View
5
Download
0
Category
Preview:
Citation preview
ELE604/ELE704 Optimization
Unconstrained Optimization
http://www.ee.hacettepe.edu.tr/∼usezen/ele604/
Dr. Umut Sezen & Dr. Cenk TokerDepartment of Electrical and Electronic Engineering
Hacettepe University
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 1 / 120
Contents
Unconstrained Optimization
Unconstrained Minimization
Descent Methods
Motivation
General Descent Method
Line SearchExact Line SearchBisection AlgorithmBacktracking Line Search
Convergence
Gradient Descent (GD) Method
Gradient Descent Method
Convergence Analysis
Conv. of GD with Exact Line Search
Conv. of GD with Backtracking Line Search
Examples
Steepest Descent (SD) Method
Preliminary De�nitions
Steepest Descent Method
Steepest Descent for di�erent normsEuclidean NormQuadratic NormL1-normChoice of norm
Convergence Analysis
Examples
Conjugate Gradient (GD) Method
Introduction
Conjugate DirectionsDescent Properties of the Conjugate Gradient Method
The Conjugate Gradient Method
Extension to Nonquadratic Problems
Newton's Method (NA)
The Newton StepInterpretation of the Newton Step
The Newton Decrement
Newton's Method
Convergence Analysis
Examples
Approximation of the Hessian
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 2 / 120
Unconstrained Optimization Unconstrained Minimization
Unconstrained Minimization
I The aim is
min f(x)
where f(x) : RN → R is twice di�erentiable.
I The problem is solvable, i.e., �nite optimal point x∗ exists.
I The optimal value (�nite) is given by
p∗ = infx
f(x) = f(x∗) (> −∞)
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 3 / 120
Unconstrained Optimization Unconstrained Minimization
I Example 1: Quadratic program
minx∈RN
f(x) =1
2xTQx− bTx + c
where Q : RN×N is symmetric, b ∈ RN and c ∈ R.
Necessary conditions:
∇f(x∗) = Qx∗ − b = 0
H(x∗) = Q � 0 (PSD)
- Q ≺ 0⇒ f(x) has no local minimum.
- Q � 0⇒ x∗ = Q−1b is the unique global minimum.
- Q � 0⇒ either no solution or ∞ number of solutions.
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 4 / 120
Unconstrained Optimization Unconstrained Minimization
I Example 2: Consider
minx1,x2∈R
f(x1, x2) =1
2(αx2
1 + βx22 − x1)
Here, let us �rst express the above equation in the quadratic program form with
Q =
[α γ−γ β
]b =
[10
]where γ ∈ R, for simplicity we can take γ = 0. So,
- If α > 0 and β > 0 (i.e., Q � 0): x∗ = ( 1α, 0) is the unique global minimum.
- If α > 0 and β = 0 (i.e., Q � 0): In�nite number of solutions,{( 1α, y), y ∈ R
}.
- If α = 0 and β > 0 (i.e., Q � 0): No solution.
- If α < 0 and β > 0, or α > 0 and β < 0 (i.e., Q is inde�nite): No solution.
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 5 / 120
Unconstrained Optimization Unconstrained Minimization
−10−5
05
10
−10−5
05
10−50
0
50
100
150
x1
α > 0, β > 0
x2 −10−5
05
10
−10−5
05
10−20
0
20
40
60
x1
α > 0, β = 0
x2
−10−5
05
10
−10−5
05
10−20
0
20
40
60
x1
α = 0, β > 0
x2−10
−50
510
−10−5
05
10−60
−40
−20
0
20
40
60
x1
α > 0, β < 0
x2
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 6 / 120
Unconstrained Optimization Unconstrained Minimization
I Two possibilities:
- {f(x) : x ∈ X} is unbounded below ⇒ no optimal solution.
- {f(x) : x ∈ X} is bounded below ⇒ a global minimum exists, if ‖x‖ 6=∞.
Then, unconstrained minimization methods
- produce sequence of points x(k) ∈ dom f(x) for k = 0, 1, . . . with
f(x(k))→ p∗
- can be interpreted as iterative methods for solving the optimality condition
∇f(x∗) = 0
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 7 / 120
Descent Methods Motivation
Motivation
I If ∇f(x) 6= 0, there is an interval (0, δ) of stepsizes such that
f(x− α∇f(x)) < f(x) ∀α ∈ (0, δ)
I If d makes an angle with ∇f(x) that is greater than 90◦, i.e.,
∇T f(x)d < 0
∃ an interval (0, δ) of stepsizes such that
f(x + αd) < f(x) ∀α ∈ (0, δ)
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 8 / 120
Descent Methods Motivation
I De�nition: The descent direction d is selected such that
∇T f(x)d < 0
I Proposition: For a descent method
f(x(k+1)) < f(x(k))
except x(k) = x∗.
I De�nition: Minimizing sequence is de�ned as
x(k+1) = x(k) + α(k)d(k)
where the scalar α(k) ∈ (0, δ) is the stepsize (or step length) at iteration k, andd(k) ∈ RN is the step or search direction.
- How to �nd optimum α(k)? Line Search Algorithm
- How to �nd optimum d(k)? Depends on the descent algorithm,e.g., d = −∇f(x(k)).
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 9 / 120
Descent Methods General Descent Method
General Descent Method
I Given a starting point x(0) ∈ dom f(x)
repeat
1. Determine a descent direction d(k),
2. Line search: Choose a stepsize α(k) > 0,
3. Update: x(k+1) = x(k) + α(k)d(k),
until stopping criterion is satis�ed.
I Example 3: Simplest method: Gradient Descent
x(k+1) = x(k) − α(k)∇f(x(k)), k = 0, 1, . . .
Note that, here the descent direction is d(k) = −∇f(x(k)).
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 10 / 120
Descent Methods General Descent Method
I Example 4: Most sophisticated method: Newton's Method
x(k+1) = x(k) − α(k)H−1(x(k))∇f(x(k)), k = 0, 1, . . .
Note that, here the descent direction is d(k) = −H−1(x(k))∇f(x(k)).
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 11 / 120
Descent Methods Line Search
Line Search
I Suppose f(x) is a continuously di�erentiable convex function and we want to �nd
α(k) = argminα
f(x(k) + αd(k))
for a given descent direction d(k). Now, let
h(α) = f(x(k) + αd(k))
where h(α) : R→ R is a "convex function" in the scalar variable α then theproblem becomes
α(k) = argminα
h(α)
Then, as h(α) is convex, it has a minimum at
h′(α(k)) =∂h(α(k))
∂α= 0
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 12 / 120
Descent Methods Line Search
where h′(α) is given by
h′(α) =∂h(α)
∂α= ∇T f(x(k) + αd(k))d(k) (using chain rule)
Therefore, since d is the descent direction, (i.e., ∇T f(x(k))d(k)< 0 ), we haveh′(0) < 0. Also, h′(α) is monotone increasing function of α because h(α) isconvex. Hence. search for h′(α(k)) = 0.
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 13 / 120
Descent Methods Line Search
Choice of stepsize:
I Constant stepsizeα(k) = c : constant
I Diminishing stepsizeα(k) → 0
while satisfying∞∑
k=−∞α(k) =∞.
I Exact line search (analytic)
α(k) = argminα
f(x(k) + αd(k))
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 14 / 120
Descent Methods Line Search
Exact line search: (for quadratic programs)
I If f(x) is a quadratic function, then h(α) is also a quadratic function, i.e.,
h(α) = f(x(k) + αd(k))
= f(x(k)) + α∇T f(x(k))d(k) +α2
2d(k)TH(x(k))d(k)
Exact line search solution α0 which minimizes the quadratic equation above, i.e.,∂h(α0)∂α
= 0, is given by
α0 = α(k) = − ∇T f(x(k))d(k)
d(k)TH(x(k))d(k)
- If f(x) is a higher order function, then second order Taylor series approximationcan be used for the exact line search algorithm (which also gives an approximatesolution).
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 15 / 120
Descent Methods Line Search
Bisection Algorithm:
I Assume h(α) is convex, then h′(α) is monotonically increasing function. Supposethat we know a value α such that h′(α) > 0.
- Since h′(0) < 0, α = 0+α2
is the next test point
- If h′(α) = 0, α(k) = α is found (very di�cult to achieve)
- If h′(α) > 0 narrow down the search interval to (0, α)
- If h′(α) < 0 narrow down the search interval to (α, α)
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 16 / 120
Descent Methods Line Search
Algorithm:
1. Set k = 0, α` = 0 and αu = α
2. Set α = α`+αu
2and calculate h′(α)
3. If h′(α) > 0 ⇒ αu = α and k = k + 1
4. If h′(α) < 0 ⇒ α` = α and k = k + 1
5. If h′(α) = 0 ⇒ stop.
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 17 / 120
Descent Methods Line Search
Proposition: After every iteration, the current interval [α`, αu] contains α∗,h′(α∗) = 0.
Proposition: At the k-th iteration, the length of the current interval is
L =
(1
2
)kα
Proposition: A value of α such that |α− α∗| < ε can be found at most⌈log2
(α
ε
)⌉steps.
I How to �nd α such that h′(α) > 0?
1. Make an initial guess of α
2. If h′(α) < 0 ⇒ α = 2α, go to step 2
3. Stop.
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 18 / 120
Descent Methods Line Search
I Stopping criterion for the Bisection Algorithm: h′(α)→ 0 as k →∞, may notconverge quickly.
Some relevant stopping criteria:
1. Stop after k = K iterations (K : user de�ned)
2. Stop when |αu − α`| ≤ ε (ε : user de�ned)
3. Stop when h′(α) ≤ ε (ε : user de�ned)
In general, 3rd criterion is the best.
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 19 / 120
Descent Methods Line Search
Backtracking line search
For small enough α:
f(x0 + αd) ' f(x0) + α∇T f(x0)d < f(x0) + γα∇T f(x0)d
where 0 < γ < 0.5 as ∇T f(x0)d < 0.
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 20 / 120
Descent Methods Line Search
I Algorithm: Backtracking line search
Given a descent direction d for f(x) at x0 ∈ dom f(x)
α = 1
while f(x0 + αd) > f(x0) + γα∇T f(x0)d
α = βα
end
where 0 < γ < 0.5 and 0 < β < 1
- At each iteration step size α is reduced by β (β ' 0.1 : coarse search, β ' 0.8 :�ne search).
- γ can be interpreted as the fraction of the decrease in f(x) predicted by linearextrapolation (γ = 0.01↔ 0.3 (typical) meaning that we accept a decrease inf(x) between 1% and 30%).
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 21 / 120
Descent Methods Line Search
- The backtracking exit inequality
f(x0 + αd) ≤ f(x0) + γα∇T f(x0)d
holds for α ∈ [0, α0]. Then, the line search stops with a step length α
i. α = 1 if α0 ≥ 1
ii. α ∈ [βα0, α0].
In other words, the step length obtained by backtracking line search satis�es
α ≥ min {1, βα0} .
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 22 / 120
Descent Methods Convergence
I Convergence
De�nition: Let ‖ · ‖ be a norm on RN . Let{x(k)
}∞k=0
be a sequence of vectors in
RN . Then, the sequence{x(k)
}∞k=0
is said to converge to a limit x∗ if
∀ε > 0, ∃Nε ∈ Z+ : (k ∈ Z+ and k ≥ Nε)⇒ (‖x(k) − x∗‖ < ε)
If the sequence{x(k)
}∞k=0
converges to x∗ then we write
limk→∞
x(k) = x∗
and call x∗ as the limit of the sequence{x(k)
}∞k=0
.
- Nε may depend on ε
- For a distance ε, after Nε iterations, all the subsequent iterations are withinthis distance ε to x∗.
This de�nition does not characterize how fast the convergence is (i.e., rate ofconvergence).
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 23 / 120
Descent Methods Convergence
I Rate of Convergence
De�nition: Let ‖ · ‖ be a norm on RN . A sequence{x(k)
}∞k=0
that converges to
x∗ ∈ RN is said to converge at rate R ∈ R++ and with rate constant δ ∈ R++ if
limk→∞
‖x(k+1) − x∗‖‖x(k) − x∗‖R
= δ
- If R = 1, 0 < δ < 1, then rate is linear
- If 1 < R < 2, 0 < δ <∞, then rate is called super-linear
- If R = 2, 0 < δ <∞, then rate is called quadratic
The rate of convergence R is sometimes called asymptotic convergence rate. Itmay not apply for the iterates, but applies asymptotically as k →∞.
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 24 / 120
Descent Methods Convergence
Example 5: The sequence{ak}∞k=0
, 0 < a < 1 converges to 0.
limk→∞
‖ak+1 − 0‖‖ak − 0‖1 = a ⇒ R = 1, δ = a
Example 6: The sequence{a2k}∞k=0
, 0 < a < 1 converges to 0.
limk→∞
‖a2k+1
− 0‖‖a2k − 0‖2
= 1 ⇒ R = 2, δ = 1
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 25 / 120
Gradient Descent (GD) Method Gradient Descent Method
Gradient Descent MethodI First order Taylor series expansion at x0 gives us
f(x0 + αd) ≈ f(x0) + α∇T f(x0)d.
This approximation is valid for α‖d‖ → 0.
I We want to choose d so that ∇T f(x0)d is as small as (as negative as) possiblefor maximum descent.
I If we normalize d, i.e., ‖d‖ = 1, then normalized direction d
d = − ∇f(x0)
‖∇f(x0)‖
makes the smallest inner product with ∇f(x0).
I Then, the unnormalized direction
d = −∇f(x0)
is called the direction of gradient descent (GD) at the point of x0.
I d is a direction as long as ∇f(x0) 6= 0.
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 26 / 120
Gradient Descent (GD) Method Gradient Descent Method
I Algorithm: Gradient Descent Algorithm
Given a starting point x(0) ∈ dom f(x)
repeat
1. d(k) = −∇f(x(k))
2. Line search: Choose step size α(k) via a line search algorithm
3. Update: x(k+1) = x(k) + α(k)d(k)
until stopping criterion is satis�ed
- A typical stopping criterion is ‖∇f(x)‖ < ε, ε→ 0 (small)
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 27 / 120
Gradient Descent (GD) Method Convergence Analysis
I Convergence Analysis
The Hessian matrix H(x) is bounded as
1. mI � H(x), i.e.,
(H(x)−mI) � 0
yTH(x)y ≥ m‖y‖2, ∀y ∈ RN
2. H(x) �MI, i.e.,
(MI−H(x)) � 0
yTH(x)y ≤M‖y‖2,∀y ∈ RN
with ∀x ∈ dom f(x).
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 28 / 120
Gradient Descent (GD) Method Convergence Analysis
Note that, condition number of a matrix is given by the ratio of the largest andthe smallest eigenvalue, e.g.,
κ(H(x)) =
∣∣∣∣maxλiminλi
∣∣∣∣ =M
m
If the condition number is close to one, the matrix is well-conditioned which meansits inverse can be computed with good accuracy. If the condition number is large,then the matrix is said to be ill-conditioned.
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 29 / 120
Gradient Descent (GD) Method Convergence Analysis
I Lower Bound: mI � H(x)
For x,y ∈ dom f(x)
f(y) = f(x) +∇T f(x)(y − x) +1
2(y − x)TH(z)(y − x)
for some z on the line segment [x,y] where H(z) � mI. Thus,
f(y) ≥ f(x) +∇T f(x)(y − x) +m
2‖y − x‖2
- If m = 0, then the inequality characterizes convexity.
- If m > 0, then we have a better lower bound for f(y)
Right-hand side is convex in y. Minimum is achieved at
y0 = x− 1
m∇f(x)
Then,
f(y) ≥ f(x) +∇T f(x)(y0 − x) +m
2‖y0 − x‖2
≥ f(x)− 1
2m‖∇f(x)‖2
∀y ∈ dom f .
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 30 / 120
Gradient Descent (GD) Method Convergence Analysis
When y = x∗
f(x∗) = p∗ ≥ f(x)− 1
2m‖∇f(x)‖2
- A stopping criterion
f(x)− p∗ ≤ 12m‖∇f(x)‖2
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 31 / 120
Gradient Descent (GD) Method Convergence Analysis
I Upper Bound: H(x) �MI
For any x,y ∈ dom f(x), using similar derivations as the lower bound, we arrive at
f(y) ≤ f(x) +∇T f(x)(y0 − x) +M
2‖y0 − x‖2
Then for y = x∗
f(x∗) = p∗ ≤ f(x)− 1
2M‖∇f(x)‖2
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 32 / 120
Gradient Descent (GD) Method Conv. of GD with Exact Line Search
I Convergence of GD using exact line search
For the exact line search, let us use second order approximation for f(x(k+1)):
f(x(k+1)) = f(x(k) − α∇f(x(k)))
∼= f(x(k))− α∇T f(x(k))∇f(x(k))︸ ︷︷ ︸‖∇f(x(k))‖2
+α2
2∇T f(x(k))H(x(k))︸ ︷︷ ︸
�MI
∇f(x(k))
This criterion is quadratic in α.
Normally, exact line search solution α0 which minimizes the quadratic equationabove is given by
α0 =∇T f(x(k))∇f(x(k))
∇T f(x(k))H(x(k))∇f(x(k))
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 33 / 120
Gradient Descent (GD) Method Conv. of GD with Exact Line Search
- However, let us use the upper bound of the second order approximation forconvergence analysis
f(x(k+1)) ≤ f(x(k))− α‖∇f(x(k))‖2 +Mα2
2‖∇f(x(k))‖2
Find α′0 such that upper bound of f(x(k) − α∇f(x(k))) is minimized over α.
Upper bound equation (i.e., right-hand side equation) is quadratic in α, henceminimized for
α′0 =1
M
with the minimum value
f(x(k))− 1
2M‖∇f(x(k))‖2
Then, for α′0
f(x(k+1)) ≤ f(x(k))− 1
2M‖∇f(x(k))‖2
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 34 / 120
Gradient Descent (GD) Method Conv. of GD with Exact Line Search
Subtract p∗ for both sides
f(x(k+1))− p∗ ≤ f(x(k))− p∗ − 1
2M‖∇f(x(k))‖2
We know that
f(x(k))− p∗ ≤ 1
2m‖∇f(x(k))‖2 ⇒ ‖∇f(x(k))‖2 ≥ 2m(f(x(k))− p∗)
Then substituting this result to the above inequality
f(x(k+1))− p∗ ≤ (f(x(k))− p∗)− m
M(f(x(k))− p∗)
≤ (1− m
M)(f(x(k))− p∗)
orf(x(k+1))− p∗
f(x(k))− p∗≤ (1− m
M) = c ≤ 1 (
m
M≤ 1)
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 35 / 120
Gradient Descent (GD) Method Conv. of GD with Exact Line Search
- Rate of convergence is unity (i.e., R = 1) ⇒ linear convergence
- Upper limit of rate constant is(1− m
M
)I Number of steps? Apply the above inequality recursively
f(x(k+1))− p∗ ≤ ck(f(x(0))− p∗)
i.e., f(x(k+1))→ p∗ as k →∞, since 0 ≤ c < 1. Thus, convergence is guaranteed.
- If m = M ⇒ c = 0, then convergence occurs in one iteration.
- If m�M ⇒ c→ 1, the slow convergence.
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 36 / 120
Gradient Descent (GD) Method Conv. of GD with Exact Line Search
(f(x(k+1))− p∗) ≤ ε is achieved after at most
K =log(
[f(x(0))− p∗]/ε)
log (1/c)
iterations
- Numerator is small when initial point is close to x∗ (K gets smaller).
- Numerator increases as accuracy increases (i.e., ε decreases) (K gets larger).
- Denominator decreases linearly with mM
(reciprocal of the condition number)as c = (1− m
M), i.e., log(1/c) = − log(1− m
M) ' m
M(using
log(x) = log(x0) + 1x0
(x− x0)− 12x20
(x− x0)2 + · · · with x0 = 1).
- well-conditioned Hessian, mM→ 1⇒ denominator is large (K gets
smaller).
- ill-conditioned Hessian, mM→ 0⇒ denominator is small (K gets
larger).
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 37 / 120
Gradient Descent (GD) Method Conv. of GD with Backtracking Line Search
I Convergence of GD using backtracking line search
Backtracking exit condition
f(x(k) − α∇f(x(k))) ≤ f(x(k))− γα‖∇f(x(k))‖2
is satis�ed when α ∈ [βα0, α0] where α0 ≥ 1M.
Backtracking line search terminates either if α = 1 or α ≥ βM
which gives a lowerbound on the decrease
1. f(x(k+1)) ≤ f(x(k))− γ‖∇f(x(k))‖2 if α = 1
2. f(x(k+1)) ≤ f(x(k))− βγM‖∇f(x(k))‖2 if α ≥ β
M
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 38 / 120
Gradient Descent (GD) Method Conv. of GD with Backtracking Line Search
If we put these inequalities (1 & 2) together
f(x(k+1)) ≤ f(x(k))−min
{γ,βγ
M
}‖∇f(x(k))‖2
Similar to the analysis exact line search, subtract p∗ from both sides
f(x(k+1))− p∗ ≤ f(x(k))− p∗ − γmin
{1,β
M
}‖∇f(x(k))‖2
But, we know that ‖∇f(x(k))‖2 ≥ 2m(f(x(k))− p∗), then
f(x(k+1))− p∗ ≤(
1− 2mγmin
{1,β
M
})(f(x(k))− p∗
)Finally
f(x(k+1))− p∗
f(x(k))− p∗≤(
1− 2mγmin
{1,β
M
})= c < 1
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 39 / 120
Gradient Descent (GD) Method Conv. of GD with Backtracking Line Search
- Rate of convergence is unity (i.e., R = 1) ⇒ linear convergence
- Rate is constant c < 1
f(x(k+1))− p∗ ≤ ck(f(x(0))− p∗
)Thus, k →∞⇒ ck → 0, so convergence is guaranteed.
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 40 / 120
Gradient Descent (GD) Method Examples
Note: Examples 7, 8, 9 and 10 are taken from Convex Optimization (Boyd and Vandenberghe) (Ch. 9).
Example 7: (quadratic problem in R2) Replace γ with σ.
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 41 / 120
Gradient Descent (GD) Method Examples
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 42 / 120
Gradient Descent (GD) Method Examples
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 43 / 120
Gradient Descent (GD) Method Examples
Example 8: (nonquadratic problem in R2) Replace α and t with γ and α.
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 44 / 120
Gradient Descent (GD) Method Examples
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 45 / 120
Gradient Descent (GD) Method Examples
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 46 / 120
Gradient Descent (GD) Method Examples
Example 9: (problem in R100) Replace α and t with γ and α.
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 47 / 120
Gradient Descent (GD) Method Examples
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 48 / 120
Gradient Descent (GD) Method Examples
Example 10: (Condition number) Replace γ, α and t with σ, γ and α.
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 49 / 120
Gradient Descent (GD) Method Examples
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 50 / 120
Gradient Descent (GD) Method Examples
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 51 / 120
Gradient Descent (GD) Method Examples
Observations:
- The gradient descent algorithm is simple.
- The gradient descent method often exhibits approximately linear convergence.
- The choice of backtracking parameters γ and β has a noticeable but not dramatice�ect on the convergence. Exact line search sometimes improves the convergenceof the gradient method, but the e�ect is not large (and probably not worth thetrouble of implementing the exact line search).
- The convergence rate depends greatly on the condition number of the Hessian, orthe sublevel sets. Convergence can be very slow, even for problems that aremoderately well-conditioned (say, with condition number in the 100s). When thecondition number is larger (say, 1000 or more) the gradient method is so slow thatit is useless in practice.
- The main advantage of the gradient method is its simplicity. Its main disadvantageis that its convergence rate depends so critically on the condition number of theHessian or sublevel sets.
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 52 / 120
Steepest Descent (SD) Method Preliminary De�nitions
I Dual Norm: Let ‖ · ‖ denote any norm on RN , then the dual norm, denoted by‖ · ‖∗, is the function from RN to R with values
‖x‖∗ = maxy
yTx : ‖y‖ ≤ 1 = sup{yTx : ‖y‖ ≤ 1
}The above de�nition also corresponds to a norm: it is convex, as it is the pointwisemaximum of convex (in fact, linear) functions y→ xTy; it is homogeneous ofdegree 1, that is, ‖αx‖∗ = α‖x‖∗ for every x in RN and α ≥ 0.
I By de�nition of the dual norm,
xTy ≤ ‖x‖ · ‖y‖∗
This can be seen as a generalized version of the Cauchy-Schwartz inequality, whichcorresponds to the Euclidean norm.
I The dual to the dual norm above is the original norm.
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 53 / 120
Steepest Descent (SD) Method Preliminary De�nitions
- The norm dual to the Euclidean norm is itself. This comes directly from theCauchy-Schwartz inequality.
‖x‖2∗ = ‖x‖2
- The norm dual to the the L∞-norm is the L1-norm, or vice versa.
‖x‖∞∗ = ‖x‖1 and ‖x‖1∗ = ‖x‖∞
- More generally, the dual of the Lp-norm is the Lq-norm
‖x‖p∗ = ‖x‖q
where q =p
1− p .
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 54 / 120
Steepest Descent (SD) Method Preliminary De�nitions
I Quadratic norm: A generalized quadratic norm of x is de�ned by
‖x‖P =(xTPx
)1/2
= ‖P1/2x‖2 = ‖Mx‖2
where P = MTM is an N ×N symmetric positive de�nite (SPD) matrix.
I When P = I then, quadratic norm is equal to the Euclidean norm.
I The dual of the quadratic norm is given by
‖x‖P∗ = ‖x‖Q =(xTP−1x
)1/2
where Q = P−1.
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 55 / 120
Steepest Descent (SD) Method Steepest Descent Method
Steepest Descent Method
I The �rst-order Taylor series approximation of f(x(k) + αd) around x(k) is
f(x(k) + αd) ≈ f(x(k)) + α∇T f(x(k))d.
This approximation is valid for α‖d‖2 → 0.
I We want to choose d so that ∇T f(x0)d is as small as (as negative as) possiblefor maximum descent.
I First normalize d to obtain normalized steepest descent direction (nsd) dnsd
dnsd = argmin{∇T f(x(k))d : ‖d‖ = 1
}where ‖ · ‖ is any norm and ‖ · ‖∗ is its dual norm on RN . Choice of norm is veryimportant.
I It is also convenient to consider the unnormalized steepest descent direction (sd)
dsd = ‖∇f(x)‖∗dnsd
where ‖ · ‖∗ is the dual norm of ‖ · ‖.
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 56 / 120
Steepest Descent (SD) Method Steepest Descent Method
I Then, for the steepest descent step, we have
∇T f(x)dsd = ‖∇f(x)‖∗∇T f(x)dnsd︸ ︷︷ ︸−‖∇f(x)‖∗
= −‖∇f(x)‖2∗
I Algorithm: Steepest Descent Algorithm
Given a starting point x(0) ∈ dom f(x)
repeat
1. Compute the steepest descent direction d(k)sd
2. Line search: Choose step size α(k) via a line search algorithm
3. Update: x(k+1) = x(k) + α(k)d(k)sd
until stopping criterion is satis�ed
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 57 / 120
Steepest Descent (SD) Method Steepest Descent for di�erent norms
I Steepest Descent for di�erent norms:
- Euclidean norm: As ‖ · ‖2∗ = ‖ · ‖2 and having x0 = x(k), the steepest descentdirection is the negative gradient, i.e.,
dsd = −∇f(x0)
For Euclidean norm, steepest descent algorithm is the same as the gradientdescent algorithm.
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 58 / 120
Steepest Descent (SD) Method Steepest Descent for di�erent norms
- Quadratic norm: For a quadratic norm ‖ · ‖P and having x0 = x(k), thenormalized descent direction is given by
dnsd = −P−1 ∇f(x0)
‖∇f(x0)‖P∗= −P−1 ∇f(x0)
(∇T f(x0)P−1∇f(x0))1/2
As ‖∇f(x)‖P∗ = ‖P−1/2∇f(x)‖2, then
dsd = −P−1∇f(x0)
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 59 / 120
Steepest Descent (SD) Method Steepest Descent for di�erent norms
Change of coordinates: Let y = P 1/2x, then ‖x‖P = ‖y‖2. Using this change ofcoordinates, we can solve the original problem of minimizing f(x) by solving theequivalent problem of minimizing the function f(y) : RN → R, given by
f(y) = f(P−1/2y) = f(x)
Apply the gradient descent method to f(y). The descent direction at y0
(x0 = P−1/2y0 for the original problem) is
dy = −∇f(y0) = −P−1/2∇f(P−1/2y0) = −P−1/2∇f(x0)
Then the descent direction for the original problem becomes
dx = P−1/2 dy = −P−1∇f(x0)
Thus, x∗ = P−1/2y∗.
The steepest descent method in the quadratic norm ‖ · ‖P is equivalent to thegradient descent method applied to the problem after the coordinatetransformation
y = P1/2x
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 60 / 120
Steepest Descent (SD) Method Steepest Descent for di�erent norms
- L1-norm: For an L1-norm ‖ · ‖1 and having x0 = x(k), the normalized descentdirection is given by
dnsd = − argmin{∇T f(x)d : ‖d‖1 = 1
}.
Let i be any index for which ‖∇f(x0)‖∞ = max |(∇f(x0))i|. Then a normalizedsteepest descent direction dnsd for the L1-norm is given by
dnsd = − sign
(∂f(x0)
∂xi
)ei
where ei is the i-th standard basis vector (i.e., the coordinate axis direction) withthe steepest gradient. For example, in the �gure above we have dnsd = e1.
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 61 / 120
Steepest Descent (SD) Method Steepest Descent for di�erent norms
Then, the unnormalized steepest descent direction is given by
dsd = dnsd ‖∇f(x0)‖∞ = −∂f(x0)
∂xiei
The steepest descent algorithm in the L1-norm has a very natural interpretation:
- At each iteration we select a component of ∇f(x0) with maximum absolutevalue, and then decrease or increase the corresponding component of x0,according to the sign of (∇f(x0))i.
- The algorithm is sometimes called a coordinate-descent algorithm, sinceonly one component of the variable x(k) is updated at each iteration.
- This can greatly simplify, or even trivialize, the line search.
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 62 / 120
Steepest Descent (SD) Method Steepest Descent for di�erent norms
Choice of norm:
- Choice of norm can dramatically a�ect the convergence
- Condition number of the Hessian should be close to unity for fast convergence
- Consider quadratic norm with respect to SPD matrix P. Performing the change ofcoordinates
y = P1/2x
can change the condition number.
- If an approximation of the Hessian at the optimal point, H(x∗), is known,then setting P ∼= H(x∗) will yield
P−1/2H(x∗)P1/2 ∼= I
resulting in a very low condition number.
- If P is chosen correctly the ellipsoid ε ={x : xTPx ≤ 1
}approximates the
cost surface at point x.
- A correct P will greatly improve the convergence whereas the wrong choiceof P will result in very poor convergence.
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 63 / 120
Steepest Descent (SD) Method Convergence Analysis
Convergence Analysis
- (Using backtracking line search) It can be shown that any norm can be bounded interms of Euclidean norm with a constant η ∈ (0, 1]
‖x‖∗ ≥ η‖x‖2
- Assuming strongly convex f(x) and using H(x) ≺MI
f(x(k) + αdsd) ≤ f(x(k)) + α∇T f(x(k))dsd +Mα2
2‖dsd‖22
≤ f(x(k)) + α∇T f(x(k))dsd +Mα2
2η2‖dsd‖2∗
≤ f(x(k))− α‖∇f(x(k))‖2∗ + α2 M
2η2‖∇f(x(k))‖2∗
Right hand side of the inequality is a quadratic function of α and has a minimum
at α = η2
M. Then,
f(x(k) + αdsd) ≤ f(x(k))− η2
2M‖∇f(x(k))‖2∗ ≤ f(x(k)) +
γη2
M∇T f(x(k))dsd
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 64 / 120
Steepest Descent (SD) Method Convergence Analysis
Since γ < 0.5 and −‖∇f(x)‖2∗ = ∇T f(x)dsd, backtracking line search will return
α ≥ min{
1, βη2
M
}, then
f(x(k) + αdsd) ≤ f(x(k))− γmin
{1,βη2
M
}‖∇f(x(k))‖2∗
≤ f(x(k))− γη2 min
{1,βη2
M
}‖∇f(x(k))‖22
Subtracting p∗ from both sides and using ‖∇f(x(k))‖2 ≥ 2m(f(x(k))− p∗), wehave
f(x(k+1))− p∗
f(x(k))− p∗≤ 1− 2mγη2 min
{1,βη2
M
}= c < 1
- Linear convergence
f(x(k))− p∗ ≤ ck(f(x(0))− p∗
)as k →∞, ck → 0. So, convergence is guaranteed,
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 65 / 120
Steepest Descent (SD) Method Examples
Example 11: A steepest descent example with L1-norm.
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 66 / 120
Steepest Descent (SD) Method Examples
Example 12: Consider the nonquadratic problem in R2 given in Example 8 (replace α
and t with γ and α).
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 67 / 120
Steepest Descent (SD) Method Examples
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 68 / 120
Steepest Descent (SD) Method Examples
When P = I, i.e., gradient descent
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 69 / 120
Steepest Descent (SD) Method Examples
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 70 / 120
Steepest Descent (SD) Method Examples
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 71 / 120
Conjugate Gradient (GD) Method Introduction
Conjugate Gradient Method
I Can overcome the slow convergence of Gradient Descent algorithm
I Computational complexity is lower than Newton's Method.
I Can be very e�ective in dealing with general objective functions.
I We will �rst investigate the quadratic problem
min1
2xTQx− bTx
where Q is SPD, and then extend the solution to the general case byapproximation.
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 72 / 120
Conjugate Gradient (GD) Method Conjugate Directions
Conjugate Directions
I De�nition: Given a symmetric matrix Q, two vectors d1 and d2 are said to beQ-orthogonal or conjugate with respect to Q if
dT1 Qd2 = 0
- Although it is not required, we will assume that Q is SPD.
- If Q = I, then the above de�nition becomes the de�nition of orthogonality.
- A �nite set of non-zero vectors d0,d1, . . . ,dk is said to be a Q-orthogonal set if
dTi Qdj = 0, ∀i, j : i 6= j
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 73 / 120
Conjugate Gradient (GD) Method Conjugate Directions
I Proposition: If Q is SPD and the set of non-zero vectors d0,d1, . . . ,dk areQ-orthogonal, then these vectors are linearly indepedent.
Proof: Assume linear dependency and suppose ∃αi, i = 0, 1, . . . , k :
α0d0 + α1d1 + · · ·+ αkdk = 0
Multiplying with dTi Q yields
α0 dTi Qd0︸ ︷︷ ︸
=0
+α1 dTi Qd1︸ ︷︷ ︸
=0
+ · · ·+ αidTi Qdi︸ ︷︷ ︸
must be 0
+ · · ·+ αk dTi Qdk︸ ︷︷ ︸
=0
= 0
But dTi Qdi > 0 (Q: PD), then αi = 0. Repeat for all αi.
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 74 / 120
Conjugate Gradient (GD) Method Conjugate Directions
I Quadratic Problem:
min1
2xTQx− bTx
If Q is N ×N PD matrix, then we have unique solution
Qx∗ = b
Let d0,d1, . . . ,dN−1 be non-zero Q-orthogonal vectors corresponding to theN ×N SPD matrix Q. They are linearly independent. Then the optimum solutionis given by
x∗ = α0d0 + α1d1 + · · ·+ αN−1dN−1
We can �nd the value of the coe�cients αi by multiplying the above equationwith dTi Q:
dTi Qx∗ = αidTi Qdi
αi =dTi b
dTi Qdi. . . Qx∗ = b
Finally the optimum solution is given by,
x∗ =
N−1∑i=0
dTi b
dTi Qdidi
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 75 / 120
Conjugate Gradient (GD) Method Conjugate Directions
- αi can be found from the known vector b and matrix Q once di are found.
- The expansion of x∗ is a result of an iterative process of N steps where at the i-thstep αidi is added.
I Conjugate Direction Theorem: Let {di}N−1i=0 be a set of non-zero Q-orthogonal
vectors. For any x(0) ∈ dom f(x), the sequence{x(k)
}Nk=0
generated according to
x(k+1) = x(k) + α(k)dk, . . . k ≥ 0
with
α(k) = − dTk g(k)
dTkQdk
and g(k) is the gradient at x(k)
g(k) = ∇f(x(k)) = Qx(k) − b
converges to the unique solution x∗ of Qx∗ = b after N steps, i.e., x(N) = x∗.
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 76 / 120
Conjugate Gradient (GD) Method Conjugate Directions
Proof: Since dk are linearly independent, we can write
x∗ − x(0) = α(0)d0 + α(1)d1 + · · ·+ α(N−1)dN−1
for some α(k). We can �nd α(k) by
α(k) =dTkQ
(x∗ − x(0)
)dTkQdk
(1)
Now, the iterative steps from x(0) to x(k)
x(k) − x(0) = α(0)d0 + α(1)d1 + · · ·+ α(k−1)dk−1
and due to Q-orthogonality
dTkQ(x(k) − x(0)
)= 0 (2)
Using (1) and (2) we arrive at
α(k) =dTkQ
(x∗ − x(k)
)dTkQdk
= − dTk g(k)
dTkQdk
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 77 / 120
Conjugate Gradient (GD) Method Conjugate Directions
Descent Properties of the Conjugate Gradient Method
I We de�ne B(k) which is spanned by {d0,d1, . . . ,dk−1} as the subspace of RN ,i.e.,
B(k) = span {d0,d1, . . . ,dk−1} ⊆ RN
We will show that at each step x(k) minimizes the objective over thek-dimensional linear variety x(0) + B(k).
I Theorem: (Expanding Subspace Theorem) Let {di}N−1i=0 be non-zero,
Q-orthogonal vectors in RN .
For any x(0) ∈ RN , the sequence
x(k+1) = x(k) + α(k)dk
α(k) = − dTk g(k)
dTkQdk
minimizes f(x) = 12xTQx− bTx on the line
x = x(k−1) − αdk−1, −∞ < α <∞
and on x(0) + B(k).
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 78 / 120
Conjugate Gradient (GD) Method Conjugate Directions
Proof: Since x(k) ∈ x(0) + B(k), i.e., B(k) contains the line x = x(k−1) − αdk−1,it is enough to show that x(k) minimizes f(x) over x(0) + B(k)
Since we assume that f(x) is strictly convex, the above condition holds when g(k)
is orthogonal to B(k), i.e., the gradient of f(x) at x(k) is orthogonal to B(k).
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 79 / 120
Conjugate Gradient (GD) Method Conjugate Directions
- Proof of g(k) ⊥ B(k) is by induction
Let for k = 0, B(0) = {} (empty set), then g(k) ⊥ B(0) is true.
Now assume that g(k) ⊥ B(k), show that g(k+1) ⊥ B(k+1)
From the de�nition of g(k) (g(k) = Qx(k) − b), it can be shown that
g(k+1) = g(k) + αkQdk
Hence, by the de�nition of αk
dTk g(k+1) = dTk g
(k) + αkdTkQdk = 0
Also, for i < k
dTi g(k+1) = dTi g
(k)︸ ︷︷ ︸vanishes by induction
+ αk dTi Qdk︸ ︷︷ ︸
=0
= 0
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 80 / 120
Conjugate Gradient (GD) Method Conjugate Directions
- Corollary: The gradients g(k), k = 0, 1, . . . , N satisfy
dTi g(k) = 0
for i < k.
Expanding subspace, at every iteration dk increases the dimensionality of B. Sincex(k) minimizes f(x) over x(0) + B(k), x(N) is the overall minimum of f(x).
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 81 / 120
Conjugate Gradient (GD) Method The Conjugate Gradient Method
The Conjugate Gradient Method
In the conjugate direction method, select the successive direction vectors as aconjugate version of the successive gradients obtained as the method progresses
I Conjugate Gradient Algorithm:
Start at any x(0) ∈ RN and de�ne d(0) = −g(0) = b−Qx(0)
repeat
1. α(k) = − d(k)T g(k)
d(k)TQd(k)
2. x(k+1) = x(k) + α(k)d(k)
3. g(k+1) = Qx(k+1) − b
4. β(k) = g(k+1)Qd(k)
d(k)TQd(k)
5. d(k+1) = −g(k+1) + β(k)d(k)
until k = N .
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 82 / 120
Conjugate Gradient (GD) Method The Conjugate Gradient Method
- Algorithm terminates in at most N steps with the exact solution (for the quadraticcase)
- Gradient is always linearly independent of all previous direction vectors, i.e.,g(k) ⊥ B(k), where B(k) = {d0,d1, . . . ,dk−1}
- If solution is reached before N steps, the gradient is zero
- Very simple formula, computational complexity is slightly higher than gradientdescent algorithm
- The process makes uniform progress toward the solution at every step. Importantfor the nonquadratic case.
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 83 / 120
Conjugate Gradient (GD) Method The Conjugate Gradient Method
Example 13: Consider the quadratic problem
min1
2xTQx− bTx
where Q =
[3 22 6
]and b =
[2−8
].
Solution is given by
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 84 / 120
Conjugate Gradient (GD) Method The Conjugate Gradient Method
CG Summary
I In theory (with exact arithmetic) converges to solution in N steps
- The bad news: due to numerical round-o� errors, can take more than Nsteps (or fail to converge)
- The good news: with luck (i.e., good spectrum of Q), can get goodapproximate solution in � N steps
I Compared to direct (factor-solve) methods, CG is less reliable, data dependent;often requires good (problem-dependent) preconditioner
I But, when it works, can solve extremely large systems
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 85 / 120
Conjugate Gradient (GD) Method Extension to Nonquadratic Problems
Extension to Nonquadratic Problems
I Idea is simple. We have two loops
- Outer loop approximates the problem with a quadratic one- Inner loop runs conjugate gradient method (CGM) for the approximation
i.e., for the neighbourhood of point x0
f(x) ∼= f(x0) +∇T f(x0)(x− x0) +1
2(x− x0)TH(x0)(x− x0)︸ ︷︷ ︸
quadratic function
+ residual︸ ︷︷ ︸→0
- Expanding
f(x) ∼=1
2xTH(x0)x+
(∇T f(x0)− xT0 H(x0)
)x+f(x0) +
1
2xT0 H(x0)x0 −∇T f(x0)x0︸ ︷︷ ︸
independent of x, i.e., constant
Thus,
min f(x) ≡ min1
2xTH(x0)x+
(∇T f(x0)− xT0 H(x0)
)x
≡ min1
2xTQx− bTx
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 86 / 120
Conjugate Gradient (GD) Method Extension to Nonquadratic Problems
I Here,
Q = H(x0)
bT = −∇T f(x0) + xT0 H(x0)
The gradient g(k) is
g(k) = Qx(k) − b
= H(x0)x0 +∇f(x0)−H(x0)x0 . . . x0 = x(k)
= ∇f(x0)
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 87 / 120
Conjugate Gradient (GD) Method Extension to Nonquadratic Problems
I Nonquadratic Conjugate Gradient Algorithm:
Starting at any x(0) ∈ RN , compute g(0) = ∇f(x(0)) and set d(0) = −g(0)
repeat
repeat
1. α(k) = − d(k)T g(k)
d(k)TH(x(k))d(k)
2. x(k+1) = x(k) + α(k)d(k)
3. g(k+1) = ∇f(x(k+1))
4. β(k) = g(k+1)TH(x(k))d(k)
d(k)TH(x(k))d(k)
5. d(k+1) = −g(k+1) + β(k)d(k)
until k = N .
new starting point is x(0) = x(N), g(0) = ∇f(x(N)) and d(0) = −g(0).
until stopping criterion is satis�ed
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 88 / 120
Conjugate Gradient (GD) Method Extension to Nonquadratic Problems
- No line search is required.
- H(x(k)) must be evaluated at each point, can be impractical.
- Algorithm may not be globally convergent.
I Involvement of H(x(k)) can be avoided by employing a line search algorithm forα(k) and slightly modifying β(k)
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 89 / 120
Conjugate Gradient (GD) Method Extension to Nonquadratic Problems
I Nonquadratic Conjugate Gradient Algorithm with Line-search:
Starting at any x(0) ∈ RN , compute g(0) = ∇f(x(0)) and set d(0) = −g(0)
repeat
repeat
1. Line search: α(k) = argminα
f(x(k) + αd(k))
2. Update: x(k+1) = x(k) + α(k)d(k)
3. Gradient: g(k+1) = ∇f(x(k+1))
4. Use
Fletcher-Reeves method: β(k) = g(k+1)T g(k+1)
g(k)T g(k), or
Polak-Ribiere method: β(k) =(g(k+1)−g(k))
Tg(k+1)
g(k)T g(k)
5. d(k+1) = −g(k+1) + β(k)d(k)
until k = N .
new starting point is x(0) = x(N), g(0) = ∇f(x(N)) and d(0) = −g(0).
until stopping criterion is satis�ed
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 90 / 120
Conjugate Gradient (GD) Method Extension to Nonquadratic Problems
- Polak-Ribiere method can be superior to the Fletcher-Reeves method.
- Global convergence of the line search methods is established by noting that agradient descent step is taken every N steps and serves as a spacer step. Since theother steps do not increase the objective, and in fact hopefully they decrease it,global convergence is guaranteed.
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 91 / 120
Conjugate Gradient (GD) Method Extension to Nonquadratic Problems
Example 14: Convergence example of the nonlinear Conjugate Gradient Method: (a) A complicatedfunction with many local minima and maxima. (b) Convergence path of Fletcher-Reeves CG. Unlike linearCG, convergence does not occur in two steps. (c) Cross-section of the surface corresponding to the �rst linesearch. (d) Convergence path of Polak-Ribiere CG.
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 92 / 120
Newton's Method (NA) The Newton Step
The Newton Step
I In Newton's Method, local quadratic approximations of f(x) are utilized. Startingwith the second-order Taylor's approximation around x(k),
f(x(k+1)) = f(x(k)) +∇f(x(k))∆x +1
2∆xTH(x(k))∆x︸ ︷︷ ︸
f(x(k+1))
+ residual
where ∆x = x(k+1) − x(k), �nd ∆x = ∆xnt such that f(x(k+1)) is minimized.
I Quadratic approximation optimum step ∆xnt (by solving ∂f(x(k+1))∂∆x
= 0)
∆xnt = −H−1(x(k))∇f(x(k))
is called the Newton step, which is a descent direction, i.e.,
∇T f(x(k))∆xnt = −∇T f(x(k))H−1(x(k))∇f(x(k)) < 0
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 93 / 120
Newton's Method (NA) The Newton Step
I Then
x(k+1) = x(k) + ∆xnt
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 94 / 120
Newton's Method (NA) The Newton Step
Interpretation of the Newton Step1. Minimizer of second-order approximation
As given on the previous slide ∆x minimizes f(x(k+1)), i.e., the quadraticapproximation of f(x) in the neighbourhood of x(k).
- If f(x) is quadratic, then f(x(0)) + ∆x is the exact minimizer of f(x) andalgorithm terminates in a single step with the exact answer.
- If f(x) is nearly quadratic, then x + ∆x is a very good estimate of the minimizerof f(x), x∗.
- For twice di�erentiable f(x), quadratic approximation is very accurate in theneighbourhood of x∗, i.e., when x is very close to x∗, the point x + ∆x is a verygood estimate of x∗.
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 95 / 120
Newton's Method (NA) The Newton Step
2. Steepest Descent Direction in Hessian Norm- The Newton step is the steepest descent direction at x(k), i.e.,
‖v‖H(x(k)) =(vTH(x(k))v
) 12
- In the steepest descent method, the quadratic norm ‖ · ‖P can signi�cantlyincrease speed of convergence, by decreasing the condition number. In theneighbourhood of x∗, P = H(x∗) is a very good choice.
- In Newton's method when x is near x∗, we have H(x) ' H(x∗).
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 96 / 120
Newton's Method (NA) The Newton Step
3. Solution of Linearized Optimality Condition
- First-order optimality condition
∇f(x∗) = 0
near x∗ (using �rst order Taylor's approximation for ∇f(x + ∆x))
∇f(x + ∆x) ' ∇f(x) + H(x)∆x = 0
with the solution∆xnt = −H−1(x)∇f(x)
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 97 / 120
Newton's Method (NA) The Newton Decrement
The Newton Decrement
I The norm of the Newton step in the quadratic norm de�ned by H(x) is called theNewton decrement
λ(x) = ‖∆xnt‖H(x) =(
∆xTntH(x)∆xnt) 1
2
I It can be used as a stopping criterion since it is an estimate of f(x)− p∗, i.e.,
f(x)− infyf(y) = f(x)− f(x + ∆xnt) =
1
2λ2(x)
where
f(x + ∆xnt) = f(x) +∇T f(x)∆xnt +1
2∆xTntH(x)∆xnt
i.e., the second-order quadratic approximation of f(x) at x.
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 98 / 120
Newton's Method (NA) The Newton Decrement
Substitute f(x + ∆xnt) into f(x)− infyf(y) and let
∆xnt = −H−1(x)∇f(x)
then1
2∇T f(x)H−1(x)∇f(x) =
1
2λ2(x)
I So, if λ2(x)2
< ε, then algorithm can be terminated for some small ε.
I With the substitution of ∆xnt = −H−1(x)∇f(x), the Newton decrement can alsobe written as
λ(x(k)) =(∇T f(x(k))H−1(x(k))∇f(x(k))
) 12
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 99 / 120
Newton's Method (NA) Newton's Method
Newton's Method
I Given a starting point x(0) ∈ dom f(x) and some small tolerance ε > 0
repeat
1. Compute the Newton step and Newton decrement
∆x(k) = −H−1(x(k))∇f(x(k))
λ(x(k)) =(∇T f(x(k))H−1(x(k))∇f(x(k))
) 12
2. Stopping criterion, quit if λ2(x(k))/2 ≤ ε.
3. Line search: Choose a stepsize α(k) > 0, e.g., by backtracking line search.
4. Update: x(k+1) = x(k) + α(k)∆x(k).
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 100 / 120
Newton's Method (NA) Newton's Method
I The stepsize α(k) (i.e., line search) is required for the non-quadratic initial parts ofthe algorithm. Otherwise, algorithm may not converge due to large higher-orderresiduals.
I As x(k) gets closer to x∗. f(x) can be better approximated by the second-orderexpansion. Hence, stepsize α(k) is no longer required. Line search algorithm willautomatically set α(k) = 1.
I If we start with α(k) = 1 and keep it the same, then the algorithm is called thepure Newton's method.
I For an arbitrary f(x), there are two regions of convergence.
- damped Newton phase, when x is far from x∗
- quadratically convergent phase, when x gets closer to x∗
I If we let H(x) = I, the algorithm reduces to gradient descent (GD)
x(k+1) = x(k) − α(k)∇f(x(k))
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 101 / 120
Newton's Method (NA) Newton's Method
I If H(x) is not positive de�nite, Newton's method will not converge.
So, use (aI + H(x))−1 instead of H−1(x), also known as (a.k.a) Marquardtmethod. There always exists an a which will make the matrix (aI + H(x)) positivede�nite.
a is a trade-o� between GD and NA
- a→∞ ⇒ Gradient Descent (GD)
- a→ 0 ⇒ Newton's Method (NA)
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 102 / 120
Newton's Method (NA) Newton's Method
I Newton step and decrement are independent of a�ne transformations (i.e., linearcoordinate transformations), i.e., for non-singular T ∈ RN×N
x = Ty and f(y) = f(Ty)
then
∇f(y) = TT∇f(x)
H(y) = TTH(x)T
- So, the Newton step will be
∆ynt = −H−1(y)∇f(y)
= −(TTH(x)T
)−1(TT∇f(x)
)= −T−1H−1(x)∇f(x)
= T−1∆xnt
i.e,x + ∆xnt = T (y + ∆ynt), ∀x
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 103 / 120
Newton's Method (NA) Newton's Method
- Similarly, the Newton decrement will be
λ(y) =(∇T f(y)H−1(y)∇f(y)
) 12
=
((∇T f(x)T
)(TTH(x)T
)−1(TT∇f(x)
)) 12
=(∇T f(x)H(x)∇f(x)
) 12
= λ(x)
I Thus, Newton's Method is independent of a�ne transformations (i.e., linearcoordinate transformations).
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 104 / 120
Newton's Method (NA) Convergence Analysis
Convergence Analysis
Read Boyd, Section 9.5.3.
I Assume a strongly convex f(x) with mI � H(x) with constant m ∀x ∈ dom f(x)
and H(x) is Lipschitz continuous on dom f(x), i.e.,
‖H(x)−H(y)‖2 ≤ L ‖x− y‖2
for constant L > 0. This inequality imposes a bound on the third derivative off(x).
If L is small f(x) is closer to a quadratic function. If L is large, f(x) is far from aquadratic function. If L = 0, then f(x) is quadratic.
Thus, L measures how well f(x) can be approximated by a quadratic function.
- Newton's Method will perform well for small L.
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 105 / 120
Newton's Method (NA) Convergence Analysis
Convergence: There exist constants η ∈(
0, m2
L
)and σ > 0 such that
I Damped Newton Phase‖∇f(x)‖2 ≥ η
- α < 1 gives better solutions, so most iterations will require line search, e.g.,backtracking line search.
- As k increases, function value decreases by at least σ, but not necessarilyquadratic.
- This phase ends after at most f(x(0))−p∗σ
iterations
I Quadratically Convergent Phase
‖∇f(x)‖2 < η
- All iterations use α = 1 (i.e., quadratic approximation suits very well.)
-‖∇f(x(k+1))‖‖∇f(x(k))‖2
≤ L
2m2, i.e., quadratic convergence.
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 106 / 120
Newton's Method (NA) Convergence Analysis
- For small ε > 0, f(x)− p∗ < ε is achieved after at most
log2 log2
ε0ε
iterations where ε0 = 2m3
L2 . This is typically 5-6 iterations.
- Number of iterations is bounded above by
f(x(0))− p∗
σ+ log2 log2
ε0ε
σ and ε0 are dependent on m, L and x(0).
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 107 / 120
Newton's Method (NA) Convergence Analysis
NA Summary
I Convergence of Newton's method is rapid in general, and quadratic near x∗. Oncethe quadratic convergence phase is reached, at most six or so iterations arerequired to produce a solution of very high accuracy.
I Newton's method is a�ne invariant. It is insensitive to the choice of coordinates,or the condition number of the sublevel sets of the objective.
I Newton's method scales well with problem size. Ignoring the computation of theHessian, its performance on problems in R10000 is similar to its performance onproblems in R10, with only a modest increase in the number of steps required.
I The good performance of Newton's method is not dependent on the choice ofalgorithm parameters. In contrast, the choice of norm for steepest descent plays acritical role in its performance.
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 108 / 120
Newton's Method (NA) Convergence Analysis
I The main disadvantage of Newton's method is the cost of forming and storing theHessian, and the cost of computing the Newton step, which requires solving a setof linear equations.
I Other alternatives (called quasi-Newton methods) are also provided by a family ofalgorithms for unconstrained optimization. These methods require lesscomputational e�ort to form the search direction, but they share some of thestrong advantages of Newton methods, such as rapid convergence near x∗.
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 109 / 120
Newton's Method (NA) Examples
Example 15: Consider the nonquadratic problem in R2 given in Example 8 and Example
12 (replace α and t with γ and α).
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 110 / 120
Newton's Method (NA) Examples
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 111 / 120
Newton's Method (NA) Examples
Example 16: Consider the nonquadratic problem in R100 given in Example 9 (replace α
and t with γ and α).
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 112 / 120
Newton's Method (NA) Examples
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 113 / 120
Newton's Method (NA) Examples
Example 17: (problem in R10000) Replace α and t with γ and α.
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 114 / 120
Newton's Method (NA) Examples
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 115 / 120
Newton's Method (NA) Approximation of the Hessian
Approximation of the Hessian
For relatively large scale problems, i.e., N is large, calculating the inverse of the Hessianat each iteration can be costly. So, we may use, some approximations of the Hessian
S(x) = H−1(x)→ H−1(x)
x(k+1) = x(k) − α(k)S(x(k))∇f(x(k))
1. Hybrid GD + NA
We know that the �rst phase the Newton's Algorithm (NA) is not very fast.Therefore, �rst we can run run GD which has considerably low complexity andafter satisfying some conditions, we can switch to the NA.
Newton's Algorithm may not converge for highly non-quadratic functions unless xis close to x∗.
Hybrid method (given on the next slide) also guarantees global convergence.
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 116 / 120
Newton's Method (NA) Approximation of the Hessian
I Hybrid Algorithm
- Start at x(0) ∈ dom f(x)
repeat
run GD (i.e., S(x(k)) = I)
until stopping criterion is satis�ed
- Start at the �nal point of GD
repeat
run NA with exact H(x) (i.e., S(x(k)) = H−1(x(k)))
until stopping criterion is satis�ed
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 117 / 120
Newton's Method (NA) Approximation of the Hessian
2. The Chord Method
If f(x) is close to a quadratic function, we may use S(x(k)) = H−1(x(0))throughout the iterations, i.e.,
∆x(k) = −H−1(x(0))∇f(x(k))
x(k+1) = x(k) + ∆x(k)
This is also the same as the SD algorithm with P = H(x(0)) and α(k) = 1.
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 118 / 120
Newton's Method (NA) Approximation of the Hessian
3. The Shamanski Method
Updating the Hessian at every N iterations may give better performance, i.e.,
S(x(k)) = H−1(xbkN cN )
∆x(k) = −H−1(xbkN cN )∇f(x(k))
x(k+1) = x(k) + ∆x(k)
This is a trade-o� between the Chord method (N ←∞) and the full NA (N ← 1).
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 119 / 120
Newton's Method (NA) Approximation of the Hessian
4. Approximating Particular Terms
Inversion of sparse matrices can be easier, i.e., when many entries of H(x) are zero
- If some entries of H(x) are small or below a small threshold, then set themto zero, obtaining H(x). Thus, H(x) becomes sparse.
- In the extreme case. when the Hessian is strongly diagonal dominant, let theo�-diagonal terms to be zero, obtaining H(x). Thus, H(x) becomesdiagonal which is very easy to invert.
There are also other advanced quasi-Newton (modi�ed Newton) algorithmsdeveloped to approximate the inverse of the Hessian, e.g., Broyden andDavidon-Fletcher-Powell (DFP) methods.
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 22-Nov-2016 120 / 120
Recommended