26
COM S 578X: Optimization for Machine Learning Lecture Note 7: Accelerated First-Order Methods Jia (Kevin) Liu Assistant Professor Department of Computer Science Iowa State University, Ames, Iowa, USA Fall 2019 JKL (CS@ISU) COM S 578X: Lecture 7 1 / 26

COM S 578X: Optimization for Machine Learningweb.cs.iastate.edu/~jialiu/teaching/COMS578X_F19/Lectures/LN7.pdf · Conjugate Gradient Nonlinear CG: Variants include Fletcher-Reeves,

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

Page 1: COM S 578X: Optimization for Machine Learningweb.cs.iastate.edu/~jialiu/teaching/COMS578X_F19/Lectures/LN7.pdf · Conjugate Gradient Nonlinear CG: Variants include Fletcher-Reeves,

COM S 578X: Optimization for Machine Learning

Lecture Note 7: Accelerated First-Order Methods

Jia (Kevin) Liu

Assistant ProfessorDepartment of Computer Science

Iowa State University, Ames, Iowa, USA

Fall 2019

JKL (CS@ISU) COM S 578X: Lecture 7 1 / 26

Page 2: COM S 578X: Optimization for Machine Learningweb.cs.iastate.edu/~jialiu/teaching/COMS578X_F19/Lectures/LN7.pdf · Conjugate Gradient Nonlinear CG: Variants include Fletcher-Reeves,

Outline

In this lecture:

Heavy-ball method

Conjugate gradient

Nesterov optimal first-order method

FISTA

Barzilai-Borwein

Adding simple constraints

Extending to regularized optimization

JKL (CS@ISU) COM S 578X: Lecture 7 2 / 26

Page 3: COM S 578X: Optimization for Machine Learningweb.cs.iastate.edu/~jialiu/teaching/COMS578X_F19/Lectures/LN7.pdf · Conjugate Gradient Nonlinear CG: Variants include Fletcher-Reeves,

First-Order Methods

So far, you’ve seen gradient descent – The most natural first-order method

GD has a sublinear O(1/k) rate for F∞,1L and a slow linear rate for S2,1µ,L.Can we do better?

First-Order Method (Nesterov): An iterative method generates a sequence oftest points xk such that

xk ∈ x0 + span∇f(x0),∇f(x1), . . . ,∇f(xk−1), k ≥ 1

Theorem 1 (Nesterov)

For any k, 1 ≤ k ≤ 12 (n− 1), and any x0 ∈ Rn, there exists a function f ∈ F∞,1L

such that for any first-order method, we have

f(xk)− f(x∗) ≥ 3L‖x0 − x∗‖232(k + 1)2

= O(1/k2)

Pretty large gap btwn O(1/k) and O(1/k2). So the answer is: Yes, we can!

JKL (CS@ISU) COM S 578X: Lecture 7 3 / 26

Page 4: COM S 578X: Optimization for Machine Learningweb.cs.iastate.edu/~jialiu/teaching/COMS578X_F19/Lectures/LN7.pdf · Conjugate Gradient Nonlinear CG: Variants include Fletcher-Reeves,

Heavy-Ball

Historical Perspective

I First proposed by Polyak in 60s [Polyak ’64]

I Original goal: Accelerate gradient descent method

I Inspired by 2nd-order ODE of heavy-body motion

I Adopted in ML very early [Rumelhart, et al. ’86]

Basic Idea:

I Search direction: Linear combination of current gradient(gravity) and previous search direction (momentum)

I Also called two-step method and could be N -step in thegeneral case

JKL (CS@ISU) COM S 578X: Lecture 7 4 / 26

Page 5: COM S 578X: Optimization for Machine Learningweb.cs.iastate.edu/~jialiu/teaching/COMS578X_F19/Lectures/LN7.pdf · Conjugate Gradient Nonlinear CG: Variants include Fletcher-Reeves,

The Heavy-Ball Algorithm

Consider the unconstrained optimization problem, with f smooth and convex:

minx∈Rn

f(x)

Denote the optimal value as f∗ = minx f(x∗) and an optimal solution as x∗.

Heavy-Ball Method

Choose initial point x0 ∈ Rn and let x−1 = x0. Repeat:

xk+1 = xk − sk∇f(xk) + βk(xk − xk−1), k = 0, 1, 2, . . .

Stop if some stopping criterion is satisfied.

JKL (CS@ISU) COM S 578X: Lecture 7 5 / 26

Page 6: COM S 578X: Optimization for Machine Learningweb.cs.iastate.edu/~jialiu/teaching/COMS578X_F19/Lectures/LN7.pdf · Conjugate Gradient Nonlinear CG: Variants include Fletcher-Reeves,

Convergence Rate Analysis: Strongly Convex Case

Theorem 2 (Polyak)

If f ∈ S2,1µ,L, Heavy-Ball with constant step-size s = 4(√L+√µ)2

and constant

momentum coefficient β =(√

L−√µ√L+√µ

)2, satisfies:

‖xk − x∗‖2 ≤ C(√

κ− 1√κ+ 1

)k√2‖x0 − x∗‖2,

where κ , L/µ and C is a constant.

JKL (CS@ISU) COM S 578X: Lecture 7 6 / 26

Page 7: COM S 578X: Optimization for Machine Learningweb.cs.iastate.edu/~jialiu/teaching/COMS578X_F19/Lectures/LN7.pdf · Conjugate Gradient Nonlinear CG: Variants include Fletcher-Reeves,

HB Convergence Rate Analysis: Strongly Convex CaseProof Sketch.

Consider the new variable: zk = [(xk − x∗)> (xk−1 − x∗)>]>

Under constant step-size s and momentum coefficient β, HB dynamics canbe rewritten in terms of zk:

‖zk+1‖2 =

∥∥∥∥[ xk+1 − x∗

xk − x∗

]∥∥∥∥2

=

∥∥∥∥[ xk − s∇f(xk) + β(xk − xk−1)− x∗

xk − x∗

]∥∥∥∥2

=

∥∥∥∥[ (1 + β)I −βII 0

][xk − x∗

xk−1 − x∗

]− s[∇f(xk)

0

]∥∥∥∥2

(1)

Note that

∇f(xk) = ∇f(x∗) +

∫ 1

0

∇2f(x∗ + τ(xk − x∗))(xk − x∗)dτ

=[ ∫ 1

0

∇2f(x∗ + τ(xk − x∗))dτ︸ ︷︷ ︸B(xk)

](xk − x∗)

Plugging this into (1) to obtain:

JKL (CS@ISU) COM S 578X: Lecture 7 7 / 26

Page 8: COM S 578X: Optimization for Machine Learningweb.cs.iastate.edu/~jialiu/teaching/COMS578X_F19/Lectures/LN7.pdf · Conjugate Gradient Nonlinear CG: Variants include Fletcher-Reeves,

HB Convergence Rate Analysis: Strongly Convex Case

zk+1 =

[xk+1 − x∗

xk − x∗

]=

[xk − s∇f(xk) + β(xk − xk−1)− x∗

xk − x∗

]=

[(1 + β)I− sB(xk) −βI

I 0

]︸ ︷︷ ︸

,Γ(xk)

zk (2)

So, (2) implies ‖zk‖2 = ‖∏k−1i=0 Γ(xi)z0‖2 ≤ maxxi,∀i ‖

∏k−1i=0 Γ(xi)z0‖2.

Noting the identical structure of Γ(xi) for i (implying index independence)

and letting Γ ,

[(1 + β)I− sB(x4) −βkI

I 0

], where x4 is a maximizer

for the function ‖(Γ(x))kz0‖. Then it can be shown that

ρ(Γ) ≤√β,

where we take β = max|1−√sµ|, |1−

√sL|

.

JKL (CS@ISU) COM S 578X: Lecture 7 8 / 26

Page 9: COM S 578X: Optimization for Machine Learningweb.cs.iastate.edu/~jialiu/teaching/COMS578X_F19/Lectures/LN7.pdf · Conjugate Gradient Nonlinear CG: Variants include Fletcher-Reeves,

HB Convergence Rate Analysis: Strongly Convex Case

Plugging this into (2), we have:

‖zk+1‖2 ≤ max|1−√sµ|, |1−

√sL|‖zk‖2 . (3)

If we let s = 4(√µ+√L)2

, we have:

(1−√sµ) = (1−√sL) =

√L−√µ√L+√µ

=

√κ− 1√κ+ 1

.

Substituting this back into (3) yields:

‖xk − x∗‖2 ≤ ‖zk‖2 ≤(√

κ− 1√κ+ 1

)k√2‖x0 − x∗‖2.

JKL (CS@ISU) COM S 578X: Lecture 7 9 / 26

Page 10: COM S 578X: Optimization for Machine Learningweb.cs.iastate.edu/~jialiu/teaching/COMS578X_F19/Lectures/LN7.pdf · Conjugate Gradient Nonlinear CG: Variants include Fletcher-Reeves,

GD vs. HB under Strong Convexity

Gradient Descent: Linear rate

(κ− 1

κ+ 1

)

Heavy-ball: Linear rate

(√κ− 1√κ+ 1

)

Big difference! For ‖xk − x∗‖ to reach ε-accuracy, need k large enough so that(κ− 1

κ+ 1

)k≤ ε ⇒ k ≥ κ

2| log ε| (Gradient Descent)(√

κ− 1√κ+ 1

)k≤ ε ⇒ k ≥

√κ

2| log ε| (Heavy-Ball)

A factor of√κ difference; e.g., if κ = 1000 (not at all uncommon in ML

problems), need approximately 30 times fewer iterations.

JKL (CS@ISU) COM S 578X: Lecture 7 10 / 26

Page 11: COM S 578X: Optimization for Machine Learningweb.cs.iastate.edu/~jialiu/teaching/COMS578X_F19/Lectures/LN7.pdf · Conjugate Gradient Nonlinear CG: Variants include Fletcher-Reeves,

Convergence Rate Analysis: Weakly Convex Case

Theorem 3 (Ghadimi-Feyzmahdavian-Johansson)

If f ∈ F1,1L , Heavy-Ball with βk = k

k+2 and αk = α0

k+2 , where α0 ∈ (0, 1/L],satisfies:

f(xk)− f(x∗) ≤ ‖x0 − x∗‖222α0(k + 1)

= O(1/k),

Remark:

HB has essentially the same sublinear rate as GD in the weakly convex case.

JKL (CS@ISU) COM S 578X: Lecture 7 11 / 26

Page 12: COM S 578X: Optimization for Machine Learningweb.cs.iastate.edu/~jialiu/teaching/COMS578X_F19/Lectures/LN7.pdf · Conjugate Gradient Nonlinear CG: Variants include Fletcher-Reeves,

Conjugate Gradient

Definition 4 (The Notion of Conjugacy)

Let H ∈ Rn×n be symmetric. The vectors d1, . . . ,dn are called H-conjugate (orsimply conjugate) if they are linearly independent and if d>i Hdj = 0 for all i 6= j.

Conjugacy is significant in quadratic functions minimization (often arise in ML):

Let f(x) = c>x + 12x>Hx. Suppose d1, . . . ,dn are H-conjugate. Then:

Note: Conjugate directions are not unique

JKL (CS@ISU) COM S 578X: Lecture 7 12 / 26

Page 13: COM S 578X: Optimization for Machine Learningweb.cs.iastate.edu/~jialiu/teaching/COMS578X_F19/Lectures/LN7.pdf · Conjugate Gradient Nonlinear CG: Variants include Fletcher-Reeves,

Conjugate Directions: Geometric Interpretation

JKL (CS@ISU) COM S 578X: Lecture 7 13 / 26

Page 14: COM S 578X: Optimization for Machine Learningweb.cs.iastate.edu/~jialiu/teaching/COMS578X_F19/Lectures/LN7.pdf · Conjugate Gradient Nonlinear CG: Variants include Fletcher-Reeves,

Quadratic Minimization w/ Conjugacy: Finite Convergence

Theorem 5 (Expanding subspace property)

Let f(x) = c>x + 12x>Hx, where H ∈ Rn×n is symmetric. Let d1, . . . ,dn be

H-conjugate. Let x1 be arbitrary starting point. Let sk=arg mins∈R f(xk+sdk),k = 1, . . . , n, and let xk+1 = xk + skdk. Then, for k = 1, . . . , n, we must have:

∇f(xk+1)>dj = 0, for j = 1, . . . , k.

∇f(x1)>dk = ∇f(xk)>dk.

xk+1 = arg minx

f(x)|x− x1 ∈ spand1, . . . ,dk

xn+1 is a minimizing point of f(x) over Rn

JKL (CS@ISU) COM S 578X: Lecture 7 14 / 26

Page 15: COM S 578X: Optimization for Machine Learningweb.cs.iastate.edu/~jialiu/teaching/COMS578X_F19/Lectures/LN7.pdf · Conjugate Gradient Nonlinear CG: Variants include Fletcher-Reeves,

Geometric Interpretation

d2

∇f(x2)x2x3

∇f(x3)

x1

d1

JKL (CS@ISU) COM S 578X: Lecture 7 15 / 26

Page 16: COM S 578X: Optimization for Machine Learningweb.cs.iastate.edu/~jialiu/teaching/COMS578X_F19/Lectures/LN7.pdf · Conjugate Gradient Nonlinear CG: Variants include Fletcher-Reeves,

Conjugate Gradient (Standard Version)

1 Initial point x1, y1 = x1, d1 = −∇f(y1), k = j = 1.

2 If ‖∇f(yj)‖ ≤ ε, stop. Otherwise, let sj = arg mins≥0 f(yj + sdj). Letyj+1 = yj + sjdj . If j = n, go to Step 4

3 Let dj+1 = −∇f(yj+1) + γjdj . Let j = j + 1 and go to Step 2.

4 Let xk+1 = y1 = yn+1, d1 = −∇f(y1). Let j = 1, k = k+ 1. Go to Step 2.

Can be viewed as heavy-ball, with βj =sjγjsj−1

. But CG can be implemented

without requiring knowledge (or estimation) of L and µ

I Choose sj to (approximately) minimize f along dj . Variants of choosing γj :

Fletcher-Reeves: γFRj =

‖∇f(yj+1)‖2

‖∇f(yj)‖2

Polak-Rebiere: γPRj =

∇f(yj+1)>(∇f(yj+1)−∇f(yj))

‖∇f(yj)‖2

Hestenes-Stiefel: γHSj =

sj∇f(yk+1)>(∇f(yj+1)−∇f(yj))

(yj+1 − yj)>(∇f(yj+1)−∇f(yj))

I All equivalent if f is convex quadratic & exact line search is used ([BSS Ch.8])

JKL (CS@ISU) COM S 578X: Lecture 7 16 / 26

Page 17: COM S 578X: Optimization for Machine Learningweb.cs.iastate.edu/~jialiu/teaching/COMS578X_F19/Lectures/LN7.pdf · Conjugate Gradient Nonlinear CG: Variants include Fletcher-Reeves,

Conjugate Gradient

Nonlinear CG: Variants include Fletcher-Reeves, Polak-Ribiere,Hestnes-Stiefel, etc.

Restarting periodically with d1 = −∇f(xk) (e.g., every n iterations, or whendj fails to be a descent direction)

For quadratic f(·), convergence analysis is based on eigenvalues of A andChebyshev polynomials, min-max arguments. Get:

I Finite termination is as many as iterations as there are distinct eigenvalues

I Asymptotic linear convergence with rate approximately

√κ− 1√κ+ 1

(like

heavy-ball, see [Necedal and Wright, Ch.5])

I Similar sublinear O(1/k) rate for f ∈ F1,1L

Can we close the gap between O(1/k) and O(1/k2) for F1,1L ?

JKL (CS@ISU) COM S 578X: Lecture 7 17 / 26

Page 18: COM S 578X: Optimization for Machine Learningweb.cs.iastate.edu/~jialiu/teaching/COMS578X_F19/Lectures/LN7.pdf · Conjugate Gradient Nonlinear CG: Variants include Fletcher-Reeves,

Nesterov Accelerated First-Order Method

Consider an unconstrained convex optimization problem:

minx∈Rn

f(x), where f(·) has L-Lipschitz cont. gradient.

Nesterov’s AGD method [Nesterov ’83]:1 Choose initial x0. Let y0 = x0. Choose α0 ∈ (0, 1).2 Iteration k ≥ 0: a) Compute f ′(yk) and let:

xk+1 = yk − (1/L)f ′(yk). (regular gradient step)

b) Compute αk+1 ∈ (0, 1) by solving α2k+1 + [α2

k − (1/κ)]αk+1 − α2k = 0.

Let βk =αk(1− αk+1)

(α2k + αk+1)

, and

yk+1 = xk+1 + βk(xk+1 − xk). (“slide” in dir. of x)

Still works for weakly convex (κ =∞)

JKL (CS@ISU) COM S 578X: Lecture 7 18 / 26

Page 19: COM S 578X: Optimization for Machine Learningweb.cs.iastate.edu/~jialiu/teaching/COMS578X_F19/Lectures/LN7.pdf · Conjugate Gradient Nonlinear CG: Variants include Fletcher-Reeves,

Nesterov’s Accelerated First-Order Method

xk+2

yk+2

yk

xk

xk+1

yk+1

In simplest form (constant step-sizes):xk+1 = yk − (1/L)f ′(yk)

yk+1 = xk+1 + β(xk+1 − xk).

Nesterov’s AGD achieves:I Strongly convex: Linear convergence rate insensitive to κ , L/µI Weakly convex: Sublinearly O(1/t2) (order-optimal).

JKL (CS@ISU) COM S 578X: Lecture 7 19 / 26

Page 20: COM S 578X: Optimization for Machine Learningweb.cs.iastate.edu/~jialiu/teaching/COMS578X_F19/Lectures/LN7.pdf · Conjugate Gradient Nonlinear CG: Variants include Fletcher-Reeves,

Convergence Performance of Nesterov’s Method

Theorem 6

The Nesterov accelerated first-order method with α[0] ≥ 1/√κ satisfies:

f(xk)− f(x∗) ≤ c1 min

(1− 1√

κ

)k,

4L

(√L+ c2k)2

,

where constants c1 and c2 depend on x0, α[0], and L.

Remark:

Linear convergence rate similar to Heavy-Ball if f ∈ S2,1µ,LSublinear O(1/k2) if f ∈ F1,1

L

In the special case where α[0] = 1/√κ, this scheme yields:

α[k] =1√κ, β[k] =

√κ− 1√κ+ 1

JKL (CS@ISU) COM S 578X: Lecture 7 20 / 26

Page 21: COM S 578X: Optimization for Machine Learningweb.cs.iastate.edu/~jialiu/teaching/COMS578X_F19/Lectures/LN7.pdf · Conjugate Gradient Nonlinear CG: Variants include Fletcher-Reeves,

FISTA

Beck and Teboulle proposed a similar momentum-based algorithm:1 Choose initial x0. Let y1 = x[0]. t1 = 12 Iteration t ≥ 0: Do the following computations:

xk ← yk −1

L∇f(yk);

tk+1 ←1

2

(1 +

√1 + 4t2k

)yk+1 ← xk +

tk − 1

tk+1(xk − xk−1)

For weakly convex f , converges with f(xk)− f(x∗) = O(1/k2)

When L is unknown, increase an estimate of L until it’s big enough

JKL (CS@ISU) COM S 578X: Lecture 7 21 / 26

Page 22: COM S 578X: Optimization for Machine Learningweb.cs.iastate.edu/~jialiu/teaching/COMS578X_F19/Lectures/LN7.pdf · Conjugate Gradient Nonlinear CG: Variants include Fletcher-Reeves,

Nonmonotone Gradient Method: Barzilai-Borwein

Barzilai and Borwein (BB) proposed an unusual choice of sk

Allows f to increase (sometimes a lot) on some steps: non-monotone

xk+1 = xk − sk∇f(xk), sk = arg mins‖rk − sqk‖2,

where rk = xk − xk−1 and qk , ∇f(xk)−∇f(xk−1)

Explicitly, we have

sk =r>k qkq>k qk

Note that for f(x) = 12x>Ax, we have

sk =r>k Arkr>k A2rk

∈[

1

L,

1

µ

]BB can be viewed as quasi-Newton, with the Hessian approximated by s−1k I

JKL (CS@ISU) COM S 578X: Lecture 7 22 / 26

Page 23: COM S 578X: Optimization for Machine Learningweb.cs.iastate.edu/~jialiu/teaching/COMS578X_F19/Lectures/LN7.pdf · Conjugate Gradient Nonlinear CG: Variants include Fletcher-Reeves,

Comparison: BB vs Gradient Descent

JKL (CS@ISU) COM S 578X: Lecture 7 23 / 26

Page 24: COM S 578X: Optimization for Machine Learningweb.cs.iastate.edu/~jialiu/teaching/COMS578X_F19/Lectures/LN7.pdf · Conjugate Gradient Nonlinear CG: Variants include Fletcher-Reeves,

Extension to Simple Constraints

Constraint set Ω: a (relatively simple) closed convex set

Some algorithms and theory stay largely the same, if we can involve theconstraint x ∈ Ω explicitly in the subproblems

Example: Nesterov’s constant step scheme requires just one calculation tobe changed from the unconstrained version as follows:

1 Choose initial x0. Let y0 = x0. Choose α0 ∈ (0, 1).2 Iteration k ≥ 0: a) Compute f ′(yk) and let:

xk+1 = argminy∈Ωyk − (1/L)f ′(yk). (projected gradient step)

b) Compute αk+1 ∈ (0, 1) by solving α2k+1 + [α2

k − (1/κ)]αk+1 − α2k = 0.

Let βk =αk(1− αk+1)

(α2k + αk+1)

, and

yk+1 = xk+1 + βk(xk+1 − xk). (“slide” in dir. of x)

Convergence theory is unchanged.

JKL (CS@ISU) COM S 578X: Lecture 7 24 / 26

Page 25: COM S 578X: Optimization for Machine Learningweb.cs.iastate.edu/~jialiu/teaching/COMS578X_F19/Lectures/LN7.pdf · Conjugate Gradient Nonlinear CG: Variants include Fletcher-Reeves,

Extension to Regularized Optimization

Consider the following optimization with regularization:

minxf(x) + τψ(x),

where f ∈ FL1,1 and ψ is convex but usually nonsmooth

Often, we only need to change the update to:

xk+1 = arg minx

1

2sk‖x− (xk + skdk)‖22 + τψ(x),

where dk could be a scaled gradient descent step, or deflected gradient (e.g.,heavy-ball, Nesterov, etc.), while sk is the step size

This is also referred to as shrinkage/thresholding step. More on how to solvethe above problem later when we discuss sparse/regularized optimization

JKL (CS@ISU) COM S 578X: Lecture 7 25 / 26

Page 26: COM S 578X: Optimization for Machine Learningweb.cs.iastate.edu/~jialiu/teaching/COMS578X_F19/Lectures/LN7.pdf · Conjugate Gradient Nonlinear CG: Variants include Fletcher-Reeves,

Next Class

Subgradient Method

JKL (CS@ISU) COM S 578X: Lecture 7 26 / 26