Benefiting from Negative Curvature - Northwestern University · k = v k 0 5 10 15 20 25 30 35-1-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6 0.8 1 heart8ls biggs6 heart6ls engval2 eckerle4ls osborneb

Benefiting from Negative Curvature

Daniel P. RobinsonJohns Hopkins University

Department of Applied Mathematics and Statistics

Collaborator:Frank E. Curtis (Lehigh University)

US and Mexico Workshop on Optimization and Its ApplicationsHuatulco, MexicoJanuary 8, 2018

Negative Curvature US-Mexico-2018 1 / 31

Outline

1 Motivation

2 Deterministic SettingThe MethodConvergence ResultsNumerical ResultsComments

3 Stochastic Setting


Motivation

Outline

1 Motivation




Motivation

Problem of interest: deterministic setting

minimizex∈Rn

f (x)

f : Rn → R assumed to be twice-continuously differentiable.

L will denote the Lipschitz constant for∇f

σ will denote the Lipschitz constant for∇2f

f may be nonconvex

Notation:

g(x) := ∇f (x)

H(x) := ∇2f (x)


Motivation

Much work has been done on convergence two second-order points:D. Goldfarb (1979) [6]

- prove convergence result to second-order optimal points (unconstrained)- curvilinear search using descent direction and negative curvature direction

D. Goldfarb, C. Mu, J. Wright, and C. Zhou (2017) [7]- consider equality constrained problems- prove convergence result to second-order optimal points- extend curvilinear search for unconstrained

F. Facchinei and S. Lucidi (1998) [3]- consider inequality constrained problems- exact penalty function, directions of negative curvature, and line search

P. Gill, V. Kungurtsev, and D. Robinson (2017) [4, 5]- consider inequality constrained problems- convergence to second-order optimal points under weak assumptions

J. Moré and D. Sorensen (1979), A. Forsgren, P. Gill, and W. Murray(1995), and many more . . .

None consistently perform better by using directions of negative curvature!Negative Curvature US-Mexico-2018 5 / 31

Motivation

Others hope to avoid saddle-points:J. Lee, M. Simchowich, M. Jordan, and B. Recht (2016) [8]

- Gradient descent converges to local minimizer almost surely.- Uses random initialization.

Y. Dauphin et al. (2016) [2]- Present a saddle-free Newton method (it is a modified-Newton method)- Goal is to escape saddle points (move away when close)

These (and others) try to avoid the ill-effects of negative curvature.


Motivation

Purpose of this research:

Design a method that consistently performs better by using directions ofnegative curvature.

Do not try to avoid negative curvature. Use it!


Deterministic Setting

Outline

1 Motivation




Deterministic Setting The Method

Outline

1 Motivation





Overview:

Compute descent direction (sk) and negative curvature direction (dk).

Predict which step will make more progress in reducing the objective f .

If predicted decrease is not realized, adjust parameters.

Iterate until an approximate second-order solution is obtained.



Requirements on the descent direction sk

Compute sk to satisfy

−g(xk)Tsk ≥ δ‖sk‖2‖g(xk)‖2

(some δ ∈ (0, 1]

)Examples:

sk = −g(xk)

Bksk = −gk with Bk appropriately chosen

Requirements on the negative curvature direction dk

Compute dk to satisfy

dTk H(xk)dk ≤ γλk‖dk‖2

2 < 0(some γ ∈ (0, 1]

)g(xk)

Tdk ≤ 0

Examples:dk = ±vk with (λk, vk) being the left-most eigenpair of H(xk)

dk a sufficiently accurate estimate of ±vkNegative Curvature US-Mexico-2018 11 / 31


How to use sk and dk?Use both in a curvilinear linesearch?

- Often taints good descent directions by "poorly scaled" directions ofnegative curvature.

- No consistent performance gains!Start using dk only once ‖g(xk)‖ is “small"?

- No consistent performance gains!- Misses areas of the space in which great decrease in f is possible.

Use sk when ‖g(xk)‖ is big relative to |(λk)−|. Otherwise, use dk?- Better, but still inconsistent performance gains!

We propose to use upper-bounding models. It works!



Predicted decrease along descent direction sk

If Lk ≥ L, then

f (xk + αsk) ≤ f (xk)− ms,k(α)(for all α

)with

ms,k(α) := −αg(xk)Tsk − 1

2 Lkα2‖sk‖2

2

and define the quantity

αk :=−g(xk)

Tsk

Lk‖sk‖22

= argmaxα≥0

ms,k(α)

Comments

ms,k(αk) is the best predicted decrease along sk

If sk = −g(xk), then αk = 1/Lk



Predicted decrease along the negative curvature direction dk

If σk ≥ σ, then

f (xk + βdk) ≤ f (xk)− md,k(β)(for all β

)with

md,k(β) := −βg(xk)Tdk − 1

2β2dT

k H(xk)dk − σk6 β

3‖dk‖32

and define, with ck := dTk H(xk)dk, the quantity

βk :=

(−ck +

√c2

k − 2σk‖dk‖32g(xk)Tdk

)σk‖dk‖3

2= argmax

β≥0md,k(β)

Comments

md,k(βk) is the best predicted decrease along dk



Choose the step that predicts the largest decrease in f .

If ms,k(αk) ≥ md,k(βk), then Try the step sk

If md,k(βk) > ms,k(αk), then Try the step dk

Question: Why “Try" instead of “Use"?Answer: We do not know if Lk ≥ L and σk ≥ σ

- If Lk < L, then it could be the case that

f (xk + αksk) > f (xk)− ms,k(αk)

- If σk < σ, then it could be the case that

f (xk + βkdk) > f (xk)− md,k(βk)



Dynamic Step-Size Algorithm1: for k ∈ N do2: compute sk and dk satisfying the required step conditions3: loop4: compute αk = argmax

α≥0ms,k(α) and βk = argmax

β≥0md,k(β)

5: if ms,k(αk) ≥ md,k(βk) then6: if f (xk + αksk) ≤ f (xk)− ms,k(αk) then7: set xk+1 ← xk + αksk and then exit loop8: else9: set Lk ← ρLk [ρ ∈ (1,∞)]

10: else11: if f (xk + βkdk) ≤ f (xk)− md,k(βk) then12: set xk+1 ← xk + βkdk and then exit loop13: else14: set σk ← ρσk

15: set (Lk+1, σk+1) ∈ (Lmin,Lk]× (σmin, σk]


Deterministic Setting Convergence Results

Outline

1 Motivation





Key decrease inequality: For all k ∈ N it holds that

f (xk)− f (xk+1) ≥ max{δ2

2Lk‖g(xk)‖2

2,2γ3

3σ2k|(λk)−|3

}.

Comments:

First term in the max holds when xk+1 = xk + αksk.

Second term in the max holds when xk+1 = xk + βkdk.

The above max holds because we choose whether to try sk or dk based on

ms,k(αk) ≥ md,k(βk)

Can prove that {Lk} and {σk} remain uniformly bounded.



Theorem (Limit points satisfy second-order necessary conditions)The computed iterates satisfy

limk→∞

‖g(xk)‖2 = 0 and lim infk→∞

λk ≥ 0

Theorem (Complexity result)The number of iterations, function, and derivative (i.e., gradient and Hessian)evaluations required until some iteration k ∈ N is reached with

‖g(xk)‖2 ≤ εg and |(λk)−| ≤ εH

is at mostO(max{ε−2

g , ε−3H })


Deterministic Setting Numerical Results

Outline

1 Motivation





Refined parameter increase strategy

L̂k ← Lk +2(f (xk + αksk)− f (xk) + ms,k(αk)

)α2

k‖sk‖2

σ̂k ← σk +6(f (xk + βkdk)− f (xk) + md,k(βk)

)β3

k‖dk‖3

then, with ρ← 2, use the update

Lk ← max{ρLk,min{103Lk, L̂k}}σk ← max{ρσk,min{103σk, σ̂k}}

Refined parameter decrease strategy

Lk+1 ← max{10−3, 10−3Lk, L̂k} and σk+1 ← σk when xk+1 ← xk + αksk

σk+1 ← max{10−3, 10−3σk, σ̂k} and Lk+1 ← Lk when xk+1 ← xk + βkdk



Termination condition

‖g(xk)‖ ≤ 10−5 max{1, ‖g(x0)‖} and |(λk)−| ≤ 10−5 max{1, |(λ0)−|}.

Measures of interestFinal objective value:

ffinal(sk)− ffinal(sk, dk)

max{|ffinal(sk)|, |ffinal(sk, dk)|, 1}∈ [−1, 1]

Required number of iterations:

#its(sk)−#its(sk, dk)

max{#its(sk),#its(sk, dk), 1}∈ [−1, 1]

Required number of function evaluations:

#fevals(sk)−#fevals(sk, dk)

max{#fevals(sk),#fevals(sk, dk), 1}∈ [−1, 1]



Steepest descent: sk = −g(xk) and dk = ±vk

0 5 10 15 20 25 30-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

BIG

GS

6R

AT

43LS

VIB

RB

EA

MH

ELI

XM

GH

09LS

HE

AR

T6L

SR

AT

42LS

HU

MP

SM

ISR

A1A

LSH

AT

FLD

DD

EN

SC

HN

ELA

NC

ZO

S2L

SG

RO

WT

HLS

GU

LFLA

NC

ZO

S3L

ST

HU

RB

ER

LSM

EY

ER

3LA

NC

ZO

S1L

SR

OS

EN

BR

VE

SU

VIA

LSN

ELS

ON

LSS

INE

VA

LC

UB

EH

IMM

ELB

FM

AR

AT

OS

BM

GH

17LS

EN

GV

AL2

HE

AR

T8L

SK

IRB

Y2L

SH

YD

C20

LSS

NA

IL

(a) Final objective value.

0 5 10 15 20 25 30-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

BIG

GS

6R

AT

43LS

VIB

RB

EA

MH

ELI

XM

GH

09LS

HE

AR

T6L

SR

AT

42LS

HU

MP

SM

ISR

A1A

LSH

AT

FLD

DD

EN

SC

HN

ELA

NC

ZO

S2L

SG

RO

WT

HLS

GU

LFLA

NC

ZO

S3L

ST

HU

RB

ER

LSM

EY

ER

3LA

NC

ZO

S1L

SR

OS

EN

BR

VE

SU

VIA

LSN

ELS

ON

LSS

INE

VA

LC

UB

EH

IMM

ELB

FM

AR

AT

OS

BM

GH

17LS

EN

GV

AL2

HE

AR

T8L

SK

IRB

Y2L

SH

YD

C20

LSS

NA

IL

(b) Required number of iterations.

Figure: Only problems for which at least one negative curvature direction is used andthe difference in final f -values is larger than 10−5 in absolute value are presented.



Shifted Newton: Bk = H(xk) + δkI, Bksk = −g(xk), and dk = ±vk

0 5 10 15 20 25 30 35-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

HE

AR

T8L

SB

IGG

S6

HE

AR

T6L

SE

NG

VA

L2E

CK

ER

LE4L

SO

SB

OR

NE

BLO

GH

AIR

YLA

NC

ZO

S3L

SH

UM

PS

LAN

CZ

OS

2LS

BE

ALE

BE

NN

ET

T5L

SM

ISR

A1A

LSR

OS

ZM

AN

1LS

DE

NS

CH

ND

DE

NS

CH

NE

NE

LSO

NLS

HA

HN

1LS

ME

YE

R3

MG

H10

LSO

SB

OR

NE

AG

RO

WT

HLS

HA

TF

LDE

MG

H09

LSS

INE

VA

LH

AT

FLD

DT

HU

RB

ER

LSM

GH

17LS

LAN

CZ

OS

1LS

PO

WE

LLB

SLS

CH

WIR

UT

1LS

CH

WIR

UT

2LS

HY

DC

20LS

DE

CO

NV

UG

ULF

VIB

RB

EA

MK

IRB

Y2L

SS

NA

ILH

ELI

X

(a) Final objective value.

0 5 10 15 20 25 30 35-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

HE

AR

T8L

SB

IGG

S6

HE

AR

T6L

SE

NG

VA

L2E

CK

ER

LE4L

SO

SB

OR

NE

BLO

GH

AIR

YLA

NC

ZO

S3L

SH

UM

PS

LAN

CZ

OS

2LS

BE

ALE

BE

NN

ET

T5L

SM

ISR

A1A

LSR

OS

ZM

AN

1LS

DE

NS

CH

ND

DE

NS

CH

NE

NE

LSO

NLS

HA

HN

1LS

ME

YE

R3

MG

H10

LSO

SB

OR

NE

AG

RO

WT

HLS

HA

TF

LDE

MG

H09

LSS

INE

VA

LH

AT

FLD

DT

HU

RB

ER

LSM

GH

17LS

LAN

CZ

OS

1LS

PO

WE

LLB

SLS

CH

WIR

UT

1LS

CH

WIR

UT

2LS

HY

DC

20LS

DE

CO

NV

UG

ULF

VIB

RB

EA

MK

IRB

Y2L

SS

NA

ILH

ELI

X

(b) Required number of iterations.

Figure: Only problems for which at least one negative curvature direction is used andthe difference in final f -values is larger than 10−5 in absolute value are presented.


Deterministic Setting Comments

Outline

1 Motivation




Deterministic Setting Comments

Comments:If L and σ are known, do not need to ever update Lk and σk, in theory. Inpractice, still allow increase and decrease for efficiency.Currently, one function evaluation each trial step. If evaluating f is verycheap, could consider evaluating both trial steps during each iteration.Relevance to strict saddle points

- We do not make any non-degenerate assumption.- Our convergence result holds regardless of the types of saddle points.- When the strict saddle point property holds, our theory implies that

* Any limit point of the sequence {xk} is a minimizer of f .* Iterates eventually enter a region that only contains minimizers.

- We get a stronger convergence theory (cf. Paternain, Mokhtari, andRibeiro (2017)) because we incorporate directions of negative curvature.

The complexity result for our method is not “optimal" based on atraditional complexity perspective.F. Curtis and I have been intrigued by alternate complexity perspectives:

- Typically, results are for general problems and based on worst case.- From some perspective, the algorithm I presented today is “optimal".- See his talk later this afternoon!


Stochastic Setting

Outline

1 Motivation




Stochastic Setting

Summary

Apply same ideas as in the deterministic case, but in the mini-batch case.

Add a negative curvature direction dk = ±vk with the sign chosenrandomly. Can be thought of as a “smart noise" approach.

Small gain in performance relative to similar algorithm without dk.

See our paper [1] for additional details.


Stochastic Setting

References I

[1] F. E. CURTIS AND D. P. ROBINSON, Exploiting negative curvaturedirections in stochastic optimization, in http://arxiv.org/abs/1703.00412,Submitted to Mathematical Programming (Special Issue on NonconvexOptimization for Statistical Learning), 2017.

[2] Y. N. DAUPHIN, R. PASCANU, C. GULCEHRE, K. CHO, S. GANGULI,AND Y. BENGIO, Identifying and attacking the saddle point problem inhigh-dimensional non-convex optimization, in Advances in NeuralInformation Processing Systems 27, Z. Ghahramani, M. Welling,C. Cortes, N. D. Lawrence, and K. Q. Weinberger, eds., CurranAssociates, Inc., 2014, pp. 2933–2941.

[3] F. FACCHINEI AND S. LUCIDI, Convergence to second order stationarypoints in inequality constrained optimization, Mathematics of OperationsResearch, 23 (1998), pp. 746–766.


Stochastic Setting

References II

[4] P. E. GILL, V. KUNGURTSEV, AND D. P. ROBINSON, A stabilized sqpmethod: global convergence, IMA Journal of Numerical Analysis, 37(2017), pp. 407–443.

[5] , A stabilized sqp method: superlinear convergence, MathematicalProgramming, 163 (2017), pp. 369–410.

[6] D. GOLDFARB, Curvilinear path steplength algorithms for minimizationwhich use directions of negative curvature, Mathematical programming,18 (1980), pp. 31–40.

[7] D. GOLDFARB, C. MU, J. WRIGHT, AND C. ZHOU, Using negativecurvature in solving nonlinear programs, arXiv preprintarXiv:1706.00896, (2017).


Stochastic Setting

References III

[8] J. D. LEE, M. SIMCHOWITZ, M. I. JORDAN, AND B. RECHT, Gradientdescent only converges to minimizers, in Conference on Learning Theory,2016, pp. 1246–1257.


Documents

Benefiting from Negative Curvature - Northwestern University · k = v k 0 5 10 15 20 25 30 35-1-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6 0.8 1 heart8ls biggs6 heart6ls engval2 eckerle4ls osborneb