Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Benefiting from Negative Curvature
Daniel P. RobinsonJohns Hopkins University
Department of Applied Mathematics and Statistics
Collaborator:Frank E. Curtis (Lehigh University)
US and Mexico Workshop on Optimization and Its ApplicationsHuatulco, MexicoJanuary 8, 2018
Negative Curvature US-Mexico-2018 1 / 31
Outline
1 Motivation
2 Deterministic SettingThe MethodConvergence ResultsNumerical ResultsComments
3 Stochastic Setting
Negative Curvature US-Mexico-2018 2 / 31
Motivation
Outline
1 Motivation
2 Deterministic SettingThe MethodConvergence ResultsNumerical ResultsComments
3 Stochastic Setting
Negative Curvature US-Mexico-2018 3 / 31
Motivation
Problem of interest: deterministic setting
minimizex∈Rn
f (x)
f : Rn → R assumed to be twice-continuously differentiable.
L will denote the Lipschitz constant for∇f
σ will denote the Lipschitz constant for∇2f
f may be nonconvex
Notation:
g(x) := ∇f (x)
H(x) := ∇2f (x)
Negative Curvature US-Mexico-2018 4 / 31
Motivation
Much work has been done on convergence two second-order points:D. Goldfarb (1979) [6]
- prove convergence result to second-order optimal points (unconstrained)- curvilinear search using descent direction and negative curvature direction
D. Goldfarb, C. Mu, J. Wright, and C. Zhou (2017) [7]- consider equality constrained problems- prove convergence result to second-order optimal points- extend curvilinear search for unconstrained
F. Facchinei and S. Lucidi (1998) [3]- consider inequality constrained problems- exact penalty function, directions of negative curvature, and line search
P. Gill, V. Kungurtsev, and D. Robinson (2017) [4, 5]- consider inequality constrained problems- convergence to second-order optimal points under weak assumptions
J. Moré and D. Sorensen (1979), A. Forsgren, P. Gill, and W. Murray(1995), and many more . . .
None consistently perform better by using directions of negative curvature!Negative Curvature US-Mexico-2018 5 / 31
Motivation
Others hope to avoid saddle-points:J. Lee, M. Simchowich, M. Jordan, and B. Recht (2016) [8]
- Gradient descent converges to local minimizer almost surely.- Uses random initialization.
Y. Dauphin et al. (2016) [2]- Present a saddle-free Newton method (it is a modified-Newton method)- Goal is to escape saddle points (move away when close)
These (and others) try to avoid the ill-effects of negative curvature.
Negative Curvature US-Mexico-2018 6 / 31
Motivation
Purpose of this research:
Design a method that consistently performs better by using directions ofnegative curvature.
Do not try to avoid negative curvature. Use it!
Negative Curvature US-Mexico-2018 7 / 31
Deterministic Setting
Outline
1 Motivation
2 Deterministic SettingThe MethodConvergence ResultsNumerical ResultsComments
3 Stochastic Setting
Negative Curvature US-Mexico-2018 8 / 31
Deterministic Setting The Method
Outline
1 Motivation
2 Deterministic SettingThe MethodConvergence ResultsNumerical ResultsComments
3 Stochastic Setting
Negative Curvature US-Mexico-2018 9 / 31
Deterministic Setting The Method
Overview:
Compute descent direction (sk) and negative curvature direction (dk).
Predict which step will make more progress in reducing the objective f .
If predicted decrease is not realized, adjust parameters.
Iterate until an approximate second-order solution is obtained.
Negative Curvature US-Mexico-2018 10 / 31
Deterministic Setting The Method
Requirements on the descent direction sk
Compute sk to satisfy
−g(xk)Tsk ≥ δ‖sk‖2‖g(xk)‖2
(some δ ∈ (0, 1]
)Examples:
sk = −g(xk)
Bksk = −gk with Bk appropriately chosen
Requirements on the negative curvature direction dk
Compute dk to satisfy
dTk H(xk)dk ≤ γλk‖dk‖2
2 < 0(some γ ∈ (0, 1]
)g(xk)
Tdk ≤ 0
Examples:dk = ±vk with (λk, vk) being the left-most eigenpair of H(xk)
dk a sufficiently accurate estimate of ±vkNegative Curvature US-Mexico-2018 11 / 31
Deterministic Setting The Method
How to use sk and dk?Use both in a curvilinear linesearch?
- Often taints good descent directions by "poorly scaled" directions ofnegative curvature.
- No consistent performance gains!Start using dk only once ‖g(xk)‖ is “small"?
- No consistent performance gains!- Misses areas of the space in which great decrease in f is possible.
Use sk when ‖g(xk)‖ is big relative to |(λk)−|. Otherwise, use dk?- Better, but still inconsistent performance gains!
We propose to use upper-bounding models. It works!
Negative Curvature US-Mexico-2018 12 / 31
Deterministic Setting The Method
Predicted decrease along descent direction sk
If Lk ≥ L, then
f (xk + αsk) ≤ f (xk)− ms,k(α)(for all α
)with
ms,k(α) := −αg(xk)Tsk − 1
2 Lkα2‖sk‖2
2
and define the quantity
αk :=−g(xk)
Tsk
Lk‖sk‖22
= argmaxα≥0
ms,k(α)
Comments
ms,k(αk) is the best predicted decrease along sk
If sk = −g(xk), then αk = 1/Lk
Negative Curvature US-Mexico-2018 13 / 31
Deterministic Setting The Method
Predicted decrease along the negative curvature direction dk
If σk ≥ σ, then
f (xk + βdk) ≤ f (xk)− md,k(β)(for all β
)with
md,k(β) := −βg(xk)Tdk − 1
2β2dT
k H(xk)dk − σk6 β
3‖dk‖32
and define, with ck := dTk H(xk)dk, the quantity
βk :=
(−ck +
√c2
k − 2σk‖dk‖32g(xk)Tdk
)σk‖dk‖3
2= argmax
β≥0md,k(β)
Comments
md,k(βk) is the best predicted decrease along dk
Negative Curvature US-Mexico-2018 14 / 31
Deterministic Setting The Method
Choose the step that predicts the largest decrease in f .
If ms,k(αk) ≥ md,k(βk), then Try the step sk
If md,k(βk) > ms,k(αk), then Try the step dk
Question: Why “Try" instead of “Use"?Answer: We do not know if Lk ≥ L and σk ≥ σ
- If Lk < L, then it could be the case that
f (xk + αksk) > f (xk)− ms,k(αk)
- If σk < σ, then it could be the case that
f (xk + βkdk) > f (xk)− md,k(βk)
Negative Curvature US-Mexico-2018 15 / 31
Deterministic Setting The Method
Dynamic Step-Size Algorithm1: for k ∈ N do2: compute sk and dk satisfying the required step conditions3: loop4: compute αk = argmax
α≥0ms,k(α) and βk = argmax
β≥0md,k(β)
5: if ms,k(αk) ≥ md,k(βk) then6: if f (xk + αksk) ≤ f (xk)− ms,k(αk) then7: set xk+1 ← xk + αksk and then exit loop8: else9: set Lk ← ρLk [ρ ∈ (1,∞)]
10: else11: if f (xk + βkdk) ≤ f (xk)− md,k(βk) then12: set xk+1 ← xk + βkdk and then exit loop13: else14: set σk ← ρσk
15: set (Lk+1, σk+1) ∈ (Lmin,Lk]× (σmin, σk]
Negative Curvature US-Mexico-2018 16 / 31
Deterministic Setting Convergence Results
Outline
1 Motivation
2 Deterministic SettingThe MethodConvergence ResultsNumerical ResultsComments
3 Stochastic Setting
Negative Curvature US-Mexico-2018 17 / 31
Deterministic Setting Convergence Results
Key decrease inequality: For all k ∈ N it holds that
f (xk)− f (xk+1) ≥ max{δ2
2Lk‖g(xk)‖2
2,2γ3
3σ2k|(λk)−|3
}.
Comments:
First term in the max holds when xk+1 = xk + αksk.
Second term in the max holds when xk+1 = xk + βkdk.
The above max holds because we choose whether to try sk or dk based on
ms,k(αk) ≥ md,k(βk)
Can prove that {Lk} and {σk} remain uniformly bounded.
Negative Curvature US-Mexico-2018 18 / 31
Deterministic Setting Convergence Results
Theorem (Limit points satisfy second-order necessary conditions)The computed iterates satisfy
limk→∞
‖g(xk)‖2 = 0 and lim infk→∞
λk ≥ 0
Theorem (Complexity result)The number of iterations, function, and derivative (i.e., gradient and Hessian)evaluations required until some iteration k ∈ N is reached with
‖g(xk)‖2 ≤ εg and |(λk)−| ≤ εH
is at mostO(max{ε−2
g , ε−3H })
Negative Curvature US-Mexico-2018 19 / 31
Deterministic Setting Numerical Results
Outline
1 Motivation
2 Deterministic SettingThe MethodConvergence ResultsNumerical ResultsComments
3 Stochastic Setting
Negative Curvature US-Mexico-2018 20 / 31
Deterministic Setting Numerical Results
Refined parameter increase strategy
L̂k ← Lk +2(f (xk + αksk)− f (xk) + ms,k(αk)
)α2
k‖sk‖2
σ̂k ← σk +6(f (xk + βkdk)− f (xk) + md,k(βk)
)β3
k‖dk‖3
then, with ρ← 2, use the update
Lk ← max{ρLk,min{103Lk, L̂k}}σk ← max{ρσk,min{103σk, σ̂k}}
Refined parameter decrease strategy
Lk+1 ← max{10−3, 10−3Lk, L̂k} and σk+1 ← σk when xk+1 ← xk + αksk
σk+1 ← max{10−3, 10−3σk, σ̂k} and Lk+1 ← Lk when xk+1 ← xk + βkdk
Negative Curvature US-Mexico-2018 21 / 31
Deterministic Setting Numerical Results
Termination condition
‖g(xk)‖ ≤ 10−5 max{1, ‖g(x0)‖} and |(λk)−| ≤ 10−5 max{1, |(λ0)−|}.
Measures of interestFinal objective value:
ffinal(sk)− ffinal(sk, dk)
max{|ffinal(sk)|, |ffinal(sk, dk)|, 1}∈ [−1, 1]
Required number of iterations:
#its(sk)−#its(sk, dk)
max{#its(sk),#its(sk, dk), 1}∈ [−1, 1]
Required number of function evaluations:
#fevals(sk)−#fevals(sk, dk)
max{#fevals(sk),#fevals(sk, dk), 1}∈ [−1, 1]
Negative Curvature US-Mexico-2018 22 / 31
Deterministic Setting Numerical Results
Steepest descent: sk = −g(xk) and dk = ±vk
0 5 10 15 20 25 30-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
BIG
GS
6R
AT
43LS
VIB
RB
EA
MH
ELI
XM
GH
09LS
HE
AR
T6L
SR
AT
42LS
HU
MP
SM
ISR
A1A
LSH
AT
FLD
DD
EN
SC
HN
ELA
NC
ZO
S2L
SG
RO
WT
HLS
GU
LFLA
NC
ZO
S3L
ST
HU
RB
ER
LSM
EY
ER
3LA
NC
ZO
S1L
SR
OS
EN
BR
VE
SU
VIA
LSN
ELS
ON
LSS
INE
VA
LC
UB
EH
IMM
ELB
FM
AR
AT
OS
BM
GH
17LS
EN
GV
AL2
HE
AR
T8L
SK
IRB
Y2L
SH
YD
C20
LSS
NA
IL
(a) Final objective value.
0 5 10 15 20 25 30-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
BIG
GS
6R
AT
43LS
VIB
RB
EA
MH
ELI
XM
GH
09LS
HE
AR
T6L
SR
AT
42LS
HU
MP
SM
ISR
A1A
LSH
AT
FLD
DD
EN
SC
HN
ELA
NC
ZO
S2L
SG
RO
WT
HLS
GU
LFLA
NC
ZO
S3L
ST
HU
RB
ER
LSM
EY
ER
3LA
NC
ZO
S1L
SR
OS
EN
BR
VE
SU
VIA
LSN
ELS
ON
LSS
INE
VA
LC
UB
EH
IMM
ELB
FM
AR
AT
OS
BM
GH
17LS
EN
GV
AL2
HE
AR
T8L
SK
IRB
Y2L
SH
YD
C20
LSS
NA
IL
(b) Required number of iterations.
Figure: Only problems for which at least one negative curvature direction is used andthe difference in final f -values is larger than 10−5 in absolute value are presented.
Negative Curvature US-Mexico-2018 23 / 31
Deterministic Setting Numerical Results
Shifted Newton: Bk = H(xk) + δkI, Bksk = −g(xk), and dk = ±vk
0 5 10 15 20 25 30 35-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
HE
AR
T8L
SB
IGG
S6
HE
AR
T6L
SE
NG
VA
L2E
CK
ER
LE4L
SO
SB
OR
NE
BLO
GH
AIR
YLA
NC
ZO
S3L
SH
UM
PS
LAN
CZ
OS
2LS
BE
ALE
BE
NN
ET
T5L
SM
ISR
A1A
LSR
OS
ZM
AN
1LS
DE
NS
CH
ND
DE
NS
CH
NE
NE
LSO
NLS
HA
HN
1LS
ME
YE
R3
MG
H10
LSO
SB
OR
NE
AG
RO
WT
HLS
HA
TF
LDE
MG
H09
LSS
INE
VA
LH
AT
FLD
DT
HU
RB
ER
LSM
GH
17LS
LAN
CZ
OS
1LS
PO
WE
LLB
SLS
CH
WIR
UT
1LS
CH
WIR
UT
2LS
HY
DC
20LS
DE
CO
NV
UG
ULF
VIB
RB
EA
MK
IRB
Y2L
SS
NA
ILH
ELI
X
(a) Final objective value.
0 5 10 15 20 25 30 35-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
HE
AR
T8L
SB
IGG
S6
HE
AR
T6L
SE
NG
VA
L2E
CK
ER
LE4L
SO
SB
OR
NE
BLO
GH
AIR
YLA
NC
ZO
S3L
SH
UM
PS
LAN
CZ
OS
2LS
BE
ALE
BE
NN
ET
T5L
SM
ISR
A1A
LSR
OS
ZM
AN
1LS
DE
NS
CH
ND
DE
NS
CH
NE
NE
LSO
NLS
HA
HN
1LS
ME
YE
R3
MG
H10
LSO
SB
OR
NE
AG
RO
WT
HLS
HA
TF
LDE
MG
H09
LSS
INE
VA
LH
AT
FLD
DT
HU
RB
ER
LSM
GH
17LS
LAN
CZ
OS
1LS
PO
WE
LLB
SLS
CH
WIR
UT
1LS
CH
WIR
UT
2LS
HY
DC
20LS
DE
CO
NV
UG
ULF
VIB
RB
EA
MK
IRB
Y2L
SS
NA
ILH
ELI
X
(b) Required number of iterations.
Figure: Only problems for which at least one negative curvature direction is used andthe difference in final f -values is larger than 10−5 in absolute value are presented.
Negative Curvature US-Mexico-2018 24 / 31
Deterministic Setting Comments
Outline
1 Motivation
2 Deterministic SettingThe MethodConvergence ResultsNumerical ResultsComments
3 Stochastic Setting
Negative Curvature US-Mexico-2018 25 / 31
Deterministic Setting Comments
Comments:If L and σ are known, do not need to ever update Lk and σk, in theory. Inpractice, still allow increase and decrease for efficiency.Currently, one function evaluation each trial step. If evaluating f is verycheap, could consider evaluating both trial steps during each iteration.Relevance to strict saddle points
- We do not make any non-degenerate assumption.- Our convergence result holds regardless of the types of saddle points.- When the strict saddle point property holds, our theory implies that
* Any limit point of the sequence {xk} is a minimizer of f .* Iterates eventually enter a region that only contains minimizers.
- We get a stronger convergence theory (cf. Paternain, Mokhtari, andRibeiro (2017)) because we incorporate directions of negative curvature.
The complexity result for our method is not “optimal" based on atraditional complexity perspective.F. Curtis and I have been intrigued by alternate complexity perspectives:
- Typically, results are for general problems and based on worst case.- From some perspective, the algorithm I presented today is “optimal".- See his talk later this afternoon!
Negative Curvature US-Mexico-2018 26 / 31
Stochastic Setting
Outline
1 Motivation
2 Deterministic SettingThe MethodConvergence ResultsNumerical ResultsComments
3 Stochastic Setting
Negative Curvature US-Mexico-2018 27 / 31
Stochastic Setting
Summary
Apply same ideas as in the deterministic case, but in the mini-batch case.
Add a negative curvature direction dk = ±vk with the sign chosenrandomly. Can be thought of as a “smart noise" approach.
Small gain in performance relative to similar algorithm without dk.
See our paper [1] for additional details.
Negative Curvature US-Mexico-2018 28 / 31
Stochastic Setting
References I
[1] F. E. CURTIS AND D. P. ROBINSON, Exploiting negative curvaturedirections in stochastic optimization, in http://arxiv.org/abs/1703.00412,Submitted to Mathematical Programming (Special Issue on NonconvexOptimization for Statistical Learning), 2017.
[2] Y. N. DAUPHIN, R. PASCANU, C. GULCEHRE, K. CHO, S. GANGULI,AND Y. BENGIO, Identifying and attacking the saddle point problem inhigh-dimensional non-convex optimization, in Advances in NeuralInformation Processing Systems 27, Z. Ghahramani, M. Welling,C. Cortes, N. D. Lawrence, and K. Q. Weinberger, eds., CurranAssociates, Inc., 2014, pp. 2933–2941.
[3] F. FACCHINEI AND S. LUCIDI, Convergence to second order stationarypoints in inequality constrained optimization, Mathematics of OperationsResearch, 23 (1998), pp. 746–766.
Negative Curvature US-Mexico-2018 29 / 31
Stochastic Setting
References II
[4] P. E. GILL, V. KUNGURTSEV, AND D. P. ROBINSON, A stabilized sqpmethod: global convergence, IMA Journal of Numerical Analysis, 37(2017), pp. 407–443.
[5] , A stabilized sqp method: superlinear convergence, MathematicalProgramming, 163 (2017), pp. 369–410.
[6] D. GOLDFARB, Curvilinear path steplength algorithms for minimizationwhich use directions of negative curvature, Mathematical programming,18 (1980), pp. 31–40.
[7] D. GOLDFARB, C. MU, J. WRIGHT, AND C. ZHOU, Using negativecurvature in solving nonlinear programs, arXiv preprintarXiv:1706.00896, (2017).
Negative Curvature US-Mexico-2018 30 / 31
Stochastic Setting
References III
[8] J. D. LEE, M. SIMCHOWITZ, M. I. JORDAN, AND B. RECHT, Gradientdescent only converges to minimizers, in Conference on Learning Theory,2016, pp. 1246–1257.
Negative Curvature US-Mexico-2018 31 / 31