AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

Analyzing Optimization in Deep Learning via Trajectories

Nadav Cohen

Institute for Advanced Study

Institute for Computational and Experimental Research in Mathematics (ICERM)

Workshop on Theory and Practice in Machine Learning and Computer Vision

19 February 2019

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 1 / 35

Deep Learning

SourceNVIDIA (www.slideshare.net/openomics/the-revolution-of-deep-learning)


www.slideshare.net/openomics/the-revolution-of-deep-learning

Limited Formal Understanding


DL Theory: Expressiveness, Optimization & Generalization

Outline

1 Deep Learning Theory: Expressiveness, Optimization and Generalization

2 Analyzing Optimization via Trajectories

3 Trajectories of Gradient Descent for Deep Linear Neural NetworksConvergence to Global OptimumAcceleration by Depth

4 Conclusion



Statistical Learning Setup

X — instance space (e.g. R100×100 for 100-by-100 grayscale images)

Y — label space (e.g. R for regression or 1, . . . , k for classification)

D — distribution over X × Y (unknown)

` : Y×Y → R≥0 — loss func (e.g. `(y , y) = (y − y)2 for Y = R)

TaskGiven training set S = (Xi , yi )mi=1 drawn i.i.d. from D, return hypothesis(predictor) h : X → Y that minimizes population loss:

LD(h) := E(X ,y)∼D[`(y , h(X ))]ApproachPredetermine hypotheses space H ⊂ YX , and return hypothesis h ∈ Hthat minimizes empirical loss:

LS(h) := 1m∑m

i=1`(yi , h(Xi ))



Statistical Learning SetupX — instance space (e.g. R100×100 for 100-by-100 grayscale images)






LS(h) := 1m∑m

i=1`(yi , h(Xi ))









LS(h) := 1m∑m

i=1`(yi , h(Xi ))









LS(h) := 1m∑m

i=1`(yi , h(Xi ))









LS(h) := 1m∑m

i=1`(yi , h(Xi ))








LD(h) := E(X ,y)∼D[`(y , h(X ))]

ApproachPredetermine hypotheses space H ⊂ YX , and return hypothesis h ∈ Hthat minimizes empirical loss:

LS(h) := 1m∑m

i=1`(yi , h(Xi ))









LS(h) := 1m∑m

i=1`(yi , h(Xi ))



Three Pillars of Statistical Learning Theory:Expressiveness, Generalization and Optimization

h*

Sh

*h

*f

(all functions)

(hypotheses space)

f ∗D — ground truth (minimizer of population loss over YX )

h∗D — optimal hypothesis (minimizer of population loss over H)

h∗S — empirically optimal hypothesis (minimizer of empirical loss over H)

h — returned hypothesisNadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 6 / 35



h*

Sh

*h

*f

Approximation Error (Expressiveness)

(all functions)

(hypotheses space)







h*

Sh

*h

*f


Estimation Error (Generalization)

(all functions)

(hypotheses space)







h*

Sh

*h

*f


Estimation Error (Generalization)Training Error

(Optimization)

(all functions)

(hypotheses space)






Classical Machine Learning

h*

Sh

*h

*f



(Optimization)

(all functions)

(hypotheses space)

OptimizationEmpirical loss minimization is a convex program:

h ≈ h∗S ( training err ≈ 0 )

Expressiveness & GeneralizationBias-variance trade-off:

H approximation err estimation errexpands shrinks

Well developed theory




h*

Sh

*h

*f



(Optimization)

(all functions)

(hypotheses space)









h*

Sh

*h

*f



(Optimization)

(all functions)

(hypotheses space)









h*

Sh

*h

*f



(Optimization)

(all functions)

(hypotheses space)








Deep Learning

h*

Sh

*h

*f



(Optimization)

(all functions)

(hypotheses space)

OptimizationEmpirical loss minimization is a non-convex program:

h∗S is not unique — many hypotheses have low training errGradient descent (GD) somehow reaches one of these

Expressiveness & GeneralizationVast difference from classical ML:

Some low training err hypotheses generalize well, others don’tW/typical data, solution returned by GD often generalizes wellExpanding H reduces approximation err, but also estimation err!

Not well understood



Deep Learning

h*

Sh

*h

*f



(Optimization)

(all functions)

(hypotheses space)





Not well understood



Deep Learning

h*

Sh

*h

*f



(Optimization)

(all functions)

(hypotheses space)


h∗S is not unique — many hypotheses have low training err

Gradient descent (GD) somehow reaches one of theseExpressiveness & GeneralizationVast difference from classical ML:


Not well understood



Deep Learning

h*

Sh

*h

*f



(Optimization)

(all functions)

(hypotheses space)





Not well understood



Deep Learning

h*

Sh

*h

*f



(Optimization)

(all functions)

(hypotheses space)





Not well understood



Deep Learning

h*

Sh

*h

*f



(Optimization)

(all functions)

(hypotheses space)




Some low training err hypotheses generalize well, others don’t

W/typical data, solution returned by GD often generalizes wellExpanding H reduces approximation err, but also estimation err!

Not well understood



Deep Learning

h*

Sh

*h

*f



(Optimization)

(all functions)

(hypotheses space)




Some low training err hypotheses generalize well, others don’tW/typical data, solution returned by GD often generalizes well

Expanding H reduces approximation err, but also estimation err!

Not well understood



Deep Learning

h*

Sh

*h

*f



(Optimization)

(all functions)

(hypotheses space)





Not well understood



Deep Learning

h*

Sh

*h

*f



(Optimization)

(all functions)

(hypotheses space)





Not well understood


Analyzing Optimization via Trajectories

Outline




4 Conclusion



Optimization

h*

Sh

*h

*f

Training Error (Optimization)

(all functions)

(hypotheses space)

f ∗D — ground truth

h∗D — optimal hypothesis

h∗S — empirically optimal hypothesis

h — returned hypothesis



Approach: Convergence via Critical Points

Prominent approach for analyzing optimization in DL is via critical points(∇ = 0) in loss landscape

Non-strict saddleGood local minimum( ≈ global minimum)

Poor local minimum Strict saddle

Result (cf. Ge et al. 2015; Lee et al. 2016)If: (1) there are no poor local minima; and (2) all saddle points are strict,then gradient descent (GD) converges to global minimum


http://proceedings.mlr.press/v40/Ge15.pdf

http://proceedings.mlr.press/v49/lee16.pdf






(1) (2)










(1) (2)


Motivated by this, many 1 studied the validity of (1) and/or (2)

1 e.g. Haeffele & Vidal 2015; Kawaguchi 2016; Soudry & Carmon 2016; Safran & Shamir 2018Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 11 / 35



https://arxiv.org/pdf/1506.07540.pdf

https://papers.nips.cc/paper/6112-deep-learning-without-poor-local-minima.pdf


http://proceedings.mlr.press/v80/safran18a/safran18a.pdf


LimitationsConvergence of GD to global min was proven via critical points only forproblems involving shallow (2 layer) models

Approach is insufficient when treating deep (≥ 3 layer) models:

(2) is violated — ∃ non-strict saddles, e.g. when all weights = 0

Algorithmic aspects essential for convergence w/deep models, e.g.proper initialization, are ignored


http://proceedings.mlr.press/v28/sutskever13.pdf























Optimizer Trajectories Matter

Different optimization trajectories may lead to qualitatively different results

=⇒ details of algorithm and init should be taken into account!



Optimizer Trajectories Matter

Different optimization trajectories may lead to qualitatively different results

=⇒ details of algorithm and init should be taken into account!



Existing Trajectory Analyses

Trajectory approach led to successful analyses of shallow models:Brutzkus & Globerson 2017Li & Yuan 2017Zhong et al. 2017Tian 2017Brutzkus et al. 2018Li et al. 2018Du et al. 2018Oymak & Soltanolkotabi 2018

It also allowed treating prohibitively large deep models:Du et al. 2018Allen-Zhu et al. 2018Zou et al. 2018

For deep linear residual networks, trajectories were used to show efficientconvergence of GD to global min (Bartlett et al. 2018)


http://proceedings.mlr.press/v70/brutzkus17a/brutzkus17a.pdf

https://papers.nips.cc/paper/6662-convergence-analysis-of-two-layer-neural-networks-with-relu-activation.pdf

http://proceedings.mlr.press/v70/zhong17a/zhong17a.pdf

http://proceedings.mlr.press/v70/tian17a.html

https://openreview.net/pdf?id=rJ33wwxRb

http://proceedings.mlr.press/v75/li18a/li18a.pdf






http://proceedings.mlr.press/v80/bartlett18a/bartlett18a.pdf


Existing Trajectory AnalysesTrajectory approach led to successful analyses of shallow models:

Brutzkus & Globerson 2017Li & Yuan 2017Zhong et al. 2017Tian 2017Brutzkus et al. 2018Li et al. 2018Du et al. 2018Oymak & Soltanolkotabi 2018




















































Trajectories of GD for Deep LNNs

Outline




4 Conclusion



Sources

On the Optimization of Deep Networks:Implicit Acceleration by Overparameterization

Arora + C + Hazan (alphabetical order)International Conference on Machine Learning (ICML) 2018

A Convergence Analysis of Gradient Descent for Deep Linear Neural NetworksArora + C + Golowich + Hu (alphabetical order)To appear: International Conference on Learning Representations (ICLR) 2019



Collaborators

Sanjeev Arora Elad Hazan

Wei Hu Noah Golowich


Trajectories of GD for Deep LNNs Convergence to Global Optimum

Outline




4 Conclusion



Linear Neural NetworksLinear neural networks (LNN) are fully-connected neural networksw/linear (no) activation

W1 W2 WNx y = WN • • • W2W1 x

As surrogate for optimization in DL, GD over LNN (highly non-convexproblem) is studied extensively

Existing Result (Bartlett et al. 2018)W/linear residual networks (a special case: Wj are square and init to Id),for `2 loss on certain data, GD efficiently converges to global min

↑Only existing proof of efficient convergenceto global min for GD training deep model





W1 W2 WNx y = WN • • • W2W1 x

As surrogate for optimization in DL, GD over LNN (highly non-convexproblem) is studied extensively 1



1 e.g. Saxe et al. 2014; Kawaguchi 2016; Hardt & Ma 2017; Laurent & Brecht 2018Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 19 / 35




https://openreview.net/pdf?id=ryxB0Rtxx

http://proceedings.mlr.press/v80/laurent18a/laurent18a.pdf



W1 W2 WNx y = WN • • • W2W1 x












W1 W2 WNx y = WN • • • W2W1 x











Gradient FlowGradient flow (GF) is a continuous version of GD (learning rate → 0):

ddt α(t) = −∇f (α(t)) , t ∈ R>0

Gradient descent

Gradient flow

Admits use of theoretical tools from differential geometry/equations



Gradient FlowGradient flow (GF) is a continuous version of GD (learning rate → 0):

ddt α(t) = −∇f (α(t)) , t ∈ R>0

Gradient descent

Gradient flow

Admits use of theoretical tools from differential geometry/equations



Trajectories of Gradient Flow

W1 W2 WNx y = WN • • • W2W1 x

Loss `(·) for linear model induces overparameterized objective for LNN:φ(W1, . . . ,WN) := `(WN · · ·W2W1)

DefinitionWeights W1 . . .WN are balanced if W>

j+1Wj+1 = WjW>j ,∀j .

↑Holds approximately under ≈ 0 init, exactly under residual (Id) init

ClaimTrajectories of GF over LNN preserve balancedness: if W1 . . .WN arebalanced at init, they remain that way throughout GF optimization




W1 W2 WNx y = WN • • • W2W1 x



j+1Wj+1 = WjW>j ,∀j .






W1 W2 WNx y = WN • • • W2W1 x



j+1Wj+1 = WjW>j , ∀j .






W1 W2 WNx y = WN • • • W2W1 x



j+1Wj+1 = WjW>j , ∀j .






W1 W2 WNx y = WN • • • W2W1 x



j+1Wj+1 = WjW>j , ∀j .





Implicit PreconditioningQuestionHow does end-to-end matrix W1:N :=WN · · ·W1 move on GF trajectories?

W1 W2 WN W1:N

Linear Neural Network Equivalent Linear Model

)NW,…, 1W( Gradient flow over ?

TheoremIf W1 . . .WN are balanced at init, W1:N follows end-to-end dynamics:

ddt vec [W1:N(t)] = −PW1:N (t) · vec

[∇`(W1:N(t)

)]where PW1:N (t) is a preconditioner (PSD matrix) that “reinforces” W1:N(t)

Adding (redundant) linear layers to classic linear model inducespreconditioner promoting movement in directions already taken!




W1 W2 WN W1:N


Gradient flow over (W1 ,…,WN)Preconditioned

gradient flow over (W1:N)



[∇`(W1:N(t)






W1 W2 WN W1:N


Gradient flow over (W1 ,…,WN)Preconditioned

gradient flow over (W1:N)



[∇`(W1:N(t)





Convergence to Global Optimum


[∇`(W1:N(t)

)]

PW1:N (t)0 when W1:N(t) has full rank =⇒ loss decreases until:

(1) ∇`(W1:N(t)

)= 0 or (2) W1:N(t) is singular

`(·) is typically convex =⇒ (1) means global min was reached

CorollaryAssume `(·) is convex and LNN is init such that:

W1 . . .WN are balanced

`(W1:N) < `(W ) for any singular W

Then, GF converges to global min





[∇`(W1:N(t)

)]PW1:N (t)0 when W1:N(t) has full rank

=⇒ loss decreases until:

(1) ∇`(W1:N(t)











[∇`(W1:N(t)

)]PW1:N (t)0 when W1:N(t) has full rank =⇒ loss decreases until:

(1) ∇`(W1:N(t)











[∇`(W1:N(t)


(1) ∇`(W1:N(t)











[∇`(W1:N(t)


(1) ∇`(W1:N(t)









From Gradient Flow to Gradient Descent

Our convergence result for GF made two assumptions on init:1 Weights are balanced:2 Loss is smaller than that of any singular solution:

`(W1:N) < `(W ) , ∀W s.t. σmin(W ) = 0

For translating to GD, we define discrete forms of these conditions:

DefinitionFor δ ≥ 0, weights W1 . . .WN are δ-balanced if:

‖W>j+1Wj+1 −WjW>

j ‖F ≤ δ ,∀jDefinitionFor c > 0, weights W1 . . .WN have deficiency margin c if:

`(W1:N) ≤ `(W ) ,∀W s.t. σmin(W ) ≤ c



From Gradient Flow to Gradient DescentOur convergence result for GF made two assumptions on init:

1 Weights are balanced:2 Loss is smaller than that of any singular solution:`(W1:N) < `(W ) , ∀W s.t. σmin(W ) = 0





`(W1:N) ≤ `(W ) ,∀W s.t. σmin(W ) ≤ c




1 Weights are balanced:W>

j+1Wj+1 = WjW>j ,∀j

2 Loss is smaller than that of any singular solution:`(W1:N) < `(W ) ,∀W s.t. σmin(W ) = 0





`(W1:N) ≤ `(W ) ,∀W s.t. σmin(W ) ≤ c




1 Weights are balanced:‖W>

j+1Wj+1 −WjW>j ‖F = 0 , ∀j

2 Loss is smaller than that of any singular solution:`(W1:N) < `(W ) ,∀W s.t. σmin(W ) = 0





`(W1:N) ≤ `(W ) ,∀W s.t. σmin(W ) ≤ c





j+1Wj+1 −WjW>j ‖F = 0 , ∀j

2 Loss is smaller than that of any singular solution:`(W1:N) < `(W ) , ∀W s.t. σmin(W ) = 0





`(W1:N) ≤ `(W ) ,∀W s.t. σmin(W ) ≤ c





j+1Wj+1 −WjW>j ‖F = 0 , ∀j






`(W1:N) ≤ `(W ) ,∀W s.t. σmin(W ) ≤ c





j+1Wj+1 −WjW>j ‖F = 0 , ∀j





j ‖F ≤ δ ,∀j

DefinitionFor c > 0, weights W1 . . .WN have deficiency margin c if:

`(W1:N) ≤ `(W ) ,∀W s.t. σmin(W ) ≤ c





j+1Wj+1 −WjW>j ‖F = 0 , ∀j






`(W1:N) ≤ `(W ) ,∀W s.t. σmin(W ) ≤ c



Convergence to Global Optimum for Gradient DescentSuppose `(·) = `2 loss

(i.e. `(W ) = 1

m∑m

i=1 ‖W xi − yi‖22)

TheoremAssume GD over LNN is init s.t. W1 . . .WN have deficiency margin c > 0and are δ-balanced w/δ ≤ O(c2). Then, for any learning rate η ≤ O(c4):

loss(iteration t) ≤ e−Ω(c2ηt)

ClaimAssumptions on init — deficiency margin and δ-balancedness:

Are necessary (violating any of them can lead to divergence)

For output dim 1, hold w/const prob under random “balanced” init

Guarantee of efficient (linear rate) convergence to global min!Most general guarantee to date for GD efficiently training deep net.




(i.e. `(W ) = 1

m∑m

i=1 ‖W xi − yi‖22)










(i.e. `(W ) = 1

m∑m

i=1 ‖W xi − yi‖22)










(i.e. `(W ) = 1

m∑m

i=1 ‖W xi − yi‖22)










(i.e. `(W ) = 1

m∑m

i=1 ‖W xi − yi‖22)










(i.e. `(W ) = 1

m∑m

i=1 ‖W xi − yi‖22)








Trajectories of GD for Deep LNNs Acceleration by Depth

Outline




4 Conclusion



The Effect of Depth

Conventional wisdom:

Depth boosts expressiveness

input early layers intermediate layers deep layers

But complicates optimization

We will see: not always true...



The Effect of DepthConventional wisdom:





















Effect of Depth for Linear Neural Networks

For LNN, we derived end-to-end dynamics:ddt vec [W1:N(t)] = −PW1:N (t) · vec

[∇`(W1:N(t)

)]Consider a discrete version:

vec[W1:N(t + 1)

]←[ vec

[W1:N(t)

]− η · PW1:N (t) · vec

[∇`(W1:N(t))

]ClaimFor any p > 2, there exist settings where `(·) = `p loss:

`(W ) = 1m∑m

i=1 ‖W xi−yi‖pp ← convexand disc end-to-end dynamics reach global min arbitrarily faster than GD

w1

w2

w1

w2 Gradient descent End-to-end dymaics



Effect of Depth for Linear Neural NetworksFor LNN, we derived end-to-end dynamics:


[∇`(W1:N(t)

)]

Consider a discrete version:vec[W1:N(t + 1)

]←[ vec

[W1:N(t)

]− η · PW1:N (t) · vec

[∇`(W1:N(t))


`(W ) = 1m∑m


w1

w2

w1






[∇`(W1:N(t)


vec[W1:N(t + 1)

]←[ vec

[W1:N(t)

]− η · PW1:N (t) · vec

[∇`(W1:N(t))

]

ClaimFor any p > 2, there exist settings where `(·) = `p loss:

`(W ) = 1m∑m


w1

w2

w1






[∇`(W1:N(t)


vec[W1:N(t + 1)

]←[ vec

[W1:N(t)

]− η · PW1:N (t) · vec

[∇`(W1:N(t))


`(W ) = 1m∑m

i=1 ‖W xi−yi‖pp

← convexand disc end-to-end dynamics reach global min arbitrarily faster than GD

w1

w2

w1






[∇`(W1:N(t)


vec[W1:N(t + 1)

]←[ vec

[W1:N(t)

]− η · PW1:N (t) · vec

[∇`(W1:N(t))


`(W ) = 1m∑m

i=1 ‖W xi−yi‖pp ← convex

and disc end-to-end dynamics reach global min arbitrarily faster than GD

w1

w2

w1






[∇`(W1:N(t)


vec[W1:N(t + 1)

]←[ vec

[W1:N(t)

]− η · PW1:N (t) · vec

[∇`(W1:N(t))


`(W ) = 1m∑m


w1

w2

w1






[∇`(W1:N(t)


vec[W1:N(t + 1)

]←[ vec

[W1:N(t)

]− η · PW1:N (t) · vec

[∇`(W1:N(t))


`(W ) = 1m∑m


w1

w2

w1




Experiments

Linear neural networks:Regression problem from UCI ML Repository ; `4 loss

Depth can speed-up GD, even w/oany gain in expressiveness, and

despite introducing non-convexity!

This speed-up can outperformpopular acceleration methodsdesigned for convex problems!


http://archive.ics.uci.edu/ml/index.php


ExperimentsLinear neural networks:

Regression problem from UCI ML Repository ; `4 loss







ExperimentsLinear neural networks:Regression problem from UCI ML Repository ; `4 loss



































Experiments (cont’)Non-linear neural networks:

TensorFlow convolutional network tutorial for MNIST:Arch: (conv → ReLU → max pool) x 2 → dense → ReLU → denseTraining: stochastic GD w/momentum, dropout

We overparameterized by adding linear layer after each dense layer

0 1000 2000 3000 4000 5000 6000 7000 8000

iteration

10−1

100

101

batc

hlo

ss

originaloverparameterized Adding depth, w/o any gain in

expressiveness, and only +15% inparams, accelerated non-linearnet by orders-of-magnitude!



Experiments (cont’)Non-linear neural networks:TensorFlow convolutional network tutorial for MNIST: 1

Arch: (conv → ReLU → max pool) x 2 → dense → ReLU → denseTraining: stochastic GD w/momentum, dropout


0 1000 2000 3000 4000 5000 6000 7000 8000

iteration

10−1

100

101

batc

hlo

ss



1 https://github.com/tensorflow/models/tree/master/tutorials/image/mnistNadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 30 / 35

https://github.com/tensorflow/models/tree/master/tutorials/image/mnist





0 1000 2000 3000 4000 5000 6000 7000 8000

iteration

10−1

100

101

batc

hlo

ss









0 1000 2000 3000 4000 5000 6000 7000 8000

iteration

10−1

100

101

batc

hlo

ss

originaloverparameterized

Adding depth, w/o any gain inexpressiveness, and only +15% inparams, accelerated non-linearnet by orders-of-magnitude!







0 1000 2000 3000 4000 5000 6000 7000 8000

iteration

10−1

100

101

batc

hlo

ss





Conclusion

Outline




4 Conclusion


Conclusion

Recap

Understanding DL calls for addressing three fundamental Qs:

Expressiveness Optimization Generalization

Optimization

Deep (≥ 3 layer) models can’t be treated via geometry alone=⇒ specific optimizer trajectories should be taken into account

We analyzed trajectories of GD over deep linear neural nets:

Derived guarantee for convergence to global min at linear rate(most general guarantee to date for GD efficiently training deep model)

Depth induces preconditioner that can accelerate convergence,w/o any gain in expressiveness, and despite introducing non-convexity


Conclusion

Recap



Optimization






Conclusion

Recap



Optimization






Conclusion

Recap



Optimization






Conclusion

Recap



Optimization






Conclusion

Recap



Optimization






Conclusion

Recap



Optimization






Conclusion

Next Step: Analyzing Generalization via Trajectories

h

*

Sh

*h

*f

Estimation Error (Generalization)

(all functions)

(hypotheses space)


Outline




4 Conclusion


Thank You


Documents

AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental