115
Analyzing Optimization in Deep Learning via Trajectories Nadav Cohen Institute for Advanced Study Institute for Computational and Experimental Research in Mathematics (ICERM) Workshop on Theory and Practice in Machine Learning and Computer Vision 19 February 2019 Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 1 / 35

AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

Analyzing Optimization in Deep Learning via Trajectories

Nadav Cohen

Institute for Advanced Study

Institute for Computational and Experimental Research in Mathematics (ICERM)

Workshop on Theory and Practice in Machine Learning and Computer Vision

19 February 2019

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 1 / 35

Page 2: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

Deep Learning

SourceNVIDIA (www.slideshare.net/openomics/the-revolution-of-deep-learning)

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 2 / 35

Page 3: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

Limited Formal Understanding

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 3 / 35

Page 4: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

DL Theory: Expressiveness, Optimization & Generalization

Outline

1 Deep Learning Theory: Expressiveness, Optimization and Generalization

2 Analyzing Optimization via Trajectories

3 Trajectories of Gradient Descent for Deep Linear Neural NetworksConvergence to Global OptimumAcceleration by Depth

4 Conclusion

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 4 / 35

Page 5: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

DL Theory: Expressiveness, Optimization & Generalization

Statistical Learning Setup

X — instance space (e.g. R100×100 for 100-by-100 grayscale images)

Y — label space (e.g. R for regression or 1, . . . , k for classification)

D — distribution over X × Y (unknown)

` : Y×Y → R≥0 — loss func (e.g. `(y , y) = (y − y)2 for Y = R)

TaskGiven training set S = (Xi , yi )mi=1 drawn i.i.d. from D, return hypothesis(predictor) h : X → Y that minimizes population loss:

LD(h) := E(X ,y)∼D[`(y , h(X ))]ApproachPredetermine hypotheses space H ⊂ YX , and return hypothesis h ∈ Hthat minimizes empirical loss:

LS(h) := 1m∑m

i=1`(yi , h(Xi ))

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 5 / 35

Page 6: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

DL Theory: Expressiveness, Optimization & Generalization

Statistical Learning SetupX — instance space (e.g. R100×100 for 100-by-100 grayscale images)

Y — label space (e.g. R for regression or 1, . . . , k for classification)

D — distribution over X × Y (unknown)

` : Y×Y → R≥0 — loss func (e.g. `(y , y) = (y − y)2 for Y = R)

TaskGiven training set S = (Xi , yi )mi=1 drawn i.i.d. from D, return hypothesis(predictor) h : X → Y that minimizes population loss:

LD(h) := E(X ,y)∼D[`(y , h(X ))]ApproachPredetermine hypotheses space H ⊂ YX , and return hypothesis h ∈ Hthat minimizes empirical loss:

LS(h) := 1m∑m

i=1`(yi , h(Xi ))

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 5 / 35

Page 7: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

DL Theory: Expressiveness, Optimization & Generalization

Statistical Learning SetupX — instance space (e.g. R100×100 for 100-by-100 grayscale images)

Y — label space (e.g. R for regression or 1, . . . , k for classification)

D — distribution over X × Y (unknown)

` : Y×Y → R≥0 — loss func (e.g. `(y , y) = (y − y)2 for Y = R)

TaskGiven training set S = (Xi , yi )mi=1 drawn i.i.d. from D, return hypothesis(predictor) h : X → Y that minimizes population loss:

LD(h) := E(X ,y)∼D[`(y , h(X ))]ApproachPredetermine hypotheses space H ⊂ YX , and return hypothesis h ∈ Hthat minimizes empirical loss:

LS(h) := 1m∑m

i=1`(yi , h(Xi ))

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 5 / 35

Page 8: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

DL Theory: Expressiveness, Optimization & Generalization

Statistical Learning SetupX — instance space (e.g. R100×100 for 100-by-100 grayscale images)

Y — label space (e.g. R for regression or 1, . . . , k for classification)

D — distribution over X × Y (unknown)

` : Y×Y → R≥0 — loss func (e.g. `(y , y) = (y − y)2 for Y = R)

TaskGiven training set S = (Xi , yi )mi=1 drawn i.i.d. from D, return hypothesis(predictor) h : X → Y that minimizes population loss:

LD(h) := E(X ,y)∼D[`(y , h(X ))]ApproachPredetermine hypotheses space H ⊂ YX , and return hypothesis h ∈ Hthat minimizes empirical loss:

LS(h) := 1m∑m

i=1`(yi , h(Xi ))

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 5 / 35

Page 9: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

DL Theory: Expressiveness, Optimization & Generalization

Statistical Learning SetupX — instance space (e.g. R100×100 for 100-by-100 grayscale images)

Y — label space (e.g. R for regression or 1, . . . , k for classification)

D — distribution over X × Y (unknown)

` : Y×Y → R≥0 — loss func (e.g. `(y , y) = (y − y)2 for Y = R)

TaskGiven training set S = (Xi , yi )mi=1 drawn i.i.d. from D, return hypothesis(predictor) h : X → Y that minimizes population loss:

LD(h) := E(X ,y)∼D[`(y , h(X ))]ApproachPredetermine hypotheses space H ⊂ YX , and return hypothesis h ∈ Hthat minimizes empirical loss:

LS(h) := 1m∑m

i=1`(yi , h(Xi ))

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 5 / 35

Page 10: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

DL Theory: Expressiveness, Optimization & Generalization

Statistical Learning SetupX — instance space (e.g. R100×100 for 100-by-100 grayscale images)

Y — label space (e.g. R for regression or 1, . . . , k for classification)

D — distribution over X × Y (unknown)

` : Y×Y → R≥0 — loss func (e.g. `(y , y) = (y − y)2 for Y = R)

TaskGiven training set S = (Xi , yi )mi=1 drawn i.i.d. from D, return hypothesis(predictor) h : X → Y that minimizes population loss:

LD(h) := E(X ,y)∼D[`(y , h(X ))]

ApproachPredetermine hypotheses space H ⊂ YX , and return hypothesis h ∈ Hthat minimizes empirical loss:

LS(h) := 1m∑m

i=1`(yi , h(Xi ))

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 5 / 35

Page 11: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

DL Theory: Expressiveness, Optimization & Generalization

Statistical Learning SetupX — instance space (e.g. R100×100 for 100-by-100 grayscale images)

Y — label space (e.g. R for regression or 1, . . . , k for classification)

D — distribution over X × Y (unknown)

` : Y×Y → R≥0 — loss func (e.g. `(y , y) = (y − y)2 for Y = R)

TaskGiven training set S = (Xi , yi )mi=1 drawn i.i.d. from D, return hypothesis(predictor) h : X → Y that minimizes population loss:

LD(h) := E(X ,y)∼D[`(y , h(X ))]ApproachPredetermine hypotheses space H ⊂ YX , and return hypothesis h ∈ Hthat minimizes empirical loss:

LS(h) := 1m∑m

i=1`(yi , h(Xi ))

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 5 / 35

Page 12: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

DL Theory: Expressiveness, Optimization & Generalization

Three Pillars of Statistical Learning Theory:Expressiveness, Generalization and Optimization

h*

Sh

*h

*f

(all functions)

(hypotheses space)

f ∗D — ground truth (minimizer of population loss over YX )

h∗D — optimal hypothesis (minimizer of population loss over H)

h∗S — empirically optimal hypothesis (minimizer of empirical loss over H)

h — returned hypothesisNadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 6 / 35

Page 13: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

DL Theory: Expressiveness, Optimization & Generalization

Three Pillars of Statistical Learning Theory:Expressiveness, Generalization and Optimization

h*

Sh

*h

*f

Approximation Error (Expressiveness)

(all functions)

(hypotheses space)

f ∗D — ground truth (minimizer of population loss over YX )

h∗D — optimal hypothesis (minimizer of population loss over H)

h∗S — empirically optimal hypothesis (minimizer of empirical loss over H)

h — returned hypothesisNadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 6 / 35

Page 14: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

DL Theory: Expressiveness, Optimization & Generalization

Three Pillars of Statistical Learning Theory:Expressiveness, Generalization and Optimization

h*

Sh

*h

*f

Approximation Error (Expressiveness)

Estimation Error (Generalization)

(all functions)

(hypotheses space)

f ∗D — ground truth (minimizer of population loss over YX )

h∗D — optimal hypothesis (minimizer of population loss over H)

h∗S — empirically optimal hypothesis (minimizer of empirical loss over H)

h — returned hypothesisNadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 6 / 35

Page 15: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

DL Theory: Expressiveness, Optimization & Generalization

Three Pillars of Statistical Learning Theory:Expressiveness, Generalization and Optimization

h*

Sh

*h

*f

Approximation Error (Expressiveness)

Estimation Error (Generalization)Training Error

(Optimization)

(all functions)

(hypotheses space)

f ∗D — ground truth (minimizer of population loss over YX )

h∗D — optimal hypothesis (minimizer of population loss over H)

h∗S — empirically optimal hypothesis (minimizer of empirical loss over H)

h — returned hypothesisNadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 6 / 35

Page 16: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

DL Theory: Expressiveness, Optimization & Generalization

Classical Machine Learning

h*

Sh

*h

*f

Approximation Error (Expressiveness)

Estimation Error (Generalization)Training Error

(Optimization)

(all functions)

(hypotheses space)

OptimizationEmpirical loss minimization is a convex program:

h ≈ h∗S ( training err ≈ 0 )

Expressiveness & GeneralizationBias-variance trade-off:

H approximation err estimation errexpands shrinks

Well developed theory

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 7 / 35

Page 17: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

DL Theory: Expressiveness, Optimization & Generalization

Classical Machine Learning

h*

Sh

*h

*f

Approximation Error (Expressiveness)

Estimation Error (Generalization)Training Error

(Optimization)

(all functions)

(hypotheses space)

OptimizationEmpirical loss minimization is a convex program:

h ≈ h∗S ( training err ≈ 0 )

Expressiveness & GeneralizationBias-variance trade-off:

H approximation err estimation errexpands shrinks

Well developed theory

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 7 / 35

Page 18: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

DL Theory: Expressiveness, Optimization & Generalization

Classical Machine Learning

h*

Sh

*h

*f

Approximation Error (Expressiveness)

Estimation Error (Generalization)Training Error

(Optimization)

(all functions)

(hypotheses space)

OptimizationEmpirical loss minimization is a convex program:

h ≈ h∗S ( training err ≈ 0 )

Expressiveness & GeneralizationBias-variance trade-off:

H approximation err estimation errexpands shrinks

Well developed theory

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 7 / 35

Page 19: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

DL Theory: Expressiveness, Optimization & Generalization

Classical Machine Learning

h*

Sh

*h

*f

Approximation Error (Expressiveness)

Estimation Error (Generalization)Training Error

(Optimization)

(all functions)

(hypotheses space)

OptimizationEmpirical loss minimization is a convex program:

h ≈ h∗S ( training err ≈ 0 )

Expressiveness & GeneralizationBias-variance trade-off:

H approximation err estimation errexpands shrinks

Well developed theory

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 7 / 35

Page 20: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

DL Theory: Expressiveness, Optimization & Generalization

Deep Learning

h*

Sh

*h

*f

Approximation Error (Expressiveness)

Estimation Error (Generalization)Training Error

(Optimization)

(all functions)

(hypotheses space)

OptimizationEmpirical loss minimization is a non-convex program:

h∗S is not unique — many hypotheses have low training errGradient descent (GD) somehow reaches one of these

Expressiveness & GeneralizationVast difference from classical ML:

Some low training err hypotheses generalize well, others don’tW/typical data, solution returned by GD often generalizes wellExpanding H reduces approximation err, but also estimation err!

Not well understood

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 8 / 35

Page 21: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

DL Theory: Expressiveness, Optimization & Generalization

Deep Learning

h*

Sh

*h

*f

Approximation Error (Expressiveness)

Estimation Error (Generalization)Training Error

(Optimization)

(all functions)

(hypotheses space)

OptimizationEmpirical loss minimization is a non-convex program:

h∗S is not unique — many hypotheses have low training errGradient descent (GD) somehow reaches one of these

Expressiveness & GeneralizationVast difference from classical ML:

Some low training err hypotheses generalize well, others don’tW/typical data, solution returned by GD often generalizes wellExpanding H reduces approximation err, but also estimation err!

Not well understood

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 8 / 35

Page 22: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

DL Theory: Expressiveness, Optimization & Generalization

Deep Learning

h*

Sh

*h

*f

Approximation Error (Expressiveness)

Estimation Error (Generalization)Training Error

(Optimization)

(all functions)

(hypotheses space)

OptimizationEmpirical loss minimization is a non-convex program:

h∗S is not unique — many hypotheses have low training err

Gradient descent (GD) somehow reaches one of theseExpressiveness & GeneralizationVast difference from classical ML:

Some low training err hypotheses generalize well, others don’tW/typical data, solution returned by GD often generalizes wellExpanding H reduces approximation err, but also estimation err!

Not well understood

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 8 / 35

Page 23: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

DL Theory: Expressiveness, Optimization & Generalization

Deep Learning

h*

Sh

*h

*f

Approximation Error (Expressiveness)

Estimation Error (Generalization)Training Error

(Optimization)

(all functions)

(hypotheses space)

OptimizationEmpirical loss minimization is a non-convex program:

h∗S is not unique — many hypotheses have low training errGradient descent (GD) somehow reaches one of these

Expressiveness & GeneralizationVast difference from classical ML:

Some low training err hypotheses generalize well, others don’tW/typical data, solution returned by GD often generalizes wellExpanding H reduces approximation err, but also estimation err!

Not well understood

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 8 / 35

Page 24: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

DL Theory: Expressiveness, Optimization & Generalization

Deep Learning

h*

Sh

*h

*f

Approximation Error (Expressiveness)

Estimation Error (Generalization)Training Error

(Optimization)

(all functions)

(hypotheses space)

OptimizationEmpirical loss minimization is a non-convex program:

h∗S is not unique — many hypotheses have low training errGradient descent (GD) somehow reaches one of these

Expressiveness & GeneralizationVast difference from classical ML:

Some low training err hypotheses generalize well, others don’tW/typical data, solution returned by GD often generalizes wellExpanding H reduces approximation err, but also estimation err!

Not well understood

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 8 / 35

Page 25: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

DL Theory: Expressiveness, Optimization & Generalization

Deep Learning

h*

Sh

*h

*f

Approximation Error (Expressiveness)

Estimation Error (Generalization)Training Error

(Optimization)

(all functions)

(hypotheses space)

OptimizationEmpirical loss minimization is a non-convex program:

h∗S is not unique — many hypotheses have low training errGradient descent (GD) somehow reaches one of these

Expressiveness & GeneralizationVast difference from classical ML:

Some low training err hypotheses generalize well, others don’t

W/typical data, solution returned by GD often generalizes wellExpanding H reduces approximation err, but also estimation err!

Not well understood

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 8 / 35

Page 26: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

DL Theory: Expressiveness, Optimization & Generalization

Deep Learning

h*

Sh

*h

*f

Approximation Error (Expressiveness)

Estimation Error (Generalization)Training Error

(Optimization)

(all functions)

(hypotheses space)

OptimizationEmpirical loss minimization is a non-convex program:

h∗S is not unique — many hypotheses have low training errGradient descent (GD) somehow reaches one of these

Expressiveness & GeneralizationVast difference from classical ML:

Some low training err hypotheses generalize well, others don’tW/typical data, solution returned by GD often generalizes well

Expanding H reduces approximation err, but also estimation err!

Not well understood

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 8 / 35

Page 27: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

DL Theory: Expressiveness, Optimization & Generalization

Deep Learning

h*

Sh

*h

*f

Approximation Error (Expressiveness)

Estimation Error (Generalization)Training Error

(Optimization)

(all functions)

(hypotheses space)

OptimizationEmpirical loss minimization is a non-convex program:

h∗S is not unique — many hypotheses have low training errGradient descent (GD) somehow reaches one of these

Expressiveness & GeneralizationVast difference from classical ML:

Some low training err hypotheses generalize well, others don’tW/typical data, solution returned by GD often generalizes wellExpanding H reduces approximation err, but also estimation err!

Not well understood

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 8 / 35

Page 28: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

DL Theory: Expressiveness, Optimization & Generalization

Deep Learning

h*

Sh

*h

*f

Approximation Error (Expressiveness)

Estimation Error (Generalization)Training Error

(Optimization)

(all functions)

(hypotheses space)

OptimizationEmpirical loss minimization is a non-convex program:

h∗S is not unique — many hypotheses have low training errGradient descent (GD) somehow reaches one of these

Expressiveness & GeneralizationVast difference from classical ML:

Some low training err hypotheses generalize well, others don’tW/typical data, solution returned by GD often generalizes wellExpanding H reduces approximation err, but also estimation err!

Not well understood

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 8 / 35

Page 29: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

Analyzing Optimization via Trajectories

Outline

1 Deep Learning Theory: Expressiveness, Optimization and Generalization

2 Analyzing Optimization via Trajectories

3 Trajectories of Gradient Descent for Deep Linear Neural NetworksConvergence to Global OptimumAcceleration by Depth

4 Conclusion

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 9 / 35

Page 30: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

Analyzing Optimization via Trajectories

Optimization

h*

Sh

*h

*f

Training Error (Optimization)

(all functions)

(hypotheses space)

f ∗D — ground truth

h∗D — optimal hypothesis

h∗S — empirically optimal hypothesis

h — returned hypothesis

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 10 / 35

Page 31: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

Analyzing Optimization via Trajectories

Approach: Convergence via Critical Points

Prominent approach for analyzing optimization in DL is via critical points(∇ = 0) in loss landscape

Non-strict saddleGood local minimum( ≈ global minimum)

Poor local minimum Strict saddle

Result (cf. Ge et al. 2015; Lee et al. 2016)If: (1) there are no poor local minima; and (2) all saddle points are strict,then gradient descent (GD) converges to global minimum

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 11 / 35

Page 32: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

Analyzing Optimization via Trajectories

Approach: Convergence via Critical Points

Prominent approach for analyzing optimization in DL is via critical points(∇ = 0) in loss landscape

Non-strict saddleGood local minimum( ≈ global minimum)

Poor local minimum Strict saddle

(1) (2)

Result (cf. Ge et al. 2015; Lee et al. 2016)If: (1) there are no poor local minima; and (2) all saddle points are strict,then gradient descent (GD) converges to global minimum

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 11 / 35

Page 33: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

Analyzing Optimization via Trajectories

Approach: Convergence via Critical Points

Prominent approach for analyzing optimization in DL is via critical points(∇ = 0) in loss landscape

Non-strict saddleGood local minimum( ≈ global minimum)

Poor local minimum Strict saddle

(1) (2)

Result (cf. Ge et al. 2015; Lee et al. 2016)If: (1) there are no poor local minima; and (2) all saddle points are strict,then gradient descent (GD) converges to global minimum

Motivated by this, many 1 studied the validity of (1) and/or (2)

1 e.g. Haeffele & Vidal 2015; Kawaguchi 2016; Soudry & Carmon 2016; Safran & Shamir 2018Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 11 / 35

Page 34: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

Analyzing Optimization via Trajectories

LimitationsConvergence of GD to global min was proven via critical points only forproblems involving shallow (2 layer) models

Approach is insufficient when treating deep (≥ 3 layer) models:

(2) is violated — ∃ non-strict saddles, e.g. when all weights = 0

Algorithmic aspects essential for convergence w/deep models, e.g.proper initialization, are ignored

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 12 / 35

Page 35: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

Analyzing Optimization via Trajectories

LimitationsConvergence of GD to global min was proven via critical points only forproblems involving shallow (2 layer) models

Approach is insufficient when treating deep (≥ 3 layer) models:

(2) is violated — ∃ non-strict saddles, e.g. when all weights = 0

Algorithmic aspects essential for convergence w/deep models, e.g.proper initialization, are ignored

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 12 / 35

Page 36: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

Analyzing Optimization via Trajectories

LimitationsConvergence of GD to global min was proven via critical points only forproblems involving shallow (2 layer) models

Approach is insufficient when treating deep (≥ 3 layer) models:

(2) is violated — ∃ non-strict saddles, e.g. when all weights = 0

Algorithmic aspects essential for convergence w/deep models, e.g.proper initialization, are ignored

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 12 / 35

Page 37: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

Analyzing Optimization via Trajectories

LimitationsConvergence of GD to global min was proven via critical points only forproblems involving shallow (2 layer) models

Approach is insufficient when treating deep (≥ 3 layer) models:

(2) is violated — ∃ non-strict saddles, e.g. when all weights = 0

Algorithmic aspects essential for convergence w/deep models, e.g.proper initialization, are ignored

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 12 / 35

Page 38: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

Analyzing Optimization via Trajectories

Optimizer Trajectories Matter

Different optimization trajectories may lead to qualitatively different results

=⇒ details of algorithm and init should be taken into account!

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 13 / 35

Page 39: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

Analyzing Optimization via Trajectories

Optimizer Trajectories Matter

Different optimization trajectories may lead to qualitatively different results

=⇒ details of algorithm and init should be taken into account!

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 13 / 35

Page 40: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

Analyzing Optimization via Trajectories

Existing Trajectory Analyses

Trajectory approach led to successful analyses of shallow models:Brutzkus & Globerson 2017Li & Yuan 2017Zhong et al. 2017Tian 2017Brutzkus et al. 2018Li et al. 2018Du et al. 2018Oymak & Soltanolkotabi 2018

It also allowed treating prohibitively large deep models:Du et al. 2018Allen-Zhu et al. 2018Zou et al. 2018

For deep linear residual networks, trajectories were used to show efficientconvergence of GD to global min (Bartlett et al. 2018)

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 14 / 35

Page 41: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

Analyzing Optimization via Trajectories

Existing Trajectory AnalysesTrajectory approach led to successful analyses of shallow models:

Brutzkus & Globerson 2017Li & Yuan 2017Zhong et al. 2017Tian 2017Brutzkus et al. 2018Li et al. 2018Du et al. 2018Oymak & Soltanolkotabi 2018

It also allowed treating prohibitively large deep models:Du et al. 2018Allen-Zhu et al. 2018Zou et al. 2018

For deep linear residual networks, trajectories were used to show efficientconvergence of GD to global min (Bartlett et al. 2018)

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 14 / 35

Page 42: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

Analyzing Optimization via Trajectories

Existing Trajectory AnalysesTrajectory approach led to successful analyses of shallow models:

Brutzkus & Globerson 2017Li & Yuan 2017Zhong et al. 2017Tian 2017Brutzkus et al. 2018Li et al. 2018Du et al. 2018Oymak & Soltanolkotabi 2018

It also allowed treating prohibitively large deep models:Du et al. 2018Allen-Zhu et al. 2018Zou et al. 2018

For deep linear residual networks, trajectories were used to show efficientconvergence of GD to global min (Bartlett et al. 2018)

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 14 / 35

Page 43: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

Analyzing Optimization via Trajectories

Existing Trajectory AnalysesTrajectory approach led to successful analyses of shallow models:

Brutzkus & Globerson 2017Li & Yuan 2017Zhong et al. 2017Tian 2017Brutzkus et al. 2018Li et al. 2018Du et al. 2018Oymak & Soltanolkotabi 2018

It also allowed treating prohibitively large deep models:Du et al. 2018Allen-Zhu et al. 2018Zou et al. 2018

For deep linear residual networks, trajectories were used to show efficientconvergence of GD to global min (Bartlett et al. 2018)

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 14 / 35

Page 44: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

Trajectories of GD for Deep LNNs

Outline

1 Deep Learning Theory: Expressiveness, Optimization and Generalization

2 Analyzing Optimization via Trajectories

3 Trajectories of Gradient Descent for Deep Linear Neural NetworksConvergence to Global OptimumAcceleration by Depth

4 Conclusion

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 15 / 35

Page 45: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

Trajectories of GD for Deep LNNs

Sources

On the Optimization of Deep Networks:Implicit Acceleration by Overparameterization

Arora + C + Hazan (alphabetical order)International Conference on Machine Learning (ICML) 2018

A Convergence Analysis of Gradient Descent for Deep Linear Neural NetworksArora + C + Golowich + Hu (alphabetical order)To appear: International Conference on Learning Representations (ICLR) 2019

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 16 / 35

Page 46: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

Trajectories of GD for Deep LNNs

Collaborators

Sanjeev Arora Elad Hazan

Wei Hu Noah Golowich

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 17 / 35

Page 47: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

Trajectories of GD for Deep LNNs Convergence to Global Optimum

Outline

1 Deep Learning Theory: Expressiveness, Optimization and Generalization

2 Analyzing Optimization via Trajectories

3 Trajectories of Gradient Descent for Deep Linear Neural NetworksConvergence to Global OptimumAcceleration by Depth

4 Conclusion

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 18 / 35

Page 48: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

Trajectories of GD for Deep LNNs Convergence to Global Optimum

Linear Neural NetworksLinear neural networks (LNN) are fully-connected neural networksw/linear (no) activation

W1 W2 WNx y = WN • • • W2W1 x

As surrogate for optimization in DL, GD over LNN (highly non-convexproblem) is studied extensively

Existing Result (Bartlett et al. 2018)W/linear residual networks (a special case: Wj are square and init to Id),for `2 loss on certain data, GD efficiently converges to global min

↑Only existing proof of efficient convergenceto global min for GD training deep model

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 19 / 35

Page 49: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

Trajectories of GD for Deep LNNs Convergence to Global Optimum

Linear Neural NetworksLinear neural networks (LNN) are fully-connected neural networksw/linear (no) activation

W1 W2 WNx y = WN • • • W2W1 x

As surrogate for optimization in DL, GD over LNN (highly non-convexproblem) is studied extensively 1

Existing Result (Bartlett et al. 2018)W/linear residual networks (a special case: Wj are square and init to Id),for `2 loss on certain data, GD efficiently converges to global min

↑Only existing proof of efficient convergenceto global min for GD training deep model

1 e.g. Saxe et al. 2014; Kawaguchi 2016; Hardt & Ma 2017; Laurent & Brecht 2018Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 19 / 35

Page 50: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

Trajectories of GD for Deep LNNs Convergence to Global Optimum

Linear Neural NetworksLinear neural networks (LNN) are fully-connected neural networksw/linear (no) activation

W1 W2 WNx y = WN • • • W2W1 x

As surrogate for optimization in DL, GD over LNN (highly non-convexproblem) is studied extensively 1

Existing Result (Bartlett et al. 2018)W/linear residual networks (a special case: Wj are square and init to Id),for `2 loss on certain data, GD efficiently converges to global min

↑Only existing proof of efficient convergenceto global min for GD training deep model

1 e.g. Saxe et al. 2014; Kawaguchi 2016; Hardt & Ma 2017; Laurent & Brecht 2018Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 19 / 35

Page 51: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

Trajectories of GD for Deep LNNs Convergence to Global Optimum

Linear Neural NetworksLinear neural networks (LNN) are fully-connected neural networksw/linear (no) activation

W1 W2 WNx y = WN • • • W2W1 x

As surrogate for optimization in DL, GD over LNN (highly non-convexproblem) is studied extensively 1

Existing Result (Bartlett et al. 2018)W/linear residual networks (a special case: Wj are square and init to Id),for `2 loss on certain data, GD efficiently converges to global min

↑Only existing proof of efficient convergenceto global min for GD training deep model

1 e.g. Saxe et al. 2014; Kawaguchi 2016; Hardt & Ma 2017; Laurent & Brecht 2018Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 19 / 35

Page 52: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

Trajectories of GD for Deep LNNs Convergence to Global Optimum

Gradient FlowGradient flow (GF) is a continuous version of GD (learning rate → 0):

ddt α(t) = −∇f (α(t)) , t ∈ R>0

Gradient descent

Gradient flow

Admits use of theoretical tools from differential geometry/equations

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 20 / 35

Page 53: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

Trajectories of GD for Deep LNNs Convergence to Global Optimum

Gradient FlowGradient flow (GF) is a continuous version of GD (learning rate → 0):

ddt α(t) = −∇f (α(t)) , t ∈ R>0

Gradient descent

Gradient flow

Admits use of theoretical tools from differential geometry/equations

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 20 / 35

Page 54: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

Trajectories of GD for Deep LNNs Convergence to Global Optimum

Trajectories of Gradient Flow

W1 W2 WNx y = WN • • • W2W1 x

Loss `(·) for linear model induces overparameterized objective for LNN:φ(W1, . . . ,WN) := `(WN · · ·W2W1)

DefinitionWeights W1 . . .WN are balanced if W>

j+1Wj+1 = WjW>j ,∀j .

↑Holds approximately under ≈ 0 init, exactly under residual (Id) init

ClaimTrajectories of GF over LNN preserve balancedness: if W1 . . .WN arebalanced at init, they remain that way throughout GF optimization

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 21 / 35

Page 55: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

Trajectories of GD for Deep LNNs Convergence to Global Optimum

Trajectories of Gradient Flow

W1 W2 WNx y = WN • • • W2W1 x

Loss `(·) for linear model induces overparameterized objective for LNN:φ(W1, . . . ,WN) := `(WN · · ·W2W1)

DefinitionWeights W1 . . .WN are balanced if W>

j+1Wj+1 = WjW>j ,∀j .

↑Holds approximately under ≈ 0 init, exactly under residual (Id) init

ClaimTrajectories of GF over LNN preserve balancedness: if W1 . . .WN arebalanced at init, they remain that way throughout GF optimization

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 21 / 35

Page 56: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

Trajectories of GD for Deep LNNs Convergence to Global Optimum

Trajectories of Gradient Flow

W1 W2 WNx y = WN • • • W2W1 x

Loss `(·) for linear model induces overparameterized objective for LNN:φ(W1, . . . ,WN) := `(WN · · ·W2W1)

DefinitionWeights W1 . . .WN are balanced if W>

j+1Wj+1 = WjW>j , ∀j .

↑Holds approximately under ≈ 0 init, exactly under residual (Id) init

ClaimTrajectories of GF over LNN preserve balancedness: if W1 . . .WN arebalanced at init, they remain that way throughout GF optimization

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 21 / 35

Page 57: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

Trajectories of GD for Deep LNNs Convergence to Global Optimum

Trajectories of Gradient Flow

W1 W2 WNx y = WN • • • W2W1 x

Loss `(·) for linear model induces overparameterized objective for LNN:φ(W1, . . . ,WN) := `(WN · · ·W2W1)

DefinitionWeights W1 . . .WN are balanced if W>

j+1Wj+1 = WjW>j , ∀j .

↑Holds approximately under ≈ 0 init, exactly under residual (Id) init

ClaimTrajectories of GF over LNN preserve balancedness: if W1 . . .WN arebalanced at init, they remain that way throughout GF optimization

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 21 / 35

Page 58: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

Trajectories of GD for Deep LNNs Convergence to Global Optimum

Trajectories of Gradient Flow

W1 W2 WNx y = WN • • • W2W1 x

Loss `(·) for linear model induces overparameterized objective for LNN:φ(W1, . . . ,WN) := `(WN · · ·W2W1)

DefinitionWeights W1 . . .WN are balanced if W>

j+1Wj+1 = WjW>j , ∀j .

↑Holds approximately under ≈ 0 init, exactly under residual (Id) init

ClaimTrajectories of GF over LNN preserve balancedness: if W1 . . .WN arebalanced at init, they remain that way throughout GF optimization

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 21 / 35

Page 59: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

Trajectories of GD for Deep LNNs Convergence to Global Optimum

Implicit PreconditioningQuestionHow does end-to-end matrix W1:N :=WN · · ·W1 move on GF trajectories?

W1 W2 WN W1:N

Linear Neural Network Equivalent Linear Model

)NW,…, 1W( Gradient flow over ?

TheoremIf W1 . . .WN are balanced at init, W1:N follows end-to-end dynamics:

ddt vec [W1:N(t)] = −PW1:N (t) · vec

[∇`(W1:N(t)

)]where PW1:N (t) is a preconditioner (PSD matrix) that “reinforces” W1:N(t)

Adding (redundant) linear layers to classic linear model inducespreconditioner promoting movement in directions already taken!

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 22 / 35

Page 60: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

Trajectories of GD for Deep LNNs Convergence to Global Optimum

Implicit PreconditioningQuestionHow does end-to-end matrix W1:N :=WN · · ·W1 move on GF trajectories?

W1 W2 WN W1:N

Linear Neural Network Equivalent Linear Model

Gradient flow over (W1 ,…,WN)Preconditioned

gradient flow over (W1:N)

TheoremIf W1 . . .WN are balanced at init, W1:N follows end-to-end dynamics:

ddt vec [W1:N(t)] = −PW1:N (t) · vec

[∇`(W1:N(t)

)]where PW1:N (t) is a preconditioner (PSD matrix) that “reinforces” W1:N(t)

Adding (redundant) linear layers to classic linear model inducespreconditioner promoting movement in directions already taken!

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 22 / 35

Page 61: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

Trajectories of GD for Deep LNNs Convergence to Global Optimum

Implicit PreconditioningQuestionHow does end-to-end matrix W1:N :=WN · · ·W1 move on GF trajectories?

W1 W2 WN W1:N

Linear Neural Network Equivalent Linear Model

Gradient flow over (W1 ,…,WN)Preconditioned

gradient flow over (W1:N)

TheoremIf W1 . . .WN are balanced at init, W1:N follows end-to-end dynamics:

ddt vec [W1:N(t)] = −PW1:N (t) · vec

[∇`(W1:N(t)

)]where PW1:N (t) is a preconditioner (PSD matrix) that “reinforces” W1:N(t)

Adding (redundant) linear layers to classic linear model inducespreconditioner promoting movement in directions already taken!

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 22 / 35

Page 62: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

Trajectories of GD for Deep LNNs Convergence to Global Optimum

Convergence to Global Optimum

ddt vec [W1:N(t)] = −PW1:N (t) · vec

[∇`(W1:N(t)

)]

PW1:N (t)0 when W1:N(t) has full rank =⇒ loss decreases until:

(1) ∇`(W1:N(t)

)= 0 or (2) W1:N(t) is singular

`(·) is typically convex =⇒ (1) means global min was reached

CorollaryAssume `(·) is convex and LNN is init such that:

W1 . . .WN are balanced

`(W1:N) < `(W ) for any singular W

Then, GF converges to global min

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 23 / 35

Page 63: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

Trajectories of GD for Deep LNNs Convergence to Global Optimum

Convergence to Global Optimum

ddt vec [W1:N(t)] = −PW1:N (t) · vec

[∇`(W1:N(t)

)]PW1:N (t)0 when W1:N(t) has full rank

=⇒ loss decreases until:

(1) ∇`(W1:N(t)

)= 0 or (2) W1:N(t) is singular

`(·) is typically convex =⇒ (1) means global min was reached

CorollaryAssume `(·) is convex and LNN is init such that:

W1 . . .WN are balanced

`(W1:N) < `(W ) for any singular W

Then, GF converges to global min

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 23 / 35

Page 64: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

Trajectories of GD for Deep LNNs Convergence to Global Optimum

Convergence to Global Optimum

ddt vec [W1:N(t)] = −PW1:N (t) · vec

[∇`(W1:N(t)

)]PW1:N (t)0 when W1:N(t) has full rank =⇒ loss decreases until:

(1) ∇`(W1:N(t)

)= 0 or (2) W1:N(t) is singular

`(·) is typically convex =⇒ (1) means global min was reached

CorollaryAssume `(·) is convex and LNN is init such that:

W1 . . .WN are balanced

`(W1:N) < `(W ) for any singular W

Then, GF converges to global min

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 23 / 35

Page 65: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

Trajectories of GD for Deep LNNs Convergence to Global Optimum

Convergence to Global Optimum

ddt vec [W1:N(t)] = −PW1:N (t) · vec

[∇`(W1:N(t)

)]PW1:N (t)0 when W1:N(t) has full rank =⇒ loss decreases until:

(1) ∇`(W1:N(t)

)= 0 or (2) W1:N(t) is singular

`(·) is typically convex =⇒ (1) means global min was reached

CorollaryAssume `(·) is convex and LNN is init such that:

W1 . . .WN are balanced

`(W1:N) < `(W ) for any singular W

Then, GF converges to global min

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 23 / 35

Page 66: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

Trajectories of GD for Deep LNNs Convergence to Global Optimum

Convergence to Global Optimum

ddt vec [W1:N(t)] = −PW1:N (t) · vec

[∇`(W1:N(t)

)]PW1:N (t)0 when W1:N(t) has full rank =⇒ loss decreases until:

(1) ∇`(W1:N(t)

)= 0 or (2) W1:N(t) is singular

`(·) is typically convex =⇒ (1) means global min was reached

CorollaryAssume `(·) is convex and LNN is init such that:

W1 . . .WN are balanced

`(W1:N) < `(W ) for any singular W

Then, GF converges to global min

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 23 / 35

Page 67: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

Trajectories of GD for Deep LNNs Convergence to Global Optimum

From Gradient Flow to Gradient Descent

Our convergence result for GF made two assumptions on init:1 Weights are balanced:2 Loss is smaller than that of any singular solution:

`(W1:N) < `(W ) , ∀W s.t. σmin(W ) = 0

For translating to GD, we define discrete forms of these conditions:

DefinitionFor δ ≥ 0, weights W1 . . .WN are δ-balanced if:

‖W>j+1Wj+1 −WjW>

j ‖F ≤ δ ,∀jDefinitionFor c > 0, weights W1 . . .WN have deficiency margin c if:

`(W1:N) ≤ `(W ) ,∀W s.t. σmin(W ) ≤ c

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 24 / 35

Page 68: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

Trajectories of GD for Deep LNNs Convergence to Global Optimum

From Gradient Flow to Gradient DescentOur convergence result for GF made two assumptions on init:

1 Weights are balanced:2 Loss is smaller than that of any singular solution:`(W1:N) < `(W ) , ∀W s.t. σmin(W ) = 0

For translating to GD, we define discrete forms of these conditions:

DefinitionFor δ ≥ 0, weights W1 . . .WN are δ-balanced if:

‖W>j+1Wj+1 −WjW>

j ‖F ≤ δ ,∀jDefinitionFor c > 0, weights W1 . . .WN have deficiency margin c if:

`(W1:N) ≤ `(W ) ,∀W s.t. σmin(W ) ≤ c

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 24 / 35

Page 69: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

Trajectories of GD for Deep LNNs Convergence to Global Optimum

From Gradient Flow to Gradient DescentOur convergence result for GF made two assumptions on init:

1 Weights are balanced:W>

j+1Wj+1 = WjW>j ,∀j

2 Loss is smaller than that of any singular solution:`(W1:N) < `(W ) ,∀W s.t. σmin(W ) = 0

For translating to GD, we define discrete forms of these conditions:

DefinitionFor δ ≥ 0, weights W1 . . .WN are δ-balanced if:

‖W>j+1Wj+1 −WjW>

j ‖F ≤ δ ,∀jDefinitionFor c > 0, weights W1 . . .WN have deficiency margin c if:

`(W1:N) ≤ `(W ) ,∀W s.t. σmin(W ) ≤ c

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 24 / 35

Page 70: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

Trajectories of GD for Deep LNNs Convergence to Global Optimum

From Gradient Flow to Gradient DescentOur convergence result for GF made two assumptions on init:

1 Weights are balanced:‖W>

j+1Wj+1 −WjW>j ‖F = 0 , ∀j

2 Loss is smaller than that of any singular solution:`(W1:N) < `(W ) ,∀W s.t. σmin(W ) = 0

For translating to GD, we define discrete forms of these conditions:

DefinitionFor δ ≥ 0, weights W1 . . .WN are δ-balanced if:

‖W>j+1Wj+1 −WjW>

j ‖F ≤ δ ,∀jDefinitionFor c > 0, weights W1 . . .WN have deficiency margin c if:

`(W1:N) ≤ `(W ) ,∀W s.t. σmin(W ) ≤ c

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 24 / 35

Page 71: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

Trajectories of GD for Deep LNNs Convergence to Global Optimum

From Gradient Flow to Gradient DescentOur convergence result for GF made two assumptions on init:

1 Weights are balanced:‖W>

j+1Wj+1 −WjW>j ‖F = 0 , ∀j

2 Loss is smaller than that of any singular solution:`(W1:N) < `(W ) , ∀W s.t. σmin(W ) = 0

For translating to GD, we define discrete forms of these conditions:

DefinitionFor δ ≥ 0, weights W1 . . .WN are δ-balanced if:

‖W>j+1Wj+1 −WjW>

j ‖F ≤ δ ,∀jDefinitionFor c > 0, weights W1 . . .WN have deficiency margin c if:

`(W1:N) ≤ `(W ) ,∀W s.t. σmin(W ) ≤ c

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 24 / 35

Page 72: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

Trajectories of GD for Deep LNNs Convergence to Global Optimum

From Gradient Flow to Gradient DescentOur convergence result for GF made two assumptions on init:

1 Weights are balanced:‖W>

j+1Wj+1 −WjW>j ‖F = 0 , ∀j

2 Loss is smaller than that of any singular solution:`(W1:N) < `(W ) , ∀W s.t. σmin(W ) = 0

For translating to GD, we define discrete forms of these conditions:

DefinitionFor δ ≥ 0, weights W1 . . .WN are δ-balanced if:

‖W>j+1Wj+1 −WjW>

j ‖F ≤ δ ,∀jDefinitionFor c > 0, weights W1 . . .WN have deficiency margin c if:

`(W1:N) ≤ `(W ) ,∀W s.t. σmin(W ) ≤ c

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 24 / 35

Page 73: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

Trajectories of GD for Deep LNNs Convergence to Global Optimum

From Gradient Flow to Gradient DescentOur convergence result for GF made two assumptions on init:

1 Weights are balanced:‖W>

j+1Wj+1 −WjW>j ‖F = 0 , ∀j

2 Loss is smaller than that of any singular solution:`(W1:N) < `(W ) , ∀W s.t. σmin(W ) = 0

For translating to GD, we define discrete forms of these conditions:

DefinitionFor δ ≥ 0, weights W1 . . .WN are δ-balanced if:

‖W>j+1Wj+1 −WjW>

j ‖F ≤ δ ,∀j

DefinitionFor c > 0, weights W1 . . .WN have deficiency margin c if:

`(W1:N) ≤ `(W ) ,∀W s.t. σmin(W ) ≤ c

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 24 / 35

Page 74: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

Trajectories of GD for Deep LNNs Convergence to Global Optimum

From Gradient Flow to Gradient DescentOur convergence result for GF made two assumptions on init:

1 Weights are balanced:‖W>

j+1Wj+1 −WjW>j ‖F = 0 , ∀j

2 Loss is smaller than that of any singular solution:`(W1:N) < `(W ) , ∀W s.t. σmin(W ) = 0

For translating to GD, we define discrete forms of these conditions:

DefinitionFor δ ≥ 0, weights W1 . . .WN are δ-balanced if:

‖W>j+1Wj+1 −WjW>

j ‖F ≤ δ ,∀jDefinitionFor c > 0, weights W1 . . .WN have deficiency margin c if:

`(W1:N) ≤ `(W ) ,∀W s.t. σmin(W ) ≤ c

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 24 / 35

Page 75: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

Trajectories of GD for Deep LNNs Convergence to Global Optimum

Convergence to Global Optimum for Gradient DescentSuppose `(·) = `2 loss

(i.e. `(W ) = 1

m∑m

i=1 ‖W xi − yi‖22)

TheoremAssume GD over LNN is init s.t. W1 . . .WN have deficiency margin c > 0and are δ-balanced w/δ ≤ O(c2). Then, for any learning rate η ≤ O(c4):

loss(iteration t) ≤ e−Ω(c2ηt)

ClaimAssumptions on init — deficiency margin and δ-balancedness:

Are necessary (violating any of them can lead to divergence)

For output dim 1, hold w/const prob under random “balanced” init

Guarantee of efficient (linear rate) convergence to global min!Most general guarantee to date for GD efficiently training deep net.

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 25 / 35

Page 76: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

Trajectories of GD for Deep LNNs Convergence to Global Optimum

Convergence to Global Optimum for Gradient DescentSuppose `(·) = `2 loss

(i.e. `(W ) = 1

m∑m

i=1 ‖W xi − yi‖22)

TheoremAssume GD over LNN is init s.t. W1 . . .WN have deficiency margin c > 0and are δ-balanced w/δ ≤ O(c2). Then, for any learning rate η ≤ O(c4):

loss(iteration t) ≤ e−Ω(c2ηt)

ClaimAssumptions on init — deficiency margin and δ-balancedness:

Are necessary (violating any of them can lead to divergence)

For output dim 1, hold w/const prob under random “balanced” init

Guarantee of efficient (linear rate) convergence to global min!Most general guarantee to date for GD efficiently training deep net.

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 25 / 35

Page 77: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

Trajectories of GD for Deep LNNs Convergence to Global Optimum

Convergence to Global Optimum for Gradient DescentSuppose `(·) = `2 loss

(i.e. `(W ) = 1

m∑m

i=1 ‖W xi − yi‖22)

TheoremAssume GD over LNN is init s.t. W1 . . .WN have deficiency margin c > 0and are δ-balanced w/δ ≤ O(c2). Then, for any learning rate η ≤ O(c4):

loss(iteration t) ≤ e−Ω(c2ηt)

ClaimAssumptions on init — deficiency margin and δ-balancedness:

Are necessary (violating any of them can lead to divergence)

For output dim 1, hold w/const prob under random “balanced” init

Guarantee of efficient (linear rate) convergence to global min!Most general guarantee to date for GD efficiently training deep net.

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 25 / 35

Page 78: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

Trajectories of GD for Deep LNNs Convergence to Global Optimum

Convergence to Global Optimum for Gradient DescentSuppose `(·) = `2 loss

(i.e. `(W ) = 1

m∑m

i=1 ‖W xi − yi‖22)

TheoremAssume GD over LNN is init s.t. W1 . . .WN have deficiency margin c > 0and are δ-balanced w/δ ≤ O(c2). Then, for any learning rate η ≤ O(c4):

loss(iteration t) ≤ e−Ω(c2ηt)

ClaimAssumptions on init — deficiency margin and δ-balancedness:

Are necessary (violating any of them can lead to divergence)

For output dim 1, hold w/const prob under random “balanced” init

Guarantee of efficient (linear rate) convergence to global min!Most general guarantee to date for GD efficiently training deep net.

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 25 / 35

Page 79: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

Trajectories of GD for Deep LNNs Convergence to Global Optimum

Convergence to Global Optimum for Gradient DescentSuppose `(·) = `2 loss

(i.e. `(W ) = 1

m∑m

i=1 ‖W xi − yi‖22)

TheoremAssume GD over LNN is init s.t. W1 . . .WN have deficiency margin c > 0and are δ-balanced w/δ ≤ O(c2). Then, for any learning rate η ≤ O(c4):

loss(iteration t) ≤ e−Ω(c2ηt)

ClaimAssumptions on init — deficiency margin and δ-balancedness:

Are necessary (violating any of them can lead to divergence)

For output dim 1, hold w/const prob under random “balanced” init

Guarantee of efficient (linear rate) convergence to global min!Most general guarantee to date for GD efficiently training deep net.

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 25 / 35

Page 80: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

Trajectories of GD for Deep LNNs Convergence to Global Optimum

Convergence to Global Optimum for Gradient DescentSuppose `(·) = `2 loss

(i.e. `(W ) = 1

m∑m

i=1 ‖W xi − yi‖22)

TheoremAssume GD over LNN is init s.t. W1 . . .WN have deficiency margin c > 0and are δ-balanced w/δ ≤ O(c2). Then, for any learning rate η ≤ O(c4):

loss(iteration t) ≤ e−Ω(c2ηt)

ClaimAssumptions on init — deficiency margin and δ-balancedness:

Are necessary (violating any of them can lead to divergence)

For output dim 1, hold w/const prob under random “balanced” init

Guarantee of efficient (linear rate) convergence to global min!Most general guarantee to date for GD efficiently training deep net.

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 25 / 35

Page 81: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

Trajectories of GD for Deep LNNs Acceleration by Depth

Outline

1 Deep Learning Theory: Expressiveness, Optimization and Generalization

2 Analyzing Optimization via Trajectories

3 Trajectories of Gradient Descent for Deep Linear Neural NetworksConvergence to Global OptimumAcceleration by Depth

4 Conclusion

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 26 / 35

Page 82: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

Trajectories of GD for Deep LNNs Acceleration by Depth

The Effect of Depth

Conventional wisdom:

Depth boosts expressiveness

input early layers intermediate layers deep layers

But complicates optimization

We will see: not always true...

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 27 / 35

Page 83: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

Trajectories of GD for Deep LNNs Acceleration by Depth

The Effect of DepthConventional wisdom:

Depth boosts expressiveness

input early layers intermediate layers deep layers

But complicates optimization

We will see: not always true...

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 27 / 35

Page 84: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

Trajectories of GD for Deep LNNs Acceleration by Depth

The Effect of DepthConventional wisdom:

Depth boosts expressiveness

input early layers intermediate layers deep layers

But complicates optimization

We will see: not always true...

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 27 / 35

Page 85: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

Trajectories of GD for Deep LNNs Acceleration by Depth

The Effect of DepthConventional wisdom:

Depth boosts expressiveness

input early layers intermediate layers deep layers

But complicates optimization

We will see: not always true...

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 27 / 35

Page 86: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

Trajectories of GD for Deep LNNs Acceleration by Depth

Effect of Depth for Linear Neural Networks

For LNN, we derived end-to-end dynamics:ddt vec [W1:N(t)] = −PW1:N (t) · vec

[∇`(W1:N(t)

)]Consider a discrete version:

vec[W1:N(t + 1)

]←[ vec

[W1:N(t)

]− η · PW1:N (t) · vec

[∇`(W1:N(t))

]ClaimFor any p > 2, there exist settings where `(·) = `p loss:

`(W ) = 1m∑m

i=1 ‖W xi−yi‖pp ← convexand disc end-to-end dynamics reach global min arbitrarily faster than GD

w1

w2

w1

w2 Gradient descent End-to-end dymaics

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 28 / 35

Page 87: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

Trajectories of GD for Deep LNNs Acceleration by Depth

Effect of Depth for Linear Neural NetworksFor LNN, we derived end-to-end dynamics:

ddt vec [W1:N(t)] = −PW1:N (t) · vec

[∇`(W1:N(t)

)]

Consider a discrete version:vec[W1:N(t + 1)

]←[ vec

[W1:N(t)

]− η · PW1:N (t) · vec

[∇`(W1:N(t))

]ClaimFor any p > 2, there exist settings where `(·) = `p loss:

`(W ) = 1m∑m

i=1 ‖W xi−yi‖pp ← convexand disc end-to-end dynamics reach global min arbitrarily faster than GD

w1

w2

w1

w2 Gradient descent End-to-end dymaics

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 28 / 35

Page 88: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

Trajectories of GD for Deep LNNs Acceleration by Depth

Effect of Depth for Linear Neural NetworksFor LNN, we derived end-to-end dynamics:

ddt vec [W1:N(t)] = −PW1:N (t) · vec

[∇`(W1:N(t)

)]Consider a discrete version:

vec[W1:N(t + 1)

]←[ vec

[W1:N(t)

]− η · PW1:N (t) · vec

[∇`(W1:N(t))

]

ClaimFor any p > 2, there exist settings where `(·) = `p loss:

`(W ) = 1m∑m

i=1 ‖W xi−yi‖pp ← convexand disc end-to-end dynamics reach global min arbitrarily faster than GD

w1

w2

w1

w2 Gradient descent End-to-end dymaics

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 28 / 35

Page 89: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

Trajectories of GD for Deep LNNs Acceleration by Depth

Effect of Depth for Linear Neural NetworksFor LNN, we derived end-to-end dynamics:

ddt vec [W1:N(t)] = −PW1:N (t) · vec

[∇`(W1:N(t)

)]Consider a discrete version:

vec[W1:N(t + 1)

]←[ vec

[W1:N(t)

]− η · PW1:N (t) · vec

[∇`(W1:N(t))

]ClaimFor any p > 2, there exist settings where `(·) = `p loss:

`(W ) = 1m∑m

i=1 ‖W xi−yi‖pp

← convexand disc end-to-end dynamics reach global min arbitrarily faster than GD

w1

w2

w1

w2 Gradient descent End-to-end dymaics

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 28 / 35

Page 90: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

Trajectories of GD for Deep LNNs Acceleration by Depth

Effect of Depth for Linear Neural NetworksFor LNN, we derived end-to-end dynamics:

ddt vec [W1:N(t)] = −PW1:N (t) · vec

[∇`(W1:N(t)

)]Consider a discrete version:

vec[W1:N(t + 1)

]←[ vec

[W1:N(t)

]− η · PW1:N (t) · vec

[∇`(W1:N(t))

]ClaimFor any p > 2, there exist settings where `(·) = `p loss:

`(W ) = 1m∑m

i=1 ‖W xi−yi‖pp ← convex

and disc end-to-end dynamics reach global min arbitrarily faster than GD

w1

w2

w1

w2 Gradient descent End-to-end dymaics

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 28 / 35

Page 91: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

Trajectories of GD for Deep LNNs Acceleration by Depth

Effect of Depth for Linear Neural NetworksFor LNN, we derived end-to-end dynamics:

ddt vec [W1:N(t)] = −PW1:N (t) · vec

[∇`(W1:N(t)

)]Consider a discrete version:

vec[W1:N(t + 1)

]←[ vec

[W1:N(t)

]− η · PW1:N (t) · vec

[∇`(W1:N(t))

]ClaimFor any p > 2, there exist settings where `(·) = `p loss:

`(W ) = 1m∑m

i=1 ‖W xi−yi‖pp ← convexand disc end-to-end dynamics reach global min arbitrarily faster than GD

w1

w2

w1

w2 Gradient descent End-to-end dymaics

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 28 / 35

Page 92: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

Trajectories of GD for Deep LNNs Acceleration by Depth

Effect of Depth for Linear Neural NetworksFor LNN, we derived end-to-end dynamics:

ddt vec [W1:N(t)] = −PW1:N (t) · vec

[∇`(W1:N(t)

)]Consider a discrete version:

vec[W1:N(t + 1)

]←[ vec

[W1:N(t)

]− η · PW1:N (t) · vec

[∇`(W1:N(t))

]ClaimFor any p > 2, there exist settings where `(·) = `p loss:

`(W ) = 1m∑m

i=1 ‖W xi−yi‖pp ← convexand disc end-to-end dynamics reach global min arbitrarily faster than GD

w1

w2

w1

w2 Gradient descent End-to-end dymaics

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 28 / 35

Page 93: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

Trajectories of GD for Deep LNNs Acceleration by Depth

Experiments

Linear neural networks:Regression problem from UCI ML Repository ; `4 loss

Depth can speed-up GD, even w/oany gain in expressiveness, and

despite introducing non-convexity!

This speed-up can outperformpopular acceleration methodsdesigned for convex problems!

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 29 / 35

Page 94: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

Trajectories of GD for Deep LNNs Acceleration by Depth

ExperimentsLinear neural networks:

Regression problem from UCI ML Repository ; `4 loss

Depth can speed-up GD, even w/oany gain in expressiveness, and

despite introducing non-convexity!

This speed-up can outperformpopular acceleration methodsdesigned for convex problems!

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 29 / 35

Page 95: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

Trajectories of GD for Deep LNNs Acceleration by Depth

ExperimentsLinear neural networks:Regression problem from UCI ML Repository ; `4 loss

Depth can speed-up GD, even w/oany gain in expressiveness, and

despite introducing non-convexity!

This speed-up can outperformpopular acceleration methodsdesigned for convex problems!

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 29 / 35

Page 96: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

Trajectories of GD for Deep LNNs Acceleration by Depth

ExperimentsLinear neural networks:Regression problem from UCI ML Repository ; `4 loss

Depth can speed-up GD, even w/oany gain in expressiveness, and

despite introducing non-convexity!

This speed-up can outperformpopular acceleration methodsdesigned for convex problems!

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 29 / 35

Page 97: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

Trajectories of GD for Deep LNNs Acceleration by Depth

ExperimentsLinear neural networks:Regression problem from UCI ML Repository ; `4 loss

Depth can speed-up GD, even w/oany gain in expressiveness, and

despite introducing non-convexity!

This speed-up can outperformpopular acceleration methodsdesigned for convex problems!

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 29 / 35

Page 98: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

Trajectories of GD for Deep LNNs Acceleration by Depth

ExperimentsLinear neural networks:Regression problem from UCI ML Repository ; `4 loss

Depth can speed-up GD, even w/oany gain in expressiveness, and

despite introducing non-convexity!

This speed-up can outperformpopular acceleration methodsdesigned for convex problems!

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 29 / 35

Page 99: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

Trajectories of GD for Deep LNNs Acceleration by Depth

ExperimentsLinear neural networks:Regression problem from UCI ML Repository ; `4 loss

Depth can speed-up GD, even w/oany gain in expressiveness, and

despite introducing non-convexity!

This speed-up can outperformpopular acceleration methodsdesigned for convex problems!

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 29 / 35

Page 100: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

Trajectories of GD for Deep LNNs Acceleration by Depth

Experiments (cont’)Non-linear neural networks:

TensorFlow convolutional network tutorial for MNIST:Arch: (conv → ReLU → max pool) x 2 → dense → ReLU → denseTraining: stochastic GD w/momentum, dropout

We overparameterized by adding linear layer after each dense layer

0 1000 2000 3000 4000 5000 6000 7000 8000

iteration

10−1

100

101

batc

hlo

ss

originaloverparameterized Adding depth, w/o any gain in

expressiveness, and only +15% inparams, accelerated non-linearnet by orders-of-magnitude!

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 30 / 35

Page 101: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

Trajectories of GD for Deep LNNs Acceleration by Depth

Experiments (cont’)Non-linear neural networks:TensorFlow convolutional network tutorial for MNIST: 1

Arch: (conv → ReLU → max pool) x 2 → dense → ReLU → denseTraining: stochastic GD w/momentum, dropout

We overparameterized by adding linear layer after each dense layer

0 1000 2000 3000 4000 5000 6000 7000 8000

iteration

10−1

100

101

batc

hlo

ss

originaloverparameterized Adding depth, w/o any gain in

expressiveness, and only +15% inparams, accelerated non-linearnet by orders-of-magnitude!

1 https://github.com/tensorflow/models/tree/master/tutorials/image/mnistNadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 30 / 35

Page 102: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

Trajectories of GD for Deep LNNs Acceleration by Depth

Experiments (cont’)Non-linear neural networks:TensorFlow convolutional network tutorial for MNIST: 1

Arch: (conv → ReLU → max pool) x 2 → dense → ReLU → denseTraining: stochastic GD w/momentum, dropout

We overparameterized by adding linear layer after each dense layer

0 1000 2000 3000 4000 5000 6000 7000 8000

iteration

10−1

100

101

batc

hlo

ss

originaloverparameterized Adding depth, w/o any gain in

expressiveness, and only +15% inparams, accelerated non-linearnet by orders-of-magnitude!

1 https://github.com/tensorflow/models/tree/master/tutorials/image/mnistNadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 30 / 35

Page 103: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

Trajectories of GD for Deep LNNs Acceleration by Depth

Experiments (cont’)Non-linear neural networks:TensorFlow convolutional network tutorial for MNIST: 1

Arch: (conv → ReLU → max pool) x 2 → dense → ReLU → denseTraining: stochastic GD w/momentum, dropout

We overparameterized by adding linear layer after each dense layer

0 1000 2000 3000 4000 5000 6000 7000 8000

iteration

10−1

100

101

batc

hlo

ss

originaloverparameterized

Adding depth, w/o any gain inexpressiveness, and only +15% inparams, accelerated non-linearnet by orders-of-magnitude!

1 https://github.com/tensorflow/models/tree/master/tutorials/image/mnistNadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 30 / 35

Page 104: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

Trajectories of GD for Deep LNNs Acceleration by Depth

Experiments (cont’)Non-linear neural networks:TensorFlow convolutional network tutorial for MNIST: 1

Arch: (conv → ReLU → max pool) x 2 → dense → ReLU → denseTraining: stochastic GD w/momentum, dropout

We overparameterized by adding linear layer after each dense layer

0 1000 2000 3000 4000 5000 6000 7000 8000

iteration

10−1

100

101

batc

hlo

ss

originaloverparameterized Adding depth, w/o any gain in

expressiveness, and only +15% inparams, accelerated non-linearnet by orders-of-magnitude!

1 https://github.com/tensorflow/models/tree/master/tutorials/image/mnistNadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 30 / 35

Page 105: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

Conclusion

Outline

1 Deep Learning Theory: Expressiveness, Optimization and Generalization

2 Analyzing Optimization via Trajectories

3 Trajectories of Gradient Descent for Deep Linear Neural NetworksConvergence to Global OptimumAcceleration by Depth

4 Conclusion

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 31 / 35

Page 106: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

Conclusion

Recap

Understanding DL calls for addressing three fundamental Qs:

Expressiveness Optimization Generalization

Optimization

Deep (≥ 3 layer) models can’t be treated via geometry alone=⇒ specific optimizer trajectories should be taken into account

We analyzed trajectories of GD over deep linear neural nets:

Derived guarantee for convergence to global min at linear rate(most general guarantee to date for GD efficiently training deep model)

Depth induces preconditioner that can accelerate convergence,w/o any gain in expressiveness, and despite introducing non-convexity

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 32 / 35

Page 107: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

Conclusion

Recap

Understanding DL calls for addressing three fundamental Qs:

Expressiveness Optimization Generalization

Optimization

Deep (≥ 3 layer) models can’t be treated via geometry alone=⇒ specific optimizer trajectories should be taken into account

We analyzed trajectories of GD over deep linear neural nets:

Derived guarantee for convergence to global min at linear rate(most general guarantee to date for GD efficiently training deep model)

Depth induces preconditioner that can accelerate convergence,w/o any gain in expressiveness, and despite introducing non-convexity

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 32 / 35

Page 108: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

Conclusion

Recap

Understanding DL calls for addressing three fundamental Qs:

Expressiveness Optimization Generalization

Optimization

Deep (≥ 3 layer) models can’t be treated via geometry alone=⇒ specific optimizer trajectories should be taken into account

We analyzed trajectories of GD over deep linear neural nets:

Derived guarantee for convergence to global min at linear rate(most general guarantee to date for GD efficiently training deep model)

Depth induces preconditioner that can accelerate convergence,w/o any gain in expressiveness, and despite introducing non-convexity

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 32 / 35

Page 109: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

Conclusion

Recap

Understanding DL calls for addressing three fundamental Qs:

Expressiveness Optimization Generalization

Optimization

Deep (≥ 3 layer) models can’t be treated via geometry alone=⇒ specific optimizer trajectories should be taken into account

We analyzed trajectories of GD over deep linear neural nets:

Derived guarantee for convergence to global min at linear rate(most general guarantee to date for GD efficiently training deep model)

Depth induces preconditioner that can accelerate convergence,w/o any gain in expressiveness, and despite introducing non-convexity

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 32 / 35

Page 110: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

Conclusion

Recap

Understanding DL calls for addressing three fundamental Qs:

Expressiveness Optimization Generalization

Optimization

Deep (≥ 3 layer) models can’t be treated via geometry alone=⇒ specific optimizer trajectories should be taken into account

We analyzed trajectories of GD over deep linear neural nets:

Derived guarantee for convergence to global min at linear rate(most general guarantee to date for GD efficiently training deep model)

Depth induces preconditioner that can accelerate convergence,w/o any gain in expressiveness, and despite introducing non-convexity

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 32 / 35

Page 111: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

Conclusion

Recap

Understanding DL calls for addressing three fundamental Qs:

Expressiveness Optimization Generalization

Optimization

Deep (≥ 3 layer) models can’t be treated via geometry alone=⇒ specific optimizer trajectories should be taken into account

We analyzed trajectories of GD over deep linear neural nets:

Derived guarantee for convergence to global min at linear rate(most general guarantee to date for GD efficiently training deep model)

Depth induces preconditioner that can accelerate convergence,w/o any gain in expressiveness, and despite introducing non-convexity

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 32 / 35

Page 112: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

Conclusion

Recap

Understanding DL calls for addressing three fundamental Qs:

Expressiveness Optimization Generalization

Optimization

Deep (≥ 3 layer) models can’t be treated via geometry alone=⇒ specific optimizer trajectories should be taken into account

We analyzed trajectories of GD over deep linear neural nets:

Derived guarantee for convergence to global min at linear rate(most general guarantee to date for GD efficiently training deep model)

Depth induces preconditioner that can accelerate convergence,w/o any gain in expressiveness, and despite introducing non-convexity

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 32 / 35

Page 113: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

Conclusion

Next Step: Analyzing Generalization via Trajectories

h

*

Sh

*h

*f

Estimation Error (Generalization)

(all functions)

(hypotheses space)

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 33 / 35

Page 114: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

Outline

1 Deep Learning Theory: Expressiveness, Optimization and Generalization

2 Analyzing Optimization via Trajectories

3 Trajectories of Gradient Descent for Deep Linear Neural NetworksConvergence to Global OptimumAcceleration by Depth

4 Conclusion

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 34 / 35

Page 115: AnalyzingOptimizationinDeepLearningviaTrajectories · AnalyzingOptimizationinDeepLearningviaTrajectories NadavCohen Institute for Advanced Study Institute for Computational and Experimental

Thank You

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 35 / 35