Understanding - LMU Medieninformatik · Related Readings II: Rethinking Generalization References 5[ Zhang et al. 2017 ] Understanding deep learning requires rethinking generalization

February 1, 2018Understanding GeneralizationOu Changkun (LMU, [email protected])

Understanding

Ou Changkun

WiSe 2017/2018University of Munich, Seminar Deep Learning

[email protected]

February 1, 2018

1

February 1, 2018Understanding GeneralizationOu Changkun (LMU, [email protected]) 2

“This paper explains why deep learning can generalize well”


Outline

1 Motivation

1 Motivation

● Backgrounds Introduction

● Recap of Traditional Generalization Theory

2 Rethinking of Generalization3 Bounds of Generalization Gap4 DARC Regularizer

3


What is Generalization?

1 Motivation Backgrounds Introduction

●●●●

Definition: Generalization Bound


Goals of Generalization Theory


●●●

Generalization theory is all about: Test error - Training error = ????


The Force of Generalization


●○○○

●○○○

●

●●


Recap I: Model Complexity

1 Motivation Recap of Generalization Theory

7

Images souce [ Mohri et al. 2012 ]

Memorization Generalization

●●●●●

: Test error ≤ Traning error + Complexity Penalty

Rademacher bound:

VC bound:

error

complexity

test error

complexity term

training error


Recap II: Algorithmic Stability & Robustness


8

●

○

●

○

Image Source [Google Inc., 2015]

: Test error ≤ Traning error + Algorithm Penalty


Recap III: Flat Region Conjecture


9

●

○○

Loss landscape, Image source [Li et al. 2018]

Image source [He et al. 2015]

“Flat region” “Sharp region”

Image source [Dauphin et al. 2015]


Outline

1 Motivation2 Rethinking of Generalization

● Understanding generalization failure

● Emperical risk guarantee

3 Bounds of Generalization Gap4 DARC Regularizer

10


On the origin of “paradox”

2 Rethinking of Generalization

11

●●

●○

●○


Empirical Risk Guarantee (Estimating Test Error)

2 Rethinking of Generalization Empirical Risk Guarantee

●●●

●

Theorem: empirical risk estimation (for finite hypothesis class)


●

●

●

13

Empirical Risk Guarantee II: An Example


Theorem: empirical risk estimation (for finite hypothesis class)


Outline

1 Motivation2 Rethinking of Generalization3 Bounds of Generalization Gap

● Data-dependent Bound

● Data-independent Bound

4 DARC Regularizer

14


“Good” Dataset

3 Bounds of Generalization Gap Data-dependent Bound

Definition: Concentrated dataset

15

●

●

●●

β β β


Data-dependent Bound (Generalization Quality)


16

Theorem: Data-dependent Bound

●

●

●●●


Two-phase Training Technique

3 Bounds of Generalization Gap Data-independent Bound

… …

…

…

…

… ……

“Standard” Phase

Train the network in standard way

with αm samples of the dataset;

… …

…

…

…

… ……

“Freeze” Phase

Freeze activation pattern and keep training with the

rest of (1-α)m samples.

●

* ratio = two-phase/base

●


Data-independent Bound (Determine Good Model)


Theorem: Data-independent bound

18

●

●● ρ, m_sigma

●●●


Data-independent Bound: Examples


Theorem: Data-independent bound

19

●●

○ ≤

●○ ≤


Outline

1 Motivation2 Rethinking of Generalization3 Bounds of Generalization4 DARC Regularizer

● Directly Approximately Regularizing Complexity

● DARC1 Experiments

20


Directly Approximately Regularizing Complexity (DARC) Family

4 DARC Regularizer Directly Approximately Regularizing Complexity

21

DARC1 Regularization


Experiment results: DARC1 Regularizer on MNIST & CIFAR-10

4 DARC Regularizer DARC1 Experiments

Test errorratio

MNIST (ND) MNIST CIFAR-10

mean stdv mean stdv mean stdv

DARC1/Base 0.89 0.61 0.95 0.67 0.97 0.79

22

DARC1 implementation (Keras)

def darc1_loss(model, lamb=0.001):

def _loss(y_true, y_pred):

original_loss = K.categorical_crossentropy(y_true, y_pred)

custom_loss = lamb*K.max(K.sum(K.abs(model.layers[-1].output), axis=0))

return original_loss + custom_loss

return _loss


TL;DR: Only Model Architecture & Dataset Matter

5 Conclusions

●●●●●

●

23

This paper is all about: Test error ≤ Traning error + (Model, Data) Penalty



Outline

1 Motivation2 Rethinking of Generalization3 Bounds of Generalization4 DARC Regularizer5 Related Readings

25


Related Readings I: Previous Researches

References

[ Shalev-Shwartz et al. 2014 ]Understanding machine learning from theory to algorithmsCambridge University Press.

[ Neyshabur et al. 2017 ]Exploring generalization in deep learning

arXiv preprint: 1706.08947, NIPS 2017

[ Xu et al. 2010 ]Robustness and generalization

arXiv preprint: 1005.2243, JMLR 2012

[ Bousquet et al. 2000 ]Stability and generalization

JMLR 2000

26


Related Readings II: Rethinking Generalization

References

[ Zhang et al. 2017 ]Understanding deep learning requires rethinking generalizationarXiv preprint: 1611.03530, ICLR 2017

[ Krueger et al. 2017 ]Deep nets don’t learn via memorizationICLR 2017 Workshop

[ Kawaguchi et al. 2017 ]Generalization in deep learning

arXiv preprint: 1710.05468

27


Related Readings III: The Trick of Deep Path

References

[ Baldi, et al. 1989 ]Neural networks and principal component analysis: Learning from examples without local minimaJournal of Neural networks, 1989, Vol.2 52–58

[ Choromanska et al. 2015 ]The Loss Surfaces of Multilayer NetworksarXiv preprint: 1412.0233, JMLR 2015

[ Kawaguchi 2016 ]Deep learning without poor local minimaNIPS 2016, oral presentation paper

28


Related Readings IV: Recent Research

References

[ Dziugaite et al. 2017 ]Computing Non-vacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many

More Parameters than Training DataarXiv preprint: 1703.11008

[ Bartlett et al. 2017 ]Spectrally-normalized margin bounds for neural networksarXiv preprint: 1706.08498, NIPS 2017

[ Neyshabur et al. 2018 ]A PAC-Bayesian approach to spectrally-normalized margin bounds for neural networksarXiv preprint: 1707.09564, ICLR 2018

[ Neyshabur et al. 2018 ]Certifiable Distributional Robustness with Principled Adversarial TrainingarXiv preprint: 1710.10571, ICLR 2018 Best Paper

29




…


General Concpet Questions


●○

●○

●○



Properties of (over-parameterized) linear models


Theorem: Generalization failure on linear models

35


On the origin of “paradox”


Proposition

36



Path Trick in ReLU Net [ Choromanska et al. 2015 ]


… …

…

…

…

…

… ……

Layer (H) Layer (H+1)Layer 1


Concentrated Dataset


39

Definition 3: “good” dataset


Data-dependent Guarantee


40

Proposition 4: Data-dependent Bound



Why z?


42

Bernstein Inequality

●●



SGD, z, w; relations?


44

Did you forget that sigma depends on w?

No, they could be independent.


Two-phase training


… …

…

…

…

… ……

… …

…

…

…

… ……

Standard Phase

Freeze Phase

Train the network in standard way with

partial dataset;

Freeze activation pattern and keep

training with the rest of dataset.



Directly Approximately Regularizing Complexity (DARC)

Backstage Slides Directly Approximately Regularizing Complexity

Bound by Rademacher Complexity [Koltchinskii and Panchenko 2002]

47



Experiment results: DARC1 Regularizer with ResNet-50 on ILSVRC2012

Backstage Slides Experiments

49

●○○○

10 Image classes Entirely



Personal Opinions regarding SGD

5 Conclusions

● nonlinearity

●

●

●

● bypass

51

Documents

Understanding - LMU Medieninformatik · Related Readings II: Rethinking Generalization References 5[ Zhang et al. 2017 ] Understanding deep learning requires rethinking generalization