Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
February 1, 2018Understanding GeneralizationOu Changkun (LMU, [email protected])
Understanding
Ou Changkun
WiSe 2017/2018University of Munich, Seminar Deep Learning
February 1, 2018
1
February 1, 2018Understanding GeneralizationOu Changkun (LMU, [email protected]) 2
“This paper explains why deep learning can generalize well”
February 1, 2018Understanding GeneralizationOu Changkun (LMU, [email protected])
Outline
1 Motivation
1 Motivation
● Backgrounds Introduction
● Recap of Traditional Generalization Theory
2 Rethinking of Generalization3 Bounds of Generalization Gap4 DARC Regularizer
3
February 1, 2018Understanding GeneralizationOu Changkun (LMU, [email protected]) 4
What is Generalization?
1 Motivation Backgrounds Introduction
●●●●
Definition: Generalization Bound
February 1, 2018Understanding GeneralizationOu Changkun (LMU, [email protected]) 5
Goals of Generalization Theory
1 Motivation Backgrounds Introduction
●●●
Generalization theory is all about: Test error - Training error = ????
February 1, 2018Understanding GeneralizationOu Changkun (LMU, [email protected]) 6
The Force of Generalization
1 Motivation Backgrounds Introduction
●○○○
●○○○
●
●●
February 1, 2018Understanding GeneralizationOu Changkun (LMU, [email protected])
Recap I: Model Complexity
1 Motivation Recap of Generalization Theory
7
Images souce [ Mohri et al. 2012 ]
Memorization Generalization
●●●●●
: Test error ≤ Traning error + Complexity Penalty
Rademacher bound:
VC bound:
error
complexity
test error
complexity term
training error
February 1, 2018Understanding GeneralizationOu Changkun (LMU, [email protected])
Recap II: Algorithmic Stability & Robustness
1 Motivation Recap of Generalization Theory
8
●
○
●
○
Image Source [Google Inc., 2015]
: Test error ≤ Traning error + Algorithm Penalty
February 1, 2018Understanding GeneralizationOu Changkun (LMU, [email protected])
Recap III: Flat Region Conjecture
1 Motivation Recap of Generalization Theory
9
●
○○
Loss landscape, Image source [Li et al. 2018]
Image source [He et al. 2015]
“Flat region” “Sharp region”
Image source [Dauphin et al. 2015]
February 1, 2018Understanding GeneralizationOu Changkun (LMU, [email protected])
Outline
1 Motivation2 Rethinking of Generalization
● Understanding generalization failure
● Emperical risk guarantee
3 Bounds of Generalization Gap4 DARC Regularizer
10
February 1, 2018Understanding GeneralizationOu Changkun (LMU, [email protected])
On the origin of “paradox”
2 Rethinking of Generalization
11
●●
●○
●○
February 1, 2018Understanding GeneralizationOu Changkun (LMU, [email protected]) 12
Empirical Risk Guarantee (Estimating Test Error)
2 Rethinking of Generalization Empirical Risk Guarantee
●●●
●
Theorem: empirical risk estimation (for finite hypothesis class)
February 1, 2018Understanding GeneralizationOu Changkun (LMU, [email protected])
●
●
●
13
Empirical Risk Guarantee II: An Example
2 Rethinking of Generalization Empirical Risk Guarantee
Theorem: empirical risk estimation (for finite hypothesis class)
February 1, 2018Understanding GeneralizationOu Changkun (LMU, [email protected])
Outline
1 Motivation2 Rethinking of Generalization3 Bounds of Generalization Gap
● Data-dependent Bound
● Data-independent Bound
4 DARC Regularizer
14
February 1, 2018Understanding GeneralizationOu Changkun (LMU, [email protected])
“Good” Dataset
3 Bounds of Generalization Gap Data-dependent Bound
Definition: Concentrated dataset
15
●
●
●●
β β β
February 1, 2018Understanding GeneralizationOu Changkun (LMU, [email protected])
Data-dependent Bound (Generalization Quality)
3 Bounds of Generalization Gap Data-dependent Bound
16
Theorem: Data-dependent Bound
●
●
●●●
February 1, 2018Understanding GeneralizationOu Changkun (LMU, [email protected]) 17
Two-phase Training Technique
3 Bounds of Generalization Gap Data-independent Bound
… …
…
…
…
… ……
“Standard” Phase
Train the network in standard way
with αm samples of the dataset;
… …
…
…
…
… ……
“Freeze” Phase
Freeze activation pattern and keep training with the
rest of (1-α)m samples.
●
* ratio = two-phase/base
●
February 1, 2018Understanding GeneralizationOu Changkun (LMU, [email protected])
Data-independent Bound (Determine Good Model)
3 Bounds of Generalization Gap Data-independent Bound
Theorem: Data-independent bound
18
●
●● ρ, m_sigma
●●●
February 1, 2018Understanding GeneralizationOu Changkun (LMU, [email protected])
Data-independent Bound: Examples
3 Bounds of Generalization Gap Data-independent Bound
Theorem: Data-independent bound
19
●●
○ ≤
●○ ≤
February 1, 2018Understanding GeneralizationOu Changkun (LMU, [email protected])
Outline
1 Motivation2 Rethinking of Generalization3 Bounds of Generalization4 DARC Regularizer
● Directly Approximately Regularizing Complexity
● DARC1 Experiments
20
February 1, 2018Understanding GeneralizationOu Changkun (LMU, [email protected])
Directly Approximately Regularizing Complexity (DARC) Family
4 DARC Regularizer Directly Approximately Regularizing Complexity
21
DARC1 Regularization
February 1, 2018Understanding GeneralizationOu Changkun (LMU, [email protected])
Experiment results: DARC1 Regularizer on MNIST & CIFAR-10
4 DARC Regularizer DARC1 Experiments
Test errorratio
MNIST (ND) MNIST CIFAR-10
mean stdv mean stdv mean stdv
DARC1/Base 0.89 0.61 0.95 0.67 0.97 0.79
22
DARC1 implementation (Keras)
def darc1_loss(model, lamb=0.001):
def _loss(y_true, y_pred):
original_loss = K.categorical_crossentropy(y_true, y_pred)
custom_loss = lamb*K.max(K.sum(K.abs(model.layers[-1].output), axis=0))
return original_loss + custom_loss
return _loss
February 1, 2018Understanding GeneralizationOu Changkun (LMU, [email protected])
TL;DR: Only Model Architecture & Dataset Matter
5 Conclusions
●●●●●
●
23
This paper is all about: Test error ≤ Traning error + (Model, Data) Penalty
February 1, 2018Understanding GeneralizationOu Changkun (LMU, [email protected]) 24
February 1, 2018Understanding GeneralizationOu Changkun (LMU, [email protected])
Outline
1 Motivation2 Rethinking of Generalization3 Bounds of Generalization4 DARC Regularizer5 Related Readings
25
February 1, 2018Understanding GeneralizationOu Changkun (LMU, [email protected])
Related Readings I: Previous Researches
References
[ Shalev-Shwartz et al. 2014 ]Understanding machine learning from theory to algorithmsCambridge University Press.
[ Neyshabur et al. 2017 ]Exploring generalization in deep learning
arXiv preprint: 1706.08947, NIPS 2017
[ Xu et al. 2010 ]Robustness and generalization
arXiv preprint: 1005.2243, JMLR 2012
[ Bousquet et al. 2000 ]Stability and generalization
JMLR 2000
26
February 1, 2018Understanding GeneralizationOu Changkun (LMU, [email protected])
Related Readings II: Rethinking Generalization
References
[ Zhang et al. 2017 ]Understanding deep learning requires rethinking generalizationarXiv preprint: 1611.03530, ICLR 2017
[ Krueger et al. 2017 ]Deep nets don’t learn via memorizationICLR 2017 Workshop
[ Kawaguchi et al. 2017 ]Generalization in deep learning
arXiv preprint: 1710.05468
27
February 1, 2018Understanding GeneralizationOu Changkun (LMU, [email protected])
Related Readings III: The Trick of Deep Path
References
[ Baldi, et al. 1989 ]Neural networks and principal component analysis: Learning from examples without local minimaJournal of Neural networks, 1989, Vol.2 52–58
[ Choromanska et al. 2015 ]The Loss Surfaces of Multilayer NetworksarXiv preprint: 1412.0233, JMLR 2015
[ Kawaguchi 2016 ]Deep learning without poor local minimaNIPS 2016, oral presentation paper
28
February 1, 2018Understanding GeneralizationOu Changkun (LMU, [email protected])
Related Readings IV: Recent Research
References
[ Dziugaite et al. 2017 ]Computing Non-vacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many
More Parameters than Training DataarXiv preprint: 1703.11008
[ Bartlett et al. 2017 ]Spectrally-normalized margin bounds for neural networksarXiv preprint: 1706.08498, NIPS 2017
[ Neyshabur et al. 2018 ]A PAC-Bayesian approach to spectrally-normalized margin bounds for neural networksarXiv preprint: 1707.09564, ICLR 2018
[ Neyshabur et al. 2018 ]Certifiable Distributional Robustness with Principled Adversarial TrainingarXiv preprint: 1710.10571, ICLR 2018 Best Paper
29
February 1, 2018Understanding GeneralizationOu Changkun (LMU, [email protected]) 30
February 1, 2018Understanding GeneralizationOu Changkun (LMU, [email protected]) 31
February 1, 2018Understanding GeneralizationOu Changkun (LMU, [email protected]) 32
…
February 1, 2018Understanding GeneralizationOu Changkun (LMU, [email protected]) 33
General Concpet Questions
1 Motivation Backgrounds Introduction
●○
●○
●○
February 1, 2018Understanding GeneralizationOu Changkun (LMU, [email protected]) 34
February 1, 2018Understanding GeneralizationOu Changkun (LMU, [email protected])
Properties of (over-parameterized) linear models
2 Rethinking of Generalization
Theorem: Generalization failure on linear models
35
February 1, 2018Understanding GeneralizationOu Changkun (LMU, [email protected])
On the origin of “paradox”
2 Rethinking of Generalization
Proposition
36
February 1, 2018Understanding GeneralizationOu Changkun (LMU, [email protected]) 37
February 1, 2018Understanding GeneralizationOu Changkun (LMU, [email protected]) 38
Path Trick in ReLU Net [ Choromanska et al. 2015 ]
3 Bounds of Generalization Gap Data-dependent Bound
… …
…
…
…
…
… ……
Layer (H) Layer (H+1)Layer 1
February 1, 2018Understanding GeneralizationOu Changkun (LMU, [email protected])
Concentrated Dataset
3 Bounds of Generalization Gap Data-dependent Bound
39
Definition 3: “good” dataset
February 1, 2018Understanding GeneralizationOu Changkun (LMU, [email protected])
Data-dependent Guarantee
3 Bounds of Generalization Gap Data-dependent Bound
40
Proposition 4: Data-dependent Bound
February 1, 2018Understanding GeneralizationOu Changkun (LMU, [email protected]) 41
February 1, 2018Understanding GeneralizationOu Changkun (LMU, [email protected])
Why z?
2 Rethinking of Generalization Empirical Risk Guarantee
42
Bernstein Inequality
●●
February 1, 2018Understanding GeneralizationOu Changkun (LMU, [email protected]) 43
February 1, 2018Understanding GeneralizationOu Changkun (LMU, [email protected])
SGD, z, w; relations?
3 Bounds of Generalization Gap Data-independent Bound
44
Did you forget that sigma depends on w?
No, they could be independent.
February 1, 2018Understanding GeneralizationOu Changkun (LMU, [email protected]) 45
Two-phase training
3 Bounds of Generalization Gap Data-independent Bound
… …
…
…
…
… ……
… …
…
…
…
… ……
Standard Phase
Freeze Phase
Train the network in standard way with
partial dataset;
Freeze activation pattern and keep
training with the rest of dataset.
February 1, 2018Understanding GeneralizationOu Changkun (LMU, [email protected]) 46
February 1, 2018Understanding GeneralizationOu Changkun (LMU, [email protected])
Directly Approximately Regularizing Complexity (DARC)
Backstage Slides Directly Approximately Regularizing Complexity
Bound by Rademacher Complexity [Koltchinskii and Panchenko 2002]
47
February 1, 2018Understanding GeneralizationOu Changkun (LMU, [email protected]) 48
February 1, 2018Understanding GeneralizationOu Changkun (LMU, [email protected])
Experiment results: DARC1 Regularizer with ResNet-50 on ILSVRC2012
Backstage Slides Experiments
49
●○○○
10 Image classes Entirely
February 1, 2018Understanding GeneralizationOu Changkun (LMU, [email protected]) 50
February 1, 2018Understanding GeneralizationOu Changkun (LMU, [email protected])
Personal Opinions regarding SGD
5 Conclusions
● nonlinearity
●
●
●
● bypass
51