Citation preview
by
Submitted to the Department of Electrical Engineering and Computer
Science
in partial fulfillment of the requirements for the degree of
Doctor of Philosophy
November 27, 2017
Eugene McDermott Professor of Brain and Cognitive Sciences Thesis
Supervisor
Accepted by . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . Professor
Leslie A. Kolodziejski
Professor of Electrical Engineering and Computer Science Chair,
Department Committee on Graduate Students
2
by
Chiyuan Zhang
Submitted to the Department of Electrical Engineering and Computer
Science on November 27, 2017, in partial fulfillment of the
requirements for the degree of Doctor of Philosophy
Abstract In the recent years deep learning has witnessed successful
applications in many different do- mains such as visual object
recognition, detection and segmentation, automatic speech recogni-
tion, natural language processing, and reinforcement learning. In
this thesis, we will investigate deep learning from a spectrum of
different perspectives.
First of all, we will study the question of generalization, which
is one of the most fundamen- tal notion in machine learning theory.
We will show how, in the regime of deep learning, the
characterization of generalization becomes different from the
conventional way, and propose alternative ways to approach
it.
Moving from theory to more practical perspectives, we will show two
different applications of deep learning. One is originated from a
real world problem of automatic geophysical feature detection from
seismic recordings to help oil & gas exploration; the other is
motivated from a computational neuroscientific modeling and
studying of human auditory system. More specifi- cally, we will
show how deep learning could be adapted to play nicely with the
unique structures associated with the problems from different
domains.
Lastly, we move to the computer system design perspective, and
present our efforts in build- ing better deep learning systems to
allow efficient and flexible computation in both academic and
industrial worlds.
Thesis Supervisor: Tomaso Poggio Title: Eugene McDermott Professor
of Brain and Cognitive Sciences
3
4
Acknowledgments
First of all, I would like to thank all the people who brought me
to the field of machine learning
and academic research, especially Dr. Xiaofei He, Dr. Deng Cai and
Dr. Binbin Lin and all the
other people at the zju-learning group in Zhejiang University.
Without them, I would never
have made up my mind to go abroad and pursue a PhD, and I would
have totally missed this
great journey of the past five years in my life. It is still hard
to believe that I am already near
the end of it now.
I would like to thank my advisor Dr. Tomaso Poggio, who offered me
the opportunity
to be in the great family of the poggio-lab. Tommy always has a
great vision in bridging the
understandings of the foundations of machine learning systems and
human intelligence. I
would have been completely lost without his guidance. On the other
hand, he is also very
supportive and gives me great freedom to explore and learn to do
independent research, as well
as pursuing a colorful life apart from study.
I would like to thank my great thesis committee, Dr. Piotr Indyk,
Dr. Tommi Jaakkola and
Dr. Stefanie Jegelka. All of them not only directly gave me
valuable advices and comments on
the thesis, but also influenced me in other occasions during my
graduate study. I started my
PhD on a collaborative project with Shell Oil, and Piotr was the
leading Principal Investigator
in the team. Piotr has always been a great inspiration for me since
then. One of the first class I
took at MIT was 6.438, taught by Tommi. Tommi is a great teacher
and everything I learned
in the class has laid a solid foundation for the rest of my
pursuit. I also came to interact with
Stefanie initially through a class, and she continued to offer
valuable feedbacks afterwards and
the conversations directly influenced part of my thesis
projects.
I’m grateful to all the other members of the Poggio-lab. Dr.
Lorenzo Rosasco is essentially
the second advisor to me. His sharp opinions has shaped my mind of
critical and mathematical
thinking. Gadi Geiger has also become a mentor as well as a good
friend throughout our daily
coffee chat. Youssef Mroueh was my CSAIL buddy and a valuable
source of research advices
and friendship. Andrea Tacchetti and Yena Han both offered
invaluable friendships and inspi-
rations for life. Many thanks to Fabio Anselmi, Guillermo D. Canas,
Carlo Ciliberto, Georgios
Evangelopoulos, Charlie Frogner, Leyla Isik, Joel Z. Leibo, Owen
Lewis, Qianli Liao, Ethan
5
Meyers, Brando Miranda, Jim Mutch, Yuzhao (Allen) Ni, Maximilian
Nickel, Gemma Roig
Noguera, and Stephen Voinea for their help and many interesting and
insightful conversations.
Special thanks to our administrative assistant Kathleen Sullivan
for all the helps and donuts and
for introducing me to inktober.
I’m extremely fortunate to have the chance to interact with many
other great people at MIT.
Thanks to Dr. Afonso Bandeira, Dr. Jim Glass, Dr. Leslie Kaelbling,
Dr. Ankur Moitra, Dr.
Sasha Rakhlin, Dr. Philippe Rigollet. I’m also grateful for the
invaluable experiences working
with wonderful people outside MIT during internships and visitings.
Many thanks to Dr. Samy
Bengio, Dr. Marco Cuturi, Dr. Moritz Hardt, Dr. Detlef Hohl, Dr.
Sugiyama Masashi, Dr.
Mauricio Araya Polo, Dr. Neil Rabinowitz, Dr. Ben Recht, Dr.
Francis Song, Dr. Ichiro
Takeuchi and Dr. Oriol Vinyals.
A very special gratitude goes out to Shell R&D, the Nuance
Foundation, and the Center
for Brain, Minds and Machine for providing research funding. All
this would be impossible
without their generous support.
I’m also grateful to all the friends, old an new, who have
supported me along the way. Special
thanks to Tianqi Chen, Xinlei Chen, Louis Chen, Xinyun Chen, Fei
Fang, Xue Feng, Siong
Thye Goh, Chong Yang Goh, Xuanzong Guo, Sonya Han, Daniel
“Shang-Wen” Li, Duanhui
Li, Chengtao Li, Mu Li, Song Liu, Tengyu Ma, Hongyao Ma, Yao Ma, He
Meng, Yin Qu, Ling
Ren, Ludwig Schmidt, Peng Wang, Shujing Wang, Zi Wang, Xiaoyu Wu,
Jiajun Wu, Bing Xu,
Tianfan Xue, Cong Yan, Judy “Qingjun” Yang, Xi Yang, Wei Yu, Zhou
Yu, Wenzhen Yuan, Yu
Zhang, Zhengdong Zhang, Xuhong Zhang, Sizhuo Zhang, Shu Zhang, Ya
Zhang, Hao Zhang,
Xijia Zheng, Qinqing Zheng, Bolei Zhou, and many others. They have
become an essential
part for my research, study and life. I will miss the study groups,
the board games, the hikings,
the day trips, and the home-made hot pots.
A special thanks to Sam Madden, my academic advisor, and Janet
Fischer, our EECS Grad-
uate Administrator, who are always kind and patient and
knowledgeable to help and make sure
everything is going on well with my study and academic life.
And finally, last but by no means least, I want to dedicate the
thesis to my mother Hua Wang
and my father Zhongmo Zhang, for their love and for fostering all
my academic endeavors.
6
Contents
2.1 Backgrounds and problem setup . . . . . . . . . . . . . . . . .
. . . . . . . 16
2.1.1 Rademacher complexity and generalization error . . . . . . .
. . . . 21
2.2 Related work on the three major theoretical questions in deep
learning . . . . 24
2.2.1 Approximation error . . . . . . . . . . . . . . . . . . . . .
. . . . 24
2.2.2 Optimization error . . . . . . . . . . . . . . . . . . . . .
. . . . . 25
2.2.3 Generalization error . . . . . . . . . . . . . . . . . . . .
. . . . . . 27
2.4.1 Our contributions . . . . . . . . . . . . . . . . . . . . . .
. . . . . 32
2.4.2 Related work . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . 35
2.4.4 The role of regularizers . . . . . . . . . . . . . . . . . .
. . . . . . 39
2.4.5 Finite-sample expressivity . . . . . . . . . . . . . . . . .
. . . . . . 43
2.4.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . 46
2.4.8 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . 47
7
2.5.3 Norms and complexity measures . . . . . . . . . . . . . . . .
. . . 57
3 Application and adaptation of deep learning with domain
structures 59
3.1 Learning based automatic geophysical feature detection system .
. . . . . . . 60
3.1.1 Petroleum exploration and seismic survey . . . . . . . . . .
. . . . . 60
3.1.2 Related work on automatic geophysical feature detection . . .
. . . . 62
3.1.3 Deep learning based fault detection from raw seismic signals
. . . . . 62
3.1.4 Learning with a Wasserstein distance . . . . . . . . . . . .
. . . . . 72
3.2 Invariant representation learning for speech and audio signals
. . . . . . . . . 98
3.2.1 Backgrounds . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . 100
imation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . 102
4 Building flexible and efficient deep learning systems 117
4.1 The importance of high quality deep learning systems and
related work . . . . 117
4.2 MXNet: flexible and efficient deep learning library for
distributed systems . . . 123
4.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . 123
8
2-3 Fitting random labels and random pixels on CIFAR-10. . . . . .
. . . . . . . 37
2-4 Effects of implicit regularizers on generalization performance.
. . . . . . . . . 42
2-5 The network architecture of the Inception model adapted for
CIFAR-10. . . . 48
2-6 The training accuracy on the convex interpolated weights of
three different
(global) minimizers found by SGD via different random
initialization. Each
subfigure shows the interpolated plot on different dataset with
either true labels
or random labels. . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . 54
3-2 Illustrations of faults. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . 63
3-3 Workflow of a machine learning based fault detection system. .
. . . . . . . . 64
3-4 Example of a randomly generated velocity model with a multiple
faults. . . . . 65
3-5 Illustration of the benefits of the Wasserstein loss. . . . . .
. . . . . . . . . . 68
3-6 Visualization of a 3D velocity model, its fault location and
our predictions. . . 69
3-7 A second visualization of a 3D velocity model, its fault
location and our pre-
dictions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . 70
3-8 Example of two images from two different categories in
ImageNet. . . . . . . 73
3-9 Illustration of the Wasserstein loss with the lattice example.
. . . . . . . . . . 74
3-10 The relaxed transport problem. . . . . . . . . . . . . . . . .
. . . . . . . . . 79
9
3-11 MNIST example for learning with a Wasserstein loss. . . . . .
. . . . . . . . 81
3-12 Top-K cost comparison of the proposed loss (Wasserstein) and
the baseline
(Divergence). . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . 83
3-13 Trade-off between semantic smoothness and maximum likelihood.
. . . . . . 84
3-14 Examples of images and tag predictions in the Flickr dataset.
. . . . . . . . . 84
3-15 Illustration of training samples on a 3x3 lattice with
different noise levels. . . . 94
3-16 Full figure for the MNIST example. . . . . . . . . . . . . . .
. . . . . . . . 95
3-17 More examples of images and tag predictions in the Flickr
dataset. . . . . . . 96
3-18 More examples of images and tag predictions in the Flickr
dataset. . . . . . . 97
3-19 Illustration of the speech generation and recognition loop. .
. . . . . . . . . . 99
3-20 The word “park” spoken by different speakers. . . . . . . . .
. . . . . . . . . 99
3-21 Illustration of the approximation of the group orbits. . . . .
. . . . . . . . . 105
3-22 Illustration of simple-complex cell based network
architecture. . . . . . . . . 106
3-23 Illustration of the phone classification experiment. . . . . .
. . . . . . . . . . 108
3-24 Phone classification error rates on TIMIT. . . . . . . . . . .
. . . . . . . . . 109
3-25 Illustration of VTL invariance based network architecture. . .
. . . . . . . . 113
3-26 Training cost and dev set frame accuracy against the training
iterations on the
TIMIT dataset. . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . 114
3-27 Performance of models trained on original and VTL-augmented
TIMIT. . . . 116
4-1 Illustration of the architecture and computation graph for
MXNet. . . . . . . . 127
4-2 Comparing MXNet with other systems for speed on a single
forward-backward
pass. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . 129
4-4 Training curves of googlenet in a distributed setting. . . . .
. . . . . . . . . 131
10
List of Tables
2.1 List of the number of parameters vs. the number of training
examples in some
common models used on two image classification datasets. . . . . .
. . . . . 31
2.2 List of p/n (number of parameters over number of training
samples) ratio and
classification performance on a few over-parameterized
architectures on CIFAR-
10. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . 32
2.3 The training and test accuracy of various models on CIFAR-10. .
. . . . . . . 40
2.4 The accuracy of the Inception v3 model on ImageNet. . . . . . .
. . . . . . 49
2.5 Generalizing with kernels. . . . . . . . . . . . . . . . . . .
. . . . . . . . . 51
2.6 Results on fitting random labels on the CIFAR-10 dataset with
weight decay
and data augmentation. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . 52
3.2 Phone classification error rate using different invariance
modules. . . . . . . . 111
3.3 Frame classification accuracy. . . . . . . . . . . . . . . . .
. . . . . . . . . 115
11
at lines 88–88
1 Introduction
Machine learning covers a wide range of methods to extract useful
patterns from a dataset that is
usually assumed to be samples from an unknown probability
distribution. Supervised learning,
one of the main setup in machine learning, is formulated as finding
a function f such that the
expected loss Ez[(f, z)] over the random variable z representing
the data is minimized. The
basic setting is that the distribution for the input random
variable z is unknown, but we have
access to a training set of n i.i.d. samples S = {z1, . . . , zn}.
As a result, the problem is usually
approached by solving the following Empirical Risk Minimization
(ERM) problem,
min f∈F
1
n
n∑ i=1
(f, zi) (1.1)
And then provide certificates showing that a “reasonable good”
solution to this problem is also
“reasonably good” for the original (expected) loss. Here F is a
hypothesis space of candidate
12
functions that we choose a priori to seeing the data. A trade off
on how good the best candidate
in F could perform (the bias) and how close the solution we found
by ERM to the optimal
solution under the expected loss (the variance) will decide the
optimal choice of F . In practice,
other concerns such as what algorithm to use to solve (1.1) will
also affect the choice of F .
Before the wide adoption of deep learning, people heavily rely on
the nice property of con-
vexity in both optimization and statistical theory, and tend to
chooseF to be linear functions in
some (predefined) feature spaces, and use convex surrogate losses
to solve the ERM problems.
On the other hand, with deep neural networks, F consists of
hypotheses that are defined
through composition of linear maps and component-wise non-linear
activations. The optimiza-
tion problem (1.1) becomes non-convex even with a convex loss .
While analyzing both the
optimization dynamics and the statistical behavior become harder in
this setting, empirically,
this formulation drastically improves the state-of-the-art
performance in real world problems
from various domains such as computer vision [Krizhevsky et al.,
2012, Russakovsky et al.,
2015, He et al., 2016, Taigman et al., 2014, Ren et al., 2015],
speech recognition [Hinton et al.,
2012b, Chan et al., 2016, Amodei et al., 2016], natural language
processing [Sutskever et al.,
2011, Bahdanau et al., 2015a, Sutskever et al., 2014], and
reinforcement learning [Mnih et al.,
2015, Silver et al., 2016], just to name a few.
The success of deep learning inspired a spectrum of different
research topics from theoretical
understanding, to applying and adapting deep neural networks to the
structures of specific tasks,
to building high performance deep learning systems. In this thesis,
we compile different projects
that the author has worked on at different stages to provide a
systematic view from the three
different perspectives.
Specifically, in Chapter 2, we will talk about our efforts on the
theoretical aspects about
understanding the generalization of deep learning. Chapter 3
consists of two applications of
deep learning with adaptations to the domain specific structures
arise in the each of the applica-
tion scenarios. Chapter 4 is devoted to building high performance
deep learning systems. The
detailed problem setup and backgrounds will be provided in each
chapter, respectively. Finally,
we will conclude with a summary and a list of contributions in
Chapter 5.
13
1.1 Main contributions
We summarize the main contributions of this thesis into three
parts. A full list of publications
and softwares included in this thesis is shown in section
5.2.
Theory we studied the theoretical aspects of the generalization
behavior of large scale deep
neural networks, and revealed some puzzling behavior in the regime
of deep learning that makes
it hard to directly apply the conventional distribution free
analysis from the statistical learning
theory toolkits [Zhang et al., 2017a].
Application We applied deep learning to two different problems:
automatic geophysical fea-
ture detection [Zhang et al., 2014b, Dahlke et al., 2016,
Araya-Polo et al., 2017] and invariant
representation learning for speech and audio signals [Zhang et al.,
2014a,c, 2015b]. The deep
learning framework is adapted to the domain specific structures of
the data and task in those
two problems. Beside successfully applying deep learning
algorithms, we also get new insights
to the design of the convolutional neural network architectures
[Zhang et al., 2015b] and new
algorithms for learning with output structures [Zhang et al.,
2015a].
System we built efficient deep learning toolkits [Chen et al.,
2015b, 2016] to support our
own research and also make it general purpose to support to the
deep learning use cases in
both academia and industry. Our open source project MXNet has an
efficient asynchronized
computation engine written in C++, built-in distributed computation
support and supports a
wide range of front end interface languages including Python,
Julia, R, and Scala. It is widely
recognized and adopted by the community, chosen to be the main deep
learning system for
Amazon AWS cloud computing platform and integrated by the NVidia
GPU Cloud as one of
the six major deep learning platforms.
14
learning
Generalization is a fundamental notion in machine learning and a
central topic in statistical and
computational learning theory. In Section 2.1, we will formally
define the problem and provide
a basic introduction to the existing literature. In Section 2.2, by
summarizing the existing liter-
ature of learning theoretical studies, we will see how the major
theoretical questions in learning
can be summarized into three core components that studies
approximation error, optimization
error and generalization error, respectively. Among thouse, we will
show in Section 2.3 some
examples of puzzles in our theoretical understanding that are
unique to the regime of deep
learning. Finally, in Section 2.4, we provide a systematic study of
the generalization behaviors
in the regime of deep learning. To conclude this chapter, we
discuss future work and recent
progresses in Section 2.5.
• A
2.1 Backgrounds and problem setup
Deep learning is generally referred to as any machine learning
related techniques that uses “deep”
neural networks — neural networks that have more than one hidden
layers. In this thesis, we
specifically focus on supervised learning, as formalized in Chapter
1, and we will be primarily
studying the feedforward neural networks.
A feedforward neural network is a parametric function fG θ : Rd →
Rk, where d and k are
input and output dimensions, respectively. Here G is a directed
acyclic graph (DAG) describing
the architecture of the neural network, which maps from the input
nodes to the output nodes.
As illustrated by Figure 2-1a, the computation at a vertex v is
defined as
xv = σ
(2.1)
where σ : R → R is a non-linear activation function, which is
typically chosen to be the
sigmoid function σ(x) = 1/1+e−x or the Rectified Linear Units
(ReLUs) σ(x) = max(x, 0).
The parameters {θu→v : (u→ v) ∈ G} associated with are called the
“weights” of the network.
In practice, the DAG is typically organized in a layerwise manner
as illustrated in Figure 2-
16
1b. The nodes in each layer receive only connections from the
previous layer. This kind of
architecture can be easily described by composition of linear
transforms and componentwise
nonlinear activations. For example, a L-layer feedforward neural
network can be described in
either of the two ways:
xi,j = σi
) , i = 1, . . . , L (2.3)
where the activation function σi(·) is applied componentwise for
vector inputs. Now the net-
work weights θ are the collection of linear transformation
coefficients {wijk} or {W i}, with
x = x0 being the input and y = xL being the output. We use a layer
index i for the activation
function because conventionally, the output layer does not use an
activation function, or uses
the identity activation.
The architecture with a simple chain of densely connected layers is
usually called multi-layer
perceptrons (MLPs). Usually, each layer could have its own
structure. For example, a convolu-
tional layer, commonly used in computer vision tasks, organize its
nodes as a spatial grid. Each
node is only sparsely connected to the surrounding nodes from the
previous layer, and share
the weights with other nodes in the same layer. It is very common
to take those basic layers (as
oppose to nodes) as building blocks and design a DAG to represent
the computation. Figure 2-
1c shows the “GoogLeNet” or “Inception” architecture proposed by
Szegedy et al. [2015] as an
example. Modern network architectures could easily have hundreds to
thousands of computa-
tional layers [He et al., 2016], with auxiliary components for
normalization [Ioffe and Szegedy,
2015] and transformations [Jaderberg et al., 2015], recurrent
connections with attentions [Bah-
danau et al., 2015b], gates [Chung et al., 2014] and differentiable
memory [Graves et al., 2014],
and number of parameters ranging from 105 to 1012 [Shazeer et al.,
2017].
Our goal is to study the practice of using deep neural networks (of
a specific architecture)
as the hypothesis for solving the supervised learning problems.
Specifically, we assume an al-
gorithm A is used to solved the ERM problem (1.1). Given a training
set S = {z1, . . . , zn},
A outputs a hypothesis fA,S . Without loss of generality, we assume
the minimizers in the
17
ES[(f, z)] (2.4)
Ez [(f, z)] (2.5)
Ez [(f, z)] (2.6)
For classification problems, f is usually called the Bayes
classifier, and achieves the minimum
possible loss among all the deterministic functions. E[(f, z)] is
called the Bayes error, repre-
senting the intrinsic difficulty of the problem, and commonly used
as the baseline performance
for analyzing a learning algorithm. fF is the optimal model in F
for the expected loss, while
fS is the optimal model for the empirical loss on the training set
S.
A common way of analyzing the loss of a learned model fA,S is
through the following
bias-variance decomposition:
E[(fA,S, z)] = E[(fA,S, z)]− E[(fF , z)] estimation error /
variance
+E[(fF , z)]− E[(f, z)] approximation error / bias
+E[(f, z)] Bayes error
(2.7)
The Bayes error is intrinsic to the problem, and beyond our
control. The approximation error
describes how well F could approximate the optimal Bayes classifier
f. Larger F leads to
smaller approximation error. On the other hand, the estimation
error characterize how well
the algorithm A, when given only the training set S as a proxy to
the underlying unknown data
distribution, could be able to find the best model fF in F . Due to
the discrepancy between
the empirical loss and the expected loss, usually largerF makes the
problem harder. As a result,
there is a bias-variance trade-off in the size / complexity of F
.
In statistical learning theory, the algorithm A is usually assumed
to be the ERM estimator.
In other words, fA,S = fS . However, in some cases, for example,
when (f, (x, y)) = I[f(x) =
y] is the binary classification loss, it is not computationally
efficient to compute the ERM
estimator. Even when the ERM problem (1.1) is convex, most of the
existing methods can
only guarantee an ε-approximate solution for a finite time budget.
Therefore, we distinguish
1For the brevity of notation, we also assume the minimizers are
unique. Most of the conclusions could be extend to non-unique
solutions and even to ε-approximate minimizers when no minimizer
exist.
18
between fA,S and fS and make the following decomposition
instead:
E[(fA,S, z)] = E[(fA,S, z)]− E[(fA,S, z)] + E[(fA,S, z)]− E[(fS,
z)] +
E[(fS, z)]− E[(fF , z)] + E[(fF , z)]− E[(f, z)] + E[(f, z)]
= E[(fA,S, z)]− E[(fS, z)] optimization error
+E[(fF , z)]− E[(f, z)] approximation error / bias
+E[(f, z)] Bayes error
+ E[(fA,S, z)]− E[(fA,S, z)] + E[(fS, z)]− E[(fF , z)]
(2.8)
where the last four terms can be bounded via
E[(fA,S, z)]− E[(fA,S, z)] + E[(fS, z)]− E[(fF , z)]
≤ E[(fA,S, z)]− E[(fA,S, z)] + E[(fF , z)]− E[(fF , z)]
≤ 2 sup f∈F
generalization error
(2.9)
The generalization error, sometimes also referred to as the
generalization gap, is decided by the
difference between the expected loss and the empirical loss. It is
easy to see largerF will increase
the generalization error, and the trade-off in choosing F is now
between the approximation
error and generalization error.
Furthermore, the optimization error, which characterize how well we
can solve the ERM
problem (1.1), could also depend on F in complicated ways. For
convex optimization, the
convergence is well understood [Bubeck, 2015]. On the other hand,
non-convex optimization
in machine learning is a very active research topic today [Zheng
and Lafferty, 2015, Sun et al.,
2015, Montanari, 2016, Raginsky et al., 2017, Zhang et al., 2017b],
and for the most general case
of training large neural networks, it is still not very clear why
we are empirically so successful
at solving those non-convex problems.
Note upper bounding by the supf∈F seem to be very loose. However,
since both fA,S and
E depend on the data, one needs some way to decouple the dependency
for the ease of analysis.
Usually this upper bound is already good enough to provide useful
sufficient condition for
learnability. In the distribution free setting, where the learning
algorithm is required to work
well with all possible data distributions, it can be shown that
this upper bound is actually tight.
19
Various ways to control (2.9) usually reduce to bounds of the order
O(d/n) ∼ O( √
d/n),
where d is some complexity measure for the hypothesis space F . A
high level idea of the
general arguments used here is: for any given f , by the law of
large number, under appropriate
conditions such as boundedness, one can show that the empirical
average E[(f, z)] converges
to the expectation E[(f, z)] with high probability. However, to
deal with supf∈F , one need
to make sure every f ∈ F converges uniformly. When F is finite,
this can be controlled by
a simple union bound, which leads to a uniform convergence bound
with a multiplier log |F|.
When |F| = ∞, this bound becomes trivial. In this case, one need
employ a more clever
union bound where “similar” functions in F are grouped together and
“counted” once. The
exact notion of “similarity” depends on what kind structures of F
are available. For example, if
F is a metric space with an appropriate metric, one can treat
functions within an ε metric ball
as “similar” and use the covering number of F to represent the
complexity [Cucker and Smale,
2002]. Another example is in binary classification, when evaluated
on a given finite dataset
S = (z1, . . . , zn) of n points, there are at most 2n possible
unique outcomes:
|FS| = |{f(z1, . . . , zn) : f ∈ F}| ≤ 2n
as a lot of functions are mapped to the same n-dimensional binary
vectors. We get |FS| <
∞, but O( √
log |FS|/n) = O(1) is still not good enough for a useful
generalization bound.
When some structures inF is known that allows us to argue about the
combinatorial properties
of |FS|, such that it grows slower than 2n, a meaningful bound can
be provided. The notion of
Vapnik-Chervonenkis (VC) dimension allows us to bound |FS| byO(dn)
for n > d. Therefore,
if F has VC dimension d, then the generalization bounds can be
reduced to O( √
d logn/n).
Please refer to the books such as Kearns and Vazirani [1994], Mohri
et al. [2012], Shalev-Shwartz
and Ben-David [2014] for detailed developments of the upper bounds
and lower bounds for
generalization error.
For completeness, we provide a brief overview of the standard
arguments through the
Rademacher complexity in the following subsection.
20
2.1.1 Rademacher complexity and generalization error
For simpler notion, let us define the loss class L = {z 7→ (f, z) :
f ∈ F}. The generalization
error we want to bound is
sup l∈L
ES[l(z)]− Ez[l(z)] (2.10)
where we now explicitly state the empirical average ES is with
respect to the sample set S,
and the expectation Ez is also made clear to be with respect to the
distribution of the random
variable z. The object (2.10) that we want to control is a random
variable, from the fact that the
training set S is random. Typical ways to bound a random variable
is to control its expectation
and to provide a high probability bound. In this case, when the
loss function is bounded, a
high probability bound can be derived from a bound in expectation
by directly applying the
McDiarmid’s Inequality. Therefore, in the following, we only show
how to control the expected
generalization error
Because the absolute value is a bit tedious to deal with, we will
show how to bound
sup l∈L
ES[l(z)]− Ez[l(z)] (2.11)
instead. It will be clear that the same procedures can be repeated
to bound
sup l∈L
Ez[l(z)]− ES[l(z)]
The two bounds can be put together to get a high probability bound
for the generalization gap
with the absolute value function. More specifically, we hope to
bound the following expecta-
tion:
ES
] (2.12)
The notation is a bit confusing here, as ES and ES mean completely
different thing. The former
means expectation with respect to the (distribution of ) the random
variable S, while the latter
means an empirical average with respect to the uniform distribution
of z ∈ S, and S just happen
to be a random variable in this case.
21
ES
]
where S is a “ghost sample set” of n points z1, . . . , zn. By
Jensen’s inequality,
ES
]
Note that ∀i, zi and zi are i.i.d. samples from the same
distribution, therefore, with the expec-
tation with respect to S and S outside, we can freely change l(zi)
− l(zi) to l(zi) − l(zi) in
the inside. As a result, we introduce the following Rademacher
random variables: σ1, . . . , σn,
which are independently and uniformly distributed on {±1}:
ES,S
]
where the last step is because the ghost sample S is independent
and identically distributed as
S. We define the empirical Rademacher complexity and the Rademacher
complexity respectively
as
Combining the steps above, we get the following theorem.
Theorem 1. Let L be a loss class of functions l : Z → R, then the
expected generalization gap can
22
ES
] ≤ 2R(L) (2.15)
In order to connect the Rademacher complexity of the loss classL
with the original hypoth-
esis space, we can use the Talagrand’s Contraction Lemma.
Lemma 1 (Talagrand’s Contraction Lemma). If : R→ R is a L-Lipschitz
function, then ∀S,
RS( F) ≤ LRS(F).
Corollary 1. Let F be a hypothesis space f : X → R and let (f, (x,
y)) = (f(x), y) be a loss
function such that (·, y) is L-Lipschitz ∀y ∈ Y . Then
ES
] ≤ 2LR(F) (2.16)
Remark. 1) We cannot directly apply Lemma 1 to prove this
corollary, because L is not simply
F due to the fact that also takes the label y as an input. But the
proof of Lemma 1 can be
adapted to show this specific variant. 2) In practice, some loss
functions are Lipschitz (e.g. the hinge
loss), but some are not (e.g. the square loss and the cross entropy
loss). An easy workaround is to
assume boundedness of X and Y , then the loss is Lipschitz on this
bounded domain. Alternative,
generalization bounds can also be shown without those assumptions,
via more involved analysis such
as Balázs et al. [2016].
Intuitively, the Rademacher complexity measures the complexity ofF
by accessing how well
the functions in F can fit arbitrary binary label assignments —
here the Rademacher random
variables σ1, . . . , σn can be thought as “pseudo labels”.
In binary classification problems, the Rademacher complexity of F
can be bounded by
O( √
d logn/n) using the corresponding VC dimension d with the help of
Massart’s lemma. See,
for example, Mohri et al. [2012, Chapter 3] for more details on
this. In many cases, the
Rademacher complexity can be directly estimated.
23
2.2 Related work on the three major theoretical questions in
deep learning
As introduced in Section 2.1, we can roughly summarize the main
learning theoretical questions
of deep learning in three parts:
Approximation error What kind of functions can deep neural networks
approximate? Are
deep neural networks better than shallow (but wide) ones?
Optimization error What does the landscape of the ERM loss surface
in deep learning look
like? Are there saddle points, local minimizers and global
minimizers and how many of
them? Is SGD guaranteed to converge? If it is, then what does it
converge to and how
fast is the convergence?
Generalization error How well can deep neural networks generalize?
Is it generalizing because
the size of F is controlled? Is there any way to improve the
generalization performance?
Note that the three parts are not completely separated. For
example, the choice of F (the
architecture of neural networks to use) inevitably affects all the
three components. However,
we found it convenient to organize the literature in this
way.
2.2.1 Approximation error
A good overview of the study of approximation in general machine
learning is by Cucker and
Zhou [2007]. In general, it is known that the Reproducing Kernel
Hilbert Spaces (RKHSs)
induced by certain kernels (e.g. the Gaussian RBF kernels) are
universal approximators in the
sense that they can uniformly approximate arbitrary continuous
functions on a compact set
[Steinwart, 2001, 2002, Micchelli et al., 2006].
Similar universal approximation theorems for feedforward neural
networks were also known
since early 90s [Gybenko, 1989, Hornik, 1991]. Given enough number
of units, the universal
approximation theorems can be proved for neural networks with only
one hidden layer. See
Anthony and Bartlett [2009] for a nice overview of early work on
this topic. However, recent
24
empirical success in deep neural networks inspired people to study
the topic of approximation
or representation power of neural networks with more layers.
Many recent progresses show that to approximate certain types of
functions (e.g. functions
that are hierarchical and compositional), shallow networks need
exponentially more units than
their deep counterparts [Pinkus, 1999, Delalleau and Bengio, 2011,
Montufar et al., 2014, Tel-
garsky, 2016, Shaham et al., 2015, Eldan and Shamir, 2015, Mhaskar
and Poggio, 2016, Mhaskar
et al., 2017, Rolnick and Tegmark, 2017].
On the other hand, people also conduct empirical studies to compare
deep and shallow
networks. Surprisingly, it is found that with more careful
training, shallow networks with the
same number of units can be trained to approximate well the
function represented by deep net-
works [Ba and Caruana, 2014, Urban et al., 2017]. More
specifically, the shallow networks are
trained to directly predict the real valued outputs of their deep
counterparts (as oppose to dis-
crete classification labels). In practice, depth alone generally
does not generate big performance
gaps between variants of neural networks, but more specific
structures (e.g. convolutional vs.
non-convolutional) do [Urban et al., 2017]. But in this case, the
analysis is not only about
representation power, but also coupled with optimization and
generalization.
Apart from explicitly choosing F , it is also very common to apply
regularizers to implicitly
change the “effective” F . The intuition is to learn with a large F
, but modify the ERM ob-
jective with a regularizer term to prefer a “simpler subset of F”.
Usually a regularizer in the
objective is equivalent to applying some hard constraints to F ,
but the former might be easier
to optimize. Many general regularization techniques such as weight
decay, data augmentation
and early stopping continue to be used in deep learning. But there
are also specific regulariza-
tion techniques designed for deep neural networks, notable ones
include dropout [Hinton et al.,
2012c] and stochastic depth [Huang et al., 2016].
2.2.2 Optimization error
It is true that learning boils down to an optimization problem as
shown in (1.1) in the end.
However, learning problems have their own unique properties and
structures that bias the study
of optimization algorithms in this domain. For example, while many
fast second order opti-
mization algorithms with quadratic convergence exist [Boyd and
Vandenberghe, 2004, Bubeck,
25
2015], first order methods are usually preferred in machine
learning, not only because of their
much cheaper computation and better scalability to large datasets;
but also because that the
generalization error is general no better than O(1/n) ∼ O( √
1/n), there is no need to use op-
timization algorithms that converges faster than that2. Similary,
the fact that many learning
problems consist of a smoothed loss function and a non-smooth
regularizer inspired the studies
of proximal gradient methods [Parikh et al., 2014], and the ERM
problem (1.1) being a sum-
mation of many terms attracts many researchers to study topics on
stochastic gradient methods
[Bottou et al., 2016] such as adaptive learning rates [Duchi et
al., 2011, Kingma and Ba, 2014]
and asynchronize & distributed optimization [Recht et al.,
2011, Duchi et al., 2015].
While most of the earlier work on optimization focused on convex
optimization, in recent
years, there is an increasing number of interests in non-convex
problems, mostly inspired by
the empirical success of “naive” gradient descent based algorithms
applied to problems such as
matrix completion that were usually solved via convex relaxation to
Semi-definition Programs
(SDPs). With some new analyzing techniques, people are able to
prove convergence of many
non-convex problems [Zheng and Lafferty, 2015, Sun et al., 2015,
Montanari, 2016, Raginsky
et al., 2017, Zhang et al., 2017b].
For general non-convex objective functions, an increasing number of
recent work on the
convergence behavior of Stochastic Gradient Descent (SGD) [Lee et
al., 2016], the main opti-
mization algorithm used in deep learning, and its variants
[Raginsky et al., 2017, Zhang et al.,
2017b, Chaudhari et al., 2017] can be found in the
literature.
Studies that directly focus on the non-convex optimization problems
in deep learning are
also emerging, although many of the current results rely on some
unrealistic assumptions. For
example, Choromanska et al. [2015] made connection between neural
networks and spin glass
models and showed that the number of bad local minima diminishes
exponentially with the size
of the network. However, for the analogues to go through, a number
of strong independence
assumptions need to be made. Kawaguchi [2016] improved the results
by relaxing a number of
assumptions, although the remaining assumptions are still
considered unrealistic for any trained
neural networks.
2In the optimization literature, “linear” and “quadratic”
convergence actually refer to O(exp(−n)) and O(exp(−2n)) rates,
respectively. So O(1/n) rate is “sublinear” convergence.
26
Although characterizing the most general optimization problems in
deep learning remains
challenging, progress has been made in various special cases. For
example, in the case of lin-
ear networks, where all the activation functions are identity maps,
people have developed good
understanding of both the landscape [Kawaguchi, 2016, Hardt and Ma,
2017] and the optimiza-
tion dynamics [Saxe et al., 2014]. Soudry and Carmon [2016] use
smoothed analysis technique
to prove that for a MLP with one hidden layer and piecewise linear
activation function, with the
quadratic loss, every differentiable local minimum has zero
training error. Brutzkus and Glober-
son [2017] and Tian [2017] both studied a ReLU network with one
hidden layer and fixed top
layer weights, with or without a convolutional structure,
respectively; and proved global opti-
mimality for random Gaussian inputs. Freeman and Bruna [2017] also
looked at single hidden
layer half-rectified network and analyzed the interplay between the
data smoothness and model
over-parameterization.
Various inspections are also applied to empirically analyze the
optimization behavior of real
world deep learning problems. For example, Goodfellow and Vinyals
[2015] found that the
objective function evaluated along the linear interpolation of two
different solutions or between
a random starting point and the solution demonstrate highly
consistent and regular behavior.
Sagun et al. [2016] studied the singularity of Hessians in deep
learning problems. Poggio and
Liao [2017] inspected the degenerate minimizers in
over-parameterized models.
2.2.3 Generalization error
Hardt et al. [2016] used uniform stability to provide
generalization bounds that are independent
of the underlying hypothesis space for general deep learning
algorithms that can be trained
quickly. However, their bounds for the general non-convex objective
are a bit loose and cannot
be directly applied to the case when the networks are trained for
many epochs. Keskar et al.
[2017] compared SGD training with small and large mini-batches and
found that large batch
sizes tend to find “sharp” minimizers that generalize worse than
“flat” minimizers. On the
other hand, Dinh et al. [2017] show that due to non-negative
homogeneity of ReLUs, “flat”
minimizers can be warped into equivalent “sharp” ones, which still
generalize well. Moreover,
Hoffer et al. [2017] investigated different phases during training
and proposed an algorithm
that could achieve good generalization performance even for large
batch training.
27
In a different line of research, Neyshabur et al. [2015b] proposed
the notion of path norm
to control the complexity of neural network hypothesis spaces. The
parameterization of deep
neural networks are known to produce many equivalent weight
assignments: for example, one
can freely re-order the nodes in a layer; or if the ReLU is used as
the activation function, one
can freely scale the weights in one layer by a scalar a and a
consecutive layer by 1/a without
changing the function defined by the network. The path norm is
designed to be invariant to
this kind of equivalent re-parameterization. In a follow up paper,
the path norm is used as a
regularizer in training neural networks [Neyshabur et al.,
2015a].
2.3 Some puzzles in understanding deep learning
Although a lot of progresses have been made in recent years about
the theoretical understanding
of deep learning, there are still many observations that are not
rigorously understood. Some of
them might be due to trivial artifacts, while some of them might
have deep reasons. We just
give a few examples here to illustrate the current state of
understanding in deep learning.
First of all, although a tremendous amount of stories of empirical
success of deep neural
networks in various fields have been reported in the literature,
and as we surveyed in the previous
section, people have studied the benefits of depth from various
different theoretical point of
views; it is still not very clear when deep neural networks would
work or work better than
traditional methods. For example, in recommender systems, people
have tried to combine
deep models with traditional shallow ones in order to achieve good
performance [Cheng et al.,
2016].
Moreover, the interplay between the representation power and
optimization is also not very
well understood. The current common practice seem to be adding more
specialized structures
such as attentions [Bahdanau et al., 2015b] external memory [Graves
et al., 2014] into the
network architectures. Intuitively, it introduce more specialized
inductive bias that could po-
tentially be benefits for solving the particular problems
considered. However, it is not clear at
all whether those make the optimization problem easier or harder.
In many cases, it seems the
usual optimization algorithm (i.e. SGD) still works very well.
Another line of research that is
in a similar philosophy is to unroll the iterative approximate
inference algorithms for Markov
28
Random Fields (MRFs) as layers of neural networks, and perform
end-to-end training for struc-
tured output learning problems. See, for example, Zheng et al.
[2015] and references therein for
more details.
Despite being extremely simple, SGD seem to be good at navigating
very complicated non-
convex landscapes arise from those highly structural and deep
neural networks. For example,
although various proofs [Zhang et al., 2017a, Poggio and Liao,
2017] exist showing that for a
given finite dataset, over-parameterized deep networks can encode
arbitrary input-output maps
perfectly, it is not clear why SGD could easily find the right set
of parameters. Moreover, it is
not because the maps in image classification is “nice” (e.g.
hierarchical and compositional), as
we have empirically shown in Zhang et al. [2017a] that SGD could
even fit arbitrary mappings
with random labels, despite being only a first-order algorithm that
is known to suffer from
slow convergence, potential flat plateaus from saddle points
[Dauphin et al., 2014] and local
minima.
Figure 2-2 shows some example of learning curves for training
Convolutional Neural Net-
works (ConvNets) on CIFAR-10. As we can see, measured in both the
cross-entropy loss (the
objective function that is typically used for classification task
in deep learning) and the classi-
fication accuracy, the network quickly reaches the global optimal
during training. There are
also some other interesting observations: when weight decay is not
used, the validation loss
goes up at some point, as shown in Figure 2-2a, which is what
expected as a result of overfitting.
However, if we look at the classification accuracy in Figure 2-2b,
the validation curve does not
show a sign of overfitting. Looking at the bottom row of the
figure, we find that an abrupt drop
in the loss could be observed at epoch 150 when the learning rate
is decreased from 0.1 to 0.01.
This kind of behavior is quite common in training of ConvNets. But
it is hard to imagine what
has happened at that moment in the landscape of thousands to
millions of dimensions.
Please note that it does not come for free that optimization in
deep learning is easy. Usually,
easy convergence to global minimizers are observed in problems with
high dimensional inputs
and largely over-parameterized architectures3. Various
normalization [Ioffe and Szegedy, 2015]
and initialization [Glorot and Bengio, 2010] techniques and special
architecture design patterns
3For example, CIFAR-10 has 50, 000 training examples of 32× 32× 3 =
3, 072 input dimensions. Typical ConvNets trained for this dataset
have 105 ∼ 106 parameters. Problems on ImageNet are of even larger
scales.
29
(a) Cross entropy loss, w/o weight decay (b) Classification
accuracy, w/o weight decay
(c) Cross entropy loss, weight decay λ = 10−4 (d) Classification
accuracy, weight decay λ = 10−4
Figure 2-2: Example of training curves of a wide ResNet (depth=28,
widen factor=1) on the CIFAR-10 dataset. SGD with momentum 0.9 is
used, with a learning rate of 0.1 at the begin- ning, 0.01 after
epoch 150 and 0.001 after epoch 225.
30
Table 2.1: List of the number of parameters vs. the number of
training examples in some common models used on two image
classification datasets.
CIFAR-10 number of training points: 50, 000
Inception 1,649,402 Alexnet 1,387,786
MLP 1× 512 1,209,866 ImageNet number of training points: ∼ 1, 200,
000
Inception V4 42,681,353 Alexnet 61,100,840
Resnet-18, Resnet-152 11,689,512, 60,192,808 VGG-11, VGG-19
132,863,336, 143,667,240
[He et al., 2016, Srivastava et al., 2015] are proposed to make
training those very deep networks
possible.
On the other hand, under cryptographic assumptions, probably
approximately correct (PAC)
learning [Kearns and Vazirani, 1994] intersection of half spaces is
hard in the worst case [Klivans
and Sherstov, 2009]. Since neural networks can encode intersection
of half spaces, this implies
cryptographic hardness of learning neural networks.
2.4 Understanding deep learning requires rethinking gener-
alization
In this section, we summarize our study of the generalization
behavior in the regime of deep
learning published in Zhang et al. [2017a]. The motivation of this
study is that, as introduced
in Section 2.1, the generalization bound is a crucial part to
understand the performance of a
learning algorithm. The classical generalization bounds are
typically O( √ d/n), where d is
the VC dimension of the hypothesis space. For linear classifiers,
the VC dimension coincide
with the input dimension, which is also the number of parameters.
In neural networks, the
VC dimension could also be related to the number of parameters. For
example, for standard
sigmoid networks, d = O(p2), where p is the number of parameters
[Anthony and Bartlett,
2009, Theorem 8.13].
On the other hand, if we look at the typical applications — here we
focus one image clas-
31
Table 2.2: List of p/n (number of parameters over number of
training samples) ratio and classi- fication performance on a few
over-parameterized architectures on CIFAR-10.
Architecture p/n ratio Training accuracy Test accuracy
MLP 1× 512 24 100% 51.51% Alexnet 28 100% 76.07%
Inception 33 100% 85.75% Wide Resnet 179 100% 88.21%
sification problems, the number of parameters p in the neural
network architectures people
use is usually one or two order of magnitude larger than the number
of training examples n.
Table 2.1 shows some statistics on CIFAR-10 and ImageNet. This,
however, create a puzzle in
our understanding of the generalization of deep learning, as the
bound of O( √ d/n) is only
useful when n is much larger than d. With the number of parameters
greatly exceeding the
number of training points, this kinds of bounds are no longer
informative. Yet in practice, we
have observed great empirical success for those huge deep learning
models. Moreover, in this
over-parameterized regime, sometimes when the p/n ratio increases,
the test performance even
improves, as shown in Table 2.2, despite increasing variances. The
mysterious gap between
theory and practice is the main topic of this study.
2.4.1 Our contributions
In this work, we problematize the traditional view of
generalization by showing that it is inca-
pable of distinguishing between different neural networks that have
radically different general-
ization performance.
Randomization tests. At the heart of our methodology is a variant
of the well-known ran-
domization test from non-parametric statistics [Edgington and
Onghena, 2007]. In a first set
of experiments, we train several standard architectures on a copy
of the data where the true
labels were replaced by random labels. Our central finding can be
summarized as:
Deep neural networks easily fit random labels.
32
More precisely, when trained on a completely random labeling of the
true data, neural net-
works achieve 0 training error. The test error, of course, is no
better than random chance as
there is no correlation between the training labels and the test
labels. In other words, by ran-
domizing labels alone we can force the generalization error of a
model to jump up considerably
without changing the model, its size, hyperparameters, or the
optimizer. We establish this fact
for several different standard architectures trained on the
CIFAR-10 and ImageNet classifica-
tion benchmarks. While simple to state, this observation has
profound implications from a
statistical learning perspective:
1. The effective capacity of neural networks is sufficient for
memorizing the entire data set.
2. Even optimization on random labels remains easy. In fact,
training time increases only
by a small constant factor compared with training on the true
labels.
3. Randomizing labels is solely a data transformation, leaving all
other properties of the
learning problem unchanged.
Extending on this first set of experiments, we also replace the
true images by completely
random pixels (e.g., Gaussian noise) and observe that convolutional
neural networks continue
to fit the data with zero training error. This shows that despite
their structure, convolutional
neural nets can fit random noise. We furthermore vary the amount of
randomization, inter-
polating smoothly between the case of no noise and complete noise.
This leads to a range of
intermediate learning problems where there remains some level of
signal in the labels. We ob-
serve a steady deterioration of the generalization error as we
increase the noise level. This shows
that neural networks are able to capture the remaining signal in
the data, while at the same time
fit the noisy part using brute-force.
We discuss in further detail below how these observations rule out
all of VC-dimension,
Rademacher complexity, and uniform stability as possible
explanations for the generalization
performance of state-of-the-art neural networks.
The role of explicit regularization. If the model architecture
itself isn’t a sufficient regular-
izer, it remains to see how much explicit regularization helps. We
show that explicit forms
33
of regularization, such as weight decay, dropout, and data
augmentation, do not adequately
explain the generalization error of neural networks. Put
differently:
Explicit regularization may improve generalization performance, but
is neither necessary nor
by itself sufficient for controlling generalization error.
In contrast with classical convex empirical risk minimization,
where explicit regularization
is necessary to rule out trivial solutions, we found that
regularization plays a rather different
role in deep learning. It appears to be more of a tuning parameter
that often helps improve the
final test error of a model, but the absence of all regularization
does not necessarily imply poor
generalization error. As reported by Krizhevsky et al. [2012],
2-regularization (weight decay)
sometimes even helps optimization, illustrating its poorly
understood nature in deep learning.
Finite sample expressivity. We complement our empirical
observations with a theoretical
construction showing that generically large neural networks can
express any labeling of the train-
ing data. More formally, we exhibit a very simple two-layer ReLU
network with p = 2n + d
parameters that can express any labeling of any sample of size n in
d dimensions. A previ-
ous construction due to Livni et al. [2014] achieved a similar
result with far more parameters,
namely, O(dn). While our depth 2 network inevitably has large
width, we can also come up
with a depth k network in which each layer has only O(n/k)
parameters.
While prior expressivity results focused on what functions neural
nets can represent over the
entire domain, we focus instead on the expressivity of neural nets
with regards to a finite sample.
In contrast to existing depth separations [Delalleau and Bengio,
2011, Eldan and Shamir, 2015,
Telgarsky, 2016, Cohen and Shashua, 2016] in function space, our
result shows that even depth-
2 networks of linear size can already represent any labeling of the
training data.
The role of implicit regularization. While explicit regularizers
like dropout and weight-
decay may not be essential for generalization, it is certainly the
case that not all models that
fit the training data well generalize well. Indeed, in neural
networks, we almost always choose
our model as the output of running stochastic gradient descent.
Appealing to linear models, we
analyze how SGD acts as an implicit regularizer. For linear models,
SGD always converges to
34
a solution with small norm. Hence, the algorithm itself is
implicitly regularizing the solution.
Indeed, we show on small data sets that even Gaussian kernel
methods can generalize well with
no regularization. Though this doesn’t explain why certain
architectures generalize better than
other architectures, it does suggest that more investigation is
needed to understand exactly what
the properties are inherited by models that were trained using
SGD.
2.4.2 Related work
Hardt et al. [2016] give an upper bound on the generalization error
of a model trained with
stochastic gradient descent in terms of the number of steps
gradient descent took. Their analysis
goes through the notion of uniform stability [Bousquet and
Elisseeff, 2002]. As we point out
in this work, uniform stability of a learning algorithm is
independent of the labeling of the
training data. Hence, the concept is not strong enough to
distinguish between the models
trained on the true labels (small generalization error) and models
trained on random labels
(high generalization error). This also highlights why the analysis
of Hardt et al. [2016] for
non-convex optimization was rather pessimistic, allowing only a
very few passes over the data.
Our results show that even empirically training neural networks is
not uniformly stable for
many passes over the data. Consequently, a weaker stability notion
is necessary to make further
progress along this direction.
There has been much work on the representational power of neural
networks, starting from
universal approximation theorems for multi-layer perceptrons
[Gybenko, 1989, Mhaskar, 1993,
Delalleau and Bengio, 2011, Mhaskar and Poggio, 2016, Eldan and
Shamir, 2015, Telgarsky,
2016, Cohen and Shashua, 2016]. All of these results are at the
population level characterizing
which mathematical functions certain families of neural networks
can express over the entire
domain. We instead study the representational power of neural
networks for a finite sample
of size n. This leads to a very simple proof that even O(n)-sized
two-layer perceptrons have
universal finite-sample expressivity.
Bartlett [1998] proved bounds on the fat shattering dimension of
multilayer perceptrons
with sigmoid activations in terms of the 1-norm of the weights at
each node. This important
result gives a generalization bound for neural nets that is
independent of the network size.
However, for RELU networks the 1-norm is no longer informative.
This leads to the question
35
of whether there is a different form of capacity control that
bounds generalization error for large
neural nets. This question was raised in a thought-provoking work
by Neyshabur et al. [2014],
who argued through experiments that network size is not the main
form of capacity control
for neural networks. An analogy to matrix factorization illustrated
the importance of implicit
regularization.
2.4.3 Effective capacity of neural networks
Our goal is to understand the effective model capacity of
feed-forward neural networks. Toward
this goal, we choose a methodology inspired by non-parametric
randomization tests. Specifi-
cally, we take a candidate architecture and train it both on the
true data and on a copy of
the data in which the true labels were replaced by random labels.
In the second case, there is
no longer any relationship between the instances and the class
labels. As a result, learning is
impossible. Intuition suggests that this impossibility should
manifest itself clearly during train-
ing, e.g., by training not converging or slowing down
substantially. To our surprise, several
properties of the training process for multiple standard
achitectures is largely unaffected by this
transformation of the labels. This poses a conceptual challenge.
Whatever justification we had
for expecting a small generalization error to begin with must no
longer apply to the case of
random labels.
To gain further insight into this phenomenon, we experiment with
different levels of ran-
domization exploring the continuum between no label noise and
completely corrupted labels.
We also try out different randomizations of the inputs (rather than
labels), arriving at the same
general conclusion.
The experiments are run on two image classification datasets, the
CIFAR-10 dataset [Krizhevsky
and Hinton, 2009] and the ImageNet [Russakovsky et al., 2015]
ILSVRC 2012 dataset. We
test the Inception V3 [Szegedy et al., 2016] architecture on
ImageNet and a smaller version of In-
ception, Alexnet [Krizhevsky et al., 2012], and MLPs on CIFAR-10.
Please see Subsection 2.4.8
in the appendix for more details of the experimental setup.
Fitting random labels and pixels
We run our experiments with the following modifications of the
labels and input images:
36
thousand steps
label corruption
label corruption
(c) generalization error growth
Figure 2-3: Fitting random labels and random pixels on CIFAR-10.
(a) shows the training loss of various experiment settings decaying
with the training steps. (b) shows the relative conver- gence time
with different label corruption ratio. (c) shows the test error
(also the generalization error since training error is 0) under
different label corruptions.
• True labels: the original dataset without modification.
• Partially corrupted labels: independently with probability p, the
label of each image is
corrupted as a uniform random class.
• Random labels: all the labels are replaced with random
ones.
• Shuffled pixels: a random permutation of the pixels is chosen and
then the same per-
mutation is applied to all the images in both training and test
set.
• Random pixels: a different random permutation is applied to each
image independently.
• Gaussian: A Gaussian distribution (with matching mean and
variance to the original
image dataset) is used to generate random pixels for each
image.
Surprisingly, stochastic gradient descent with unchanged
hyperparameter settings can opti-
mize the weights to fit to random labels perfectly, even though the
random labels completely
destroy the relationship between images and labels. We further
break the structure of the images
by shuffling the image pixels, and even completely re-sampling
random pixels from a Gaussian
distribution. But the networks we tested are still able to
fit.
Figure 2-3a shows the learning curves of the Inception model on the
CIFAR-10 dataset
under various settings. We expect the objective function to take
longer to start decreasing on
random labels because initially the label assignments for every
training sample is uncorrelated.
37
Therefore, large predictions errors are back-propagated to make
large gradients for parameter
updates. However, since the random labels are fixed and consistent
across epochs, the network
starts fitting after going through the training set multiple times.
We find the following obser-
vations for fitting random labels very interesting: a) we do not
need to change the learning rate
schedule; b) once the fitting starts, it converges quickly; c) it
converges to (over)fit the training
set perfectly. Also note that “random pixels” and “Gaussian” start
converging faster than “ran-
dom labels”. This might be because with random pixels, the inputs
are more separated from
each other than natural images that originally belong to the same
category, therefore, easier to
build a network for arbitrary label assignments.
On the CIFAR-10 dataset, Alexnet and MLPs all converge to zero loss
on the training set.
The shaded rows in Table 2.3 show the exact numbers and
experimental setup. We also tested
random labels on the ImageNet dataset. As shown in the last three
rows of Table 2.4 in the
appendix, although it does not reach the perfect 100% top-1
accuracy, 95.20% accuracy is still
very surprising for a million random labels from 1000 categories.
Note that we did not do any
hyperparameter tuning when switching from the true labels to random
labels. It is likely that
with some modification of the hyperparameters, perfect accuracy
could be achieved on random
labels. The network also manages to reach∼90% top-1 accuracy even
with explicit regularizers
turned on.
Partially corrupted labels We further inspect the behavior of
neural network training with
a varying level of label corruptions from 0 (no corruption) to 1
(complete random labels) on
the CIFAR-10 dataset. The networks fit the corrupted training set
perfectly for all the cases.
Figure 2-3b shows the slowdown of the convergence time with
increasing level of label noises.
Figure 2-3c depicts the test errors after convergence. Since the
training errors are always zero,
the test errors are the same as generalization errors. As the noise
level approaches 1, the gener-
alization errors converge to 90% — the performance of random
guessing on CIFAR-10.
Implications
In light of our randomization experiments, we discuss how our
findings pose a challenge for
several traditional approaches for reasoning about
generalization.
38
Rademacher complexity and VC-dimension. Rademacher complexity is
commonly used
and flexible complexity measure of a hypothesis class. Please see
Subsection 2.1.1 for a brief
introduction. The definition (2.13) of the empirical Rademacher
complexity closely resembles
our randomization test. Specifically, RS(F)measures ability ofF to
fit random±1 binary label
assignments. While we consider multiclass problems, it is
straightforward to consider related
binary classification problems for which the same experimental
observations hold. Since our
randomization tests suggest that many neural networks fit the
training set with random labels
perfectly, we expect that RS(F) ≈ 1 for the corresponding model
class F . This is, of course, a
trivial upper bound on the Rademacher complexity that does not lead
to useful generalization
bounds in realistic settings. A similar reasoning applies to
VC-dimension and its continuous
analog fat-shattering dimension, unless we further restrict the
network. While Bartlett [1998]
proves a bound on the fat-shattering dimension in terms of 1 norm
bounds on the weights
of the network, this bound does not apply to the ReLU networks that
we consider here. This
result was generalized to other norms by Neyshabur et al. [2015b],
but even these do not seem
to explain the generalization behavior that we observe.
Uniform stability. Stepping away from complexity measures of the
hypothesis class, we can
instead consider properties of the algorithm used for training.
This is commonly done with
some notion of stability, such as uniform stability [Bousquet and
Elisseeff, 2002]. Uniform
stability of an algorithm measures how sensitive the algorithm is
to the replacement of a single
example. However, it is solely a property of the algorithm, which
does not take into account
specifics of the data or the distribution of the labels. It is
possible to define weaker notions of
stability [Mukherjee et al., 2002, Poggio et al., 2004,
Shalev-Shwartz et al., 2010]. The weakest
stability measure is directly equivalent to bounding generalization
error and does take the data
into account. However, it has been difficult to utilize this weaker
stability notion effectively.
2.4.4 The role of regularizers
Most of our randomization tests are performed with explicit
regularization turned off. Regu-
larizers are the standard tool in theory and practice to mitigate
overfitting in the regime when
there are more parameters than data points [Vapnik, 1998]. The
basic idea is that although
39
Table 2.3: The training and test accuracy (in percentage) of
various models on the CIFAR-10 dataset. Performance with and
without data augmentation and weight decay are compared. The
results of fitting random labels are also included.
model # params random crop weight decay train accuracy test
accuracy
Inception 1,649,402
yes yes 100.0 89.05 yes no 100.0 89.31 no yes 100.0 86.03 no no
100.0 85.75
(fitting random labels) no no 100.0 9.78
Inception w/o BatchNorm 1,649,402 no yes 100.0 83.00
no no 100.0 82.00 (fitting random labels) no no 100.0 10.12
Alexnet 1,387,786
yes yes 99.90 81.22 yes no 99.82 79.66 no yes 100.0 77.36 no no
100.0 76.07
(fitting random labels) no no 99.82 9.86
MLP 3x512 1,735,178 no yes 100.0 53.35 no no 100.0 52.39
(fitting random labels) no no 100.0 10.48
MLP 1x512 1,209,866 no yes 99.80 50.39 no no 100.0 50.51
(fitting random labels) no no 99.34 10.61
40
the original hypothesis is too large to generalize well,
regularizers help confine learning to a
subset of the hypothesis space with manageable complexity. By
adding an explicit regularizer,
say by penalizing the norm of the optimal solution, the effective
Rademacher complexity of the
possible solutions is dramatically reduced.
As we will see, in deep learning, explicit regularization seems to
play a rather different role.
As the bottom rows of Table 2.4 in the appendix show, even with
dropout and weight decay,
InceptionV3 is still able to fit the random training set extremely
well if not perfectly. Although
not shown explicitly, on CIFAR-10, both Inception and MLPs still
fit perfectly the random
training set with weight decay turned on. However, AlexNet with
weight decay turned on fails
to converge on random labels. To investigate the role of
regularization in deep learning, we
explicitly compare behavior of deep nets learning with and without
regularizers.
Instead of doing a full survey of all kinds of regularization
techniques introduced for deep
learning, we simply take several commonly used network
architectures, and compare the be-
havior when turning off the equipped regularizers. The following
regularizers are covered:
• Data augmentation: augment the training set via domain-specific
transformations. For
image data, commonly used transformations include random cropping,
random pertur-
bation of brightness, saturation, hue and contrast.
• Weight decay: equivalent to a 2 regularizer on the weights; also
equivalent to a hard
constrain of the weights to an Euclidean ball, with the radius
decided by the amount of
weight decay.
• Dropout [Srivastava et al., 2014]: mask out each element of a
layer output randomly
with a given dropout probability. Only the Inception V3 for
ImageNet uses dropout in
our experiments.
Table 2.3 shows the results of Inception, Alexnet and MLPs on
CIFAR-10, toggling the use
of data augmentation and weight decay. Both regularization
techniques help to improve the
generalization performance, but even with all of the regularizers
turned off, all of the models
still generalize very well.
Table 2.4 in the appendix shows a similar experiment on the
ImageNet dataset. A 18% top-
1 accuracy drop is observed when we turn off all the regularizers.
Specifically, the top-1 accuracy
41
thousand training steps
thousand training steps
(b) Inception on CIFAR-10
Figure 2-4: Effects of implicit regularizers on generalization
performance. aug is data augmen- tation, wd is weight decay, BN is
batch normalization. The shaded areas are the cumulative best test
accuracy, as an indicator of potential performance gain of early
stopping. (a) early stopping could potentially improve
generalization when other regularizers are absent. (b) early
stopping is not necessarily helpful on CIFAR-10, but batch
normalization stablize the training process and improves
generalization.
without regularization is 59.80%, while random guessing only
achieves 0.1% top-1 accuracy
on ImageNet. More strikingly, with data-augmentation on but other
explicit regularizers off,
Inception is able to achieve a top-1 accuracy of 72.95%. Indeed, it
seems like the ability to
augment the data using known symmetries is significantly more
powerful than just tuning
weight decay or preventing low training error.
Inception achieves 80.38% top-5 accuracy without regularization,
while the reported num-
ber of the winner of ILSVRC 2012 [Krizhevsky et al., 2012] achieved
83.6%. So while regular-
ization is important, bigger gains can be achieved by simply
changing the model architecture. It
is difficult to say that the regularizers count as a fundamental
phase change in the generalization
capability of deep nets.
Implicit regularizations
Early stopping was shown to implicitly regularize on some convex
learning problems [Yao et al.,
2007, Lin et al., 2016]. In Table 2.4 in the appendix, we show in
parentheses the best test ac-
curacy along the training process. It confirms that early stopping
could potentially4 improve
4We say “potentially” because to make this statement rigorous, we
need to have another isolated test set and test the performance
there when we choose early stopping point on the first test set
(acting like a validation set).
42
the generalization performance. Figure 2-4a shows the training and
testing accuracy on Ima-
geNet. The shaded area indicate the accumulative best test
accuracy, as a reference of potential
performance gain for early stopping. However, on the CIFAR-10
dataset, we do not observe
any potential benefit of early stopping.
Batch normalization [Ioffe and Szegedy, 2015] is an operator that
normalizes the layer re-
sponses within each mini-batch. It has been widely adopted in many
modern neural network
architectures such as Inception [Szegedy et al., 2016] and Residual
Networks [He et al., 2016].
Although not explicitly designed for regularization, batch
normalization is usually found to im-
prove the generalization performance. The Inception architecture
uses a lot of batch normaliza-
tion layers. To test the impact of batch normalization, we create a
“Inception w/o BatchNorm”
architecture that is exactly the same as Inception in Figure 2-5,
except with all the batch nor-
malization layers removed. Figure 2-4b compares the learning curves
of the two variants of
Inception on CIFAR-10, with all the explicit regularizers turned
off. The normalization oper-
ator helps stablize the learning dynamics, but the impact on the
generalization performance is
only 3∼4%. The exact accuracy is also listed in the section
“Inception w/o BatchNorm” of
Table 2.3.
In summary, our observations on both explicit and implicit
regularizers are consistently
suggesting that regularizers, when properly tuned, could help to
improve the generalization
performance. However, it is unlikely that the regularizers are the
fundamental reason for gen-
eralization, as the networks continue to perform well after all the
regularizers removed.
2.4.5 Finite-sample expressivity
Much effort has gone into characterizing the expressivity of neural
networks, e.g, Gybenko
[1989], Mhaskar [1993], Delalleau and Bengio [2011], Mhaskar and
Poggio [2016], Eldan and
Shamir [2015], Telgarsky [2016], Cohen and Shashua [2016]. Almost
all of these results are at
the “population level” showing what functions of the entire domain
can and cannot be repre-
sented by certain classes of neural networks with the same number
of parameters. For example,
it is known that at the population level depth k is generically
more powerful than depth k− 1.
We argue that what is more relevant in practice is the expressive
power of neural networks
on a finite sample of size n. It is possible to transfer population
level results to finite sample re-
43
sults using uniform convergence theorems. However, such uniform
convergence bounds would
require the sample size to be polynomially large in the dimension
of the input and exponential
in the depth of the network, posing a clearly unrealistic
requirement in practice.
We instead directly analyze the finite-sample expressivity of
neural networks, noting that
this dramatically simplifies the picture. Specifically, as soon as
the number of parameters p of a
networks is greater than n, even simple two-layer neural networks
can represent any function
of the input sample. We say that a neural network C can represent
any function of a sample of
size n in d dimensions if for every sample S ⊆ Rd with |S| = n and
every function f : S → R,
there exists a setting of the weights of C such that C(x) = f(x)
for every x ∈ S.
Theorem 2. There exists a two-layer neural network with ReLU
activations and 2n + d weights
that can represent any function on a sample of size n in d
dimensions.
The proof is given in Section 2.4.8 in the appendix, where we also
discuss how to achieve
widthO(n/k)with depth k.We remark that it’s a simple exercise to
give bounds on the weights
of the coefficient vectors in our construction. Lemma 2 gives a
bound on the smallest eigenvalue
of the matrix A. This can be used to give reasonable bounds on the
weight of the solution w.
2.4.6 Implicit regularization: an appeal to linear models
Although deep neural nets remain mysterious for many reasons, we
note in this section that it is
not necessarily easy to understand the source of generalization for
linear models either. Indeed,
it is useful to appeal to the simple case of linear models to see
if there are parallel insights that
can help us better understand neural networks.
Suppose we collect n distinct data points {(xi, yi)} where xi are
d-dimensional feature
vectors and yi are labels. Letting loss denote a nonnegative loss
function with loss(y, y) = 0,
consider the empirical risk minimization (ERM) problem
min w∈Rd
loss(wxi, yi) (2.17)
If d ≥ n, then we can fit any labeling. But is it then possible to
generalize with such a rich
model class and no explicit regularization?
44
Let X denote the n×d data matrix whose i-th row is x i . If X has
rank n, then the system
of equations Xw = y has an infinite number of solutions regardless
of the right hand side. We
can find a global minimum in the ERM problem (2.17) by simply
solving this linear system.
But do all global minima generalize equally well? Is there a way to
determine when one
global minimum will generalize whereas another will not? One
popular way to understand
quality of minima is the curvature of the loss function at the
solution. But in the linear case,
the curvature of all optimal solutions is the same [Choromanska et
al., 2015]. To see this, note
that in the case when yi is a scalar,
∇2 1
)
A similar formula can be found when y is vector valued. In
particular, the Hessian is not a
function of the choice of w. Moreover, the Hessian is degenerate at
all global optimal solutions.
If curvature doesn’t distinguish global minima, what does? A
promising direction is to con-
sider the workhorse algorithm, stochastic gradient descent (SGD),
and inspect which solution
SGD converges to. Since the SGD update takes the form wt+1 = wt −
ηtetxit where ηt is the
step size and et is the prediction error loss. If w0 = 0, we must
have that the solution has the
form w = ∑n
i=1 αixi for some coefficients α. Hence, if we run SGD we have that
w = Xα
lies in the span of the data points. If we also perfectly
interpolate the labels we have Xw = y.
Enforcing both of these identities, this reduces to the single
equation
XXα = y (2.18)
which has a unique solution. Note that this equation only depends
on the dot-products between
the data points xi. We have thus derived the “kernel trick”
[Schölkopf et al., 2001]—albeit in
a roundabout fashion.
We can therefore perfectly fit any set of labels by forming the
Gram matrix (aka the kernel
matrix) on the data K = XX and solving the linear system Kα = y for
α. This is an n× n
linear system that can be solved on standard workstations whenever
n is less than a hundred
thousand, as is the case for small benchmarks like CIFAR-10 and
MNIST.
45
Quite surprisingly, fitting the training labels exactly yields
excellent performance for convex
models. On MNIST with no preprocessing, we are able to achieve a
test error of 1.2% by simply
solving (2.18). Note that this is not exactly simple as the kernel
matrix requires 30GB to store
in memory. Nonetheless, this system can be solved in under 3
minutes on a commodity work-
station with 24 cores and 256 GB of RAM with a conventional LAPACK
call. By first applying
a Gabor wavelet transform to the data and then solving (2.18), the
error on MNIST drops to
0.6%. Surprisingly, adding regularization does not improve either
model’s performance!
Similar results follow for CIFAR-10. Simply applying a Gaussian
kernel on pixels and using
no regularization achieves 46% test error. By preprocessing with a
random convolutional neural
net with 32,000 random filters, this test error drops to 17%
error5. Adding 2 regularization
further reduces this number to 15% error. Note that this is without
any data augmentation.
Note that this kernel solution has an appealing interpretation in
terms of implicit regulariza-
tion. Simple algebra reveals that it is equivalent to the minimum
2-norm solution of Xw = y.
That is, out of all models that exactly fit the data, SGD will
often converge to the solution
with minimum norm. It is very easy to construct solutions of Xw = y
that don’t generalize:
for example, one could fit a Gaussian kernel to data and place the
centers at random points.
Another simple example would be to force the data to fit random
labels on the test data. In
both cases, the norm of the solution is significantly larger than
the minimum norm solution.
Unfortunately, this notion of minimum norm is not predictive of
generalization perfor-
mance. For example, returning to the MNIST example, the 2-norm of
the minimum norm
solution with no preprocessing is approximately 220. With wavelet
preprocessing, the norm
jumps to 390. Yet the test error drops by a factor of 2. So while
this minimum-norm intu-
ition may provide some guidance to new algorithm design, it is only
a very small piece of the
generalization story.
2.4.7 Conclusion
In this work we presented a simple experimental framework for
defining and understanding a
notion of effective capacity of machine learning models. The
experiments we conducted empha-
size that the effective capacity of several successful neural
network architectures is large enough
5This conv-net is the Coates and Ng [2012] net, but with the
filters selected at random instead of with k-means.
46
to shatter the training data. Consequently, these models are in
principle rich enough to memo-
rize the training data. This situation poses a conceptual challenge
to statistical learning theory as
traditional measures of model complexity struggle to explain the
generalization ability of large
artificial neural networks. We argue that we have yet to discover a
precise formal measure under
which these enormous models are simple. Another insight resulting
from our experiments is
that optimization continues to be empirically easy even if the
resulting model does not gener-
alize. This shows that the reasons for why optimization is
empirically easy must be differe