Upload
vuphuc
View
225
Download
0
Embed Size (px)
Citation preview
Intro. GANs Interpolation
Some New Insights on Regularization and InterpolationMotivated from Neural Networks
Tengyuan Liang
Econometrics and Statistics
1 / 60
Intro. GANs Interpolation
OUTLINE
Generative Adversarial Networks• statistical rates• pair regularization• optimization
Interpolation• regularization?• kernel ridgeless regression• GD on two layers ReLU networks
2 / 60
Intro. GANs Interpolation
OUTLINE
Generative Adversarial Networks (unsupervised)• statistical rates• pair regularization• optimization
Interpolation (supervised)• regularization?• kernel ridgeless regression• GD on two layers ReLU networks
2 / 60
Intro. GANs Interpolation
GENERATIVE ADVERSARIAL NETWORKS
• GAN Goodfellow et al. (2014)
• WGAN Arjovsky et al. (2017); Arjovskyand Bottou (2017)
• MMD GAN Li, Swersky, and Zemel(2015); Dziugaite, Roy, and Ghahramani(2015); Arbel, Sutherland, Binkowski, andGretton (2018)
• f -GAN Nowozin, Cseke, and Tomioka(2016)
• Sobolev GAN Mroueh et al. (2017)
• many others... Liu, Bousquet, andChaudhuri (2017); Tolstikhin, Gelly,Bousquet, Simon-Gabriel, and Scholkopf(2017)
Generator gθ, Discriminator fω
U(θ,ω) = EX∼Preal
h1(fω(X)) − EZ∼Pinput
h2(fω(gθ(Z)))
minθ
maxω
U(θ,ω)
GANs are widely used in practice, however
4 / 60
Intro. GANs Interpolation
GENERATIVE ADVERSARIAL NETWORKS
Generator gθ, Discriminator fω
U(θ,ω) = EX∼Preal
h1(fω(X)) − EZ∼Pinput
h2(fω(gθ(Z)))
minθ
maxω
U(θ,ω)
GANs are widely used in practice, however
4 / 60
Intro. GANs Interpolation
MUCH NEEDS TO BE UNDERSTOOD, IN THEORY
• Approximation:
what dist. can be approximated by the generator gθ(Z)?
• Statistical:
given n samples, what is the statistical/generalization error rate?
• Computational:
local convergence for practical optimization, how to stablize?
• Landscape:
are local saddle points good globally?
5 / 60
Intro. GANs Interpolation
FORMULATION
DG dist. class by generator, FD func. class by discriminator, ν target dist.
population µ∗ ∶= argminµ∈DG
maxf∈FD
EY∼µ
f(Y) − EX∼ν
f(X)
νn empirical dist.
empirical µn ∶= argminµ∈DG
maxf∈FD
EY∼µ
f(Y) − EX∼νn
f(X)
• Density learning/estimation: long history nonparametric statisticstarget density ν ∈ Wα - Sobolev space with smoothness α ≥ 0
Stone (1982); Nemirovski (2000); Tsybakov (2009); Wassermann (2006)
• GAN statistical theory is neededArora and Zhang (2017); Arora et al. (2017a,b); Liu et al. (2017)
6 / 60
Intro. GANs Interpolation
FORMULATION
DG dist. class by generator, FD func. class by discriminator, ν target dist.
population µ∗ ∶= argminµ∈DG
maxf∈FD
EY∼µ
f(Y) − EX∼ν
f(X)
νn empirical dist.
empirical µn ∶= argminµ∈DG
maxf∈FD
EY∼µ
f(Y) − EX∼νn
f(X)
• Density learning/estimation: long history nonparametric statisticstarget density ν ∈ Wα - Sobolev space with smoothness α ≥ 0
Stone (1982); Nemirovski (2000); Tsybakov (2009); Wassermann (2006)
• GAN statistical theory is neededArora and Zhang (2017); Arora et al. (2017a,b); Liu et al. (2017)
6 / 60
Intro. GANs Interpolation
FORMULATION
DG dist. class by generator, FD func. class by discriminator, ν target dist.
population µ∗ ∶= argminµ∈DG
maxf∈FD
EY∼µ
f(Y) − EX∼ν
f(X)
νn empirical dist.
empirical µn ∶= argminµ∈DG
maxf∈FD
EY∼µ
f(Y) − EX∼νn
f(X)
• Density learning/estimation: long history nonparametric statisticstarget density ν ∈ Wα - Sobolev space with smoothness α ≥ 0
Stone (1982); Nemirovski (2000); Tsybakov (2009); Wassermann (2006)
• GAN statistical theory is neededArora and Zhang (2017); Arora et al. (2017a,b); Liu et al. (2017)
6 / 60
Intro. GANs Interpolation
DISCRIMINATOR METRIC
Define the critic metric (IPM)
dF(µ,ν) ∶= supf∈F
EY∼µ
f(Y) − EX∼ν
f(X) .
• F Lip-1: Wasserstein metric dW
• F bounded by 1: total variation/Radon metric dTV
• RKHSH, F = f ∈H, ∥f∥H ≤ 1: MMD GAN• F Sobolev smoothness β: Sobolev GAN
Statistical question: statistical error rate with n-i.i.d samples, E dF(ν, µn)?
7 / 60
Intro. GANs Interpolation
DISCRIMINATOR METRIC
Define the critic metric (IPM)
dF(µ,ν) ∶= supf∈F
EY∼µ
f(Y) − EX∼ν
f(X) .
• F Lip-1: Wasserstein metric dW
• F bounded by 1: total variation/Radon metric dTV
• RKHSH, F = f ∈H, ∥f∥H ≤ 1: MMD GAN• F Sobolev smoothness β: Sobolev GAN
Statistical question: statistical error rate with n-i.i.d samples, E dF(ν, µn)?
7 / 60
Intro. GANs Interpolation
MINIMAX OPTIMAL RATES: SOBOLEV GAN
Consider the target density ν ∈ G = Wα Sobolev space with smoothness α > 0,and the evaluation metric F = Wβ with smoothness β > 0.
The minimax optimal rate is
infνn
supν∈G
E dF (ν, νn) ≍ n−α+β2α+d ∨ n−
12 .
Theorem (L. ’17 & L. ’18, Sobolev).
Here νn any estimator based on n samples. d-dim.
Mair and Ruymgaart (1996); Liang (2017); Singh et al. (2018)
8 / 60
Intro. GANs Interpolation
MINIMAX OPTIMAL RATES: MMD GAN
Consider a reproducing kernel Hilbert space (RKHS)H• integral operator T with eigenvalue decay ti ≍ i−κ, 0 < κ <∞• evaluation metric F = f ∈H ∣ ∥f∥H ≤ 1
• target density ν(x) in G = ν ∣ ∥T−(α−1)/2ν∥H ≤ 1 with smoothness α > 0
The minimax optimal rate is
infνn
supν∈G
E dF (ν, νn) ≾ n−(α+1)κ2ακ+2 ∨ n−
12 .
Theorem (L. ’18, RKHS).
κ > 1: intrinsic dim. ∑i≥1 ti = ∑i≥1 i−κ ≤ C, parametric rate n−(α+1)κ2ακ+2 ∨ n−
12 = n−1/2.
κ < 1: sample complexity scales n = ε2+ 2α+1 (
1κ−1), “effective dim.” 1/κ.
9 / 60
Intro. GANs Interpolation
MINIMAX OPTIMAL RATES: MMD GAN
Consider a reproducing kernel Hilbert space (RKHS)H• integral operator T with eigenvalue decay ti ≍ i−κ, 0 < κ <∞• evaluation metric F = f ∈H ∣ ∥f∥H ≤ 1
• target density ν(x) in G = ν ∣ ∥T−(α−1)/2ν∥H ≤ 1 with smoothness α > 0
The minimax optimal rate is
infνn
supν∈G
E dF (ν, νn) ≾ n−(α+1)κ2ακ+2 ∨ n−
12 .
Theorem (L. ’18, RKHS).
κ > 1: intrinsic dim. ∑i≥1 ti = ∑i≥1 i−κ ≤ C, parametric rate n−(α+1)κ2ακ+2 ∨ n−
12 = n−1/2.
κ < 1: sample complexity scales n = ε2+ 2α+1 (
1κ−1), “effective dim.” 1/κ.
9 / 60
Intro. GANs Interpolation
ORACLE INEQUALITY
Generator class DG may not contain the target density ν: oracle approach.
Let DG be any generator class. The discriminator metric FD = Wβ, targetdensity ν ∈ Wα.
With empirical density νn(x) as plug-in, the GAN estimator
µn ∈ argminµ∈DG
maxf∈FD
∫ f(x)µ(x)dx − ∫ f(x)νn(x)dx ,
attains a sub-optimal rate
E dFD(µn,ν) ≤ minµ∈DG
dFD(µ,ν) + n−βd ∨
log n√
n.
Corollary (L. ’18).
In contrast, a smoothed/regularizedempirical density νn(x) as plug-in
µn ∈ argminµ∈DG
maxf∈FD
∫ f(x)µ(x)dx − ∫ f(x)νn(x)dx ,
a faster rate is attainable
E dFD(µn,ν) ≤ minµ∈DG
dFD(µ,ν) + n−α+β2α+d ∨
1√
n.
Corollary (L. ’18).
Canas and Rosasco (2012)
10 / 60
Intro. GANs Interpolation
ORACLE INEQUALITY
Generator class DG may not contain the target density ν: oracle approach.
Let DG be any generator class. The discriminator metric FD = Wβ, targetdensity ν ∈ Wα.
With empirical density νn(x) as plug-in, the GAN estimator
µn ∈ argminµ∈DG
maxf∈FD
∫ f(x)µ(x)dx − ∫ f(x)νn(x)dx ,
attains a sub-optimal rate
E dFD(µn,ν) ≤ minµ∈DG
dFD(µ,ν) + n−βd ∨
log n√
n.
Corollary (L. ’18).
In contrast, a smoothed/regularizedempirical density νn(x) as plug-in
µn ∈ argminµ∈DG
maxf∈FD
∫ f(x)µ(x)dx − ∫ f(x)νn(x)dx ,
a faster rate is attainable
E dFD(µn,ν) ≤ minµ∈DG
dFD(µ,ν) + n−α+β2α+d ∨
1√
n.
Corollary (L. ’18).
Canas and Rosasco (2012)
10 / 60
Intro. GANs Interpolation
SUB-OPTIMALITY AND REGULARIZATION
Regularization helps achieve faster rate!however, notions of regularization/complexity is yet understood well for neural nets...
Use νn “smoothed” empirical estimate, that serves as regularization
For example, kernel smoothing - νn(x) = 1
nhnK (
x−xihn
)
practice: SGD still carries through, as sample from νn is easy as Gaussian mixtures
Turns out, this is used in practice, called “instance noise” or “dataaugmentation”
Sønderby et al. (2016); Liang et al. (2017); Arjovsky and Bottou (2017); Mescheder et al. (2018)
11 / 60
Intro. GANs Interpolation
Consider the parametrized GAN estimator
θm,n ∈ argminθ∶gθ∈G
maxω∶fω∈F
Emfω(gθ(Z)) − Enfω(X) ,
where m and n denote the number of the generator samples and real samples.
13 / 60
Intro. GANs Interpolation
GENERALIZED ORACLE INEQUALITY
approx. err. A1(F ,G,ν) ∶= supθ
infω
∥logν
µθ− fω∥
∞, A2(G,ν) ∶= inf
θ∥log
µθ
ν∥
1/2
∞,
sto. err. Sn,m(F ,G) ∶=
√
Pdim(F) (logm
m∨log n
n) ∨
√
Pdim(F G)logm
m,
Pdim(⋅) the pseudo-dimension of the neural network function.
E d2TV (ν,µθm,n
) ,E d2W (ν,µθm,n
) ,E dKL (ν∣∣µθm,n) + E dKL (µθm,n
∣∣ν)
≤ A1(F ,G,ν) +A2(G,ν) + Sn,m(F ,G) .
Theorem (L. ’18).
We emphasize on the interplay between (G,F) as a pair of tuningparameters for regularization.
14 / 60
Intro. GANs Interpolation
PAIR REGULARIZATION
for instance, one simple form of the interplay is:fix G, as F increase ∶ A1(F ,G,ν) decrease, A2(G,ν) constant, Sn,m(F ,G) increase,fix F , as G increase ∶ A1(F ,G,ν) increase, A2(G,ν) decrease, Sn,m(F ,G) increase.
Generator Class
Dis
crim
inat
or C
lass
and
dominated by
, ,
,
15 / 60
Intro. GANs Interpolation
APPLICATION I: PARAMETRIC RATES FOR LEAKY RELU NETWORKS
When the generator G and discriminator F are both leaky ReLU networkswith depth L (width properly chosen depends on dimension).
Generator
Discriminator
When the target density is realizable by the generator.
logµθ(x) = c1
L−1
∑l=1
d
∑i=1
1mli(x)≥0 + c0,
Bai et al. (2018)17 / 60
Intro. GANs Interpolation
APPLICATION I: PARAMETRIC RATES FOR LEAKY RELU NETWORKS
When the generator G and discriminator F are both leaky ReLU networkswith depth L (width properly chosen depends on dimension).
E d2TV (ν,µθm,n
) ≾
√
d2L2 log(dL) (logm
m∨logn
n).
Theorem (L. ’18, leaky ReLU).
The results hold for very deep networks with depth L = o(√
n/ log n).
17 / 60
Intro. GANs Interpolation
APPLICATION II: LEARNING MULTIVARIATE GAUSSIAN
GANs enjoy near optimal sampling complexity (w.r.t. dim. d), withproper choices of the architecture and activation,
E d2TV (ν,µθm,n
) ≾
√d2 log dn ∧m
.
Corollary (L. ’18, Gaussian).
18 / 60
Intro. GANs Interpolation
PAIR REGULARIZATION: WHY GANS MIGHT BE BETTER
Generator Class
Dis
crim
inat
or C
lass
and
dominated by
data-memorization, empirical deviation
nonparametric density estimation cl
assi
c pa
ram
etric
mod
els
19 / 60
Intro. GANs Interpolation
FORMULATION
Generator gθ, Discriminator fω
U(θ,ω) = EX∼Preal
h1(fω(X)) − EZ∼Pinput
h2(fω(gθ(Z)))
minθ
maxω
U(θ,ω)
• global optimization for general U(θ,ω) is hard Singh et al. (2000); Pfau and Vinyals
(2016); Salimans et al. (2016)
Local saddle point (θ∗,ω∗) such that no incentive to deviate locally
U(θ∗,ω) ≤ U(θ∗,ω∗) ≤ U(θ,ω∗) ,
for (θ,ω) in an open neighborhood of (θ∗,ω∗).
• also called local Nash Equilibrium (NE)• modest goal: initialized properly, algorithm converges to a local NE
21 / 60
Intro. GANs Interpolation
MAIN MESSAGE: INTERACTION MATTERS
Exponential local convergence to stable equilibrium
However, “interaction term” matters, slows down the convergence⇐ curse
What if unstable? turns out “interaction term” matters, utilize it rendersexponential convergence⇐ blessing
22 / 60
Intro. GANs Interpolation
MAIN MESSAGE: INTERACTION MATTERS
Exponential local convergence to stable equilibriumanalog to GD in single-player optimization, strongly convex caseintuitive picture: discrete-time SGA cycles inward to a stable equilibrium fast
However, “interaction term” matters, slows down the convergence⇐ cursecompared to conventional GD, strongly convex case, due to presence of ∇θωU∇θωUT
also we show a lower bound on TSGA to show the curse is necessary
What if unstable? turns out “interaction term” matters, utilize it rendersexponential convergence⇐ blessing
• SGA fails, modify the dynamics to utilize interaction• analog to single-player optimization, non-strongly convex case, is
surprising• single-player: first order methods cannot obtain error better than 1/T2 in
smooth, but non-strongly convex case, classic result Nemirovski and Yudin (1983);Nesterov (2013)
• two-player: we will show first order method can obtain exponentialconvergence to unstable equilibrium exp(−cT)
22 / 60
Intro. GANs Interpolation
“However, no guarantees are known beyond the convex-concave setting and, moreimportantly for the paper, even in convex-concave games, no guarantees are knownfor the last-iterate pair.”
— Daskalakis, Ilyas, Syrgkanis, and Zeng (2017)
23 / 60
Intro. GANs Interpolation
EXPONENTIAL CONVERGENCE TO UNSTABLE EQUILIBRIUM
OMD proposed in Daskalakis et al. (2017)
θt+1 = θt − 2η∇θU(θt,ωt) + η∇θU(θt−1,ωt−1)
ωt+1 =ωt + 2η∇ωU(θt,ωt) − η∇ωU(θt−1,ωt−1)
For bi-linear game U(θ,ω) = θTCω, to obtain ε-close solution
shown in Daskalakis et al. (2017) ∶ T ≿ ε−4 log
1ε
⋅ Poly(λmax(CCT
)
λmin(CCT))
we show ∶ T ≿ log1ε
⋅λmax(CCT
)
λmin(CCT)
Corollary (L. & Stokes, ’18).
24 / 60
Intro. GANs Interpolation
HOW DO WE TEACH STAT/ML?
model complexityoptimal complexity
bias variance
generalization error
overfitting
26 / 60
Intro. GANs Interpolation
HOW DO WE TEACH STAT/ML?
model complexity
optimal complexity
bias variance
generalization error
overfitting
∞ ← λ λ → 0
λopt
DANGER ZONE!!
26 / 60
Intro. GANs Interpolation
IS THIS REALLY WHAT’S HAPPENING IN PRACTICE?
model complexity
generalization error
overfitting
∞ ← λ
DANGER ZONE??
λ → 0
λ → 0
27 / 60
Intro. GANs Interpolation
Is explicit regularization λopt really needed?
Is interpolation really bad for statistics and machine learning?
28 / 60
Intro. GANs Interpolation
Is explicit regularization λopt really needed?
Is interpolation really bad for statistics and machine learning?
28 / 60
Intro. GANs Interpolation
AN EMPIRICAL EXAMPLE
0.0 0.2 0.4 0.6 0.8 1.0 1.2lambda
10 1
log(
erro
r)Kernel Regression on MNIST
digits pair [i,j][2,5][2,9][3,6][3,8][4,7]
λ = 0: the interpolated solution, perfect fit on training data.
MNIST data from LeCun et al. (2010)
29 / 60
Intro. GANs Interpolation
AN EMPIRICAL EXAMPLE
0.0 0.2 0.4 0.6 0.8 1.0 1.2lambda
10 1
log(
erro
r)
Kernel Regression on MNIST
digits pair [i,j][2,5][2,6][2,7][2,8][2,9]
[3,5][3,6][3,7][3,8][3,9]
[4,5][4,6][4,7][4,8][4,9]
λ = 0: the interpolated solution, perfect fit on training data.
MNIST data from LeCun et al. (2010)
29 / 60
Intro. GANs Interpolation
ISOLATED PHENOMENON? NO
• Methodology: deep learning, kernel learning, boosting, random forests . . . Zhang,Bengio, Hardt, Recht, and Vinyals (2016); Wyner, Olson, Bleich, and Mease (2017); Maennel, Bousquet, andGelly (2018)
• Datasets: MNIST, CIFAR-10, others Belkin, Ma, and Mandal (2018b)30 / 60
Intro. GANs Interpolation
PUZZLES
Interpolated solutions performs very well in practicefor many (modern) methodology and datasets!
What is happening? “Overfitting” is not that bad . . .
31 / 60
Intro. GANs Interpolation
OUR MESSAGE
Geometric properties of the data design X, high dimensionality, andcurvature of the kernel⇒ interpolated solution generalizes.
32 / 60
Intro. GANs Interpolation
Potential theory in statistics/learning for interpolated solution:
• Analysis through explicit regularization (7)• Capacity control (7)• Early stopping (algorithmic) (7)• Stability analysis (algorithmic) (7)• Nonparametric smoothing analysis (7)• Inductive bias (3?) at least promising
Belkin, Hsu, and Mitra (2018a)
33 / 60
Intro. GANs Interpolation
INDUCTIVE BIAS
There are many functions that behave exactly the same on training data, but themethod/algo. prefers certain functions
• kernels/RKHS: Representer Thm., min norm interpolation
• over-parametrized linear regression: 0-initialization, min norm interpolation
• matrix factorization, etc. Gunasekar, Woodworth, Bhojanapalli, Neyshabur, and Srebro (2017); Li,Ma, and Zhang (2017)
• two layers ReLU network Maennel, Bousquet, and Gelly (2018)
• Inductive bias (3?), at least promising
Zhang et al. (2016); Neyshabur et al. (2017); Liang et al. (2017); Belkin et al. (2018b)
34 / 60
Intro. GANs Interpolation
INDUCTIVE BIAS
There are many functions that behave exactly the same on training data, but themethod/algo. prefers certain functions
• kernels/RKHS: Representer Thm., min norm interpolation
• over-parametrized linear regression: 0-initialization, min norm interpolation
• matrix factorization, etc. Gunasekar, Woodworth, Bhojanapalli, Neyshabur, and Srebro (2017); Li,Ma, and Zhang (2017)
• two layers ReLU network Maennel, Bousquet, and Gelly (2018)
• Inductive bias (3?), at least promising
Zhang et al. (2016); Neyshabur et al. (2017); Liang et al. (2017); Belkin et al. (2018b)
34 / 60
Intro. GANs Interpolation
HISTORY: INTERPOLATION RULES
Understudied in the literature: especially when there is label noise
Recent progress on local/direct interpolation schemes:
• Geometric simplicial interpolation and weighted kNN Belkin, Hsu, and Mitra
(2018a)
• Nonparametric Nadaraya-Watson estimator with singular kernels Shepard
(1968); Devroye, Gyorfi, and Krzyzak (1998); Belkin, Rakhlin, and Tsybakov (2018c)
35 / 60
Intro. GANs Interpolation
SIMPLICIAL INTERPOLATION
Belkin, Hsu, and Mitra (2018a) showed under regularity conditions, simplicial interpolation fn
lim supn→∞
E(fn(x) − f∗(x))2≤
2d + 2
E(f∗(x) − y)2
36 / 60
Intro. GANs Interpolation
SINGULAR KERNEL
Shepard (1968); Devroye et al. (1998); Belkin et al. (2018c) showed for singular kernels
K(u) ∶= ∥u∥−aI∥u∥ ≤ 1⇒ limu→0
K(u) =∞
the Nadaraya-Watson estimator fn =∑n
i=1 yiK( x−xih )
∑ni=1 K( x−xi
h )achieves the optimal error when f∗
lies in Holder space with smoothness β
E(fn(x) − f∗(x))2∼ n−
2β2β+d
37 / 60
Intro. GANs Interpolation
Global/inverse interpolation methods (kernel machines/neuralnetworks/boosting) performs better than the local interpolation schemesempirically.
Several conjectures have been made about global/inverse interpolationmethods, such as kernel machines in Belkin, Hsu, and Mitra (2018a), two layers ReLUnets Maennel, Bousquet, and Gelly (2018).
38 / 60
Intro. GANs Interpolation
PROBLEM FORMULATION
Given n i.i.d. pairs (xi, yi) drawn from unknown µ: xi are d-dim covari-ates inΩ ⊂ Rd, yi ∈ R are the response/labels.
To estimate
f∗(x) = E(y∣x = x),
which is assumed to lie in a Reproducing Kernel Hilbert Space (RKHS)H with kernel K(⋅, ⋅).
Smola and Scholkopf (1998); Wahba (1990); Shawe-Taylor and Cristianini (2004)
40 / 60
Intro. GANs Interpolation
Conventional wisdom: Kernel Ridge Regression, explicit regularization λ ≠ 0added whenH is high- or infinite-dimensional
minf∈H
1n
n
∑i=1
(f(xi) − yi)2+ λ∥f∥2
H .
41 / 60
Intro. GANs Interpolation
We study min-norm interpolation estimator f
f ∶= argminf∈H
∥f∥H, s.t. f(xi) = yi, ∀i ≤ n .
Equivalently
f(x) = K(x,X)K(X,X)−1Y
when K(X,X) ∈ Rn×n is invertible.
42 / 60
Intro. GANs Interpolation
We study min-norm interpolation estimator f
f ∶= argminf∈H
∥f∥H, s.t. f(xi) = yi, ∀i ≤ n .
Equivalently
f(x) = K(x,X)K(X,X)−1Y
when K(X,X) ∈ Rn×n is invertible.
42 / 60
Intro. GANs Interpolation
Look for adaptive “data-dependent” bounds φn,d(X, f∗) to understand whenand why interpolated estimator f generalizes.
We provide high-probability bounds on
integrated squared risk E(f(x) − f∗(x)) ≤ φn,d(X, f∗)
generalization error E(f(x) − y)2− E(f∗(x) − y)2
≤ φn,d(X, f∗)
43 / 60
Intro. GANs Interpolation
ASSUMPTIONS
(A.1) High-dim: c ≤ d/n ≤ CΣd = Eµ[xx∗] satisfies ∥Σd∥ ≤ C and Tr(Σd)/d ≥ c.
(A.2) (8 +m)-moments: zi ∶= Σ−1/2d xi
each entries of zi are i.i.d. mean zero, with bounded (8 +m)-moments.
(A.3) Noise: E[(f∗(x) − y)2∣x = x] ≤ σ2 for all x ∈Ω.
(A.4) Non-linear kernel: for a non-linear smooth function h(⋅)
K(x, x ′) = h(1d⟨x, x ′⟩)
44 / 60
Intro. GANs Interpolation
Define the following quantities related to curvature of h(⋅)
α ∶= h(0) + h ′′(0)Tr(Σ2
d)
d2, β ∶= h ′(0),
γ ∶= h(Tr(Σd)
d) − h(0) − h ′(0)
Tr(Σd)
d.
45 / 60
Intro. GANs Interpolation
MAIN RESULTS
Define
φn,d(X, f∗) ∶=
8σ2∥Σd∥
d∑
j
λj (XX∗
d + αβ
11∗)
[γβ+ λj (
XX∗d + α
β11∗)]
2 + ∥f∗∥2H inf
0≤k≤n
⎧⎪⎪⎨⎪⎪⎩
1n∑j>kλj(KXK∗X) + 2M
√kn
⎫⎪⎪⎬⎪⎪⎭
Under (A.1)-(A.4), with prob. 1 − 2δ − d−2, interpolation estimator f
EY∣X ∥f − f∗∥2L2µ
EY∣X E(f(x) − y)2− E(f∗(x) − y)2
≤ φn,d(X, f∗) + ε(n, d).
The remainder term ε(n, d) = O(d−m
8+m log4.1 d) +O(n−12 log0.5(n/δ)).
Theorem (L. & Rakhlin, ’18).
46 / 60
Intro. GANs Interpolation
MAIN RESULTS
Define
φn,d(X, f∗) ∶=
8σ2∥Σd∥
d∑
j
λj (XX∗
d + αβ
11∗)
[γβ+ λj (
XX∗d + α
β11∗)]
2 + ∥f∗∥2H inf
0≤k≤n
⎧⎪⎪⎨⎪⎪⎩
1n∑j>kλj(KXK∗X) + 2M
√kn
⎫⎪⎪⎬⎪⎪⎭
Under (A.1)-(A.4), with prob. 1 − 2δ − d−2, interpolation estimator f
EY∣X ∥f − f∗∥2L2µ
EY∣X E(f(x) − y)2− E(f∗(x) − y)2
≤ φn,d(X, f∗) + ε(n, d).
The remainder term ε(n, d) = O(d−m
8+m log4.1 d) +O(n−12 log0.5(n/δ)).
Theorem (L. & Rakhlin, ’18).
46 / 60
Intro. GANs Interpolation
MAIN MESSAGE
Geometric properties of the data design X, high dimensionality, andcurvature of the kernel⇒ interpolated solution generalizes.
φn,d(X, f∗) ∶=
8σ2∥Σd∥d
∑j
λj ( XX∗d + α
β11∗)
[γβ+ λj ( XX∗
d + αβ
11∗)]2+ ∥f∗∥2
H inf0≤k≤n
⎧⎪⎪⎪⎨⎪⎪⎪⎩
1
n∑j>kλj(KXK∗X) + 2M
¿ÁÁÀ k
n
⎫⎪⎪⎪⎬⎪⎪⎪⎭
Proof is different from classic RKHS analysis with explicit regularization.
47 / 60
Intro. GANs Interpolation
GEOMETRIC PROPERTIES OF DESIGN
Geometric properties of the data design X, high dimensionality, and curvature of the kernel⇒ interpolatedsolution generalizes.
8σ2∥Σd∥d
∑j
λj ( XX∗d + α
β11∗)
[γβ+ λj ( XX∗
d + αβ
11∗)]2+ ∥f∗∥2
H inf0≤k≤n
⎧⎪⎪⎪⎨⎪⎪⎪⎩
1
n∑j>kλj(KXK∗X) + 2M
¿ÁÁÀ k
n
⎫⎪⎪⎪⎬⎪⎪⎪⎭
0 20 40 60 80 100index
10 4
10 3
10 2
log(
eige
nval
ue)
Spectral Decay of Kernels on MNISTdigits pair [i,j]
[2,5][2,9][3,6][3,8][4,7]
0 20 40 60 80 100index
10 1
100
101
log(
eige
nval
ue)
Spectral Decay of Cov on MNISTdigits pair [i,j]
[2,5][2,9][3,6][3,8][4,7]
Spectral decay of λj(KXK∗X) and λj ( XX∗d )
Preferable geometric properties of the design. Not all design works!
48 / 60
Intro. GANs Interpolation
HIGH DIMENSIONALITY
Geometric properties of the data design X, high dimensionality, and curvature of the kernel⇒ interpolatedsolution generalizes.
8σ2∥Σd∥d
∑j
λj ( XX∗d + α
β11∗)
[γβ+ λj ( XX∗
d + αβ
11∗)]2+ ∥f∗∥2
H inf0≤k≤n
⎧⎪⎪⎪⎨⎪⎪⎪⎩
1
n∑j>kλj(KXK∗X) + 2M
¿ÁÁÀ k
n
⎫⎪⎪⎪⎬⎪⎪⎪⎭
Scalings:• c < d/n < C, typical high-dim scaling in RMT, El Karoui (2010); Johnstone (2001)
• scaling: K(x, x ′) = h(⟨x, x ′⟩/d), default choice for high dim. data in computingpackages, e.g. Scikit-learn Pedregosa et al. (2011)
• bounds work for large (d,n) regime,ε(n, d) = O(d−m/(8+m) log4.1 d) +O(n−1/2 log0.5(n/δ))
Blessings of high dimensionality:• similar effect observed in local/direct interpolating schemes in Belkin et al. (2018a)
for simplicial interpolation and weighted kNN
• Kernel “ridgeless” regression is a global/inverse interpolation scheme
49 / 60
Intro. GANs Interpolation
HIGH DIMENSIONALITY
Geometric properties of the data design X, high dimensionality, and curvature of the kernel⇒ interpolatedsolution generalizes.
8σ2∥Σd∥d
∑j
λj ( XX∗d + α
β11∗)
[γβ+ λj ( XX∗
d + αβ
11∗)]2+ ∥f∗∥2
H inf0≤k≤n
⎧⎪⎪⎪⎨⎪⎪⎪⎩
1
n∑j>kλj(KXK∗X) + 2M
¿ÁÁÀ k
n
⎫⎪⎪⎪⎬⎪⎪⎪⎭
Scalings:• c < d/n < C, typical high-dim scaling in RMT, El Karoui (2010); Johnstone (2001)
• scaling: K(x, x ′) = h(⟨x, x ′⟩/d), default choice for high dim. data in computingpackages, e.g. Scikit-learn Pedregosa et al. (2011)
• bounds work for large (d,n) regime,ε(n, d) = O(d−m/(8+m) log4.1 d) +O(n−1/2 log0.5(n/δ))
Blessings of high dimensionality:• similar effect observed in local/direct interpolating schemes in Belkin et al. (2018a)
for simplicial interpolation and weighted kNN
• Kernel “ridgeless” regression is a global/inverse interpolation scheme
49 / 60
Intro. GANs Interpolation
CURVATURE AND IMPLICIT REGULARIZATION
Geometric properties of the data design X, high dimensionality, and curvature of the kernel⇒ interpolatedsolution generalizes.
8σ2∥Σd∥d
∑j
λj ( XX∗d + α
β11∗)
[γβ+ λj ( XX∗
d + αβ
11∗)]2+ ∥f∗∥2
H inf0≤k≤n
⎧⎪⎪⎪⎨⎪⎪⎪⎩
1
n∑j>kλj(KXK∗X) + 2M
¿ÁÁÀ k
n
⎫⎪⎪⎪⎬⎪⎪⎪⎭
Role of implicit regularization γβ≠ 0: due to curvature/non-linearity of kernel
• the analysis is very different from that in explicit regularization in RKHS Caponnettoand De Vito (2007)
• borrow tools from recent development in RMT for kernel matrices in El Karoui (2010)
effective dim
⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩
classic analysis: ∑jλj
λ∗+λj
our analysis: ∑jλj
(λ∗+λj)2
50 / 60
Intro. GANs Interpolation
CURVATURE AND IMPLICIT REGULARIZATION
Geometric properties of the data design X, high dimensionality, and curvature of the kernel⇒ interpolatedsolution generalizes.
8σ2∥Σd∥d
∑j
λj ( XX∗d + α
β11∗)
[γβ+ λj ( XX∗
d + αβ
11∗)]2+ ∥f∗∥2
H inf0≤k≤n
⎧⎪⎪⎪⎨⎪⎪⎪⎩
1
n∑j>kλj(KXK∗X) + 2M
¿ÁÁÀ k
n
⎫⎪⎪⎪⎬⎪⎪⎪⎭
Role of implicit regularization γβ≠ 0: due to curvature/non-linearity of kernel
effective dim
⎧⎪⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎪⎩
classic: ∑jλj
λ∗+λj
ours: ∑jλj
(λ∗+λj)2
naive: ∑j1λj
λj
λj
+λ⋅
λj
λj
( + )λ⋅
λj
2
1
λj
51 / 60
Intro. GANs Interpolation
MAIN MESSAGE
Geometric properties of the data design X, high dimensionality, andcurvature of the kernel⇒ interpolated solution generalizes.
“implicit regularization” + “inductive bias”
52 / 60
Intro. GANs Interpolation
“Explicit regularization may improve generalization performance, but is neithernecessary nor by itself sufficient for controlling generalization error.”
— Zhang, Bengio, Hardt, Recht, and Vinyals (2016)
53 / 60
Intro. GANs Interpolation
FORMULATION
Two layers ReLU networks
ft(x) =m∑i=1
wi(t)σ(xTui(t)).
with gradient descent (GD) on parameters wi(t),ui(t)
dwi(t)dt
= −Ez[∂l(y, f(x))
∂fσ(xTui)]
dui(t)dt
= −Ez[∂l(y, f(x))
∂fwi1xTui≥0x]
Initialization: m large, ui random from uniform spherical dist. with
∣wi∣ = ∥ui∥ =1
√m.
Algorithmic approximation: given (x,y), run GD with two layers ReLUnetworks, how ft(x) approximates f∗(x) = E(y∣x = x)? interpolates y?
No further assumption on f∗ besides it lies in L2.Maennel, Bousquet, and Gelly (2018)
55 / 60
Intro. GANs Interpolation
VIEW GD ON RELU NETWORK AS DYNAMIC KERNELS
dEx [(f∗(x) − ft(x))2] = −2Ex,x [(f∗(x) − ft(x)) Kt(x, x) (f∗(x) − ft(x))] dt.
Lemma (Dou & L. ’18+).
View NN as fixed kernel:
limm→∞
K0(x, x) = 2⎡⎢⎢⎢⎢⎣
π − arccos(t)π
t +
√1 − t2
2π
⎤⎥⎥⎥⎥⎦
, where t = ⟨x, x⟩
Rahimi and Recht (2008); Cho and Saul (2009); Daniely et al. (2016); Bach (2017)
We view NN as dynamic kernel! We provide mean-field approx. (asm→∞), PDE characterization (Distribution Dynamics) for ρt thus Kt
Mei, Montanari, and Nguyen (2018); Rotskoff and Vanden-Eijnden (2018)
56 / 60
Intro. GANs Interpolation
REPRESENTATION BENEFITS
NN: data-dependent basis, an adaptive representation learned from dataClassic nonparametric: fixed basis from analysis, not adaptive to data
Heuristic justification, but what does it really mean mathematically?
Follow GD dynamics to any stationarity, denote corresp. RKHS asK⋆ (kernel K⋆):
For any f∗ ∈ L2µ
• Function computed by GD on NN is proj. to RKHS K⋆
limt→∞
f GDt = H⋆f∗ ∈ K⋆
• Residual lies in a smaller space
residual ∶= f∗ − limt→∞
f GDt
residual ∈ K⊥GD ⊂ K⊥⋆
Theorem (Dou & L., ’18+).
K⋆ is adaptive to f∗! Gap in space: non-trivial decomposition.
57 / 60
Intro. GANs Interpolation
REPRESENTATION BENEFITS
NN: data-dependent basis, an adaptive representation learned from dataClassic nonparametric: fixed basis from analysis, not adaptive to data
Heuristic justification, but what does it really mean mathematically?
Follow GD dynamics to any stationarity, denote corresp. RKHS asK⋆ (kernel K⋆):
For any f∗ ∈ L2µ
• Function computed by GD on NN is proj. to RKHS K⋆
limt→∞
f GDt = H⋆f∗ ∈ K⋆
• Residual lies in a smaller space
residual ∶= f∗ − limt→∞
f GDt
residual ∈ K⊥GD ⊂ K⊥⋆
Theorem (Dou & L., ’18+).
K⋆ is adaptive to f∗! Gap in space: non-trivial decomposition.
57 / 60
Intro. GANs Interpolation
INTERPOLATION BENEFITS
Running GD on NN is learning the data-dependent kernel andperforming least-squares (RKHS) simultaneously.
The kernel is adaptive to task f∗, so the least squares proj. f GD∞ lies in K⋆,
and the residual f∗ − f GD∞ is smaller.
Having an (trainable) additional layer serves as “implicit regularization” onKt, faster interpolation
more benefits and generalizations see Dou & L. ’18+, to be posted.
58 / 60
Intro. GANs Interpolation
CONCLUSION
• Minimax optimal rates does not explain the empirical success of neuralnetworks.
• One needs new adaptive (to properties of data), data-dependentframework/understanding.
• Requires new insights on regularization and interpolation.
image credit to Internet
59 / 60
Intro. GANs Interpolation
CONCLUSION
• Minimax optimal rates does not explain the empirical success of neuralnetworks.
• One needs new adaptive (to properties of data), data-dependentframework/understanding.
• Requires new insights on regularization and interpolation.
image credit to Internet
59 / 60
References
Thank you!
Liang, T. (2018). — On How Well Generative Adversarial Networks Learn Densities: Nonparametric and
Parametric Results. available on arXiv:1811.03179 under review
Liang, T. & Rakhlin, A. (2018). — Just Interpolate: Kernel “Ridgeless” Regression Can Generalize. available
on arXiv:1808.00387 revision invited
Liang, T. & Stokes, J. (2018). — Interaction Matters: A Note on Non-asymptotic Local Convergence of
Generative Adversarial Networks. available on arXiv:1802.06132 AISTATS 2019, to appear
Liang, T., Poggio, T., Rakhlin, A. & Stokes, J. (2017). — Fisher-Rao Metric, Geometry, and Complexity of
Neural Networks. available on arXiv:1711.01530 AISTATS 2019, to appear
Dou, X. & Liang, T. (2018+). — Training Neural Networks as Learning Data-adaptive Kernels: Provable
Representation and Approximation Benefits. available on hereMichael Arbel, Dougal J Sutherland, Mikołaj Binkowski, and Arthur Gretton. On gradient regularizers for mmd gans. arXiv preprint
arXiv:1805.11565, 2018.Martin Arjovsky and Leon Bottou. Towards principled methods for training generative adversarial networks. arXiv preprint
arXiv:1701.04862, 2017.Martin Arjovsky, Soumith Chintala, and Leon Bottou. Wasserstein gan. arXiv preprint arXiv:1701.07875, 2017.Sanjeev Arora and Yi Zhang. Do gans actually learn the distribution? an empirical study. arXiv preprint arXiv:1706.08224, 2017.Sanjeev Arora, Rong Ge, Yingyu Liang, Tengyu Ma, and Yi Zhang. Generalization and equilibrium in generative adversarial nets (gans).
arXiv preprint arXiv:1703.00573, 2017a.Sanjeev Arora, Andrej Risteski, and Yi Zhang. Theoretical limitations of encoder-decoder gan architectures. arXiv preprint arXiv:1711.02651,
2017b.Francis Bach. Breaking the curse of dimensionality with convex neural networks. Journal of Machine Learning Research, 18(19):1–53, 2017.Yu Bai, Tengyu Ma, and Andrej Risteski. Approximability of discriminators implies diversity in gans. arXiv preprint arXiv:1806.10586, 2018.Mikhail Belkin, Daniel Hsu, and Partha Mitra. Overfitting or perfect fitting? risk bounds for classification and regression rules that
interpolate. arXiv preprint arXiv:1806.05161, 2018a.Mikhail Belkin, Siyuan Ma, and Soumik Mandal. To understand deep learning we need to understand kernel learning. arXiv preprint
arXiv:1802.01396, 2018b.Mikhail Belkin, Alexander Rakhlin, and Alexandre B Tsybakov. Does data interpolation contradict statistical optimality? arXiv preprint
arXiv:1806.09471, 2018c.Guillermo Canas and Lorenzo Rosasco. Learning probability measures with respect to optimal transport metrics. In Advances in Neural
Information Processing Systems, pages 2492–2500, 2012.Andrea Caponnetto and Ernesto De Vito. Optimal rates for the regularized least-squares algorithm. Foundations of Computational
Mathematics, 7(3):331–368, 2007.Youngmin Cho and Lawrence K Saul. Kernel methods for deep learning. In Advances in neural information processing systems, pages 342–350,
2009.Amit Daniely, Roy Frostig, and Yoram Singer. Toward deeper understanding of neural networks: The power of initialization and a dual
view on expressivity. In Advances In Neural Information Processing Systems, pages 2253–2261, 2016.Constantinos Daskalakis, Andrew Ilyas, Vasilis Syrgkanis, and Haoyang Zeng. Training gans with optimism. arXiv preprint
arXiv:1711.00141, 2017.Luc Devroye, Laszlo Gyorfi, and Adam Krzyzak. The hilbert kernel regression estimate. Journal of Multivariate Analysis, 65(2):209–227, 1998.Gintare Karolina Dziugaite, Daniel M Roy, and Zoubin Ghahramani. Training generative neural networks via maximum mean discrepancy
optimization. arXiv preprint arXiv:1505.03906, 2015.Noureddine El Karoui. The spectrum of kernel random matrices. The Annals of Statistics, 38(1):1–50, 2010.Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio.
Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.Suriya Gunasekar, Blake E Woodworth, Srinadh Bhojanapalli, Behnam Neyshabur, and Nati Srebro. Implicit regularization in matrix
factorization. In Advances in Neural Information Processing Systems, pages 6151–6159, 2017.Iain M Johnstone. On the distribution of the largest eigenvalue in principal components analysis. Annals of statistics, pages 295–327, 2001.Yann LeCun, Corinna Cortes, and CJ Burges. Mnist handwritten digit database. AT&T Labs [Online]. Available: http://yann. lecun.
com/exdb/mnist, 2, 2010.Yuanzhi Li, Tengyu Ma, and Hongyang Zhang. Algorithmic regularization in over-parameterized matrix recovery. arXiv preprint
arXiv:1712.09203, 2017.Yujia Li, Kevin Swersky, and Rich Zemel. Generative moment matching networks. In Proceedings of the 32nd International Conference on
Machine Learning (ICML-15), pages 1718–1727, 2015.Tengyuan Liang. How well can generative adversarial networks learn densities: A nonparametric view. arXiv preprint arXiv:1712.08244,
2017.Tengyuan Liang, Tomaso Poggio, Alexander Rakhlin, and James Stokes. Fisher-rao metric, geometry, and complexity of neural networks.
arXiv preprint arXiv:1711.01530, 2017.Shuang Liu, Olivier Bousquet, and Kamalika Chaudhuri. Approximation and convergence properties of generative adversarial learning.
arXiv preprint arXiv:1705.08991, 2017.Hartmut Maennel, Olivier Bousquet, and Sylvain Gelly. Gradient descent quantizes relu network features. arXiv preprint arXiv:1803.08367,
2018.Bernard A Mair and Frits H Ruymgaart. Statistical inverse estimation in hilbert scales. SIAM Journal on Applied Mathematics, 56(5):
1424–1444, 1996.Song Mei, Andrea Montanari, and Phan-Minh Nguyen. A mean field view of the landscape of two-layers neural networks. arXiv preprint
arXiv:1804.06561, 2018.Lars Mescheder, Andreas Geiger, and Sebastian Nowozin. Which training methods for gans do actually converge? In International
Conference on Machine Learning, pages 3478–3487, 2018.Youssef Mroueh, Chun-Liang Li, Tom Sercu, Anant Raj, and Yu Cheng. Sobolev gan. arXiv preprint arXiv:1711.04894, 2017.A Nemirovski and D Yudin. Information-based complexity of mathematical programming. Izvestia AN SSSR, Ser. Tekhnicheskaya Kibernetika
(the journal is translated to English as Engineering Cybernetics. Soviet J. Computer & Systems Sci.), 1, 1983.Arkadi Nemirovski. Topics in non-parametric. Ecole d’Ete de Probabilites de Saint-Flour, 28:85, 2000.Yurii Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2013.Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nati Srebro. Exploring generalization in deep learning. In Advances in
Neural Information Processing Systems, pages 5947–5956, 2017.Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-gan: Training generative neural samplers using variational divergence
minimization. In Advances in Neural Information Processing Systems, pages 271–279, 2016.F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas,
A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of MachineLearning Research, 12:2825–2830, 2011.
David Pfau and Oriol Vinyals. Connecting generative adversarial networks and actor-critic methods. arXiv preprint arXiv:1610.01945, 2016.Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Advances in neural information processing systems, pages
1177–1184, 2008.Grant M Rotskoff and Eric Vanden-Eijnden. Neural networks as interacting particle systems: Asymptotic convexity of the loss landscape
and universal scaling of the approximation error. arXiv preprint arXiv:1805.00915, 2018.Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. In
Advances in Neural Information Processing Systems, pages 2234–2242, 2016.John Shawe-Taylor and Nello Cristianini. Kernel methods for pattern analysis. Cambridge university press, 2004.Donald Shepard. A two-dimensional interpolation function for irregularly-spaced data. In Proceedings of the 1968 23rd ACM national
conference, pages 517–524. ACM, 1968.Satinder Singh, Michael Kearns, and Yishay Mansour. Nash convergence of gradient dynamics in general-sum games. In Proceedings of the
Sixteenth conference on Uncertainty in artificial intelligence, pages 541–548. Morgan Kaufmann Publishers Inc., 2000.Shashank Singh, Ananya Uppal, Boyue Li, Chun-Liang Li, Manzil Zaheer, and Barnabas Poczos. Nonparametric density estimation under
adversarial losses. arXiv preprint arXiv:1805.08836, 2018.Alex J Smola and Bernhard Scholkopf. Learning with kernels, volume 4. Citeseer, 1998.Casper Kaae Sønderby, Jose Caballero, Lucas Theis, Wenzhe Shi, and Ferenc Huszar. Amortised map inference for image super-resolution.
arXiv preprint arXiv:1610.04490, 2016.Charles J Stone. Optimal global rates of convergence for nonparametric regression. The annals of statistics, pages 1040–1053, 1982.Ilya O Tolstikhin, Sylvain Gelly, Olivier Bousquet, Carl-Johann Simon-Gabriel, and Bernhard Scholkopf. Adagan: Boosting generative
models. In Advances in Neural Information Processing Systems, pages 5424–5433, 2017.Alexandre B Tsybakov. Introduction to nonparametric estimation. Springer Series in Statistics. Springer, New York, 2009.Grace Wahba. Spline models for observational data, volume 59. Siam, 1990.Larry Wassermann. All of nonparametric statistics. Springer Science+ Business Media, New York, 2006.Abraham J Wyner, Matthew Olson, Justin Bleich, and David Mease. Explaining the success of adaboost and random forests as interpolating
classifiers. The Journal of Machine Learning Research, 18(1):1558–1590, 2017.Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking
generalization. arXiv preprint arXiv:1611.03530, 2016.
60 / 60