Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
Analyzing Priors on Deep Networks
David Duvenaud, Oren Rippel, Ryan Adams, Zoubin Ghahramani
Sheffield Workshop on Deep Probabilistic Models
October 2, 2014
Designing neural nets
I Neural nets require lots of design decisions whoseimplications hard to understand.
I We want to understand them without reference to a specificdataset, loss function, or training method.
I We can analyze different network architectures by lookingat nets whose parameters are drawn randomly.
Why look at priors if I’m going to learneverything anyways?
I When using Bayesian neural nets:I Can’t learn types of networks having vanishing probability
under the prior.I Even when non-probabilistic:
I Good prior→ a good initialization strategy.I Good prior→ a good regularization strategy.I Good prior→ higher fraction of parameters specify
reasonable models→ easier optimization problem.
GPs as Neural Nets
x1
x2
x3
h1
h2
h∞
f (x)
InputsHidden
...
Output
A weighted sum of features,
f(x) =1
K
K∑i=1
wihi(x)
with any weight distribution,
E [wi] = 0, V [wi] = σ2, i.i.d.
by CLT, gives a GP as K →∞
cov
[f(x)f(x′)
]→ σ2
K
K∑i=1
hi(x)hi(x′)
Kernel learning as feature learning
I GPs have fixed features, integrate out feature weights.I Mapping between kernels and features:
k(x, x′) = h(x)Th(x′).I Any PSD kernel can be written as inner product of
features. (Mercer’s Theorem)I Kernel learning = feature learning
I What if we make the GP nueral network deep?
Example deep kernel: Periodic
xsin(x)
cos(x)
h(2)1
h(2)2
h(2)∞
f (x)
InputsHidden
Hidden
...
Output
Now our model is:
h1(x) = [sin(x), cos(x)]
we have “deep kernel”:
k2(x,x′)
= exp(−1
2
(h1(x))− h1(x′)
)
Deep nets, deep kernels
x1
x2
x3
h(1)1
h(1)2
h(1)∞
h(2)1
h(2)2
h(2)∞
f (x)
InputsHidden
...
Hidden
...
Output
Now our model is:
f(x) =1
K
K∑i=1
wih(2)i
(h(1)(x)
)=wTh(2)
(h(1)(x)
)Instead of
k1(x,x′) = h(1)(x)Th(1)(x′),
we have “deep kernel”:
k2(x,x′)
=[h(2)(h(1)(x)
)]Th(2)(h(1)(x′)
)
Deep Kernels
I (Cho, 2012) built kernels by composing feature mappings.I Composing any kernel k1 with a squared-exp kernel (SE):
k2(x, x′) =
=(hSE (h1(x)
))T hSE (h1(x′))
= exp(−1
2||h1(x)− h1(x′)||22
)= exp
(−1
2[h1(x)Th1(x)− 2h1(x)Th1(x′) + h1(x′)Th1(x′)
])= exp
(−1
2[k1(x, x)− 2k1(x, x′) + k1(x′, x′)]
)I A closed form. . . let’s do it again!
Repeated Fixed Feature Mappings
x1
x2
x3
h(1)1
h(1)2
h(1)∞
h(2)1
h(2)2
h(2)∞
h(3)1
h(3)2
h(3)∞
h(4)1
h(4)2
h(4)∞
f (x)...
......
...
Infinitely Deep Kernels
I For SE kernel, kL+1(x, x′) = exp (kL(x, x′)− 1).I What is the limit of composing SE features?
−4 −3 −2 −1 0 1 2 3 40
0.2
0.4
0.6
0.8
1
x − x’
cov(
f(x)
, f(x
’)
Deep RBF Network Kernel Functions
1 layer
2 layers
3 layers
∞ layers
−4 −3 −2 −1 0 1 2 3 4−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
2.5
3Draws from a GP with Deep RBF Kernels
1 layer
2 layers
3 layers
∞ layers
Kernel Draws from GP prior
I k∞(x, x′) = 1 everywhere. /
A simple fix
I Following a suggestion from Neal (1995), we connect theinputs x to each layer:
Standard architecture:
x f(1)(x) f(2)(x) f(3)(x) f(4)(x)
Input-connected architecture:
x f(1)(x) f(2)(x) f(3)(x) f(4)(x)
A simple fix
kL+1(x, x′) =
= exp
(−1
2
∣∣∣∣∣∣∣∣[ hL(x)x
]−[
hL(x′)x′
]∣∣∣∣∣∣∣∣22
)
= exp(−1
2[kL(x, x)− 2kL(x, x′) + kL(x′, x′)]−
12||x− x′||22
)
Infinitely deep kernels, take two
I What is the limit of compositions of input-connected SEfeatures?
I kL+1(x, x′) = exp(kL(x, x′)− 1− 1
2 ||x− x′||22).
−4 −3 −2 −1 0 1 2 3 40
0.2
0.4
0.6
0.8
1
x − x’
cov(
f(x)
, f(x
’)
Deep RBF Network Kernel Functions
1 layer
2 layers
3 layers
∞ layers
−4 −3 −2 −1 0 1 2 3 4−3
−2
−1
0
1
2
3Draws from a GP with Deep RBF Kernels
1 layer
2 layers
3 layers
∞ layers
Kernels Draws from GP priors
I Like an Ornstein-Uhlenbeck process with skinny tailsI Samples are non-differentiable (fractal).
Not very exciting...
I Fixed feature mapping, unlikely to be useful for anythingI Power of neural nets comes from learning a custom
representation.
Deep Gaussian Processes
I A prior over compositions of functions:
f(1:L)(x) = f(L)(f(L−1)(. . . f(2)(f(1)(x)) . . . )) (1)
with each f(`)dind∼ GP
(0, k`d(x, x′)
).
I Can be seen as a “simpler” version of Bayesian neural netsI Two equivalent architectures.
Deep GPs as nonparametric nets
x1
x2
x3
Inputs
x
Targets
t
I A neural net where each neuron’s activation function isdrawn from a Gaussian process prior.
I Avoids problem of unit saturation (with sigmoidal units).I Each draw from neural net prior gives a function y = f(x).I In this talk we only consider noiseless functions.
Deep GPs as infinitely wide parametric nets
x1
x2
x3
h(1)1
h(1)2
h(1)∞
h(2)1
h(2)2
h(2)∞
h(3)1
h(3)2
h(3)∞
f (1)1
f (1)2
f (1)3
f (2)1
f (2)2
f (2)3
f (3)1
f (3)2
f (3)3
Inputs
x
Fixed
f(1)(x)
RandomFixed
f(1:2)(x)
RandomFixed
Random
y
......
...
I Infinitely-wide fixed feature maps alternating with finitelinear information bottlenecks:
h(`)(x) = σ(b(`) +
[V(`)W(`−1)]h(`−1)(x)
)
Priors on deep networks
I A draw from a one-neuron-per-layer deep GP:
f (x)
−4 −2 0 2 4−2
−1.5
−1
−0.5
0
0.5
1Layer 1 Compostion
x
1 Layer
Priors on deep networks
I A draw from a one-neuron-per-layer deep GP:
f (x)
−4 −2 0 2 4−1.5
−1
−0.5
0Layer 2 Compostion
x
2 Layers
Priors on deep networks
I A draw from a one-neuron-per-layer deep GP:
f (x)
−4 −2 0 2 4−1
−0.5
0
0.5
1Layer 3 Compostion
x
3 Layers
Priors on deep networks
I A draw from a one-neuron-per-layer deep GP:
f (x)
−4 −2 0 2 4−3
−2
−1
0
1
2Layer 4 Compostion
x
4 Layers
Priors on deep networks
I A draw from a one-neuron-per-layer deep GP:
f (x)
−4 −2 0 2 4−1
−0.5
0
0.5
1
1.5
2Layer 5 Compostion
x
5 Layers
Priors on deep networks
I A draw from a one-neuron-per-layer deep GP:
f (x)
−4 −2 0 2 4−0.8
−0.6
−0.4
−0.2
0Layer 6 Compostion
x
6 Layers
Priors on deep networks
I A draw from a one-neuron-per-layer deep GP:
f (x)
−4 −2 0 2 40.2
0.4
0.6
0.8
1Layer 7 Compostion
x
7 Layers
Priors on deep networks
I A draw from a one-neuron-per-layer deep GP:
f (x)
−4 −2 0 2 4−1.8
−1.7
−1.6
−1.5
−1.4
−1.3Layer 8 Compostion
x
8 Layers
Priors on deep networks
I A draw from a one-neuron-per-layer deep GP:
f (x)
−4 −2 0 2 4−0.15
−0.14
−0.13
−0.12
−0.11
−0.1Layer 9 Compostion
x
9 Layers
Priors on deep networks
I A draw from a one-neuron-per-layer deep GP:
f (x)
−4 −2 0 2 40.88
0.89
0.9
0.91
0.92
0.93Layer 10 Compostion
x
10 Layers
Size of derivative becomes log-normal distributed.
Priors on deep networksI 2D to 2D warpings of a set of coloured points:
Priors on deep networksI 2D to 2D warpings of a set of coloured points:
1 Layer
Priors on deep networksI 2D to 2D warpings of a set of coloured points:
2 Layers
Priors on deep networksI 2D to 2D warpings of a set of coloured points:
3 Layers
Priors on deep networksI 2D to 2D warpings of a set of coloured points:
4 Layers
Priors on deep networksI 2D to 2D warpings of a set of coloured points:
5 LayersDensity concentrates along filaments.
Priors on deep networks
Color shows y that each x is mapped to (decision boundary)
No warping
Priors on deep networks
Color shows y that each x is mapped to (decision boundary)
1 Layer
Priors on deep networks
Color shows y that each x is mapped to (decision boundary)
2 Layers
Priors on deep networks
Color shows y that each x is mapped to (decision boundary)
3 Layers
Priors on deep networks
Color shows y that each x is mapped to (decision boundary)
4 Layers
Priors on deep networks
Color shows y that each x is mapped to (decision boundary)
5 Layers
Priors on deep networks
Color shows y that each x is mapped to (decision boundary)
10 Layers
Priors on deep networks
Color shows y that each x is mapped to (decision boundary)
20 Layers
Priors on deep networksColor shows y that each x is mapped to (decision boundary)
40 LayersRepresentation only changes in one direction locally.
What makes a good representation?
tangent
orthogonal
I Good representations of data manifolds don’t change indirections orthogonal to the manifold. (Rifai et. al. 2011)
I Good representations also change in directions tangent tothe manifold, to preserve information.
I Representation of a D-dimensional manifold shouldchange in D orthogonal directions, locally.
I Our prior on functions might be too restrictive.
Analysis of Jacobian2 Layers 6 LayersN
orm
aliz
edsi
ngul
arva
lue
0
0.2
0.4
0.6
0.8
1
1 2 3 4 50
0.2
0.4
0.6
0.8
1
1 2 3 4 5
Singular value index Singular value indexThe distribution of normalized singular values of the Jacobian
of functions drawn from a 5-dimensional deep GP prior.I Lemma from paper: The Jacobian of a deep GP is a
product of i.i.d. random Gaussian matrices.I Output only changes in w.r.t. one direction as net deepens.
A simple fix
I Following a suggestion from Neal (1995), we connect theinputs x to each layer:
Standard architecture:
x f(1)(x) f(2)(x) f(3)(x) f(4)(x)
Input-connected architecture:
x f(1)(x) f(2)(x) f(3)(x) f(4)(x)
A different architecture
I A draw from a one-neuron-per-layer deep GP, with theinput also connected to each layer:
f (x)
−4 −2 0 2 4−1.5
−1
−0.5
0
0.5
1
1.5Layer 1 Compostion
x
1 layer
A different architecture
I A draw from a one-neuron-per-layer deep GP, with theinput also connected to each layer:
f (x)
−4 −2 0 2 4−1.5
−1
−0.5
0
0.5
1
1.5Layer 2 Compostion
x
2 layers
A different architecture
I A draw from a one-neuron-per-layer deep GP, with theinput also connected to each layer:
f (x)
−4 −2 0 2 4−3
−2
−1
0
1
2
3Layer 3 Compostion
x
3 layers
A different architecture
I A draw from a one-neuron-per-layer deep GP, with theinput also connected to each layer:
f (x)
−4 −2 0 2 4−2
−1
0
1
2
3Layer 4 Compostion
x
4 layers
A different architecture
I A draw from a one-neuron-per-layer deep GP, with theinput also connected to each layer:
f (x)
−4 −2 0 2 4−4
−2
0
2
4Layer 5 Compostion
x
5 layers
A different architecture
I A draw from a one-neuron-per-layer deep GP, with theinput also connected to each layer:
f (x)
−4 −2 0 2 4−3
−2
−1
0
1Layer 6 Compostion
x
6 layers
A different architecture
I A draw from a one-neuron-per-layer deep GP, with theinput also connected to each layer:
f (x)
−4 −2 0 2 4−3
−2
−1
0
1
2
3Layer 7 Compostion
x
7 layers
A different architecture
I A draw from a one-neuron-per-layer deep GP, with theinput also connected to each layer:
f (x)
−4 −2 0 2 4−2
−1
0
1
2
3Layer 8 Compostion
x
8 layers
A different architecture
I A draw from a one-neuron-per-layer deep GP, with theinput also connected to each layer:
f (x)
−4 −2 0 2 4−3
−2
−1
0
1
2
3Layer 9 Compostion
x
9 layers
A different architecture
I A draw from a one-neuron-per-layer deep GP, with theinput also connected to each layer:
f (x)
−4 −2 0 2 4−2
−1
0
1
2Layer 10 Compostion
x
10 layers
Greater variety of derivatives.
A different architecture
I Input-connected 2D to 2D warpings of coloured points:
A different architecture
I Input-connected 2D to 2D warpings of coloured points:
1 Layer
A different architecture
I Input-connected 2D to 2D warpings of coloured points:
2 Layers
A different architecture
I Input-connected 2D to 2D warpings of coloured points:
3 Layers
A different architecture
I Input-connected 2D to 2D warpings of coloured points:
4 Layers
A different architectureI Input-connected 2D to 2D warpings of coloured points:
5 LayersDensity becomes more complex but remains 2D.
A different architecture (show video)
I Color shows y that each x is mapped to
No warping
A different architecture (show video)
I Color shows y that each x is mapped to
2 Layers
A different architecture (show video)
I Color shows y that each x is mapped to
10 Layers
A different architecture (show video)
I Color shows y that each x is mapped to
20 Layers
A different architecture (show video)I Color shows y that each x is mapped to
40 LayersRepresentation sometimes depends on all directions.
Understanding dropout
I Dropout is a method for regularizing neural networks(Hinton et al., 2012; Srivastava, 2013).
I Recipe:1. Randomly set to zero (drop out) some neuron activations.2. Average over all possible ways of doing this.
I Gives robustness since neurons can’t depend on each other.I How does dropout affect priors on functions?I Related work: (Baldi and Sadowski, 2013; Cho, 2013;
Wager, Wang and Liang, 2013)
Dropout on Feature Activations
x1
x2
x3
h1(x)
h2(x)
h3(x)
h4(x)
h5(x)
f (x)
Original formulation:
f (x) =1K
K∑i=1
wihi(x)
with any weight distribution,
E [wi] = 0, V [wi] = σ2
by CLT, gives a GP as K →∞
cov[
f (x)f (x′)
]→ σ2
K
K∑i=1
hi(x)hi(x′)
Dropout on Feature Activations
x1
x2
x3
h1(x)
h2(x)
h3(x)
h4(x)
h5(x)
f (x)
Remove units with probability 12 :
f (x) =1K
K∑i=1
riwihi(x) ri ∼iid Ber(12)
with any weight distribution,
E [riwi] = 0, V [riwi] =12σ2
by CLT, gives a GP as K →∞
cov[
f (x)f (x′)
]→ 1
2σ2
K
K∑i=1
hi(x)hi(x′)
Dropout on Feature Activations
x1
x2
x3
h1(x)
h2(x)
h3(x)
h4(x)
h5(x)
f (x)
Remove units with probability 12 :
f (x) =1K
K∑i=1
riwihi(x) ri ∼iid Ber(12)
with any weight distribution,
E [riwi] = 0, V [riwi] =12σ2
by CLT, gives a GP as K →∞
cov[
f (x)f (x′)
]→ 1
2σ2
K
K∑i=1
hi(x)hi(x′)
Dropout on Feature Activations
x1
x2
x3
h1(x)
h2(x)
h3(x)
h4(x)
h5(x)
f (x)
Remove units with probability 12 :
f (x) =1K
K∑i=1
riwihi(x) ri ∼iid Ber(12)
with any weight distribution,
E [riwi] = 0, V [riwi] =12σ2
by CLT, gives a GP as K →∞
cov[
f (x)f (x′)
]→ 1
2σ2
K
K∑i=1
hi(x)hi(x′)
Dropout on Feature Activations
x1
x2
x3
h1(x)
h2(x)
h3(x)
h4(x)
h5(x)
f (x)
Remove units with probability 12 :
f (x) =1K
K∑i=1
riwihi(x) ri ∼iid Ber(12)
with any weight distribution,
E [riwi] = 0, V [riwi] =12σ2
by CLT, gives a GP as K →∞
cov[
f (x)f (x′)
]→ 1
2σ2
K
K∑i=1
hi(x)hi(x′)
Dropout on Feature Activations
x1
x2
x3
h1(x)
h2(x)
h3(x)
h4(x)
h5(x)
f (x)
Remove units with probability 12 :
f (x) =1K
K∑i=1
riwihi(x) ri ∼iid Ber(12)
with any weight distribution,
E [riwi] = 0, V [riwi] =12σ2
by CLT, gives a GP as K →∞
cov[
f (x)f (x′)
]→ 1
2σ2
K
K∑i=1
hi(x)hi(x′)
Dropout on Feature Activations
x1
x2
x3
h1(x)
h2(x)
h3(x)
h4(x)
h5(x)
f (x)
Remove units with probability 12 :
f (x) =1K
K∑i=1
riwihi(x) ri ∼iid Ber(12)
with any weight distribution,
E [riwi] = 0, V [riwi] =12σ2
by CLT, gives a GP as K →∞
cov[
f (x)f (x′)
]→ 1
2σ2
K
K∑i=1
hi(x)hi(x′)
Dropout on Feature Activations
x1
x2
x3
h1(x)
h2(x)
h3(x)
h4(x)
h5(x)
f (x)
Remove units with probability 12 :
f (x) =1K
K∑i=1
riwihi(x) ri ∼iid Ber(12)
with any weight distribution,
E [riwi] = 0, V [riwi] =12σ2
by CLT, gives a GP as K →∞
cov[
f (x)f (x′)
]→ 1
2σ2
K
K∑i=1
hi(x)hi(x′)
Dropout on Feature Activations
x1
x2
x3
h1(x)
h2(x)
h3(x)
h4(x)
h5(x)
f (x)
Remove units with probability 12 :
f (x) =1K
K∑i=1
riwihi(x) ri ∼iid Ber(12)
with any weight distribution,
E [riwi] = 0, V [riwi] =12σ2
by CLT, gives a GP as K →∞
cov[
f (x)f (x′)
]→ 1
2σ2
K
K∑i=1
hi(x)hi(x′)
Dropout on Feature Activations
x1
x2
x3
h1(x)
h2(x)
h3(x)
h4(x)
h5(x)
f (x)
Remove units with probability 12 :
f (x) =1K
K∑i=1
riwihi(x) ri ∼iid Ber(12)
with any weight distribution,
E [riwi] = 0, V [riwi] =12σ2
by CLT, gives a GP as K →∞
cov[
f (x)f (x′)
]→ 1
2σ2
K
K∑i=1
hi(x)hi(x′)
Dropout on Feature Activations
x1
x2
x3
h1(x)
h2(x)
h3(x)
h4(x)
h5(x)
f (x)
Double output variance:
f (x) =2K
K∑i=1
riwihi(x) ri ∼iid Ber(12)
with any weight distribution,
E[√
2riwi
]= 0, V
[√2riwi
]=
22σ2
by CLT, gives a GP as K →∞
cov[
f (x)f (x′)
]→ 2
2σ2
K
K∑i=1
hi(x)hi(x′)
Dropout on Feature Activations
I Dropout on feature activations gives same GP.I Averaging the same model doesn’t do anything.
I GPs were doing dropout all along? ,I GPs are strange because any one feature doesn’t matter.I Is there a better way to drop out features that would lead to
robustness?
Dropout on GP inputs
x1
x2
Inputs
x
Output f(x)
y
I Each function only depends on some input dimensions.I Given prior covariance cov [f (x), f (x′)] = k(x, x′), exact
dropout gives a mixture of GPs:
p(f (x)
)=
12D
∑r∈{0,1}D
GP(0, k(rTx, rTx′)
)I Can be viewed as spike-and-slab ARD prior.
Dropout on GP inputs
x1
x2
Inputs
x
Output f(x)
y
I Each function only depends on some input dimensions.I Given prior covariance cov [f (x), f (x′)] = k(x, x′), exact
dropout gives a mixture of GPs:
p(f (x)
)=
12D
∑r∈{0,1}D
GP(0, k(rTx, rTx′)
)I Can be viewed as spike-and-slab ARD prior.
Dropout on GP inputs
x1
x2
Inputs
x
Output f(x)
y
I Each function only depends on some input dimensions.I Given prior covariance cov [f (x), f (x′)] = k(x, x′), exact
dropout gives a mixture of GPs:
p(f (x)
)=
12D
∑r∈{0,1}D
GP(0, k(rTx, rTx′)
)I Can be viewed as spike-and-slab ARD prior.
Covariance before and after dropoutOriginal squared-exp: After dropout:
cov [f (x), f (x′)] = k(x, x′) cov [f (x), f (x′)] =∑
r∈{0,1}D
k(rTx, rTx′)
x− x′ x− x′
I Sum of many functions, each depends only on a subset ofinputs.
I Output similar even if some input dimensions change a lot.
Summary
I Priors on functions can shed light on design choices in adata-independent way.
I Example 1: Increasing depth makes net outputs change infewer input directions.
I Example 2: Dropout makes output similar even if someinputs change a lot.
I What sorts of structures do we want to be able to learn?
Thanks!
Summary
I Priors on functions can shed light on design choices in adata-independent way.
I Example 1: Increasing depth makes net outputs change infewer input directions.
I Example 2: Dropout makes output similar even if someinputs change a lot.
I What sorts of structures do we want to be able to learn?
Thanks!