Analyzing Priors on Deep Networksduvenaud/talks/deep-limits-sheffield.pdf · Why look at priors if I’m going to learn everything anyways? I When using Bayesian neural nets: I Can’t

Analyzing Priors on Deep Networks

David Duvenaud, Oren Rippel, Ryan Adams, Zoubin Ghahramani

Sheffield Workshop on Deep Probabilistic Models

October 2, 2014

Designing neural nets

I Neural nets require lots of design decisions whoseimplications hard to understand.

I We want to understand them without reference to a specificdataset, loss function, or training method.

I We can analyze different network architectures by lookingat nets whose parameters are drawn randomly.

Why look at priors if I’m going to learneverything anyways?

I When using Bayesian neural nets:I Can’t learn types of networks having vanishing probability

under the prior.I Even when non-probabilistic:

I Good prior→ a good initialization strategy.I Good prior→ a good regularization strategy.I Good prior→ higher fraction of parameters specify

reasonable models→ easier optimization problem.

GPs as Neural Nets

x1

x2

x3

h1

h2

h∞

f (x)

InputsHidden

...

Output

A weighted sum of features,

f(x) =1

K

K∑i=1

wihi(x)

with any weight distribution,

E [wi] = 0, V [wi] = σ2, i.i.d.

by CLT, gives a GP as K →∞

cov

[f(x)f(x′)

]→ σ2

K

K∑i=1

hi(x)hi(x′)

Kernel learning as feature learning

I GPs have fixed features, integrate out feature weights.I Mapping between kernels and features:

k(x, x′) = h(x)Th(x′).I Any PSD kernel can be written as inner product of

features. (Mercer’s Theorem)I Kernel learning = feature learning

I What if we make the GP nueral network deep?

Example deep kernel: Periodic

xsin(x)

cos(x)

h(2)1

h(2)2

h(2)∞

f (x)

InputsHidden

Hidden

...

Output

Now our model is:

h1(x) = [sin(x), cos(x)]

we have “deep kernel”:

k2(x,x′)

= exp(−1

2

(h1(x))− h1(x′)

)

Deep nets, deep kernels

x1

x2

x3

h(1)1

h(1)2

h(1)∞

h(2)1

h(2)2

h(2)∞

f (x)

InputsHidden

...

Hidden

...

Output

Now our model is:

f(x) =1

K

K∑i=1

wih(2)i

(h(1)(x)

)=wTh(2)

(h(1)(x)

)Instead of

k1(x,x′) = h(1)(x)Th(1)(x′),

we have “deep kernel”:

k2(x,x′)

=[h(2)(h(1)(x)

)]Th(2)(h(1)(x′)

)

Deep Kernels

I (Cho, 2012) built kernels by composing feature mappings.I Composing any kernel k1 with a squared-exp kernel (SE):

k2(x, x′) =

=(hSE (h1(x)

))T hSE (h1(x′))

= exp(−1

2||h1(x)− h1(x′)||22

)= exp

(−1

2[h1(x)Th1(x)− 2h1(x)Th1(x′) + h1(x′)Th1(x′)

])= exp

(−1

2[k1(x, x)− 2k1(x, x′) + k1(x′, x′)]

)I A closed form. . . let’s do it again!

Repeated Fixed Feature Mappings

x1

x2

x3

h(1)1

h(1)2

h(1)∞

h(2)1

h(2)2

h(2)∞

h(3)1

h(3)2

h(3)∞

h(4)1

h(4)2

h(4)∞

f (x)...

......

...

Infinitely Deep Kernels

I For SE kernel, kL+1(x, x′) = exp (kL(x, x′)− 1).I What is the limit of composing SE features?

−4 −3 −2 −1 0 1 2 3 40

0.2

0.4

0.6

0.8

1

x − x’

cov(

f(x)

, f(x

’)

Deep RBF Network Kernel Functions

1 layer

2 layers

3 layers

∞ layers

−4 −3 −2 −1 0 1 2 3 4−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

3Draws from a GP with Deep RBF Kernels

1 layer

2 layers

3 layers

∞ layers

Kernel Draws from GP prior

I k∞(x, x′) = 1 everywhere. /

A simple fix

I Following a suggestion from Neal (1995), we connect theinputs x to each layer:

Standard architecture:

x f(1)(x) f(2)(x) f(3)(x) f(4)(x)

Input-connected architecture:

x f(1)(x) f(2)(x) f(3)(x) f(4)(x)

A simple fix

kL+1(x, x′) =

= exp

(−1

2

∣∣∣∣∣∣∣∣[ hL(x)x

]−[

hL(x′)x′

]∣∣∣∣∣∣∣∣22

)

= exp(−1

2[kL(x, x)− 2kL(x, x′) + kL(x′, x′)]−

12||x− x′||22

)

Infinitely deep kernels, take two

I What is the limit of compositions of input-connected SEfeatures?

I kL+1(x, x′) = exp(kL(x, x′)− 1− 1

2 ||x− x′||22).

−4 −3 −2 −1 0 1 2 3 40

0.2

0.4

0.6

0.8

1

x − x’

cov(

f(x)

, f(x

’)

Deep RBF Network Kernel Functions

1 layer

2 layers

3 layers

∞ layers

−4 −3 −2 −1 0 1 2 3 4−3

−2

−1

0

1

2

3Draws from a GP with Deep RBF Kernels

1 layer

2 layers

3 layers

∞ layers

Kernels Draws from GP priors

I Like an Ornstein-Uhlenbeck process with skinny tailsI Samples are non-differentiable (fractal).

Not very exciting...

I Fixed feature mapping, unlikely to be useful for anythingI Power of neural nets comes from learning a custom

representation.

Deep Gaussian Processes

I A prior over compositions of functions:

f(1:L)(x) = f(L)(f(L−1)(. . . f(2)(f(1)(x)) . . . )) (1)

with each f(`)dind∼ GP

(0, k`d(x, x′)

).

I Can be seen as a “simpler” version of Bayesian neural netsI Two equivalent architectures.

Deep GPs as nonparametric nets

x1

x2

x3

Inputs

x

Targets

t

I A neural net where each neuron’s activation function isdrawn from a Gaussian process prior.

I Avoids problem of unit saturation (with sigmoidal units).I Each draw from neural net prior gives a function y = f(x).I In this talk we only consider noiseless functions.

Deep GPs as infinitely wide parametric nets

x1

x2

x3

h(1)1

h(1)2

h(1)∞

h(2)1

h(2)2

h(2)∞

h(3)1

h(3)2

h(3)∞

f (1)1

f (1)2

f (1)3

f (2)1

f (2)2

f (2)3

f (3)1

f (3)2

f (3)3

Inputs

x

Fixed

f(1)(x)

RandomFixed

f(1:2)(x)

RandomFixed

Random

y

......

...

I Infinitely-wide fixed feature maps alternating with finitelinear information bottlenecks:

h(`)(x) = σ(b(`) +

[V(`)W(`−1)]h(`−1)(x)

)

Priors on deep networks

I A draw from a one-neuron-per-layer deep GP:

f (x)

−4 −2 0 2 4−2

−1.5

−1

−0.5

0

0.5

1Layer 1 Compostion

x

1 Layer



f (x)

−4 −2 0 2 4−1.5

−1

−0.5

0Layer 2 Compostion

x

2 Layers



f (x)

−4 −2 0 2 4−1

−0.5

0

0.5

1Layer 3 Compostion

x

3 Layers



f (x)

−4 −2 0 2 4−3

−2

−1

0

1

2Layer 4 Compostion

x

4 Layers



f (x)

−4 −2 0 2 4−1

−0.5

0

0.5

1

1.5

2Layer 5 Compostion

x

5 Layers



f (x)

−4 −2 0 2 4−0.8

−0.6

−0.4

−0.2

0Layer 6 Compostion

x

6 Layers



f (x)

−4 −2 0 2 40.2

0.4

0.6

0.8

1Layer 7 Compostion

x

7 Layers



f (x)

−4 −2 0 2 4−1.8

−1.7

−1.6

−1.5

−1.4

−1.3Layer 8 Compostion

x

8 Layers



f (x)

−4 −2 0 2 4−0.15

−0.14

−0.13

−0.12

−0.11

−0.1Layer 9 Compostion

x

9 Layers



f (x)

−4 −2 0 2 40.88

0.89

0.9

0.91

0.92

0.93Layer 10 Compostion

x

10 Layers

Size of derivative becomes log-normal distributed.

Priors on deep networksI 2D to 2D warpings of a set of coloured points:


1 Layer


2 Layers


3 Layers


4 Layers


5 LayersDensity concentrates along filaments.


Color shows y that each x is mapped to (decision boundary)

No warping



1 Layer



2 Layers



3 Layers



4 Layers



5 Layers



10 Layers



20 Layers

Priors on deep networksColor shows y that each x is mapped to (decision boundary)

40 LayersRepresentation only changes in one direction locally.

What makes a good representation?

tangent

orthogonal

I Good representations of data manifolds don’t change indirections orthogonal to the manifold. (Rifai et. al. 2011)

I Good representations also change in directions tangent tothe manifold, to preserve information.

I Representation of a D-dimensional manifold shouldchange in D orthogonal directions, locally.

I Our prior on functions might be too restrictive.

Analysis of Jacobian2 Layers 6 LayersN

orm

aliz

edsi

ngul

arva

lue

0

0.2

0.4

0.6

0.8

1

1 2 3 4 50

0.2

0.4

0.6

0.8

1

1 2 3 4 5

Singular value index Singular value indexThe distribution of normalized singular values of the Jacobian

of functions drawn from a 5-dimensional deep GP prior.I Lemma from paper: The Jacobian of a deep GP is a

product of i.i.d. random Gaussian matrices.I Output only changes in w.r.t. one direction as net deepens.

A simple fix

I Following a suggestion from Neal (1995), we connect theinputs x to each layer:

Standard architecture:

x f(1)(x) f(2)(x) f(3)(x) f(4)(x)

Input-connected architecture:

x f(1)(x) f(2)(x) f(3)(x) f(4)(x)

A different architecture

I A draw from a one-neuron-per-layer deep GP, with theinput also connected to each layer:

f (x)

−4 −2 0 2 4−1.5

−1

−0.5

0

0.5

1


x

1 layer



f (x)

−4 −2 0 2 4−1.5

−1

−0.5

0

0.5

1


x

2 layers



f (x)

−4 −2 0 2 4−3

−2

−1

0

1

2

3Layer 3 Compostion

x

3 layers



f (x)

−4 −2 0 2 4−2

−1

0

1

2

3Layer 4 Compostion

x

4 layers



f (x)

−4 −2 0 2 4−4

−2

0

2

4Layer 5 Compostion

x

5 layers



f (x)

−4 −2 0 2 4−3

−2

−1

0

1Layer 6 Compostion

x

6 layers



f (x)

−4 −2 0 2 4−3

−2

−1

0

1

2

3Layer 7 Compostion

x

7 layers



f (x)

−4 −2 0 2 4−2

−1

0

1

2

3Layer 8 Compostion

x

8 layers



f (x)

−4 −2 0 2 4−3

−2

−1

0

1

2

3Layer 9 Compostion

x

9 layers



f (x)

−4 −2 0 2 4−2

−1

0

1

2Layer 10 Compostion

x

10 layers

Greater variety of derivatives.


I Input-connected 2D to 2D warpings of coloured points:



1 Layer



2 Layers



3 Layers



4 Layers

A different architectureI Input-connected 2D to 2D warpings of coloured points:

5 LayersDensity becomes more complex but remains 2D.

A different architecture (show video)

I Color shows y that each x is mapped to

No warping



2 Layers



10 Layers



20 Layers

A different architecture (show video)I Color shows y that each x is mapped to

40 LayersRepresentation sometimes depends on all directions.

Understanding dropout

I Dropout is a method for regularizing neural networks(Hinton et al., 2012; Srivastava, 2013).

I Recipe:1. Randomly set to zero (drop out) some neuron activations.2. Average over all possible ways of doing this.

I Gives robustness since neurons can’t depend on each other.I How does dropout affect priors on functions?I Related work: (Baldi and Sadowski, 2013; Cho, 2013;

Wager, Wang and Liang, 2013)

Dropout on Feature Activations

x1

x2

x3

h1(x)

h2(x)

h3(x)

h4(x)

h5(x)

f (x)

Original formulation:

f (x) =1K

K∑i=1

wihi(x)


E [wi] = 0, V [wi] = σ2


cov[

f (x)f (x′)

]→ σ2

K

K∑i=1

hi(x)hi(x′)


x1

x2

x3

h1(x)

h2(x)

h3(x)

h4(x)

h5(x)

f (x)

Remove units with probability 12 :

f (x) =1K

K∑i=1

riwihi(x) ri ∼iid Ber(12)


E [riwi] = 0, V [riwi] =12σ2


cov[

f (x)f (x′)

]→ 1

2σ2

K

K∑i=1

hi(x)hi(x′)


x1

x2

x3

h1(x)

h2(x)

h3(x)

h4(x)

h5(x)

f (x)


f (x) =1K

K∑i=1



E [riwi] = 0, V [riwi] =12σ2


cov[

f (x)f (x′)

]→ 1

2σ2

K

K∑i=1

hi(x)hi(x′)


x1

x2

x3

h1(x)

h2(x)

h3(x)

h4(x)

h5(x)

f (x)


f (x) =1K

K∑i=1



E [riwi] = 0, V [riwi] =12σ2


cov[

f (x)f (x′)

]→ 1

2σ2

K

K∑i=1

hi(x)hi(x′)


x1

x2

x3

h1(x)

h2(x)

h3(x)

h4(x)

h5(x)

f (x)


f (x) =1K

K∑i=1



E [riwi] = 0, V [riwi] =12σ2


cov[

f (x)f (x′)

]→ 1

2σ2

K

K∑i=1

hi(x)hi(x′)


x1

x2

x3

h1(x)

h2(x)

h3(x)

h4(x)

h5(x)

f (x)


f (x) =1K

K∑i=1



E [riwi] = 0, V [riwi] =12σ2


cov[

f (x)f (x′)

]→ 1

2σ2

K

K∑i=1

hi(x)hi(x′)


x1

x2

x3

h1(x)

h2(x)

h3(x)

h4(x)

h5(x)

f (x)


f (x) =1K

K∑i=1



E [riwi] = 0, V [riwi] =12σ2


cov[

f (x)f (x′)

]→ 1

2σ2

K

K∑i=1

hi(x)hi(x′)


x1

x2

x3

h1(x)

h2(x)

h3(x)

h4(x)

h5(x)

f (x)


f (x) =1K

K∑i=1



E [riwi] = 0, V [riwi] =12σ2


cov[

f (x)f (x′)

]→ 1

2σ2

K

K∑i=1

hi(x)hi(x′)


x1

x2

x3

h1(x)

h2(x)

h3(x)

h4(x)

h5(x)

f (x)


f (x) =1K

K∑i=1



E [riwi] = 0, V [riwi] =12σ2


cov[

f (x)f (x′)

]→ 1

2σ2

K

K∑i=1

hi(x)hi(x′)


x1

x2

x3

h1(x)

h2(x)

h3(x)

h4(x)

h5(x)

f (x)


f (x) =1K

K∑i=1



E [riwi] = 0, V [riwi] =12σ2


cov[

f (x)f (x′)

]→ 1

2σ2

K

K∑i=1

hi(x)hi(x′)


x1

x2

x3

h1(x)

h2(x)

h3(x)

h4(x)

h5(x)

f (x)

Double output variance:

f (x) =2K

K∑i=1



E[√

2riwi

]= 0, V

[√2riwi

]=

22σ2


cov[

f (x)f (x′)

]→ 2

2σ2

K

K∑i=1

hi(x)hi(x′)


I Dropout on feature activations gives same GP.I Averaging the same model doesn’t do anything.

I GPs were doing dropout all along? ,I GPs are strange because any one feature doesn’t matter.I Is there a better way to drop out features that would lead to

robustness?

Dropout on GP inputs

x1

x2

Inputs

x

Output f(x)

y

I Each function only depends on some input dimensions.I Given prior covariance cov [f (x), f (x′)] = k(x, x′), exact

dropout gives a mixture of GPs:

p(f (x)

)=

12D

∑r∈{0,1}D

GP(0, k(rTx, rTx′)

)I Can be viewed as spike-and-slab ARD prior.


x1

x2

Inputs

x

Output f(x)

y



p(f (x)

)=

12D

∑r∈{0,1}D




x1

x2

Inputs

x

Output f(x)

y



p(f (x)

)=

12D

∑r∈{0,1}D



Covariance before and after dropoutOriginal squared-exp: After dropout:

cov [f (x), f (x′)] = k(x, x′) cov [f (x), f (x′)] =∑

r∈{0,1}D

k(rTx, rTx′)

x− x′ x− x′

I Sum of many functions, each depends only on a subset ofinputs.

I Output similar even if some input dimensions change a lot.

Summary

I Priors on functions can shed light on design choices in adata-independent way.

I Example 1: Increasing depth makes net outputs change infewer input directions.

I Example 2: Dropout makes output similar even if someinputs change a lot.

I What sorts of structures do we want to be able to learn?

Thanks!

Summary

I Priors on functions can shed light on design choices in adata-independent way.

I Example 1: Increasing depth makes net outputs change infewer input directions.

I Example 2: Dropout makes output similar even if someinputs change a lot.

I What sorts of structures do we want to be able to learn?

Thanks!

Documents

Analyzing Priors on Deep Networksduvenaud/talks/deep-limits-sheffield.pdf · Why look at priors if I’m going to learn everything anyways? I When using Bayesian neural nets: I Can’t