Training and Inference for Deep Gaussian Processes

Training and Inference for Deep Gaussian Processes

Keyon Vafa

April 26, 2016

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 1 / 50

Motivation

An ideal model for prediction is

accurate

computationally efficient

easy to tune without overfitting

able to provide certainty estimates


Motivation


accurate





Motivation


accurate





Motivation


accurate





Motivation


accurate





Motivation

This thesis focuses on one particular class of prediction models, deepGaussian processes for regression. They are a new model, having beenintroduced by Damianou and Lawrence in 2013.

Exact inference is intractable. In this thesis, we introduce a new method tolearn deep GPs, the Deep Gaussian Process Sampling algorithm (DPGS).


Motivation

This thesis focuses on one particular class of prediction models, deepGaussian processes for regression. They are a new model, having beenintroduced by Damianou and Lawrence in 2013.

Exact inference is intractable. In this thesis, we introduce a new method tolearn deep GPs, the Deep Gaussian Process Sampling algorithm (DPGS).


Motivation

The DGPS algorithm

is more straightforward than existing methods

can more easily adapt to using arbitrary kernels

relies on Monte Carlo sampling to circumvent the intractability hurdle

uses pseudo data to ease the computational burden


Motivation

The DGPS algorithm






Motivation

The DGPS algorithm






Motivation

The DGPS algorithm






Motivation

The DGPS algorithm






Gaussian Processes

Table of Contents

1 Gaussian Processes

2 Deep Gaussian Processes

3 Implementation

4 Experiments and Analysis

5 Conclusion


Gaussian Processes

Definition of a Gaussian Process

A function f is a Gaussian process (GP) if any finite set of valuesf (x1), . . . , f (xN) has a multivariate normal distribution.

The inputs {xn}Nn=1 can be vectors from any arbitrary sized domain.

Specified by a mean function m(x) and a covariance function k(x, x′)where

m(x) = E [f (x)]

k(x, x′) = Cov(f (x), f (x′)).


Gaussian Processes





m(x) = E [f (x)]

k(x, x′) = Cov(f (x), f (x′)).


Gaussian Processes





m(x) = E [f (x)]

k(x, x′) = Cov(f (x), f (x′)).


Gaussian Processes

Covariance Function

The covariance function (or kernel) determines the smoothness andstationarity of functions drawn from a GP.

The squared exponential covariance function has the following form:

k(x, x′) = σ2f exp

(−1

2(x− x′)TM(x− x′)

)

When M is a diagonal matrix, the elements on the diagonal areknown as the length-scales, denoted by l−2i . σ2f is known as the signalvariance.


Gaussian Processes

Covariance Function




(−1

2(x− x′)TM(x− x′)

)



Gaussian Processes

Covariance Function




(−1

2(x− x′)TM(x− x′)

)



Gaussian Processes

Sampling from a GP

x

f(x)

Signal variance 1.0, Length-scale 0.2

x

f(x)


x

f(x)


x

f(x)


x

f(x)


x

f(x)


Random samples from GP priors. The length-scale controls thesmoothness of our function, while the signal variance controls thedeviation from the mean.


Gaussian Processes

GPs for Regression

Setup: We are given a set of inputs X ∈ RN×D and correspondingoutputs y ∈ RN , the function values from a GP evaluated at X. Weassume a mean function m(x) and a covariance function k(x, x′),which rely on parameters θ.

We would like to learn the optimal θ, and estimate the functionvalues y∗ for a set of new inputs X∗.

To learn θ, we optimize the marginal likelihood:

P(y|X,θ) = N (0,KXX ).

We can then use the multivariate normal conditional distribution toevaluate the predictive distribution:

P(y∗|X∗,X, y,θ) ∼ N (KX∗XK−1XXy,KX∗X∗ −KX∗XK−1XXKXX∗).


Gaussian Processes

GPs for Regression




P(y|X,θ) = N (0,KXX ).




Gaussian Processes

GPs for Regression




P(y|X,θ) = N (0,KXX ).




Gaussian Processes

GPs for Regression




P(y|X,θ) = N (0,KXX ).




Gaussian Processes

GPs for Regression

Note this is all true because we are assuming the outputs correspond to aGaussian process. We therefore make the following assumption:(

yy∗

)∼ N

((00

),

(KXX KXX∗

KX∗X KX∗X∗

)).

Computing P(y|X) and P(y∗|X∗,X, y) only requires matrix algebra on theabove assumption.


Gaussian Processes

GPs for Regression

Note this is all true because we are assuming the outputs correspond to aGaussian process. We therefore make the following assumption:(

yy∗

)∼ N

((00

),

(KXX KXX∗

KX∗X KX∗X∗

)).

Computing P(y|X) and P(y∗|X∗,X, y) only requires matrix algebra on theabove assumption.


Gaussian Processes

Example of a GP for Regression

xf(x)

Gaussian Process Regression

x

Outp

uts

Data

Figure: On the left, data from a sigmoidal curve with noise. On the right, samplesfrom a GP trained on the data (represented by ‘x’), using a squared exponentialcovariance function.


Deep Gaussian Processes

Table of Contents



3 Implementation


5 Conclusion



Definition of a Deep Gaussian Process

Formally, we define a deep Gaussian Process as the composition of GPs:

f(1:L)(x) = f(L)(f(L−1)(. . . f(2)(f(1)(x)) . . . ))

where f(l)d ∼ GP

(0, k

(l)d (x, x′)

)for f

(l)d ∈ f(l).



Deep GP Notation

Each layer l consists of D(l) GPs, where D(l) is the number of unitsat layer l

For an L layer deep GP, we have

one input layer xn ∈ RD(0)

L− 1 hidden layers {hln}L−1

l=1

one output layer yn, which we assume to be 1-dimensional.

All layers are completely connected by GPs, each with their ownkernel.



Deep GP Notation





l=1





Deep GP Notation





l=1





Deep GP Notation





l=1





Deep GP Notation





l=1





Deep GP Notation





l=1





Example: Two-Layer Deep GP

ynhnxnf g

We have a one dimensional input, xn, a one dimensional hidden unit, hn,and a one dimensional output, yn. This two-layer network consists of twoGPs, f and g where

hn = f (xn), where f ∼ GP(0, k(1)(x , x ′))

andyn = g(hn), where g ∼ GP(0, k(2)(h, h′)).



Example: Two-Layer Deep GP

ynhnxnf g

We have a one dimensional input, xn, a one dimensional hidden unit, hn,and a one dimensional output, yn. This two-layer network consists of twoGPs, f and g where

hn = f (xn), where f ∼ GP(0, k(1)(x , x ′))

andyn = g(hn), where g ∼ GP(0, k(2)(h, h′)).



Example: More Complicated Model

y

h(2)1

h(2)2

h(2)3

h(2)4

h(1)1

h(1)2

h(1)3

h(1)4

x1

x2

x3

Graphical representation of a more complicated deep GP architecture.Every edge corresponds to a GP between units, as the outputs of eachlayer are the inputs of the following layer. Our input data is 3-dimensional,while the two hidden layers in this model each have 4 hidden units.



Sampling From a Deep GP

6 4 2 0 2 4 6

x

2.0

1.5

1.0

0.5

0.0

0.5

1.0

1.5

2.0

2.5

g(f(x))

Full Deep GP

6 4 2 0 2 4 6

x

2

1

0

1

2

3

4

f(x)

Layer 1: Length-scale 0.5

2 1 0 1 2 3 4

f(x)

2.0

1.5

1.0

0.5

0.0

0.5

1.0

1.5

2.0

2.5

g(f(x))

Layer 2: Length-scale 1.0

Samples from deep GPs. As opposed to single-layer GPs, a deep GP canmodel non-stationary functions (functions whose shape changes along theinput space) without the use of a non-stationary kernel.



Comparison with Neural Networks

Similarities: deep architectures, completely connected, single-layerGPs correspond to two-layer neural networks with random weightsand infinitely many hidden units.

Differences: deep GP is nonparametric, no activation functions, mustspecify kernels, training is intractable



Comparison with Neural Networks

Similarities: deep architectures, completely connected, single-layerGPs correspond to two-layer neural networks with random weightsand infinitely many hidden units.

Differences: deep GP is nonparametric, no activation functions, mustspecify kernels, training is intractable


Implementation

Table of Contents



3 Implementation


5 Conclusion


Implementation

FITC Approximation for Single-Layer GP

The Fully Independent Training Conditional Approximation(FITC) circumvents the O(N3) training time for a single-layer GP byintroducing pseudo data, points that are not in the data set but canbe chosen to approximate the function (Snelson and Ghahramani,2005).

We introduce M pseudo inputs X = {xm}Mm=1 and the correspondingpseudo outputs y = {ym}Mm=1, which correspond to the functionvalues at the pseudo inputs.

Key assumption: conditioned on the pseudo data, the output valuesare independent.


Implementation






Implementation






Implementation


We assume a priorP(y|X) = N (0,KXX) .

Training takes time O(NM2), and testing requires O(M2).


Implementation


We assume a priorP(y|X) = N (0,KXX) .

Training takes time O(NM2), and testing requires O(M2).


Implementation

FITC Example

x

f(x)

5 Pseudo Parameters

x

f(x)

10 Pseudo Parameters

Figure: The predictive mean of a GP trained on sigmoidal data using the FITCapproximation. On the left, we use 5 pseudo data points, while on the right, weuse 10.


Implementation

Learning Deep GPs is Intractable

ynhnxnf

θ(1)

g

θ(2)

Example: two-layer model, with inputs X, outputs y, and hidden layer H(which is N × 1). Ideally, a Bayesian treatment would allow us to integrateout the hidden function values to evaluate

P(y|X,θ) =

∫P(y|H,θ(2))P

(H|X,θ(1)

)dH

=

∫N (0,KHH)N (0,KXX) dH.

Evaluating the integrals of Gaussians with respect to kernel functions isintractable.


Implementation

Learning Deep GPs is Intractable

ynhnxnf

θ(1)

g

θ(2)

Example: two-layer model, with inputs X, outputs y, and hidden layer H(which is N × 1). Ideally, a Bayesian treatment would allow us to integrateout the hidden function values to evaluate

P(y|X,θ) =

∫P(y|H,θ(2))P

(H|X,θ(1)

)dH

=

∫N (0,KHH)N (0,KXX) dH.

Evaluating the integrals of Gaussians with respect to kernel functions isintractable.


Implementation

DPGS Algorithm Overview

The Deep Gaussian Process Sampling algorithm relies on two central ideas:

We sample predictive means and covariances to approximate themarginal likelihood, relying on automatic differentiation techniques toevaluate the gradients and optimize our objective.

We replace every GP with the FITC GP, so the time complexity for Llayers and H hidden units per layer is O(N2MLH) as opposed toO(N3LH).


Implementation






Implementation






Implementation

Related Work

Damianou and Lawrence (2013) also use the FITC approximation atevery layer, but they perform inference with approximate variationalmarginalization. Subsequent methods (Hensman and Lawrence, 2014;Dai et al., 2015; Bui et al., 2016) also use variational approximations.

These methods are able to integrate out the pseudo outputs at eachlayer, but they rely on integral approximations that restrict the kernel.Meanwhile, the DGPS uses Monte Carlo sampling, which is easier toimplement, more intuitive to understand, and can extent easily tomost kernels.


Implementation

Related Work

Damianou and Lawrence (2013) also use the FITC approximation atevery layer, but they perform inference with approximate variationalmarginalization. Subsequent methods (Hensman and Lawrence, 2014;Dai et al., 2015; Bui et al., 2016) also use variational approximations.

These methods are able to integrate out the pseudo outputs at eachlayer, but they rely on integral approximations that restrict the kernel.Meanwhile, the DGPS uses Monte Carlo sampling, which is easier toimplement, more intuitive to understand, and can extent easily tomost kernels.


Implementation

Sampling Hidden Values

For inputs X, we calculate the predictive mean and covariance forevery unit in the first hidden layer. We then sample values from eachpredictive distribution

For every hidden layer thereafter, we take the samples from theprevious layer, calculate the predictive mean and covariance, andrepeat sampling until the final layer.

We use K different samples {(µk , Σk)}Kk=1 to approximate themarginal likelihood:

P(y|X) ≈K∑

k=1

P(y|µk , Σk) =K∑

k=1

N (µk , Σk)


Implementation





P(y|X) ≈K∑

k=1


k=1

N (µk , Σk)


Implementation





P(y|X) ≈K∑

k=1


k=1

N (µk , Σk)


Implementation

FITC for Deep GPs

To make fitting more scalable, we replace every GP in the model witha FITC GP

For each GP, corresponding to hidden unit d in layer l , we introduce

pseudo inputs X(l)d and corresponding pseudo outputs y

(l)d .

With the addition of the pseudo data, we are required to learn thefollowing set of parameters:

Θ =

{{X

(l)d , y

(l)d ,θ

(l)d

}D(l)

d=1

}L

l=1

.


Implementation

FITC for Deep GPs




(l)d .


Θ =

{{X

(l)d , y

(l)d ,θ

(l)d

}D(l)

d=1

}L

l=1

.


Implementation

FITC for Deep GPs




(l)d .


Θ =

{{X

(l)d , y

(l)d ,θ

(l)d

}D(l)

d=1

}L

l=1

.


Implementation

Example: DGPS Algorithm on 2 Layers

ynHnXnf

X(1), y(1),θ(1)

g

X(2), y(2),θ(2)

Our goal is to learn

{(X(l), y(l))}2l=1, the pseudo data for each layer

θ(1) and θ(2), the kernel parameters for f and g


Implementation


ynHnXnf

X(1), y(1),θ(1)

g

X(2), y(2),θ(2)





Implementation


ynHnXnf

X(1), y(1),θ(1)

g

X(2), y(2),θ(2)





Implementation


ynHnXnf

X(1), y(1),θ(1)

g

X(2), y(2),θ(2)

To sample values H from the hidden layer, we use the FITC approximationand assume

P(

H|X, X(1), y(1)

)= N

(µ(1),Σ(1)

)

where

µ(1) = KXX

(1)K−1X(1)

X(1) y

(1)

Σ(1) = diag(

KXX −KXX

(1)K−1X(1)

X(1)KX

(1)X

).

We obtain K samples, {Hk}Kk=1 from the above distribution.


Implementation


ynHnXnf

X(1), y(1),θ(1)

g

X(2), y(2),θ(2)


P(

H|X, X(1), y(1)

)= N

(µ(1),Σ(1)

)where

µ(1) = KXX

(1)K−1X(1)

X(1) y

(1)

Σ(1) = diag(

KXX −KXX

(1)K−1X(1)

X(1)KX

(1)X

).



Implementation


ynHnXnf

X(1), y(1),θ(1)

g

X(2), y(2),θ(2)


P(

H|X, X(1), y(1)

)= N

(µ(1),Σ(1)

)where

µ(1) = KXX

(1)K−1X(1)

X(1) y

(1)

Σ(1) = diag(

KXX −KXX

(1)K−1X(1)

X(1)KX

(1)X

).



Implementation


ynHnXnf

X(1), y(1),θ(1)

g

X(2), y(2),θ(2)

For each sample Hk , we can approximate

P(

y|Hk , X(2), y(2)

)≈ N

(µ(2), Σ

(2))

where

µ(2) = KHk X

(2)K−1X(2)

X(2) y

(2)

Σ(2)

= diag(

KHk Hk−K

Hk X(2)K−1

X(2)

X(2)KX

(2)Hk

).


Implementation


ynHnXnf

X(1), y(1),θ(1)

g

X(2), y(2),θ(2)

For each sample Hk , we can approximate

P(

y|Hk , X(2), y(2)

)≈ N

(µ(2), Σ

(2))

where

µ(2) = KHk X

(2)K−1X(2)

X(2) y

(2)

Σ(2)

= diag(

KHk Hk−K

Hk X(2)K−1

X(2)

X(2)KX

(2)Hk

).


Implementation


Thus, we can approximate the marginal likelihood with our samples:

P(y|X,Θ) ≈ 1

K

K∑k=1

P(

y|Hk , X(2), y(2)

).

Incorporating the prior over the pseudo outputs into our objective, wehave:

L(y|X,Θ) = logP(y|X,Θ) +L∑

l=1

D(l)∑d=1

logP(

y(l)d

∣∣X(l)d

).


Implementation


Thus, we can approximate the marginal likelihood with our samples:

P(y|X,Θ) ≈ 1

K

K∑k=1

P(

y|Hk , X(2), y(2)

).

Incorporating the prior over the pseudo outputs into our objective, wehave:

L(y|X,Θ) = logP(y|X,Θ) +L∑

l=1

D(l)∑d=1

logP(

y(l)d

∣∣X(l)d

).


Experiments and Analysis

Table of Contents



3 Implementation


5 Conclusion



Step Function

We test on a step function with noise: X ∈ [−2, 2], yi = sign(xi ) + εi ,where εi ∼ N (0, .01).

The non-stationarity of a step function is appealing from a deep GPperspective.



Step Function

We test on a step function with noise: X ∈ [−2, 2], yi = sign(xi ) + εi ,where εi ∼ N (0, .01).

The non-stationarity of a step function is appealing from a deep GPperspective.



Step Function

x

f(x)

Samples from a Single-Layer GP

Figure: Functions sampled from a single-layer GP. Evidently, the predictive drawsdo not fully capture the shape of the step function.



Step Function

Figure: Predictive draws from a single-layer GP and a two-layer deep GP.



Step Function

Figure: Predictive draws from a three-layer deep GP.



Step Function

x

f(x)

Random Initialization

x

Hid

den v

alu

es

Hidden values

f(x)

x

f(x)

Smart Initialization

xH

idden v

alu

es

Hidden values

f(x)

Impact of Parameter Initializations on Predictive Draws



Step Function

1 2 3

Number of Layers

0.0

0.2

0.4

0.6

0.8

1.0

Mean S

quare

d E

rror

1 2 3

Number of Layers

0.0

0.2

0.4

0.6

0.8

1.0

1 2 3

Number of Layers

0.0

0.2

0.4

0.6

0.8

1.0

1 2 33

2

1

0

1

2Lo

g L

ikelih

ood p

er

Data

50 Data Points

1 2 33

2

1

0

1

2100 Data Points

1 2 33

2

1

0

1

2200 Data Points

Figure: Experimental results measuring the test log-likelihood per data and testmean squared error of the noisy step function. We vary the number of layers usedin the model, along with the number of data points used in the original stepfunction (which is divided 80/20 into train/test). We run 10 trials at eachcombination.



Step Function

Occasionally, models with deeper architectures outperform those that aremore shallow, yet they also possess the widest distributions and trials withthe worst results.

2 1 0 1 2

Train Log-Likelihood per Data

2

1

0

1

2

Test

Log-L

ikelih

ood p

er

Data

Layers

1

2

3

0.0 0.2 0.4 0.6 0.8 1.0

Train MSE

0.0

0.2

0.4

0.6

0.8

1.0

Test

MSE

Layers

1

2

3

Figure: Test set log-likelihoods per data and mean squared errors plotted againsttheir training set counterparts for the step function experiment. Overfitting doesnot appear to be a problem.



Step Function

Overfitting does not appear to be a problem.

If we can successfully optimize our objective, deeper architectures arebetter suited at learning the noisy step function than shallower ones.

However, it becomes more difficult to train and successfully optimizeas the number of layers grows and the number of parametersincreases.



Step Function






Step Function






Step Function

x

f(x)

Random Seed 66

x

Hid

den v

alu

es

1

Hidden values 1

Hid

den v

alu

es

2

Hidden values 2

f(x)

x

f(x)

Random Seed 0

x

Hid

den v

alu

es

1

Hidden values 1

Hid

den v

alu

es

2

Hidden values 2

f(x)

Figure: Predictive draws from two identical three-layer models, albeit withdifferent random parameter initializations.



Step Function

Ways to combat optimization challenges:

Using random restarts

Decreasing the number of model parameters

Trying different optimization methods

Experimenting with more diverse architectures, i.e. increasing thedimension of the hidden layer



Step Function








Step Function








Step Function








Step Function








Toy Non-Stationary Data

We create toy non-stationary data to evaluate a deep GP’s ability tolearn a non-stationary function.

We divide the input space into three regions: X1 ∈ [−4,−3],X2 ∈ [−1, 1] and X3 ∈ [2, 4], each of which consists of 40 data points.

Sample from a GP with length-scale l = .25 for regions X1 and X3,using l = 2 for region X2.
















x

Outp

uts

Data

x

f(x)

2-Layer Deep GP

x

Hid

den v

alu

es

Hidden values

f(x)

x

f(x)

1-Layer GP

Predictive Draws for Toy Non-Stationary Data

Figure: Predictive draws from the single-layer and 2-layer models for toynon-stationary data with squared exponential kernels.




x

f(x)

Non-Stationary Data: 3-Layer Deep GP

Figure: The optimization for a 3-layer model can get stuck in a local optimum,and although the predictive draws are non-stationary, our predictions are poor atthe tails.



Motorcycle Data

94 points, where the inputs are time in milliseconds since impact of amotorcycle accident and outputs are corresponding helmetaccelerations.

Dataset is somewhat non-stationary, as the accelerations are constantearly on but after a certain time become more varying.



Motorcycle Data

94 points, where the inputs are time in milliseconds since impact of amotorcycle accident and outputs are corresponding helmetaccelerations.

Dataset is somewhat non-stationary, as the accelerations are constantearly on but after a certain time become more varying.



Motorcycle Data

Time

Acc

ele

rati

on

Data

Time

Acc

ele

rati

on

2-Layer Deep GP

Time

Hid

den v

alu

es

Hidden values

Acc

ele

rati

on

Time

Acc

ele

rati

on

1-Layer GP

Predictive Draws for Motorcycle Data

Figure: Predictive draws from the single-layer and 2-layer models trained onmotorcycle data with squared exponential kernels.


Conclusion

Table of Contents



3 Implementation


5 Conclusion


Conclusion

Future Directions

Natural extensions include

Trying different optimization methods to avoid getting stuck in localoptima

Introducing variational parameters so we do not have to learn pseudooutputs

Extending model to classification

Exploring properties of more complex architectures, and evaluate themodel likelihood to choose optimal configuration


Conclusion

Future Directions







Conclusion

Future Directions







Conclusion

Future Directions







Conclusion

Future Directions







Conclusion

Acknowledgments

A huge thank-you to Sasha Rush, Finale Doshi-Velez, David Duvenaud,and Miguel Hernandez-Lobato. This thesis would not be possible withoutyour help and support!


Data & Analytics

Training and Inference for Deep Gaussian Processes