Training and Inference for Deep Gaussian Processes

Preview:

Citation preview

Training and Inference for Deep Gaussian Processes

Keyon Vafa

April 26, 2016

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 1 / 50

Motivation

An ideal model for prediction is

accurate

computationally efficient

easy to tune without overfitting

able to provide certainty estimates

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 2 / 50

Motivation

An ideal model for prediction is

accurate

computationally efficient

easy to tune without overfitting

able to provide certainty estimates

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 2 / 50

Motivation

An ideal model for prediction is

accurate

computationally efficient

easy to tune without overfitting

able to provide certainty estimates

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 2 / 50

Motivation

An ideal model for prediction is

accurate

computationally efficient

easy to tune without overfitting

able to provide certainty estimates

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 2 / 50

Motivation

An ideal model for prediction is

accurate

computationally efficient

easy to tune without overfitting

able to provide certainty estimates

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 2 / 50

Motivation

This thesis focuses on one particular class of prediction models, deepGaussian processes for regression. They are a new model, having beenintroduced by Damianou and Lawrence in 2013.

Exact inference is intractable. In this thesis, we introduce a new method tolearn deep GPs, the Deep Gaussian Process Sampling algorithm (DPGS).

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 3 / 50

Motivation

This thesis focuses on one particular class of prediction models, deepGaussian processes for regression. They are a new model, having beenintroduced by Damianou and Lawrence in 2013.

Exact inference is intractable. In this thesis, we introduce a new method tolearn deep GPs, the Deep Gaussian Process Sampling algorithm (DPGS).

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 3 / 50

Motivation

The DGPS algorithm

is more straightforward than existing methods

can more easily adapt to using arbitrary kernels

relies on Monte Carlo sampling to circumvent the intractability hurdle

uses pseudo data to ease the computational burden

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 4 / 50

Motivation

The DGPS algorithm

is more straightforward than existing methods

can more easily adapt to using arbitrary kernels

relies on Monte Carlo sampling to circumvent the intractability hurdle

uses pseudo data to ease the computational burden

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 4 / 50

Motivation

The DGPS algorithm

is more straightforward than existing methods

can more easily adapt to using arbitrary kernels

relies on Monte Carlo sampling to circumvent the intractability hurdle

uses pseudo data to ease the computational burden

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 4 / 50

Motivation

The DGPS algorithm

is more straightforward than existing methods

can more easily adapt to using arbitrary kernels

relies on Monte Carlo sampling to circumvent the intractability hurdle

uses pseudo data to ease the computational burden

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 4 / 50

Motivation

The DGPS algorithm

is more straightforward than existing methods

can more easily adapt to using arbitrary kernels

relies on Monte Carlo sampling to circumvent the intractability hurdle

uses pseudo data to ease the computational burden

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 4 / 50

Gaussian Processes

Table of Contents

1 Gaussian Processes

2 Deep Gaussian Processes

3 Implementation

4 Experiments and Analysis

5 Conclusion

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 5 / 50

Gaussian Processes

Definition of a Gaussian Process

A function f is a Gaussian process (GP) if any finite set of valuesf (x1), . . . , f (xN) has a multivariate normal distribution.

The inputs {xn}Nn=1 can be vectors from any arbitrary sized domain.

Specified by a mean function m(x) and a covariance function k(x, x′)where

m(x) = E [f (x)]

k(x, x′) = Cov(f (x), f (x′)).

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 6 / 50

Gaussian Processes

Definition of a Gaussian Process

A function f is a Gaussian process (GP) if any finite set of valuesf (x1), . . . , f (xN) has a multivariate normal distribution.

The inputs {xn}Nn=1 can be vectors from any arbitrary sized domain.

Specified by a mean function m(x) and a covariance function k(x, x′)where

m(x) = E [f (x)]

k(x, x′) = Cov(f (x), f (x′)).

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 6 / 50

Gaussian Processes

Definition of a Gaussian Process

A function f is a Gaussian process (GP) if any finite set of valuesf (x1), . . . , f (xN) has a multivariate normal distribution.

The inputs {xn}Nn=1 can be vectors from any arbitrary sized domain.

Specified by a mean function m(x) and a covariance function k(x, x′)where

m(x) = E [f (x)]

k(x, x′) = Cov(f (x), f (x′)).

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 6 / 50

Gaussian Processes

Covariance Function

The covariance function (or kernel) determines the smoothness andstationarity of functions drawn from a GP.

The squared exponential covariance function has the following form:

k(x, x′) = σ2f exp

(−1

2(x− x′)TM(x− x′)

)

When M is a diagonal matrix, the elements on the diagonal areknown as the length-scales, denoted by l−2i . σ2f is known as the signalvariance.

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 7 / 50

Gaussian Processes

Covariance Function

The covariance function (or kernel) determines the smoothness andstationarity of functions drawn from a GP.

The squared exponential covariance function has the following form:

k(x, x′) = σ2f exp

(−1

2(x− x′)TM(x− x′)

)

When M is a diagonal matrix, the elements on the diagonal areknown as the length-scales, denoted by l−2i . σ2f is known as the signalvariance.

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 7 / 50

Gaussian Processes

Covariance Function

The covariance function (or kernel) determines the smoothness andstationarity of functions drawn from a GP.

The squared exponential covariance function has the following form:

k(x, x′) = σ2f exp

(−1

2(x− x′)TM(x− x′)

)

When M is a diagonal matrix, the elements on the diagonal areknown as the length-scales, denoted by l−2i . σ2f is known as the signalvariance.

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 7 / 50

Gaussian Processes

Sampling from a GP

x

f(x)

Signal variance 1.0, Length-scale 0.2

x

f(x)

Signal variance 1.0, Length-scale 1.0

x

f(x)

Signal variance 1.0, Length-scale 5.0

x

f(x)

Signal variance 0.2, Length-scale 1.0

x

f(x)

Signal variance 1.0, Length-scale 1.0

x

f(x)

Signal variance 5.0, Length-scale 1.0

Random samples from GP priors. The length-scale controls thesmoothness of our function, while the signal variance controls thedeviation from the mean.

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 8 / 50

Gaussian Processes

GPs for Regression

Setup: We are given a set of inputs X ∈ RN×D and correspondingoutputs y ∈ RN , the function values from a GP evaluated at X. Weassume a mean function m(x) and a covariance function k(x, x′),which rely on parameters θ.

We would like to learn the optimal θ, and estimate the functionvalues y∗ for a set of new inputs X∗.

To learn θ, we optimize the marginal likelihood:

P(y|X,θ) = N (0,KXX ).

We can then use the multivariate normal conditional distribution toevaluate the predictive distribution:

P(y∗|X∗,X, y,θ) ∼ N (KX∗XK−1XXy,KX∗X∗ −KX∗XK−1XXKXX∗).

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 9 / 50

Gaussian Processes

GPs for Regression

Setup: We are given a set of inputs X ∈ RN×D and correspondingoutputs y ∈ RN , the function values from a GP evaluated at X. Weassume a mean function m(x) and a covariance function k(x, x′),which rely on parameters θ.

We would like to learn the optimal θ, and estimate the functionvalues y∗ for a set of new inputs X∗.

To learn θ, we optimize the marginal likelihood:

P(y|X,θ) = N (0,KXX ).

We can then use the multivariate normal conditional distribution toevaluate the predictive distribution:

P(y∗|X∗,X, y,θ) ∼ N (KX∗XK−1XXy,KX∗X∗ −KX∗XK−1XXKXX∗).

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 9 / 50

Gaussian Processes

GPs for Regression

Setup: We are given a set of inputs X ∈ RN×D and correspondingoutputs y ∈ RN , the function values from a GP evaluated at X. Weassume a mean function m(x) and a covariance function k(x, x′),which rely on parameters θ.

We would like to learn the optimal θ, and estimate the functionvalues y∗ for a set of new inputs X∗.

To learn θ, we optimize the marginal likelihood:

P(y|X,θ) = N (0,KXX ).

We can then use the multivariate normal conditional distribution toevaluate the predictive distribution:

P(y∗|X∗,X, y,θ) ∼ N (KX∗XK−1XXy,KX∗X∗ −KX∗XK−1XXKXX∗).

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 9 / 50

Gaussian Processes

GPs for Regression

Setup: We are given a set of inputs X ∈ RN×D and correspondingoutputs y ∈ RN , the function values from a GP evaluated at X. Weassume a mean function m(x) and a covariance function k(x, x′),which rely on parameters θ.

We would like to learn the optimal θ, and estimate the functionvalues y∗ for a set of new inputs X∗.

To learn θ, we optimize the marginal likelihood:

P(y|X,θ) = N (0,KXX ).

We can then use the multivariate normal conditional distribution toevaluate the predictive distribution:

P(y∗|X∗,X, y,θ) ∼ N (KX∗XK−1XXy,KX∗X∗ −KX∗XK−1XXKXX∗).

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 9 / 50

Gaussian Processes

GPs for Regression

Note this is all true because we are assuming the outputs correspond to aGaussian process. We therefore make the following assumption:(

yy∗

)∼ N

((00

),

(KXX KXX∗

KX∗X KX∗X∗

)).

Computing P(y|X) and P(y∗|X∗,X, y) only requires matrix algebra on theabove assumption.

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 10 / 50

Gaussian Processes

GPs for Regression

Note this is all true because we are assuming the outputs correspond to aGaussian process. We therefore make the following assumption:(

yy∗

)∼ N

((00

),

(KXX KXX∗

KX∗X KX∗X∗

)).

Computing P(y|X) and P(y∗|X∗,X, y) only requires matrix algebra on theabove assumption.

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 10 / 50

Gaussian Processes

Example of a GP for Regression

xf(x)

Gaussian Process Regression

x

Outp

uts

Data

Figure: On the left, data from a sigmoidal curve with noise. On the right, samplesfrom a GP trained on the data (represented by ‘x’), using a squared exponentialcovariance function.

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 11 / 50

Deep Gaussian Processes

Table of Contents

1 Gaussian Processes

2 Deep Gaussian Processes

3 Implementation

4 Experiments and Analysis

5 Conclusion

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 12 / 50

Deep Gaussian Processes

Definition of a Deep Gaussian Process

Formally, we define a deep Gaussian Process as the composition of GPs:

f(1:L)(x) = f(L)(f(L−1)(. . . f(2)(f(1)(x)) . . . ))

where f(l)d ∼ GP

(0, k

(l)d (x, x′)

)for f

(l)d ∈ f(l).

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 13 / 50

Deep Gaussian Processes

Deep GP Notation

Each layer l consists of D(l) GPs, where D(l) is the number of unitsat layer l

For an L layer deep GP, we have

one input layer xn ∈ RD(0)

L− 1 hidden layers {hln}L−1

l=1

one output layer yn, which we assume to be 1-dimensional.

All layers are completely connected by GPs, each with their ownkernel.

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 14 / 50

Deep Gaussian Processes

Deep GP Notation

Each layer l consists of D(l) GPs, where D(l) is the number of unitsat layer l

For an L layer deep GP, we have

one input layer xn ∈ RD(0)

L− 1 hidden layers {hln}L−1

l=1

one output layer yn, which we assume to be 1-dimensional.

All layers are completely connected by GPs, each with their ownkernel.

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 14 / 50

Deep Gaussian Processes

Deep GP Notation

Each layer l consists of D(l) GPs, where D(l) is the number of unitsat layer l

For an L layer deep GP, we have

one input layer xn ∈ RD(0)

L− 1 hidden layers {hln}L−1

l=1

one output layer yn, which we assume to be 1-dimensional.

All layers are completely connected by GPs, each with their ownkernel.

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 14 / 50

Deep Gaussian Processes

Deep GP Notation

Each layer l consists of D(l) GPs, where D(l) is the number of unitsat layer l

For an L layer deep GP, we have

one input layer xn ∈ RD(0)

L− 1 hidden layers {hln}L−1

l=1

one output layer yn, which we assume to be 1-dimensional.

All layers are completely connected by GPs, each with their ownkernel.

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 14 / 50

Deep Gaussian Processes

Deep GP Notation

Each layer l consists of D(l) GPs, where D(l) is the number of unitsat layer l

For an L layer deep GP, we have

one input layer xn ∈ RD(0)

L− 1 hidden layers {hln}L−1

l=1

one output layer yn, which we assume to be 1-dimensional.

All layers are completely connected by GPs, each with their ownkernel.

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 14 / 50

Deep Gaussian Processes

Deep GP Notation

Each layer l consists of D(l) GPs, where D(l) is the number of unitsat layer l

For an L layer deep GP, we have

one input layer xn ∈ RD(0)

L− 1 hidden layers {hln}L−1

l=1

one output layer yn, which we assume to be 1-dimensional.

All layers are completely connected by GPs, each with their ownkernel.

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 14 / 50

Deep Gaussian Processes

Example: Two-Layer Deep GP

ynhnxnf g

We have a one dimensional input, xn, a one dimensional hidden unit, hn,and a one dimensional output, yn. This two-layer network consists of twoGPs, f and g where

hn = f (xn), where f ∼ GP(0, k(1)(x , x ′))

andyn = g(hn), where g ∼ GP(0, k(2)(h, h′)).

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 15 / 50

Deep Gaussian Processes

Example: Two-Layer Deep GP

ynhnxnf g

We have a one dimensional input, xn, a one dimensional hidden unit, hn,and a one dimensional output, yn. This two-layer network consists of twoGPs, f and g where

hn = f (xn), where f ∼ GP(0, k(1)(x , x ′))

andyn = g(hn), where g ∼ GP(0, k(2)(h, h′)).

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 15 / 50

Deep Gaussian Processes

Example: More Complicated Model

y

h(2)1

h(2)2

h(2)3

h(2)4

h(1)1

h(1)2

h(1)3

h(1)4

x1

x2

x3

Graphical representation of a more complicated deep GP architecture.Every edge corresponds to a GP between units, as the outputs of eachlayer are the inputs of the following layer. Our input data is 3-dimensional,while the two hidden layers in this model each have 4 hidden units.

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 16 / 50

Deep Gaussian Processes

Sampling From a Deep GP

6 4 2 0 2 4 6

x

2.0

1.5

1.0

0.5

0.0

0.5

1.0

1.5

2.0

2.5

g(f(x))

Full Deep GP

6 4 2 0 2 4 6

x

2

1

0

1

2

3

4

f(x)

Layer 1: Length-scale 0.5

2 1 0 1 2 3 4

f(x)

2.0

1.5

1.0

0.5

0.0

0.5

1.0

1.5

2.0

2.5

g(f(x))

Layer 2: Length-scale 1.0

Samples from deep GPs. As opposed to single-layer GPs, a deep GP canmodel non-stationary functions (functions whose shape changes along theinput space) without the use of a non-stationary kernel.

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 17 / 50

Deep Gaussian Processes

Comparison with Neural Networks

Similarities: deep architectures, completely connected, single-layerGPs correspond to two-layer neural networks with random weightsand infinitely many hidden units.

Differences: deep GP is nonparametric, no activation functions, mustspecify kernels, training is intractable

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 18 / 50

Deep Gaussian Processes

Comparison with Neural Networks

Similarities: deep architectures, completely connected, single-layerGPs correspond to two-layer neural networks with random weightsand infinitely many hidden units.

Differences: deep GP is nonparametric, no activation functions, mustspecify kernels, training is intractable

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 18 / 50

Implementation

Table of Contents

1 Gaussian Processes

2 Deep Gaussian Processes

3 Implementation

4 Experiments and Analysis

5 Conclusion

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 19 / 50

Implementation

FITC Approximation for Single-Layer GP

The Fully Independent Training Conditional Approximation(FITC) circumvents the O(N3) training time for a single-layer GP byintroducing pseudo data, points that are not in the data set but canbe chosen to approximate the function (Snelson and Ghahramani,2005).

We introduce M pseudo inputs X = {xm}Mm=1 and the correspondingpseudo outputs y = {ym}Mm=1, which correspond to the functionvalues at the pseudo inputs.

Key assumption: conditioned on the pseudo data, the output valuesare independent.

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 20 / 50

Implementation

FITC Approximation for Single-Layer GP

The Fully Independent Training Conditional Approximation(FITC) circumvents the O(N3) training time for a single-layer GP byintroducing pseudo data, points that are not in the data set but canbe chosen to approximate the function (Snelson and Ghahramani,2005).

We introduce M pseudo inputs X = {xm}Mm=1 and the correspondingpseudo outputs y = {ym}Mm=1, which correspond to the functionvalues at the pseudo inputs.

Key assumption: conditioned on the pseudo data, the output valuesare independent.

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 20 / 50

Implementation

FITC Approximation for Single-Layer GP

The Fully Independent Training Conditional Approximation(FITC) circumvents the O(N3) training time for a single-layer GP byintroducing pseudo data, points that are not in the data set but canbe chosen to approximate the function (Snelson and Ghahramani,2005).

We introduce M pseudo inputs X = {xm}Mm=1 and the correspondingpseudo outputs y = {ym}Mm=1, which correspond to the functionvalues at the pseudo inputs.

Key assumption: conditioned on the pseudo data, the output valuesare independent.

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 20 / 50

Implementation

FITC Approximation for Single-Layer GP

We assume a priorP(y|X) = N (0,KXX) .

Training takes time O(NM2), and testing requires O(M2).

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 21 / 50

Implementation

FITC Approximation for Single-Layer GP

We assume a priorP(y|X) = N (0,KXX) .

Training takes time O(NM2), and testing requires O(M2).

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 21 / 50

Implementation

FITC Example

x

f(x)

5 Pseudo Parameters

x

f(x)

10 Pseudo Parameters

Figure: The predictive mean of a GP trained on sigmoidal data using the FITCapproximation. On the left, we use 5 pseudo data points, while on the right, weuse 10.

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 22 / 50

Implementation

Learning Deep GPs is Intractable

ynhnxnf

θ(1)

g

θ(2)

Example: two-layer model, with inputs X, outputs y, and hidden layer H(which is N × 1). Ideally, a Bayesian treatment would allow us to integrateout the hidden function values to evaluate

P(y|X,θ) =

∫P(y|H,θ(2))P

(H|X,θ(1)

)dH

=

∫N (0,KHH)N (0,KXX) dH.

Evaluating the integrals of Gaussians with respect to kernel functions isintractable.

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 23 / 50

Implementation

Learning Deep GPs is Intractable

ynhnxnf

θ(1)

g

θ(2)

Example: two-layer model, with inputs X, outputs y, and hidden layer H(which is N × 1). Ideally, a Bayesian treatment would allow us to integrateout the hidden function values to evaluate

P(y|X,θ) =

∫P(y|H,θ(2))P

(H|X,θ(1)

)dH

=

∫N (0,KHH)N (0,KXX) dH.

Evaluating the integrals of Gaussians with respect to kernel functions isintractable.

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 23 / 50

Implementation

DPGS Algorithm Overview

The Deep Gaussian Process Sampling algorithm relies on two central ideas:

We sample predictive means and covariances to approximate themarginal likelihood, relying on automatic differentiation techniques toevaluate the gradients and optimize our objective.

We replace every GP with the FITC GP, so the time complexity for Llayers and H hidden units per layer is O(N2MLH) as opposed toO(N3LH).

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 24 / 50

Implementation

DPGS Algorithm Overview

The Deep Gaussian Process Sampling algorithm relies on two central ideas:

We sample predictive means and covariances to approximate themarginal likelihood, relying on automatic differentiation techniques toevaluate the gradients and optimize our objective.

We replace every GP with the FITC GP, so the time complexity for Llayers and H hidden units per layer is O(N2MLH) as opposed toO(N3LH).

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 24 / 50

Implementation

DPGS Algorithm Overview

The Deep Gaussian Process Sampling algorithm relies on two central ideas:

We sample predictive means and covariances to approximate themarginal likelihood, relying on automatic differentiation techniques toevaluate the gradients and optimize our objective.

We replace every GP with the FITC GP, so the time complexity for Llayers and H hidden units per layer is O(N2MLH) as opposed toO(N3LH).

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 24 / 50

Implementation

Related Work

Damianou and Lawrence (2013) also use the FITC approximation atevery layer, but they perform inference with approximate variationalmarginalization. Subsequent methods (Hensman and Lawrence, 2014;Dai et al., 2015; Bui et al., 2016) also use variational approximations.

These methods are able to integrate out the pseudo outputs at eachlayer, but they rely on integral approximations that restrict the kernel.Meanwhile, the DGPS uses Monte Carlo sampling, which is easier toimplement, more intuitive to understand, and can extent easily tomost kernels.

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 25 / 50

Implementation

Related Work

Damianou and Lawrence (2013) also use the FITC approximation atevery layer, but they perform inference with approximate variationalmarginalization. Subsequent methods (Hensman and Lawrence, 2014;Dai et al., 2015; Bui et al., 2016) also use variational approximations.

These methods are able to integrate out the pseudo outputs at eachlayer, but they rely on integral approximations that restrict the kernel.Meanwhile, the DGPS uses Monte Carlo sampling, which is easier toimplement, more intuitive to understand, and can extent easily tomost kernels.

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 25 / 50

Implementation

Sampling Hidden Values

For inputs X, we calculate the predictive mean and covariance forevery unit in the first hidden layer. We then sample values from eachpredictive distribution

For every hidden layer thereafter, we take the samples from theprevious layer, calculate the predictive mean and covariance, andrepeat sampling until the final layer.

We use K different samples {(µk , Σk)}Kk=1 to approximate themarginal likelihood:

P(y|X) ≈K∑

k=1

P(y|µk , Σk) =K∑

k=1

N (µk , Σk)

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 26 / 50

Implementation

Sampling Hidden Values

For inputs X, we calculate the predictive mean and covariance forevery unit in the first hidden layer. We then sample values from eachpredictive distribution

For every hidden layer thereafter, we take the samples from theprevious layer, calculate the predictive mean and covariance, andrepeat sampling until the final layer.

We use K different samples {(µk , Σk)}Kk=1 to approximate themarginal likelihood:

P(y|X) ≈K∑

k=1

P(y|µk , Σk) =K∑

k=1

N (µk , Σk)

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 26 / 50

Implementation

Sampling Hidden Values

For inputs X, we calculate the predictive mean and covariance forevery unit in the first hidden layer. We then sample values from eachpredictive distribution

For every hidden layer thereafter, we take the samples from theprevious layer, calculate the predictive mean and covariance, andrepeat sampling until the final layer.

We use K different samples {(µk , Σk)}Kk=1 to approximate themarginal likelihood:

P(y|X) ≈K∑

k=1

P(y|µk , Σk) =K∑

k=1

N (µk , Σk)

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 26 / 50

Implementation

FITC for Deep GPs

To make fitting more scalable, we replace every GP in the model witha FITC GP

For each GP, corresponding to hidden unit d in layer l , we introduce

pseudo inputs X(l)d and corresponding pseudo outputs y

(l)d .

With the addition of the pseudo data, we are required to learn thefollowing set of parameters:

Θ =

{{X

(l)d , y

(l)d ,θ

(l)d

}D(l)

d=1

}L

l=1

.

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 27 / 50

Implementation

FITC for Deep GPs

To make fitting more scalable, we replace every GP in the model witha FITC GP

For each GP, corresponding to hidden unit d in layer l , we introduce

pseudo inputs X(l)d and corresponding pseudo outputs y

(l)d .

With the addition of the pseudo data, we are required to learn thefollowing set of parameters:

Θ =

{{X

(l)d , y

(l)d ,θ

(l)d

}D(l)

d=1

}L

l=1

.

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 27 / 50

Implementation

FITC for Deep GPs

To make fitting more scalable, we replace every GP in the model witha FITC GP

For each GP, corresponding to hidden unit d in layer l , we introduce

pseudo inputs X(l)d and corresponding pseudo outputs y

(l)d .

With the addition of the pseudo data, we are required to learn thefollowing set of parameters:

Θ =

{{X

(l)d , y

(l)d ,θ

(l)d

}D(l)

d=1

}L

l=1

.

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 27 / 50

Implementation

Example: DGPS Algorithm on 2 Layers

ynHnXnf

X(1), y(1),θ(1)

g

X(2), y(2),θ(2)

Our goal is to learn

{(X(l), y(l))}2l=1, the pseudo data for each layer

θ(1) and θ(2), the kernel parameters for f and g

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 28 / 50

Implementation

Example: DGPS Algorithm on 2 Layers

ynHnXnf

X(1), y(1),θ(1)

g

X(2), y(2),θ(2)

Our goal is to learn

{(X(l), y(l))}2l=1, the pseudo data for each layer

θ(1) and θ(2), the kernel parameters for f and g

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 28 / 50

Implementation

Example: DGPS Algorithm on 2 Layers

ynHnXnf

X(1), y(1),θ(1)

g

X(2), y(2),θ(2)

Our goal is to learn

{(X(l), y(l))}2l=1, the pseudo data for each layer

θ(1) and θ(2), the kernel parameters for f and g

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 28 / 50

Implementation

Example: DGPS Algorithm on 2 Layers

ynHnXnf

X(1), y(1),θ(1)

g

X(2), y(2),θ(2)

To sample values H from the hidden layer, we use the FITC approximationand assume

P(

H|X, X(1), y(1)

)= N

(µ(1),Σ(1)

)

where

µ(1) = KXX

(1)K−1X(1)

X(1) y

(1)

Σ(1) = diag(

KXX −KXX

(1)K−1X(1)

X(1)KX

(1)X

).

We obtain K samples, {Hk}Kk=1 from the above distribution.

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 29 / 50

Implementation

Example: DGPS Algorithm on 2 Layers

ynHnXnf

X(1), y(1),θ(1)

g

X(2), y(2),θ(2)

To sample values H from the hidden layer, we use the FITC approximationand assume

P(

H|X, X(1), y(1)

)= N

(µ(1),Σ(1)

)where

µ(1) = KXX

(1)K−1X(1)

X(1) y

(1)

Σ(1) = diag(

KXX −KXX

(1)K−1X(1)

X(1)KX

(1)X

).

We obtain K samples, {Hk}Kk=1 from the above distribution.

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 29 / 50

Implementation

Example: DGPS Algorithm on 2 Layers

ynHnXnf

X(1), y(1),θ(1)

g

X(2), y(2),θ(2)

To sample values H from the hidden layer, we use the FITC approximationand assume

P(

H|X, X(1), y(1)

)= N

(µ(1),Σ(1)

)where

µ(1) = KXX

(1)K−1X(1)

X(1) y

(1)

Σ(1) = diag(

KXX −KXX

(1)K−1X(1)

X(1)KX

(1)X

).

We obtain K samples, {Hk}Kk=1 from the above distribution.

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 29 / 50

Implementation

Example: DGPS Algorithm on 2 Layers

ynHnXnf

X(1), y(1),θ(1)

g

X(2), y(2),θ(2)

For each sample Hk , we can approximate

P(

y|Hk , X(2), y(2)

)≈ N

(µ(2), Σ

(2))

where

µ(2) = KHk X

(2)K−1X(2)

X(2) y

(2)

Σ(2)

= diag(

KHk Hk−K

Hk X(2)K−1

X(2)

X(2)KX

(2)Hk

).

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 30 / 50

Implementation

Example: DGPS Algorithm on 2 Layers

ynHnXnf

X(1), y(1),θ(1)

g

X(2), y(2),θ(2)

For each sample Hk , we can approximate

P(

y|Hk , X(2), y(2)

)≈ N

(µ(2), Σ

(2))

where

µ(2) = KHk X

(2)K−1X(2)

X(2) y

(2)

Σ(2)

= diag(

KHk Hk−K

Hk X(2)K−1

X(2)

X(2)KX

(2)Hk

).

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 30 / 50

Implementation

Example: DGPS Algorithm on 2 Layers

Thus, we can approximate the marginal likelihood with our samples:

P(y|X,Θ) ≈ 1

K

K∑k=1

P(

y|Hk , X(2), y(2)

).

Incorporating the prior over the pseudo outputs into our objective, wehave:

L(y|X,Θ) = logP(y|X,Θ) +L∑

l=1

D(l)∑d=1

logP(

y(l)d

∣∣X(l)d

).

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 31 / 50

Implementation

Example: DGPS Algorithm on 2 Layers

Thus, we can approximate the marginal likelihood with our samples:

P(y|X,Θ) ≈ 1

K

K∑k=1

P(

y|Hk , X(2), y(2)

).

Incorporating the prior over the pseudo outputs into our objective, wehave:

L(y|X,Θ) = logP(y|X,Θ) +L∑

l=1

D(l)∑d=1

logP(

y(l)d

∣∣X(l)d

).

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 31 / 50

Experiments and Analysis

Table of Contents

1 Gaussian Processes

2 Deep Gaussian Processes

3 Implementation

4 Experiments and Analysis

5 Conclusion

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 32 / 50

Experiments and Analysis

Step Function

We test on a step function with noise: X ∈ [−2, 2], yi = sign(xi ) + εi ,where εi ∼ N (0, .01).

The non-stationarity of a step function is appealing from a deep GPperspective.

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 33 / 50

Experiments and Analysis

Step Function

We test on a step function with noise: X ∈ [−2, 2], yi = sign(xi ) + εi ,where εi ∼ N (0, .01).

The non-stationarity of a step function is appealing from a deep GPperspective.

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 33 / 50

Experiments and Analysis

Step Function

x

f(x)

Samples from a Single-Layer GP

Figure: Functions sampled from a single-layer GP. Evidently, the predictive drawsdo not fully capture the shape of the step function.

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 34 / 50

Experiments and Analysis

Step Function

Figure: Predictive draws from a single-layer GP and a two-layer deep GP.

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 35 / 50

Experiments and Analysis

Step Function

Figure: Predictive draws from a three-layer deep GP.

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 36 / 50

Experiments and Analysis

Step Function

x

f(x)

Random Initialization

x

Hid

den v

alu

es

Hidden values

f(x)

x

f(x)

Smart Initialization

xH

idden v

alu

es

Hidden values

f(x)

Impact of Parameter Initializations on Predictive Draws

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 37 / 50

Experiments and Analysis

Step Function

1 2 3

Number of Layers

0.0

0.2

0.4

0.6

0.8

1.0

Mean S

quare

d E

rror

1 2 3

Number of Layers

0.0

0.2

0.4

0.6

0.8

1.0

1 2 3

Number of Layers

0.0

0.2

0.4

0.6

0.8

1.0

1 2 33

2

1

0

1

2Lo

g L

ikelih

ood p

er

Data

50 Data Points

1 2 33

2

1

0

1

2100 Data Points

1 2 33

2

1

0

1

2200 Data Points

Figure: Experimental results measuring the test log-likelihood per data and testmean squared error of the noisy step function. We vary the number of layers usedin the model, along with the number of data points used in the original stepfunction (which is divided 80/20 into train/test). We run 10 trials at eachcombination.

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 38 / 50

Experiments and Analysis

Step Function

Occasionally, models with deeper architectures outperform those that aremore shallow, yet they also possess the widest distributions and trials withthe worst results.

2 1 0 1 2

Train Log-Likelihood per Data

2

1

0

1

2

Test

Log-L

ikelih

ood p

er

Data

Layers

1

2

3

0.0 0.2 0.4 0.6 0.8 1.0

Train MSE

0.0

0.2

0.4

0.6

0.8

1.0

Test

MSE

Layers

1

2

3

Figure: Test set log-likelihoods per data and mean squared errors plotted againsttheir training set counterparts for the step function experiment. Overfitting doesnot appear to be a problem.

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 39 / 50

Experiments and Analysis

Step Function

Overfitting does not appear to be a problem.

If we can successfully optimize our objective, deeper architectures arebetter suited at learning the noisy step function than shallower ones.

However, it becomes more difficult to train and successfully optimizeas the number of layers grows and the number of parametersincreases.

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 40 / 50

Experiments and Analysis

Step Function

Overfitting does not appear to be a problem.

If we can successfully optimize our objective, deeper architectures arebetter suited at learning the noisy step function than shallower ones.

However, it becomes more difficult to train and successfully optimizeas the number of layers grows and the number of parametersincreases.

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 40 / 50

Experiments and Analysis

Step Function

Overfitting does not appear to be a problem.

If we can successfully optimize our objective, deeper architectures arebetter suited at learning the noisy step function than shallower ones.

However, it becomes more difficult to train and successfully optimizeas the number of layers grows and the number of parametersincreases.

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 40 / 50

Experiments and Analysis

Step Function

x

f(x)

Random Seed 66

x

Hid

den v

alu

es

1

Hidden values 1

Hid

den v

alu

es

2

Hidden values 2

f(x)

x

f(x)

Random Seed 0

x

Hid

den v

alu

es

1

Hidden values 1

Hid

den v

alu

es

2

Hidden values 2

f(x)

Figure: Predictive draws from two identical three-layer models, albeit withdifferent random parameter initializations.

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 41 / 50

Experiments and Analysis

Step Function

Ways to combat optimization challenges:

Using random restarts

Decreasing the number of model parameters

Trying different optimization methods

Experimenting with more diverse architectures, i.e. increasing thedimension of the hidden layer

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 42 / 50

Experiments and Analysis

Step Function

Ways to combat optimization challenges:

Using random restarts

Decreasing the number of model parameters

Trying different optimization methods

Experimenting with more diverse architectures, i.e. increasing thedimension of the hidden layer

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 42 / 50

Experiments and Analysis

Step Function

Ways to combat optimization challenges:

Using random restarts

Decreasing the number of model parameters

Trying different optimization methods

Experimenting with more diverse architectures, i.e. increasing thedimension of the hidden layer

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 42 / 50

Experiments and Analysis

Step Function

Ways to combat optimization challenges:

Using random restarts

Decreasing the number of model parameters

Trying different optimization methods

Experimenting with more diverse architectures, i.e. increasing thedimension of the hidden layer

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 42 / 50

Experiments and Analysis

Step Function

Ways to combat optimization challenges:

Using random restarts

Decreasing the number of model parameters

Trying different optimization methods

Experimenting with more diverse architectures, i.e. increasing thedimension of the hidden layer

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 42 / 50

Experiments and Analysis

Toy Non-Stationary Data

We create toy non-stationary data to evaluate a deep GP’s ability tolearn a non-stationary function.

We divide the input space into three regions: X1 ∈ [−4,−3],X2 ∈ [−1, 1] and X3 ∈ [2, 4], each of which consists of 40 data points.

Sample from a GP with length-scale l = .25 for regions X1 and X3,using l = 2 for region X2.

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 43 / 50

Experiments and Analysis

Toy Non-Stationary Data

We create toy non-stationary data to evaluate a deep GP’s ability tolearn a non-stationary function.

We divide the input space into three regions: X1 ∈ [−4,−3],X2 ∈ [−1, 1] and X3 ∈ [2, 4], each of which consists of 40 data points.

Sample from a GP with length-scale l = .25 for regions X1 and X3,using l = 2 for region X2.

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 43 / 50

Experiments and Analysis

Toy Non-Stationary Data

We create toy non-stationary data to evaluate a deep GP’s ability tolearn a non-stationary function.

We divide the input space into three regions: X1 ∈ [−4,−3],X2 ∈ [−1, 1] and X3 ∈ [2, 4], each of which consists of 40 data points.

Sample from a GP with length-scale l = .25 for regions X1 and X3,using l = 2 for region X2.

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 43 / 50

Experiments and Analysis

Toy Non-Stationary Data

x

Outp

uts

Data

x

f(x)

2-Layer Deep GP

x

Hid

den v

alu

es

Hidden values

f(x)

x

f(x)

1-Layer GP

Predictive Draws for Toy Non-Stationary Data

Figure: Predictive draws from the single-layer and 2-layer models for toynon-stationary data with squared exponential kernels.

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 44 / 50

Experiments and Analysis

Toy Non-Stationary Data

x

f(x)

Non-Stationary Data: 3-Layer Deep GP

Figure: The optimization for a 3-layer model can get stuck in a local optimum,and although the predictive draws are non-stationary, our predictions are poor atthe tails.

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 45 / 50

Experiments and Analysis

Motorcycle Data

94 points, where the inputs are time in milliseconds since impact of amotorcycle accident and outputs are corresponding helmetaccelerations.

Dataset is somewhat non-stationary, as the accelerations are constantearly on but after a certain time become more varying.

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 46 / 50

Experiments and Analysis

Motorcycle Data

94 points, where the inputs are time in milliseconds since impact of amotorcycle accident and outputs are corresponding helmetaccelerations.

Dataset is somewhat non-stationary, as the accelerations are constantearly on but after a certain time become more varying.

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 46 / 50

Experiments and Analysis

Motorcycle Data

Time

Acc

ele

rati

on

Data

Time

Acc

ele

rati

on

2-Layer Deep GP

Time

Hid

den v

alu

es

Hidden values

Acc

ele

rati

on

Time

Acc

ele

rati

on

1-Layer GP

Predictive Draws for Motorcycle Data

Figure: Predictive draws from the single-layer and 2-layer models trained onmotorcycle data with squared exponential kernels.

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 47 / 50

Conclusion

Table of Contents

1 Gaussian Processes

2 Deep Gaussian Processes

3 Implementation

4 Experiments and Analysis

5 Conclusion

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 48 / 50

Conclusion

Future Directions

Natural extensions include

Trying different optimization methods to avoid getting stuck in localoptima

Introducing variational parameters so we do not have to learn pseudooutputs

Extending model to classification

Exploring properties of more complex architectures, and evaluate themodel likelihood to choose optimal configuration

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 49 / 50

Conclusion

Future Directions

Natural extensions include

Trying different optimization methods to avoid getting stuck in localoptima

Introducing variational parameters so we do not have to learn pseudooutputs

Extending model to classification

Exploring properties of more complex architectures, and evaluate themodel likelihood to choose optimal configuration

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 49 / 50

Conclusion

Future Directions

Natural extensions include

Trying different optimization methods to avoid getting stuck in localoptima

Introducing variational parameters so we do not have to learn pseudooutputs

Extending model to classification

Exploring properties of more complex architectures, and evaluate themodel likelihood to choose optimal configuration

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 49 / 50

Conclusion

Future Directions

Natural extensions include

Trying different optimization methods to avoid getting stuck in localoptima

Introducing variational parameters so we do not have to learn pseudooutputs

Extending model to classification

Exploring properties of more complex architectures, and evaluate themodel likelihood to choose optimal configuration

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 49 / 50

Conclusion

Future Directions

Natural extensions include

Trying different optimization methods to avoid getting stuck in localoptima

Introducing variational parameters so we do not have to learn pseudooutputs

Extending model to classification

Exploring properties of more complex architectures, and evaluate themodel likelihood to choose optimal configuration

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 49 / 50

Conclusion

Acknowledgments

A huge thank-you to Sasha Rush, Finale Doshi-Velez, David Duvenaud,and Miguel Hernandez-Lobato. This thesis would not be possible withoutyour help and support!

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 50 / 50

Recommended