105
Training and Inference for Deep Gaussian Processes Keyon Vafa April 26, 2016 Keyon Vafa Training and Inference for Deep GPs April 26, 2016 1 / 50

Training and Inference for Deep Gaussian Processes

Embed Size (px)

Citation preview

Page 1: Training and Inference for Deep Gaussian Processes

Training and Inference for Deep Gaussian Processes

Keyon Vafa

April 26, 2016

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 1 / 50

Page 2: Training and Inference for Deep Gaussian Processes

Motivation

An ideal model for prediction is

accurate

computationally efficient

easy to tune without overfitting

able to provide certainty estimates

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 2 / 50

Page 3: Training and Inference for Deep Gaussian Processes

Motivation

An ideal model for prediction is

accurate

computationally efficient

easy to tune without overfitting

able to provide certainty estimates

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 2 / 50

Page 4: Training and Inference for Deep Gaussian Processes

Motivation

An ideal model for prediction is

accurate

computationally efficient

easy to tune without overfitting

able to provide certainty estimates

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 2 / 50

Page 5: Training and Inference for Deep Gaussian Processes

Motivation

An ideal model for prediction is

accurate

computationally efficient

easy to tune without overfitting

able to provide certainty estimates

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 2 / 50

Page 6: Training and Inference for Deep Gaussian Processes

Motivation

An ideal model for prediction is

accurate

computationally efficient

easy to tune without overfitting

able to provide certainty estimates

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 2 / 50

Page 7: Training and Inference for Deep Gaussian Processes

Motivation

This thesis focuses on one particular class of prediction models, deepGaussian processes for regression. They are a new model, having beenintroduced by Damianou and Lawrence in 2013.

Exact inference is intractable. In this thesis, we introduce a new method tolearn deep GPs, the Deep Gaussian Process Sampling algorithm (DPGS).

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 3 / 50

Page 8: Training and Inference for Deep Gaussian Processes

Motivation

This thesis focuses on one particular class of prediction models, deepGaussian processes for regression. They are a new model, having beenintroduced by Damianou and Lawrence in 2013.

Exact inference is intractable. In this thesis, we introduce a new method tolearn deep GPs, the Deep Gaussian Process Sampling algorithm (DPGS).

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 3 / 50

Page 9: Training and Inference for Deep Gaussian Processes

Motivation

The DGPS algorithm

is more straightforward than existing methods

can more easily adapt to using arbitrary kernels

relies on Monte Carlo sampling to circumvent the intractability hurdle

uses pseudo data to ease the computational burden

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 4 / 50

Page 10: Training and Inference for Deep Gaussian Processes

Motivation

The DGPS algorithm

is more straightforward than existing methods

can more easily adapt to using arbitrary kernels

relies on Monte Carlo sampling to circumvent the intractability hurdle

uses pseudo data to ease the computational burden

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 4 / 50

Page 11: Training and Inference for Deep Gaussian Processes

Motivation

The DGPS algorithm

is more straightforward than existing methods

can more easily adapt to using arbitrary kernels

relies on Monte Carlo sampling to circumvent the intractability hurdle

uses pseudo data to ease the computational burden

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 4 / 50

Page 12: Training and Inference for Deep Gaussian Processes

Motivation

The DGPS algorithm

is more straightforward than existing methods

can more easily adapt to using arbitrary kernels

relies on Monte Carlo sampling to circumvent the intractability hurdle

uses pseudo data to ease the computational burden

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 4 / 50

Page 13: Training and Inference for Deep Gaussian Processes

Motivation

The DGPS algorithm

is more straightforward than existing methods

can more easily adapt to using arbitrary kernels

relies on Monte Carlo sampling to circumvent the intractability hurdle

uses pseudo data to ease the computational burden

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 4 / 50

Page 14: Training and Inference for Deep Gaussian Processes

Gaussian Processes

Table of Contents

1 Gaussian Processes

2 Deep Gaussian Processes

3 Implementation

4 Experiments and Analysis

5 Conclusion

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 5 / 50

Page 15: Training and Inference for Deep Gaussian Processes

Gaussian Processes

Definition of a Gaussian Process

A function f is a Gaussian process (GP) if any finite set of valuesf (x1), . . . , f (xN) has a multivariate normal distribution.

The inputs {xn}Nn=1 can be vectors from any arbitrary sized domain.

Specified by a mean function m(x) and a covariance function k(x, x′)where

m(x) = E [f (x)]

k(x, x′) = Cov(f (x), f (x′)).

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 6 / 50

Page 16: Training and Inference for Deep Gaussian Processes

Gaussian Processes

Definition of a Gaussian Process

A function f is a Gaussian process (GP) if any finite set of valuesf (x1), . . . , f (xN) has a multivariate normal distribution.

The inputs {xn}Nn=1 can be vectors from any arbitrary sized domain.

Specified by a mean function m(x) and a covariance function k(x, x′)where

m(x) = E [f (x)]

k(x, x′) = Cov(f (x), f (x′)).

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 6 / 50

Page 17: Training and Inference for Deep Gaussian Processes

Gaussian Processes

Definition of a Gaussian Process

A function f is a Gaussian process (GP) if any finite set of valuesf (x1), . . . , f (xN) has a multivariate normal distribution.

The inputs {xn}Nn=1 can be vectors from any arbitrary sized domain.

Specified by a mean function m(x) and a covariance function k(x, x′)where

m(x) = E [f (x)]

k(x, x′) = Cov(f (x), f (x′)).

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 6 / 50

Page 18: Training and Inference for Deep Gaussian Processes

Gaussian Processes

Covariance Function

The covariance function (or kernel) determines the smoothness andstationarity of functions drawn from a GP.

The squared exponential covariance function has the following form:

k(x, x′) = σ2f exp

(−1

2(x− x′)TM(x− x′)

)

When M is a diagonal matrix, the elements on the diagonal areknown as the length-scales, denoted by l−2i . σ2f is known as the signalvariance.

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 7 / 50

Page 19: Training and Inference for Deep Gaussian Processes

Gaussian Processes

Covariance Function

The covariance function (or kernel) determines the smoothness andstationarity of functions drawn from a GP.

The squared exponential covariance function has the following form:

k(x, x′) = σ2f exp

(−1

2(x− x′)TM(x− x′)

)

When M is a diagonal matrix, the elements on the diagonal areknown as the length-scales, denoted by l−2i . σ2f is known as the signalvariance.

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 7 / 50

Page 20: Training and Inference for Deep Gaussian Processes

Gaussian Processes

Covariance Function

The covariance function (or kernel) determines the smoothness andstationarity of functions drawn from a GP.

The squared exponential covariance function has the following form:

k(x, x′) = σ2f exp

(−1

2(x− x′)TM(x− x′)

)

When M is a diagonal matrix, the elements on the diagonal areknown as the length-scales, denoted by l−2i . σ2f is known as the signalvariance.

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 7 / 50

Page 21: Training and Inference for Deep Gaussian Processes

Gaussian Processes

Sampling from a GP

x

f(x)

Signal variance 1.0, Length-scale 0.2

x

f(x)

Signal variance 1.0, Length-scale 1.0

x

f(x)

Signal variance 1.0, Length-scale 5.0

x

f(x)

Signal variance 0.2, Length-scale 1.0

x

f(x)

Signal variance 1.0, Length-scale 1.0

x

f(x)

Signal variance 5.0, Length-scale 1.0

Random samples from GP priors. The length-scale controls thesmoothness of our function, while the signal variance controls thedeviation from the mean.

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 8 / 50

Page 22: Training and Inference for Deep Gaussian Processes

Gaussian Processes

GPs for Regression

Setup: We are given a set of inputs X ∈ RN×D and correspondingoutputs y ∈ RN , the function values from a GP evaluated at X. Weassume a mean function m(x) and a covariance function k(x, x′),which rely on parameters θ.

We would like to learn the optimal θ, and estimate the functionvalues y∗ for a set of new inputs X∗.

To learn θ, we optimize the marginal likelihood:

P(y|X,θ) = N (0,KXX ).

We can then use the multivariate normal conditional distribution toevaluate the predictive distribution:

P(y∗|X∗,X, y,θ) ∼ N (KX∗XK−1XXy,KX∗X∗ −KX∗XK−1XXKXX∗).

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 9 / 50

Page 23: Training and Inference for Deep Gaussian Processes

Gaussian Processes

GPs for Regression

Setup: We are given a set of inputs X ∈ RN×D and correspondingoutputs y ∈ RN , the function values from a GP evaluated at X. Weassume a mean function m(x) and a covariance function k(x, x′),which rely on parameters θ.

We would like to learn the optimal θ, and estimate the functionvalues y∗ for a set of new inputs X∗.

To learn θ, we optimize the marginal likelihood:

P(y|X,θ) = N (0,KXX ).

We can then use the multivariate normal conditional distribution toevaluate the predictive distribution:

P(y∗|X∗,X, y,θ) ∼ N (KX∗XK−1XXy,KX∗X∗ −KX∗XK−1XXKXX∗).

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 9 / 50

Page 24: Training and Inference for Deep Gaussian Processes

Gaussian Processes

GPs for Regression

Setup: We are given a set of inputs X ∈ RN×D and correspondingoutputs y ∈ RN , the function values from a GP evaluated at X. Weassume a mean function m(x) and a covariance function k(x, x′),which rely on parameters θ.

We would like to learn the optimal θ, and estimate the functionvalues y∗ for a set of new inputs X∗.

To learn θ, we optimize the marginal likelihood:

P(y|X,θ) = N (0,KXX ).

We can then use the multivariate normal conditional distribution toevaluate the predictive distribution:

P(y∗|X∗,X, y,θ) ∼ N (KX∗XK−1XXy,KX∗X∗ −KX∗XK−1XXKXX∗).

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 9 / 50

Page 25: Training and Inference for Deep Gaussian Processes

Gaussian Processes

GPs for Regression

Setup: We are given a set of inputs X ∈ RN×D and correspondingoutputs y ∈ RN , the function values from a GP evaluated at X. Weassume a mean function m(x) and a covariance function k(x, x′),which rely on parameters θ.

We would like to learn the optimal θ, and estimate the functionvalues y∗ for a set of new inputs X∗.

To learn θ, we optimize the marginal likelihood:

P(y|X,θ) = N (0,KXX ).

We can then use the multivariate normal conditional distribution toevaluate the predictive distribution:

P(y∗|X∗,X, y,θ) ∼ N (KX∗XK−1XXy,KX∗X∗ −KX∗XK−1XXKXX∗).

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 9 / 50

Page 26: Training and Inference for Deep Gaussian Processes

Gaussian Processes

GPs for Regression

Note this is all true because we are assuming the outputs correspond to aGaussian process. We therefore make the following assumption:(

yy∗

)∼ N

((00

),

(KXX KXX∗

KX∗X KX∗X∗

)).

Computing P(y|X) and P(y∗|X∗,X, y) only requires matrix algebra on theabove assumption.

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 10 / 50

Page 27: Training and Inference for Deep Gaussian Processes

Gaussian Processes

GPs for Regression

Note this is all true because we are assuming the outputs correspond to aGaussian process. We therefore make the following assumption:(

yy∗

)∼ N

((00

),

(KXX KXX∗

KX∗X KX∗X∗

)).

Computing P(y|X) and P(y∗|X∗,X, y) only requires matrix algebra on theabove assumption.

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 10 / 50

Page 28: Training and Inference for Deep Gaussian Processes

Gaussian Processes

Example of a GP for Regression

xf(x)

Gaussian Process Regression

x

Outp

uts

Data

Figure: On the left, data from a sigmoidal curve with noise. On the right, samplesfrom a GP trained on the data (represented by ‘x’), using a squared exponentialcovariance function.

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 11 / 50

Page 29: Training and Inference for Deep Gaussian Processes

Deep Gaussian Processes

Table of Contents

1 Gaussian Processes

2 Deep Gaussian Processes

3 Implementation

4 Experiments and Analysis

5 Conclusion

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 12 / 50

Page 30: Training and Inference for Deep Gaussian Processes

Deep Gaussian Processes

Definition of a Deep Gaussian Process

Formally, we define a deep Gaussian Process as the composition of GPs:

f(1:L)(x) = f(L)(f(L−1)(. . . f(2)(f(1)(x)) . . . ))

where f(l)d ∼ GP

(0, k

(l)d (x, x′)

)for f

(l)d ∈ f(l).

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 13 / 50

Page 31: Training and Inference for Deep Gaussian Processes

Deep Gaussian Processes

Deep GP Notation

Each layer l consists of D(l) GPs, where D(l) is the number of unitsat layer l

For an L layer deep GP, we have

one input layer xn ∈ RD(0)

L− 1 hidden layers {hln}L−1

l=1

one output layer yn, which we assume to be 1-dimensional.

All layers are completely connected by GPs, each with their ownkernel.

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 14 / 50

Page 32: Training and Inference for Deep Gaussian Processes

Deep Gaussian Processes

Deep GP Notation

Each layer l consists of D(l) GPs, where D(l) is the number of unitsat layer l

For an L layer deep GP, we have

one input layer xn ∈ RD(0)

L− 1 hidden layers {hln}L−1

l=1

one output layer yn, which we assume to be 1-dimensional.

All layers are completely connected by GPs, each with their ownkernel.

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 14 / 50

Page 33: Training and Inference for Deep Gaussian Processes

Deep Gaussian Processes

Deep GP Notation

Each layer l consists of D(l) GPs, where D(l) is the number of unitsat layer l

For an L layer deep GP, we have

one input layer xn ∈ RD(0)

L− 1 hidden layers {hln}L−1

l=1

one output layer yn, which we assume to be 1-dimensional.

All layers are completely connected by GPs, each with their ownkernel.

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 14 / 50

Page 34: Training and Inference for Deep Gaussian Processes

Deep Gaussian Processes

Deep GP Notation

Each layer l consists of D(l) GPs, where D(l) is the number of unitsat layer l

For an L layer deep GP, we have

one input layer xn ∈ RD(0)

L− 1 hidden layers {hln}L−1

l=1

one output layer yn, which we assume to be 1-dimensional.

All layers are completely connected by GPs, each with their ownkernel.

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 14 / 50

Page 35: Training and Inference for Deep Gaussian Processes

Deep Gaussian Processes

Deep GP Notation

Each layer l consists of D(l) GPs, where D(l) is the number of unitsat layer l

For an L layer deep GP, we have

one input layer xn ∈ RD(0)

L− 1 hidden layers {hln}L−1

l=1

one output layer yn, which we assume to be 1-dimensional.

All layers are completely connected by GPs, each with their ownkernel.

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 14 / 50

Page 36: Training and Inference for Deep Gaussian Processes

Deep Gaussian Processes

Deep GP Notation

Each layer l consists of D(l) GPs, where D(l) is the number of unitsat layer l

For an L layer deep GP, we have

one input layer xn ∈ RD(0)

L− 1 hidden layers {hln}L−1

l=1

one output layer yn, which we assume to be 1-dimensional.

All layers are completely connected by GPs, each with their ownkernel.

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 14 / 50

Page 37: Training and Inference for Deep Gaussian Processes

Deep Gaussian Processes

Example: Two-Layer Deep GP

ynhnxnf g

We have a one dimensional input, xn, a one dimensional hidden unit, hn,and a one dimensional output, yn. This two-layer network consists of twoGPs, f and g where

hn = f (xn), where f ∼ GP(0, k(1)(x , x ′))

andyn = g(hn), where g ∼ GP(0, k(2)(h, h′)).

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 15 / 50

Page 38: Training and Inference for Deep Gaussian Processes

Deep Gaussian Processes

Example: Two-Layer Deep GP

ynhnxnf g

We have a one dimensional input, xn, a one dimensional hidden unit, hn,and a one dimensional output, yn. This two-layer network consists of twoGPs, f and g where

hn = f (xn), where f ∼ GP(0, k(1)(x , x ′))

andyn = g(hn), where g ∼ GP(0, k(2)(h, h′)).

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 15 / 50

Page 39: Training and Inference for Deep Gaussian Processes

Deep Gaussian Processes

Example: More Complicated Model

y

h(2)1

h(2)2

h(2)3

h(2)4

h(1)1

h(1)2

h(1)3

h(1)4

x1

x2

x3

Graphical representation of a more complicated deep GP architecture.Every edge corresponds to a GP between units, as the outputs of eachlayer are the inputs of the following layer. Our input data is 3-dimensional,while the two hidden layers in this model each have 4 hidden units.

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 16 / 50

Page 40: Training and Inference for Deep Gaussian Processes

Deep Gaussian Processes

Sampling From a Deep GP

6 4 2 0 2 4 6

x

2.0

1.5

1.0

0.5

0.0

0.5

1.0

1.5

2.0

2.5

g(f(x))

Full Deep GP

6 4 2 0 2 4 6

x

2

1

0

1

2

3

4

f(x)

Layer 1: Length-scale 0.5

2 1 0 1 2 3 4

f(x)

2.0

1.5

1.0

0.5

0.0

0.5

1.0

1.5

2.0

2.5

g(f(x))

Layer 2: Length-scale 1.0

Samples from deep GPs. As opposed to single-layer GPs, a deep GP canmodel non-stationary functions (functions whose shape changes along theinput space) without the use of a non-stationary kernel.

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 17 / 50

Page 41: Training and Inference for Deep Gaussian Processes

Deep Gaussian Processes

Comparison with Neural Networks

Similarities: deep architectures, completely connected, single-layerGPs correspond to two-layer neural networks with random weightsand infinitely many hidden units.

Differences: deep GP is nonparametric, no activation functions, mustspecify kernels, training is intractable

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 18 / 50

Page 42: Training and Inference for Deep Gaussian Processes

Deep Gaussian Processes

Comparison with Neural Networks

Similarities: deep architectures, completely connected, single-layerGPs correspond to two-layer neural networks with random weightsand infinitely many hidden units.

Differences: deep GP is nonparametric, no activation functions, mustspecify kernels, training is intractable

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 18 / 50

Page 43: Training and Inference for Deep Gaussian Processes

Implementation

Table of Contents

1 Gaussian Processes

2 Deep Gaussian Processes

3 Implementation

4 Experiments and Analysis

5 Conclusion

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 19 / 50

Page 44: Training and Inference for Deep Gaussian Processes

Implementation

FITC Approximation for Single-Layer GP

The Fully Independent Training Conditional Approximation(FITC) circumvents the O(N3) training time for a single-layer GP byintroducing pseudo data, points that are not in the data set but canbe chosen to approximate the function (Snelson and Ghahramani,2005).

We introduce M pseudo inputs X = {xm}Mm=1 and the correspondingpseudo outputs y = {ym}Mm=1, which correspond to the functionvalues at the pseudo inputs.

Key assumption: conditioned on the pseudo data, the output valuesare independent.

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 20 / 50

Page 45: Training and Inference for Deep Gaussian Processes

Implementation

FITC Approximation for Single-Layer GP

The Fully Independent Training Conditional Approximation(FITC) circumvents the O(N3) training time for a single-layer GP byintroducing pseudo data, points that are not in the data set but canbe chosen to approximate the function (Snelson and Ghahramani,2005).

We introduce M pseudo inputs X = {xm}Mm=1 and the correspondingpseudo outputs y = {ym}Mm=1, which correspond to the functionvalues at the pseudo inputs.

Key assumption: conditioned on the pseudo data, the output valuesare independent.

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 20 / 50

Page 46: Training and Inference for Deep Gaussian Processes

Implementation

FITC Approximation for Single-Layer GP

The Fully Independent Training Conditional Approximation(FITC) circumvents the O(N3) training time for a single-layer GP byintroducing pseudo data, points that are not in the data set but canbe chosen to approximate the function (Snelson and Ghahramani,2005).

We introduce M pseudo inputs X = {xm}Mm=1 and the correspondingpseudo outputs y = {ym}Mm=1, which correspond to the functionvalues at the pseudo inputs.

Key assumption: conditioned on the pseudo data, the output valuesare independent.

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 20 / 50

Page 47: Training and Inference for Deep Gaussian Processes

Implementation

FITC Approximation for Single-Layer GP

We assume a priorP(y|X) = N (0,KXX) .

Training takes time O(NM2), and testing requires O(M2).

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 21 / 50

Page 48: Training and Inference for Deep Gaussian Processes

Implementation

FITC Approximation for Single-Layer GP

We assume a priorP(y|X) = N (0,KXX) .

Training takes time O(NM2), and testing requires O(M2).

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 21 / 50

Page 49: Training and Inference for Deep Gaussian Processes

Implementation

FITC Example

x

f(x)

5 Pseudo Parameters

x

f(x)

10 Pseudo Parameters

Figure: The predictive mean of a GP trained on sigmoidal data using the FITCapproximation. On the left, we use 5 pseudo data points, while on the right, weuse 10.

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 22 / 50

Page 50: Training and Inference for Deep Gaussian Processes

Implementation

Learning Deep GPs is Intractable

ynhnxnf

θ(1)

g

θ(2)

Example: two-layer model, with inputs X, outputs y, and hidden layer H(which is N × 1). Ideally, a Bayesian treatment would allow us to integrateout the hidden function values to evaluate

P(y|X,θ) =

∫P(y|H,θ(2))P

(H|X,θ(1)

)dH

=

∫N (0,KHH)N (0,KXX) dH.

Evaluating the integrals of Gaussians with respect to kernel functions isintractable.

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 23 / 50

Page 51: Training and Inference for Deep Gaussian Processes

Implementation

Learning Deep GPs is Intractable

ynhnxnf

θ(1)

g

θ(2)

Example: two-layer model, with inputs X, outputs y, and hidden layer H(which is N × 1). Ideally, a Bayesian treatment would allow us to integrateout the hidden function values to evaluate

P(y|X,θ) =

∫P(y|H,θ(2))P

(H|X,θ(1)

)dH

=

∫N (0,KHH)N (0,KXX) dH.

Evaluating the integrals of Gaussians with respect to kernel functions isintractable.

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 23 / 50

Page 52: Training and Inference for Deep Gaussian Processes

Implementation

DPGS Algorithm Overview

The Deep Gaussian Process Sampling algorithm relies on two central ideas:

We sample predictive means and covariances to approximate themarginal likelihood, relying on automatic differentiation techniques toevaluate the gradients and optimize our objective.

We replace every GP with the FITC GP, so the time complexity for Llayers and H hidden units per layer is O(N2MLH) as opposed toO(N3LH).

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 24 / 50

Page 53: Training and Inference for Deep Gaussian Processes

Implementation

DPGS Algorithm Overview

The Deep Gaussian Process Sampling algorithm relies on two central ideas:

We sample predictive means and covariances to approximate themarginal likelihood, relying on automatic differentiation techniques toevaluate the gradients and optimize our objective.

We replace every GP with the FITC GP, so the time complexity for Llayers and H hidden units per layer is O(N2MLH) as opposed toO(N3LH).

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 24 / 50

Page 54: Training and Inference for Deep Gaussian Processes

Implementation

DPGS Algorithm Overview

The Deep Gaussian Process Sampling algorithm relies on two central ideas:

We sample predictive means and covariances to approximate themarginal likelihood, relying on automatic differentiation techniques toevaluate the gradients and optimize our objective.

We replace every GP with the FITC GP, so the time complexity for Llayers and H hidden units per layer is O(N2MLH) as opposed toO(N3LH).

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 24 / 50

Page 55: Training and Inference for Deep Gaussian Processes

Implementation

Related Work

Damianou and Lawrence (2013) also use the FITC approximation atevery layer, but they perform inference with approximate variationalmarginalization. Subsequent methods (Hensman and Lawrence, 2014;Dai et al., 2015; Bui et al., 2016) also use variational approximations.

These methods are able to integrate out the pseudo outputs at eachlayer, but they rely on integral approximations that restrict the kernel.Meanwhile, the DGPS uses Monte Carlo sampling, which is easier toimplement, more intuitive to understand, and can extent easily tomost kernels.

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 25 / 50

Page 56: Training and Inference for Deep Gaussian Processes

Implementation

Related Work

Damianou and Lawrence (2013) also use the FITC approximation atevery layer, but they perform inference with approximate variationalmarginalization. Subsequent methods (Hensman and Lawrence, 2014;Dai et al., 2015; Bui et al., 2016) also use variational approximations.

These methods are able to integrate out the pseudo outputs at eachlayer, but they rely on integral approximations that restrict the kernel.Meanwhile, the DGPS uses Monte Carlo sampling, which is easier toimplement, more intuitive to understand, and can extent easily tomost kernels.

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 25 / 50

Page 57: Training and Inference for Deep Gaussian Processes

Implementation

Sampling Hidden Values

For inputs X, we calculate the predictive mean and covariance forevery unit in the first hidden layer. We then sample values from eachpredictive distribution

For every hidden layer thereafter, we take the samples from theprevious layer, calculate the predictive mean and covariance, andrepeat sampling until the final layer.

We use K different samples {(µk , Σk)}Kk=1 to approximate themarginal likelihood:

P(y|X) ≈K∑

k=1

P(y|µk , Σk) =K∑

k=1

N (µk , Σk)

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 26 / 50

Page 58: Training and Inference for Deep Gaussian Processes

Implementation

Sampling Hidden Values

For inputs X, we calculate the predictive mean and covariance forevery unit in the first hidden layer. We then sample values from eachpredictive distribution

For every hidden layer thereafter, we take the samples from theprevious layer, calculate the predictive mean and covariance, andrepeat sampling until the final layer.

We use K different samples {(µk , Σk)}Kk=1 to approximate themarginal likelihood:

P(y|X) ≈K∑

k=1

P(y|µk , Σk) =K∑

k=1

N (µk , Σk)

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 26 / 50

Page 59: Training and Inference for Deep Gaussian Processes

Implementation

Sampling Hidden Values

For inputs X, we calculate the predictive mean and covariance forevery unit in the first hidden layer. We then sample values from eachpredictive distribution

For every hidden layer thereafter, we take the samples from theprevious layer, calculate the predictive mean and covariance, andrepeat sampling until the final layer.

We use K different samples {(µk , Σk)}Kk=1 to approximate themarginal likelihood:

P(y|X) ≈K∑

k=1

P(y|µk , Σk) =K∑

k=1

N (µk , Σk)

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 26 / 50

Page 60: Training and Inference for Deep Gaussian Processes

Implementation

FITC for Deep GPs

To make fitting more scalable, we replace every GP in the model witha FITC GP

For each GP, corresponding to hidden unit d in layer l , we introduce

pseudo inputs X(l)d and corresponding pseudo outputs y

(l)d .

With the addition of the pseudo data, we are required to learn thefollowing set of parameters:

Θ =

{{X

(l)d , y

(l)d ,θ

(l)d

}D(l)

d=1

}L

l=1

.

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 27 / 50

Page 61: Training and Inference for Deep Gaussian Processes

Implementation

FITC for Deep GPs

To make fitting more scalable, we replace every GP in the model witha FITC GP

For each GP, corresponding to hidden unit d in layer l , we introduce

pseudo inputs X(l)d and corresponding pseudo outputs y

(l)d .

With the addition of the pseudo data, we are required to learn thefollowing set of parameters:

Θ =

{{X

(l)d , y

(l)d ,θ

(l)d

}D(l)

d=1

}L

l=1

.

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 27 / 50

Page 62: Training and Inference for Deep Gaussian Processes

Implementation

FITC for Deep GPs

To make fitting more scalable, we replace every GP in the model witha FITC GP

For each GP, corresponding to hidden unit d in layer l , we introduce

pseudo inputs X(l)d and corresponding pseudo outputs y

(l)d .

With the addition of the pseudo data, we are required to learn thefollowing set of parameters:

Θ =

{{X

(l)d , y

(l)d ,θ

(l)d

}D(l)

d=1

}L

l=1

.

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 27 / 50

Page 63: Training and Inference for Deep Gaussian Processes

Implementation

Example: DGPS Algorithm on 2 Layers

ynHnXnf

X(1), y(1),θ(1)

g

X(2), y(2),θ(2)

Our goal is to learn

{(X(l), y(l))}2l=1, the pseudo data for each layer

θ(1) and θ(2), the kernel parameters for f and g

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 28 / 50

Page 64: Training and Inference for Deep Gaussian Processes

Implementation

Example: DGPS Algorithm on 2 Layers

ynHnXnf

X(1), y(1),θ(1)

g

X(2), y(2),θ(2)

Our goal is to learn

{(X(l), y(l))}2l=1, the pseudo data for each layer

θ(1) and θ(2), the kernel parameters for f and g

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 28 / 50

Page 65: Training and Inference for Deep Gaussian Processes

Implementation

Example: DGPS Algorithm on 2 Layers

ynHnXnf

X(1), y(1),θ(1)

g

X(2), y(2),θ(2)

Our goal is to learn

{(X(l), y(l))}2l=1, the pseudo data for each layer

θ(1) and θ(2), the kernel parameters for f and g

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 28 / 50

Page 66: Training and Inference for Deep Gaussian Processes

Implementation

Example: DGPS Algorithm on 2 Layers

ynHnXnf

X(1), y(1),θ(1)

g

X(2), y(2),θ(2)

To sample values H from the hidden layer, we use the FITC approximationand assume

P(

H|X, X(1), y(1)

)= N

(µ(1),Σ(1)

)

where

µ(1) = KXX

(1)K−1X(1)

X(1) y

(1)

Σ(1) = diag(

KXX −KXX

(1)K−1X(1)

X(1)KX

(1)X

).

We obtain K samples, {Hk}Kk=1 from the above distribution.

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 29 / 50

Page 67: Training and Inference for Deep Gaussian Processes

Implementation

Example: DGPS Algorithm on 2 Layers

ynHnXnf

X(1), y(1),θ(1)

g

X(2), y(2),θ(2)

To sample values H from the hidden layer, we use the FITC approximationand assume

P(

H|X, X(1), y(1)

)= N

(µ(1),Σ(1)

)where

µ(1) = KXX

(1)K−1X(1)

X(1) y

(1)

Σ(1) = diag(

KXX −KXX

(1)K−1X(1)

X(1)KX

(1)X

).

We obtain K samples, {Hk}Kk=1 from the above distribution.

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 29 / 50

Page 68: Training and Inference for Deep Gaussian Processes

Implementation

Example: DGPS Algorithm on 2 Layers

ynHnXnf

X(1), y(1),θ(1)

g

X(2), y(2),θ(2)

To sample values H from the hidden layer, we use the FITC approximationand assume

P(

H|X, X(1), y(1)

)= N

(µ(1),Σ(1)

)where

µ(1) = KXX

(1)K−1X(1)

X(1) y

(1)

Σ(1) = diag(

KXX −KXX

(1)K−1X(1)

X(1)KX

(1)X

).

We obtain K samples, {Hk}Kk=1 from the above distribution.

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 29 / 50

Page 69: Training and Inference for Deep Gaussian Processes

Implementation

Example: DGPS Algorithm on 2 Layers

ynHnXnf

X(1), y(1),θ(1)

g

X(2), y(2),θ(2)

For each sample Hk , we can approximate

P(

y|Hk , X(2), y(2)

)≈ N

(µ(2), Σ

(2))

where

µ(2) = KHk X

(2)K−1X(2)

X(2) y

(2)

Σ(2)

= diag(

KHk Hk−K

Hk X(2)K−1

X(2)

X(2)KX

(2)Hk

).

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 30 / 50

Page 70: Training and Inference for Deep Gaussian Processes

Implementation

Example: DGPS Algorithm on 2 Layers

ynHnXnf

X(1), y(1),θ(1)

g

X(2), y(2),θ(2)

For each sample Hk , we can approximate

P(

y|Hk , X(2), y(2)

)≈ N

(µ(2), Σ

(2))

where

µ(2) = KHk X

(2)K−1X(2)

X(2) y

(2)

Σ(2)

= diag(

KHk Hk−K

Hk X(2)K−1

X(2)

X(2)KX

(2)Hk

).

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 30 / 50

Page 71: Training and Inference for Deep Gaussian Processes

Implementation

Example: DGPS Algorithm on 2 Layers

Thus, we can approximate the marginal likelihood with our samples:

P(y|X,Θ) ≈ 1

K

K∑k=1

P(

y|Hk , X(2), y(2)

).

Incorporating the prior over the pseudo outputs into our objective, wehave:

L(y|X,Θ) = logP(y|X,Θ) +L∑

l=1

D(l)∑d=1

logP(

y(l)d

∣∣X(l)d

).

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 31 / 50

Page 72: Training and Inference for Deep Gaussian Processes

Implementation

Example: DGPS Algorithm on 2 Layers

Thus, we can approximate the marginal likelihood with our samples:

P(y|X,Θ) ≈ 1

K

K∑k=1

P(

y|Hk , X(2), y(2)

).

Incorporating the prior over the pseudo outputs into our objective, wehave:

L(y|X,Θ) = logP(y|X,Θ) +L∑

l=1

D(l)∑d=1

logP(

y(l)d

∣∣X(l)d

).

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 31 / 50

Page 73: Training and Inference for Deep Gaussian Processes

Experiments and Analysis

Table of Contents

1 Gaussian Processes

2 Deep Gaussian Processes

3 Implementation

4 Experiments and Analysis

5 Conclusion

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 32 / 50

Page 74: Training and Inference for Deep Gaussian Processes

Experiments and Analysis

Step Function

We test on a step function with noise: X ∈ [−2, 2], yi = sign(xi ) + εi ,where εi ∼ N (0, .01).

The non-stationarity of a step function is appealing from a deep GPperspective.

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 33 / 50

Page 75: Training and Inference for Deep Gaussian Processes

Experiments and Analysis

Step Function

We test on a step function with noise: X ∈ [−2, 2], yi = sign(xi ) + εi ,where εi ∼ N (0, .01).

The non-stationarity of a step function is appealing from a deep GPperspective.

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 33 / 50

Page 76: Training and Inference for Deep Gaussian Processes

Experiments and Analysis

Step Function

x

f(x)

Samples from a Single-Layer GP

Figure: Functions sampled from a single-layer GP. Evidently, the predictive drawsdo not fully capture the shape of the step function.

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 34 / 50

Page 77: Training and Inference for Deep Gaussian Processes

Experiments and Analysis

Step Function

Figure: Predictive draws from a single-layer GP and a two-layer deep GP.

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 35 / 50

Page 78: Training and Inference for Deep Gaussian Processes

Experiments and Analysis

Step Function

Figure: Predictive draws from a three-layer deep GP.

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 36 / 50

Page 79: Training and Inference for Deep Gaussian Processes

Experiments and Analysis

Step Function

x

f(x)

Random Initialization

x

Hid

den v

alu

es

Hidden values

f(x)

x

f(x)

Smart Initialization

xH

idden v

alu

es

Hidden values

f(x)

Impact of Parameter Initializations on Predictive Draws

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 37 / 50

Page 80: Training and Inference for Deep Gaussian Processes

Experiments and Analysis

Step Function

1 2 3

Number of Layers

0.0

0.2

0.4

0.6

0.8

1.0

Mean S

quare

d E

rror

1 2 3

Number of Layers

0.0

0.2

0.4

0.6

0.8

1.0

1 2 3

Number of Layers

0.0

0.2

0.4

0.6

0.8

1.0

1 2 33

2

1

0

1

2Lo

g L

ikelih

ood p

er

Data

50 Data Points

1 2 33

2

1

0

1

2100 Data Points

1 2 33

2

1

0

1

2200 Data Points

Figure: Experimental results measuring the test log-likelihood per data and testmean squared error of the noisy step function. We vary the number of layers usedin the model, along with the number of data points used in the original stepfunction (which is divided 80/20 into train/test). We run 10 trials at eachcombination.

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 38 / 50

Page 81: Training and Inference for Deep Gaussian Processes

Experiments and Analysis

Step Function

Occasionally, models with deeper architectures outperform those that aremore shallow, yet they also possess the widest distributions and trials withthe worst results.

2 1 0 1 2

Train Log-Likelihood per Data

2

1

0

1

2

Test

Log-L

ikelih

ood p

er

Data

Layers

1

2

3

0.0 0.2 0.4 0.6 0.8 1.0

Train MSE

0.0

0.2

0.4

0.6

0.8

1.0

Test

MSE

Layers

1

2

3

Figure: Test set log-likelihoods per data and mean squared errors plotted againsttheir training set counterparts for the step function experiment. Overfitting doesnot appear to be a problem.

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 39 / 50

Page 82: Training and Inference for Deep Gaussian Processes

Experiments and Analysis

Step Function

Overfitting does not appear to be a problem.

If we can successfully optimize our objective, deeper architectures arebetter suited at learning the noisy step function than shallower ones.

However, it becomes more difficult to train and successfully optimizeas the number of layers grows and the number of parametersincreases.

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 40 / 50

Page 83: Training and Inference for Deep Gaussian Processes

Experiments and Analysis

Step Function

Overfitting does not appear to be a problem.

If we can successfully optimize our objective, deeper architectures arebetter suited at learning the noisy step function than shallower ones.

However, it becomes more difficult to train and successfully optimizeas the number of layers grows and the number of parametersincreases.

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 40 / 50

Page 84: Training and Inference for Deep Gaussian Processes

Experiments and Analysis

Step Function

Overfitting does not appear to be a problem.

If we can successfully optimize our objective, deeper architectures arebetter suited at learning the noisy step function than shallower ones.

However, it becomes more difficult to train and successfully optimizeas the number of layers grows and the number of parametersincreases.

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 40 / 50

Page 85: Training and Inference for Deep Gaussian Processes

Experiments and Analysis

Step Function

x

f(x)

Random Seed 66

x

Hid

den v

alu

es

1

Hidden values 1

Hid

den v

alu

es

2

Hidden values 2

f(x)

x

f(x)

Random Seed 0

x

Hid

den v

alu

es

1

Hidden values 1

Hid

den v

alu

es

2

Hidden values 2

f(x)

Figure: Predictive draws from two identical three-layer models, albeit withdifferent random parameter initializations.

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 41 / 50

Page 86: Training and Inference for Deep Gaussian Processes

Experiments and Analysis

Step Function

Ways to combat optimization challenges:

Using random restarts

Decreasing the number of model parameters

Trying different optimization methods

Experimenting with more diverse architectures, i.e. increasing thedimension of the hidden layer

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 42 / 50

Page 87: Training and Inference for Deep Gaussian Processes

Experiments and Analysis

Step Function

Ways to combat optimization challenges:

Using random restarts

Decreasing the number of model parameters

Trying different optimization methods

Experimenting with more diverse architectures, i.e. increasing thedimension of the hidden layer

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 42 / 50

Page 88: Training and Inference for Deep Gaussian Processes

Experiments and Analysis

Step Function

Ways to combat optimization challenges:

Using random restarts

Decreasing the number of model parameters

Trying different optimization methods

Experimenting with more diverse architectures, i.e. increasing thedimension of the hidden layer

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 42 / 50

Page 89: Training and Inference for Deep Gaussian Processes

Experiments and Analysis

Step Function

Ways to combat optimization challenges:

Using random restarts

Decreasing the number of model parameters

Trying different optimization methods

Experimenting with more diverse architectures, i.e. increasing thedimension of the hidden layer

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 42 / 50

Page 90: Training and Inference for Deep Gaussian Processes

Experiments and Analysis

Step Function

Ways to combat optimization challenges:

Using random restarts

Decreasing the number of model parameters

Trying different optimization methods

Experimenting with more diverse architectures, i.e. increasing thedimension of the hidden layer

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 42 / 50

Page 91: Training and Inference for Deep Gaussian Processes

Experiments and Analysis

Toy Non-Stationary Data

We create toy non-stationary data to evaluate a deep GP’s ability tolearn a non-stationary function.

We divide the input space into three regions: X1 ∈ [−4,−3],X2 ∈ [−1, 1] and X3 ∈ [2, 4], each of which consists of 40 data points.

Sample from a GP with length-scale l = .25 for regions X1 and X3,using l = 2 for region X2.

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 43 / 50

Page 92: Training and Inference for Deep Gaussian Processes

Experiments and Analysis

Toy Non-Stationary Data

We create toy non-stationary data to evaluate a deep GP’s ability tolearn a non-stationary function.

We divide the input space into three regions: X1 ∈ [−4,−3],X2 ∈ [−1, 1] and X3 ∈ [2, 4], each of which consists of 40 data points.

Sample from a GP with length-scale l = .25 for regions X1 and X3,using l = 2 for region X2.

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 43 / 50

Page 93: Training and Inference for Deep Gaussian Processes

Experiments and Analysis

Toy Non-Stationary Data

We create toy non-stationary data to evaluate a deep GP’s ability tolearn a non-stationary function.

We divide the input space into three regions: X1 ∈ [−4,−3],X2 ∈ [−1, 1] and X3 ∈ [2, 4], each of which consists of 40 data points.

Sample from a GP with length-scale l = .25 for regions X1 and X3,using l = 2 for region X2.

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 43 / 50

Page 94: Training and Inference for Deep Gaussian Processes

Experiments and Analysis

Toy Non-Stationary Data

x

Outp

uts

Data

x

f(x)

2-Layer Deep GP

x

Hid

den v

alu

es

Hidden values

f(x)

x

f(x)

1-Layer GP

Predictive Draws for Toy Non-Stationary Data

Figure: Predictive draws from the single-layer and 2-layer models for toynon-stationary data with squared exponential kernels.

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 44 / 50

Page 95: Training and Inference for Deep Gaussian Processes

Experiments and Analysis

Toy Non-Stationary Data

x

f(x)

Non-Stationary Data: 3-Layer Deep GP

Figure: The optimization for a 3-layer model can get stuck in a local optimum,and although the predictive draws are non-stationary, our predictions are poor atthe tails.

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 45 / 50

Page 96: Training and Inference for Deep Gaussian Processes

Experiments and Analysis

Motorcycle Data

94 points, where the inputs are time in milliseconds since impact of amotorcycle accident and outputs are corresponding helmetaccelerations.

Dataset is somewhat non-stationary, as the accelerations are constantearly on but after a certain time become more varying.

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 46 / 50

Page 97: Training and Inference for Deep Gaussian Processes

Experiments and Analysis

Motorcycle Data

94 points, where the inputs are time in milliseconds since impact of amotorcycle accident and outputs are corresponding helmetaccelerations.

Dataset is somewhat non-stationary, as the accelerations are constantearly on but after a certain time become more varying.

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 46 / 50

Page 98: Training and Inference for Deep Gaussian Processes

Experiments and Analysis

Motorcycle Data

Time

Acc

ele

rati

on

Data

Time

Acc

ele

rati

on

2-Layer Deep GP

Time

Hid

den v

alu

es

Hidden values

Acc

ele

rati

on

Time

Acc

ele

rati

on

1-Layer GP

Predictive Draws for Motorcycle Data

Figure: Predictive draws from the single-layer and 2-layer models trained onmotorcycle data with squared exponential kernels.

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 47 / 50

Page 99: Training and Inference for Deep Gaussian Processes

Conclusion

Table of Contents

1 Gaussian Processes

2 Deep Gaussian Processes

3 Implementation

4 Experiments and Analysis

5 Conclusion

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 48 / 50

Page 100: Training and Inference for Deep Gaussian Processes

Conclusion

Future Directions

Natural extensions include

Trying different optimization methods to avoid getting stuck in localoptima

Introducing variational parameters so we do not have to learn pseudooutputs

Extending model to classification

Exploring properties of more complex architectures, and evaluate themodel likelihood to choose optimal configuration

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 49 / 50

Page 101: Training and Inference for Deep Gaussian Processes

Conclusion

Future Directions

Natural extensions include

Trying different optimization methods to avoid getting stuck in localoptima

Introducing variational parameters so we do not have to learn pseudooutputs

Extending model to classification

Exploring properties of more complex architectures, and evaluate themodel likelihood to choose optimal configuration

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 49 / 50

Page 102: Training and Inference for Deep Gaussian Processes

Conclusion

Future Directions

Natural extensions include

Trying different optimization methods to avoid getting stuck in localoptima

Introducing variational parameters so we do not have to learn pseudooutputs

Extending model to classification

Exploring properties of more complex architectures, and evaluate themodel likelihood to choose optimal configuration

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 49 / 50

Page 103: Training and Inference for Deep Gaussian Processes

Conclusion

Future Directions

Natural extensions include

Trying different optimization methods to avoid getting stuck in localoptima

Introducing variational parameters so we do not have to learn pseudooutputs

Extending model to classification

Exploring properties of more complex architectures, and evaluate themodel likelihood to choose optimal configuration

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 49 / 50

Page 104: Training and Inference for Deep Gaussian Processes

Conclusion

Future Directions

Natural extensions include

Trying different optimization methods to avoid getting stuck in localoptima

Introducing variational parameters so we do not have to learn pseudooutputs

Extending model to classification

Exploring properties of more complex architectures, and evaluate themodel likelihood to choose optimal configuration

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 49 / 50

Page 105: Training and Inference for Deep Gaussian Processes

Conclusion

Acknowledgments

A huge thank-you to Sasha Rush, Finale Doshi-Velez, David Duvenaud,and Miguel Hernandez-Lobato. This thesis would not be possible withoutyour help and support!

Keyon Vafa Training and Inference for Deep GPs April 26, 2016 50 / 50