Upload
keyon-vafa
View
591
Download
1
Embed Size (px)
Citation preview
Training and Inference for Deep Gaussian Processes
Keyon Vafa
April 26, 2016
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 1 / 50
Motivation
An ideal model for prediction is
accurate
computationally efficient
easy to tune without overfitting
able to provide certainty estimates
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 2 / 50
Motivation
An ideal model for prediction is
accurate
computationally efficient
easy to tune without overfitting
able to provide certainty estimates
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 2 / 50
Motivation
An ideal model for prediction is
accurate
computationally efficient
easy to tune without overfitting
able to provide certainty estimates
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 2 / 50
Motivation
An ideal model for prediction is
accurate
computationally efficient
easy to tune without overfitting
able to provide certainty estimates
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 2 / 50
Motivation
An ideal model for prediction is
accurate
computationally efficient
easy to tune without overfitting
able to provide certainty estimates
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 2 / 50
Motivation
This thesis focuses on one particular class of prediction models, deepGaussian processes for regression. They are a new model, having beenintroduced by Damianou and Lawrence in 2013.
Exact inference is intractable. In this thesis, we introduce a new method tolearn deep GPs, the Deep Gaussian Process Sampling algorithm (DPGS).
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 3 / 50
Motivation
This thesis focuses on one particular class of prediction models, deepGaussian processes for regression. They are a new model, having beenintroduced by Damianou and Lawrence in 2013.
Exact inference is intractable. In this thesis, we introduce a new method tolearn deep GPs, the Deep Gaussian Process Sampling algorithm (DPGS).
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 3 / 50
Motivation
The DGPS algorithm
is more straightforward than existing methods
can more easily adapt to using arbitrary kernels
relies on Monte Carlo sampling to circumvent the intractability hurdle
uses pseudo data to ease the computational burden
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 4 / 50
Motivation
The DGPS algorithm
is more straightforward than existing methods
can more easily adapt to using arbitrary kernels
relies on Monte Carlo sampling to circumvent the intractability hurdle
uses pseudo data to ease the computational burden
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 4 / 50
Motivation
The DGPS algorithm
is more straightforward than existing methods
can more easily adapt to using arbitrary kernels
relies on Monte Carlo sampling to circumvent the intractability hurdle
uses pseudo data to ease the computational burden
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 4 / 50
Motivation
The DGPS algorithm
is more straightforward than existing methods
can more easily adapt to using arbitrary kernels
relies on Monte Carlo sampling to circumvent the intractability hurdle
uses pseudo data to ease the computational burden
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 4 / 50
Motivation
The DGPS algorithm
is more straightforward than existing methods
can more easily adapt to using arbitrary kernels
relies on Monte Carlo sampling to circumvent the intractability hurdle
uses pseudo data to ease the computational burden
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 4 / 50
Gaussian Processes
Table of Contents
1 Gaussian Processes
2 Deep Gaussian Processes
3 Implementation
4 Experiments and Analysis
5 Conclusion
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 5 / 50
Gaussian Processes
Definition of a Gaussian Process
A function f is a Gaussian process (GP) if any finite set of valuesf (x1), . . . , f (xN) has a multivariate normal distribution.
The inputs {xn}Nn=1 can be vectors from any arbitrary sized domain.
Specified by a mean function m(x) and a covariance function k(x, x′)where
m(x) = E [f (x)]
k(x, x′) = Cov(f (x), f (x′)).
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 6 / 50
Gaussian Processes
Definition of a Gaussian Process
A function f is a Gaussian process (GP) if any finite set of valuesf (x1), . . . , f (xN) has a multivariate normal distribution.
The inputs {xn}Nn=1 can be vectors from any arbitrary sized domain.
Specified by a mean function m(x) and a covariance function k(x, x′)where
m(x) = E [f (x)]
k(x, x′) = Cov(f (x), f (x′)).
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 6 / 50
Gaussian Processes
Definition of a Gaussian Process
A function f is a Gaussian process (GP) if any finite set of valuesf (x1), . . . , f (xN) has a multivariate normal distribution.
The inputs {xn}Nn=1 can be vectors from any arbitrary sized domain.
Specified by a mean function m(x) and a covariance function k(x, x′)where
m(x) = E [f (x)]
k(x, x′) = Cov(f (x), f (x′)).
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 6 / 50
Gaussian Processes
Covariance Function
The covariance function (or kernel) determines the smoothness andstationarity of functions drawn from a GP.
The squared exponential covariance function has the following form:
k(x, x′) = σ2f exp
(−1
2(x− x′)TM(x− x′)
)
When M is a diagonal matrix, the elements on the diagonal areknown as the length-scales, denoted by l−2i . σ2f is known as the signalvariance.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 7 / 50
Gaussian Processes
Covariance Function
The covariance function (or kernel) determines the smoothness andstationarity of functions drawn from a GP.
The squared exponential covariance function has the following form:
k(x, x′) = σ2f exp
(−1
2(x− x′)TM(x− x′)
)
When M is a diagonal matrix, the elements on the diagonal areknown as the length-scales, denoted by l−2i . σ2f is known as the signalvariance.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 7 / 50
Gaussian Processes
Covariance Function
The covariance function (or kernel) determines the smoothness andstationarity of functions drawn from a GP.
The squared exponential covariance function has the following form:
k(x, x′) = σ2f exp
(−1
2(x− x′)TM(x− x′)
)
When M is a diagonal matrix, the elements on the diagonal areknown as the length-scales, denoted by l−2i . σ2f is known as the signalvariance.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 7 / 50
Gaussian Processes
Sampling from a GP
x
f(x)
Signal variance 1.0, Length-scale 0.2
x
f(x)
Signal variance 1.0, Length-scale 1.0
x
f(x)
Signal variance 1.0, Length-scale 5.0
x
f(x)
Signal variance 0.2, Length-scale 1.0
x
f(x)
Signal variance 1.0, Length-scale 1.0
x
f(x)
Signal variance 5.0, Length-scale 1.0
Random samples from GP priors. The length-scale controls thesmoothness of our function, while the signal variance controls thedeviation from the mean.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 8 / 50
Gaussian Processes
GPs for Regression
Setup: We are given a set of inputs X ∈ RN×D and correspondingoutputs y ∈ RN , the function values from a GP evaluated at X. Weassume a mean function m(x) and a covariance function k(x, x′),which rely on parameters θ.
We would like to learn the optimal θ, and estimate the functionvalues y∗ for a set of new inputs X∗.
To learn θ, we optimize the marginal likelihood:
P(y|X,θ) = N (0,KXX ).
We can then use the multivariate normal conditional distribution toevaluate the predictive distribution:
P(y∗|X∗,X, y,θ) ∼ N (KX∗XK−1XXy,KX∗X∗ −KX∗XK−1XXKXX∗).
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 9 / 50
Gaussian Processes
GPs for Regression
Setup: We are given a set of inputs X ∈ RN×D and correspondingoutputs y ∈ RN , the function values from a GP evaluated at X. Weassume a mean function m(x) and a covariance function k(x, x′),which rely on parameters θ.
We would like to learn the optimal θ, and estimate the functionvalues y∗ for a set of new inputs X∗.
To learn θ, we optimize the marginal likelihood:
P(y|X,θ) = N (0,KXX ).
We can then use the multivariate normal conditional distribution toevaluate the predictive distribution:
P(y∗|X∗,X, y,θ) ∼ N (KX∗XK−1XXy,KX∗X∗ −KX∗XK−1XXKXX∗).
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 9 / 50
Gaussian Processes
GPs for Regression
Setup: We are given a set of inputs X ∈ RN×D and correspondingoutputs y ∈ RN , the function values from a GP evaluated at X. Weassume a mean function m(x) and a covariance function k(x, x′),which rely on parameters θ.
We would like to learn the optimal θ, and estimate the functionvalues y∗ for a set of new inputs X∗.
To learn θ, we optimize the marginal likelihood:
P(y|X,θ) = N (0,KXX ).
We can then use the multivariate normal conditional distribution toevaluate the predictive distribution:
P(y∗|X∗,X, y,θ) ∼ N (KX∗XK−1XXy,KX∗X∗ −KX∗XK−1XXKXX∗).
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 9 / 50
Gaussian Processes
GPs for Regression
Setup: We are given a set of inputs X ∈ RN×D and correspondingoutputs y ∈ RN , the function values from a GP evaluated at X. Weassume a mean function m(x) and a covariance function k(x, x′),which rely on parameters θ.
We would like to learn the optimal θ, and estimate the functionvalues y∗ for a set of new inputs X∗.
To learn θ, we optimize the marginal likelihood:
P(y|X,θ) = N (0,KXX ).
We can then use the multivariate normal conditional distribution toevaluate the predictive distribution:
P(y∗|X∗,X, y,θ) ∼ N (KX∗XK−1XXy,KX∗X∗ −KX∗XK−1XXKXX∗).
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 9 / 50
Gaussian Processes
GPs for Regression
Note this is all true because we are assuming the outputs correspond to aGaussian process. We therefore make the following assumption:(
yy∗
)∼ N
((00
),
(KXX KXX∗
KX∗X KX∗X∗
)).
Computing P(y|X) and P(y∗|X∗,X, y) only requires matrix algebra on theabove assumption.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 10 / 50
Gaussian Processes
GPs for Regression
Note this is all true because we are assuming the outputs correspond to aGaussian process. We therefore make the following assumption:(
yy∗
)∼ N
((00
),
(KXX KXX∗
KX∗X KX∗X∗
)).
Computing P(y|X) and P(y∗|X∗,X, y) only requires matrix algebra on theabove assumption.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 10 / 50
Gaussian Processes
Example of a GP for Regression
xf(x)
Gaussian Process Regression
x
Outp
uts
Data
Figure: On the left, data from a sigmoidal curve with noise. On the right, samplesfrom a GP trained on the data (represented by ‘x’), using a squared exponentialcovariance function.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 11 / 50
Deep Gaussian Processes
Table of Contents
1 Gaussian Processes
2 Deep Gaussian Processes
3 Implementation
4 Experiments and Analysis
5 Conclusion
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 12 / 50
Deep Gaussian Processes
Definition of a Deep Gaussian Process
Formally, we define a deep Gaussian Process as the composition of GPs:
f(1:L)(x) = f(L)(f(L−1)(. . . f(2)(f(1)(x)) . . . ))
where f(l)d ∼ GP
(0, k
(l)d (x, x′)
)for f
(l)d ∈ f(l).
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 13 / 50
Deep Gaussian Processes
Deep GP Notation
Each layer l consists of D(l) GPs, where D(l) is the number of unitsat layer l
For an L layer deep GP, we have
one input layer xn ∈ RD(0)
L− 1 hidden layers {hln}L−1
l=1
one output layer yn, which we assume to be 1-dimensional.
All layers are completely connected by GPs, each with their ownkernel.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 14 / 50
Deep Gaussian Processes
Deep GP Notation
Each layer l consists of D(l) GPs, where D(l) is the number of unitsat layer l
For an L layer deep GP, we have
one input layer xn ∈ RD(0)
L− 1 hidden layers {hln}L−1
l=1
one output layer yn, which we assume to be 1-dimensional.
All layers are completely connected by GPs, each with their ownkernel.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 14 / 50
Deep Gaussian Processes
Deep GP Notation
Each layer l consists of D(l) GPs, where D(l) is the number of unitsat layer l
For an L layer deep GP, we have
one input layer xn ∈ RD(0)
L− 1 hidden layers {hln}L−1
l=1
one output layer yn, which we assume to be 1-dimensional.
All layers are completely connected by GPs, each with their ownkernel.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 14 / 50
Deep Gaussian Processes
Deep GP Notation
Each layer l consists of D(l) GPs, where D(l) is the number of unitsat layer l
For an L layer deep GP, we have
one input layer xn ∈ RD(0)
L− 1 hidden layers {hln}L−1
l=1
one output layer yn, which we assume to be 1-dimensional.
All layers are completely connected by GPs, each with their ownkernel.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 14 / 50
Deep Gaussian Processes
Deep GP Notation
Each layer l consists of D(l) GPs, where D(l) is the number of unitsat layer l
For an L layer deep GP, we have
one input layer xn ∈ RD(0)
L− 1 hidden layers {hln}L−1
l=1
one output layer yn, which we assume to be 1-dimensional.
All layers are completely connected by GPs, each with their ownkernel.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 14 / 50
Deep Gaussian Processes
Deep GP Notation
Each layer l consists of D(l) GPs, where D(l) is the number of unitsat layer l
For an L layer deep GP, we have
one input layer xn ∈ RD(0)
L− 1 hidden layers {hln}L−1
l=1
one output layer yn, which we assume to be 1-dimensional.
All layers are completely connected by GPs, each with their ownkernel.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 14 / 50
Deep Gaussian Processes
Example: Two-Layer Deep GP
ynhnxnf g
We have a one dimensional input, xn, a one dimensional hidden unit, hn,and a one dimensional output, yn. This two-layer network consists of twoGPs, f and g where
hn = f (xn), where f ∼ GP(0, k(1)(x , x ′))
andyn = g(hn), where g ∼ GP(0, k(2)(h, h′)).
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 15 / 50
Deep Gaussian Processes
Example: Two-Layer Deep GP
ynhnxnf g
We have a one dimensional input, xn, a one dimensional hidden unit, hn,and a one dimensional output, yn. This two-layer network consists of twoGPs, f and g where
hn = f (xn), where f ∼ GP(0, k(1)(x , x ′))
andyn = g(hn), where g ∼ GP(0, k(2)(h, h′)).
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 15 / 50
Deep Gaussian Processes
Example: More Complicated Model
y
h(2)1
h(2)2
h(2)3
h(2)4
h(1)1
h(1)2
h(1)3
h(1)4
x1
x2
x3
Graphical representation of a more complicated deep GP architecture.Every edge corresponds to a GP between units, as the outputs of eachlayer are the inputs of the following layer. Our input data is 3-dimensional,while the two hidden layers in this model each have 4 hidden units.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 16 / 50
Deep Gaussian Processes
Sampling From a Deep GP
6 4 2 0 2 4 6
x
2.0
1.5
1.0
0.5
0.0
0.5
1.0
1.5
2.0
2.5
g(f(x))
Full Deep GP
6 4 2 0 2 4 6
x
2
1
0
1
2
3
4
f(x)
Layer 1: Length-scale 0.5
2 1 0 1 2 3 4
f(x)
2.0
1.5
1.0
0.5
0.0
0.5
1.0
1.5
2.0
2.5
g(f(x))
Layer 2: Length-scale 1.0
Samples from deep GPs. As opposed to single-layer GPs, a deep GP canmodel non-stationary functions (functions whose shape changes along theinput space) without the use of a non-stationary kernel.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 17 / 50
Deep Gaussian Processes
Comparison with Neural Networks
Similarities: deep architectures, completely connected, single-layerGPs correspond to two-layer neural networks with random weightsand infinitely many hidden units.
Differences: deep GP is nonparametric, no activation functions, mustspecify kernels, training is intractable
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 18 / 50
Deep Gaussian Processes
Comparison with Neural Networks
Similarities: deep architectures, completely connected, single-layerGPs correspond to two-layer neural networks with random weightsand infinitely many hidden units.
Differences: deep GP is nonparametric, no activation functions, mustspecify kernels, training is intractable
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 18 / 50
Implementation
Table of Contents
1 Gaussian Processes
2 Deep Gaussian Processes
3 Implementation
4 Experiments and Analysis
5 Conclusion
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 19 / 50
Implementation
FITC Approximation for Single-Layer GP
The Fully Independent Training Conditional Approximation(FITC) circumvents the O(N3) training time for a single-layer GP byintroducing pseudo data, points that are not in the data set but canbe chosen to approximate the function (Snelson and Ghahramani,2005).
We introduce M pseudo inputs X = {xm}Mm=1 and the correspondingpseudo outputs y = {ym}Mm=1, which correspond to the functionvalues at the pseudo inputs.
Key assumption: conditioned on the pseudo data, the output valuesare independent.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 20 / 50
Implementation
FITC Approximation for Single-Layer GP
The Fully Independent Training Conditional Approximation(FITC) circumvents the O(N3) training time for a single-layer GP byintroducing pseudo data, points that are not in the data set but canbe chosen to approximate the function (Snelson and Ghahramani,2005).
We introduce M pseudo inputs X = {xm}Mm=1 and the correspondingpseudo outputs y = {ym}Mm=1, which correspond to the functionvalues at the pseudo inputs.
Key assumption: conditioned on the pseudo data, the output valuesare independent.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 20 / 50
Implementation
FITC Approximation for Single-Layer GP
The Fully Independent Training Conditional Approximation(FITC) circumvents the O(N3) training time for a single-layer GP byintroducing pseudo data, points that are not in the data set but canbe chosen to approximate the function (Snelson and Ghahramani,2005).
We introduce M pseudo inputs X = {xm}Mm=1 and the correspondingpseudo outputs y = {ym}Mm=1, which correspond to the functionvalues at the pseudo inputs.
Key assumption: conditioned on the pseudo data, the output valuesare independent.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 20 / 50
Implementation
FITC Approximation for Single-Layer GP
We assume a priorP(y|X) = N (0,KXX) .
Training takes time O(NM2), and testing requires O(M2).
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 21 / 50
Implementation
FITC Approximation for Single-Layer GP
We assume a priorP(y|X) = N (0,KXX) .
Training takes time O(NM2), and testing requires O(M2).
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 21 / 50
Implementation
FITC Example
x
f(x)
5 Pseudo Parameters
x
f(x)
10 Pseudo Parameters
Figure: The predictive mean of a GP trained on sigmoidal data using the FITCapproximation. On the left, we use 5 pseudo data points, while on the right, weuse 10.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 22 / 50
Implementation
Learning Deep GPs is Intractable
ynhnxnf
θ(1)
g
θ(2)
Example: two-layer model, with inputs X, outputs y, and hidden layer H(which is N × 1). Ideally, a Bayesian treatment would allow us to integrateout the hidden function values to evaluate
P(y|X,θ) =
∫P(y|H,θ(2))P
(H|X,θ(1)
)dH
=
∫N (0,KHH)N (0,KXX) dH.
Evaluating the integrals of Gaussians with respect to kernel functions isintractable.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 23 / 50
Implementation
Learning Deep GPs is Intractable
ynhnxnf
θ(1)
g
θ(2)
Example: two-layer model, with inputs X, outputs y, and hidden layer H(which is N × 1). Ideally, a Bayesian treatment would allow us to integrateout the hidden function values to evaluate
P(y|X,θ) =
∫P(y|H,θ(2))P
(H|X,θ(1)
)dH
=
∫N (0,KHH)N (0,KXX) dH.
Evaluating the integrals of Gaussians with respect to kernel functions isintractable.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 23 / 50
Implementation
DPGS Algorithm Overview
The Deep Gaussian Process Sampling algorithm relies on two central ideas:
We sample predictive means and covariances to approximate themarginal likelihood, relying on automatic differentiation techniques toevaluate the gradients and optimize our objective.
We replace every GP with the FITC GP, so the time complexity for Llayers and H hidden units per layer is O(N2MLH) as opposed toO(N3LH).
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 24 / 50
Implementation
DPGS Algorithm Overview
The Deep Gaussian Process Sampling algorithm relies on two central ideas:
We sample predictive means and covariances to approximate themarginal likelihood, relying on automatic differentiation techniques toevaluate the gradients and optimize our objective.
We replace every GP with the FITC GP, so the time complexity for Llayers and H hidden units per layer is O(N2MLH) as opposed toO(N3LH).
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 24 / 50
Implementation
DPGS Algorithm Overview
The Deep Gaussian Process Sampling algorithm relies on two central ideas:
We sample predictive means and covariances to approximate themarginal likelihood, relying on automatic differentiation techniques toevaluate the gradients and optimize our objective.
We replace every GP with the FITC GP, so the time complexity for Llayers and H hidden units per layer is O(N2MLH) as opposed toO(N3LH).
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 24 / 50
Implementation
Related Work
Damianou and Lawrence (2013) also use the FITC approximation atevery layer, but they perform inference with approximate variationalmarginalization. Subsequent methods (Hensman and Lawrence, 2014;Dai et al., 2015; Bui et al., 2016) also use variational approximations.
These methods are able to integrate out the pseudo outputs at eachlayer, but they rely on integral approximations that restrict the kernel.Meanwhile, the DGPS uses Monte Carlo sampling, which is easier toimplement, more intuitive to understand, and can extent easily tomost kernels.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 25 / 50
Implementation
Related Work
Damianou and Lawrence (2013) also use the FITC approximation atevery layer, but they perform inference with approximate variationalmarginalization. Subsequent methods (Hensman and Lawrence, 2014;Dai et al., 2015; Bui et al., 2016) also use variational approximations.
These methods are able to integrate out the pseudo outputs at eachlayer, but they rely on integral approximations that restrict the kernel.Meanwhile, the DGPS uses Monte Carlo sampling, which is easier toimplement, more intuitive to understand, and can extent easily tomost kernels.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 25 / 50
Implementation
Sampling Hidden Values
For inputs X, we calculate the predictive mean and covariance forevery unit in the first hidden layer. We then sample values from eachpredictive distribution
For every hidden layer thereafter, we take the samples from theprevious layer, calculate the predictive mean and covariance, andrepeat sampling until the final layer.
We use K different samples {(µk , Σk)}Kk=1 to approximate themarginal likelihood:
P(y|X) ≈K∑
k=1
P(y|µk , Σk) =K∑
k=1
N (µk , Σk)
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 26 / 50
Implementation
Sampling Hidden Values
For inputs X, we calculate the predictive mean and covariance forevery unit in the first hidden layer. We then sample values from eachpredictive distribution
For every hidden layer thereafter, we take the samples from theprevious layer, calculate the predictive mean and covariance, andrepeat sampling until the final layer.
We use K different samples {(µk , Σk)}Kk=1 to approximate themarginal likelihood:
P(y|X) ≈K∑
k=1
P(y|µk , Σk) =K∑
k=1
N (µk , Σk)
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 26 / 50
Implementation
Sampling Hidden Values
For inputs X, we calculate the predictive mean and covariance forevery unit in the first hidden layer. We then sample values from eachpredictive distribution
For every hidden layer thereafter, we take the samples from theprevious layer, calculate the predictive mean and covariance, andrepeat sampling until the final layer.
We use K different samples {(µk , Σk)}Kk=1 to approximate themarginal likelihood:
P(y|X) ≈K∑
k=1
P(y|µk , Σk) =K∑
k=1
N (µk , Σk)
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 26 / 50
Implementation
FITC for Deep GPs
To make fitting more scalable, we replace every GP in the model witha FITC GP
For each GP, corresponding to hidden unit d in layer l , we introduce
pseudo inputs X(l)d and corresponding pseudo outputs y
(l)d .
With the addition of the pseudo data, we are required to learn thefollowing set of parameters:
Θ =
{{X
(l)d , y
(l)d ,θ
(l)d
}D(l)
d=1
}L
l=1
.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 27 / 50
Implementation
FITC for Deep GPs
To make fitting more scalable, we replace every GP in the model witha FITC GP
For each GP, corresponding to hidden unit d in layer l , we introduce
pseudo inputs X(l)d and corresponding pseudo outputs y
(l)d .
With the addition of the pseudo data, we are required to learn thefollowing set of parameters:
Θ =
{{X
(l)d , y
(l)d ,θ
(l)d
}D(l)
d=1
}L
l=1
.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 27 / 50
Implementation
FITC for Deep GPs
To make fitting more scalable, we replace every GP in the model witha FITC GP
For each GP, corresponding to hidden unit d in layer l , we introduce
pseudo inputs X(l)d and corresponding pseudo outputs y
(l)d .
With the addition of the pseudo data, we are required to learn thefollowing set of parameters:
Θ =
{{X
(l)d , y
(l)d ,θ
(l)d
}D(l)
d=1
}L
l=1
.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 27 / 50
Implementation
Example: DGPS Algorithm on 2 Layers
ynHnXnf
X(1), y(1),θ(1)
g
X(2), y(2),θ(2)
Our goal is to learn
{(X(l), y(l))}2l=1, the pseudo data for each layer
θ(1) and θ(2), the kernel parameters for f and g
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 28 / 50
Implementation
Example: DGPS Algorithm on 2 Layers
ynHnXnf
X(1), y(1),θ(1)
g
X(2), y(2),θ(2)
Our goal is to learn
{(X(l), y(l))}2l=1, the pseudo data for each layer
θ(1) and θ(2), the kernel parameters for f and g
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 28 / 50
Implementation
Example: DGPS Algorithm on 2 Layers
ynHnXnf
X(1), y(1),θ(1)
g
X(2), y(2),θ(2)
Our goal is to learn
{(X(l), y(l))}2l=1, the pseudo data for each layer
θ(1) and θ(2), the kernel parameters for f and g
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 28 / 50
Implementation
Example: DGPS Algorithm on 2 Layers
ynHnXnf
X(1), y(1),θ(1)
g
X(2), y(2),θ(2)
To sample values H from the hidden layer, we use the FITC approximationand assume
P(
H|X, X(1), y(1)
)= N
(µ(1),Σ(1)
)
where
µ(1) = KXX
(1)K−1X(1)
X(1) y
(1)
Σ(1) = diag(
KXX −KXX
(1)K−1X(1)
X(1)KX
(1)X
).
We obtain K samples, {Hk}Kk=1 from the above distribution.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 29 / 50
Implementation
Example: DGPS Algorithm on 2 Layers
ynHnXnf
X(1), y(1),θ(1)
g
X(2), y(2),θ(2)
To sample values H from the hidden layer, we use the FITC approximationand assume
P(
H|X, X(1), y(1)
)= N
(µ(1),Σ(1)
)where
µ(1) = KXX
(1)K−1X(1)
X(1) y
(1)
Σ(1) = diag(
KXX −KXX
(1)K−1X(1)
X(1)KX
(1)X
).
We obtain K samples, {Hk}Kk=1 from the above distribution.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 29 / 50
Implementation
Example: DGPS Algorithm on 2 Layers
ynHnXnf
X(1), y(1),θ(1)
g
X(2), y(2),θ(2)
To sample values H from the hidden layer, we use the FITC approximationand assume
P(
H|X, X(1), y(1)
)= N
(µ(1),Σ(1)
)where
µ(1) = KXX
(1)K−1X(1)
X(1) y
(1)
Σ(1) = diag(
KXX −KXX
(1)K−1X(1)
X(1)KX
(1)X
).
We obtain K samples, {Hk}Kk=1 from the above distribution.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 29 / 50
Implementation
Example: DGPS Algorithm on 2 Layers
ynHnXnf
X(1), y(1),θ(1)
g
X(2), y(2),θ(2)
For each sample Hk , we can approximate
P(
y|Hk , X(2), y(2)
)≈ N
(µ(2), Σ
(2))
where
µ(2) = KHk X
(2)K−1X(2)
X(2) y
(2)
Σ(2)
= diag(
KHk Hk−K
Hk X(2)K−1
X(2)
X(2)KX
(2)Hk
).
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 30 / 50
Implementation
Example: DGPS Algorithm on 2 Layers
ynHnXnf
X(1), y(1),θ(1)
g
X(2), y(2),θ(2)
For each sample Hk , we can approximate
P(
y|Hk , X(2), y(2)
)≈ N
(µ(2), Σ
(2))
where
µ(2) = KHk X
(2)K−1X(2)
X(2) y
(2)
Σ(2)
= diag(
KHk Hk−K
Hk X(2)K−1
X(2)
X(2)KX
(2)Hk
).
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 30 / 50
Implementation
Example: DGPS Algorithm on 2 Layers
Thus, we can approximate the marginal likelihood with our samples:
P(y|X,Θ) ≈ 1
K
K∑k=1
P(
y|Hk , X(2), y(2)
).
Incorporating the prior over the pseudo outputs into our objective, wehave:
L(y|X,Θ) = logP(y|X,Θ) +L∑
l=1
D(l)∑d=1
logP(
y(l)d
∣∣X(l)d
).
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 31 / 50
Implementation
Example: DGPS Algorithm on 2 Layers
Thus, we can approximate the marginal likelihood with our samples:
P(y|X,Θ) ≈ 1
K
K∑k=1
P(
y|Hk , X(2), y(2)
).
Incorporating the prior over the pseudo outputs into our objective, wehave:
L(y|X,Θ) = logP(y|X,Θ) +L∑
l=1
D(l)∑d=1
logP(
y(l)d
∣∣X(l)d
).
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 31 / 50
Experiments and Analysis
Table of Contents
1 Gaussian Processes
2 Deep Gaussian Processes
3 Implementation
4 Experiments and Analysis
5 Conclusion
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 32 / 50
Experiments and Analysis
Step Function
We test on a step function with noise: X ∈ [−2, 2], yi = sign(xi ) + εi ,where εi ∼ N (0, .01).
The non-stationarity of a step function is appealing from a deep GPperspective.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 33 / 50
Experiments and Analysis
Step Function
We test on a step function with noise: X ∈ [−2, 2], yi = sign(xi ) + εi ,where εi ∼ N (0, .01).
The non-stationarity of a step function is appealing from a deep GPperspective.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 33 / 50
Experiments and Analysis
Step Function
x
f(x)
Samples from a Single-Layer GP
Figure: Functions sampled from a single-layer GP. Evidently, the predictive drawsdo not fully capture the shape of the step function.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 34 / 50
Experiments and Analysis
Step Function
Figure: Predictive draws from a single-layer GP and a two-layer deep GP.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 35 / 50
Experiments and Analysis
Step Function
Figure: Predictive draws from a three-layer deep GP.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 36 / 50
Experiments and Analysis
Step Function
x
f(x)
Random Initialization
x
Hid
den v
alu
es
Hidden values
f(x)
x
f(x)
Smart Initialization
xH
idden v
alu
es
Hidden values
f(x)
Impact of Parameter Initializations on Predictive Draws
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 37 / 50
Experiments and Analysis
Step Function
1 2 3
Number of Layers
0.0
0.2
0.4
0.6
0.8
1.0
Mean S
quare
d E
rror
1 2 3
Number of Layers
0.0
0.2
0.4
0.6
0.8
1.0
1 2 3
Number of Layers
0.0
0.2
0.4
0.6
0.8
1.0
1 2 33
2
1
0
1
2Lo
g L
ikelih
ood p
er
Data
50 Data Points
1 2 33
2
1
0
1
2100 Data Points
1 2 33
2
1
0
1
2200 Data Points
Figure: Experimental results measuring the test log-likelihood per data and testmean squared error of the noisy step function. We vary the number of layers usedin the model, along with the number of data points used in the original stepfunction (which is divided 80/20 into train/test). We run 10 trials at eachcombination.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 38 / 50
Experiments and Analysis
Step Function
Occasionally, models with deeper architectures outperform those that aremore shallow, yet they also possess the widest distributions and trials withthe worst results.
2 1 0 1 2
Train Log-Likelihood per Data
2
1
0
1
2
Test
Log-L
ikelih
ood p
er
Data
Layers
1
2
3
0.0 0.2 0.4 0.6 0.8 1.0
Train MSE
0.0
0.2
0.4
0.6
0.8
1.0
Test
MSE
Layers
1
2
3
Figure: Test set log-likelihoods per data and mean squared errors plotted againsttheir training set counterparts for the step function experiment. Overfitting doesnot appear to be a problem.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 39 / 50
Experiments and Analysis
Step Function
Overfitting does not appear to be a problem.
If we can successfully optimize our objective, deeper architectures arebetter suited at learning the noisy step function than shallower ones.
However, it becomes more difficult to train and successfully optimizeas the number of layers grows and the number of parametersincreases.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 40 / 50
Experiments and Analysis
Step Function
Overfitting does not appear to be a problem.
If we can successfully optimize our objective, deeper architectures arebetter suited at learning the noisy step function than shallower ones.
However, it becomes more difficult to train and successfully optimizeas the number of layers grows and the number of parametersincreases.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 40 / 50
Experiments and Analysis
Step Function
Overfitting does not appear to be a problem.
If we can successfully optimize our objective, deeper architectures arebetter suited at learning the noisy step function than shallower ones.
However, it becomes more difficult to train and successfully optimizeas the number of layers grows and the number of parametersincreases.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 40 / 50
Experiments and Analysis
Step Function
x
f(x)
Random Seed 66
x
Hid
den v
alu
es
1
Hidden values 1
Hid
den v
alu
es
2
Hidden values 2
f(x)
x
f(x)
Random Seed 0
x
Hid
den v
alu
es
1
Hidden values 1
Hid
den v
alu
es
2
Hidden values 2
f(x)
Figure: Predictive draws from two identical three-layer models, albeit withdifferent random parameter initializations.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 41 / 50
Experiments and Analysis
Step Function
Ways to combat optimization challenges:
Using random restarts
Decreasing the number of model parameters
Trying different optimization methods
Experimenting with more diverse architectures, i.e. increasing thedimension of the hidden layer
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 42 / 50
Experiments and Analysis
Step Function
Ways to combat optimization challenges:
Using random restarts
Decreasing the number of model parameters
Trying different optimization methods
Experimenting with more diverse architectures, i.e. increasing thedimension of the hidden layer
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 42 / 50
Experiments and Analysis
Step Function
Ways to combat optimization challenges:
Using random restarts
Decreasing the number of model parameters
Trying different optimization methods
Experimenting with more diverse architectures, i.e. increasing thedimension of the hidden layer
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 42 / 50
Experiments and Analysis
Step Function
Ways to combat optimization challenges:
Using random restarts
Decreasing the number of model parameters
Trying different optimization methods
Experimenting with more diverse architectures, i.e. increasing thedimension of the hidden layer
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 42 / 50
Experiments and Analysis
Step Function
Ways to combat optimization challenges:
Using random restarts
Decreasing the number of model parameters
Trying different optimization methods
Experimenting with more diverse architectures, i.e. increasing thedimension of the hidden layer
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 42 / 50
Experiments and Analysis
Toy Non-Stationary Data
We create toy non-stationary data to evaluate a deep GP’s ability tolearn a non-stationary function.
We divide the input space into three regions: X1 ∈ [−4,−3],X2 ∈ [−1, 1] and X3 ∈ [2, 4], each of which consists of 40 data points.
Sample from a GP with length-scale l = .25 for regions X1 and X3,using l = 2 for region X2.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 43 / 50
Experiments and Analysis
Toy Non-Stationary Data
We create toy non-stationary data to evaluate a deep GP’s ability tolearn a non-stationary function.
We divide the input space into three regions: X1 ∈ [−4,−3],X2 ∈ [−1, 1] and X3 ∈ [2, 4], each of which consists of 40 data points.
Sample from a GP with length-scale l = .25 for regions X1 and X3,using l = 2 for region X2.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 43 / 50
Experiments and Analysis
Toy Non-Stationary Data
We create toy non-stationary data to evaluate a deep GP’s ability tolearn a non-stationary function.
We divide the input space into three regions: X1 ∈ [−4,−3],X2 ∈ [−1, 1] and X3 ∈ [2, 4], each of which consists of 40 data points.
Sample from a GP with length-scale l = .25 for regions X1 and X3,using l = 2 for region X2.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 43 / 50
Experiments and Analysis
Toy Non-Stationary Data
x
Outp
uts
Data
x
f(x)
2-Layer Deep GP
x
Hid
den v
alu
es
Hidden values
f(x)
x
f(x)
1-Layer GP
Predictive Draws for Toy Non-Stationary Data
Figure: Predictive draws from the single-layer and 2-layer models for toynon-stationary data with squared exponential kernels.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 44 / 50
Experiments and Analysis
Toy Non-Stationary Data
x
f(x)
Non-Stationary Data: 3-Layer Deep GP
Figure: The optimization for a 3-layer model can get stuck in a local optimum,and although the predictive draws are non-stationary, our predictions are poor atthe tails.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 45 / 50
Experiments and Analysis
Motorcycle Data
94 points, where the inputs are time in milliseconds since impact of amotorcycle accident and outputs are corresponding helmetaccelerations.
Dataset is somewhat non-stationary, as the accelerations are constantearly on but after a certain time become more varying.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 46 / 50
Experiments and Analysis
Motorcycle Data
94 points, where the inputs are time in milliseconds since impact of amotorcycle accident and outputs are corresponding helmetaccelerations.
Dataset is somewhat non-stationary, as the accelerations are constantearly on but after a certain time become more varying.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 46 / 50
Experiments and Analysis
Motorcycle Data
Time
Acc
ele
rati
on
Data
Time
Acc
ele
rati
on
2-Layer Deep GP
Time
Hid
den v
alu
es
Hidden values
Acc
ele
rati
on
Time
Acc
ele
rati
on
1-Layer GP
Predictive Draws for Motorcycle Data
Figure: Predictive draws from the single-layer and 2-layer models trained onmotorcycle data with squared exponential kernels.
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 47 / 50
Conclusion
Table of Contents
1 Gaussian Processes
2 Deep Gaussian Processes
3 Implementation
4 Experiments and Analysis
5 Conclusion
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 48 / 50
Conclusion
Future Directions
Natural extensions include
Trying different optimization methods to avoid getting stuck in localoptima
Introducing variational parameters so we do not have to learn pseudooutputs
Extending model to classification
Exploring properties of more complex architectures, and evaluate themodel likelihood to choose optimal configuration
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 49 / 50
Conclusion
Future Directions
Natural extensions include
Trying different optimization methods to avoid getting stuck in localoptima
Introducing variational parameters so we do not have to learn pseudooutputs
Extending model to classification
Exploring properties of more complex architectures, and evaluate themodel likelihood to choose optimal configuration
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 49 / 50
Conclusion
Future Directions
Natural extensions include
Trying different optimization methods to avoid getting stuck in localoptima
Introducing variational parameters so we do not have to learn pseudooutputs
Extending model to classification
Exploring properties of more complex architectures, and evaluate themodel likelihood to choose optimal configuration
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 49 / 50
Conclusion
Future Directions
Natural extensions include
Trying different optimization methods to avoid getting stuck in localoptima
Introducing variational parameters so we do not have to learn pseudooutputs
Extending model to classification
Exploring properties of more complex architectures, and evaluate themodel likelihood to choose optimal configuration
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 49 / 50
Conclusion
Future Directions
Natural extensions include
Trying different optimization methods to avoid getting stuck in localoptima
Introducing variational parameters so we do not have to learn pseudooutputs
Extending model to classification
Exploring properties of more complex architectures, and evaluate themodel likelihood to choose optimal configuration
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 49 / 50
Conclusion
Acknowledgments
A huge thank-you to Sasha Rush, Finale Doshi-Velez, David Duvenaud,and Miguel Hernandez-Lobato. This thesis would not be possible withoutyour help and support!
Keyon Vafa Training and Inference for Deep GPs April 26, 2016 50 / 50