11
Composite Gaussian Processes: Scalable Computation and Performance Analysis Xiuming Liu 1 Dave Zachariah 1 Edith C. H. Ngai 1 Abstract Gaussian process (GP) models provide a pow- erful tool for prediction but are computationally prohibitive using large data sets. In such scenar- ios, one has to resort to approximate methods. We derive an approximation based on a com- posite likelihood approach using a general belief updating framework, which leads to a recursive computation of the predictor as well as of learn- ing the hyper-parameters. We then provide an analysis of the derived composite GP model in predictive and information-theoretic terms. Fi- nally, we evaluate the approximation with both synthetic data and a real-world application. 1. Introduction Regression is a fundamental problem in machine learning, signal processing, and system identification. In general, an input-output pair (x,y) can be described by y = f (x)+ ε, (1) where f (·) is an unknown regression function and ε is a zero-mean error term. Given a set of input-output pairs, the goal is to model f (·) and infer the value of f (x * ) at some test points. Gaussian Processes (GPs) are a family of nonlinear and non-parametric models (Rasmussen & Williams, 2006). In- ferences based on GPs are conceptually straightforward and the model class provides an internal measure of un- certainties of the inferred quantities. Therefore GPs are widely applied in machine learning (Bishop, 2006), time series analysis (Shumway & Stoffer, 2011), spatial statis- tics (Kroese & Botev, 2015), and control systems (Deisen- roth et al., 2015). A GP model of f (x) is specified by its mean and covariance functions, which are learned from data. After GP learning, 1 Uppsala University, Uppsala, Sweden. Correspondence to: Xiuming Liu and Dave Zachariah <[email protected] and [email protected]>. Copyright 2018 by the author(s). the latent values of f (·) can be inferred from observed data. Both GP learning and inference rely on evaluating a like- lihood function with available data. In case of a large set of data, this evaluation becomes computational prohibitive due to the requirement of inverting covariance matrices, which has a typical runtime on the order O(N 3 ), where N is the size of training data. Thus the direct use of GP has been restricted to moderately sized data sets. Considerable research efforts have investigated scalable GP models or approximations which can be divided into four broad categories: subset of data, low rank approximation of covariance matrices (Williams & Seeger, 2001), sparse GPs (Qui˜ nonero Candela & Rasmussen, 2005; Hensman et al., 2013), and product of experts (or Bayesian com- mittee machines) (Tresp, 2000; Deisenroth & Ng, 2015). Many methods rely on fast heuristic searches (Seeger et al., 2003; Titsias, 2009) for an optimal subset of data or di- mensions to, for instance, maximize the information gain, which has been shown to be an NP-hard problem (Krause et al., 2008). Sparse GPs use a small and special subset of data, namely inducing variables, and apply two key ideas: an assump- tion of conditional independence between training data and testing data given inducing variables; and the approximated conditional distribution for training data given inducing variables, for example, the fully or partially independent training conditional (FITC or PITC) (Qui˜ nonero Candela & Rasmussen, 2005; Snelson & Ghahramani, 2006; 2007; Bijl et al., 2015). Based on these assumptions, sparse GPs reduce the complexity to O(NM 2 ) where M N is the number of inducing variables. By contrast, the product of experts (PoEs) or Bayesian committee machines (BCM) (Tresp, 2000; Cao & Fleet, 2014; Deisenroth & Ng, 2015) does not require to find a special set of variables. Those methods divide the large training data into segments and uses each with the GP model as a ‘local expert’. The final result are formed by multiplying the results given by each local experts, usually with certain weights in order to be robust against outliers. Several important questions remain for scalable GP meth- ods. First, how are these different methods connected to each other? Second, how do we quantify the difference be- arXiv:1802.00045v1 [stat.ML] 31 Jan 2018

Composite Gaussian Processes: Scalable Computation … · Composite Gaussian Processes: Scalable Computation and Performance Analysis tween an approximated GP posterior distribution

Embed Size (px)

Citation preview

Page 1: Composite Gaussian Processes: Scalable Computation … · Composite Gaussian Processes: Scalable Computation and Performance Analysis tween an approximated GP posterior distribution

Composite Gaussian Processes:Scalable Computation and Performance Analysis

Xiuming Liu 1 Dave Zachariah 1 Edith C. H. Ngai 1

AbstractGaussian process (GP) models provide a pow-erful tool for prediction but are computationallyprohibitive using large data sets. In such scenar-ios, one has to resort to approximate methods.We derive an approximation based on a com-posite likelihood approach using a general beliefupdating framework, which leads to a recursivecomputation of the predictor as well as of learn-ing the hyper-parameters. We then provide ananalysis of the derived composite GP model inpredictive and information-theoretic terms. Fi-nally, we evaluate the approximation with bothsynthetic data and a real-world application.

1. IntroductionRegression is a fundamental problem in machine learning,signal processing, and system identification. In general, aninput-output pair (x, y) can be described by

y = f(x) + ε, (1)

where f(·) is an unknown regression function and ε is azero-mean error term. Given a set of input-output pairs, thegoal is to model f(·) and infer the value of f(x∗) at sometest points.

Gaussian Processes (GPs) are a family of nonlinear andnon-parametric models (Rasmussen & Williams, 2006). In-ferences based on GPs are conceptually straightforwardand the model class provides an internal measure of un-certainties of the inferred quantities. Therefore GPs arewidely applied in machine learning (Bishop, 2006), timeseries analysis (Shumway & Stoffer, 2011), spatial statis-tics (Kroese & Botev, 2015), and control systems (Deisen-roth et al., 2015).

A GP model of f(x) is specified by its mean and covariancefunctions, which are learned from data. After GP learning,

1Uppsala University, Uppsala, Sweden. Correspondence to:Xiuming Liu and Dave Zachariah <[email protected] [email protected]>.

Copyright 2018 by the author(s).

the latent values of f(·) can be inferred from observed data.Both GP learning and inference rely on evaluating a like-lihood function with available data. In case of a large setof data, this evaluation becomes computational prohibitivedue to the requirement of inverting covariance matrices,which has a typical runtime on the order O(N3), whereN is the size of training data. Thus the direct use of GP hasbeen restricted to moderately sized data sets.

Considerable research efforts have investigated scalable GPmodels or approximations which can be divided into fourbroad categories: subset of data, low rank approximationof covariance matrices (Williams & Seeger, 2001), sparseGPs (Quinonero Candela & Rasmussen, 2005; Hensmanet al., 2013), and product of experts (or Bayesian com-mittee machines) (Tresp, 2000; Deisenroth & Ng, 2015).Many methods rely on fast heuristic searches (Seeger et al.,2003; Titsias, 2009) for an optimal subset of data or di-mensions to, for instance, maximize the information gain,which has been shown to be an NP-hard problem (Krauseet al., 2008).

Sparse GPs use a small and special subset of data, namelyinducing variables, and apply two key ideas: an assump-tion of conditional independence between training data andtesting data given inducing variables; and the approximatedconditional distribution for training data given inducingvariables, for example, the fully or partially independenttraining conditional (FITC or PITC) (Quinonero Candela& Rasmussen, 2005; Snelson & Ghahramani, 2006; 2007;Bijl et al., 2015). Based on these assumptions, sparse GPsreduce the complexity to O(NM2) where M � N is thenumber of inducing variables. By contrast, the product ofexperts (PoEs) or Bayesian committee machines (BCM)(Tresp, 2000; Cao & Fleet, 2014; Deisenroth & Ng, 2015)does not require to find a special set of variables. Thosemethods divide the large training data into segments anduses each with the GP model as a ‘local expert’. The finalresult are formed by multiplying the results given by eachlocal experts, usually with certain weights in order to berobust against outliers.

Several important questions remain for scalable GP meth-ods. First, how are these different methods connected toeach other? Second, how do we quantify the difference be-

arX

iv:1

802.

0004

5v1

[st

at.M

L]

31

Jan

2018

Page 2: Composite Gaussian Processes: Scalable Computation … · Composite Gaussian Processes: Scalable Computation and Performance Analysis tween an approximated GP posterior distribution

Composite Gaussian Processes: Scalable Computation and Performance Analysis

tween an approximated GP posterior distribution and a fullGP posterior distribution? This work aims to address bothquestions. First, we show that various approximated GPposteriors can be derived by applying a general belief up-dating framework (Bissiri et al., 2016). Using compositelikelihoods (Varin et al., 2011), we obtain a scalable com-posite GP model, which can be implemented in a recur-sive fashion. Second, we analyze how the posterior of thecomposite GPs differs from that of a full GP with respectto predictive performance and its representation of uncer-tainty (Bernardo, 1979; Cover & Thomas, 1991).

This paper is organized as follows: In Section 3, we derivethe composite GP posterior. In Section 4, we give equationsfor scalable GP learning and prediction. The performanceanalysis of composite GPs is presented in Section 5. Ex-amples and discussions are presented in Section 6 and 7.

2. Problem FormulationWe consider f(x) in (1) to be a stochastic process modeledas a Gaussian process GP(µ(x), σ(x,x′)) (Rasmussen &Williams, 2006). This model yields a prior belief distribu-tion p(z) over the latent variable at M tests points:

z = [f(x?1), · · · , f(x?M )]> ∼ N (µz,Σz). (2)

The joint mean µz and covariance matrix Σz are functionsof the test points {x?1, . . . ,x?M}. Using a training data set

D ={

(x1, y1), . . . , (xN , yN )}

= {X,y},

our goal is to update the belief distribution over z so as toproduce a prediction z with a dispersion measure for un-certainty.

In addition to (2), we model εi in each sample from (1) as

εi ∼ N (0, σ2ε) (i.i.d.) (3)

This specifies a conditional data model p(y|z), that isGaussian with mean µy|z and covariance Σy|z. The stan-dard Bayesian inference framework then updates the priorbelief distribution p(z) into the posterior p(z|y) via Bayes’rule.

The inference of z depends, moreover, on a specified meanand covariance model, which can be learned in severalways. When it is parameterized by a vector θ, the max-imum marginal likelihood approach is a popular learningapproach that aims to solve the problem

θ = arg maxθ

∫pθ(y|z)pθ(z)dz, (4)

which requires several matrix inversions in itself. Finally,the posterior

pθ(z|y) =pθ(y|z)pθ(z)

pθ(y)(5)

is evaluated at θ = θ. Both (4) and (5) require a runtime onthe orderO(N3) and a storage that scales asO(N2), whichrenders standard GP training and inference intractable forlarge N (Quinonero Candela & Rasmussen, 2005).

In this work, our goal is to firstly formulate an alternativeupdate of the belief distribution, secondly provide a scal-able training and inference method and finally present ananalysis on the performance of approximation in terms ofMSE and information loss.

3. Updating Belief DistributionsThe posterior above can be thought of as a special case ofupdating the belief distribution p(z) into a new distribu-tion q(z) using the data D. A more general belief updatingframework was formulated in (Bissiri et al., 2016). Usingthis framework, we first define a loss function `(y; z) andthen find the distribution q(z) which minimizes the averageloss

L(q(z)

),∫Z`(y; z)q(z) dz +D(q(z)||p(z)), (6)

regularized by the Kullback–Leibler divergence (KLD)D(q(z)||p(z)) (Cover & Thomas, 1991). The first term in(6) fits q(z) to the data, while the second term constrains itto the prior belief distribution. The updated belief distribu-tion is then obtained as

q(z) = arg minq(z)

L(q(z)

). (7)

It is readily seen that for the loss function

`GP(y; z) = − ln p(y|z) (8)

the minimizer of LGP(q(z)) is the posterior p(z|y), cf.(Bissiri et al., 2016) for more details. It is also interesting topoint out that the above optimal belief updating frameworkhas the same structure as the KLD optimization problemin variational Bayesian inferences (Fox & Roberts, 2012).Here, however, the problem is considered from a differentangle: in (Fox & Roberts, 2012), the challenge is to tacklethe joint distribution of high-dimensional z by factorizingit into single variable factors; in this paper the challenge isto tackle the distribution of large data sets y.

To alleviate the computational requirements using the GPloss (8), we formulate a different loss function based onmarginal blocks of the full data distribution p(y|z) simi-lar to the composite likelihood approach, cf. (Varin et al.,2011). Specifically, we choose

`CGP(y; z) = −K∑k=1

ln p(yk|z), (9)

Page 3: Composite Gaussian Processes: Scalable Computation … · Composite Gaussian Processes: Scalable Computation and Performance Analysis tween an approximated GP posterior distribution

Composite Gaussian Processes: Scalable Computation and Performance Analysis

where the data set D has been divided into K segments

Dk = {Xk,yk}, k = 1, . . . ,K, (10)

with Nk � N samples each. As we show below, this costfunction enables scalable and online processing for largedata set.

Theorem 1 (Composite GP (CGP) update). By applying(9), we obtain a recursively updated belief distribution

pCGP(z|y1:K) , arg minq(z)

LCGP(q(z)

)=pCGP(z|y1:K−1)p(yK |z)

p(yK),

(11)

where

p(y1:K) =

∫Zp(z)

K∏k=1

p(yk|z)dz

is the marginalized distribution for all data; and

p(yK) =

∫ZpCGP(z|y1:K−1)p(yK |z)dz

is the marginalized distribution for segment K. For afixed Nk, (11) can be evaluated in the runtime of orderO(KN3

k ).

The proof follows by recognizing that (6) is equivalent tothe divergence

D(q(z) || exp(−`(y; z))p(z)

)≥ 0, (12)

which attains the minimum 0 only when q(z) ∝exp(−`(y; z))p(z). Then the result is obtained by notingthat q(z) is a distribution that integrates to unity. The re-cursive computation of the CGP posterior in (11) can betackled using standard tools (Sarkka, 2013) as discussed inSection 4 below.

Here it is instructive to compare the CGP with the SGP ap-proach (Snelson & Ghahramani, 2006) which is formulatedusing a set of latent ‘inducing variables’ u with a joint dis-tribution p(y,u|z). By assuming that y is independent of zwhen given u, the joint distribution can be factorized intop(y|u)p(u|z). Then the implicit loss function used in SGPis based on marginalizing out u from the joint distribution:

`SGP(y; z) = − ln

∫Up(y|u)p(u|z)du, (13)

using a segmented model p(y|u) =∏Bb=1 p(yb|u). The

SGP posterior is

pSGP(z|y) , arg minq(z)

LSGP(q(z)

)=p(z)

∫U∏Bb=1 p(yb|u)p(u|z)du∫

U p(y|u)p(u)du.

(14)

This formulation reproduces the FITC and PITC ap-proximations depending on the size of the segments yb,cf. (Quinonero Candela & Rasmussen, 2005; Snelson &Ghahramani, 2007). The inducing variables are targeting acompressed representation of the data and the informationis transferred to the belief distribution of z via p(u|z). Thusu must be carefully selected so as to transfer the maximumamount of information about the training data y. The opti-mal placement of inducing variables involves a challengingcombinatorial optimization problem and but can be tackledusing greedy search heuristics (Seeger et al., 2003; Krauseet al., 2008; Titsias, 2009).

4. Recursive ComputationThe model parameters θ are fixed unknown quantitiesand typically learned using the maximum likelihood (ML)method. Once this is completed, the prediction of z alongwith its dispersion is computed. In this section, we presentrecursive computations for both CGP learning and predic-tion.

4.1. Learning

The Fisher information matrix (FIM), J(θ|yk), quantifiesthe information about the model parameters θ containedin data yk when assuming a model pθ(yk). In case ofGaussian distributed data, the FIM can be obtained by theSlepian-Bangs formula. We refer readers to (B.3.3) in (Sto-ica & Moses, 1997) for this formula. The larger the FIM inthe Lowner order sense, the lower errors we may achievewhen learning the optimal model parameters. Indeed, theinverse of the FIM, provides an estimate of the variance ofan efficient estimator θ (Kay, 1993; Van Trees et al., 2013).

Each data segment yk yields its own maximum likelihoodestimate θk with an approximate error covariance matrixJk = J(θk|yk). When the segments are obtained fromthe same data generating process, the model is applicableto each segment and we combine the estimates to provide arefined model parameter at segment K:

θK =

(K∑k=1

Jk

)−1( K∑k=1

Jkθk

)= Λ−1K sK (15)

The quantities are computed recursively as

Λk = Λk−1 + Jk, (16)

sk = sk−1 + Jkθk, (17)

where Λ0 = 0 and s0 = 0, cf. (Zachariah et al., 2017).

Figure 1 illustrates the recursive learning of a CGP learningfor a model with zero-mean function and the squared expo-nential (SE) covariance function (θ = [1, 2]>) is shown

Page 4: Composite Gaussian Processes: Scalable Computation … · Composite Gaussian Processes: Scalable Computation and Performance Analysis tween an approximated GP posterior distribution

Composite Gaussian Processes: Scalable Computation and Performance Analysis

(a) Nk = 50, k ∈ 1, . . . , 100 (b) Nk = 100, k ∈ 1, . . . , 50 (c) Nk = 200, k ∈ 1, . . . , 25

Figure 1. FIM weighted average of MLEs based on segmented training data with different block length Nk.

in Figure 1. In the example we split the training data(N = 5000) into a number of segments. Three caseswith different segmented data length (Nk = 50, 100, and200) are shown. Note that the ML estimator is asymptot-ically efficient estimator and thus the accuracy of θk andJk improves as Nk becomes larger. Therefore there is atrade-off between accuracy and computational cost. Set-ting Nk = 200, the accuracy of the weighted combination(15) is satisfying, while the computational complexity isstill maintained at a relatively low level comparing to thefull GP learning.

It is worth pointing out that, the above idea of segmentingdata and weighting the contributions from each module issimilar to the approach found in the recent work by Jacob,et al. (2017). The modularized inference increases robust-ness to misspecifications of the joint data model.

4.2. Prediction

We now present the mean and covariance of the updatedbelief distribution for CGP, cf. Theorem 1. Specifically,given the CGP posterior distribution pCGP(z|y1:k−1) =

N (µz|y1:k−1, Σz|y1:k−1

) and a new data segment yk, theposterior pCGP(z|y1:k) = N (µz|y1:k

, Σz|y1:k) is updated

recursively, cf. (Sarkka, 2013).

First, a prior is constructed. For k = 1, the GP prior in (2)is used as the prior distribution for z; for k ≥ 2, previousposterior pCGP(z|y1:k−1) is used as the new prior.

Second, the conditional distribution p(yk|z) is obtainedwith mean and covariance matrix

µyk|z = µyk+ Σyk,zΣ

−1z,z(µz|y1:k−1

− µz), (18)

Σyk|z = Σyk,k−Σyk,zΣ

−1z,zΣz,yk

. (19)

Finally, the posterior pCGP(z|y1:k) has a mean and covari-

ance:

µz|y1:k=µz|y1:k−1

+ Σz|y1:k−1H>k G

−1k

[yk − µyk−Hk(µz|y1:k−1

− µz)],(20)

Σz|y1:k=Σz|y1:k−1

− Σz|y1:k−1H>k G

−1k HkΣz|y1:k−1

.

(21)

The matricesHk andGk are defined as

Hk , Σyk,zΣ−1z,z, (22)

Gk , Σyk|z +HkΣz|y1:k−1H>k . (23)

Together the equations form a recursive computation wherepCGP(z|y1:k) is the basis for obtaining pCGP(z|y1:k+1) afterobserving yk+1.

5. Performance Analysis of Composite GPsGiven the appealing computational properties of CGP, anatural question is how good is its posterior as comparedwith that of GP? Specifically, we are interested in the pre-dictive performance and the ability to represent uncertaintyabout the latent state z, which will be addressed in the fol-lowing subsections. Both aspects are related to the amountof information that the training data provides about the la-tent state, i.e., I(z; y) (Cover & Thomas, 1991).

5.1. Data-Averaged MSE

We begin by considering an arbitrary test point, so that zis scalar and z is any predictor. Given the prior belief dis-tribution of z with known variance σ2

z , the data-averagedmean squared error (MSE) is lower bounded by the mutualinformation between z and the training data y:

E[(z − z)2] ≥ σ2z

1

2πeexp[−2I(z; y)], (24)

under fairly general conditions, cf. Theorem 17.3.2 in(Cover & Thomas, 1991). When the marginal data distri-bution p(y) obtained from GP is well-specified, the bound

Page 5: Composite Gaussian Processes: Scalable Computation … · Composite Gaussian Processes: Scalable Computation and Performance Analysis tween an approximated GP posterior distribution

Composite Gaussian Processes: Scalable Computation and Performance Analysis

(24) equals the posterior variance σ2z|y of the GP and is at-

tained by setting z = µz|y. Thus I(z; y) also representsthe reduced uncertainty of the updated belief distributionfor GP.

In this scenario, what is the additional MSE incurred whenusing the CGP posterior mean as a predictor zCGP? In gen-eral, the data-averaged MSE equals

E[(z − zCGP)2] = σ2z|y + E[(µz|y − µz|y1:K

)2], (25)

where the second term is the additional MSE given by thedifference between the GP and CGP posterior means, re-spectively. To obtain closed-form expressions of this term,we consider the case of K = 2 segments.

Theorem 2 (Excess MSE of CGP). The additional MSE ofCGP when K = 2 equals

E[(µz|y − µz|y1:K)2] =

[α1

α2

]> [Σy1,1

Σy1,2

Σy2,1Σy2,2

] [α1

α2

],

(26)

where Σyi,jis the covariance matrix of the i-th and j-th

data segments. The vectors α1 and α2 are given by[α1

α2

]=

([A BC D

]−[A′ B′

C ′ D′

])> [Σz,y1

Σz,y2

], (27)

where Σz,yi is the column vector of covariances betweenthe testing data and the i-th data segment; and the twoblock matrices ([A, B; C, D] and [A′, B′; C ′, D′]) arecoefficient matrices for the full GP and CGP predictions,respectively.

The full derivation is omitted here due to page limitations.We present a sketch of the proof in the following. It can beshown that the coefficient matrix for GP prediction is theblock inverse of the covariance matrix of data y:[

A BC D

]=

[Σy1,1

Σy1,2

Σy2,1Σy2,2

]−1=[

Σ−1y1,1Σy1,2Σ

−1y2|y1

Σy2,1Σ−1y1,1

−Σ−1y1,1Σy1,2Σ

−1y2|1

−Σ−1y2|1Σy2,1Σ

−1y1,1

Σ−1y2|1

](28)

whereΣy2|1 = Σy2,2 −Σy2,1Σ

−1y1,1

Σy1,2 . (29)

To derive the corresponding coefficient matrix for CGP, wefirst define the approximated covariance matrix betweendata block i and j:

Σyi,j, Σyi,z

1

σ2z

Σz,yj, (30)

which means the segmented observations are indirectlyconnected by the testing data. Thereafter, the approximated

conditional covariance matrix of y2 given y1 can be ex-pressed as

Σy2|1 , Σy2,2− Σy2,1

Σ−1y1,1Σy1,2

. (31)

After a few steps of algebraic manipulation, the block co-efficients matrix for CGP can be obtained as

[A′ B′

C ′ D′

]=

[Σ−1y1,1

0

−σ2z|y1

σ2z

Σ−1y2|1Σy2,1

Σ−1y1,1

σ2z|y1

σ2z

Σ−1y2|1

](32)

where σ2z|y1

= σ2z − Σz,y1

Σ−1y1,y1Σy1,z is the posterior

variance of latent variable z using the first data block y1.

By comparing the expressions for the coefficients matri-ces of GP and CGP (K = 2), we can make the followingobservations. First, for CGP, the covariance matrix Σy2,1

is replaced by an approximation Σy2,1, which relies on the

testing points to ‘connect’ the two blocks, cf. (30). Second,the CGP prediction effectively assumes that the segmentedtraining data blocks are conditionally independent, whengiven testing data. Therefore, the coefficients matrix A′ issimply the inverse of first data block’s covariance matrix,and B′ = 0. Thus Theorem 2 provides a means of prob-ing the sources of the additional MSE incurred when usingCGP compared to GP.

5.2. Data-Averaged KLD

Another way to compare different updated belief dis-tributions is to quantify how much they differ fromthe prior distribution p(z). Specifically, we use theKullback–Leibler divergence D(p(z|y)||p(z)) and aver-age it over all realizations y under the marginal datamodel. For GP, it is straight-forward to show the identityEy[D(p(z|y)||p(z))] = I(z; y), which is the average in-formation gain about z provided by the data and the se-lected model, cf. (Bernardo, 1979).

Theorem 3 (Difference between average informationgains). The data-averaged KL divergences of the posteriorbelief distributions differ by

Ey[D(p(z|y)||p(z))]− Ey1:K[D(pCGP(z|y1:K)||p(z))]

=I(z; y)−K∑k=1

I(z; yk),

(33)

where I(z; y) is the mutual information of the latent vari-able z and all observations y, and

∑Kk=1 I(z; yk) is the

sum of mutual information of z and observation block yk.

Proof. The data-averaged KLD between the CGP poste-

Page 6: Composite Gaussian Processes: Scalable Computation … · Composite Gaussian Processes: Scalable Computation and Performance Analysis tween an approximated GP posterior distribution

Composite Gaussian Processes: Scalable Computation and Performance Analysis

4000 4050 4100 4150 4200t

20

30

40

50

60

70

80

90

y(t)

TrainingTestingPrediction

0 2000 4000 6000

-20

0

20

40

60

(a) GP posterior

4000 4050 4100 4150 4200t

20

30

40

50

60

70

80

90

y(t)

TrainingTestingPrediction

0 2000 4000 6000

-20

0

20

40

60

(b) CGP posterior

4000 4050 4100 4150 4200t

20

30

40

50

60

70

80

90

y(t)

TrainingTestingPrediction

0 2000 4000 6000

-200

20406080 Inducing variables

(c) SGP posterior

20 40 60 80 100 120

20

40

60

80

100

120 -2

0

2

4

6

8

10

12

(d) GP covariance matrix

20 40 60 80 100 120

20

40

60

80

100

120 -2

0

2

4

6

8

10

12

(e) CGP covariance matrix

20 40 60 80 100 120

20

40

60

80

100

120 -2

0

2

4

6

8

10

12

(f) SGP covariance matrix

Figure 2. The GP, CGP, and SGP predictive posterior distributions and covariance matrices.

rior and the prior is given by

Ey1:K[D(pCGP(z|y1:K)||p(z))]

=

∫Y1:K

p(y1:K)

∫ZpCGP(z|y1:K) ln

pCGP(z|y1:K)

p(z)dzdy1:K

=

K∑k=1

∫Yk

∫Zp(z,yk)[ln

1

p(z)+ ln p(z|yk)]dzdyk

=

K∑k=1

I(z; yk).

(34)

Note that the second last holds because pCGP(y1:K |z) =∏Kk=1 p(yk|z), p(y1:K) =

∫Z p(z)

∏Kk=1 p(yk|z)dz, and∫

Yj

∫Yk

∫Zp(yj |z)p(z,yk) ln

p(z|yk)

p(z)dzdykdyj

=

∫Yk

∫Zp(z,yk) ln

p(z|yk)

p(z)dzdyk, ∀j 6= k.

(35)

Remark 1 (Redundancy and synergy). The difference be-tween information gains in Theorem 3 can be positive ornegative. If the difference is negative the segmented datais said to be redundant and there is overlapping informa-tion in the data segments about the latent variable; oth-erwise the segmented data are said to be synergistic in

which case there is more information about the latent stateby jointly considering all segmented data (Barrett, 2015;Timme et al., 2014).

Remark 2 (Under- and overestimation of uncertainty).When the marginal data distribution obtained from GP iscorrectly specified, I(z; y) provides the correct measureof uncertainty about the latent variable, cf. (24). In thisscenario, the CGP belief distribution will either under-or overestimate the uncertainty depending on whether thedata segments are redundant or synergistic in nature.

6. ExamplesIn this section, we present examples of CGPs for process-ing large data sets. The first two examples are based onsynthetic data. In the third example, we demonstrate theCGP for NOx predictions based on real-world data.

6.1. Synthetic Time Series Data

Considering the GP model with a linear mean functionµ(t) = at + b, and a covariance function with pe-riodic patterns σ(t, t′) = α2

1 exp[−2 sin2(π|t−t′|/T )θ21

] +

α22 exp[−(t−t

′)2

θ22] + σ2

ε , where the period T = 128. In to-tal 4224 data points are simulated: the first 4096 points areused as observations, and the last 128 points need to be pre-dicted. Assuming the hyper-parameters are known, the GP,CGP, and SGP predictive posterior distributions are illus-

Page 7: Composite Gaussian Processes: Scalable Computation … · Composite Gaussian Processes: Scalable Computation and Performance Analysis tween an approximated GP posterior distribution

Composite Gaussian Processes: Scalable Computation and Performance Analysis

35 40 45 50x

1

20

25

30

35

40

45x 2

-1

-0.5

0

0.5

1

1.5

2

(a) GP mean

35 40 45 50x

1

20

25

30

35

40

45

x 2

-1

-0.5

0

0.5

1

1.5

2

(b) CGP mean, K = 4

35 40 45 50x

1

20

25

30

35

40

45

x 2

-1

-0.5

0

0.5

1

1.5

2

(c) CGP mean, K = 16

35 40 45 50x

1

20

25

30

35

40

45

x 2

-1

-0.5

0

0.5

1

1.5

2

(d) SGP mean

0

0.1

0.2

40

x2

30 5045

x1

402035

0

0.05

0.1

0.15

0.2

(e) GP variance

0

0.1

0.2

40

x2

30 5045

x1

402035

0

0.05

0.1

0.15

0.2

(f) CGP variance, K = 4

0

0.1

0.2

40

x2

30 5045

x1

402035

0

0.05

0.1

0.15

0.2

(g) CGP variance, K = 16

0

0.1

0.2

40

x2

30 5045

x1

402035

0

0.05

0.1

0.15

0.2

(h) SGP variance

Figure 3. The GP, CGP (K = 4 and 16), and SGP interpolations for the missing values of GRF in Figure 4.

Figure 4. A realization of the GRF: points within the red box areused for testing; blue dots are the inducing variables.

trated in Figure 2. In the CGP case, the observations aresequentially divided into four segments (each segment hasthe length of 1024); in the SGP (FITC) case, 128 inducingvariables are placed uniformly across the space of observa-tion inputs.

Several observations are made from this example. First, therecursive computations render the CGP very fast compar-ing to the GP or the SGP methods. Running on a standardlaptop, runtime of CGP i less than a second to process 5000

data points and the run time grows linearly with respect toK. Second, in this particular case, the predictive variancesare underestimated with the composite GP. This is consis-tent with redundant information in the data segments as perRemark 2.

6.2. Synthetic Spatial Data

In this example, we consider a two-dimensionalGaussian random field (GRF) model: y(x1, x2) ∼GP(0, k(x1, x2, x

′1, x′2)), where k(x1, x2, x

′1, x′2) =

α2 exp[−(x1−x′1)

2

θ21+−(x2−x′2)

2

θ22]. This GRF model can be

efficiently simulated via circulant embedding (Kroese &Botev, 2015). The contour plot of a realization of the GRF(α = 1, θ1 = 8, θ2 = 8) is illustrated in Figure 4. This64× 64 grid of GRF points are partitioned into N = 3840points for prediction as observations and 512 test points.

The predictions are illustrated in Figure 3. Visually, CGPis most similar to GP, and the posterior variances of theCGP representing the uncertainty is slightly higher thanthat of GP. By contrast, SGP and the GP exhibit notabledifferences and the posterior variance of the former is sig-nificantly higher than the latter. The additional MSE (25)incurred by using these posterior distributions is visualizedfor CGP (K = 4 and K = 16) and SGP in Figure 5, using100 different realizations of the GRF. When the number ofdata segments varies from 4 to 16, the computation time ofCGP is reduced but the additional MSE increases, which

Page 8: Composite Gaussian Processes: Scalable Computation … · Composite Gaussian Processes: Scalable Computation and Performance Analysis tween an approximated GP posterior distribution

Composite Gaussian Processes: Scalable Computation and Performance Analysis

35 40 45 50x

1

20

25

30

35

40

45

x 2

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

(a) E[(µGP − µCGP)2], K = 4

35 40 45 50x

1

20

25

30

35

40

45

x 2

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

(b) E[(µGP−µCGP)2],K = 16

35 40 45 50x

1

20

25

30

35

40

45

x 2

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

(c) E[(µGP − µSGP)2]

Figure 5. Data-averaged approximation errors for CGP (K = 4 and 16) and SGP comparing to GP spatial interpolations.

is consistent with our analysis in the previous section. Bycontrast, SGP yields a notably higher additional MSE overGP.

6.3. A Real-world Case: NOx Prediction

Next we demonstrate the use of CGP in a real applicationfor predicting NOx (nitrogen oxides). The data set (avail-able online: http://slb.nu/slbanalys/) includes more than 10years hourly NOx measurements for a city with 1 millionpopulation. Figure 6(a) shows that the NOx data are farfrom Gaussian: the measurements are non-negative, highlyskewed towards to lower values, and have a long tail in highvalues. Therefore, we apply the logarithm transformation(Shumway & Stoffer, 2011) to adjust the NOx measure-ments to a normal distributed data. The result after trans-formation is shown in Figure 6(b).

0 500 1000 1500

NOx ( g/m3)

0

2000

4000

6000

8000

10000

12000

14000

Fre

qu

ency

(a) NOx measurements

-2 0 2 4 6 8

Logarithm-scale NOx (log g/m3)

0

500

1000

1500

2000

2500

3000

3500

Fre

qu

ency

(b) After log transformation

Figure 6. Logarithm transformation of NOx measurements.

An example of 24-hour NOx prediction is illustrated in Fig-ure 7. In this example, previous two years measurements(17472 hourly measurements) are used by the CGP modelto produce the posterior of NOx levels on March 12, 2014.The produced posterior is a log-normal process: the pre-dictions are non-negative; the distribution is skewed to thelower values and has a long tail in higher values; and thewidth of credibility intervals is significantly bigger for rushhours comparing to late and early hours in working days.

Figure 7. A 24-hour NOx prediction with the composite GP.

5 10 15 20 25Data block length N

k or number of inducing variables M (in weeks)

75

80

85

90

95

RM

SE

0

5

10

15

20

25

Run

Tim

e

RMSE CGPRMSE SGPRun time CGPRun time SGP

Figure 8. Experiments of daily NOx prediction over year 2014.

Finally, we show the root-mean square error (RMSE) andruntime results for the CGP and the SGP, with 365 24-hourpredictions over the year of 2014. The GP prediction isused as a reference for comparison. As GP prediction isnot very scalable, we limit the maximum number of weeksused for one prediction (for instance the prediction shownin Figure 7) to be 50. Thereafter, we vary the number of ob-servations per batch for the CGP and the number of induc-ing variables for the SGP from 1 to 25 weeks. The resultingRMSE (µg/m3) and runtime (seconds) are shown in Figure8. The RMSE of GP with 50 weeks of data is slightly above75, which is provided as the lower bound when comparingthe CGP and the SGP results. Figure 8 shows a trade-offbetween the runtime and the RMSE with CGPs and SGPsfor this particular data set when the total data used for one

Page 9: Composite Gaussian Processes: Scalable Computation … · Composite Gaussian Processes: Scalable Computation and Performance Analysis tween an approximated GP posterior distribution

Composite Gaussian Processes: Scalable Computation and Performance Analysis

prediction is fixed (50 weeks). The SGP with 5 weeks of in-ducing variable achieves lower RMSE than the CGP with10 batches and 5 weeks per batch. However, the runtimeof the CGP is significantly lower than the SGP. We mustnotice that when the number of observations per batch ornumber of inducing variables increases, the difference ofruntimes becomes larger and the gap between RMSEs areclosing, which shows a nice scalablility and accuracy of theCGP model.

7. ConclusionIn this work, we addressed the problem of learning GPmodels and predicting an underlying process when in sce-narios where the data sets are large. Using a general beliefupdate framework, we applied a composite likelihood ap-proach and derived the CGP posterior distribution. It canbe both be updated and learned recursively with a runtimethat is linear in the number of data segments. We show thatGP, CGP, and SGP posteriors can be all recovered in thisframework using different likelihood models.

Furthermore, we compared the CGP posterior with that ofGP. We obtained closed-form expressions of the additionalprediction MSE incurred by CGP and the differences in in-formation gains under both models. The results can be usedas a conceptual as well as computational tool for designingsegmentation schemes and evaluating errors induced whenusing the scalable CGP in different applications. The de-sign of segmentation schemes based on the derived quanti-ties is a topic of future research.

Page 10: Composite Gaussian Processes: Scalable Computation … · Composite Gaussian Processes: Scalable Computation and Performance Analysis tween an approximated GP posterior distribution

Composite Gaussian Processes: Scalable Computation and Performance Analysis

ReferencesBarrett, Adam B. Exploration of synergistic and redun-

dant information sharing in static and dynamical gaus-sian systems. Phys. Rev. E, 91:052802, May 2015. doi:10.1103/PhysRevE.91.052802.

Bernardo, Jose M. Reference posterior distributions forbayesian inference. Journal of the Royal Statistical So-ciety. Series B (Methodological), 41(2):113–147, 1979.ISSN 00359246.

Bijl, Hildo, Wingerden, Jan-Willem, Schon, Thomas B.,and Verhaegen, Michel. Online sparse gaussian processregression using fitc and pitc approximations. IFAC-PapersOnLine, 48(28):703 – 708, 2015. ISSN 2405-8963. 17th IFAC Symposium on System IdentificationSYSID 2015.

Bishop, Christopher M. Pattern recognition and ma-chine learning. Springer, New York, NY, 2006. ISBN9780387310732;0387310738;.

Bissiri, P. G., Holmes, C. C., and Walker, S. G. A gen-eral framework for updating belief distributions. Jour-nal of the Royal Statistical Society: Series B (Statisti-cal Methodology), 78(5):1103–1130, 2016. ISSN 1467-9868. doi: 10.1111/rssb.12158.

Cao, Yanshuai and Fleet, David J. Generalized product ofexperts for automatic and principled fusion of gaussianprocess predictions. CoRR, abs/1410.7827, 2014.

Cover, Thomas M. and Thomas, Joy A. Elements ofinformation theory. Wiley, New York, 1991. ISBN9780471062592;0471062596;.

Deisenroth, M. P., Fox, D., and Rasmussen, C. E. Gaus-sian processes for data-efficient learning in robotics andcontrol. IEEE Transactions on Pattern Analysis andMachine Intelligence, 37(2):408–423, Feb 2015. ISSN0162-8828. doi: 10.1109/TPAMI.2013.218.

Deisenroth, Marc and Ng, Jun Wei. Distributed gaussianprocesses. In Bach, Francis and Blei, David (eds.), Pro-ceedings of the 32nd International Conference on Ma-chine Learning, volume 37 of Proceedings of MachineLearning Research, pp. 1481–1490, Lille, France, 07–09 Jul 2015. PMLR.

Fox, Charles W. and Roberts, Stephen J. A tutorial onvariational bayesian inference. Artificial Intelligence Re-view, 38(2):85–95, Aug 2012. ISSN 1573-7462. doi:10.1007/s10462-011-9236-8.

Hensman, James, Fusi, Nicolo, and Lawrence, Neil D.Gaussian processes for big data. CoRR, abs/1309.6835,2013.

Jacob, P. E., Murray, L. M., Holmes, C. C., and Robert,C. P. Better together? Statistical learning in modelsmade of modules. ArXiv e-prints, August 2017.

Kay, Steven M. Fundamentals of statistical signalprocessing: Vol. 1, Estimation theory. PrenticeHall PTR, Upper Saddle River, N.J, 1993. ISBN0133457117;9780133457117;.

Krause, Andreas, Singh, Ajit, and Guestrin, Carlos. Near-optimal sensor placements in gaussian processes: The-ory, efficient algorithms and empirical studies. J. Mach.Learn. Res., 9:235–284, June 2008. ISSN 1532-4435.

Kroese, Dirk P. and Botev, Zdravko I. Spatial ProcessSimulation, pp. 369–404. Springer International Pub-lishing, 2015. ISBN 978-3-319-10064-7. doi: 10.1007/978-3-319-10064-7 12.

Quinonero Candela, Joaquin and Rasmussen, Carl Edward.A unifying view of sparse approximate gaussian processregression. J. Mach. Learn. Res., 6:1939–1959, Decem-ber 2005. ISSN 1532-4435.

Rasmussen, Carl E. and Williams, Christopher K. I.Gaussian processes for machine learning. MITPress, Cambridge, Massachusetts, 2006. ISBN9780262182539;026218253X;.

Sarkka, Simo. Bayesian Filtering and Smoothing. Cam-bridge University Press, New York, NY, USA, 2013.ISBN 1107619289, 9781107619289.

Seeger, Matthias, Williams, Christopher, and Lawrence,Neil. Fast Forward Selection to Speed Up Sparse Gaus-sian Process Regression. In Artificial Intelligence andStatistics 9, 2003.

Shumway, Robert H. and Stoffer, David S. Time se-ries analysis and its applications: with R exam-ples. Springer, New York, 3rd edition, 2011. ISBN144197864X;9781441978646;.

Snelson, Edward and Ghahramani, Zoubin. Sparse gaus-sian processes using pseudo-inputs. In Weiss, Y.,Scholkopf, P. B., and Platt, J. C. (eds.), Advances in Neu-ral Information Processing Systems 18, pp. 1257–1264.MIT Press, 2006.

Snelson, Edward and Ghahramani, Zoubin. Local andglobal sparse gaussian process approximations. InMeila, Marina and Shen, Xiaotong (eds.), Proceedingsof the Eleventh International Conference on Artificial In-telligence and Statistics, volume 2 of Proceedings of Ma-chine Learning Research, pp. 524–531, San Juan, PuertoRico, 21–24 Mar 2007. PMLR.

Page 11: Composite Gaussian Processes: Scalable Computation … · Composite Gaussian Processes: Scalable Computation and Performance Analysis tween an approximated GP posterior distribution

Composite Gaussian Processes: Scalable Computation and Performance Analysis

Stoica, Petre and Moses, Randolph L. Introduction to spec-tral analysis. Prentice Hall, Upper Saddle River, 1997.ISBN 9780132584197;0132584190;.

Timme, Nicholas, Alford, Wesley, Flecker, Benjamin, andBeggs, John M. Synergy, redundancy, and multivari-ate information measures: an experimentalist’s perspec-tive. Journal of Computational Neuroscience, 36(2):119–140, Apr 2014. ISSN 1573-6873. doi: 10.1007/s10827-013-0458-4.

Titsias, Michalis. Variational learning of inducing vari-ables in sparse gaussian processes. In van Dyk, Davidand Welling, Max (eds.), Proceedings of the TwelthInternational Conference on Artificial Intelligence andStatistics, volume 5 of Proceedings of Machine Learn-ing Research, pp. 567–574, Hilton Clearwater Beach Re-sort, Clearwater Beach, Florida USA, 16–18 Apr 2009.PMLR.

Tresp, Volker. A bayesian committee machine. NeuralComput., 12(11):2719–2741, November 2000. ISSN0899-7667. doi: 10.1162/089976600300014908.

Van Trees, Harry L., Bell, Kristine L., and Tian, Zhi. De-tection estimation and modulation theory. John Wiley& Sons, Inc, Hoboken, N.J, second edition, 2013. ISBN9780470542965;0470542969;.

Varin, Cristiano, Reid, Nancy, and Firth, David. Anoverview of composite likelihood methods. StatisticaSinica, 21(1):5–42, 2011. ISSN 10170405, 19968507.

Williams, Christopher K. I. and Seeger, Matthias. Using thenystrom method to speed up kernel machines. In Leen,T. K., Dietterich, T. G., and Tresp, V. (eds.), Advancesin Neural Information Processing Systems 13, pp. 682–688. MIT Press, 2001.

Zachariah, D., Dwivedi, S., Handel, P., and Stoica, P. Scal-able and passive wireless network clock synchronizationin los environments. IEEE Transactions on WirelessCommunications, 16(6):3536–3546, June 2017. ISSN1536-1276. doi: 10.1109/TWC.2017.2683486.