Online Sparse Gaussian Process Training with Input Noiseuser.it.uu.se/~thosc112/bijlsvwv21016sonig.pdf · Sparse Online Gaussian Process Training with Input Noise 1.3 Online application

Online Sparse Gaussian Process Training with Input Noise

Hildo Bijl [email protected] Center for Systems and ControlDelft University of TechnologyDelft, The Netherlands

Thomas B. Schon [email protected] of Information TechnologyUppsala UniversityUppsala, Sweden

Jan-Willem van Wingerden [email protected] Center for Systems and ControlDelft University of TechnologyDelft, The Netherlands

Michel Verhaegen [email protected]

Delft Center for Systems and Control

Delft University of Technology

Delft, The Netherlands

Abstract

Gaussian process regression traditionally has three important downsides. (1) It is computa-tionally intensive, (2) it cannot efficiently implement newly obtained measurements online,and (3) it cannot deal with stochastic (noisy) input points. In this paper we present analgorithm tackling all these three issues simultaneously. The resulting Sparse Online NoisyInput GP (SONIG) regression algorithm can incorporate new measurements in constantruntime. A comparison has shown that it is more accurate than similar existing regres-sion algorithms. In addition, the algorithm can be applied to non-linear black-box systemmodeling, where its performance is competitive with non-linear ARX models.

Keywords: Gaussian Processes, Regression, Machine Learning, Sparse Methods, SystemIdentification

1. Introduction

Gaussian Process (GP) regression (see Rasmussen and Williams (2006) for an introduction)has been increasing in popularity recently. Fundamentally, it works similarly to other func-tion approximation algorithms like neural networks and support vector machines. However,it has its basis in Bayesian probability theory. The result is that GP regression does not onlygive estimates of function outputs, but also variances – information about the uncertainty– of these estimates. This advantage has proven to be useful in a variety of applications,like those by Deisenroth and Rasmussen (2011) or Hensman et al. (2013).

Traditional GP regression is not without its limitations though. In fact, there are threeimportant limitations, which we will outline here, also discussing the current state of the

c©2016 Hildo Bijl, Thomas B. Schon, Jan-Willem van Wingerden, Michel Verhaegen.

Bijl

literature on these limitations. The contribution of this paper is a new GP algorithm thatovercomes all three of these limitations simultaneously, which has not been done in this waybefore.

1.1 Using stochastic input points

The first limitation is that the GP regression algorithm assumes that the input points aredeterministic. This assumption concerns both the measurement points xm and the trialpoints x∗. For noisy (stochastic) trial points, we can work around the problem by applyingmoment matching, as explained by Deisenroth (2010). This technique can subsequentlyalso be expanded for noisy measurement points using the ideas from Dallaire et al. (2009).

However, the problem with this technique is that it only adjusts the uncertainty (thecovariance) of points based on prior data and does not involve posterior data (derived frommeasurement values) at all. As such, its performance is significantly limited. A methodwhich does include posterior data is given by Girard and Murray-Smith (2003), but thismethod only works for noisy trial points and not for noisy measurement points. To thebest of our knowledge, there is only one previously available method which both includesposterior data and works with noisy measurement points, which is the NIGP algorithmby McHutchon and Rasmussen (2011). This is the method we will be expanding upon inthis paper.

It should be noted that there is also another way to take into account the effects ofnoisy measurement points: by assuming the noise variance varies over the input space;a feature called heteroscedasticity. This has been investigated quite in-depth by, amongothers, Goldberg et al. (1997), Le and Smola (2005), Wang and Neal (2012) and Snelson andGhahramani (2006a), but it would give more degrees of freedom to the learning algorithmthan would be required, and as a result these methods have a reduced performance for ourparticular problem.

1.2 Reducing the computational complexity

The second limitation of Gaussian process regression is its computational complexity. Thestraightforward regression algorithm has a runtime of O(n3), with n the number of measure-ments. This makes it hard to combine GPs with the recent trend of using more and moredata. As a solution, sparse methods have been proposed. An overview of such methodsis given in Candela and Rasmussen (2005), summarizing various contributions from Smolaand Bartlett (2001), Csato and Opper (2002), Seeger et al. (2003) and Snelson and Ghahra-mani (2006b) into a comprehensive framework. With these methods, and particularly withthe FITC method which we will apply in this paper (see Candela and Rasmussen (2005))the runtime can be reduced to being linear with respect to n. More recent work on sparseGPs involve the use of variational inference (Titsias, 2009; Hensman et al., 2013; Gal et al.,2014).

An alternative to the sparse GPs is offered by the distributed Gaussian process con-struction recently developed by Deisenroth and Ng (2015). Their construction makes use ofthe full data set and instead reduces the computational complexity by distributing both thecomputational load and the memory load amongst a large set of independent computationalunits.

2

Sparse Online Gaussian Process Training with Input Noise

1.3 Online application

The third limitation is the difficulty with which GP regression algorithms can add newmeasurements. Traditionally, when a new measurement n+1 is added to the existing set ofn measurement points, the GP regression algorithm would have to redo all computations totake this new measurement into account. For online applications, in which new measurementdata continuously arrive, this seems unnecessarily inefficient.

And indeed, various online GP regression methods exist which approximate a GPthrough a relatively small subset of nu inducing input (basis) points. Many of these tech-niques (Csato and Opper, 2002; Candela and Rasmussen, 2005; Ranganathan et al., 2011;Kou et al., 2013; Hensman et al., 2013) are designed for sparse GP algorithms, but they donot work for any set of inducing input points. The exceptions are Huber (2013, 2014) andBijl et al. (2015), which both give an online method of applying the FITC algorithm forany set of inducing input points.

1.4 Contribution and outline of the paper

In this paper the solutions to these three limitations will be combined into a single novelalgorithm, called Sparse Online Noisy Input GP (SONIG). This algorithm will be capable ofadding information obtained from noisy input points in an online way to Gaussian processpredictions. To the best of our knowledge, an algorithm with these features has not beenpublished before. In addition, we will subsequently expand this algorithm, enabling it toidentify non-linear system dynamics. The resulting algorithm, called System Identificationusing Sparse Online GPs (SISOG), will be applied to a magneto-rheological fluid damperexample.

This paper is set up as follows. We start by defining the notation we use in Section 2, aswell as outlining the equations normally used in GP regression. In Section 3 we expand thesemethods, making them applicable to GP regression with noisy input points. This resultsin the main version of the SONIG algorithm. Section 4 subsequently outlines various waysto expand on this SONIG algorithm. Test results of the main SONIG algorithm and of anextension focused on system identification are then shown in Section 5. Section 6 finallygives conclusions and recommendations.

2. Overview of Gaussian process regression methods

In this section we will give an overview of the existing theory on GP regression. Specifi-cally, we introduce the notation that we will use as well as the method of thought behindthe regression algorithm. Section 2.1 mainly covers theory discussed by Rasmussen andWilliams (2006) on regular GP regression, Section 2.2 discusses sparse GP with insightsfrom Candela and Rasmussen (2005) and Section 2.3 looks at the online implementation ofthese algorithms using recent results from Bijl et al. (2015).

2.1 Regular Gaussian process regression

In GP regression we aim to approximate a function f(x). To do this, we first obtain variousmeasurements (xm1 , y1), (xm2 , y2), . . . , (xmn , yn), with xm the measurement input point, ythe measured function output and n the number of measurements. Here we assume that the

3

Bijl

measurement outputs are corrupted by noise, such that yi = f(xmi) + ε, with ε ∼ N (0, σ2n)being Gaussian white noise with variance σ2n.

As shorthand notation, we merge all the measurement points xmi into a measurementset Xm and all output values yi into an output vector y. We now write the noiseless functionoutput f(Xm) as fm, such that y = fm + ε, with ε ∼ N (0,Σn) and Σn = σ2nI.

Once we have the measurements, we want to predict the function value f(x∗) at aspecific trial point x∗. Equivalently, we can also predict the function values f∗ = f(X∗) ata whole set of trial points X∗. To accomplish this using GP regression, we assume that f∗and fm have a prior joint Gaussian distribution given by[f0m

f0∗

]∼ N

([m(Xm)m(X∗)

],

[k(Xm, Xm) k(Xm, X∗)k(X∗, Xm) k(X∗, X∗)

])= N

([mm

m∗

],

[Kmm Km∗K∗m K∗∗

]), (1)

where in the second part of the equation we have introduced another shorthand notation.Note here that m(x) is the prior mean function for the Gaussian process and k(x,x′) isthe prior covariance function. The superscript 0 in f0

m and f0∗ also denotes that we are

referring to the prior distribution: no measurements have been taken into account yet. Inthis paper we make no assumptions on the prior functions, but our examples will apply azero mean function m(x) = 0 and a squared exponential covariance function

k(x,x′) = α2 exp

(−1

2

(x− x′

)TΛ−1

(x− x′

)), (2)

with α a characteristic output length scale and Λ a diagonal matrix of characteristic inputsquared length scales. For now we assume that these hyperparameters are known, but inSection 4.3 we look at ways to tune them.

From (1) we can find the posterior distribution of both fm and f∗ given y. This canbe derived in a variety of ways, but they all result in the same expressions, being[

fnm

fn∗

]∼ N

([µnm

µn∗

],

[Σnmm Σn

m∗Σn∗m Σn

∗∗

]),[

µnm

µn∗

]=

[(K−1

mm + Σ−1n

)−1 (K−1

mmmm + Σ−1n y

)m∗ +K∗m (Kmm + Σn)−1 (y −mm)

],

[Σnmm Σn

m∗Σn∗m Σn

∗∗

]=

[ (K−1

mm + Σ−1n

)−1Σn (Kmm + Σn)−1Km∗

K∗m (Kmm + Σn)−1 Σn K∗∗ −K∗m (Kmm + Σn)−1Km∗

]. (3)

Note here that while we use m and K to denote properties of prior distributions, we use µand Σ for posterior distributions. The superscript n indicates these are posteriors taking nmeasurements into account.

2.2 Sparse Gaussian process regression

To reduce the computational complexity of predicting the function value f∗ at trial pointsX∗, we can also use sparse GP regression. The first step in this process is to introduce aset of so-called inducing input points Xu. We choose these points ourselves, either evenlydistributed over the input space, or with some more inducing input points in regions we

4


are specifically interested in. We then first predict the distribution of fu, identically to (3),through [

fnm

fnu

]∼ N

([µnm

µnu

],

[Σnmm Σn

mu

Σnum Σn

uu

]),[

µnm

µnu

]=

[(K−1

mm + Σ−1n

)−1 (K−1

mmmm + Σ−1n y

)mu +Kum (Kmm + Σn)−1 (y −mm)

],

[Σnmm Σn

mu

Σnum Σn

uu

]=

[ (K−1

mm + Σ−1n

)−1Σn (Kmm + Σn)−1Kmu

Kum (Kmm + Σn)−1 Σn Kuu −Kum (Kmm + Σn)−1Kmu

]. (4)

In this expression, it is usually only the distribution fnu ∼ N (µn

u,Σnuu) that is useful. This

is because, after we know the distribution of fu, we throw away all measurement data andonly use this distribution to predict f∗. (Mathematically, this comes down to assumingthat fm and f∗ are conditionally independent, given fu.) Taking care not to use prior datatwice, we then find that[

fnu

fn∗

]∼ N

([µnu

µn∗

],

[Σnuu Σn

u∗Σn∗u Σn

∗∗

]),[

µnu

µn∗

]=

[µnu

m∗ +K∗uK−1uu

(µnu −mu

)] ,[Σnuu Σn

u∗Σn∗u Σn

∗∗

]=

[Σnuu Σn

uuK−1uuKu∗

K∗uK−1uu Σn

uu K∗∗ −K∗uK−1uu (Kuu − Σn

uu)K−1uuKu∗

]. (5)

Given that the number of inducing input points nu is much smaller than the number ofmeasurements n, this will result in a fast prediction for f∗. However, predicting fu is stillslow. One way of making it faster is by first predicting (finding the distribution of) fu,only based on the first measurement (xm1 , y1). Then we do the same, only based on thesecond measurement, and so on. Finally, we merge all distributions of fu which we havefound together, taking care to use prior data only once. (Mathematically, this comes downto assuming that fmi and fmj are conditionally independent, given fu.) This will give usthe FITC regression equations, in their (computationally) most efficient form[

fnm

fnu

]∼ N

([µnm

µnu

],

[Σnmm Σn

mu

Σnum Σn

uu

]),[

µnm

µnu

]=

[mm + Σn

mmΣ−1n (y −mm)

mu + Σnum

(Λ−1mm + Σ−1

n

)(y −mm)

],

Σnmm =

(Λ−1mm + Σ−1

n

)−1+Σn (Λmm + Σn)−1Kmu∆−1Kum (Λmm + Σn)−1 Σn,

Σnmu = Σn (Λmm + Σn)−1Kmu∆−1Kuu,

Σnum = Kuu∆−1Kum (Λmm + Σn)−1 Σn,

Σnuu = Kuu∆−1Kuu. (6)

Here we have used the shorthand notation ∆ = Kuu +Kum (Λmm + Σn)−1Kmu and Λmm =diag

(Kmm −KmuK

−1uuKum

), with diag being the function which sets all non-diagonal ele-

ments of the given matrix to zero. It is important to note that the above set of expressionsdoes not equal (4), but is merely a computationally efficient approximation.

5

Bijl

2.3 Sparse Online Gaussian process regression

The methods developed so far can also be implemented in an online fashion. Suppose weget a new measurement (xmn+1 , yn+1), whose notation we will shorten to (xm+ , y+). Toincorporate it, we first predict the posterior distribution of f+ = f(xm+) using only thefirst n measurements. This results in[

fnu

fn+

]∼ N

([µnu

µn+

],

[Σnuu Σn

u+

Σn+u Σn

++

]),[

µnu

µn+

]=

[µnu

m+ +K+uK−1uu

(µnu −mu

)] ,[Σnuu Σn

u+

Σn+u Σn

++

]=

[Σnuu Σn

uuK−1uuKu+

K+uK−1uu Σn

uu K++ −K+uK−1uu (Kuu − Σn

uu)K−1uuKu+

]. (7)

Then we do the same using only our new measurement (using (4)). If we merge the resultstogether in the proper way, we end up with[

fn+1u

fn+1+

]∼ N

([µn+1u

µn+1+

],

[Σn+1uu Σn+1

u+

Σn+1+u Σn+1

++

]),[

µn+1u

µn+1+

]=

[µnu + Σn

u+

(Σ++ + σ2n

)−1 (y+ − µn+

)σ2n(Σn++ + σ2n

)−1µn+ + Σn

++

(Σn++ + σ2n

)−1y+

],[

Σn+1uu Σn+1

u+

Σn+1+u Σn+1

++

]=

[Σnuu − Σn

u+(Σn++ + σ2n)−1Σn

+u Σnu+(Σn

++ + σ2n)−1σ2nσ2n(Σn

++ + σ2n)−1Σn+u σ2n(Σn

++ + σ2n)−1Σn++

]. (8)

This expression (or at least the part relating to fu) is the update law for the FITC algorithm.It tells us exactly how the distribution of fn+1

u (both µn+1u and Σn+1

uu ) depends on x+ andy+. And with this distribution, we can always predict new trial points f∗ in an efficientway through (5).

3. Expanding the algorithm for stochastic measurement points

In this section we will expand on the existing theory to enable the online FITC algorithmto handle stochastic measurement points. Though we make use of many existing ideas, thecombination of these ideas in this form has not been done before.

Consider the case where we know the distribution fnu ∼ N

(µnu,Σ

nuu

). (Initially we have

f0u ∼ N

(µ0u,Σ

0uu

)= N (mu,Kuu).) We now obtain a new measurement at some unknown

input point x+. Also the true function output f+ = f(x+) is unknown. Our measurementsdo give us values x+ and f+ approximating these. (Note that f+ and y+ are exactly thesame, but for the sake of uniform notation, we have renamed y+ to f+.) This measurementbasically tells us that x+ ∼ N (x+,Σ+x) and f+ ∼ N (f+,Σ+f

). (Here Σ+fand σ2n are the

same.) We assume that the noise on the input and the output is independent, and hencethat x+ and f+ are a priori not correlated.

The question now arises how we should calculate p(fn+1u ). Given x+, we can calculate

the distribution of fn+1u using (8). This equation hence tells us p(fn+1

u |x+, f+,fnu ). It

now follows that

p(fn+1u |x+, f+,f

nu ) =

∫Xp(fn+1

u |x+, x+, f+,fnu )p(x+|x+, f+,f

nu ) dx+, (9)

6


where the first probability p(fn+1u |x+, x+, f+,f

nu follows from (8). The second probability

p(x+|x+, f+,fnu ) is the posterior distribution of x+, given both fn

u and the new measure-ment. We will find this latter probability first in Section 3.1 and show how to implementit in Section 3.2. We then merge all ideas together into the resulting SONIG algorithm inSection 3.3.

3.1 The posterior distribution of the measurement point

The posterior distribution of x+ can be found through Bayes’ theorem,

p(x+|f+, x+,fnu ) =

p(f+|x+, x+,fnu )p(x+|x+,f

nu )

p(f+|x+,fnu )

. (10)

Here p(x+|x+,fnu ) = p(x+|x+) = N (x+,Σ+x) and p(f+|x+,f

nu ) equals an unknown

constant (i.e., not depending on x+). Additionally,

p(f+|x+, x+,fnu ) = p(f+|x+,f

nu ) = N

(µn+,Σ

n++ + Σ+f

), (11)

where µn+ and Σn++ follow from (7). Note that both these quantities depend on x+ in a

non-linear way. Because of this, the resulting probability p(x+|f+, x+,fnu ) will be non-

Gaussian. To work around this problem, we have to make some simplifying assumptions.What we can do, similarly to Girard and Murray-Smith (2003), is to linearize the Gaussianprocess f+ (which depends on x+) around a point x+. That is, we assume that

p(f+|x+,fnu ) = N

(µn+(x+) +

dµn+(x+)

dx+(x+ − x+) ,Σn

++(x+) + Σ+f

). (12)

(Note that we use the notation that a derivative of a scalar with respect to a vector isa row vector.) Hence, the mean varies linearly with x+, but the covariance is constanteverywhere. With this approximation, the solution for p(x+|f+, x+,f

nu ) is Gaussian. It

can be solved analytically, by manually working out all the matrix exponents, as

p(x+|f+, x+,fnu ) = N

(xn+1+ ,Σn+1

+x

),

xn+1+ = x+ + Σn+1

+x

((dµn+(x+)

dx+

)T (Σn++(x+) + Σ+f

)−1

(dµn+(x+)

dx+(x+ − x+) +

(f+ − µn+(x+)

))),

Σn+1+ =

((dµn+(x+)

dx+

)T (Σn++(x+) + Σ+f

)−1(dµn+(x+)

dx+

)+ Σ−1

+x

)−1

. (13)

The remaining question is which point x+ we should linearize around. The above equation

is easiest to apply when we choose x+ = µ+, but when(f+ − µn+(x+)

)is large, this may

result in inaccuracies due to the linearization. It is generally more accurate to choose x+

as xn+1+ . However, xn+1

+ is initially unknown, so it may be necessary to apply the above

equation a few times, each time resetting x+ to the latest value of xn+1+ that was found,

to find the most accurate posterior distribution of x+.

7

Bijl

3.2 Updating the inducing input points

We are now ready to derive a solution to the integral in (9). To do so, we again needvarious assumptions. Here we use less strict assumptions than in the above derivation: weonly assume that x+ is independent of fu and that Σ2

+xand higher powers of Σ+x are

negligible. Both assumptions are reasonable to make for most applications with reasonablyaccurate measurements.

To solve (9), we will not linearize the GP fu(x+), but instead approximate it by its fullTaylor expansion around xn+1

+ . Element-wise, we write this as,

fn+1ui

(x+) = fn+1ui

(xn+1+ ) +

dfn+1ui

(xn+1+ )

dx+

(x+ − xn+1

+

)+

1

2

(x+ − xn+1

+

)T d2fn+1ui

(xn+1+ )

dx+2

(x+ − xn+1

+

)+ . . . . (14)

Within this approximation, Girard and Murray-Smith (2003) made the further assumption

that higher order derivatives like d2fnu

dx+2 were negligible. We do not make this assumption,

but our assumption that higher powers of Σ+x are negligible identically causes these higherorder derivatives to drop out. As added bonus, this latter assumption is easier to justify.

It is important to realize here that not only x+ but also fn+1u (xn+1

+ ) and its derivativesare Gaussian random variables. (The derivative of a Gaussian process is also a Gaussianprocess.) Hence, in order to find the expected value of fn+1

u , we have to take the expectationwith respect to both the function values fn+1

u and the input point x+. If we look at theequation element-wise, the result will be

Efux+

[fn+1u

]i

= Efu

[fn+1ui

(xn+1+ )

]+ Efu

[dfn+1

ui(xn+1

+ )

dx+

]Ex+

[x+ − xn+1

+

]+

1

2Ex+

[(x+ − xn+1

+

)TEfu

[d2fn+1

ui(xn+1

+ )

dx+2

](x+ − xn+1

+

)]+ . . .

= Efu

[fn+1ui

(xn+1+ )

]+

1

2tr

d2Efu

[fn+1ui

(xn+1+ )

]dx+

2Σn+1+x

= µn+1

ui(xn+1

+ ) +1

2tr

(d2µn+1

ui(xn+1

+ )

dx+2

Σn+1+x

). (15)

Note that, because of our assumption that Σ2+x

and higher powers of Σ+x are negligible,all future terms drop out. Additionally, we have used our assumption that x+ and fuare independent. We have also used the fact that both the derivative and the expectationoperators are linear operators, allowing us to interchange their order of application.

We can find the covariance matrix of fn+1u with a similar but much more lengthy

procedure. For that reason, we have put the derivation in Appendix A and only present

8


the final result here element-wise as

Vfux+

[fn+1u

]i,j

= Σn+1uiuj

(xn+1+ ) +

(dµn+1

uj(xn+1

+ )

dx+

)Σn+1+x

(dµn+1

ui(xn+1

+ )

dx+

)T

+1

2tr

((d2Σn+1

uiuj(xn+1

+ )

dx+2

)Σn+1+x

). (16)

Finding all the necessary derivatives for the above equations can be a daunting task, espe-cially for non-scalar inputs x+, but the mathematics are relatively straightforward, so forthis we refer to Appendix B.

It is interesting to compare the expressions we found with those used by McHutchonand Rasmussen (2011) in their NIGP algorithm. If we do, we find that for both (15)and (16), McHutchon and Rasmussen (2011) did not include the last term of the equationin their paper. That is, they did not take second derivatives into account. In their simulationcode, which is available online, they did include the last term of (16) and found (correctly)that its effect was not significant. However, they still did not include the last term of (15),and in the test results section (see Section 5.1) we will find that this term actually issignificant. As such, (15) and (16) also serve as an improvement with respect to the NIGPalgorithm.

3.3 The SONIG algorithm

Using all the ideas developed this far, we can set up an online regression algorithm. Theresulting Sparse Online Noisy Input GP (SONIG) algorithm is presented in Algorithm 1.This algorithm is computationally efficient, in the sense that a single updating step (incor-porating one measurement) can be done in constant runtime with respect to the numberof measurements already processed. The runtime does depend on the number of inducinginput points through O(n3u). Hence, increasing the number of inducing input points doesslow down the algorithm. Generally, nu stays below a hundred though, preventing this frombecoming a significant issue.

4. Extensions of the SONIG algorithm

In the previous section, we have presented the basic idea behind the SONIG algorithm.There are various further details that can be derived and implemented in the algorithm.For instance, the algorithm can deal with multi-output functions f(x) (Section 4.1), itcan give us the posterior distribution of the output f+ as well as its correlation with theinput x+ (Section 4.2), we can implement hyperparameter tuning (Section 4.3), we can addinducing input points online (Section 4.4) and we can make predictions f∗ using stochastictrial points x∗ (Section 4.5).

4.1 Multiple outputs

So far we have approximated functions f(x) with only one output. It is also possibleto approximate functions f(x) with dy > 1 outputs. The standard way in which this isdone in GP regression algorithms (see for instance Deisenroth and Rasmussen (2011)) is by

9

Bijl

Input:A possibly expanding set of measurements (xm1 , y1), . . . , (xmn , yn) in whichboth xm and y are distorted by Gaussian white noise.

Preparation:Either choose the hyperparameters based on expert knowledge, or apply theNIGP hypertuning methods of McHutchon and Rasmussen (2011) on a subset ofthe data (a few hundred points) to find the hyperparameters.Optionally, apply the NIGP regression methods on this subset of data to obtainan initial distribution of fu. Otherwise, initialize fu as N (mu,Kuu).

Updating:while there are unprocessed measurements (xmn+1 , yn+1) do

1. Apply (13) to find the posterior distribution of the measurement pointxmn+1 (written as x+).2. Use (15) and (16) to update the distribution of fu.3. Optionally, use (17) and (18) to calculate the posterior distribution of thefunction value f(x+) (written as f+).

end

Prediction:Apply (5) to find the distribution f∗ for any set of deterministic trial points. Forstochastic trial points, use the expansion from Section 4.5.

Algorithm 1: The Sparse Online Noisy Input GP (SONIG) algorithm: an online versionof the FITC algorithm capable of dealing with stochastic (noisy) measurement points.

assuming that, given a deterministic input x, all outputs f1(x), . . . , fdy(x) are independent.With this assumption, it is possible to keep a separate inducing input point distributionf iu ∼ N

(µiu,Σ

iu

)for each output fi(x). Hence, each output is basically treated separately.

When using stochastic input points (as noted by Deisenroth and Rasmussen (2011))the outputs do become correlated. We now have two options. If we take this correlation

into account, we have to keep track of the joint distribution of the vectors f1u, . . . ,f

dyu ,

effectively merging them into one big vector. This results in a vector of size nudy, givingour algorithm a computational complexity of O(n3ud

3y). Alternatively, we could also neglect

the correlation between the inducing input point distributions f iu caused by stochastic

measurement points x+ ∼ N (x+,Σ+x). If we do, we can continue to treat each functionoutput separately, giving our algorithm a runtime of O(n3udy). Because one of the focusesof this paper is to reduce the runtime of GP regression algorithms, we will apply the secondoption in this paper.

When each output is treated separately, each output also has its own hyperparameters.Naturally, the prior output covariance α2

i and the output noise σ2nican differ per output fi,

but also the input length scales Λi may be chosen differently for each output. In fact, it iseven possible to specify a fully separate covariance function ki(x,x

′) per output fi, thoughin this paper we stick with the squared exponential covariance function.

Naturally, there are a few equations which we should adjust slightly in the case ofmultivariate outputs. In particular, in equation (13), the parameter Σn

++(x+) would not bea scalar anymore, but become a matrix. And because of our assumption that the outputs

10


are independent, it would be a diagonal matrix. Similarly, the derivative dµn+/dx+ wouldnot be a row vector anymore. Instead, it would turn into the matrix dµn

+/dx+. With theseadjustments, (13) still holds and all other equations can be applied as usual.

4.2 The posterior distribution of the measured output

In Section 3.2 we found the posterior distribution for fu. For some applications (likethe SISOG algorithm presented in Section 5.2) we also need the posterior distribution ofthe measured function value f+, even though we do not exactly know to which input itcorresponds. We can find this element-wise, using the same methods, through

E[f+]i = µn+1+i

(xn+1+ ) +

1

2tr

(d2µn+1

+i(xn+1

+ )

dx+2

Σn+1+x

),

V[f+,f+]i,j = Σn+1+i+j

(xn+1+ ) +

(dµn+1

+i(xn+1

+ )

dx+

)Σn+1+x

(dµn+1

+j(xn+1

+ )

dx+

)T

+1

2tr

((d2Σn+1

+i+j(xn+1

+ )

dx+2

)Σn+1+x

). (17)

Note here that Σn+1++ is (by assumption) a diagonal matrix, simplifying the above equation

for non-diagonal terms. As such, the covariance between two different function outputs f+i

and f+j only depends on the second term in the above expression.With the above expressions, we can now calculate the covariance between each of the

entries of f+. However, it may occur that we also need to know the posterior covariancebetween the function value f+ and the function input x+. Using the same method, we canfind that

V[f+,x+] =

(dµn+1

+ (xn+1+ )

dx+

)Σn+1+x

. (18)

Now we know everything about the posterior distributions of x+ and f+. We can calcu-late their posterior covariances, after taking into account the data stored in the inducinginput point distributions fu. And all the time everything is approximated as Gaussiandistributions, making all computations computationally very efficient.

4.3 Applying hyperparameter tuning to the algorithm

So far we have assumed that the hyperparameters of the Gaussian process were known apriori. When this is not the case, they need to be tuned first. Naturally, this could be doneusing expert knowledge of the system, but can it also be done automatically?

McHutchon and Rasmussen (2011), with their NIGP method, offer an effective method oftuning the hyperparameters, which also tells us the input noise variance Σ+x . However, thismethod has a computational complexity of O(n3), with n still the number of measurements.As such, it can only be used for a small number of measurements and it cannot be usedonline. NIGP seems not to be applicable to our problem.

However, Chalupka et al. (2013) compare various GP regression algorithms, includingmethods to tune hyperparameters. One of their main conclusions is that the subset-of-data

11

Bijl

(SoD) method – simply using a limited number of measurement points to tune hyperparame-ters – provides a good trade-off between computational complexity and prediction accuracy.As such, applying NIGP for hyperparameter tuning on a subset of data – say, the first fewhundred data points – seems to be a decent choice.

Chalupka et al. (2013) also conclude that, for the regression problem with known hy-perparameters, the FITC algorithm provides a very good trade-off between complexity andaccuracy. So after the SoD method has given us an estimate of the hyperparameters, it willbe an effective choice to use the online FITC algorithm with stochastic measurement points(that is, the SONIG algorithm) as our regression method.

4.4 Adjusting the set of inducing input points online

When using an online GP regression algorithm, it is often not known in advance what kindof measured input points xm the system will get. As such, choosing the inducing inputpoints Xu in advance is not always possible, nor wise. Instead, we can adjust the inducinginput points while the algorithm is running.

Suppose that we denote the current set of inducing input points by Xu, and that wewant to add an extra set of inducing input points Xu+ . In this case, given all the data thatwe have, the distributions of fu and fu+ satisfy, identically to (5),[

fufu+

]∼ N

([µu

µu+

],

[Σuu Σuu+

Σu+u Σu+u+

]),[

µu

µu+

]=

[µu

mu+ +Ku+uK−1uu (µu −mu)

],[

Σnuu Σn

uu+

Σnu+u Σn

u+u+

]=

[Σnuu Σn

uuK−1uuKuu+

Ku+uK−1uu Σn

uu Ku+u+ −Ku+uK−1uu (Kuu − Σn

uu)K−1uuKuu+

]. (19)

With this combined set of old and new inducing input points, we can then continue incor-porating new measurements without losing any data whatsoever.

Additionally, it is also possible to remove unimportant inducing input points. In thiscase, its entry can simply be removed from fu. And since it is possible to both add anddelete inducing input points, it is naturally also possible to shift them around (first addnew points, then remove old points) whenever deemed necessary.

The way in which inducing input points are usually added in the SONIG algorithm isas follows. Whenever we incorporate a new measurement with posterior input distribution

x+ ∼ N(xn+1+ ,Σn+1

+x

), we check if xn+1

+ is already close to any existing inducing input

point. To be more precise, we examine the normalized squared distance(xn+1+ − xui

)TΛ−1

(xn+1+ − xui

)(20)

for each inducing input point xui . If there is no inducing input point whose normalizedsquared distance is below a given threshold (often chosen to be roughly 1, but tuned notto get too few or too many points), then it means that there is no inducing input pointxui close to our new measurement xn+1

+ . As a result, we add xn+1+ to our set of inducing

input points. This guarantees that each measurement is close to at least one inducing inputpoint, which always allows the data from the measurement to be taken into account.

12


4.5 Predictions for stochastic trial points

Suppose we have a stochastic trial point x∗ ∼ N (x∗,Σ∗x) and we want to calculate thedistribution of f∗ = f(x∗). How does that work?

For deterministic trial points x∗ we can use (5). For stochastic trial points, we need tointegrate over this. This will not result in a Gaussian distribution, so once more we willapply moment matching. Previously, we had to make additional assumptions, to make surethat the mean vector and the covariance matrix could be solved for analytically. This timewe do not have to. Deisenroth (2010) showed that, for the squared exponential covariancefunction, the mean vector and the covariance matrix can be calculated analytically. We canapply the same ideas in our present setting.

For our results, we will first define some helpful quantities. When doing so, we shouldnote that in theory every output fk(x) can have its own covariance function kk(. . .), and assuch its own hyperparameters αk and Λk. (See Section 4.1.) Keeping this in mind, we nowdefine the vectors qk and matrices Qkl element-wise as

qki =

∫Xkk(xui ,x∗)p(x∗) dx∗

=α2k√

|Σ∗x ||Σ−1∗x + Λ−1

k |exp

(−1

2(xui − x∗)

T (Λk + Σ∗x)−1 (xui − x∗)), (21)

Qklij =

∫Xkk(xui ,x∗)kl(x∗,xuj )p(x∗) dx∗

=α2kα

2l√

|Σ∗x ||Σ−1∗x + Λ−1

k + Λ−1l |

exp

(−1

2

(xui − xuj

)T(Λk + Λl)

−1 (xui − xuj

))

exp

(−1

2

(xkluij− x∗

)T ((Λ−1k + Λ−1

l

)−1+ Σ∗x

)−1 (xkluij− x∗

)), (22)

where we have defined

xkluij

=(Λ−1k + Λ−1

l

)−1 (Λ−1k xui + Λ−1

l xuj

). (23)

With these quantities, we can find that

f∗ ∼ N (µ∗,Σ∗),

[µ∗]k =(qk)T (

Kkuu

)−1µku,

[Σ∗]k,k = α2k − tr

((Kk

uu

)−1(Kk

uu − Σku

)(Kk

uu

)−1Qkk

)+(µku

)T(Kk

uu

)−1Qkk

(Kk

uu

)−1µku − [µ∗]

2k ,

[Σ∗]k,l =(µku

)T (Kk

uu

)−1Qkl

(K l

uu

)−1µlu − [µ∗]k [µ∗]l , (24)

where the latter expression is for the non-diagonal terms of Σ∗ (with k 6= l). Note that thefirst line of the above expression is in fact an approximation. In reality the distribution f∗

13

Bijl

is not Gaussian. The other two lines, however, are the analytic mean vector and covariancematrix. With these quantities, we can accurately predict the distribution of the output f∗for stochastic trial points x∗.

5. Experimental results

In this section we apply the developed algorithm to test problems and compare its per-formance to existing state of the art solutions. First we apply the SONIG algorithm toa sample function, allowing us to compare its performance to other regression algorithms.The results of this are discussed in Section 5.1. Then we apply a variant of the SONIGalgorithm to identify a magneto-rheological fluid damper, the outcome of which is reportedin Section 5.2. All code for these examples, as well as for using the SONIG algorithm ingeneral, is available on GitHub (see Bijl (2016)).

5.1 Evaluating the SONIG algorithm through a sample function

To compare the SONIG algorithm with other algorithms, we have set up a basic single-inputsingle-output GP test. First, we randomly generate a sample function from a Gaussianprocess with a squared exponential covariance function (see (2)). This is done on the rangex ∈ [−5, 5], subject to the hyperparameters α = 1 and Λ = 1. Subsequently, we take nmeasurements on random points from the input range and distort both the input xm andthe output y with zero-mean Gaussian white noise with standard deviation σx = 0.4 andσn = 0.1, respectively. To this data set, we then apply the following algorithms.

(1) GP regression without any input noise and with the exact hyperparameters, givenabove. This serves as a reference case: all other algorithms get noisy input points andtuned hyperparameters.

(2) GP regression with input noise and with hyperparameters tuned through the maximum-likelihood method.

(3) The NIGP algorithm of McHutchon and Rasmussen (2011). This algorithm has its ownmethod of tuning hyperparameters, including σx.

(4) The SONIG algorithm, starting with µ0u = mu and Σ0

uu = Kuu, using the hyperpa-rameters given by (3). We use nu = 21 evenly distributed inducing input points.

(5) The same as (4), but now with more measurements (800 instead of 200). Because theSONIG algorithm is computationally a lot more efficient than the NIGP algorithm, theruntime of this is similar to that of (3), being roughly 2-3 seconds when using Matlab,although this of course does depend on the exact implementation of the algorithms.

(6) NIGP applied on a subset of data (100 measurements) to predict the distribution ofthe inducing input points, followed by the SONIG algorithm applied to the remainder(700) of the measurements, further updating the inducing input points. The runtimeof this approach is again similar to that of (3), being 2-3 seconds.

(7) The FITC algorithm, using the hyperparameters of (2). This serves as a reference case.

14


Table 1: Comparison of various GP regression algorithms, applied to noisy measurementsof 200 randomly generated sample functions. For details, see the main text.

Measurements MSE Mean variance Ratio

(1) GPR with exact hyperparameters and no input noise 200 0.74 · 10−3 0.83 · 10−3 0.89

(2) GPR with tuned hyperparameters 200 24.0 · 10−3 6.1 · 10−3 3.9

(3) NIGP with its own hyperparameter tuning 200 22.5 · 10−3 4.9 · 10−3 4.5

(4) SONIG using the hyperparameters of (3) 200 14.4 · 10−3 6.9 · 10−3 2.1

(5) SONIG using the hyperparameters of (3) 800 10.4 · 10−3 2.1 · 10−3 5.0

(6) NIGP on a subset, followed by SONIG on the rest 100/700 10.3 · 10−3 2.0 · 10−3 5.1

(7) FITC, using the hyperparameters of (2) 800 17.4 · 10−3 1.6 · 10−3 10.6

For all these algorithms, we examine both the Mean Squared Error (MSE) of the re-sulting prediction and the mean variance given by the regression algorithm. The latter isbasically the estimate by the regression algorithm of the MSE. By comparing it with thereal MSE, we learn about the integrity of the algorithm. As such, the ratio between thesetwo is an indication of the algorithm integrity.

To get fair results, we do this whole process 200 times, each time for a different randomlygenerated function. In a small part of these cases, for some algorithms, Matlab causesnumerical issues. For instance, it sometimes has problems properly inverting even a smallcovariance matrix. These issues are rare and unpredictable, but they significantly distortthe results. To solve this, while keeping things fair, we discard the 10% worst results ofeach algorithm. The average of the remaining results is subsequently shown in Table 1.

There are many things that can be noticed from Table 1. First of all, it is that for thegiven type of functions, and for an equal number of measurements, the SONIG algorithmperforms better than the NIGP algorithm. This is surprising, because the SONIG algorithmcan be seen as a computationally efficient approximation of the NIGP algorithm. Furtherexperiments have shown that this is mostly because the SONIG term takes into account thesecond derivative of the mean in its approximation (see (15)). The NIGP algorithm doesnot, and if SONIG also does not, the performance of the two algorithms is comparable.

A second thing that can be noticed is that more measurements provide a higher accuracy.In particular, even the FITC algorithm (which does not take input noise into account) with800 measurements performs better than the NIGP algorithm with 200 measurements. Itshould be noted here that part of the reason is the type of function used: for functions witha steeper slope, it is expected that the NIGP algorithm still performs better than FITC.

Thirdly, it should be noted that the SONIG algorithms with and without an initialNIGP prediction of fu (algorithms (5) and (6)) have a similar performance. Hence, in thelong run, it does not matter much whether or not we use the NIGP algorithm to make aninitial prediction of the distribution of fu.

Finally, it is interesting to note that all algorithms, with the exception of regular GPregression with the exact hyperparameters, are much more optimistic about their predictionsthan is reasonable. That is, the ratio between the MSE and the mean variance is way larger

15

Bijl

Figure 1: Predictions of the NIGP algorithm (left) and the SONIG algorithm (right) aftern = 30 measurements. Exact conditions are described in the main text.

than the value of 1 which it should have. Ideally, the predicted variance of all algorithmswould be significantly higher (by roughly a factor 4 to 5).

Next, we will look at some plots. To be precise, we will examine algorithms (3) and(4) closer, but subject to only n = 30 measurements and nu = 11 inducing input points.The predictions of the two algorithms, for a single random sample function, are shown inFigure 1.

The most important thing that can be noticed here, is that (for both methods) the pos-terior standard deviation varies with the slope of the to-be-approximated function. Whenthe input is near −2, and the function is nearly flat, the standard deviation is small (wellbelow 0.1). However, when the input is near −3 or −1/2, the standard deviation is larger(near 0.2). This is what can be expected, because measurements in these steep regions aremuch more affected/distorted by the noise, and hence provide less information.

A second thing to be noticed is the difference between the two methods. Especiallyfor x > 2, where there are relatively few measurements, the SONIG algorithm gives muchhigher variances. There are two reasons for this. The first is inherent to sparse algorithms.(The FITC algorithm would show a similar trend.) The second reason is inherent to theSONIG algorithm. Whereas regular GP regression (and similarly the NIGP algorithm)uses all measurement points together, the SONIG algorithm only uses data from previousmeasurement points while incorporating a new measurement point. As a result, when thereare relatively few measurements in a certain region, and many of these measurements appearearly in the updating process, the accuracy in that region can be expected to be slightlylower. However, as more measurement points are incorporated, which can be done veryefficiently, the problem will quickly disappear.

5.2 Identifying the dynamics of a magneto-rheological fluid damper

We will now show how the SONIG algorithm can be used in learning a model of a nonlineardynamical system based on measured data. To this end we will make use of a nonlinear

16


autoregressive model which has exogenous (ARX) inputs of the following form

yk = φ(yk−1, . . . ,yk−ny ,uk−1, . . . ,uk−nu), (25)

where φ(·) denotes some nonlinear function of past inputs uk−1, . . . ,uk−nu and past out-puts yk−1, . . . ,yk−ny . To fit (25) into the SONIG model, let us define

xk = (yk−1, . . . ,yk−ny ,uk−1, . . . ,uk−nu), (26)

as the input vector, implying that the output is given by yk = φ(xk). We can now makeuse of the SONIG algorithm to approximate φ. While doing this, we keep track of thecovariances between respective outputs yk,yk−1, . . . and inputs uk−1,uk−2, . . .. Whendone properly this results in what we call the System Identification using Sparse Online GP(SISOG) algorithm, which is described in Algorithm 2.

We will apply this SISOG algorithm to model the dynamical behavior of a magneto-rheological fluid damper. The measured data for this example was provided by Wang et al.(2009) and supplied through The MathWorks Inc. (2015), which also discusses varioussystem identification examples using the techniques from Ljung (1999). This example isa common benchmark in system identification applications. It has for instance been usedmore recently in the context of Gaussian Process State Space Models (GP-SSM) by Svenssonet al. (2016) in their Reduced Rank GP-SSM (RR GP-SSM) algorithm.

This example has 3499 measurements provided, sampled every ∆t = 0.05 seconds. Wewill use the first 2000 measurements (10 seconds) for training and the next 1499 measure-ments (7.5 seconds) for evaluation. The MathWorks Inc. (2015) recommended to use onepast output and three past inputs to predict subsequent outputs. Based on this, we decidedto learn a black-box model of the following functional form

yk+1 = φ(yk, uk, uk−1, uk−2). (27)

Hyperparameters were initially tuned by passing a subset of the data to the NIGP algorithm,but were subsequently improved through manual tuning. They were set to

Λ = diag(702, 202, 102, 102

), α2 = 702,

Σ+x = diag(22, 0.12, 0.12, 0.12

), Σ+f

= 22. (28)

After processing a measurement yk+1, the SONIG algorithm provided us with a posteriordistribution of yk+1, yk, uk, uk−1 and uk−2. The marginal posterior distribution of yk+1,uk and uk−1 was then used as prior distribution while incorporating the next measurement.Inducing input points were added online, as specified in Section 4.4, which eventually gaveus 32 inducing input points. This is a low number, and as a result, the whole training wasdone in only a few (roughly 10) seconds.

After all training measurements had been used, the SISOG algorithm was given the inputdata for the remaining 1499 measurements, but not the output data. It had to predict thisoutput data by itself, using each prediction yk to predict the subsequent yk+1. While doingso, the SISOG algorithm also calculated the variance of each prediction yk to take thisinto account while predicting the next output using the techniques from Section 4.5. Theresulting predictions can be seen in Figure 2.

17

Bijl

Input:A set of inputs u1,u2, . . . and outputs y1,y2, . . . of a system that is to beidentified. Both the input and the output can be disturbed by noise.

Preparation:Define hyperparameters, either through the NIGP algorithm or by using expertknowledge about the system. Optionally, also define an initial set of inducinginput points Xu.

Updating:while there are unprocessed measurements yk+1 do

1. Set up xk+1 (shortened to x+) using its definition in (25). Find its priordistribution using known covariances between system outputsyk, . . . ,yk−(ny−1) and (if necessary) system inputs uk, . . . ,uk−(nu−1). Alsofind the prior distribution of the function output yk+1 (denoted as fk+1 orshortened as f+).

2. Apply (13) to find the posterior distribution N(xk+1+ ,Σk+1

+x

)of x+.

Optionally, use this to update the posterior distribution of the system outputsyk, . . . ,yk−(ny−1) and system inputs uk, . . . ,uk−(nu−1).

3. Optionally, if xk+1+ is far removed from any inducing input point, add it to

the set of inducing inputs Xu using (19). (Or rearrange the inducing inputpoints in any desired way.)4. Calculate the posterior distribution of the inducing input vector fu foreach of the outputs of φ using (15) and (16).5. Calculate the posterior distribution of yk+1 using (17). Additionally,calculate the covariances between yk+1 and each of the previous systemoutputs yk, . . . ,yk−(ny−1) and inputs uk, . . . ,uk−(nu−1) through (18).

end

Prediction:For any deterministic set of previous outputs yk, . . . ,yk−(ny−1) and inputsuk, . . . ,uk−(nu−1), apply (5) to predict the next output yk+1. For stochasticparameters, use the expansion from 4.5.

Algorithm 2: The System Identification using Sparse Online GP (SISOG) algorithm: anapplication of the SONIG algorithm to identify a non-linear system in an online way.

18


A comparison of the SISOG algorithm with various other methods is shown in Table 2.This table shows that the SISOG algorithm can clearly outperform other black-box mod-eling approaches. It should be noted here, however, that this is all subject to the propertuning of hyperparameters and the proper choice of inducing input points. With differenthyperparameters or inducing input point selection strategies, the performance will degradeslightly, though in most cases it still is competitive with other identification algorithms.

Figure 2: Prediction of the output of the magneto-rheological fluid damper by the SISOGalgorithm (black) compared to the real output (blue). The grey area representsthe 95% uncertainty region as given by the algorithm. It shows that in thetransition regions (like near t = 2 s) which the algorithm is less well trained on,the uncertainty is larger. As comparison, also the best non-linear ARX modelpredictions from The MathWorks Inc. (2015) (red) are plotted. It is interesting tonote that this model makes very similar errors as the SISOG algorithm, indicatingthe errors are mostly caused by distortions in the training/evaluation data.

6. Conclusions and recommendations

We can conclude that the presented SONIG algorithm works as intended. Just like the FITCalgorithm which it expands on, it is mainly effective when there are more measurementsthan the NIGP algorithm (or regular GP regression) can handle. The SONIG algorithm can

19

Bijl

Table 2: Comparison of various system identification models and algorithms when appliedto data from the magneto-rheological fluid damper. All algorithms were given2000 measurements for training and 1499 measurements for evaluation.

Algorithm RMSE Source

Linear OE model (4th order) 27.1 The MathWorks Inc. (2015)

Hammerstein-Wiener (4th order) 27.0 The MathWorks Inc. (2015)

NLARX (3rd order, wavelet network) 24.5 The MathWorks Inc. (2015)

NLARX (3rd order, tree partition) 19.3 The MathWorks Inc. (2015)

NLARX (3rd order, sigmoid network) 8.24 The MathWorks Inc. (2015)

RR GP-SSM 8.17 Svensson et al. (2016)

SISOG 7.12 This paper

then include the additional measurements very efficiently – incorporating each measurementin constant runtime – giving a much higher accuracy than the NIGP algorithm could havegiven. However, even when this is not the case, the SONIG algorithm has on average abetter performance than the NIGP algorithm, though it still needs the NIGP algorithm forhyperparameter tuning.

Though the SONIG algorithm can be used for any type of regression problem, it has beensuccessfully applied, in the form of the SISOG algorithm, to a non-linear black-box systemidentification problem. With the proper choice of hyperparameters and inducing inputpoints, it outperformed existing state-of-the-art non-linear system identification algorithms.

Nevertheless, there are still many improvements which can be made to the SONIG algo-rithm. For instance, to improve the accuracy of the algorithm, we can look at reducing someof the approximating assumptions, like the linearization assumption (12) or the assumptionthat higher order terms of Σ+ are negligible.

Another way to improve the accuracy of the algorithm is to increase the number ofinducing input points, but this will slow the algorithm down. To compensate, we couldlook into updating only the few nearest inducing input points (with the highest covariance)when incorporating a new measurement. Experience has shown that updates hardly affectinducing inputs far away from the measurement point anyway.

A final possible improvement would concern the addition of a smoothing step in thealgorithm. Currently, early measurements are used to provide more accuracy for latermeasurements, but not vice versa. If we also walk back through the measurements, like ina smoothing algorithm, a higher accuracy might again be obtained.

Acknowledgments

This research is supported by the Dutch Technology Foundation STW, which is part of theNetherlands Organisation for Scientific Research (NWO), and which is partly funded by theMinistry of Economic Affairs. The work was also supported by the Swedish research Council(VR) via the project Probabilistic modeling of dynamical systems (Contract number: 621-

20


2013-5524). We would also like to thank Marc Deisenroth for fruitful discussion and inparticular for pointing us to the NIGP algorithm.

Appendix A. Derivation of the inducing input point covariance

In this appendix we derive update law (16) for the covariance matrix of fn+1u subject to a

stochastic measurement point x+ ∼ N (x+,Σ+x). (We drop the superscript n+ 1 here forease of notation.) This quantity is formally defined element-wise as

Vfux+

[fui(x+), fuj (x+)

]= Efux+

[(fui(x+)− Efux+

[fui(x+)]) (fuj (x+)− Efux+

[fuj (x+)

])].

(29)The key to solving this lies in expanding it to

Efux+

[((fui(x+)− Efu [fui(x+)]) +

(Efu [fui(x+)]− Efux+

[fui(x+)])) (

. . .)], (30)

where the dots(. . .)

denote the exact same as the previous term within brackets, but thenwith subscript j instead of i. Expanding the brackets will lead to four terms. The twocross-terms of the form

Efux+

[(fui(x+)− Efu [fui(x+)])

(Efu

[fuj (x+)

]− Efux+

[fuj (x+)

])](31)

will turn out to be zero. To see why, we first note that the second half of the above expressiondoes not depend on the value of fu. When we then apply the expectation operator withrespect to fu, the first half will become zero.

That leaves us with two terms, one of them being

Efux+

[(fui(x+)− Efu [fui(x+)])

(fuj (x+)− Efu

[fuj (x+)

])]. (32)

If we expand fui (and similarly fuj ) using (14), we will find

Efux+

[((fui(x+)− Efu [fui(x+)]) +

(dfui(x+)

dx+− Efu

[dfui(x+)

dx+

])(x+ − x+) (33)

+1

2(x+ − x+)T

(d2fui(x+)

dx+2− Efu

[d2fui(x+)

dx+2

])(x+ − x+) + . . .

)T (. . .

)].

The nice part is that each of the terms is either dependent on x+ or on fu, but not onboth. Hence, we can split the expectation, and after working out the results, neglectinghigher orders of Σ+x , we can reduce this term to

V[fui , fuj

]+ tr

((V

[dfui

dx+,dfuj

dx+

]+

1

2V

[fui ,

d2fuj

dx+2

]+

1

2V

[d2fui

dx+2, fuj

])Σ+x

). (34)

Here we have dropped the function notation (x+) from fui(x+) and the subscript fux+

from V because of space constraints, and because it is clear with respect to what we takethe function/covariance.

The above expression does leave us with the question of how to find the covariance ofthe derivatives of a GP with respect to some external variable x+. Luckily, it can be shown

21

Bijl

in a straightforward way (by directly taking the derivative of the definition of the varianceoperator) that

dV[fui , fuj

]dx+

= V

[fui ,

dfuj

dx+

]+V

[dfui

dx+, fuj

],

d2V[fui , fuj

]dx+

2= V

[fui ,

d2fuj

dx+2

]+ 2V

[dfui

dx+,dfuj

dx+

]+V

[d2fui

dx+2, fuj

]. (35)

Next, we will look at the last of the four terms. It is defined as

Efux+

[(Efu [fui(x+)]− Efux+

[fui(x+)]) (Efu

[fuj (x+)

]− Efux+

[fuj (x+)

])], (36)

and working the above out further will eventually turn it into

(dEfu [fui(x+)]

dx+

)Σ+x

(dEfu [fui(x+)]

dx+

)T

. (37)

If we now merge all our results into one expression, we find that

V[fui(x+), fuj (x+)

]= V

[fui(x+), fuj (x+)

]+

(dE [fui(x+)]

dx+

)Σ+x

(dE [fui(x+)]

dx+

)T

+1

2tr

((d2V

[fui(x+), fuj (x+)

]dx+

2

)Σ+x

), (38)

which equals (16).

Appendix B. Derivatives of prediction matrices

Equation (15) and (16) contain various derivatives of matrices. Using (7) and (8) we canfind them. To do so, we first define the scalar quantity

P = Σn++ + σ2n = K++ + σ2n −K+uK

−1uu (Kuu − Σn

uu)K−1uuKu+. (39)

22


We also assume that m(x) = 0 for ease of notation. (If not, this can of course be taken intoaccount.) The derivatives of µn+1

u and Σn+1uu can now be found element-wise through

dµn+1u

dx+j

= ΣnuuK

−1uu

(dKu+

dx+j

P−1(y+ −K+uK

−1uuµ

nu

)+Ku+

dP−1

dx+j

(y+ −K+uK

−1uuµ

nu

)− Ku+P

−1dK+u

dx+j

K−1uuµ

nu

),

d2µn+1u

dx+jdx+k

= ΣnuuK

−1uu

(d2Ku+

dx+jdx+k

P−1(y+ −K+uK

−1uuµ

nu

)+dKu+

dx+j

dP−1

dx+k

(y+ −K+uK

−1uuµ

nu

)−dKu+

dx+j

P−1dK+u

dx+k

K−1uuµ

nu +

dKu+

dx+k

dP−1

dx+j

(y+ −K+uK

−1uuµ

nu

)+Ku+

d2P−1

dx+jdx+k

(y+ −K+uK

−1uuµ

nu

)−Ku+

dP−1

dx+j

dK+u

dx+k

K−1uuµ

nu

−dKu+

dx+k

P−1dK+u

dx+j

K−1uuµ

nu −Ku+

dP−1

dx+k

dK+u

dx+j

K−1uuµ

nu −Ku+P

−1 d2K+u

dx+jdx+k

K−1uuµ

nu

),

dΣn+1uu

dx+j

= −ΣnuuK

−1uu

(dKu+

dx+j

P−1K+u +Ku+dP−1

dx+j

K+u +Ku+P−1dK+u

dx+j

)K−1

uu Σnuu,

d2Σn+1uu

dx+jdx+k

= −ΣnuuK

−1uu

(d2Ku+

dx+jdx+k

P−1K+u +dKu+

dx+j

dP−1

dx+k

K+u +dKu+

dx+j

P−1dK+u

dx+k

+dKu+

dx+k

dP−1

dx+j

K+u +Ku+d2P−1

dx+jdx+k

K+u +Ku+dP−1

dx+j

dK+u

dx+k

+dKu+

dx+k

P−1dK+u

dx+j

+Ku+dP−1

dx+k

dK+u

dx+j

+Ku+P−1 d2K+u

dx+jdx+k

)K−1

uu Σnuu. (40)

These expressions contain various additional derivatives. To find them, we need to choosea covariance function. (The above expressions are valid for any covariance function.) If weuse the squared exponential covariance function of (2), we can derive

dKui+

dx+= α2 exp

(−1

2(xui − x+)T Λ−1 (xui − x+)

)(xui − x+)T Λ−1,

d2Kui+

dx+2

= α2 exp

(−1

2(xui − x+)T Λ−1 (xui − x+)

)(Λ−1 (xui − x+) (xui − x+)T Λ−1 − Λ−1

),

dP−1

dx+= −P−2 dP

dx+= 2P−2

(K+uK

−1uu (Kuu − Σn

uu)K−1uu

dKu+

dx+

),

d2P−1

dx+2

=d

dx+

(−P−2 dP

dx+

)= 2P−3

(dP

dx+

)T ( dP

dx+

)− P−2 d

2P

dx+2,

dP

dx+= −2K+uK

−1uu (Kuu − Σn

uu)K−1uu

dKu+

dx+,

d2P

dx+2

= −2dK+u

dx+K−1

uu (Kuu − Σnuu)K−1

uu

dKu+

dx+− 2K+uK

−1uu (Kuu − Σn

uu)K−1uu

d2Ku+

dx+2. (41)

23

Bijl

References

Hildo Bijl. SONIG source code, 2016. URL https://github.com/HildoBijl/SONIG.

Hildo Bijl, Jan-Willem van Wingerden, Thomas B. Schon, and Michel Verhaegen. Onlinesparse Gaussian process regression using FITC and PITC approximations. In Proceedingsof the IFAC symposium on System Identification, SYSID, Beijing, China, October 2015.

Joaquin Q. Candela and Carl E. Rasmussen. A unifying view of sparse approximate Gaus-sian process regression. Journal of Machine Learning Research, 6:1939–1959, 2005.

Krzysztof Chalupka, Christopher K. I. Williams, and Iain Murray. A framework for evaluat-ing approximation methods for Gaussian process regression. Machine Learning Research,14:333–350, 2013.

Legel Csato and Manfred Opper. Sparse online Gaussian processes. Neural Computation,14(3):641–669, 2002.

Patrick Dallaire, Camille Besse, and Brahim Chaib-draa. Learning Gaussian process modelsfrom uncertain data. In Proceedings of the 16th International Conference on NeuralInformation Processing, December 2009.

M. P. Deisenroth and J. W. Ng. Distributed Gaussian processes. In Proceedings of theInternational Conference on Machine Learning (ICML), Lille, France, 2015.

Marc P. Deisenroth. Efficient Reinforcement Learning using Gaussian Processes. PhDthesis, Karlsruhe Institute of Technology, 2010.

Marc P. Deisenroth and Carl E. Rasmussen. PILCO: A model-based and data-efficientapproach to policy search. In Proceedings of the International Conference on MachineLearning (ICML), Bellevue, Washington, USA, pages 465–472. ACM Press, 2011.

Y. Gal, M. van der Wilk, and C. E. Rasmussen. Distributed variational inference in sparsegaussian process regression and latent variable models. In Advances in Neural InformationProcessing Systems (NIPS), 2014.

Agathe Girard and Roderick Murray-Smith. Learning a Gaussian process model with un-certain inputs. Technical Report 144, Department of Computing Science, University ofGlasgow, June 2003.

P.W. Goldberg, C.K.I. Williams, and Christopher M. Bishop. Regression with input-dependent noise: A Gaussian process treatment. In Advances in Neural InformationProcessing Systems (NIPS), volume 10, pages 493–499. MIT Press, January 1997.

James Hensman, Fusi Nicolo, and Neil D. Lawrence. Gaussian processes for big data.In Proceedings of the Twenty-Ninth Conference on Uncertainty in Artificial Intelligence(UAI), Bellevue, Washington, USA, 2013.

Marco F. Huber. Recursive Gaussian process regression. In Proceedings of the IEEE Inter-national Conference on Acoustics, Speech and Signal Processing (ICASSP), Vancouver,Canada, pages 3362–3366, May 2013.

24


Marco F. Huber. Recursive Gaussian process: On-line regression and learning. PatternRecognition Letters, 45:85–91, 2014.

Peng Kou, Feng Gao, and Xiaohong Guan. Sparse online warped Gaussian process for windpower probabilistic forecasting. Applied Energy, 108:410–428, August 2013.

Quoc V. Le and Alex J. Smola. Heteroscedastic Gaussian process regression. In Proceedingsof the International Conference on Machine Learning (ICML), Bonn, Germany, 2005.

Lennart Ljung. System Identification: Theory for the User. Prentice Hall, Upper SaddleRiver, NJ, USA, 1999.

Andrew McHutchon and Carl E. Rasmussen. Gaussian process training with input noise.In Advances in Neural Information Processing Systems (NIPS), Granada, Spain, pages1341–1349, December 2011.

Ananth Ranganathan, Ming-Hsuan Yang, and Jeffrey Ho. Online sparse Gaussian processregression and its applications. IEEE Transactions on Image, 20(2):391–404, 2011.

Carl E. Rasmussen and Christopher K.I. Williams. Gaussian Processes for Machine Learn-ing. MIT Press, 2006.

Matthias Seeger, Christopher K.I. Williams, and Neil D. Lawrence. Fast forward selection tospeed up sparse Gaussian process regression. In Workshop on AI and Statistics, number 9,2003.

Alex J. Smola and Peter Bartlett. Sparse greedy Gaussian process regression. In Advancesin Neural Information Processing Systems (NIPS), Vancouver, Canada, pages 619–625,2001.

Edward Snelson and Zoubin Ghahramani. Variable noise and dimensionality reduction forsparse Gaussian processes. In Proceedings of the Twenty-Second Conference on Uncer-tainty in Artificial Intelligence (UAI), Cambridge, Massachusetts, USA, 2006a.

Edward Snelson and Zoubin Ghahramani. Sparse Gaussian processes using pseudo-inputs.In Advances in Neural Information Processing Systems (NIPS), pages 1257–1264, 2006b.

Andreas Svensson, Arno Solin, Simo Sarkka, and Thomas B. Schon. Computationallyefficient Bayesian learning of Gaussian process state space models. In Proceedings of the19th International Conference on Artificial Intelligence and Statistics (AISTATS), Cadiz,Spain, May 2016.

The MathWorks Inc. Nonlinear modeling of a magneto-rheological fluid damper. Exam-ple file provided by Matlab R© R2015b System Identification ToolboxTM, 2015. Avail-able at http://mathworks.com/help/ident/examples/nonlinear-modeling-of-a-magneto-rheological-fluid-damper.html.

M. K. Titsias. Variational learning of inducing variables in sparse Gaussian processes.In Proceedings of the International Conference on Artificial Intelligence and Statistics(AISTATS), Clearwater Beach, FL, USA, 2009.

25

Bijl

Chunyi Wang and Radford M. Neal. Gaussian process regression with heteroscedastic ornon-Gaussian residuals. Technical report, arXiv.org, 2012.

J. Wang, A. Sano, T. Chen, and B. Huang. Identification of Hammerstein systems withoutexplicit parameterization of nonlinearity. International Journal of Control, 82(5):937–952,2009.

26

Documents

Online Sparse Gaussian Process Training with Input Noiseuser.it.uu.se/~thosc112/bijlsvwv21016sonig.pdf · Sparse Online Gaussian Process Training with Input Noise 1.3 Online application