Sparse Gaussian processes using pseudo-inputs

Sparse Gaussian processes usingpseudo-inputs

Ed Snelson ([email protected])

Gatsby Computational Neuroscience Unit, UCL

NIPS 2005, 6th December

Work done with Zoubin Ghahramani

Nonlinear regression

Consider the problem of nonlinear regression:

You want to learn a function f with error bars from data D = {X,y}

x

y

A Gaussian process is a prior over functions p(f) which can be used

for Bayesian regression:

p(f |D) =p(f)p(D|f)

p(D)

1

Gaussian process (GP) priors

GP: consistent Gaussian prior on any set of function values f ={fn}N

n=1, given corresponding inputs X = {xn}Nn=1

one sample function

x

f

prior

p(f |X) = N (0,KN)

KN

Covariance: Knn′ = K(xn,xn′ ;θ), hyperparameters θ

Knn′ = v exp

−12

D∑d=1

(x

(d)n − x

(d)n′

rd

)2

2

Gaussian process (GP) priors

GP: consistent Gaussian prior on any set of function values f ={fn}N

n=1, given corresponding inputs X = {xn}Nn=1

f1

f2

f3 fN

N function values

x

f

prior

p(f |X) = N (0,KN)

KN

Covariance: Knn′ = K(xn,xn′ ;θ), hyperparameters θ

Knn′ = v exp

−12

D∑d=1

(x

(d)n − x

(d)n′

rd

)2

2

GP regressionGaussian observation noise: yn = fn + εn, where εn ∼ N (0, σ2)

sample data

x

ymarginal likelihood

p(y|X) = N (0,KN + σ2I)

predictive

x

y

predictive distribution

p(y∗|x∗,X,y) = N (µ∗, σ2∗)

µ∗ = K∗N(KN + σ2I)−1yσ2∗ = K∗∗ −K∗N(KN + σ2I)−1KN∗ + σ2

Problem: N3 computation3

GP regressionGaussian observation noise: yn = fn + εn, where εn ∼ N (0, σ2)

sample data

x

ymarginal likelihood

p(y|X) = N (0,KN + σ2I)

x∗

predictive

x

y

predictive distribution

p(y∗|x∗,X,y) = N (µ∗, σ2∗)

µ∗ = K∗N(KN + σ2I)−1yσ2∗ = K∗∗ −K∗N(KN + σ2I)−1KN∗ + σ2

Problem: N3 computation3

Overview

This talk contains 2 key ideas:

1. A new sparse Gaussian process approximation based on a small

set of M ‘pseudo-inputs’ (M � N). This reduces computational

complexity to O(M2N)

2. A gradient based learning procedure for finding the pseudo-

inputs and hyperparameters of the Gaussian process, in one joint

optimization

4

Two stage generative model

X

x

fpseudo-input prior

p(f |X) = N (0,KM)

1. Choose any set of M (pseudo-) inputs X

2. Draw corresponding function values f from prior

5

Two stage generative model

X

x

f

3. Draw f conditioned on f

conditional

p(f |f) = N (µ,Σ)

µ = KNMK−1M f

Σ = KN −KNMK−1M KMN

Σ

• This two stage procedure defines exactly the same GP prior

• We have not gained anything yet, but it inspires a sparse

approximation ...

6

Factorized approximation

X

x

f

single point conditional

p(fn|f) = N (µn, λn)

µn = KnMK−1M f

λn = Knn −KnMK−1M KMn

Λ

Approximate: p(f |f) ≈∏n

p(fn|f) = N (µ,Λ) , Λ = diag(λ)

Minimum KL: minqn

KL[p(f |f) ‖

∏n

qn(fn)]

7

Sparse pseudo-input Gaussian processes (SPGP)

Integrate out f to obtain SPGP prior: p(f) =∫

df∏

n p(fn|f) p(f)

GP prior

N (0,KN) ≈SPGP prior

p(f) = N (0,KNMK−1M KMN + Λ)

≈ = +

• SPGP covariance inverted in O(M2N) ⇒ sparse

• SPGP = GP with non-stationary covariance parameterized by X

• Given data {X,y} with noise σ2, predictive mean and variance

can be computed in O(M) and O(M2) per test case respectively

8

How to find pseudo-inputs?

Pseudo-inputs are like extra hyperparameters: we jointly maximize

marginal likelihood w.r.t. (X,θ, σ2)

p(y|X, X,θ, σ2) = N(0, KNMK−1

M KMN + Λ + σ2I)

Key advantages over many related sparse methods 1:

1. Pseudo-inputs not constrained to subset of data (‘active set’) =

improved accuracy and flexibility

2. Joint optimization avoids discontinuities that arise when active

set selection is interleaved with hyperparameter learning

1Tresp (2000), Smola & Bartlett (2001), Csato & Opper (2002), Seeger et al. (2003)

9

1D demo

amplitudeamplitudeamplitude

x

y

X

amplitude lengthscale noise

Initialize adversarially: amplitude and lengthscale too bignoise too smallpseudo-inputs bunched up

10

1D demo

amplitudeamplitudeamplitude

x

y

X

amplitude lengthscale noise

Pseudo-inputs and hyperparameters optimized

10

Selected Results:(more at poster)

11

kin40k1 — SPGP vs random

testm.sq.error

0 200 400 600 800 1000 120010−2

10−1

n = 10000random

size of ‘active set’ M

10000 train

30000 test

9 attributes

horizontal line – full GP on subset size 2000 (∗)black – random subset – hyperparameters obtained from ∗red squares – SPGP – pseudo-inputs optimized, hyperparameters obtained from ∗blue circles – SPGP – pseudo-inputs and hyperparameters optimized

1as tested in Seeger et al. (2003)

12

kin40k — SPGP vs info-gain

testm.sq.error

0 200 400 600 800 1000 120010−2

10−1

n = 10000info−gain


10000 train

30000 test

9 attributes

horizontal line – full GP on subset size 2000 (∗)black – info-gain1 – hyperparameters obtained from ∗red squares – SPGP – pseudo-inputs optimized, hyperparameters obtained from ∗blue circles – SPGP – pseudo-inputs and hyperparameters optimized

1Seeger et al. (2003)

13

kin40k — SPGP vs Smo-Bart

testm.sq.error

0 200 400 600 800 1000 120010−2

10−1

n = 10000smo−bart


10000 train

30000 test

9 attributes

horizontal line – full GP on subset size 2000 (∗)black – Smo-Bart1 – hyperparameters obtained from ∗red squares – SPGP – pseudo-inputs optimized, hyperparameters obtained from ∗blue circles – SPGP – pseudo-inputs and hyperparameters optimized

1Smola & Bartlett (2001)

14

Local maxima and overfitting?

• Many local maxima, but can initialize pseudo-inputs on random

subset of data. Hyperparameter initialization more tricky

• Many parameters: MD + |θ|+ 1 instead of |θ|+ 1. Overfitting?

(D = input space dimension, M = no. of pseudo-inputs)

• Consider M = N and X = X

– Here KMN = KM = KN, Λ = σ2I⇒ SPGP collapses to full GP

• However interaction with hyperparameter learning can lead to

overfitting behaviour

• For full Bayesian treatment: sample pseudo-inputs and

hyperparameters from p(X,θ, σ2|X,y) instead of optimizing

15

Modeling non-stationarity

standard GP

x

y

SPGP

X

x

y

Extra flexibility of SPGP allows some non-stationary effects to be

modeled

16

Limitations and possible extensions

• Large pseudo set size M and/or high dimensional input space D

means optimization can become impractically big

– possible solution: learning a projection of input space into lower

dimensional space [work in progress]

– may also prevent overfitting

• We used CG or L-BFGS but many optimization schemes available:

– Optimize subsets of variables iteratively (chunking)

– Stochastic gradient descent

– hybrid — pick some points randomly, optimize others

– EM algorithm

• Extension to classification and other likelihood functions

17

Conclusions

• New method for sparse GP regression

• Significant decrease in test error, especially for very sparse solutions

• Added flexibility of moving pseudo-inputs which are not constrained

to lie on the data leads to better solutions

• Hyperparameters jointly learned with pseudo-inputs in a single

smooth optimization

• Matlab code, draft paper, and this talk available

(www.gatsby.ucl.ac.uk/~snelson)

18

Acknowledgements

Sheffield GP Round-table

Carl RasmussenJoaquin Quinonero-Candela

Peter SollichMatthias SeegerNeil LawrenceChris WilliamsTom MinkaSam Roweis

PASCAL European Network of Excellence

19

Relation of SPGP to PLV1

SPGPApproximate conditional:

p(f |f) ≈∏

n p(fn|f) = N (µ,Λ)minimum KL fully factorized

approximation

Marginal likelihood:

N (0,KNMK−1M KMN + Λ + σ2I)

marginal variances match full GP

everywhere

Pseudo-inputs:

not constrained to data – optimized by

gradient ascent on marginal likelihood,

together with hyperparameters

PLVApproximate conditional:

p(f |f) ≈ N (µ,0)uncertainty not taken into account –

deterministic approximation

Marginal likelihood:

N (0,KNMK−1M KMN + σ2I)

marginal variances decay to σ2 away

from ‘active set’ points

Active set:

chosen as subset of data using greedy

info-gain criteria; active set selection and

hyperparameter learning interleaved

1Seeger et al. (2003)

20

PLV with pseudo-inputs

x

y

(a)

x

y

(c)

x

y

(b)

Predictive distributions for: (a) full GP, (b) gradient ascent on SPGP

likelihood, (c) gradient ascent on PLV likelihood.

Initial pseudo point positions — red crosses

Final pseudo point positions — blue crosses

21

Documents

Sparse Gaussian processes using pseudo-inputs