31
Lecture 3 and 4: Gaussian Processes Machine Learning 4F13, Michaelmas 2015 Zoubin Ghahramani Department of Engineering University of Cambridge http://mlg.eng.cam.ac.uk/teaching/4f13/ Ghahramani Lecture 3 and 4: Gaussian Processes 1 / 31

Lecture 3 and 4: Gaussian Processes - University of Cambridgemlg.eng.cam.ac.uk/teaching/4f13/1516/lect0304.pdf · 2015-10-16 · Ghahramani Lecture 3 and 4: Gaussian Processes 17

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Lecture 3 and 4: Gaussian Processes - University of Cambridgemlg.eng.cam.ac.uk/teaching/4f13/1516/lect0304.pdf · 2015-10-16 · Ghahramani Lecture 3 and 4: Gaussian Processes 17

Lecture 3 and 4: Gaussian Processes

Machine Learning 4F13, Michaelmas 2015

Zoubin Ghahramani

Department of EngineeringUniversity of Cambridge

http://mlg.eng.cam.ac.uk/teaching/4f13/

Ghahramani Lecture 3 and 4: Gaussian Processes 1 / 31

Page 2: Lecture 3 and 4: Gaussian Processes - University of Cambridgemlg.eng.cam.ac.uk/teaching/4f13/1516/lect0304.pdf · 2015-10-16 · Ghahramani Lecture 3 and 4: Gaussian Processes 17

Key concepts

• Distributions over parameters and over functions• Motivation: representaion of multiple hypothesis• concepts of prior over functions and over parameters• priors over functions are priors over long vectors• GP definition• joint generation and conditional generation

• Properties of Gaussian Processes• the predictive distribution• hyperparameters• the marginal likelihood for a GP

• Connections between linear in the parameters model and GPs• from finite linear models to GPs• weight space and function space views• infinite dimensional models and why finite dimensional models are

dangerous

Ghahramani Lecture 3 and 4: Gaussian Processes 2 / 31

Page 3: Lecture 3 and 4: Gaussian Processes - University of Cambridgemlg.eng.cam.ac.uk/teaching/4f13/1516/lect0304.pdf · 2015-10-16 · Ghahramani Lecture 3 and 4: Gaussian Processes 17

Old question, new marginal likelihood view

xn

-1.5 -1 -0.5 0 0.5 1 1.5 2

y n

-5

-4

-3

-2

-1

0

1

2

3

-10 1 2-4-202

M=0

-10 1 2-4-202

M=1

-10 1 2-4-202

M=2

-10 1 2-4-202

M=3

-10 1 2-4-202

M=4

-10 1 2-4-202

M=5

-10 1 2-4-202

M=6

-10 1 2-4-202

M=7

-10 1 2-4-202

M=8

-10 1 2-4-202

M=9

-10 1 2-4-202

M=10

-10 1 2-4-202

M=11

-10 1 2-4-202

M=12

-10 1 2-4-202

M=13

-10 1 2-4-202

M=14

-10 1 2-4-202

M=15

-10 1 2-4-202

M=16

-10 1 2-4-202

M=17

• Should we choose a polynomial? model structurewe will address this soon

• What degree should we choose for the polynomial? model structurelet the marginal likelihood speak

• For a given degree, how do we choose the weights? model parameterswe consider many possible weights under the posterior

• For now, let find the single “best” polynomial: degree and weights.we don’t do this sort of thing anymore

Ghahramani Lecture 3 and 4: Gaussian Processes 3 / 31

Page 4: Lecture 3 and 4: Gaussian Processes - University of Cambridgemlg.eng.cam.ac.uk/teaching/4f13/1516/lect0304.pdf · 2015-10-16 · Ghahramani Lecture 3 and 4: Gaussian Processes 17

Marginal likelihood (Evidence) of our polynomials

Marginal likelihood, or “evidence” of a finite linear model:

p(y|x,M) =

∫p(f|x,M)p(y|f)df = N(y; 0,σ2

w ΦΦ> + σ2noise I)

For each polynomial degree, repeat the following infinitely many times:

1 Sample a function fs from the prior: p(f|x,M).2 Compute the likelihood of that function for the data: p(y|fs).3 Keep count of the number of samples so far: S.4 The marginal likelihood is the average likelihood: 1

S

∑Ss=1 p(y|fs)

Luckily for Gaussian noise there is a closed-form analytical solution!

0 5 10 15 20−100

−50

0

M: Degree of the polynomial

log evidence • The evidence prefers M = 3,

not simpler, not more complex.• Too simple models consistently

miss most data.• Too complex models frequently

miss some data.

Ghahramani Lecture 3 and 4: Gaussian Processes 4 / 31

Page 5: Lecture 3 and 4: Gaussian Processes - University of Cambridgemlg.eng.cam.ac.uk/teaching/4f13/1516/lect0304.pdf · 2015-10-16 · Ghahramani Lecture 3 and 4: Gaussian Processes 17

Multiple explanations of the data

xn

-1 0 1 2

y n

-4

-2

0

2

Remember that a finite linear model f(xn) = φ(xn)>w with prior on the weights

p(w) = N(w; 0,σ2wI) has a posterior distribution

p(w|x, y,M) = N(w; µ, Σ) withΣ =

(σ−2

noiseΦ>Φ+ σ−2

w

)−1

µ =(Φ>Φ+

σ2noiseσ2

wI)−1

Φ>y

and predictive distribution

p(y∗|x∗, x, y,M) = N(y∗; φ(x∗)>µ, φ(x∗)

>Σφ(x∗) + σ2noise I)

Ghahramani Lecture 3 and 4: Gaussian Processes 5 / 31

Page 6: Lecture 3 and 4: Gaussian Processes - University of Cambridgemlg.eng.cam.ac.uk/teaching/4f13/1516/lect0304.pdf · 2015-10-16 · Ghahramani Lecture 3 and 4: Gaussian Processes 17

Are polynomials a good prior over functions?

−1 0 1 2−2

−1

0

1

2M=0

−1 0 1 2−4

−2

0

2

4

M=1

−1 0 1 2

−5

0

5

M=2

−1 0 1 2

−10

0

10

M=3

−1 0 1 2

−50

0

50

M=5

−1 0 1 2

−2

0

2

x 105 M=17

Ghahramani Lecture 3 and 4: Gaussian Processes 6 / 31

Page 7: Lecture 3 and 4: Gaussian Processes - University of Cambridgemlg.eng.cam.ac.uk/teaching/4f13/1516/lect0304.pdf · 2015-10-16 · Ghahramani Lecture 3 and 4: Gaussian Processes 17

A prior over functions view

−1 0 1 2−2

−1

0

1

2M=0

−1 0 1 2−4

−2

0

2

4

M=1

−1 0 1 2

−5

0

5

M=2

−1 0 1 2

−10

0

10

M=3

−1 0 1 2

−50

0

50

M=5

−1 0 1 2

−2

0

2

x 105 M=17

We have learnt that linear-in-the-parameter models with priors on the weightsindirectly specify priors over functions.

True... but those priors over functions might not be good.

−6 −4 −2 0 2 4 6−2

0

2

−6 −4 −2 0 2 4 6−2

0

2

−6 −4 −2 0 2 4 6−2

0

2

... why not try to specify priors over functions directly?What? What does a probability density over functions even look like?

Ghahramani Lecture 3 and 4: Gaussian Processes 7 / 31

Page 8: Lecture 3 and 4: Gaussian Processes - University of Cambridgemlg.eng.cam.ac.uk/teaching/4f13/1516/lect0304.pdf · 2015-10-16 · Ghahramani Lecture 3 and 4: Gaussian Processes 17

The Gaussian Distribution

The univariate Gaussian distribution is given by

p(x|µ,σ2) = (2πσ2)−1/2 exp(−

12σ2 (x− µ)

2)The multivariate Gaussian distribution for D-dimensional vectors is given by

p(x|µ,Σ) = N(µ,Σ) = (2π)−D/2|Σ|−1/2 exp(− 1

2 (x − µ)>Σ−1(x − µ))

where µ is the mean vector and Σ the covariance matrix.Ghahramani Lecture 3 and 4: Gaussian Processes 8 / 31

Page 9: Lecture 3 and 4: Gaussian Processes - University of Cambridgemlg.eng.cam.ac.uk/teaching/4f13/1516/lect0304.pdf · 2015-10-16 · Ghahramani Lecture 3 and 4: Gaussian Processes 17

Conditionals and Marginals of a Gaussian, pictorial

joint Gaussianconditional

joint Gaussianmarginal

Both the conditionals p(x|y) and the marginals p(x) of a joint Gaussian p(x,y)are again Gaussian.

Ghahramani Lecture 3 and 4: Gaussian Processes 9 / 31

Page 10: Lecture 3 and 4: Gaussian Processes - University of Cambridgemlg.eng.cam.ac.uk/teaching/4f13/1516/lect0304.pdf · 2015-10-16 · Ghahramani Lecture 3 and 4: Gaussian Processes 17

Conditionals and Marginals of a Gaussian, algebra

If x and y are jointly Gaussian

p(x, y) = p([ x

y

])= N

([ ab

],[A B

B> C

]),

we get the marginal distribution of x, p(x) by

p(x, y) = N([ a

b

],[A B

B> C

])=⇒ p(x) = N(a, A),

and the conditional distribution of x given y by

p(x, y) = N([ a

b

],[A B

B> C

])=⇒ p(x|y) = N(a+BC−1(y− b), A−BC−1B>),

where x and y can be scalars or vectors.

Ghahramani Lecture 3 and 4: Gaussian Processes 10 / 31

Page 11: Lecture 3 and 4: Gaussian Processes - University of Cambridgemlg.eng.cam.ac.uk/teaching/4f13/1516/lect0304.pdf · 2015-10-16 · Ghahramani Lecture 3 and 4: Gaussian Processes 17

What is a Gaussian Process?

A Gaussian process is a generalization of a multivariate Gaussian distribution toinfinitely many variables.

Informally: infinitely long vector ' function

Definition: a Gaussian process is a collection of random variables, anyfinite number of which have (consistent) Gaussian distributions. �

A Gaussian distribution is fully specified by a mean vector, µ, and covariancematrix Σ:

f = (f1, . . . , fN)> ∼ N(µ,Σ), indexes n = 1, . . . ,N

A Gaussian process is fully specified by a mean function m(x) and covariancefunction k(x, x ′):

f ∼ GP(m,k), indexes: x ∈ X

here f and m are functions on X, and k is a function on X× X

Ghahramani Lecture 3 and 4: Gaussian Processes 11 / 31

Page 12: Lecture 3 and 4: Gaussian Processes - University of Cambridgemlg.eng.cam.ac.uk/teaching/4f13/1516/lect0304.pdf · 2015-10-16 · Ghahramani Lecture 3 and 4: Gaussian Processes 17

The marginalization property

Thinking of a GP as a Gaussian distribution with an infinitely long mean vectorand an infinite by infinite covariance matrix may seem impractical. . .

. . . luckily we are saved by the marginalization property:

Recall:

p(x) =

∫p(x, y)dy.

For Gaussians:

p(x, y) = N([ a

b

],[A B

B> C

])=⇒ p(x) = N(a, A)

Ghahramani Lecture 3 and 4: Gaussian Processes 12 / 31

Page 13: Lecture 3 and 4: Gaussian Processes - University of Cambridgemlg.eng.cam.ac.uk/teaching/4f13/1516/lect0304.pdf · 2015-10-16 · Ghahramani Lecture 3 and 4: Gaussian Processes 17

Random functions from a Gaussian Process

Example one dimensional Gaussian process:

p(f) ∼ GP(m, k

), where m(x) = 0, and k(x, x ′) = exp(− 1

2 (x− x′)2).

To get an indication of what this distribution over functions looks like, focus on afinite subset of function values f = (f(x1), f(x2), . . . , f(xN))>, for which

f ∼ N(0,Σ), where Σij = k(xi, xj).

Then plot the coordinates of f as a function of the corresponding x values.

−5 0 5−1.5

−1

−0.5

0

0.5

1

1.5

input, x

outp

ut, f

(x)

Ghahramani Lecture 3 and 4: Gaussian Processes 13 / 31

Page 14: Lecture 3 and 4: Gaussian Processes - University of Cambridgemlg.eng.cam.ac.uk/teaching/4f13/1516/lect0304.pdf · 2015-10-16 · Ghahramani Lecture 3 and 4: Gaussian Processes 17

Joint Generation

To generate a random sample from a D dimensional joint Gaussian withcovariance matrix K and mean vector m: (in octave or matlab)

z = randn(D,1);y = chol(K)’*z + m;

where chol is the Cholesky factor R such that R>R = K.

Thus, the covariance of y is:

E[(y − m)(y − m)>] = E[R>zz>R] = R>E[zz>]R = R>IR = K.

Ghahramani Lecture 3 and 4: Gaussian Processes 14 / 31

Page 15: Lecture 3 and 4: Gaussian Processes - University of Cambridgemlg.eng.cam.ac.uk/teaching/4f13/1516/lect0304.pdf · 2015-10-16 · Ghahramani Lecture 3 and 4: Gaussian Processes 17

Sequential Generation

Factorize the joint distribution

p(f1, . . . , fN|x1, . . . xN) =

N∏n=1

p(fn|fn−1, . . . , f1, xn, . . . , x1),

and generate function values sequentially. For Gaussians:

p(fn, f<n) = N([ a

b

],[A B

B> C

])=⇒

p(fn|f<n) = N(a + BC−1(f<n − b), A− BC−1B>).

−5 0 5−3

−2

−1

0

1

2

3

−5 0 5−3

−2

−1

0

1

2

3

−5 0 5−3

−2

−1

0

1

2

3

Ghahramani Lecture 3 and 4: Gaussian Processes 15 / 31

Page 16: Lecture 3 and 4: Gaussian Processes - University of Cambridgemlg.eng.cam.ac.uk/teaching/4f13/1516/lect0304.pdf · 2015-10-16 · Ghahramani Lecture 3 and 4: Gaussian Processes 17

−6−4

−20

24

6

−6

−4

−2

0

2

4

6

−2

−1

0

1

2

3

4

5

6

7

8

Function drawn at random from a Gaussian Process with Gaussian covariance

Ghahramani Lecture 3 and 4: Gaussian Processes 16 / 31

Page 17: Lecture 3 and 4: Gaussian Processes - University of Cambridgemlg.eng.cam.ac.uk/teaching/4f13/1516/lect0304.pdf · 2015-10-16 · Ghahramani Lecture 3 and 4: Gaussian Processes 17

Non-parametric Gaussian process models

In our non-parametric model, the “parameters” are the function itself!

Gaussian likelihood, with noise variance σ2noise

p(y|x, f,Mi) ∼ N(f, σ2noiseI),

Gaussian process prior with zero mean and covariance function k

p(f|Mi) ∼ GP(m ≡ 0, k),

Leads to a Gaussian process posterior

p(f|x, y,Mi) ∼ GP(mpost, kpost),

where{mpost(x) = k(x, x)[K(x, x) + σ2

noiseI]−1y,

kpost(x, x ′) = k(x, x ′) − k(x, x)[K(x, x) + σ2noiseI]

−1k(x, x ′),

And a Gaussian predictive distribution:

p(y∗|x∗, x, y,Mi) ∼ N(k(x∗, x)>[K+ σ2

noiseI]−1y,

k(x∗, x∗) + σ2noise − k(x∗, x)>[K+ σ2

noiseI]−1k(x∗, x)

).

Ghahramani Lecture 3 and 4: Gaussian Processes 17 / 31

Page 18: Lecture 3 and 4: Gaussian Processes - University of Cambridgemlg.eng.cam.ac.uk/teaching/4f13/1516/lect0304.pdf · 2015-10-16 · Ghahramani Lecture 3 and 4: Gaussian Processes 17

Prior and Posterior

−5 0 5

−2

−1

0

1

2

input, x

outp

ut, f

(x)

−5 0 5

−2

−1

0

1

2

input, x

outp

ut, f

(x)

Predictive distribution:

p(y∗|x∗, x, y) ∼ N(k(x∗, x)>[K+ σ2

noiseI]−1y,

k(x∗, x∗) + σ2noise − k(x∗, x)>[K+ σ2

noiseI]−1k(x∗, x)

)Ghahramani Lecture 3 and 4: Gaussian Processes 18 / 31

Page 19: Lecture 3 and 4: Gaussian Processes - University of Cambridgemlg.eng.cam.ac.uk/teaching/4f13/1516/lect0304.pdf · 2015-10-16 · Ghahramani Lecture 3 and 4: Gaussian Processes 17

Some interpretation

Recall our main result:

f∗|x∗, x, y ∼ N(K(x∗, x)[K(x, x) + σ2

noiseI]−1y,

K(x∗, x∗) − K(x∗, x)[K(x, x) + σ2noiseI]

−1K(x, x∗)).

The mean is linear in two ways:

µ(x∗) = k(x∗, x)[K(x, x) + σ2noiseI]

−1y =

N∑n=1

βnyn =

N∑n=1

αnk(x∗, xn).

The last form is most commonly encountered in the kernel literature.

The variance is the difference between two terms:

V(x∗) = k(x∗, x∗) − k(x∗, x)[K(x, x) + σ2noiseI]

−1k(x, x∗),

the first term is the prior variance, from which we subtract a (positive) term,telling how much the data x has explained.Note, that the variance is independent of the observed outputs y.

Ghahramani Lecture 3 and 4: Gaussian Processes 19 / 31

Page 20: Lecture 3 and 4: Gaussian Processes - University of Cambridgemlg.eng.cam.ac.uk/teaching/4f13/1516/lect0304.pdf · 2015-10-16 · Ghahramani Lecture 3 and 4: Gaussian Processes 17

The marginal likelihood

Log marginal likelihood:

logp(y|x,Mi) = −12

y>K−1y −12

log |K|−n

2log(2π)

is the combination of a data fit term and complexity penalty. Occam’s Razor isautomatic.

Learning in Gaussian process models involves finding

• the form of the covariance function, and• any unknown (hyper-) parameters θ.

This can be done by optimizing the marginal likelihood:

∂ logp(y|x, θ,Mi)

∂θj=

12

y>K−1 ∂K

∂θjK−1y −

12

trace(K−1 ∂K

∂θj)

Ghahramani Lecture 3 and 4: Gaussian Processes 20 / 31

Page 21: Lecture 3 and 4: Gaussian Processes - University of Cambridgemlg.eng.cam.ac.uk/teaching/4f13/1516/lect0304.pdf · 2015-10-16 · Ghahramani Lecture 3 and 4: Gaussian Processes 17

Example: Fitting the length scale parameter

Parameterized covariance function: k(x, x ′) = v2 exp(−

(x− x ′)2

2`2)+ σ2

noiseδxx′ .

−10 −5 0 5 10

−0.

50.

00.

51.

0

input, x

func

tion

valu

e, y

too longabout righttoo short

Characteristic Lengthscales

The mean posterior predictive function is plotted for 3 different length scales (theblue curve corresponds to optimizing the marginal likelihood). Notice, that analmost exact fit to the data can be achieved by reducing the length scale – but themarginal likelihood does not favour this!

Ghahramani Lecture 3 and 4: Gaussian Processes 21 / 31

Page 22: Lecture 3 and 4: Gaussian Processes - University of Cambridgemlg.eng.cam.ac.uk/teaching/4f13/1516/lect0304.pdf · 2015-10-16 · Ghahramani Lecture 3 and 4: Gaussian Processes 17

How can Bayes rule help find the right modelcomplexity? Marginal likelihoods and Occam’s Razor

too simple

too complex

"just right"

All possible data sets

P(Y

|Mi)

Y

Ghahramani Lecture 3 and 4: Gaussian Processes 22 / 31

Page 23: Lecture 3 and 4: Gaussian Processes - University of Cambridgemlg.eng.cam.ac.uk/teaching/4f13/1516/lect0304.pdf · 2015-10-16 · Ghahramani Lecture 3 and 4: Gaussian Processes 17

An illustrative analogous example

Imagine the simple task of fitting the variance, σ2, of a zero-mean Gaussian to aset of n scalar observations.

The log likelihood is logp(y|µ,σ2) = − 12 y>Iy/σ2− 1

2 log |Iσ2|− n2 log(2π)

Ghahramani Lecture 3 and 4: Gaussian Processes 23 / 31

Page 24: Lecture 3 and 4: Gaussian Processes - University of Cambridgemlg.eng.cam.ac.uk/teaching/4f13/1516/lect0304.pdf · 2015-10-16 · Ghahramani Lecture 3 and 4: Gaussian Processes 17

From finite linear models to Gaussian processes (1)

Finite linear model with Gaussian priors on the weights:

f(x) =

M∑m=1

wmφm(x) p(w) = N(w; 0,A)

The joint distribution of any f = [f(x1), . . . , f(xN)]> is a multivariate Gaussian –this looks like a Gaussian Process!

The prior p(f) is fully characterized by the mean and covariance functions.

m(x) = Ew(f(x)

)=

∫ ( M∑m=1

wkφm(x))p(w)dw =

M∑m=1

φm(x)

∫wmp(w)dw

=

M∑m=1

φm(x)

∫wmp(wm)dwm = 0

The mean function is zero.

Ghahramani Lecture 3 and 4: Gaussian Processes 24 / 31

Page 25: Lecture 3 and 4: Gaussian Processes - University of Cambridgemlg.eng.cam.ac.uk/teaching/4f13/1516/lect0304.pdf · 2015-10-16 · Ghahramani Lecture 3 and 4: Gaussian Processes 17

From finite linear models to Gaussian processes (2)

Covariance function of a finite linear model

f(x) =∑Mm=1wmφm(x) = w>φ(x)

p(w) = N(w; 0,A)φ(x) = [φ1(x), . . . ,φM(x)]>(M×1)

k(xi, xj) = Covw(f(xi), f(xj)

)= Ew

(f(xi)f(xj)

)− Ew

(f(xi)

)Ew(f(xj)

)︸ ︷︷ ︸0

=

∫...∫ ( M∑

k=1

M∑l=1

wkwlφk(xi)φl(xj))p(w)dw

=

M∑k=1

M∑l=1

φk(xi)φl(xj)

∫∫wkwlp(wk,wl)dwkdwl︸ ︷︷ ︸

Akl

=

M∑k=1

M∑l=1

Aklφk(xi)φl(xj)

k(xi, xj) = φ(xi)>Aφ(xj)

Note: If A = σ2wI then k(xi, xj) = σ2

w∑Mk=1φk(xi)φk(xj) = σ

2wφ(xi)

>φ(xj)

Ghahramani Lecture 3 and 4: Gaussian Processes 25 / 31

Page 26: Lecture 3 and 4: Gaussian Processes - University of Cambridgemlg.eng.cam.ac.uk/teaching/4f13/1516/lect0304.pdf · 2015-10-16 · Ghahramani Lecture 3 and 4: Gaussian Processes 17

GPs and Linear in the parameters models are equivalent

We’ve seen that a Linear in the parameters model, with a Gaussian prior on theweights is also a GP.

Note the different computational complexity: GP: O(N3), linear model O(NM2)where M is the number of basis functions and N the number of training cases.

So, which representation is most efficient?

Might it also be the case that every GP corresponds to a Linear in the parametersmodel? (Mercer’s theorem.)

Ghahramani Lecture 3 and 4: Gaussian Processes 26 / 31

Page 27: Lecture 3 and 4: Gaussian Processes - University of Cambridgemlg.eng.cam.ac.uk/teaching/4f13/1516/lect0304.pdf · 2015-10-16 · Ghahramani Lecture 3 and 4: Gaussian Processes 17

From infinite linear models to Gaussian processes

Consider the class of functions (sums of squared exponentials):

f(x) = limN→∞

1N

∞∑n=−∞γn exp(−(x− n

N)2), where γn ∼ N(0, 1), ∀n

=

∫∞−∞γ(u) exp(−(x− u)2)du, where γ(u) ∼ N(0, 1), ∀u.

The mean function is:

µ(x) = E[f(x)] =

∫∞−∞ exp(−(x− u)2)

∫∞−∞γ(u)p(γ(u))dγ(u) du = 0,

and the covariance function:

E[f(x)f(x ′)] =

∫exp

(− (x− u)2 − (x ′ − u)2)du

=

∫exp

(− 2(u−

x+ x ′

2)2 +

(x+ x ′)2

2− x2 − x ′2

)du ∝ exp

(−

(x− x ′)2

2

).

Thus, the squared exponential covariance function is equivalent to regressionusing infinitely many Gaussian shaped basis functions placed everywhere, not justat your training points!

Ghahramani Lecture 3 and 4: Gaussian Processes 27 / 31

Page 28: Lecture 3 and 4: Gaussian Processes - University of Cambridgemlg.eng.cam.ac.uk/teaching/4f13/1516/lect0304.pdf · 2015-10-16 · Ghahramani Lecture 3 and 4: Gaussian Processes 17

Using finitely many basis functions may be dangerous!(1)

Finite linear model with 5 localized basis functions)

−5 0 5−3

−2

−1

0

1

2

3

−5 0 5−3

−2

−1

0

1

2

3

−5 0 5−3

−2

−1

0

1

2

3

Gaussian process with infinitely many localized basis functions

−5 0 5−3

−2

−1

0

1

2

3

−5 0 5−3

−2

−1

0

1

2

3

−5 0 5−3

−2

−1

0

1

2

3

Ghahramani Lecture 3 and 4: Gaussian Processes 28 / 31

Page 29: Lecture 3 and 4: Gaussian Processes - University of Cambridgemlg.eng.cam.ac.uk/teaching/4f13/1516/lect0304.pdf · 2015-10-16 · Ghahramani Lecture 3 and 4: Gaussian Processes 17

Using finitely many basis functions may be dangerous!(2)

Finite linear model with 5 localized basis functions)

−5 0 5−3

−2

−1

0

1

2

3

−5 0 5−3

−2

−1

0

1

2

3

−5 0 5−3

−2

−1

0

1

2

3

Gaussian process with infinitely many localized basis functions

−5 0 5−3

−2

−1

0

1

2

3

−5 0 5−3

−2

−1

0

1

2

3

−5 0 5−3

−2

−1

0

1

2

3

Ghahramani Lecture 3 and 4: Gaussian Processes 29 / 31

Page 30: Lecture 3 and 4: Gaussian Processes - University of Cambridgemlg.eng.cam.ac.uk/teaching/4f13/1516/lect0304.pdf · 2015-10-16 · Ghahramani Lecture 3 and 4: Gaussian Processes 17

Using finitely many basis functions may be dangerous!(3)

Finite linear model with 5 localized basis functions)

−5 0 5−3

−2

−1

0

1

2

3

−5 0 5−3

−2

−1

0

1

2

3

−5 0 5−3

−2

−1

0

1

2

3

Gaussian process with infinitely many localized basis functions

−5 0 5−3

−2

−1

0

1

2

3

−5 0 5−3

−2

−1

0

1

2

3

−5 0 5−3

−2

−1

0

1

2

3

Ghahramani Lecture 3 and 4: Gaussian Processes 30 / 31

Page 31: Lecture 3 and 4: Gaussian Processes - University of Cambridgemlg.eng.cam.ac.uk/teaching/4f13/1516/lect0304.pdf · 2015-10-16 · Ghahramani Lecture 3 and 4: Gaussian Processes 17

Matrix and Gaussian identities cheat sheet

Matrix identities

• Matrix inversion lemma (Woodbury, Sherman & Morrison formula)

(Z+UWV>)−1 = Z−1 − Z−1U(W−1 + V>Z−1U)−1V>Z−1

• A similar equation exists for determinants

|Z+UWV>| = |Z| |W| |W−1 + V>Z−1U|

The product of two Gaussian density functions

N(x|a,A)N(P> x|b,B) = zcN(x|c,C)

• is proportional to a Gaussian density function with covariance and mean

C =(A−1 + P B−1P>

)−1c = C

(A−1a + P B−1 b

)• and has a normalizing constant zc that is Gaussian both in a and in b

zc = (2π)−m2 |B+ P>AP|−

12 exp

(−

12(b− P> a)>

(B+ P>AP

)−1(b− P> a)

)Ghahramani Lecture 3 and 4: Gaussian Processes 31 / 31