Gp Cvpr12 Session1

8/11/2019 Gp Cvpr12 Session1

1/302


2/302

Book

?

Urtasun and Lawrence () Session 1: GP and Regression CVPR Tutorial 2 / 74


3/302

Outline

1 The Gaussian Density

2 Covariance from Basis Functions

3

Basis Function Representations

4 Constructing Covariance

5 GP Limitations

6 Conclusions



4/302

Outline



3 Basis Function Representations


5 GP Limitations

6 Conclusions



5/302

The Gaussian Density

Perhaps the most common probability density.

p(y|, 2) =

1

22exp

(y

)2

22

=Ny|, 2The Gaussian density.



6/302

Gaussian Density

0

1

2

3

0 1 2

p(h

|

,

2)

h, height/m

The Gaussian PDF with = 1.7 and variance 2 = 0.0225. Mean shownas red line. It could represent the heights of a population of students.



7/302

Gaussian Density

Ny|, 2= 122

exp(y )2

22



8/302

Two Important Gaussian Properties

1 Sum of Gaussian variables is also Gaussian.

yi Ni, 2i n

i=1

yi N n

i=1

i,n

i=1

2i(Aside: As sum increases, sum of non-Gaussian, finite variancevariables is also Gaussian [central limit theorem].)

2 Scaling a Gaussian leads to a Gaussian.

y N, 2

wy Nw, w22



9/302



yi Ni, 2i n

i=1

yi N n

i=1

i,n

i=1



y N, 2

wy Nw, w22



10/302



yi Ni, 2i n

i=1

yi N n

i=1

i,n

i=1



y N, 2

wy N

w, w22



11/302



yi Ni, 2i n

i=1

yi N n

i=1

i,n

i=1



y N, 2

wy N

w, w22



12/302



yi Ni, 2i n

i=1

yi N n

i=1

i,n

i=1



y N, 2

wy N

w, w22



13/302


14/302

Two Simultaneous Equations

A system of two differentialequations with two unknowns.

y1 =mx1+c

y2 =mx2+c



15/302



y1 y2=m(x1 x2)



16/302



y1 y2x1 x2 =m



17/302



m= y2 y1x2 x1

c=y1 mx10

1

2

3

4

5

0 1 2 3

y

x

c

y1

y2

x2 x1

m = y2y1

x2x1


S


18/302


How do we deal with threesimultaneous equations with only twounknowns?

y1 =mx1+c

y2 =mx2+c

y3 =mx3+c 0

1

2

3

4

5

0 1 2 3

y

x

c

y1

y2

x2 x1

m = y2y1

x2x1


O d i d S


19/302

Overdetermined System

With two unknowns and two observations:

y1 =mx1+cy2 =mx2+c

Additional observation leads to overdeterminedsystem.

y3 =mx3+c

This problem is solved through a noise model

N0, 2

y1 =mx1+c+1

y2 =mx2+c+2

y3 =mx3+c+3


O d i d S


20/302



y1 =mx1+cy2 =mx2+c


y3 =mx3+c


N0, 2

y1 =mx1+c+1

y2 =mx2+c+2

y3 =mx3+c+3


O d t i d S t


21/302



y1 =mx1+cy2 =mx2+c


y3 =mx3+c


N0, 2

y1 =mx1+c+1

y2 =mx2+c+2

y3 =mx3+c+3


N i M d l


22/302

Noise Models

We arent modeling entire system.

Noise model gives mismatch between model and data.

Gaussian model justified by appeal to central limit theorem.Other models also possible (Student-tfor heavy tails).

Maximum likelihood with Gaussian noise leads to least squares.


U d d t i d S st


23/302

Underdetermined System

What about two unknowns and oneobservation?

y1 =mx1+c

0

1

2

3

4

5

0 1 2 3

y

x




24/302


Can compute m given c.

m= y1 cx

0

1

2

3

4

5

0 1 2 3

y

x




25/302



c= 1.75 = m= 1.25

0

1

2

3

4

5

0 1 2 3

y

x




26/302



c= 0.777 = m= 3.78

0

1

2

3

4

5

0 1 2 3

y

x




27/302



c= 4.01 = m= 7.01

0

1

2

3

4

5

0 1 2 3

y

x



28/302



29/302



c= 2.45 = m= 0.545

0

1

2

3

4

5

0 1 2 3

y

x




30/302



c= 0.657 = m= 3.66

0

1

2

3

4

5

0 1 2 3

y

x




31/302



c= 3.13 = m= 6.13

0

1

2

3

4

5

0 1 2 3

y

x



32/302



33/302

U Sy

Can compute m given c.Assume

c N(0, 4) ,

we find a distribution of solutions.0

1

2

3

4

5

0 1 2 3

y

x


Probability for Under- and Overdetermined


34/302

y

To deal with overdetermined introduced probability distribution forvariable,i.

For underdetermined system introduced probability distribution forparameter,c.

This is known as a Bayesian treatment.



35/302

For general Bayesian inference need multivariate priors.E.g. for multivariate linear regression:

yi=

i

wjxi,j+i

(where weve dropped c for convenience), we need a prior over w.This motivates a multivariateGaussian density.

We will use the multivariate Gaussian to put a prior directlyon thefunction (a Gaussian process).



36/302

For general Bayesian inference need multivariate priors.E.g. for multivariate linear regression:

yi =wxi,:+i

(where weve dropped c for convenience), we need a prior over w.This motivates a multivariateGaussian density.

We will use the multivariate Gaussian to put a prior directlyon thefunction (a Gaussian process).


Multivariate Regression Likelihood


37/302

Recall multivariate regression likelihood:

p(y|X, w) = 1(22)n/2

exp

1

22

ni=1

yiwxi,:

2

Now use a multivariate Gaussian prior:

p(w) = 1

(2)p2

exp 1

2ww


Multivariate Regression Likelihood


38/302

Recall multivariate regression likelihood:

p(y|X, w) = 1(22)n/2

exp

1

22

ni=1

yiwxi,:

2

Now use a multivariate Gaussian prior:

p(w) = 1

(2)p2

exp 1

2ww


Posterior Density


39/302

Once again we want to know the posterior:

p(w|y,X) p(y|X, w)p(w)

And we can compute by completing the square.

log p(w|y, X) = 1

22

ni=1

y2i + 1

2

ni=1

yixi,:w

122

ni=1

wxi,:x

i,:w 1

2ww+ const.

p(w|y,X) =N(w|w, Cw)Cw = (

2XX +1)1 and w =Cw2Xy


Posterior Density


40/302

Once again we want to know the posterior:

p(w|y,X) p(y|X, w)p(w)

And we can compute by completing the square.

log p(w|y, X) = 1

22

ni=1

y2i + 1

2

ni=1

yix

i,:w

122

ni=1

wxi,:x

i,:w 1

2ww+ const.

p(w|y,X) =N(w|w, Cw)Cw = (

2XX +1)1 and w =Cw2Xy


Bayesian vs Maximum Likelihood


41/302

Note the similarity between posterior mean

w = (2XX +1)12Xy

and Maximum likelihood solution

w= (XX)1Xy


Marginal Likelihood is Computed as Normalizer


42/302

p(w|y,X)p(y

|X) =p(y

|w,X)p(w)



43/302

Two Dimensional Gaussian


44/302

Consider height, h/m and weight, w/kg.

Could sample height from a distribution:

p(h)

N(1.7, 0.0225)

And similarly weight:

p(w) N(75, 36)


Height and Weight Models


45/302

p(h)

h/m

Marginal Distributions

p(w)

w/kg Gaussian

distributions for height and weight.


Sampling Two Dimensional VariablesMarginal Distributions


46/302

w

/kg

h/

m

Joint Distribution

p(h)


p(w)

Sample height and weight one after the other and plot against each other.



47/302



48/302

w

/kg

h/m

Joint Distribution

p(h)


p(w)




49/302



50/302

w

/kg

h/m

Joint Distribution

p(h)


p(w)





51/302

w

/kg

h/m

Joint Distribution

p(h)


p(w)





52/302

w

/kg

h/m

Joint Distribution

p(h)

g

p(w)





53/302

w/kg

h/m

Joint Distribution

p(h)

g

p(w)




54/302



55/302

w/kg

h/m

Joint Distribution

p(h)

p(w)





56/302

w/kg

h/m

Joint Distribution

p(h)

p(w)





57/302

w/kg

h/m

Joint Distribution

p(h)

p(w)





58/302

w/kg

h/m

Joint Distribution

p(h)

p(w)




59/302



60/302

w/kg

h/m

Joint Distribution

p(h)

p(w)





61/302

w/kg

h/m

Joint Distribution

p(h)

p(w)





62/302

w/kg

h/m

Joint Distribution

p(h)

p(w)





63/302

w/kg

h/m

Joint Distribution

p(h)

p(w)




64/302


65/302



66/302

w/kg

h/m

Joint Distribution

p(h)

p(w)





67/302

w/kg

h/m

Joint Distribution

p(h)

p(w)





68/302

w/kg

h/m

Joint Distribution

p(h)

p(w)




69/302



70/302

w

/kg

h/m

Joint Distribution

p(h)

p(w)




71/302

w

/kg

h/

m

Joint Distribution

p(h)

p(w)



72/302



73/302

w

/kg

h/m

Joint Distribution

p(h)

p(w)



74/302



75/302

w

/kg

h/m

Joint Distribution

p(h)

p(w)




76/302

w

/kg

h/m

Joint Distribution

p(h)

p(w)




77/302

w

/kg

h/m

Joint Distribution

p(h)

p(w)




78/302

w

/kg

h/m

Joint Distribution

p(h)

p(w)




79/302

w

/kg

h/m

Joint Distribution

p(h)

p(w)




80/302

w/kg

h/m

Joint Distribution

p(h)

p(w)




81/302

w/kg

h/m

Joint Distribution

p(h)

p(w)




82/302

w/kg

h/m

Joint Distribution

p(h)

p(w)




83/302

w/kg

h/m

Joint Distribution

p(h)

p(w)



84/302



85/302

w/kg

h/m

Joint Distribution

p(h)

p(w)




86/302

w/kg

h/m

Joint Distribution

p(h)

p(w)




87/302

w/kg

h/m

Joint Distribution

p(h)

p(w)




88/302

w/kg

h/m

Joint Distribution

p(h)

p(w)




89/302

w/kg

h/m

Joint Distribution

p(h)

p(w)



90/302



91/302

w/kg

h/m

Joint Distribution

p(h)

p(w)




92/302

w/kg

h/m

Joint Distribution

p(h)

p(w)


Independent Gaussians


93/302

p(w, h) =p(w)p(h)




94/302

p(w, h) = 1

221222exp

1

2

(w 1)2

21+

(h 2)222




95/302

p(w, h) = 1

2

2122

exp1

2w

h

1

2

21 0

0 221

wh

1

2




96/302

p(y) = 1

2 |D| exp1

2(y )D1(y )



97/302

Correlated Gaussian


98/302

Form correlated from original by rotating the data space using matrix R.

p(y) = 1

2

|D

|

12

exp

1

2(Ry R)D1(Ry R)

U t d L () S i 1 GP d R i CVPR T t i l 26 / 74

Correlated Gaussian


99/302


p(y) = 1

2

|D

|

12

exp

1

2(y )RD1R(y )

this gives a covariance matrix:

C1 =RD1R


Correlated Gaussian


100/302


p(y) = 1

2

|C

|

12

exp

1

2(y )C1(y )

this gives a covariance matrix:

C= RDR


Recall Univariate Gaussian Properties



101/302

yi Ni, 2i n

i=1yi N

n

i=1i,

n

i=12i


y N, 2

wy Nw, w22

U d L () S i 1 GP d R i CVPR T i l 27 / 74




102/302

yi Ni, 2i n

i=1yi N

n

i=1i,

n

i=12i


y N, 2

wy Nw, w22

U d L () S i 1 GP d R i CVPR T i l 27 / 74




103/302

yi Ni, 2i n

i=1yi N

n

i=1i,

n

i=12i


y N, 2

wy Nw, w22



104/302




105/302

yi Ni, 2i n

i=1yi N

n

i=1i,

n

i=12i


y N, 2

wy Nw, w22


Multivariate Consequence


106/302

If x N(,)

And

y=Wx

Theny NW, WW


Multivariate Consequence


107/302

If x N(,)

And

y=Wx

Theny N

W, WW



108/302

Sampling a Function


109/302

Multi-variate Gaussians

We will consider a Gaussian with a particular structure of covariancematrix.

Generate a single sample from this 25 dimensional Gaussiandistribution, f= [f1, f2. . . f25].

We will plot these points against their index.


Gaussian Distribution Sample

2 0.

1


110/302

-2

-1

0

1

0 5 10 15 20 25

i(a) A 25 dimensional correlated randomvariable (values ploted against index)

j

i00.0.0.0.0.

0.0.0.

(b) colormap showing correlations betweendimensions.

Figure: A sample from a 25 dimensional Gaussian distribution.



111/302


2

00.1


112/302

-2

-1

0

1

0 5 10 15 20 25


00.0.0.0.0.

0.0.

0.





2

00.1


113/302

-2

-1

0

1

0 5 10 15 20 25


00.0.0.0.0.

0.0.

0.





2

00.

1


114/302

-2

-1

0

1

0 5 10 15 20 25


00.0.0.0.0.

0.0.

0.





2

00.1


115/302

-2

-1

0

1

0 5 10 15 20 25


00.0.0.0.0.

0.0.

0.





2

0.0.

1


116/302

-2

-1

0

1

0 5 10 15 20 25


00.0.0.0.0.

0.0.

0.





117/302

Prediction off2 from f1

1


118/302

-1

0

-1 0 1

f2

1 0.96587

0.96587 1

The single contour of the Gaussian density represents thejoint

distribution,p

(f

1,f

2).We observe that f1 = 0.313.Conditional density: p(f2|f1 = 0.313).



1


119/302

-1

0

-1 0 1

f2

1 0.96587

0.96587 1

The single contour of the Gaussian density represents thejointdistribution, p(f

1, f

2).

We observe that f1 = 0.313.Conditional density: p(f2|f1 = 0.313).



1


120/302

-1

0

-1 0 1

f2

1 0.96587

0.96587 1


1, f

2).




121/302

Prediction with Correlated Gaussians

Prediction off2 from f1 requires conditional density.


122/302

Conditional density is alsoGaussian.

p(f2|f1) =N

f2|k1,2k1,1

f1, k2,2 k21,2

k1,1

where covariance of joint density is given by

K=

k1,1 k1,2k2,1 k2,2



11 0.57375


123/302

-1

0

-1 0 1

f5

0.57375 1


1, f

5).




11 0.57375


124/302

-1

0

-1 0 1

f5

0.57375 1


1, f

5).




11 0.57375


125/302

-1

0

-1 0 1

f5

0.57375 1

The single contour of the Gaussian density represents thejointdistribution, p(f1, f5).




11 0.57375


126/302

-1

0

-1 0 1

f5

0.57375 1

The single contour of the Gaussian density represents thejointdistribution, p(f1, f5).



Prediction with Correlated Gaussians

Prediction off from f requires multivariate conditional density.

M lti i t diti l d it i l G i


127/302

Multivariate conditional density is alsoGaussian.

p(f|f) =N

f|K,fK1f,ff, K, K,fK1f,fKf,

Here covariance of joint density is given by

K= Kf,f K,fKf, K,



128/302

Covariance FunctionsWhere did this covariance matrix come from?

Exponentiated Quadratic Kernel Function (RBF, SquaredExponential, Gaussian)

2


129/302

k

x, x

= exp

x x

2222

Covariance matrix is builtusing the inputsto thefunction x.

For the example above itwas based on Euclidean

distance.The covariance function isalso know as a kernel.



Exponentiated Quadratic Kernel Function (RBF, SquaredExponential, Gaussian)

x x2


130/302

k

x, x

= exp

x x

2222

Covariance matrix is builtusing the inputsto thefunction x.

For the example above itwas based on Euclidean

distance.The covariance function isalso know as a kernel.

-3-2

-1

0

1

2

3

-1 -0.5 0 0.5 1



k(xi, xj) = exp

||xixj||222


131/302

x1 = 3.0, x2 = 1.20, and x3 = 1.40 with = 2.00 and = 1.00.

x1 = 3.0, x1 = 3.0

k1,1 = 1.00 exp (3.03.0)222.002



k(xi, xj) = exp

||xixj||222


132/302

1.00

x1 = 3.0, x2 = 1.20, and x3 = 1.40 with = 2.00 and = 1.00.

x1 = 3.0, x1 = 3.0

k1,1 = 1.00 exp (3.03.0)222.002



133/302


134/302


135/302


k(xi, xj) = exp ||xixj||222


136/302

1.00 0.110

0.110

x1 = 3.0, x2 = 1.20, and x3 = 1.40 with = 2.00 and = 1.00.

x2 = 1.20, x2 = 1.20

k2,2 = 1.00 exp (1.201.20)222.002





137/302

1.00 0.110

0.110 1.00

x1 = 3.0, x2 = 1.20, and x3 = 1.40 with = 2.00 and = 1.00.

x2 = 1.20, x2 = 1.20

k2,2 = 1.00 exp (1.201.20)222.002



138/302


139/302




140/302

1.00 0.110 0.0889

0.110 1.00

0.0889

x1 = 3.0, x2 = 1.20, and x3 = 1.40 with = 2.00 and = 1.00.

x3 = 1.40, x1 = 3.0

k3,1 = 1.00 exp (1.401.40)222.002





141/302

1.00 0.110 0.0889

0.110 1.00

0.0889

x1 = 3.0, x2 = 1.20, and x3 = 1.40 with = 2.00 and = 1.00.

x3 = 1.40, x2 = 1.20

k3,2 = 1.00 exp (1.401.40)222.002





142/302

1.00 0.110 0.0889

0.110 1.00

0.0889 0.995

x1 = 3.0, x2 = 1.20, and x3 = 1.40 with = 2.00 and = 1.00.

x3 = 1.40, x2 = 1.20

k3,2 = 1.00 exp (1.401.40)222.002





143/302

1.00 0.110 0.0889

0.110 1.00 0.995

0.0889 0.995

x1 = 3.0, x2 = 1.20, and x3 = 1.40 with = 2.00 and = 1.00.

x3 = 1.40, x2 = 1.20

k3,2 = 1.00 exp (1.401.40)222.002





144/302

1.00 0.110 0.0889

0.110 1.00 0.995

0.0889 0.995

x1 = 3.0, x2 = 1.20, and x3 = 1.40 with = 2.00 and = 1.00.

x3 = 1.40, x3 = 1.40

k3,3 = 1.00 exp (1.401.40)222.002





145/302

1.00 0.110 0.0889

0.110 1.00 0.995

0.0889 0.995 1.00

x1 = 3.0, x2 = 1.20, and x3 = 1.40 with = 2.00 and = 1.00.

x3 = 1.40, x3 = 1.40

k3,3 = 1.00 exp (1.401.40)222.002



146/302




147/302

x1 = 3, x2 = 1.2, x3 = 1.4, and x4 = 2.0 with = 2.0 and = 1.0.

x1 = 3, x1 = 3

k1,1 = 1.0 exp (33)222.02



1 0



148/302

1.0

x1 = 3, x2 = 1.2, x3 = 1.4, and x4 = 2.0 with = 2.0 and = 1.0.

x1 = 3, x1 = 3

k1,1 = 1.0 exp (33)222.02



1 0



149/302

1.0

x1 = 3, x2 = 1.2, x3 = 1.4, and x4 = 2.0 with = 2.0 and = 1.0.

x2 = 1.2, x1 = 3

k2,1 = 1.0 exp (1.21.2)222.02



1 0



150/302

1.0

0.11

x1 = 3, x2 = 1.2, x3 = 1.4, and x4 = 2.0 with = 2.0 and = 1.0.

x2 = 1.2, x1 = 3

k2,1 = 1.0 exp (1.21.2)222.02



151/302


152/302


153/302


154/302


155/302


1.0 0.11 0.089



156/302

1.0 0.11 0.089

0.11 1.0

0.089

x1 = 3, x2 = 1.2, x3 = 1.4, and x4 = 2.0 with = 2.0 and = 1.0.

x3 = 1.4, x1 = 3

k3,1 = 1.0 exp (1.41.4)222.02



157/302


1.0 0.11 0.089



158/302

0.11 1.0

0.089 1.0

x1 = 3, x2 = 1.2, x3 = 1.4, and x4 = 2.0 with = 2.0 and = 1.0.

x3 = 1.4, x2 = 1.2

k3,2 = 1.0 exp (1.41.4)222.02



159/302


1.0 0.11 0.089



160/302

0.11 1.0 1.0

0.089 1.0

x1 = 3, x2 = 1.2, x3 = 1.4, and x4 = 2.0 with = 2.0 and = 1.0.

x3 = 1.4, x3 = 1.4

k3,3 = 1.0 exp (1.41.4)222.02



1.0 0.11 0.089



161/302

0.11 1.0 1.0

0.089 1.0 1.0

x1 = 3, x2 = 1.2, x3 = 1.4, and x4 = 2.0 with = 2.0 and = 1.0.

x3 = 1.4, x3 = 1.4

k3,3 = 1.0 exp (1.41.4)222.02



1.0 0.11 0.089



162/302

0.11 1.0 1.0

0.089 1.0 1.0

x1 = 3, x2 = 1.2, x3 = 1.4, and x4 = 2.0 with = 2.0 and = 1.0.

x4 = 2.0, x1 = 3

k4,1 = 1.0 exp (2.02.0)222.02



163/302


164/302


1.0 0.11 0.089 0.044



165/302

0.11 1.0 1.0

0.089 1.0 1.0

0.044

x1 = 3, x2 = 1.2, x3 = 1.4, and x4 = 2.0 with = 2.0 and = 1.0.

x4 = 2.0, x2 = 1.2

k4,2 = 1.0 exp (2.02.0)222.02



166/302


167/302


1.0 0.11 0.089 0.044



168/302

0.11 1.0 1.0 0.92

0.089 1.0 1.0

0.044 0.92

x1 = 3, x2 = 1.2, x3 = 1.4, and x4 = 2.0 with = 2.0 and = 1.0.

x4 = 2.0, x3 = 1.4

k4,3 = 1.0 exp (2.02.0)222.02



169/302


1.0 0.11 0.089 0.044



170/302

0.11 1.0 1.0 0.92

0.089 1.0 1.0 0.96

0.044 0.92 0.96

x1 = 3, x2 = 1.2, x3 = 1.4, and x4 = 2.0 with = 2.0 and = 1.0.

x4 = 2.0, x3 = 1.4

k4,3 = 1.0 exp (2.02.0)222.02



1.0 0.11 0.089 0.044



171/302

0.11 1.0 1.0 0.92

0.089 1.0 1.0 0.96

0.044 0.92 0.96

x1 = 3, x2 = 1.2, x3 = 1.4, and x4 = 2.0 with = 2.0 and = 1.0.

x4 = 2.0, x4 = 2.0

k4,4 = 1.0 exp (2.02.0)222.02



1.0 0.11 0.089 0.044



172/302

0.11 1.0 1.0 0.92

0.089 1.0 1.0 0.96

0.044 0.92 0.96 1.0

x1 = 3, x2 = 1.2, x3 = 1.4, and x4 = 2.0 with = 2.0 and = 1.0.

x4 = 2.0, x4 = 2.0

k4,4 = 1.0 exp (2.02.0)222.02





173/302

x1 = 3, x2 = 1.2, x3 = 1.4, and x4 = 2.0 with = 2.0 and = 1.0.

x4 = 2.0, x4 = 2.0

k4,4 = 1.0 exp (2.02.0)222.02





174/302

x1 = 3.0, x2 = 1.20, and x3 = 1.40 with = 5.00 and = 4.00.

x1 = 3.0, x1 = 3.0

k1,1 = 4.00 exp (3.03.0)225.002



175/302


176/302


177/302


178/302


179/302


180/302


181/302


182/302


183/302


184/302


185/302


4.00 2.81 2.72


1 40 1 20


186/302

2.81 4.00 4.00

2.72 4.00

x1 = 3.0, x2 = 1.20, and x3 = 1.40 with = 5.00 and = 4.00.

x3 = 1.40, x2 = 1.20

k3,2 = 4.00 exp (1.401.40)225.002



187/302


4.00 2.81 2.72


1 40 1 40


188/302

2.81 4.00 4.00

2.72 4.00 4.00

x1 = 3.0, x2 = 1.20, and x3 = 1.40 with = 5.00 and = 4.00.

x3 = 1.40, x3 = 1.40

k3,3 = 4.00 exp (1.401.40)225.002




x = 1 40 x = 1 40


189/302

x1 = 3.0, x2 = 1.20, and x3 = 1.40 with = 5.00 and = 4.00.

x3 = 1.40, x3 = 1.40

k3,3 = 4.00 exp (1.401.40)225.002


Outline





190/302


5 GP Limitations

6 Conclusions


Basis Function Form

Radial basis functionscommonly have the form

k(xi) = exp

|xi k|

2

22

.

1


191/302

Basis function mapsdata into afeaturespacein which alinear sum is a nonlinear function.

0

0.5

-8 -6 -4 -2 0 2 4 6 8

(x

)

xFigure: A set of radial basis functions with width= 2 and location parameters = [4 0 4 ].



192/302

Random Functions

Functions derived using:

f(x) =m

k=1

wkk(x),-2

-10

1

2

f(x)


193/302

k 1

where W is sampledfrom a Gaussian density,

wk N(0, ) .

-8 -6 -4 -2 0 2 4 6 8

x

Figure: Functions sampled using the basis set fromfigure2. Each line is a separate sample, generated bya weighted sum of the basis set. The weights, w aresampled from a Gaussian density with variance = 1.


Direct Construction of Covariance Matrix

Use matrix notation to write function,

f (xi; w) =m

k=1

wkk(xi)

computed at training data gives a vector


194/302

f= w.

w and fare only related by a inner product.

is fixed and non-stochastic for a given training set.

f is Gaussian distributed.

it is straightforward to compute distribution for f



195/302



f (xi; w) =m

k=1

wkk(xi)



196/302

f= w.







197/302



f (xi; w) =m

k=1

wkk(xi)



198/302

f= w.








f (xi; w) =m

k=1

wkk(xi)



199/302

f= w.







200/302


201/302

Expectations

We useto denote expectations under prior distributions.We have

f = w .

Prior mean ofw was zero giving

f =0.


202/302

Prior covariance off is

K=

ff f f

ff= ww,giving

K= .



203/302

Expectations

We useto denote expectations under prior distributions.We have

f = w .

Prior mean ofw was zero giving

f =0.


204/302

Prior covariance off is

K=

ff f f

ff= ww,giving

K= .



205/302

Covariance between Two Points

The prior covariance between two points xi and xj is

k(xi, xj) =

m

(xi)(xj)

or in vector form

k (x x ) = (x ) (x )


206/302

k(xi, xj) =:(xi) :(xj) ,

For the radial basis used this gives

k(xi, xj) =

m

k=1

exp|xi k|2 + |xjk|2

22 .



207/302


208/302

Covariance between Two Points

The prior covariance between two points xi and xj is

k(xi, xj) =

m

(xi)(xj)

or in vector form

k (xi xj ) = (xi ) (xj )


209/302

k(xi, xj) =:(xi) :(xj) ,

For the radial basis used this gives

k(xi, xj) =

m

k=1

exp|xi k|2 + |xjk|2

22 .



210/302

Selecting Number and Location of Basis

Need to choose

1 location of centers2 number of basis functions

Consider uniform spacing over a region:


211/302

k(xi, xj) =

mk=1

expx

2i +x

2j

2k(xi+xj) + 2

2k

22

,


Selecting Number and Location of Basis

Need to choose

1 location of centers2 number of basis functions

Consider uniform spacing over a region:


212/302

k(xi, xj) =

mk=1

expx

2i +x

2j

2k(xi+xj) + 2

2k

22

,



213/302

Uniform Basis Functions

Set each center location to

k=a+

(k

1).

Specify the bases in terms of their indices,


214/302

k(xi, xj) =

mk=1

exp x

2i +x

2j

22

2 (a+ k) (xi+xj) + 2 (a+ k)2

22

.



215/302


216/302

Infinite Basis Functions

Take 0 =a and m =bso b=a+ (m 1).

Take limit as 0 so m k(xi, xj) =

ba

exp

x

2i +x

2j

22

1 ( )2 1 ( )2


217/302

+

2 12(xi+xj)2

12(xi+xj)

2

22

d,

where we have used k .



218/302

Infinite Basis Functions

Take 0 =a and m =bso b=a+ (m 1).

Take limit as 0 so m k(xi, xj) =

ba

exp

x

2i +x

2j

22

2 1 ( + )2 1 ( + )2


219/302

+

2 12(xi+xj)

12(xi+xj)

2

22

d,

where we have used k .


Result

Performing the integration leads to

k(xi,xj) =

2

2 exp

(xi xj)242

erf

b 12(xi+xj)

erf

a 12(xi+xj)

,


220/302

Now take limit as a and b

k(xi, xj) = exp

(xi xj)

2

42

.

where =2.



221/302

Result

Performing the integration leads to

k(xi,xj) =

2

2 exp

(xi xj)242

erf

b 12(xi+xj)

erf

a 12(xi+xj)

,


222/302

Now take limit as a and b

k(xi, xj) = exp

(xi xj)

2

42

.

where =2.


Infinite Feature Space

A RBF model with infinite basis functions is a Gaussian process.

The covariance function is the exponentiated quadratic.

Note: The functional form for the covariance function and basisfunctions are similar.

this is a special case,


223/302

in general they are very differentSimilar results can obtained for multi-dimensional input networks ??.



224/302





this is a special case,


225/302

in general they are very differentSimilar results can obtained for multi-dimensional input networks ??.






this is a special case,i l h diff


226/302

in general they are very different

Similar results can obtained for multi-dimensional input networks ??.



227/302

The Parametric Bottleneck

Parametric models have a representation that does not respond toincreasing training set size.

Bayesian posterior distributions over parameters contain theinformation about the training data.

Use Bayes rule from training data, p(w|y,X), Make predictions on test data

p(y|X, y,X) =

p(y|w, X) p(w|y,X)dw) .


228/302

w becomes a bottleneck for information about the training set to passto the test set.

Solution: increase m so that the bottleneck is so large that it nolonger presents a problem.

How big is big enough for m? Non-parametrics says m .



229/302


230/302


Now no longer possible to manipulate the model through the standardparametric form given in (1).

However, it ispossible to express parametricas GPs:

k(xi, xj) =:(xi) :(xj) .


231/302

These are known as degenerate covariance matrices.Their rank is at most m, non-parametric models have full rankcovariance matrices.

Most well known is the linear kernel, k(xi, xj) =x

i xj.





k(xi, xj) =:(xi) :(xj) .


232/302



i xj.





k(xi, xj) =:(xi) :(xj) .


233/302



i xj.


Making Predictions

For non-parametrics prediction at new points f is made byconditioning on f in the joint distribution.

In GPs this involves combining the training data with the covariancefunction and the mean function.

Parametric is a special case when conditional prediction can besummarized in a fixednumber of parameters.


234/302

Complexity of parametric model remains fixed regardless of the size ofour training data set.

For a non-parametric model the required number of parameters growswith the size of the training data.


Making Predictions

For non-parametrics prediction at new points f is made byconditioning on f in the joint distribution.

In GPs this involves combining the training data with the covariancefunction and the mean function.

Parametric is a special case when conditional prediction can besummarized in a fixednumber of parameters.


235/302

Complexity of parametric model remains fixed regardless of the size ofour training data set.

For a non-parametric model the required number of parameters growswith the size of the training data.



236/302


237/302


238/302

Covariance Functions

RBF Basis Functions

k

x, x

= (x)(x)

i(x) = exp

x i

22

2


239/302

=

10

1



240/302


241/302

Covariance Functions and Mercer Kernels

Mercer Kernels and Covariance Functions are similar.the kernel perspective does not make a probabilistic interpretation ofthe covariance function.

Algorithms can be simpler, but probabilistic interpretation is crucialfor kernel parameter optimization.


242/302



243/302


244/302

Constructing Covariance Functions

Sum of two covariances is also a covariance function.

k(x, x) =k1(x, x) +k2(x, x

)


245/302



246/302


247/302


248/302


MLP Covariance Function

k

x, x

= asin

wxx +b

wxx +b+ 1wxx +b+ 1

Based on infinite neuralnetwork model.

0

1

2


249/302

w= 40

b= 4 -2

-1

0

-1 0 1



250/302


Linear Covariance Function

k

x, x

= xx

Bayesian linear regression.

1

0

1

2


251/302

= 1

-2

-1

-1 0 1



252/302


253/302


254/302

Gaussian Process Interpolation

-3

-2

-1

0

1

2

3

2 1 0 1 2

f(x)


255/302

3 -2 -1 0 1 2

x

Figure: Real example: BACCO (see e.g. (?)). Interpolation through outputs fromslow computer simulations (e.g. atmospheric carbon levels).



-3

-2

-1

0

1

2

3

2 1 0 1 2

f(x)


256/302

3 -2 -1 0 1 2

x




-3

-2

-1

0

1

2

3

2 1 0 1 2

f(x)


257/302

3 -2 -1 0 1 2

x




-3

-2

-1

0

1

2

3

2 1 0 1 2

f(x)


258/302

3 -2 -1 0 1 2

x




-3

-2

-1

0

1

2

3

2 1 0 1 2

f(x)


259/302

-2 -1 0 1 2

x



Noise Models

Graph of a GP

Relates input variables, X,

to vector, y, throughfgiven kernel parameters .

Plate notation indicatesindependence ofyi|fi.

Noise model, p(yi|fi) cantake several forms

yi

X

fi

i= 1. . .

n


260/302

, p (y | )take several forms.Simplest is Gaussiannoise.

. . .

Figure: The Gaussian processdepicted graphically.



261/302

Gaussian Process Regression

-3

-2

-1

0

1

2

3

-2 -1 0 1 2

y(x)


262/302

x

Figure: Examples include WiFi localization, C14 callibration curve.



263/302


264/302


265/302


266/302


267/302


268/302


-3

-2

-1

0

1

2

3

-2 -1 0 1 2

y(x)


269/302

x




-3

-2

-1

0

1

2

3

-2 -1 0 1 2

y(x)


270/302

x



Learning Covariance ParametersCan we determine length scales and noise levels from the data?

N(y|0,K) = 1(2)

n2 |K|exp

yK1y2

The parameters are insidethe covariance function

(matrix).


271/302

( )

ki,j=k(xi, xj;)



272/302


logN(y|0,K) = n2

log 212

log |K|yK1

y2

The parameters are insidethe covariance function(matrix).


273/302

( )ki,j=k(xi, xj;)



E() =

1

2log |K| +yK1y

2

The parameters are insidethe covariance function(matrix).


274/302

( )ki,j=k(xi, xj;)



275/302


276/302

Capacity control: log |K|

1 0

0 2

1

2 =


277/302



278/302


279/302


1 0

0 2

1

2 =


280/302

|| =12Urtasun and Lawrence () Session 1: GP and Regression CVPR Tutorial 67 / 74


1 0

0 2

1

2 || =


281/302



282/302


283/302


1 0

0 2

1

2 || =


284/302



w1,1 w1,2

w2,1 w2,2

12

||R =


285/302

|R| =12Urtasun and Lawrence () Session 1: GP and Regression CVPR Tutorial 67 / 74

Data Fit: y1K1y

2

-6-4

-2

0

2

4

6

y2

1

2


286/302

-6 -4 -2 0 2 4 6

y1


Data Fit: y1K1y

2

-6-4

-2

0

2

4

6

y2

1

2


287/302

-6 -4 -2 0 2 4 6

y1


Data Fit: y1K1y

2

-6-4

-2

0

2

4

6

y2

12


288/302

-6 -4 -2 0 2 4 6

y1



-2

-1

0

1

2

-2 -1 0 1 2

x

-10

-5

0

5

10

15

20

101 100 101

length scale,


289/302

E() =1

2|K| + y

K1y

2



290/302


291/302


292/302


-2

-1

0

1

2

-2 -1 0 1 2

x

-10

-5

0

5

10

15

20

101 100 101

length scale,


293/302

E() =1

2|K| + y

K1y

2



-2

-1

0

1

2

-2 -1 0 1 2

x

-10

-5

0

5

10

15

20

101 100 101

length scale,


294/302

E() =1

2|K| + y

K1y

2



-2

-1

0

1

2

-2 -1 0 1 2

x

-10

-5

0

5

10

15

20

101 100 101

length scale,

1


295/302

E() =1

2|K| + y

K1y

2



-2

-1

0

1

2

-2 -1 0 1 2

x

-10

-5

0

5

10

15

20

101 100 101

length scale,

1


296/302

E() =1

2|K| + y

K1y

2



-2

-1

0

1

2

-2 -1 0 1 2

x

-10

-5

0

5

10

15

20

101 100 101

length scale,

1


297/302

E() =1

2|K| + y

K1y

2


Gene Expression Example


298/302

Data from ?. Figure from?.


Outline





5 GP Limitations


299/302

6 Conclusions


Limitations of Gaussian Processes

Inference is O(n3) due to matrix inverse (in practice use Cholesky).

Gaussian processes dont deal well with discontinuities (financialcrises, phosphorylation, collisions, edges in images).

Widely used exponentiated quadratic covariance (RBF) can be toosmooth in practice (but there are many alternatives!!).


300/302


Summary

Broad introduction to Gaussian processes. Started with Gaussian distribution. Motivated Gaussian processes through the multivariate density.

Emphasized the role of the covariance (not the mean).

Performs nonlinear regression with error bars.

Parameters of the covariance function (kernel) are easily optimized

with maximum likelihood.


301/302


References I

G. Della Gatta, M. Bansal, A. Ambesi-Impiombato, D. Antonini, C. Missero, and D. di Bernardo. Direct targets of the trp63transcription factor revealed by a combination of gene expression profiling and reverse engineering. Genome Research, 18(6):939948, Jun 2008. [URL]. [DOI].

A. A. Kalaitzis and N. D. Lawrence. A simple approach to ranking differentially expressed gene expression time courses throughGaussian process regression. BMC Bioinformatics, 12(180), 2011. [DOI].

R. M. Neal. Bayesian Learning for Neural Networks. Springer, 1996. Lecture Notes in Statistics 118.

J. Oakley and A. OHagan. Bayesian inference for the uncertainty distribution of computer model outputs. Biometrika, 89(4):769784, 2002.

C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. MIT Press, Cambridge, MA, 2006. [GoogleBooks].

C. K. I. Williams. Computation with infinite neural networks. Neural Computation, 10(5):12031216, 1998.
http://dx.doi.org/10.1101/gr.073601.107http://dx.doi.org/10.1101/gr.073601.107http://dx.doi.org/10.1101/gr.073601.107http://dx.doi.org/10.1101/gr.073601.107http://dx.doi.org/10.1101/gr.073601.107http://dx.doi.org/10.1186/1471-2105-12-180http://dx.doi.org/10.1186/1471-2105-12-180http://dx.doi.org/10.1186/1471-2105-12-180http://books.google.com/books?as_isbn=0-262-18253-Xhttp://books.google.com/books?as_isbn=0-262-18253-Xhttp://books.google.com/books?as_isbn=0-262-18253-Xhttp://books.google.com/books?as_isbn=0-262-18253-Xhttp://books.google.com/books?as_isbn=0-262-18253-Xhttp://books.google.com/books?as_isbn=0-262-18253-Xhttp://dx.doi.org/10.1186/1471-2105-12-180http://dx.doi.org/10.1101/gr.073601.107http://dx.doi.org/10.1101/gr.073601.107


302/302


Documents

Gp Cvpr12 Session1