Gaussian Processes: Applications in Machine Learning

Gaussian Processes: Applications in MachineLearning

Abhishek Agarwal(05329022)

Under the Guidance of Prof. Sunita Sarawagi

KReSIT, IIT Bombay

Seminar PresentationMarch 29, 2006

Abhishek Agarwal (05329022) Gaussian Processes: Applications in Machine Learning

Outline

Introduction to Gaussian Processes(GP)

Prior & Posterior Distributions

GP Models: Regression

GP Models: Binary Classification

Covariance Functions

Conclusion.


Introduction

Supervised Learning

Gaussian Processes

Defines distribution over functions.Collection of random variables, any finite number of whichhave joint Gaussian distributions.[1] [2]

f ∼ GP(m, k)

Hyperparameters and Covariance function.Predictions


Prior Distribution

Represents our belief about the function distribution, whichwe pass through parameters

Example: GP(m, k)

m(x) =1

4x2, k(x , x ′) = exp(−1

2(x − x ′)2).

To draw sample from the distribution:

Pick some data points.Find distribution parameters at each point.

µi = m(xi ) & Σij = k(xi , xj) i , j = 1, . . . , n

Pick the function values from each individual distribution.


Prior Distribution(contd.)

−5 −4 −3 −2 −1 0 1 2 3 4 51

2

3

4

5

6

7

8

9

data points

func

tion

valu

es

Figure: Prior distribution over function using Gaussian Process


Posterior Distribution

Distribution changes in presence of Training data D(x , y).

Functions which satisy D are given higher probability.

−5 −4 −3 −2 −1 0 1 2 3 4 5−1

0

1

2

3

4

5

6

7

8

data points

func

tion

valu

es

Figure: Posterior distribution over functions using Gaussian Processes


Posterior Distribution (contd.)

Prediction for unlabeled data x∗GP outputs the function distribution at x∗Let f be the distribution at data points in D and f∗ at x∗f and f∗ will have a joint Gaussian distribution, represented as:

[

f

f∗

]

∼

( [

µµ∗

] [

Σ Σ∗

Σ∗

T Σ∗∗

] )

Conditional distribution of f∗ given f can be expressed as:

f∗|f ∼ N ( µ∗ + Σ∗

TΣ−1(f − µ), Σ∗∗ − Σ∗

TΣ−1Σ∗) (1)


Posterior Distribution (contd.)

Parameters of the posterior in Eq. 1 are:

f∗|D ∼ GP(mD, kD) ,

where mD(x) = m(x) + Σ(X , x)TΣ−1(f − m)

kD(x , x ′) = k(x , x ′) − Σ(X , x)TΣ−1Σ(X , x ′)

−5 −4 −3 −2 −1 0 1 2 3 4 50

1

2

3

4

5

6

7

8

data points

func

tion

valu

es

Figure: Prediction from GPAbhishek Agarwal (05329022) Gaussian Processes: Applications in Machine Learning

GP Models: Regression

GP can be directly applied to Bayesian Linear Regressionmodel like:

f (x) = φ(x)Tw with prior w ∼ N (0,Σ)Parameters for this distribution will be:

E[f (x)] = φ(x)TE[w ] = 0,

E[f (x)f (x ′)] = φ(x)T E[wwT ]φ(x ′) = φ(x)T Σpφ(x ′)

So, f (x) and f (x ′) are jointly Gaussian with zero mean andcovariance φ(x)TΣpφ(x ′).


GP Models: Regression (contd.)

In Regression, posterior distribution over the weights, is givenas (9):

posterior =likelhood ∗ prior

marginal likelihood

Both prior p(f|X ) and likelihood p(y |f, X ) are Gaussian:

prior: f|X ∼ N (0,K ) (5)likelihood: y|f ∼ N (f, σn

2I)

Marginal Likelihood p(y |X ) is defined as (6):

p(y |X ) =

∫

p(y |f, X )p(f|X )df (2)


GP Models: Classification

Modeling Binary Classifier

Squash the output of a regression model using a responsefunction, like sigmoid.Ex: Linear logistic regression model:

p(C1|x) = λ(xTw), λ(z) =1

1 + exp(−z)

Likelihood is expressed as (7):

p(yi |xi ,w) = σ(yi fi ),

fi ∼ f (xi ) = x iTw

and therefore its non-Gaussain.


GP Models: Classification (contd.)

Distribution over latent function, after seeing the test data, isgiven as:

p(f∗|X , y , x∗) =

∫

p(f∗|X , x∗, f)p(f|X , y)df, (3)

where p(f|X , y) = p(y |f)p(f|X )/p(y |X ) is the posterior overthe latent variable.

Computation of the above integral is analytically intractable

Both, likelihood and posterior are non-Gaussian.Need to use some analytic Approximation of integrals.


GP Models: Laplace Approximations

Gaussian Approximation of p(f|X , y):

Using second order Taylor expansion, we obtain:

q(f|X , y) = N (f |̂f,A−1)

where where f̂ = argmaxf p(f|X , y) andA = −55 log p(f|X , y)|f=f̂

To find f̂, we use Newton’s method, because of non-linearity of5 log p(f|X , y) (9)

Prediction is given as:

π∗ = p(y∗ = +1|X , y , x∗) =

∫

σ(f∗)p(f∗|X , y , x∗)df∗, (4)


Covariance Function

Encodes our belief about the prior distribution over function

Some properties:

StaionaryIsotropicDot-Product Covariance

Ex: Squared Exponential(SE) covarince function:

cov(f (xp), f (xq)) = exp(−1

2|xp − xq|

2)

Learned with other hyper-parameters.


Summary and Future Work

Current Research:

Fast sparse approximation algorithm for matrix inversion.Approximation algorithm for non-Gaussian likelihoods.

GP approach has outperformed traditional methods in manyapplications.

Gaussin Process based Positioning System (GPPS) [6]Multi user Detection (MUD) in CDMA [7]

GP models are more powerful and flexible than simplelinear parametric models and less complex in comparisonto other models like multi-layer perceptrons. [1]


Rasmussen and Williams. Gaussian Process for MachineLearning, The MIT Press, 2006.

Matthias Seeger. Gaussian Process for Machine Learning,2004. International Journal of Neural Systems, 14(2):69-106,2004.

Christopher Williams, Bayesian Classification with GaussianProcesses, In IEEE Trans. Pattern analysis and MachineIntelligence, 1998

Rasmussen and Williams, Gaussian Process for Regression. InProceedings of NIPS’ 1996.

Rasmussen, Evaluation of Gaussian Processes and OtherMethods for Non-linear Regression. PhD thesis, Dept. ofComputer Science, University of Toronto, 1996. Available fromhttp://www.cs.utoronto.ca/ carl/


Anton Schwaighofer, et. al. GPPS: A Gaussian ProcessPositioning System for Cellular Networks, In proceedings ofNIPS’ 2003.

Murillo-Fuentes, et. al. Gaussian Processes for MultiuserDetection in CDMA receivers, Advances in Neural InformationProcessing System’ 2005

David Mackay, Introduction to Gaussian Processes

C. Williams. Gaussian processes. In M. A. Arbib, editor,Handbook of Brain Theory and Neural Networks, pages466-470. The MIT Press, second edition, 2002.


Thank You !!

Questions ??


Extra

Prior:

log p(f|X ) = −1

2fTK−1f −

1

2log |K | −

n

2log 2π (5)

Mariginal likelihood

log p(y|X ) = −1

2yT (K+σn

2I)−1y−1

2log |K+σn

2I|−n

2log 2π

(6)

Likelihoodp(y = +1|x , w) = σ(xTw), (7)

For symmetric like hood σ(−z) = 1 − σ(z).

p(yi |xi , w) = σ(x iTw), (8)


Extra (contd.)

first derivative of posterior

f̂ = K (5 log p(f|X , y))

Prediction

p(w |y , X ) =p(y|X,w) ∗ p(w)

p(y |X )


Documents

Gaussian Processes: Applications in Machine Learning