Download ppt - Gaussian Process and Prediction. (C) 2001 SNU CSE Artificial Intelligence Lab (SCAI)2 Outline Gaussian Process and Bayesian Regression Bayesian regression

Gaussian ProcessGaussian Process and Predictionand Prediction

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI)

2

OutlineOutline

Gaussian Process and Bayesian Regression Bayesian regression Weight-space view Function-space view Spline smoothing Neural network Classification problem

Active Data Selection Maximizing the expected information gain Minimizing the regression error Experimental result

Mixtures of Gaussian Process


3

Gaussian Process and Bayesian Regression Gaussian Process and Bayesian Regression (1)(1)

A distribution of y in Bayesian regression

Generalized linear regression

Weight-space view

):,:()|()|()( dataDHypothesisaHDHpHypyp

)}({)()()(1

xfunctionsbasismofsetfixedaforxWxwxy i

m

i

Tii

MPT

ws

i wT

iT

i

MP

nnw

wxx

WWxWtE

posteriornegativetheMinimizingbyWChoose

dWWpWDpWpWDpDWp

txtxDexytNW

)()(21))((

21

log

)()|(/)()|()|(

))},(,),,{(,)(),,0(~

**

122

11


4

Gaussian Process and Bayesian Regression Gaussian Process and Bayesian Regression (2)(2) Function-space view

Y(x) is a linear combination of Gaussian random variables W ~ N(0,) {Yx} is a Gaussian Process with mean and covariance functi

ons:

can be predicted from conditional distributions

)'()(][,0][ ' xxYYEYE wT

xxwxw

**)( YxY exytwherettY n )(),,,|( 1*

)()(

)()()(,))(,),((

)(

)(

1

111

2**1*

*1

****

1**

nmn

m

nT

wT

m

wT

wT

wT

Tw

T

xx

xxIPxxwhere

PYVar

tPYE


5

Gaussian Process and Bayesian Regression Gaussian Process and Bayesian Regression (3)(3) Weight-space view and function-space view gave same results

For a smaller number of basis functions, weight space-view is preferred, while for a larger number of basis functions, function space-view (Gaussian procees view) is better.

Cf. Nonparametric Kernel estimator for a density p(y) :

):()(1)(ˆ1

bandwidthhwhereh

XyKnh

ypn

i

i


6

Spline Smoothing (1)Spline Smoothing (1)

Interpolating spline

Interpolation spline is a cubic polynomial defined piecewise between adjacent knots with continuous second derivative(Schoenberg (1964)) Smoothing spline

interpolating spline. least squares linear fit. Smoothing spline is also a cubic spline ( Reinsch (1967))

niyxrtsdxxr ii

x

xr

n

,,1,)(..)(min1

2

),( 12

nxxCr

nx

x

n

iiir

dxxrxryrSr1

2

1

2 )()]([)(minarg)(ˆ

:0:


7

Spline Smoothing (2)Spline Smoothing (2) Linear smoothing property of smoothing spline

If the design is equally spaced, then all of the n component smoothing splines are identical in shape. And the shape converged to the kernel (Silverman (1984)).

Cf. Nonparametric kernel regression (Nadaraya(1964) and Watson(1964):

n

i

iiryrrrryyx

1

)()2()1()21()2()1( ˆˆ,ˆˆˆ),(

||),2||

4sin(

21)( 2/|| ttetK t

s

n

jjh

ihih

n

iiih

xxK

xxKxxwyxxwxr

dyyxf

dyyxyfxXYExr

x

x

xx

1

1 )(

)(),(,),()(ˆ

),(

),()|()(


8


Spline estimation procedure can be interpreted as a Bayesian MAP:

2)(

1

2)( )]([21))((

21))(( xydxtxyxyM p

N

nn

n

When p=2: the resulting is a cubic spline ( a piesewise cubic function that has knots at the data points .)

)(ˆ xy

}{ )(nx

)(log.),,|)((log))((

.))((21)),(|(log

.)]([21)|)((log)):((

:),(:

1

2)(

2)(

MAPBayesianorposteriorconsttxypxyM

consttxyxytp

constxydxxypxyforprior

tDataxyModel

N

N

nn

nN

p

N


9


Spline priors are Gaussian processes Gaussian Process:

dxxzxyxzxyproductinner

operatorlinearAfunctionmeanxwhere

xxyAxxyZ

Axxyp

T

T

)()()()(

:,:)(

)]()(())()((21exp[1)),(|)((

)0)(0)()(..0,][,0)((

)()(21.)]([

21)|)((log 2)(

xyallforxAyxyeiADDAx

constxAyxyconstxydxxyp

TpTp

Tp


10


Splines correspond to Gaussian processes with a particular choice of covariance function.

n

iii

Tnn

n

zzCcZorzKkZEzZzZGiven

kkkK

KKNZZZ

1**

1*11

**1

),()()(:,,

),,0(~),,,(


11

Known covariance function for modeling : (e.g.)


12

Covariance function with unknown parameters For a smaller number of parameters: choose a parametric family of co

vaiance function and estimate by log likelihood.

For a larger number of parameters or for a local maxima etc.:use a prior distribution of parameters numerically.

tKKKtKKtrl

ntKtKDpl

i

T

ii

T

111

1

21)(

21

2log22

1||log21)|(log

dDpDypDyp )|(),|()|( **


13

Multilayer Neural Networks and Gaussian ProcessMultilayer Neural Networks and Gaussian Process

The properties of neural network with one hidden layer converge to those of a gaussian process as the number of hidden neurons tends to infinity if standard weight decay priors are assumed. (Neal (1996))

The corresponding covariance of this gaussian process depends on the priors on the weights and activation functions of the hidden units in the network.

),()]()([)]()([

.))(,0()(,

)];'();([)];'();([)]'()([

,0)]([

...~),,0(~),,0(~,);()(

)()(22)()(22)()(

222/1

2222

22

1

jivbj

jivb

ji

vbvv

jjuvbj jjuvbw

W

jvjb

H

jjj

xxCxhxhExfxfE

CLTbyxVwNxfHwIf

uxhuxhEHuxhuxhExfxfE

xfE

diiunvnbuxhvbxf


14

Classification Problems Classification Problems

Estimate the posterior p ( k | x ) for each class k with

Find a distribution by a Gaussian process prior of activation y(x) through a logistic regression.

Make a prediction for a test input x* by

( Apply appropriate Jacobian to the above for a distribution of ) When p(t|y) is Gaussian : exact expression When : no exact expression (use analytic approximation or MCMC)

k

xkpkallforxkp 1)|(,1)|(0

)(x

**** ),|(ˆ dtp

dyytpyyptp

dytyyptyp

ytpyypyytpyyptyypNote

)|()|,()|(

1),|,(),|(

)|(),(),|(),(),,(:

***

****

)(x

ii ti

t

i iytp 1)1()|(


15

Active data Selection (1)Active data Selection (1) Maximizing the expected information gain criterior (Mckay (1992)).

By selecting the data with maximum predictor variance

Minimizing the error of (Cohn (1996)) : minimum overall variance.

*1

****)( wT

wT

wT PYVar

)(ˆ xy


16

Active data Selection (2)Active data Selection (2) (a) Target function from a covariance function (b) Expected change of average variance over x for 100 reference points


17

Active data Selection (3)Active data Selection (3) Experiments :

First data is selected random 150 data are selected actively 500 reference points for error evaluation Optimum query was selected using 300 random reference points

.


18

Active data Selection (4)Active data Selection (4) For real data: pumadyn-8nm (puma560 robot arm)

250 data points for active selecting, 400 reference points