Gaussian ProcessGaussian Process and Predictionand Prediction
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI)
2
OutlineOutline
Gaussian Process and Bayesian Regression Bayesian regression Weight-space view Function-space view Spline smoothing Neural network Classification problem
Active Data Selection Maximizing the expected information gain Minimizing the regression error Experimental result
Mixtures of Gaussian Process
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI)
3
Gaussian Process and Bayesian Regression Gaussian Process and Bayesian Regression (1)(1)
A distribution of y in Bayesian regression
Generalized linear regression
Weight-space view
):,:()|()|()( dataDHypothesisaHDHpHypyp
)}({)()()(1
xfunctionsbasismofsetfixedaforxWxwxy i
m
i
Tii
MPT
ws
i wT
iT
i
MP
nnw
wxx
WWxWtE
posteriornegativetheMinimizingbyWChoose
dWWpWDpWpWDpDWp
txtxDexytNW
)()(21))((
21
log
)()|(/)()|()|(
))},(,),,{(,)(),,0(~
**
122
11
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI)
4
Gaussian Process and Bayesian Regression Gaussian Process and Bayesian Regression (2)(2) Function-space view
Y(x) is a linear combination of Gaussian random variables W ~ N(0,) {Yx} is a Gaussian Process with mean and covariance functi
ons:
can be predicted from conditional distributions
)'()(][,0][ ' xxYYEYE wT
xxwxw
**)( YxY exytwherettY n )(),,,|( 1*
)()(
)()()(,))(,),((
)(
)(
1
111
2**1*
*1
****
1**
nmn
m
nT
wT
m
wT
wT
wT
Tw
T
xx
xxIPxxwhere
PYVar
tPYE
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI)
5
Gaussian Process and Bayesian Regression Gaussian Process and Bayesian Regression (3)(3) Weight-space view and function-space view gave same results
For a smaller number of basis functions, weight space-view is preferred, while for a larger number of basis functions, function space-view (Gaussian procees view) is better.
Cf. Nonparametric Kernel estimator for a density p(y) :
):()(1)(ˆ1
bandwidthhwhereh
XyKnh
ypn
i
i
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI)
6
Spline Smoothing (1)Spline Smoothing (1)
Interpolating spline
Interpolation spline is a cubic polynomial defined piecewise between adjacent knots with continuous second derivative(Schoenberg (1964)) Smoothing spline
interpolating spline. least squares linear fit. Smoothing spline is also a cubic spline ( Reinsch (1967))
niyxrtsdxxr ii
x
xr
n
,,1,)(..)(min1
2
),( 12
nxxCr
nx
x
n
iiir
dxxrxryrSr1
2
1
2 )()]([)(minarg)(ˆ
:0:
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI)
7
Spline Smoothing (2)Spline Smoothing (2) Linear smoothing property of smoothing spline
If the design is equally spaced, then all of the n component smoothing splines are identical in shape. And the shape converged to the kernel (Silverman (1984)).
Cf. Nonparametric kernel regression (Nadaraya(1964) and Watson(1964):
n
i
iiryrrrryyx
1
)()2()1()21()2()1( ˆˆ,ˆˆˆ),(
||),2||
4sin(
21)( 2/|| ttetK t
s
n
jjh
ihih
n
iiih
xxK
xxKxxwyxxwxr
dyyxf
dyyxyfxXYExr
x
x
xx
1
1 )(
)(),(,),()(ˆ
),(
),()|()(
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI)
8
Spline Smoothing (3)Spline Smoothing (3)
Spline estimation procedure can be interpreted as a Bayesian MAP:
2)(
1
2)( )]([21))((
21))(( xydxtxyxyM p
N
nn
n
When p=2: the resulting is a cubic spline ( a piesewise cubic function that has knots at the data points .)
)(ˆ xy
}{ )(nx
)(log.),,|)((log))((
.))((21)),(|(log
.)]([21)|)((log)):((
:),(:
1
2)(
2)(
MAPBayesianorposteriorconsttxypxyM
consttxyxytp
constxydxxypxyforprior
tDataxyModel
N
N
nn
nN
p
N
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI)
9
Spline Smoothing (4)Spline Smoothing (4)
Spline priors are Gaussian processes Gaussian Process:
dxxzxyxzxyproductinner
operatorlinearAfunctionmeanxwhere
xxyAxxyZ
Axxyp
T
T
)()()()(
:,:)(
)]()(())()((21exp[1)),(|)((
)0)(0)()(..0,][,0)((
)()(21.)]([
21)|)((log 2)(
xyallforxAyxyeiADDAx
constxAyxyconstxydxxyp
TpTp
Tp
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI)
10
Spline Smoothing (5)Spline Smoothing (5)
Splines correspond to Gaussian processes with a particular choice of covariance function.
n
iii
Tnn
n
zzCcZorzKkZEzZzZGiven
kkkK
KKNZZZ
1**
1*11
**1
),()()(:,,
),,0(~),,,(
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI)
11
Known covariance function for modeling : (e.g.)
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI)
12
Covariance function with unknown parameters For a smaller number of parameters: choose a parametric family of co
vaiance function and estimate by log likelihood.
For a larger number of parameters or for a local maxima etc.:use a prior distribution of parameters numerically.
tKKKtKKtrl
ntKtKDpl
i
T
ii
T
111
1
21)(
21
2log22
1||log21)|(log
dDpDypDyp )|(),|()|( **
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI)
13
Multilayer Neural Networks and Gaussian ProcessMultilayer Neural Networks and Gaussian Process
The properties of neural network with one hidden layer converge to those of a gaussian process as the number of hidden neurons tends to infinity if standard weight decay priors are assumed. (Neal (1996))
The corresponding covariance of this gaussian process depends on the priors on the weights and activation functions of the hidden units in the network.
),()]()([)]()([
.))(,0()(,
)];'();([)];'();([)]'()([
,0)]([
...~),,0(~),,0(~,);()(
)()(22)()(22)()(
222/1
2222
22
1
jivbj
jivb
ji
vbvv
jjuvbj jjuvbw
W
jvjb
H
jjj
xxCxhxhExfxfE
CLTbyxVwNxfHwIf
uxhuxhEHuxhuxhExfxfE
xfE
diiunvnbuxhvbxf
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI)
14
Classification Problems Classification Problems
Estimate the posterior p ( k | x ) for each class k with
Find a distribution by a Gaussian process prior of activation y(x) through a logistic regression.
Make a prediction for a test input x* by
( Apply appropriate Jacobian to the above for a distribution of ) When p(t|y) is Gaussian : exact expression When : no exact expression (use analytic approximation or MCMC)
k
xkpkallforxkp 1)|(,1)|(0
)(x
**** ),|(ˆ dtp
dyytpyyptp
dytyyptyp
ytpyypyytpyyptyypNote
)|()|,()|(
1),|,(),|(
)|(),(),|(),(),,(:
***
****
)(x
ii ti
t
i iytp 1)1()|(
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI)
15
Active data Selection (1)Active data Selection (1) Maximizing the expected information gain criterior (Mckay (1992)).
By selecting the data with maximum predictor variance
Minimizing the error of (Cohn (1996)) : minimum overall variance.
*1
****)( wT
wT
wT PYVar
)(ˆ xy
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI)
16
Active data Selection (2)Active data Selection (2) (a) Target function from a covariance function (b) Expected change of average variance over x for 100 reference points
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI)
17
Active data Selection (3)Active data Selection (3) Experiments :
First data is selected random 150 data are selected actively 500 reference points for error evaluation Optimum query was selected using 300 random reference points
.
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI)
18
Active data Selection (4)Active data Selection (4) For real data: pumadyn-8nm (puma560 robot arm)
250 data points for active selecting, 400 reference points