Upload
cornelia-walsh
View
220
Download
2
Embed Size (px)
Citation preview
Introduction to Radial Basis Function Networks
Content
Overview The Models of Function Approximator The Radial Basis Function Networks RBFN’s for Function Approximation The Projection Matrix Learning the Kernels Bias-Variance Dilemma The Effective Number of Parameters Model Selection Incremental Operations
Introduction to Radial Basis Function Networks
Overview
Typical Applications of NN
Pattern Classification
Function Approximation
Time-Series Forecasting
( )l f xl C N
( )fy x mY R y
mX R x
nX R x
1 2 3( ) ( , , , )t t tft x x x x
Function Approximation
mX R fnY R
:f X YUnknown
mX R nY R
ˆ :f X YApproximator
f
NeuralNetwork
Supervised Learning
UnknownFunction
ixiy
ˆ iy1,2,i +
+
ie
Neural Networks as Universal Approximators
Feedforward neural networks with a single hidden layer of sigmoidal units are capable of approximating uniformly any continuous multivariate function, to any desired degree of accuracy.
– Hornik, K., Stinchcombe, M., and White, H. (1989). "Multilayer Feedforward Networks are Universal Approximators," Neural Networks, 2(5), 359-366.
Like feedforward neural networks with a single hidden layer of sigmoidal units, it can be shown that RBF networks are universal approximators.
– Park, J. and Sandberg, I. W. (1991). "Universal Approximation Using Radial-Basis-Function Networks," Neural Computation, 3(2), 246-257.
– Park, J. and Sandberg, I. W. (1993). "Approximation and Radial-Basis-Function Networks," Neural Computation, 5(2), 305-316.
Statistics vs. Neural Networks
Statistics Neural Networksmodel networkestimation learningregression supervised learninginterpolation generalizationobservations training setparameters (synaptic) weightsindependent variables inputsdependent variables outputsridge regression weight decay
Introduction to Radial Basis Function Networks
The Model ofFunction Approximator
Linear Models
1
(( ) )ii
i
m
f w
x xFixed BasisFunctions
Weights
Linear Models
Feature VectorsInputs
HiddenUnits
OutputUnits
• Decomposition• Feature Extraction• Transformation
Linearly weighted output
1
(( ) )ii
i
m
f w
x x
y
11 22 mm
x1 x2 xn
w1 w2 wm
x =
Linear Models
Feature VectorsInputs
HiddenUnits
OutputUnits
• Decomposition• Feature Extraction• Transformation
Linearly weighted output
y
11 22 mm
x1 x2 xn
w1 w2 wm
x =
Can you say some bases?Can you say some bases?
1
(( ) )ii
i
m
f w
x x
Example Linear Models
Polynomial
Fourier Series
( ) ii
iwf xx
0exp 2( )k
k jf w k xx
Are they orthogonal bases?Are they orthogonal bases?
( ) , 0,1,2,ii ix x
0( ) exp 2 0,1,, 2, k x j k x k
1
(( ) )ii
i
m
f w
x x
y
11 22 mm
x1 x2 xn
w1 w2 wm
x =
Single-Layer Perceptrons as Universal Aproximators
HiddenUnits
With sufficient number of sigmoidal units, it can be a universal approximator.
1
(( ) )ii
i
m
f w
x x
y
11 22 mm
x1 x2 xn
w1 w2 wm
x =
Radial Basis Function Networks as Universal Aproximators
HiddenUnits
With sufficient number of radial-basis-function units, it can also be a universal approximator.
1
(( ) )ii
i
m
f w
x x
Non-Linear Models
1
(( ) )ii
i
m
f w
x xAdjusted by the
Learning process
Weights
1
(( ) )ii
i
m
f w
x x
Introduction to Radial Basis Function Networks
The Radial Basis Function Networks
Radial Basis Functions1
(( ) )ii
i
m
f w
x x
Center
Distance Measure
Shape
Three parameters for a radial function:
xi
r = ||x xi||
i(x)= (||x xi||)
Typical Radial Functions
Gaussian
Hardy Multiquadratic
Inverse Multiquadratic
2
22 0 and r
r e r
2 2 0 and r r c c c r
2 2 0 and r c r c c r
Gaussian Basis Function (=0.5,1.0,1.5)
2
22 0 and r
r e r
0.5 1.0
1.5
Inverse Multiquadratic
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
-10 -5 0 5 10
2 2 0 and r c r c c r
c=1
c=2c=3
c=4c=5
++
Most General RBF
1( ) ( )i iT
i μ x μx x
+1μ
2μ 3μ
Basis {i: i =1,2,…} is `near’ orthogonal.
Properties of RBF’s
On-Center, Off Surround
Analogies with localized receptive fields found in several biological structures, e.g.,– visual cortex;– ganglion cells
The Topology of RBF
Feature Vectorsx1 x2 xn
y1 ym
Inputs
HiddenUnits
OutputUnits
Projection
Interpolation
As a function approximator ( )y f x
The Topology of RBF
Feature Vectorsx1 x2 xn
y1 ym
Inputs
HiddenUnits
OutputUnits
Subclasses
Classes
As a pattern classifier.
Introduction to Radial Basis Function Networks
RBFN’s forFunction Approximation
The idea
x
y Unknown Function to Approximate
TrainingData
The idea
x
y Unknown Function to Approximate
TrainingData
Basis Functions (Kernels)
The idea
x
y
Basis Functions (Kernels)
Function Learned
1
)) ((m
iiiwy f
xx
The idea
x
y
Basis Functions (Kernels)
Function Learned
NontrainingSample
1
)) ((m
iiiwy f
xx
The idea
x
y Function Learned
NontrainingSample
1
)) ((m
iiiwy f
xx
Radial Basis Function Networks as Universal Aproximators
1
( ) ( )m
ii
iy wf
x x
x1 x2 xn
w1 w2 wm
x =
Training set ( )
1
( ),p
k
k
ky
xT
Goal (( ))k kfy x for all k
2
( )
1
( )min kp
k
k
SSE fy
x
2
)(
1
) (
1
p mk
ik
ki
i
y w
x
Learn the Optimal Weight Vector
w1 w2 wm
x1 x2 xnx =
Training set ( )
1
( ),p
k
k
ky
xT
Goal (( ))k kfy x for all k
1
( ) ( )m
ii
iy wf
x x
2
( )
1
( )min kp
k
k
SSE fy
x
2
)(
1
) (
1
p mk
ik
ki
i
y w
x
2
( )
1
( )min kp
k
k
SSE fy
x
2
)(
1
) (
1
p mk
ik
ki
i
y w
x
Regularization
Training set ( )
1
( ),p
k
k
ky
xT
Goal (( ))k kfy x for all k
2
1
m
iiiw
2
1
m
iiiw
0i
If regularization is unneeded, set 0i
C
Learn the Optimal Weight Vector
22( )
1
( )
1
p mk
ki
ii
k wfyC
xMinimize
1
( ) ( )m
ii
iwf
x x
(
( )
) ( )
1
2 2kp
jk
j
k
k
j j
wfC
fw w
y
x
x
( ) ( ) ( )
1
2 2p
k kj
kjj
k f wy
x x
0
( ) * ( ) ( )
1 1
)* (p p
k k kj j
k kjj
kwf y
x x x
*
1
*( ) ( )i
m
ii
wf
x x
Learn the Optimal Weight Vector
(1) ( ), ,T
pj j j φ x x
* * (1) * ( ), ,T
pf ff x xDefine
(1) ( ), ,Tpy yy
* *j
T Tj jjw φ f φ y 1, ,j m
( ) * ( ) ( )
1 1
)* (p p
k k kj j
k kjj
kwf y
x x x
1
( ) ( )m
ii
iwf
x x *
1
*( ) ( )i
m
ii
wf
x x
Learn the Optimal Weight Vector
*1 1
*2 2
*1
*
*
1
2 2
*
T T
T T
T Tm mmm
w
w
w
φ f φ y
φ f φ y
φ f φ y
* *j
T Tj jjw φ f φ y 1, ,j m
1 2, , , mΦ φ φ φ
1
2
m
Λ
* * *1 2* , , ,
T
mw w ww
Define
**T T wΦ Λf Φ y
Learn the Optimal Weight Vector
**T T wΦ Λf Φ y
* (1)
* (2)*
* ( )p
f
f
f
x
xf
x
*
1
*( ) ( )i
m
ii
wf
x x
(1)
1
(2)
1
(
*
*
1
*
)
k
k
m
kk
m
kk
mP
kk
k
w
w
w
x
x
x
(1) (1) (1)*1 21
(2) (2) (2) *1 2 2
*( ) ( ) ( )
1 2
m
m
p p pm
m
w
w
w
x x x
x x x
x x x
*Φw
* *T T ΛwΦ wΦ Φ y
Learn the Optimal Weight Vector
* *T T ΛwΦ wΦ Φ y
*T T wΦ ΛΦ Φ y
1* T T Λw Φ Φ Φ y
Variance Matrix
1 TA Φ y
1 :ADesign Matrix:Φ
Summary
1* T T Λw Φ Φ Φ y
1 TA Φ y
(1)
(2)
( )
j
jj
pj
x
xφ
x
(1)
(2)
( )p
y
y
y
y
1 2, , , mΦ φ φ φ
*1*2
*
*
m
w
w
w
w
1
2
m
Λ
1
)) ((m
iiiwy f
xx
Training set ( )
1
( ),p
k
k
ky
xT
Introduction to Radial Basis Function Networks
The Projection Matrix
The Empirical-Error Vector
1* Tw A Φ y
*1w *
2w *mw
( )kx ( )1
kx ( )2kx ( )k
nx( )kx ( )1
kx ( )2kx ( )k
nx
UnknownFunction
UnknownFunction
y ** f Φw
1φ 2φ mφ
The Empirical-Error Vector
1* Tw A Φ y
*1w *
2w *mw
( )kx ( )1
kx ( )2kx ( )k
nx( )kx ( )1
kx ( )2kx ( )k
nx
UnknownFunction
UnknownFunction
y ** f Φw
1φ 2φ mφ
Error Vector
* e y f * y Φw 1 T ΦAy Φ y
1 Tp
AI Φ Φ yPy
Sum-Squared-Error
e Py1 T
p P I Φ ΦA
T A Φ Φ Λ
1 2, , , mΦ φ φ φError Vector
T Py Py2Ty P y
2( ) * ( )
1
pk k
k
SSE y f
x
If =0, the RBFN’s learning algorithm is to minimize SSE (MSE).
The Projection Matrix
e Py1 T
p P I Φ ΦA
T A Φ Φ Λ
1 2, , , mΦ φ φ φError Vector
2TSSE y P y
1 2( , , , )mspane φ φ φΛ 0
( )P Py Pee Py
Ty Py
Introduction to Radial Basis Function Networks
Learning the Kernels
RBFN’s as Universal Approximators
x1 x2 xn
y1 yl
12 m
w11 w12w1m
wl1 wl2wlml
Training set
) ( )(
1,
pk
k
k
yxT
2
2( ) exp
2j
jj
x μx
Kernels
What to Learn?
Weights wij’s Centers j’s of j’s Widths j’s of j’s
Number of j’s Model Selection
x1 x2 xn
y1 yl
12 m
w11 w12w1m
wl1 wl2wlml
One-Stage Learning
( ) ( ) ( )1
k k kij i i jw y f x x
( ) ( )
1
lk k
i ij jj
f w
x x
( ) ( ) ( )2 2
1
ljk k k
j j ij i iij
w y f
x μμ x x
2
( ) ( ) ( )3 3
1
ljk k k
j j ij i iij
w y f
x μx x
One-Stage Learning
( ) ( )
1
lk k
i ij jj
f w
x xThe simultaneous updates of all three sets of parameters may be suitable for non-stationary environments or on-line setting.
The simultaneous updates of all three sets of parameters may be suitable for non-stationary environments or on-line setting.
( ) ( ) ( )1
k k kij i i jw y f x x
( ) ( ) ( )2 2
1
ljk k k
j j ij i iij
w y f
x μμ x x
2
( ) ( ) ( )3 3
1
ljk k k
j j ij i iij
w y f
x μx x
x1 x2 xn
y1 yl
12 m
w11 w12w1m
wl1 wl2wlml
Two-Stage Training
Step 1
Step 2
Determines Centers j’s of j’s. Widths j’s of j’s. Number of j’s.
Determines Centers j’s of j’s. Widths j’s of j’s. Number of j’s.
Determines wij’s.Determines wij’s.
E.g., using batch-learning.
Train the Kernels
+
+
+
+
+
Unsupervised Training
Subset Selection– Random Subset Selection– Forward Selection– Backward Elimination
Clustering Algorithms– KMEANS– LVQ
Mixture Models– GMM
Methods
Subset Selection
Random Subset Selection
Randomly choosing a subset of points from training set
Sensitive to the initially chosen points.
Using some adaptive techniques to tune– Centers– Widths– #points
Clustering Algorithms
Partition the data points into K clusters.
Partition the data points into K clusters.
Clustering Algorithms
Is such a partition satisfactory?
Is such a partition satisfactory?
Clustering Algorithms
How about this?How about this?
+
++
+
Clustering Algorithms
1
2
3
4
Introduction to Radial Basis Function Networks
Bias-Variance Dilemma
Goal Revisit
Minimize Prediction Error
• Ultimate Goal Generalization
• Goal of Our Learning Procedure
Minimize Empirical Error
Badness of Fit Underfitting
– A model (e.g., network) that is not sufficiently complex can fail to detect fully the signal in a complicated data set, leading to underfitting.
– Produces excessive bias in the outputs. Overfitting
– A model (e.g., network) that is too complex may fit the noise, not just the signal, leading to overfitting.
– Produces excessive variance in the outputs.
Underfitting/Overfitting Avoidance
Model selection Jittering Early stopping Weight decay
– Regularization– Ridge Regression
Bayesian learning Combining networks
Best Way to Avoid Overfitting
Use lots of training data, e.g.,– 30 times as many training cases as there are we
ights in the network.– for noise-free data, 5 times as many training ca
ses as weights may be sufficient.
Don’t arbitrarily reduce the number of weights for fear of underfitting.
Badness of Fit
Underfit Overfit
Badness of Fit
Underfit Overfit
Bias-Variance Dilemma
Underfit Overfit
Large bias Small bias
Small variance Large variance
However, it's not really a dilemma.
Underfit Overfit
Large bias Small bias
Small variance Large variance
Bias-Variance Dilemma
More on overfitting
Easily lead to predictions that are far beyond the range of the training data.
Produce wild predictions in multilayer perceptrons even with noise-free data.
More on overfitting
Easily lead to predictions that are far beyond the range of the training data.
Produce wild predictions in multilayer perceptrons even with noise-free data.
Bias-Variance Dilemma
It's not really a dilemma.
fitmore
overfitmore
underfit
Bias
Variance
Bias-Variance Dilemma
Sets of functionsE.g., depend on # hidden nodes used.
The true modelnoise
bias
Solution obtained with training set 2.
Solution obtained with training set 1.
Solution obtained with training set 3.
biasbias
The mean of the bias=?
The variance of the bias=?
Bias-Variance Dilemma
Sets of functionsE.g., depend on # hidden nodes used.
The true modelnoise
The mean of the bias=?
The variance of the bias=?
bias
Variance
Model Selection
Sets of functionsE.g., depend on # hidden nodes used.
The true modelnoise
bias
Variance
Reduce the effective number of parameters.
Reduce the number of hidden nodes.
Bias-Variance Dilemma
Sets of functionsE.g., depend on # hidden nodes used.
The true modelnoise
bias
( )g x( ) ( )y g x x
( )f x
Goal: 2min ( ) ( )E y f x x
Bias-Variance Dilemma
Goal: 2min ( ) ( )E y f x x
2( ) ( )E y f x x 2( ) ( )E g f x x
2 2( ) ( ) 2 ( ) ( )E g f g f x x x x
2 2( ) ( ) 2 ( ) ( )E g f E g f E x x x x
Goal: 2min ( ) ( )E g f x x 2min ( ) ( )E y f x x
0 constant
Bias-Variance Dilemma
2( ) ( )E g f x x 2[ ( )] [( ) () )( ]E f E fE g f x xx x
2[ ]( )) (E fE g x x 2[ ]( )) (E fE f x x
[ ( )] [ ( )) ( ]2 ( )E g fE f E f x xx x
2 2[ ( )] [ ( )]( ) ( ) [ ( )] [2 ( ) ( ) ( )]E f E f E f EE g g f ff x x x xx x x x
( ) ( )[ ( )] [ ( )]E E f Eg f f x xxx
( ) ( )E g f x x [ )]( ()EE fg x x [ ( ()] )E E f f x x [ ( )] [ ( )]E f E fE x x
( ) ( )E g E f x x [ )]( () EE fg x x [ ( )] ( )EE f f x x [ ( )] [ ( )]E f E f x x 0
2 22( ) ( ) ( ) ( )E y f E E g f x x x x
0
Bias-Variance Dilemma
2 22( ) ( ) ( ) ( )E y f E E g f x x x x
2 22 [ ( )] [ ( )( ( ]) )E fE E fE g E f x x xx
Goal: 2min ( ) ( )E y f x x
noise bias2 variance
Cannot be minimized
Minimize both bias2 and variance
Model Complexity vs. Bias-Variance
2 22( ) ( ) ( ) ( )E y f E E g f x x x x
2 22 [ ( )] [ ( )( ( ]) )E fE E fE g E f x x xx
Goal: 2min ( ) ( )E y f x x
noise bias2 variance
ModelComplexity(Capacity)
Bias-Variance Dilemma
2 22( ) ( ) ( ) ( )E y f E E g f x x x x
2 22 [ ( )] [ ( )( ( ]) )E fE E fE g E f x x xx
Goal: 2min ( ) ( )E y f x x
noise bias2 variance
ModelComplexity(Capacity)
Example (Polynomial Fits)
22exp( 16 )y x x
Example (Polynomial Fits)
Example (Polynomial Fits)
Degree 1 Degree 5
Degree 10 Degree 15
Introduction to Radial Basis Function Networks
The Effective Number of Parameters
Variance Estimation
Mean
Variance 22
1
1ˆ
p
ii
xp
In general, not available.
Variance Estimation
Mean1
1ˆ
p
ii
x xp
Variance 22 2
1
ˆ1
1ˆ
p
ii
s xp
Loss 1 degree of freedom
Simple Linear Regression
0 1y x 2~ (0, )N
01
y
x
y
x
Simple Linear Regression
01
y
x
y
x
01ˆ
ˆ
y
x
0 1y x 2~ (0, )N
21
ˆp
i ii
SSE y y
Minimize
Mean Squared Error (MSE)
01
y
x
y
x
01ˆ
ˆ
y
x
0 1y x 2~ (0, )N
21
ˆp
i ii
SSE y y
Minimize
2
2ˆ
SSEMSE
p
2
2ˆ
SSEMSE
p
Loss 2 degrees
of freedom
Variance Estimation
2ˆSSE
MSEmp
Loss m degrees of freedom
m: #parameters of the model
The Number of Parameters
w1 w2 wm
x1 x2 xnx =
1
( ) ( )m
ii
iy wf
x x
m#degrees of freedom:
The Effective Number of Parameters ()
w1 w2 wm
x1 x2 xnx =
1
( ) ( )m
ii
iy wf
x x The projection Matrix
1 Tp
P I Φ ΦA
T A Φ Φ Λ
Λ 0
( )p trace P m
The Effective Number of Parameters ()
The projection Matrix
1 Tp
P I Φ ΦA
T A Φ Φ Λ
Λ 0
1( ) Tptrace trace AP I Φ Φ
1 Tp trace ΦA Φ
( ) ( ) ( )trace trace trace A B A BFacts:( ) ( )trace traceAB BA
1 TTp trace
Φ Φ ΦΦ
Pf)
1T Tp trace
Φ ΦΦ Φ
mp trace I
p m ( )p trace P m
Regularization
((
1
)
2
)p
k
k
k fy
x2
1
m
iiiw
Empirial
Er
Model s
penalror ts
yCo t
2
( )
1 1
( )p m
ki
k
k
iiwy
x
Penalize models withlarge weightsSSE
( )p trace PThe effective number of parameters:
Regularization
((
1
)
2
)p
k
k
k fy
x2
1
m
iiiw
Empirial
Er
Model s
penalror ts
yCo t
2
( )
1 1
( )p m
ki
k
k
iiwy
x
Penalize models withlarge weightsSSE Without penalty
(i=0), there are m degrees of freedom
to minimize SSE (Cost).
The effective number of
parameters =m.
Without penalty (i=0), there are m
degrees of freedom to minimize SSE
(Cost).
The effective number of
parameters =m.
The effective number of parameters:
( )p trace P
Regularization
((
1
)
2
)p
k
k
k fy
x2
1
m
iiiw
Empirial
Er
Model s
penalror ts
yCo t
2
( )
1 1
( )p m
ki
k
k
iiwy
x
Penalize models withlarge weightsSSE
With penalty (i>0), the liberty to
minimize SSE will be reduced.
The effective number of
parameters <m.
With penalty (i>0), the liberty to
minimize SSE will be reduced.
The effective number of
parameters <m.
The effective number of parameters:
( )p trace P
Variance Estimation
2ˆ MSE
Loss degrees of freedom
The effective number of parameters:
( )p trace P
SSE
p
Variance Estimation
2ˆ MSE
The effective number of parameters:
( )p trace P
( )
SSE
trace
P
Introduction to Radial Basis Function Networks
Model Selection
Model Selection Goal
– Choose the fittest model Criteria
– Least prediction error Main Tools (Estimate Model Fitness)
– Cross validation– Projection matrix
Methods– Weight decay (Ridge regression)– Pruning and Growing RBFN’s
Empirical Error vs. Model Fitness
Minimize Prediction Error
• Ultimate Goal Generalization
• Goal of Our Learning Procedure
Minimize Empirical Error
Minimize Prediction Error
(MSE)
Estimating Prediction Error
When you have plenty of data use independent test sets– E.g., use the same training set to
train different models, and choose the best model by comparing on the test set.
When data is scarce, use– Cross-Validation– Bootstrap
Cross Validation
Simplest and most widely used method for estimating prediction error.
Partition the original set into several different ways and to compute an average score over the different partitions, e.g.,– K-fold Cross-Validation– Leave-One-Out Cross-Validation– Generalize Cross-Validation
K-Fold CV
Split the set, say, D of available input-output patterns into k mutually exclusive subsets, say D1, D2, …, Dk.
Train and test the learning algorithm k times, each time it is trained on D\Di and tested on Di.
K-Fold CV
Available DataAvailable Data
K-Fold CV
Available DataAvailable DataD1 D2 D3. . . Dk
D1 D2 D3. . . Dk
D1 D2 D3. . . Dk
D1 D2 D3. . . Dk
D1 D2 D3. . . Dk
Test Set
Training Set
Estimate 2
Leave-One-Out CV
Split the p available input-output patterns into a training set of size p1 and a test set of size 1.
Average the squared error on the left-out pattern over the p possible ways of partition.
A special case of k-fold CV.
Error Variance Predicted by LOO
A special case of k-fold CV.
22
1
1ˆ ( )
p
LOO i i ii
y fp
x
( , ) : 1,2, ,k kD y k p x
\ ( , )i i iD D y x
Available input-output patterns.
1, ,i p Training sets of LOO.
if Function learned using Di as training set.
The estimate for the variance of prediction error using LOO:
Error-square for the left-out element.
Error Variance Predicted by LOO
A special case of k-fold CV.
22
1
1ˆ ( )
p
LOO i i ii
y fp
x
( , ) : 1,2, ,k kD y k p x
\ ( , )i i iD D y x
Available input-output patterns.
1, ,i p Training sets of LOO.
if Function learned using Di as training set.
The estimate for the variance of prediction error using LOO:
Error-square for the left-out element.
Given a model, the function with least empirical error for Di.
Given a model, the function with least empirical error for Di.
As an index of model’s fitness.We want to find a model also minimize
this.
As an index of model’s fitness.We want to find a model also minimize
this.
Error Variance Predicted by LOO
A special case of k-fold CV.
22
1
1ˆ ( )
p
LOO i i ii
y fp
x
( , ) : 1,2, ,k kD y k p x
\ ( , )i i iD D y x
Available input-output patterns.
1, ,i p Training sets of LOO.
if Function learned using Di as training set.
The estimate for the variance of prediction error using LOO:
Error-square for the left-out element.
How to estimate?How to estimate?
Are there any efficient ways?
Are there any efficient ways?
Error Variance Predicted by LOO
22
1
1ˆ ( )
p
LOO i i ii
y fp
x
Error-square for the left-out element.
2 2(1
ˆ ˆ ( )) ˆTLOO di
pag y PP Py
Generalized Cross-Validation
2 2(1
ˆ ˆ ( )) ˆTLOO di
pag y PP Py
( )( ) p
tracediag
p
PP I
2
22
ˆ ˆˆ
( )
T
GCV
p
trace
y P y
P
2
2
ˆ ˆ
( )
Tp
p
y P y
More Criteria Based on CV
22
2
ˆ ˆˆ
( )
T
GCV
p
p
y P y GCV
(Generalized CV)
22 ˆ ˆ
ˆT
UEV p
y P y UEV
(Unbiased estimate of variance)
22 ˆ ˆ
ˆT
FPE
p
p p
y P y FPE(Final Prediction Error)
Akaike’sInformation Criter
ion
BIC(Bayesian Information Criterio)
22 ln( ) 1 ˆ ˆ
ˆT
BIC
p p
p p
y P y
More Criteria Based on CV
22
2
ˆ ˆˆ
( )
T
GCV
p
p
y P y
22 ˆ ˆ
ˆT
UEV p
y P y
22 ˆ ˆ
ˆT
FPE
p
p p
y P y
22 ln( ) 1 ˆ ˆ
ˆT
BIC
p p
p p
y P y
2ˆUEV
p
p
2ˆUEV
p
p
2ln( ) 1ˆUEV
p p
p
2 2 2 2ˆ ˆ ˆ ˆUEV FPE GCV BIC
More Criteria Based on CV
22
2
ˆ ˆˆ
( )
T
GCV
p
p
y P y
22 ˆ ˆ
ˆT
UEV p
y P y
22 ˆ ˆ
ˆT
FPE
p
p p
y P y
22 ln( ) 1 ˆ ˆ
ˆT
BIC
p p
p p
y P y
2ˆUEV
p
p
2ˆUEV
p
p
2ln( ) 1ˆUEV
p p
p
2 2 2 2ˆ ˆ ˆ ˆUEV FPE GCV BIC
?P
Regularization
((
1
)
2
)p
k
k
k fy
x2
1
m
iiiw
Empirial
Er
Model s
penalror ts
yCo t
2
( )
1 1
( )p m
ki
k
k
iiwy
x
Penalize models withlarge weightsSSE
Standard Ridge Regression,
,i i
Regularization
((
1
)
2
)p
k
k
k fy
x2
1
m
iiiw
Empirial
Er
Model s
penalror ts
yCo t
2
( )
1 1
( )p m
ki
k
k
iiwy
x
Penalize models withlarge weightsSSE
2
1
m
iiw
2
1
m
iiw
Standard Ridge Regression,
,i i
Solution Review
2
(( ) )
1 1
2
1
min p m m
ki
k
ki i
i i
C w wy
x
1* Tw A Φ yT
m Φ Φ IA1 T
p P I Φ ΦA
Used to compute model selection criteria
Example
22( ) (1 2 ) xy x x x e
50p
2~ (0,0.2 )N
Width of RBF r = 0.5Width of RBF r = 0.5
Example2 2 2 2ˆ ˆ ˆ ˆUEV FPE GCV BIC
2 2 2 2ˆ ˆ ˆ ˆUEV FPE GCV BIC
22( ) (1 2 ) xy x x x e
50p
2~ (0,0.2 )N
Width of RBF r = 0.5Width of RBF r = 0.5
Example2 2 2 2ˆ ˆ ˆ ˆUEV FPE GCV BIC
2 2 2 2ˆ ˆ ˆ ˆUEV FPE GCV BIC
22( ) (1 2 ) xy x x x e
50p
2~ (0,0.2 )N
Width of RBF r = 0.5Width of RBF r = 0.5
How the determine the
optimal regularization
parameter effectively?How the determine the
optimal regularization
parameter effectively?
Optimizing the Regularization Parameter
Re-Estimation Formula
2 1 2
1
ˆˆ
( )
T
T
trace
trace
A
w
Ay P y
A Pw
Local Ridge Regression
Re-Estimation Formula
2 1 2
1
ˆˆ
( )
T
T
trace
trace
A
w
Ay P y
A Pw
Example
sin(12 )y x 2~ (0,0.1 )N50m p
— Width of RBF
( )p trace P( )p trace P
0.05r
Example
sin(12 )y x 2~ (0,0.1 )N50m p
— Width of RBF
( )p trace P( )p trace P
0.05r
Example
sin(12 )y x 2~ (0,0.1 )N50m p
— Width of RBF0.05r
There are two local-minima.
2 1 2
1
ˆˆ
( )
T
T
trace
trace
A
w
Ay P y
A Pw
Using the about re-estimation formula, it will be stuck at the nearest local minimum.
That is, the solution depends on the initial setting.
Example
sin(12 )y x 2~ (0,0.1 )N50m p
— Width of RBF0.05r
There are two local-minima.
2 1 2
1
ˆˆ
( )
T
T
trace
trace
A
w
Ay P y
A Pw
ˆ(0) 0.01
lim 0.1ˆ( )t
t
Example
sin(12 )y x 2~ (0,0.1 )N50m p
— Width of RBF0.05r
There are two local-minima.
2 1 2
1
ˆˆ
( )
T
T
trace
trace
A
w
Ay P y
A Pw
5ˆ(0) 10
4ˆ( )lim 2.1 10t
t
Example
sin(12 )y x 2~ (0,0.1 )N50m p
— Width of RBF0.05r
RMSE: Root Mean Squared ErrorIn real case, it is not available.
RMSE: Root Mean Squared ErrorIn real case, it is not available.
Example
sin(12 )y x 2~ (0,0.1 )N50m p
— Width of RBF0.05r
RMSE: Root Mean Squared ErrorIn real case, it is not available.
RMSE: Root Mean Squared ErrorIn real case, it is not available.
Local Ridge Regression
2
(( ) )
1 1
2
1
min p m m
ki
k
ki i
i i
C w wy
x
Standard Ridge Regression
2
(( ) )
1 1
2
1
min p m m
i ik
ik i
ik
i
C w wy
x
Local Ridge Regression
Local Ridge Regression
2
(( ) )
1 1
2
1
min p m m
ki
k
ki i
i i
C w wy
x
Standard Ridge Regression
2
(( ) )
1 1
2
1
min p m m
i ik
ik i
ik
i
C w wy
x
Local Ridge Regression
j implies that j() can be removed.
j implies that j() can be removed.
The Solutions1* Tw A Φ y
Tm Φ Φ IA
1 Tp
P I Φ ΦAUsed to compute model selection criteria
T A Φ Φ Λ
TA Φ Φ Linear Regression
Standard Ridge Regression
Local Ridge Regression
Optimizing the Regularization Parameters
Incremental Operation
P: The current projection Matrix.
Pj: The projection Matrix obtained by removing j().
Tj j j j
j Tj j j j
P φ φ PP P
φ P φ
2
22
ˆ ˆˆ
( )
T
GCV
p
trace
y P y
P
Optimizing the Regularization Parameters
2
22
ˆ ˆˆ
( )
T
GCV
p
trace
y P y
P
Solve 2ˆ
0GCV
j
Subject to 0j
Tj j j j
j Tj j j j
P φ φ PP P
φ P φ
Optimizing the Regularization Parameters
2Tja y P y2T Tj j j jby P φ y P φ2 2( )T T
j j j j jc φ P φ y P φ
( )jtrace P2T
j j j φ P φ
if
if and ˆ0 if and
if
c bTj j j b a
c bTjj j j b a
c b c bT Tj j j j j jb a b a
a b
a b
a b
φ P φ
φ P φ
φ P φ φ P φ
if
if and ˆ0 if and
if
c bTj j j b a
c bTjj j j b a
c b c bT Tj j j j j jb a b a
a b
a b
a b
φ P φ
φ P φ
φ P φ φ P φ
Solve 2ˆ
0GCV
j
Subject to 0j
Optimizing the Regularization Parameters
2Tja y P y2T Tj j j jby P φ y P φ2 2( )T T
j j j j jc φ P φ y P φ
( )jtrace P2T
j j j φ P φ
if
if and ˆ0 if and
if
c bTj j j b a
c bTjj j j b a
c b c bT Tj j j j j jb a b a
a b
a b
a b
φ P φ
φ P φ
φ P φ φ P φ
if
if and ˆ0 if and
if
c bTj j j b a
c bTjj j j b a
c b c bT Tj j j j j jb a b a
a b
a b
a b
φ P φ
φ P φ
φ P φ φ P φ
Solve 2ˆ
0GCV
j
Subject to 0j Remove j()
Remove j()
The Algorithm
Initialize i’s. – e.g., performing standard ridge regression.
Repeat the following until GCV converges:– Randomly select j and compute– Perform local ridge regression– If GCV reduce & remove j()
ˆj
ˆj
References
Kohavi, R. (1995), "A study of cross-validation and bootstrap for accuracy estimation and model selection," International Joint Conference on Artificial Intelligence (IJCAI).
Mark J. L. Orr (April 1996), Introduction to Radial Basis Function Networks, http://www.anc.ed.ac.uk/~mjo/intro/intro.html.
Introduction to Radial Basis Function Networks
Incremental Operations