Upload
others
View
11
Download
1
Embed Size (px)
Citation preview
J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
NONLINEAR CLASSIFICATION AND REGRESSION
Probability & Bayesian Inference
J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
2
Nonlinear Classification and Regression: Outline
Multi-Layer Perceptrons The Back-Propagation Learning Algorithm
Generalized Linear Models Radial Basis Function Networks Sparse Kernel Machines
Nonlinear SVMs and the Kernel Trick Relevance Vector Machines
Probability & Bayesian Inference
J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
3
Nonlinear Classification and Regression: Outline
Multi-Layer Perceptrons The Back-Propagation Learning Algorithm
Generalized Linear Models Radial Basis Function Networks Sparse Kernel Machines
Nonlinear SVMs and the Kernel Trick Relevance Vector Machines
Probability & Bayesian Inference
J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
4
Implementing Logical Relations
AND and OR operations are linearly separable problems
Probability & Bayesian Inference
J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
5
The XOR Problem
XOR is not linearly separable.
How can we use linear classifiers to solve this problem?
x1 x2 XOR Class
0 0 0 B
0 1 1 A
1 0 1 A
1 1 0 B
Probability & Bayesian Inference
J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
6
Combining two linear classifiers
Idea: use a logical combination of two linear classifiers.
g
1(x) = x
1+ x
2!
1
2
g
2(x) = x
1+ x
2!
3
2
Probability & Bayesian Inference
J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
7
Observe that the classification problem is then solved by
f y1! y
2!
1
2
"
#$%
&'
where
y1= f g
1(x)( ) and y
2= f g
2(x)( )
Combining two linear classifiers
Let f (x) be the unit step activation function:
f (x) = 0, x < 0
f (x) = 1, x ! 0
g
1(x) = x
1+ x
2!
1
2
g
2(x) = x
1+ x
2!
3
2
Probability & Bayesian Inference
J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
8
f y1! y
2!
1
2
"
#$%
&'
where
y1= f g
1(x)( ) and y
2= f g
2(x)( )
Combining two linear classifiers
This calculation can be implemented sequentially: 1. Compute y1 and y2 from x1 and x2.
2. Compute the decision from y1 and y2.
Each layer in the sequence consists of one or more linear classifications.
This is therefore a two-layer perceptron.
g
1(x) = x
1+ x
2!
1
2
g
2(x) = x
1+ x
2!
3
2
Probability & Bayesian Inference
J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
9
f y1! y
2!
1
2
"
#$%
&'
where
y1= f g
1(x)( ) and y
2= f g
2(x)( )
The Two-Layer Perceptron
g
1(x) = x
1+ x
2!
1
2
g
2(x) = x
1+ x
2!
3
2
Layer 1
Layer 2
x1 x2 y1 y2
0 0 0(-) 0(-) B(0)
0 1 1(+) 0(-) A(1)
1 0 1(+) 0(-) A(1)
1 1 1(+) 1(+) B(0)
y
1! y
2!
1
2
Layer 1 Layer 2
Probability & Bayesian Inference
J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
10
y
1= f g
1(x)( ) and y
2= f g
2(x)( )
The Two-Layer Perceptron
The first layer performs a nonlinear mapping that makes the data linearly separable.
g
1(x) = x
1+ x
2!
1
2
g
2(x) = x
1+ x
2!
3
2
y
1! y
2!
1
2
Layer 1 Layer 2
Probability & Bayesian Inference
J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
11
The Two-Layer Perceptron Architecture
g
1(x) = x
1+ x
2!
1
2
g
2(x) = x
1+ x
2!
3
2
y
1! y
2!
1
2
Input Layer Hidden Layer Output Layer
-1
Probability & Bayesian Inference
J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
12
y
1= f g
1(x)( ) and y
2= f g
2(x)( )
The Two-Layer Perceptron
Note that the hidden layer maps the plane onto the vertices of a unit square.
g
1(x) = x
1+ x
2!
1
2
g
2(x) = x
1+ x
2!
3
2
y
1! y
2!
1
2
Layer 1 Layer 2
Probability & Bayesian Inference
J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
13
Higher Dimensions
Each hidden unit realizes a hyperplane discriminant function. The output of each hidden unit is 0 or 1 depending upon the
location of the input vector relative to the hyperplane.
lRx∈ { } piyyyyx iT
p ,...2 ,1 1 ,0 ,],...[ 1 =∈=→
Probability & Bayesian Inference
J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
14
Higher Dimensions
Together, the hidden units map the input onto the vertices of a p-dimensional unit hypercube.
lRx∈ { } piyyyyx iT
p ,...2 ,1 1 ,0 ,],...[ 1 =∈=→
Probability & Bayesian Inference
J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
15
Two-Layer Perceptron
These p hyperplanes partition the l-dimensional input space into polyhedral regions
Each region corresponds to a different vertex of the p-dimensional hypercube represented by the outputs of the hidden layer.
Probability & Bayesian Inference
J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
16
Two-Layer Perceptron
In this example, the vertex (0, 0, 1) corresponds to the region of the input space where: g1(x) < 0 g2(x) < 0 g3(x) > 0
Probability & Bayesian Inference
J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
17
Limitations of a Two-Layer Perceptron
The output neuron realizes a hyperplane in the transformed space that partitions the p vertices into two sets.
Thus, the two layer perceptron has the capability to classify vectors into classes that consist of unions of polyhedral regions.
But NOT ANY union. It depends on the relative position of the corresponding vertices.
How can we solve this problem?
Probability & Bayesian Inference
J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
18
The Three-Layer Perceptron
Suppose that Class A consists of the union of K polyhedra in the input space.
Use K neurons in the 2nd hidden layer.
Train each to classify one region as positive, the rest negative.
Now use an output neuron that implements the OR function.
Probability & Bayesian Inference
J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
19
The Three-Layer Perceptron
Thus the three-layer perceptron can separate classes resulting from any union of polyhedral regions in the input space.
Probability & Bayesian Inference
J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
20
The Three-Layer Perceptron
The first layer of the network forms the hyperplanes in the input space.
The second layer of the network forms the polyhedral regions of the input space
The third layer forms the appropriate unions of these regions and maps each to the appropriate class.
Learning Parameters
Probability & Bayesian Inference
J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
22
Training Data
The training data consist of N input-output pairs:
y(i ), x(i )( ) , i ! 1,…N
where
y(i ) = y1(i ),… , y
kL
(i )"#$
%&'
t
and
x(i ) = x1(i ),… , x
k0
(i )"#$
%&'
t
Probability & Bayesian Inference
J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
23
Choosing an Activation Function
The unit step activation function means that the error rate of the network is a discontinuous function of the weights.
This makes it difficult to learn optimal weights by minimizing the error.
To fix this problem, we need to use a smooth activation function.
A popular choice is the sigmoid function we used for logistic regression:
Probability & Bayesian Inference
J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
24
Smooth Activation Function
f (a) =
1
1+ exp(!a)140 8 Classification models
!"
Figure 8.3 Logistic regression model in 1D and 2D. a) One dimensional fit.Green points denote set of examples S0 where y = 0. Pink points denoteset of examples S1 where y = 1. Note that in this (and all future figuresin this chapter) we have only plotted the probability Pr(y = 1|x) (compareto figure 8.2c). The probability Pr(y = 0|x) can be trivially computed as1 ! Pr(y = 1|x). b) Two dimensional fit. Here, the model has a sigmoidprofile in the direction of the gradient ! and is constant in the orthogonaldirections. The decision boundary (blue line) is linear.
As usual, however, it is simpler to maximize the logarithm L of this expression.Since the logarithm is a monotonic transformation, it does not change the positionof the maximum with respect to !. However, applying the logarithm the productand replaces it with a sum so that
L =I!
i=1
yi log
"1
1 + exp[!!Txi]
#+
I!
i=1
(1! yi) log
$exp[!!Txi]
1 + exp[!!Txi]
%. (8.6)
The derivative of the log likelihood L with respect to the parameters ! is
!L
!!=
I!
i=1
&1
1 + exp[!!Txi]! yi
'xi =
I!
i=1
(sig[ai]! yi)xi. (8.7)
Unfortunately, when we equate this expression to zero, there is no way to re-arrange to get a closed form solution for the parameters !. Instead we mustrely on a non-linear optimization technique to find the maximum of this function.We’ll now sketch the main ideas behind non-linear optimization. We defer a moredetailed discussion until section 8.10.
In non-linear optimization, we start with an initial estimate of the solution! and iteratively improve it. The methods we will discuss rely on computing
wt! x1
Probability & Bayesian Inference
J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
25
Output: Two Classes
For a binary classification problem, there is a single output node with activation function given by
Since the output is constrained to lie between 0 and 1, it can be interpreted as the probability of the input vector belonging to Class 1.
f (a) =
1
1+ exp(!a)
Probability & Bayesian Inference
J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
26
Output: K > 2 Classes
For a K-class problem, we use K outputs, and the softmax function given by
Since the outputs are constrained to lie between 0 and 1, and sum to 1, yk can be interpreted as the probability that the input vector belongs to Class K.
yk=
exp ak( )
exp aj( )
j
!
Probability & Bayesian Inference
J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
27
Non-Convex
Now each layer of our multi-layer perceptron is a logistic regressor.
Recall that optimizing the weights in logistic regression results in a convex optimization problem.
Unfortunately the cascading of logistic regressors in the multi-layer perceptron makes the problem non-convex.
This makes it difficult to determine an exact solution. Instead, we typically use gradient descent to find a
locally optimal solution to the weights. The specific learning algorithm is called the
backpropagation algorithm.
Nov 21, 2011
End of Lecture
Probability & Bayesian Inference
J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
29
Nonlinear Classification and Regression: Outline
Multi-Layer Perceptrons The Back-Propagation Learning Algorithm
Generalized Linear Models Radial Basis Function Networks Sparse Kernel Machines
Nonlinear SVMs and the Kernel Trick Relevance Vector Machines
Paul J. Werbos. Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences. PhD thesis, Harvard University, 1974 Rumelhart, David E.; Hinton, Geoffrey E., Williams, Ronald J. (8 October 1986). "Learning representations by back-propagating errors". Nature 323 (6088): 533–536.
The Backpropagation Algorithm
Werbos Rumelhart Hinton
Probability & Bayesian Inference
J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
31
Notation
Assume a network with L layers k0 nodes in the input layer.
kr nodes in the r’th layer.
Probability & Bayesian Inference
J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
32
Notation
Let ykr !1 be the output of the kth neuron of layer r ! 1.
Let wjkr be the weight of the synapse on the jth neuron of layer r
from the kth neuron of layer r ! 1.
Probability & Bayesian Inference
J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
33
Input
y
k
0(i ) = xk(i ), k = 1,… ,k
0
Probability & Bayesian Inference
J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
34
Notation
Then yj
r (i ) = f vj
r (i )( ) = f wjk
r yk
r !1 (i )k=0
kr !1
"#
$%
&
'(
Let v
jr be the total input to the jth neuron of layer r :
vjr (i ) = w
jr( )
tyr !1 (i ) = w
jkr y
kr !1 (i )
k=0
kr !1
"
where we define y0r (i ) = +1 to incorporate the bias term.
Probability & Bayesian Inference
J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
35
Cost Function
A common cost function is the squared error:
J = ! (i )i =1
N"
where ! (i ) ! 12 em(i )( )2
m=1
kL
" = 12 ym (i ) # ym (i )( )2
m=1
kL
"and ym (i ) = yk
r (i ) is the output of the network.
Probability & Bayesian Inference
J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
36
Cost Function
To summarize, the error for input i is given by
where is the output of the output layer and each layer is related to the previous layer through
and
! (i ) = 1
2 em(i )( )2m=1
kL
" = 12 ym (i ) # ym (i )( )2
m=1
kL
"
y
j
r (i ) = f vj
r (i )( )
v
j
r (i ) = wj
r( )t
yr !1 (i )
y
m(i ) = y
k
r (i )
Probability & Bayesian Inference
J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
37
Gradient Descent
Gradient descent starts with an initial guess at the weights over all layers of the network.
We then use these weights to compute the network output for each input vector x(i) in the training data.
This allows us to calculate the errorε(i) for each of these inputs.
Then, in order to minimize this error, we incrementally update the weights in the negative gradient direction:
! (i ) = 1
2 em(i )( )2m=1
kL
" = 12 ym (i ) # ym (i )( )2
m=1
kL
"
wj
r (new) = wjr (old) - µ !J
!wjr = wj
r (old) - µ !" (i )!wj
ri =1
N#
y(i )
Probability & Bayesian Inference
J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
38
Gradient Descent
Since , the influence of the jth weight of the rth layer on the error can be expressed as:
v
j
r (i ) = wj
r( )t
yr !1 (i )
!" (i )
!wj
r=
!" (i )
!vj
r (i )
!vj
r (i )
!wj
r
= #j
r (i )yr $1 (i )
where
#j
r (i ) !!" (i )
!vj
r (i )
Probability & Bayesian Inference
J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
39
Gradient Descent
!" (i )
!wj
r= #
j
r (i )yr $1 (i ),
where
#j
r (i ) !!" (i )
!vj
r (i )
For an intermediate layer r,
we cannot compute !jr (i ) directly.
However, !jr (i ) can be computed inductively,
starting from the output layer.
Probability & Bayesian Inference
J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
40
Backpropagation: The Output Layer
Thus at the output layer we have
!" (i )!wj
r = # jr (i )yr $1 (i ), where # j
r (i ) ! !" (i )!vj
r (i )
and ! (i ) = 1
2 em(i )( )2m=1
kL
" = 12 ym (i ) # ym (i )( )2
m=1
kL
"
Recall that ym (i ) = yjL (i ) = f vj
L (i )( )
! j
L (i ) = "# (i )"vj
L (i ) = "# (i )"ej
L (i )"ej
L (i )"vj
L (i ) = ejL (i ) $f vj
L (i )( )
f (a) = 1
1+ exp(!a)" #f (a) = f (a) 1! f (a)( )
! jL (i ) = ej
L (i )f v jL (i )( ) 1 " f vj
L (i )( )( )
Probability & Bayesian Inference
J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
41
Backpropagation: Hidden Layers
Observe that the dependence of the error on the total input to a neuron in a previous layer can be expressed in terms of the dependence on the total input of neurons in the following layer:
! j
r "1 (i ) = #$ (i )#vj
r "1 (i ) = #$ (i )#vk
r (i )#vk
r (i )#vj
r "1 (i )k=1
kr
% = !kr (i ) #vk
r (i )#vj
r "1 (i )k=1
kr
%
where vk
r (i ) = wkmr ym
r !1 (i )m=0
kr !1
" = wkmr f vm
r !1 (i )( )m=0
kr !1
"
Thus we have !vk
r (i )!vj
r "1 (i ) = wkjr #f vj
r "1 (i )( )
and so ! j
r "1 (i ) = #$ (i )#vj
r "1 (i ) = %f vjr "1 (i )( ) !k
r (i )wkjr = f vj
L (i )( ) 1 " f vjL (i )( )( ) !k
r (i )wkjr
k=1
kr
&k=1
kr
&
Thus once the !kr (i ) are determined they can be propagated backward
to calculate ! jr "1 (i ) using this inductive formula.
Probability & Bayesian Inference
J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
42
Backpropagation: Summary of Algorithm
1. Initialization Initialize all weights with small random values
2. Forward Pass For each input vector, run the network in the forward direction, calculating:
3. Backward Pass Starting with the output layer, use our inductive formula to compute the :
Output Layer (Base Case):
Hidden Layers (Inductive Case):
4. Update Weights ! j
r "1 (i ) = #f vjr "1 (i )( ) !k
r (i )wkjr
k=1
kr
$ ! j
L (i ) = ejL (i ) "f vj
L (i )( )
wj
r (new) = wjr (old) - µ !" (i )
!wjr
i =1
N#
where !" (i )
!wjr = # j
r (i )yr $1 (i )
! jr "1 (i )
yjr (i ) = f vj
r (i )( ) v jr (i ) = wj
r( )t yr !1 (i );
and finally ! (i ) = 1
2 em(i )( )2m=1
kL
" = 12 ym (i ) # ym (i )( )2
m=1
kL
"
Repe
at u
ntil
conv
erge
nce
Probability & Bayesian Inference
J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
43
Batch vs Online Learning
As described, on each iteration backprop updates the weights based upon all of the training data. This is called batch learning.
An alternative is to update the weights after each training input has been processed by the network, based only upon the error for that input. This is called online learning.
wj
r (new) = wjr (old) - µ !" (i )
!wjr
i =1
N#
where !" (i )
!wjr = # j
r (i )yr $1 (i )
wj
r (new) = wj
r (old) - µ!" (i )
!wj
r
where !" (i )
!wjr = # j
r (i )yr $1 (i )
Probability & Bayesian Inference
J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
44
Batch vs Online Learning
One advantage of batch learning is that averaging over all inputs when updating the weights should lead to smoother convergence.
On the other hand, the randomness associated with online learning might help to prevent convergence toward a local minimum.
Changing the order of presentation of the inputs from epoch to epoch may also improve results.
Probability & Bayesian Inference
J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
45
Remarks
Local Minima The objective function is in general non-convex, and so
the solution may not be globally optimal.
Stopping Criterion Typically stop when the change in weights or the
change in the error function falls below a threshold.
Learning Rate The speed and reliability of convergence depends on
the learning rate μ.
Probability & Bayesian Inference
J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
46
Nonlinear Classification and Regression: Outline
Multi-Layer Perceptrons The Back-Propagation Learning Algorithm
Generalized Linear Models Radial Basis Function Networks Sparse Kernel Machines
Nonlinear SVMs and the Kernel Trick Relevance Vector Machines
Probability & Bayesian Inference
J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
47
Generalizing Linear Classifiers
One way of tackling problems that are not linearly separable is to transform the input in a nonlinear fashion prior to applying a linear classifier.
The result is that decision boundaries that are linear in the resulting feature space may be highly nonlinear in the original input space.
x1
x2
−1 0 1
−1
0
1
!1
!2
0 0.5 1
0
0.5
1
Input Space Feature Space
Probability & Bayesian Inference
J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
48
Nonlinear Basis Function Models
Generally
where ϕj(x) are known as basis functions. Typically, Φ0(x) = 1, so that w0 acts as a bias.
Probability & Bayesian Inference
J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
49
Nonlinear basis functions for classification
In the context of classification, the discriminant function in the feature space becomes:
This formulation can be thought of as an input space approximation of the true separating discriminant function g(x) using a set of interpolation functions .
g y (x)( ) = w
0+ w
iy
i(x)
i =1
M
! = w0+ w
i"
i(x)
i =1
M
!
!
i(x)
Probability & Bayesian Inference
J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
50
Dimensionality
The dimensionality M of the feature space may be less than, equal to, or greater than the dimensionality D of the original input space. M < D: This may result in a factoring out of irrelevant
dimensions, reduction in the number of model parameters, and resulting improvement in generalization (reduced overlearning).
M > D: Problems that are not linearly separable in the input space may become separable in the feature space, and the probability of linear separability generally increases with the dimensionality of the feature space. Thus choosing M >> D helps to make the problem linearly separable.
Probability & Bayesian Inference
J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
51
Cover’s Theorem
“A complex pattern-classification problem, cast in a high-dimensional space nonlinearly, is more likely to be linearly separable than in a low-dimensional space, provided that the space is not densely populated.” — Cover, T.M. , Geometrical and Statistical properties of systems of linear inequalities with applications in pattern recognition., 1965
Example
Probability & Bayesian Inference
J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
52
Nonlinear Classification and Regression: Outline
Multi-Layer Perceptrons The Back-Propagation Learning Algorithm
Generalized Linear Models Radial Basis Function Networks Sparse Kernel Machines
Nonlinear SVMs and the Kernel Trick Relevance Vector Machines
Probability & Bayesian Inference
J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
53
Radial Basis Functions
Consider interpolation functions (kernels) of the form
In other words, the feature value depends only upon the Euclidean distance to a ‘centre point’ in the input space.
A commonly used RBF is the isotropic Gaussian:
!
ix " µ
i( )
!i(x) = exp "
1
2#i2
x " µi
2$
%&
'
()
Probability & Bayesian Inference
J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
54
Relation to KDE
We can use Gaussian RBFs to approximate the discriminant function g(x):
where
This is reminiscent of kernel density estimation, where we approximated probability densities as a normalized sum of Gaussian kernels.
g y (x)( ) = w
0+ w
iy
i(x)
i =1
M
! = w0+ w
i"
i(x)
i =1
M
!
!i(x) = exp "
1
2#i2
x " µi
2$
%&
'
()
Probability & Bayesian Inference
J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
55
Relation to KDE
For KDE we planted a kernel at each data point. Thus there were N kernels.
For RBF networks, we generally use far fewer kernels than the number of data points: M << N.
This leads to greater efficiency and generalization.
Probability & Bayesian Inference
J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
56
RBF Networks
The Linear Classifier with nonlinear radial basis functions can be considered an artificial neural network where The hidden nodes are nonlinear (e.g., Gaussian). The output node is linear.
RBF Network for 2 Classes
Probability & Bayesian Inference
J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
57
RBF Networks vs Perceptrons
Recall that for a perceptron, the output of a hidden unit is invariant on a hyperplane.
For an RBF, the output of a hidden unit is invariant on a circle centred on μi.
Thus hidden units are global in a perceptron, but local in an RBF network.
RBF Network for 2 Classes
Probability & Bayesian Inference
J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
58
RBF Networks vs Perceptrons
This difference has consequences: Multilayer perceptrons tend to learn slower than RBFs. However, multilayer perceptrons tend to have better
generalization properties, especially in regions of the input space where training data are sparse.
Typically, more neurons are needed for an RBF than for a multilayer perceptron to solve a given problem.
Probability & Bayesian Inference
J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
59
Parameters
There are two options for choosing the parameters (centres and scales) of the RBFs: 1. Fixed.
For example, randomly select a subset of M of the input vectors and use these as centres. Use a common scale based upon your judgement.
2. Learned. Note that when the RBF parameters are fixed, the weights
could be learned using linear classifier techniques (e.g., least squares).
Thus the RBF parameters could be learned in an outer loop, by gradient descent.
Probability & Bayesian Inference
J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
60
Nonlinear Classification and Regression: Outline
Multi-Layer Perceptrons The Back-Propagation Learning Algorithm
Generalized Linear Models Radial Basis Function Networks Sparse Kernel Machines
Nonlinear SVMs and the Kernel Trick Relevance Vector Machines
Probability & Bayesian Inference
J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
61
The Kernel Function
Recall that an SVM is the solution to the problem
A new input x is classified by computing
Where S is the set of support vectors.
Here we introduced the kernel function k(x, x’), defined as
This is more than a notational convenience!!
Maximize !L(a) = ann=1
N
! " 12
anamtntmk xn,xm( )m=1
N
!n=1
N
!
subject to 0 # an #C and antnn=1
N
! = 0
k x, !x( ) = "(x)t"( !x )
y(x) = antnk(x,xn)
n!S" + b
Probability & Bayesian Inference
J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
62
The Kernel Trick
Note that the basis functions and individual training vectors are no longer part of the objective function.
Instead all we need is the kernel value (like a distance measure) for all pairs of training vectors.
Maximize !L(a) = ann=1
N
! " 12
anamtntmk xn,xm( )m=1
N
!n=1
N
!
subject to 0 # an #C and antnn=1
N
! = 0
where k x, !x( ) = "(x)t"( !x )
Probability & Bayesian Inference
J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
63
The Kernel Function
The kernel function k(x, !x ) measures the 'similarity' of input vectors x and !xas an inner product in a feature space defined by the feature space mapping "(x) :k(x, !x ) = "(x)t"( !x )
If k(x, !x ) = k(x " !x ) we say that the kernel is stationary
If k(x, !x ) = k x " !x( ) we call it a radial basis function.
Probability & Bayesian Inference
J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
64
Constructing Kernels
We can construct a kernel by selecting a feature space mapping ϕ(x) and then defining
k(x, !x ) = "(x)t"( !x )
!1 0 10
0.25
0.5
0.75
1
!1 0 10.0
1.0
2.0
Gaussian
!x
!( "x )
k(x, !x )
Probability & Bayesian Inference
J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
65
Constructing Kernels
Alternatively, we can construct the kernel function
directly, ensuring that it corresponds to an inner product in some (possibly infinite-dimensional) feature space.
Probability & Bayesian Inference
J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
66
Constructing Kernels
k(x) = !(x)t!( "x )
Example 1: k(x,z) = xtz
Example 2: k(x,z) = xtz + c, c > 0
Example 3: k(x,z) = xtz( )2
Probability & Bayesian Inference
J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
67
Kernel Properties
Kernels obey certain properties that make it easy to construct complex kernels from simpler ones.
Probability & Bayesian Inference
J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
68
Kernel Properties Combining Kernels
Given valid kernels k1(x,x!) and k2(x,x!) the following kernelswill also be valid:
k(x,x!) = ck1(x,x!) (6.13)
k(x,x!) = f(x)k1(x,x!)f(x!) (6.14)
k(x,x!) = q(k1(x,x!)) (6.15)
k(x,x!) = exp(k1(x,x!)) (6.16)
k(x,x!) = k1(x,x!) + k2(x,x!) (6.17)
k(x,x!) = k1(x,x!)k2(x,x!) (6.18)
k(x,x!) = k3(!(x),!(x!)) (6.19)
k(x,x!) = xT Ax! (6.20)
k(x,x!) = ka(xa,x!a) + kb(xb,x
!b) (6.21)
k(x,x!) = ka(xa,x!a)kb(xb,x
!b) (6.22)
with corresponding conditions on c, f, q,!, k3,A,xa,xb, ka, kb
Vasil Khalidov, Alex Klaser Bishop Chapter 6: Kernel Methods
where c > 0, f (!) is any function, q(!) is a polynomial with nonnegative coefficients, "(x) is a mapping from x # !M ,k3 is a valid kernel on !M , A is a symmetric positive semidefinite matrix, xa and xb are variables such that x = xa,xb( ) and ka,kb are valid kernels over their respective spaces.
Probability & Bayesian Inference
J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
69
Constructing Kernels
Examples:
k(x, !x ) = xt !x + c( )M ,c > 0
k(x, !x ) = exp " x " !x
2/ 2# 2( )
Corresponds to infinite-dimensional feature vector
(Use 6.18)
(Use 6.14 and 6.16.)
Probability & Bayesian Inference
J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
70
Nonlinear SVM Example (Gaussian Kernel)
Input Space
x1
x2
Probability & Bayesian Inference
J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
71
SVMs for Regression
In standard linear regression, we minimize12
yn ! tn( )2
n=1
N
" +#2
w2
This penalizes all deviations from the model.
To obtain sparse solutions, we replace the quadratic error functionby an !-insensitive error function, e.g.,
E! y(x) " t( ) = 0, if y(x) - t < !
y(x) - t " !, otherwise
#$%
&%
See text for details of solution.
y
y + !
y ! !
y(x)
x
!" > 0
" > 0
Probability & Bayesian Inference
J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
72
Example
x
t
0 1
!1
0
1
Probability & Bayesian Inference
J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
73
Nonlinear Classification and Regression: Outline
Multi-Layer Perceptrons The Back-Propagation Learning Algorithm
Generalized Linear Models Radial Basis Function Networks Sparse Kernel Machines
Nonlinear SVMs and the Kernel Trick Relevance Vector Machines
Probability & Bayesian Inference
J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
74
Relevance Vector Machines
Some drawbacks of SVMs: Do not provide posterior probabilities. Not easily generalized to K > 2 classes. Parameters (C, ε) must be learned by cross-validation.
The Relevance Vector Machine is a sparse Bayesian kernel technique that avoids these drawbacks.
RVMs also typically lead to sparser models.
Probability & Bayesian Inference
J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
75
RVMs for Regression
p t | x,w,!( ) =N t | y(x),!"1( ) where y(x) = w t!(x)
In an RVM, the basis functions !(x) are kernels k x,xn( ) :
y(x) = wnk x,xn( )n=1
N
" + b
However, unlike in SVMs, the kernels need not be positive definite, and the xn need not be the training data points.
Probability & Bayesian Inference
J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
76
RVMs for Regression
Note that each weight parameter has its own precision hyperparameter.
Likelihood:
p t | X,w,!( ) = p tn | xn,w,!( )n=1
N
"where the nth row of X is xn
t .
Prior:
p(w |!) = N wi | 0,! i"1( )
i=1
M
#
Probability & Bayesian Inference
J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
77
RVMs for Regression
The conjugate prior for the precision of a Gaussian is a gamma distribution. Integrating out the precision parameter leads to a Student’s t distribution over
wi. Thus the distribution over w is a product of Student’s t distributions. As a result, maximizing the evidence will yield a sparse w. Note that to achieve sparsity it is critical that each parameter wi has a separate
precision αi.
p(wi |! i ) =N wi | 0,! i
"1( )
p ! i( ) = Gam ! i | a,b( )
p wi( ) = St wi | 2a( )
Bayesian Inference: Principles and Practice in Machine Learning 16
case of a Gamma hyperprior, which we introduce for greater generality here. This combination ofthe prior over !m controlling the prior over wm gives us what is often referred to as a hierarchicalprior. Now, if we have p(wm|!m) and p(!m) and we want to know the ‘true’ p(wm) we alreadyknow what to do — we must marginalise:
p(wm) =!
p(wm|!m) p(!m) d!m. (27)
For a Gamma p(!m), this integral is computable and we find that p(wm) is a Student-t distributionillustrated as a function of two parameters in Figure 8; its equivalent as a regularising penaltyfunction would be
"m log |wm|.
Gaussian prior Marginal prior: single ! Independent !
Figure 8: Contour plots of Gaussian and Student-t prior distributions over two parameters.While the marginal prior p(w1, w2) for the ‘single’ hyperparameter model of Section2 has a much sharper peak than the Gaussian at zero, it can be seen that it isnot sparse unlike the multiple ‘independent’ hyperparameter prior, which as well ashaving a sharp peak at zero, places most of its probability mass along axial ridgeswhere the magnitude of one of the two parameters is small.
4.3 A Sparse Bayesian Model for Regression
We can develop a sparse regression model by following an identical methodology to the previoussections. Again, we assume independent Gaussian noise: tn ! N(y(xn;w), "2), which gives acorresponding likelihood:
p(t|w,"2) = (2#"2)–N/2 exp#" 1
2"2#t"!w#2
$, (28)
where as before we denote t = (t1 . . . tN )T, w = (w1 . . . wM )T, and ! is the N $M ‘design’ matrixwith !nm = $m(xn).
Following the Bayesian framework, we desire the posterior distribution over all unknowns:
p(w,!,"2|t) =p(t|w, !,"2)p(w, !,"2)
p(t), (29)
which we can’t compute analytically. So as previously, we decompose this as:
p(w, !,"2|t) % p(w|t, !,"2) p(!,"2|t) (30)
where p(w|t, !,"2) is the ‘weight posterior’ distribution, and is tractable. This leaves p(!,"2|t)which must be approximated.
w2
w1
Probability & Bayesian Inference
J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
78
RVMs for Regression
p(wi |! i ) =N wi | 0,! i
"1( )
p ! i( ) = Gam ! i | a,b( )
p wi( ) = St wi | 2a( )
Bayesian Inference: Principles and Practice in Machine Learning 16
case of a Gamma hyperprior, which we introduce for greater generality here. This combination ofthe prior over !m controlling the prior over wm gives us what is often referred to as a hierarchicalprior. Now, if we have p(wm|!m) and p(!m) and we want to know the ‘true’ p(wm) we alreadyknow what to do — we must marginalise:
p(wm) =!
p(wm|!m) p(!m) d!m. (27)
For a Gamma p(!m), this integral is computable and we find that p(wm) is a Student-t distributionillustrated as a function of two parameters in Figure 8; its equivalent as a regularising penaltyfunction would be
"m log |wm|.
Gaussian prior Marginal prior: single ! Independent !
Figure 8: Contour plots of Gaussian and Student-t prior distributions over two parameters.While the marginal prior p(w1, w2) for the ‘single’ hyperparameter model of Section2 has a much sharper peak than the Gaussian at zero, it can be seen that it isnot sparse unlike the multiple ‘independent’ hyperparameter prior, which as well ashaving a sharp peak at zero, places most of its probability mass along axial ridgeswhere the magnitude of one of the two parameters is small.
4.3 A Sparse Bayesian Model for Regression
We can develop a sparse regression model by following an identical methodology to the previoussections. Again, we assume independent Gaussian noise: tn ! N(y(xn;w), "2), which gives acorresponding likelihood:
p(t|w,"2) = (2#"2)–N/2 exp#" 1
2"2#t"!w#2
$, (28)
where as before we denote t = (t1 . . . tN )T, w = (w1 . . . wM )T, and ! is the N $M ‘design’ matrixwith !nm = $m(xn).
Following the Bayesian framework, we desire the posterior distribution over all unknowns:
p(w,!,"2|t) =p(t|w, !,"2)p(w, !,"2)
p(t), (29)
which we can’t compute analytically. So as previously, we decompose this as:
p(w, !,"2|t) % p(w|t, !,"2) p(!,"2|t) (30)
where p(w|t, !,"2) is the ‘weight posterior’ distribution, and is tractable. This leaves p(!,"2|t)which must be approximated.
w2
w1
Thus if we let a! 0,b! 0, then
p log"i( )! uniform and p w
i( )! wi
#1
.
Student's t Distribution: p x |!( ) ="
! +1
2
#$%
&'(
!)"!2
#$%
&'(
1+x 2
!#
$%&
'(
*!+1
2
.
Also recall the rule for transforming densities:
If y is a monotonic function of x , then
pY(y ) = p
X(x )
dx
dy
Gamma Distribution: p x |a,b( ) =
ba
!(a)x a"1e "bx .
Very sparse!!
Probability & Bayesian Inference
J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
79
RVMs for Regression
In practice, it is difficult to integrate α out exactly. Instead, we use an approximate maximum likelihood method, finding ML values
for each αi. When we maximize the evidence with respect to these hyperparameters, many
will ∞. As a result, the corresponding weights will 0, yielding a sparse solution.
Likelihood:
p t | X,w,!( ) = p tn | xn,w,!( )n=1
N
"where the nth row of X is xn
t .
Prior:
p(w |!) = N wi | 0,! i"1( )
i=1
M
#
Probability & Bayesian Inference
J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
80
RVMs for Regression
Since both the likelihood and prior are normal, the posterior over w will also be normal:
Posterior:p w | t,X,!,"( ) =N w |m,#( )where m = "#$t t
# = A + "$t$( )%1
and $ni = &i xn( )A = diag ! i( )
Note that when !
i" #, the ith row and column of $ " 0, and
p w
i| t,X,!,"( ) = N w
i| 0,0( )
Probability & Bayesian Inference
J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
81
RVMs for Regression
The values for α and βare determined using the evidence approximation, where we maximize
p t | X,!,"( ) = p t | X,w,"( )p w |!( )dw#
In general, this results in many of the precision parameters ! i "#,so that wi " 0.
Unfortunately, this is a non-convex problem.
Probability & Bayesian Inference
J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
82
Example
x
t
0 1
!1
0
1