NONLINEAR CLASSIFICATION AND REGRESSION...CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder 24 Smooth Activation Function f(a)= 1 140 8 Classiﬁcation

J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

NONLINEAR CLASSIFICATION AND REGRESSION

Probability & Bayesian Inference


2

Nonlinear Classification and Regression: Outline

  Multi-Layer Perceptrons  The Back-Propagation Learning Algorithm

  Generalized Linear Models  Radial Basis Function Networks  Sparse Kernel Machines

 Nonlinear SVMs and the Kernel Trick  Relevance Vector Machines



3







4

Implementing Logical Relations

  AND and OR operations are linearly separable problems



5

The XOR Problem

  XOR is not linearly separable.

  How can we use linear classifiers to solve this problem?

x1 x2 XOR Class

0 0 0 B

0 1 1 A

1 0 1 A

1 1 0 B



6

Combining two linear classifiers

  Idea: use a logical combination of two linear classifiers.

g

1(x) = x

1+ x

2!

1

2

g

2(x) = x

1+ x

2!

3

2



7

Observe that the classification problem is then solved by

f y1! y

2!

1

2

"

#$%

&'

where

y1= f g

1(x)( ) and y

2= f g

2(x)( )


Let f (x) be the unit step activation function:

f (x) = 0, x < 0

f (x) = 1, x ! 0

g

1(x) = x

1+ x

2!

1

2

g

2(x) = x

1+ x

2!

3

2



8

f y1! y

2!

1

2

"

#$%

&'

where

y1= f g

1(x)( ) and y

2= f g

2(x)( )


  This calculation can be implemented sequentially: 1.  Compute y1 and y2 from x1 and x2.

2.  Compute the decision from y1 and y2.

  Each layer in the sequence consists of one or more linear classifications.

  This is therefore a two-layer perceptron.

g

1(x) = x

1+ x

2!

1

2

g

2(x) = x

1+ x

2!

3

2



9

f y1! y

2!

1

2

"

#$%

&'

where

y1= f g

1(x)( ) and y

2= f g

2(x)( )

The Two-Layer Perceptron

g

1(x) = x

1+ x

2!

1

2

g

2(x) = x

1+ x

2!

3

2

Layer 1

Layer 2

x1 x2 y1 y2

0 0 0(-) 0(-) B(0)

0 1 1(+) 0(-) A(1)

1 0 1(+) 0(-) A(1)

1 1 1(+) 1(+) B(0)

y

1! y

2!

1

2

Layer 1 Layer 2



10

y

1= f g

1(x)( ) and y

2= f g

2(x)( )


  The first layer performs a nonlinear mapping that makes the data linearly separable.

g

1(x) = x

1+ x

2!

1

2

g

2(x) = x

1+ x

2!

3

2

y

1! y

2!

1

2

Layer 1 Layer 2



11

The Two-Layer Perceptron Architecture

g

1(x) = x

1+ x

2!

1

2

g

2(x) = x

1+ x

2!

3

2

y

1! y

2!

1

2

Input Layer Hidden Layer Output Layer

-1



12

y

1= f g

1(x)( ) and y

2= f g

2(x)( )


  Note that the hidden layer maps the plane onto the vertices of a unit square.

g

1(x) = x

1+ x

2!

1

2

g

2(x) = x

1+ x

2!

3

2

y

1! y

2!

1

2

Layer 1 Layer 2



13

Higher Dimensions

  Each hidden unit realizes a hyperplane discriminant function.   The output of each hidden unit is 0 or 1 depending upon the

location of the input vector relative to the hyperplane.

lRx∈ { } piyyyyx iT

p ,...2 ,1 1 ,0 ,],...[ 1 =∈=→



14

Higher Dimensions

  Together, the hidden units map the input onto the vertices of a p-dimensional unit hypercube.

lRx∈ { } piyyyyx iT

p ,...2 ,1 1 ,0 ,],...[ 1 =∈=→



15

Two-Layer Perceptron

  These p hyperplanes partition the l-dimensional input space into polyhedral regions

  Each region corresponds to a different vertex of the p-dimensional hypercube represented by the outputs of the hidden layer.



16

Two-Layer Perceptron

  In this example, the vertex (0, 0, 1) corresponds to the region of the input space where:   g1(x) < 0   g2(x) < 0   g3(x) > 0



17

Limitations of a Two-Layer Perceptron

  The output neuron realizes a hyperplane in the transformed space that partitions the p vertices into two sets.

  Thus, the two layer perceptron has the capability to classify vectors into classes that consist of unions of polyhedral regions.

  But NOT ANY union. It depends on the relative position of the corresponding vertices.

  How can we solve this problem?



18

The Three-Layer Perceptron

  Suppose that Class A consists of the union of K polyhedra in the input space.

  Use K neurons in the 2nd hidden layer.

  Train each to classify one region as positive, the rest negative.

  Now use an output neuron that implements the OR function.



19


  Thus the three-layer perceptron can separate classes resulting from any union of polyhedral regions in the input space.



20


  The first layer of the network forms the hyperplanes in the input space.

  The second layer of the network forms the polyhedral regions of the input space

  The third layer forms the appropriate unions of these regions and maps each to the appropriate class.

Learning Parameters



22

Training Data

  The training data consist of N input-output pairs:

y(i ), x(i )( ) , i ! 1,…N

where

y(i ) = y1(i ),… , y

kL

(i )"#$

%&'

t

and

x(i ) = x1(i ),… , x

k0

(i )"#$

%&'

t



23

Choosing an Activation Function

  The unit step activation function means that the error rate of the network is a discontinuous function of the weights.

  This makes it difficult to learn optimal weights by minimizing the error.

  To fix this problem, we need to use a smooth activation function.

  A popular choice is the sigmoid function we used for logistic regression:



24

Smooth Activation Function

f (a) =

1

1+ exp(!a)140 8 Classification models

!"

Figure 8.3 Logistic regression model in 1D and 2D. a) One dimensional fit.Green points denote set of examples S0 where y = 0. Pink points denoteset of examples S1 where y = 1. Note that in this (and all future figuresin this chapter) we have only plotted the probability Pr(y = 1|x) (compareto figure 8.2c). The probability Pr(y = 0|x) can be trivially computed as1 ! Pr(y = 1|x). b) Two dimensional fit. Here, the model has a sigmoidprofile in the direction of the gradient ! and is constant in the orthogonaldirections. The decision boundary (blue line) is linear.

As usual, however, it is simpler to maximize the logarithm L of this expression.Since the logarithm is a monotonic transformation, it does not change the positionof the maximum with respect to !. However, applying the logarithm the productand replaces it with a sum so that

L =I!

i=1

yi log

"1

1 + exp[!!Txi]

#+

I!

i=1

(1! yi) log

$exp[!!Txi]

1 + exp[!!Txi]

%. (8.6)

The derivative of the log likelihood L with respect to the parameters ! is

!L

!!=

I!

i=1

&1

1 + exp[!!Txi]! yi

'xi =

I!

i=1

(sig[ai]! yi)xi. (8.7)

Unfortunately, when we equate this expression to zero, there is no way to re-arrange to get a closed form solution for the parameters !. Instead we mustrely on a non-linear optimization technique to find the maximum of this function.We’ll now sketch the main ideas behind non-linear optimization. We defer a moredetailed discussion until section 8.10.

In non-linear optimization, we start with an initial estimate of the solution! and iteratively improve it. The methods we will discuss rely on computing

wt! x1



25

Output: Two Classes

  For a binary classification problem, there is a single output node with activation function given by

  Since the output is constrained to lie between 0 and 1, it can be interpreted as the probability of the input vector belonging to Class 1.

f (a) =

1

1+ exp(!a)



26

Output: K > 2 Classes

  For a K-class problem, we use K outputs, and the softmax function given by

  Since the outputs are constrained to lie between 0 and 1, and sum to 1, yk can be interpreted as the probability that the input vector belongs to Class K.

yk=

exp ak( )

exp aj( )

j

!



27

Non-Convex

  Now each layer of our multi-layer perceptron is a logistic regressor.

  Recall that optimizing the weights in logistic regression results in a convex optimization problem.

  Unfortunately the cascading of logistic regressors in the multi-layer perceptron makes the problem non-convex.

  This makes it difficult to determine an exact solution.   Instead, we typically use gradient descent to find a

locally optimal solution to the weights.   The specific learning algorithm is called the

backpropagation algorithm.

Nov 21, 2011

End of Lecture



29





Paul J. Werbos. Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences. PhD thesis, Harvard University, 1974 Rumelhart, David E.; Hinton, Geoffrey E., Williams, Ronald J. (8 October 1986). "Learning representations by back-propagating errors". Nature 323 (6088): 533–536.

The Backpropagation Algorithm

Werbos Rumelhart Hinton



31

Notation

  Assume a network with L layers   k0 nodes in the input layer.

  kr nodes in the r’th layer.



32

Notation

Let ykr !1 be the output of the kth neuron of layer r ! 1.

Let wjkr be the weight of the synapse on the jth neuron of layer r

from the kth neuron of layer r ! 1.



33

Input

y

k

0(i ) = xk(i ), k = 1,… ,k

0



34

Notation

Then yj

r (i ) = f vj

r (i )( ) = f wjk

r yk

r !1 (i )k=0

kr !1

"#

$%

&

'(

Let v

jr be the total input to the jth neuron of layer r :

vjr (i ) = w

jr( )

tyr !1 (i ) = w

jkr y

kr !1 (i )

k=0

kr !1

"

where we define y0r (i ) = +1 to incorporate the bias term.



35

Cost Function

  A common cost function is the squared error:

J = ! (i )i =1

N"

where ! (i ) ! 12 em(i )( )2

m=1

kL

" = 12 ym (i ) # ym (i )( )2

m=1

kL

"and ym (i ) = yk

r (i ) is the output of the network.



36

Cost Function

  To summarize, the error for input i is given by

where is the output of the output layer and each layer is related to the previous layer through

and

! (i ) = 1

2 em(i )( )2m=1

kL

" = 12 ym (i ) # ym (i )( )2

m=1

kL

"

y

j

r (i ) = f vj

r (i )( )

v

j

r (i ) = wj

r( )t

yr !1 (i )

y

m(i ) = y

k

r (i )



37

Gradient Descent

  Gradient descent starts with an initial guess at the weights over all layers of the network.

  We then use these weights to compute the network output for each input vector x(i) in the training data.

  This allows us to calculate the errorε(i) for each of these inputs.

  Then, in order to minimize this error, we incrementally update the weights in the negative gradient direction:

! (i ) = 1

2 em(i )( )2m=1

kL

" = 12 ym (i ) # ym (i )( )2

m=1

kL

"

wj

r (new) = wjr (old) - µ !J

!wjr = wj

r (old) - µ !" (i )!wj

ri =1

N#

y(i )



38

Gradient Descent

  Since , the influence of the jth weight of the rth layer on the error can be expressed as:

v

j

r (i ) = wj

r( )t

yr !1 (i )

!" (i )

!wj

r=

!" (i )

!vj

r (i )

!vj

r (i )

!wj

r

= #j

r (i )yr $1 (i )

where

#j

r (i ) !!" (i )

!vj

r (i )



39

Gradient Descent

!" (i )

!wj

r= #

j

r (i )yr $1 (i ),

where

#j

r (i ) !!" (i )

!vj

r (i )

For an intermediate layer r,

we cannot compute !jr (i ) directly.

However, !jr (i ) can be computed inductively,

starting from the output layer.



40

Backpropagation: The Output Layer

Thus at the output layer we have

!" (i )!wj

r = # jr (i )yr $1 (i ), where # j

r (i ) ! !" (i )!vj

r (i )

and ! (i ) = 1

2 em(i )( )2m=1

kL

" = 12 ym (i ) # ym (i )( )2

m=1

kL

"

Recall that ym (i ) = yjL (i ) = f vj

L (i )( )

! j

L (i ) = "# (i )"vj

L (i ) = "# (i )"ej

L (i )"ej

L (i )"vj

L (i ) = ejL (i ) $f vj

L (i )( )

f (a) = 1

1+ exp(!a)" #f (a) = f (a) 1! f (a)( )

! jL (i ) = ej

L (i )f v jL (i )( ) 1 " f vj

L (i )( )( )



41

Backpropagation: Hidden Layers

  Observe that the dependence of the error on the total input to a neuron in a previous layer can be expressed in terms of the dependence on the total input of neurons in the following layer:

! j

r "1 (i ) = #$ (i )#vj

r "1 (i ) = #$ (i )#vk

r (i )#vk

r (i )#vj

r "1 (i )k=1

kr

% = !kr (i ) #vk

r (i )#vj

r "1 (i )k=1

kr

%

where vk

r (i ) = wkmr ym

r !1 (i )m=0

kr !1

" = wkmr f vm

r !1 (i )( )m=0

kr !1

"

Thus we have !vk

r (i )!vj

r "1 (i ) = wkjr #f vj

r "1 (i )( )

and so ! j

r "1 (i ) = #$ (i )#vj

r "1 (i ) = %f vjr "1 (i )( ) !k

r (i )wkjr = f vj

L (i )( ) 1 " f vjL (i )( )( ) !k

r (i )wkjr

k=1

kr

&k=1

kr

&

Thus once the !kr (i ) are determined they can be propagated backward

to calculate ! jr "1 (i ) using this inductive formula.



42

Backpropagation: Summary of Algorithm

1.  Initialization   Initialize all weights with small random values

2.  Forward Pass   For each input vector, run the network in the forward direction, calculating:

3.  Backward Pass   Starting with the output layer, use our inductive formula to compute the :

  Output Layer (Base Case):

  Hidden Layers (Inductive Case):

4.  Update Weights ! j

r "1 (i ) = #f vjr "1 (i )( ) !k

r (i )wkjr

k=1

kr

$ ! j

L (i ) = ejL (i ) "f vj

L (i )( )

wj

r (new) = wjr (old) - µ !" (i )

!wjr

i =1

N#

where !" (i )

!wjr = # j

r (i )yr $1 (i )

! jr "1 (i )

yjr (i ) = f vj

r (i )( ) v jr (i ) = wj

r( )t yr !1 (i );

and finally ! (i ) = 1

2 em(i )( )2m=1

kL

" = 12 ym (i ) # ym (i )( )2

m=1

kL

"

Repe

at u

ntil

conv

erge

nce



43

Batch vs Online Learning

  As described, on each iteration backprop updates the weights based upon all of the training data. This is called batch learning.

  An alternative is to update the weights after each training input has been processed by the network, based only upon the error for that input. This is called online learning.

wj

r (new) = wjr (old) - µ !" (i )

!wjr

i =1

N#

where !" (i )

!wjr = # j

r (i )yr $1 (i )

wj

r (new) = wj

r (old) - µ!" (i )

!wj

r

where !" (i )

!wjr = # j

r (i )yr $1 (i )



44

Batch vs Online Learning

  One advantage of batch learning is that averaging over all inputs when updating the weights should lead to smoother convergence.

  On the other hand, the randomness associated with online learning might help to prevent convergence toward a local minimum.

  Changing the order of presentation of the inputs from epoch to epoch may also improve results.



45

Remarks

  Local Minima  The objective function is in general non-convex, and so

the solution may not be globally optimal.

  Stopping Criterion  Typically stop when the change in weights or the

change in the error function falls below a threshold.

  Learning Rate  The speed and reliability of convergence depends on

the learning rate μ.



46







47

Generalizing Linear Classifiers

  One way of tackling problems that are not linearly separable is to transform the input in a nonlinear fashion prior to applying a linear classifier.

  The result is that decision boundaries that are linear in the resulting feature space may be highly nonlinear in the original input space.

x1

x2

−1 0 1

−1

0

1

!1

!2

0 0.5 1

0

0.5

1

Input Space Feature Space



48

Nonlinear Basis Function Models

  Generally

  where ϕj(x) are known as basis functions.   Typically, Φ0(x) = 1, so that w0 acts as a bias.



49

Nonlinear basis functions for classification

  In the context of classification, the discriminant function in the feature space becomes:

  This formulation can be thought of as an input space approximation of the true separating discriminant function g(x) using a set of interpolation functions .

g y (x)( ) = w

0+ w

iy

i(x)

i =1

M

! = w0+ w

i"

i(x)

i =1

M

!

!

i(x)



50

Dimensionality

  The dimensionality M of the feature space may be less than, equal to, or greater than the dimensionality D of the original input space.  M < D: This may result in a factoring out of irrelevant

dimensions, reduction in the number of model parameters, and resulting improvement in generalization (reduced overlearning).

 M > D: Problems that are not linearly separable in the input space may become separable in the feature space, and the probability of linear separability generally increases with the dimensionality of the feature space. Thus choosing M >> D helps to make the problem linearly separable.



51

Cover’s Theorem

  “A complex pattern-classification problem, cast in a high-dimensional space nonlinearly, is more likely to be linearly separable than in a low-dimensional space, provided that the space is not densely populated.” — Cover, T.M. , Geometrical and Statistical properties of systems of linear inequalities with applications in pattern recognition., 1965

Example



52







53

Radial Basis Functions

  Consider interpolation functions (kernels) of the form

  In other words, the feature value depends only upon the Euclidean distance to a ‘centre point’ in the input space.

  A commonly used RBF is the isotropic Gaussian:

!

ix " µ

i( )

!i(x) = exp "

1

2#i2

x " µi

2$

%&

'

()



54

Relation to KDE

  We can use Gaussian RBFs to approximate the discriminant function g(x):

  where

  This is reminiscent of kernel density estimation, where we approximated probability densities as a normalized sum of Gaussian kernels.

g y (x)( ) = w

0+ w

iy

i(x)

i =1

M

! = w0+ w

i"

i(x)

i =1

M

!

!i(x) = exp "

1

2#i2

x " µi

2$

%&

'

()



55

Relation to KDE

  For KDE we planted a kernel at each data point. Thus there were N kernels.

  For RBF networks, we generally use far fewer kernels than the number of data points: M << N.

  This leads to greater efficiency and generalization.



56

RBF Networks

  The Linear Classifier with nonlinear radial basis functions can be considered an artificial neural network where   The hidden nodes are nonlinear (e.g., Gaussian).   The output node is linear.

RBF Network for 2 Classes



57

RBF Networks vs Perceptrons

  Recall that for a perceptron, the output of a hidden unit is invariant on a hyperplane.

  For an RBF, the output of a hidden unit is invariant on a circle centred on μi.

  Thus hidden units are global in a perceptron, but local in an RBF network.

RBF Network for 2 Classes



58

RBF Networks vs Perceptrons

  This difference has consequences:  Multilayer perceptrons tend to learn slower than RBFs.  However, multilayer perceptrons tend to have better

generalization properties, especially in regions of the input space where training data are sparse.

 Typically, more neurons are needed for an RBF than for a multilayer perceptron to solve a given problem.



59

Parameters

  There are two options for choosing the parameters (centres and scales) of the RBFs: 1.  Fixed.

  For example, randomly select a subset of M of the input vectors and use these as centres. Use a common scale based upon your judgement.

2.  Learned.   Note that when the RBF parameters are fixed, the weights

could be learned using linear classifier techniques (e.g., least squares).

  Thus the RBF parameters could be learned in an outer loop, by gradient descent.



60







61

The Kernel Function

  Recall that an SVM is the solution to the problem

  A new input x is classified by computing

  Where S is the set of support vectors.

  Here we introduced the kernel function k(x, x’), defined as

  This is more than a notational convenience!!

Maximize !L(a) = ann=1

N

! " 12

anamtntmk xn,xm( )m=1

N

!n=1

N

!

subject to 0 # an #C and antnn=1

N

! = 0

k x, !x( ) = "(x)t"( !x )

y(x) = antnk(x,xn)

n!S" + b



62

The Kernel Trick

  Note that the basis functions and individual training vectors are no longer part of the objective function.

  Instead all we need is the kernel value (like a distance measure) for all pairs of training vectors.

Maximize !L(a) = ann=1

N

! " 12

anamtntmk xn,xm( )m=1

N

!n=1

N

!

subject to 0 # an #C and antnn=1

N

! = 0

where k x, !x( ) = "(x)t"( !x )



63

The Kernel Function

The kernel function k(x, !x ) measures the 'similarity' of input vectors x and !xas an inner product in a feature space defined by the feature space mapping "(x) :k(x, !x ) = "(x)t"( !x )

If k(x, !x ) = k(x " !x ) we say that the kernel is stationary

If k(x, !x ) = k x " !x( ) we call it a radial basis function.



64

Constructing Kernels

  We can construct a kernel by selecting a feature space mapping ϕ(x) and then defining

k(x, !x ) = "(x)t"( !x )

!1 0 10

0.25

0.5

0.75

1

!1 0 10.0

1.0

2.0

Gaussian

!x

!( "x )

k(x, !x )



65


  Alternatively, we can construct the kernel function

directly, ensuring that it corresponds to an inner product in some (possibly infinite-dimensional) feature space.



66


k(x) = !(x)t!( "x )

Example 1: k(x,z) = xtz

Example 2: k(x,z) = xtz + c, c > 0

Example 3: k(x,z) = xtz( )2



67

Kernel Properties

  Kernels obey certain properties that make it easy to construct complex kernels from simpler ones.



68

Kernel Properties Combining Kernels

Given valid kernels k1(x,x!) and k2(x,x!) the following kernelswill also be valid:

k(x,x!) = ck1(x,x!) (6.13)

k(x,x!) = f(x)k1(x,x!)f(x!) (6.14)

k(x,x!) = q(k1(x,x!)) (6.15)

k(x,x!) = exp(k1(x,x!)) (6.16)

k(x,x!) = k1(x,x!) + k2(x,x!) (6.17)

k(x,x!) = k1(x,x!)k2(x,x!) (6.18)

k(x,x!) = k3(!(x),!(x!)) (6.19)

k(x,x!) = xT Ax! (6.20)

k(x,x!) = ka(xa,x!a) + kb(xb,x

!b) (6.21)

k(x,x!) = ka(xa,x!a)kb(xb,x

!b) (6.22)

with corresponding conditions on c, f, q,!, k3,A,xa,xb, ka, kb

Vasil Khalidov, Alex Klaser Bishop Chapter 6: Kernel Methods

where c > 0, f (!) is any function, q(!) is a polynomial with nonnegative coefficients, "(x) is a mapping from x # !M ,k3 is a valid kernel on !M , A is a symmetric positive semidefinite matrix, xa and xb are variables such that x = xa,xb( ) and ka,kb are valid kernels over their respective spaces.



69


  Examples:

k(x, !x ) = xt !x + c( )M ,c > 0

k(x, !x ) = exp " x " !x

2/ 2# 2( )

Corresponds to infinite-dimensional feature vector

(Use 6.18)

(Use 6.14 and 6.16.)



70

Nonlinear SVM Example (Gaussian Kernel)

Input Space

x1

x2



71

SVMs for Regression

In standard linear regression, we minimize12

yn ! tn( )2

n=1

N

" +#2

w2

This penalizes all deviations from the model.

To obtain sparse solutions, we replace the quadratic error functionby an !-insensitive error function, e.g.,

E! y(x) " t( ) = 0, if y(x) - t < !

y(x) - t " !, otherwise

#$%

&%

See text for details of solution.

y

y + !

y ! !

y(x)

x

!" > 0

" > 0



72

Example

x

t

0 1

!1

0

1



73







74

Relevance Vector Machines

  Some drawbacks of SVMs:  Do not provide posterior probabilities.  Not easily generalized to K > 2 classes.  Parameters (C, ε) must be learned by cross-validation.

  The Relevance Vector Machine is a sparse Bayesian kernel technique that avoids these drawbacks.

  RVMs also typically lead to sparser models.



75

RVMs for Regression

p t | x,w,!( ) =N t | y(x),!"1( ) where y(x) = w t!(x)

In an RVM, the basis functions !(x) are kernels k x,xn( ) :

y(x) = wnk x,xn( )n=1

N

" + b

However, unlike in SVMs, the kernels need not be positive definite, and the xn need not be the training data points.



76

RVMs for Regression

  Note that each weight parameter has its own precision hyperparameter.

Likelihood:

p t | X,w,!( ) = p tn | xn,w,!( )n=1

N

"where the nth row of X is xn

t .

Prior:

p(w |!) = N wi | 0,! i"1( )

i=1

M

#



77

RVMs for Regression

  The conjugate prior for the precision of a Gaussian is a gamma distribution.   Integrating out the precision parameter leads to a Student’s t distribution over

wi.   Thus the distribution over w is a product of Student’s t distributions.   As a result, maximizing the evidence will yield a sparse w.   Note that to achieve sparsity it is critical that each parameter wi has a separate

precision αi.

p(wi |! i ) =N wi | 0,! i

"1( )

p ! i( ) = Gam ! i | a,b( )

p wi( ) = St wi | 2a( )

Bayesian Inference: Principles and Practice in Machine Learning 16

case of a Gamma hyperprior, which we introduce for greater generality here. This combination ofthe prior over !m controlling the prior over wm gives us what is often referred to as a hierarchicalprior. Now, if we have p(wm|!m) and p(!m) and we want to know the ‘true’ p(wm) we alreadyknow what to do — we must marginalise:

p(wm) =!

p(wm|!m) p(!m) d!m. (27)

For a Gamma p(!m), this integral is computable and we find that p(wm) is a Student-t distributionillustrated as a function of two parameters in Figure 8; its equivalent as a regularising penaltyfunction would be

"m log |wm|.

Gaussian prior Marginal prior: single ! Independent !

Figure 8: Contour plots of Gaussian and Student-t prior distributions over two parameters.While the marginal prior p(w1, w2) for the ‘single’ hyperparameter model of Section2 has a much sharper peak than the Gaussian at zero, it can be seen that it isnot sparse unlike the multiple ‘independent’ hyperparameter prior, which as well ashaving a sharp peak at zero, places most of its probability mass along axial ridgeswhere the magnitude of one of the two parameters is small.

4.3 A Sparse Bayesian Model for Regression

We can develop a sparse regression model by following an identical methodology to the previoussections. Again, we assume independent Gaussian noise: tn ! N(y(xn;w), "2), which gives acorresponding likelihood:

p(t|w,"2) = (2#"2)–N/2 exp#" 1

2"2#t"!w#2

$, (28)

where as before we denote t = (t1 . . . tN )T, w = (w1 . . . wM )T, and ! is the N $M ‘design’ matrixwith !nm = $m(xn).

Following the Bayesian framework, we desire the posterior distribution over all unknowns:

p(w,!,"2|t) =p(t|w, !,"2)p(w, !,"2)

p(t), (29)

which we can’t compute analytically. So as previously, we decompose this as:

p(w, !,"2|t) % p(w|t, !,"2) p(!,"2|t) (30)

where p(w|t, !,"2) is the ‘weight posterior’ distribution, and is tractable. This leaves p(!,"2|t)which must be approximated.

w2

w1



78

RVMs for Regression

p(wi |! i ) =N wi | 0,! i

"1( )

p ! i( ) = Gam ! i | a,b( )

p wi( ) = St wi | 2a( )

Bayesian Inference: Principles and Practice in Machine Learning 16

case of a Gamma hyperprior, which we introduce for greater generality here. This combination ofthe prior over !m controlling the prior over wm gives us what is often referred to as a hierarchicalprior. Now, if we have p(wm|!m) and p(!m) and we want to know the ‘true’ p(wm) we alreadyknow what to do — we must marginalise:

p(wm) =!

p(wm|!m) p(!m) d!m. (27)

For a Gamma p(!m), this integral is computable and we find that p(wm) is a Student-t distributionillustrated as a function of two parameters in Figure 8; its equivalent as a regularising penaltyfunction would be

"m log |wm|.

Gaussian prior Marginal prior: single ! Independent !

Figure 8: Contour plots of Gaussian and Student-t prior distributions over two parameters.While the marginal prior p(w1, w2) for the ‘single’ hyperparameter model of Section2 has a much sharper peak than the Gaussian at zero, it can be seen that it isnot sparse unlike the multiple ‘independent’ hyperparameter prior, which as well ashaving a sharp peak at zero, places most of its probability mass along axial ridgeswhere the magnitude of one of the two parameters is small.

4.3 A Sparse Bayesian Model for Regression

We can develop a sparse regression model by following an identical methodology to the previoussections. Again, we assume independent Gaussian noise: tn ! N(y(xn;w), "2), which gives acorresponding likelihood:

p(t|w,"2) = (2#"2)–N/2 exp#" 1

2"2#t"!w#2

$, (28)

where as before we denote t = (t1 . . . tN )T, w = (w1 . . . wM )T, and ! is the N $M ‘design’ matrixwith !nm = $m(xn).

Following the Bayesian framework, we desire the posterior distribution over all unknowns:

p(w,!,"2|t) =p(t|w, !,"2)p(w, !,"2)

p(t), (29)

which we can’t compute analytically. So as previously, we decompose this as:

p(w, !,"2|t) % p(w|t, !,"2) p(!,"2|t) (30)

where p(w|t, !,"2) is the ‘weight posterior’ distribution, and is tractable. This leaves p(!,"2|t)which must be approximated.

w2

w1

Thus if we let a! 0,b! 0, then

p log"i( )! uniform and p w

i( )! wi

#1

.

Student's t Distribution: p x |!( ) ="

! +1

2

#$%

&'(

!)"!2

#$%

&'(

1+x 2

!#

$%&

'(

*!+1

2

.

Also recall the rule for transforming densities:

If y is a monotonic function of x , then

pY(y ) = p

X(x )

dx

dy

Gamma Distribution: p x |a,b( ) =

ba

!(a)x a"1e "bx .

Very sparse!!



79

RVMs for Regression

  In practice, it is difficult to integrate α out exactly.   Instead, we use an approximate maximum likelihood method, finding ML values

for each αi.   When we maximize the evidence with respect to these hyperparameters, many

will ∞.   As a result, the corresponding weights will 0, yielding a sparse solution.

Likelihood:

p t | X,w,!( ) = p tn | xn,w,!( )n=1

N

"where the nth row of X is xn

t .

Prior:

p(w |!) = N wi | 0,! i"1( )

i=1

M

#



80

RVMs for Regression

  Since both the likelihood and prior are normal, the posterior over w will also be normal:

Posterior:p w | t,X,!,"( ) =N w |m,#( )where m = "#$t t

# = A + "$t$( )%1

and $ni = &i xn( )A = diag ! i( )

Note that when !

i" #, the ith row and column of $ " 0, and

p w

i| t,X,!,"( ) = N w

i| 0,0( )



81

RVMs for Regression

  The values for α and βare determined using the evidence approximation, where we maximize

p t | X,!,"( ) = p t | X,w,"( )p w |!( )dw#

In general, this results in many of the precision parameters ! i "#,so that wi " 0.

Unfortunately, this is a non-convex problem.



82

Example

x

t

0 1

!1

0

1

Documents

NONLINEAR CLASSIFICATION AND REGRESSION...CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder 24 Smooth Activation Function f(a)= 1 140 8 Classiﬁcation