82
J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition NONLINEAR CLASSIFICATION AND REGRESSION

NONLINEAR CLASSIFICATION AND REGRESSION...CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder 24 Smooth Activation Function f(a)= 1 140 8 Classification

  • Upload
    others

  • View
    11

  • Download
    1

Embed Size (px)

Citation preview

Page 1: NONLINEAR CLASSIFICATION AND REGRESSION...CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder 24 Smooth Activation Function f(a)= 1 140 8 Classification

J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

NONLINEAR CLASSIFICATION AND REGRESSION

Page 2: NONLINEAR CLASSIFICATION AND REGRESSION...CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder 24 Smooth Activation Function f(a)= 1 140 8 Classification

Probability & Bayesian Inference

J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

2

Nonlinear Classification and Regression: Outline

  Multi-Layer Perceptrons  The Back-Propagation Learning Algorithm

  Generalized Linear Models  Radial Basis Function Networks  Sparse Kernel Machines

 Nonlinear SVMs and the Kernel Trick  Relevance Vector Machines

Page 3: NONLINEAR CLASSIFICATION AND REGRESSION...CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder 24 Smooth Activation Function f(a)= 1 140 8 Classification

Probability & Bayesian Inference

J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

3

Nonlinear Classification and Regression: Outline

  Multi-Layer Perceptrons  The Back-Propagation Learning Algorithm

  Generalized Linear Models  Radial Basis Function Networks  Sparse Kernel Machines

 Nonlinear SVMs and the Kernel Trick  Relevance Vector Machines

Page 4: NONLINEAR CLASSIFICATION AND REGRESSION...CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder 24 Smooth Activation Function f(a)= 1 140 8 Classification

Probability & Bayesian Inference

J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

4

Implementing Logical Relations

  AND and OR operations are linearly separable problems

Page 5: NONLINEAR CLASSIFICATION AND REGRESSION...CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder 24 Smooth Activation Function f(a)= 1 140 8 Classification

Probability & Bayesian Inference

J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

5

The XOR Problem

  XOR is not linearly separable.

  How can we use linear classifiers to solve this problem?

x1 x2 XOR Class

0 0 0 B

0 1 1 A

1 0 1 A

1 1 0 B

Page 6: NONLINEAR CLASSIFICATION AND REGRESSION...CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder 24 Smooth Activation Function f(a)= 1 140 8 Classification

Probability & Bayesian Inference

J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

6

Combining two linear classifiers

  Idea: use a logical combination of two linear classifiers.

g

1(x) = x

1+ x

2!

1

2

g

2(x) = x

1+ x

2!

3

2

Page 7: NONLINEAR CLASSIFICATION AND REGRESSION...CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder 24 Smooth Activation Function f(a)= 1 140 8 Classification

Probability & Bayesian Inference

J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

7

Observe that the classification problem is then solved by

f y1! y

2!

1

2

"

#$%

&'

where

y1= f g

1(x)( ) and y

2= f g

2(x)( )

Combining two linear classifiers

Let f (x) be the unit step activation function:

f (x) = 0, x < 0

f (x) = 1, x ! 0

g

1(x) = x

1+ x

2!

1

2

g

2(x) = x

1+ x

2!

3

2

Page 8: NONLINEAR CLASSIFICATION AND REGRESSION...CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder 24 Smooth Activation Function f(a)= 1 140 8 Classification

Probability & Bayesian Inference

J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

8

f y1! y

2!

1

2

"

#$%

&'

where

y1= f g

1(x)( ) and y

2= f g

2(x)( )

Combining two linear classifiers

  This calculation can be implemented sequentially: 1.  Compute y1 and y2 from x1 and x2.

2.  Compute the decision from y1 and y2.

  Each layer in the sequence consists of one or more linear classifications.

  This is therefore a two-layer perceptron.

g

1(x) = x

1+ x

2!

1

2

g

2(x) = x

1+ x

2!

3

2

Page 9: NONLINEAR CLASSIFICATION AND REGRESSION...CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder 24 Smooth Activation Function f(a)= 1 140 8 Classification

Probability & Bayesian Inference

J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

9

f y1! y

2!

1

2

"

#$%

&'

where

y1= f g

1(x)( ) and y

2= f g

2(x)( )

The Two-Layer Perceptron

g

1(x) = x

1+ x

2!

1

2

g

2(x) = x

1+ x

2!

3

2

Layer 1

Layer 2

x1 x2 y1 y2

0 0 0(-) 0(-) B(0)

0 1 1(+) 0(-) A(1)

1 0 1(+) 0(-) A(1)

1 1 1(+) 1(+) B(0)

y

1! y

2!

1

2

Layer 1 Layer 2

Page 10: NONLINEAR CLASSIFICATION AND REGRESSION...CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder 24 Smooth Activation Function f(a)= 1 140 8 Classification

Probability & Bayesian Inference

J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

10

y

1= f g

1(x)( ) and y

2= f g

2(x)( )

The Two-Layer Perceptron

  The first layer performs a nonlinear mapping that makes the data linearly separable.

g

1(x) = x

1+ x

2!

1

2

g

2(x) = x

1+ x

2!

3

2

y

1! y

2!

1

2

Layer 1 Layer 2

Page 11: NONLINEAR CLASSIFICATION AND REGRESSION...CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder 24 Smooth Activation Function f(a)= 1 140 8 Classification

Probability & Bayesian Inference

J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

11

The Two-Layer Perceptron Architecture

g

1(x) = x

1+ x

2!

1

2

g

2(x) = x

1+ x

2!

3

2

y

1! y

2!

1

2

Input Layer Hidden Layer Output Layer

-1

Page 12: NONLINEAR CLASSIFICATION AND REGRESSION...CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder 24 Smooth Activation Function f(a)= 1 140 8 Classification

Probability & Bayesian Inference

J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

12

y

1= f g

1(x)( ) and y

2= f g

2(x)( )

The Two-Layer Perceptron

  Note that the hidden layer maps the plane onto the vertices of a unit square.

g

1(x) = x

1+ x

2!

1

2

g

2(x) = x

1+ x

2!

3

2

y

1! y

2!

1

2

Layer 1 Layer 2

Page 13: NONLINEAR CLASSIFICATION AND REGRESSION...CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder 24 Smooth Activation Function f(a)= 1 140 8 Classification

Probability & Bayesian Inference

J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

13

Higher Dimensions

  Each hidden unit realizes a hyperplane discriminant function.   The output of each hidden unit is 0 or 1 depending upon the

location of the input vector relative to the hyperplane.

lRx∈ { } piyyyyx iT

p ,...2 ,1 1 ,0 ,],...[ 1 =∈=→

Page 14: NONLINEAR CLASSIFICATION AND REGRESSION...CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder 24 Smooth Activation Function f(a)= 1 140 8 Classification

Probability & Bayesian Inference

J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

14

Higher Dimensions

  Together, the hidden units map the input onto the vertices of a p-dimensional unit hypercube.

lRx∈ { } piyyyyx iT

p ,...2 ,1 1 ,0 ,],...[ 1 =∈=→

Page 15: NONLINEAR CLASSIFICATION AND REGRESSION...CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder 24 Smooth Activation Function f(a)= 1 140 8 Classification

Probability & Bayesian Inference

J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

15

Two-Layer Perceptron

  These p hyperplanes partition the l-dimensional input space into polyhedral regions

  Each region corresponds to a different vertex of the p-dimensional hypercube represented by the outputs of the hidden layer.

Page 16: NONLINEAR CLASSIFICATION AND REGRESSION...CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder 24 Smooth Activation Function f(a)= 1 140 8 Classification

Probability & Bayesian Inference

J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

16

Two-Layer Perceptron

  In this example, the vertex (0, 0, 1) corresponds to the region of the input space where:   g1(x) < 0   g2(x) < 0   g3(x) > 0

Page 17: NONLINEAR CLASSIFICATION AND REGRESSION...CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder 24 Smooth Activation Function f(a)= 1 140 8 Classification

Probability & Bayesian Inference

J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

17

Limitations of a Two-Layer Perceptron

  The output neuron realizes a hyperplane in the transformed space that partitions the p vertices into two sets.

  Thus, the two layer perceptron has the capability to classify vectors into classes that consist of unions of polyhedral regions.

  But NOT ANY union. It depends on the relative position of the corresponding vertices.

  How can we solve this problem?

Page 18: NONLINEAR CLASSIFICATION AND REGRESSION...CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder 24 Smooth Activation Function f(a)= 1 140 8 Classification

Probability & Bayesian Inference

J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

18

The Three-Layer Perceptron

  Suppose that Class A consists of the union of K polyhedra in the input space.

  Use K neurons in the 2nd hidden layer.

  Train each to classify one region as positive, the rest negative.

  Now use an output neuron that implements the OR function.

Page 19: NONLINEAR CLASSIFICATION AND REGRESSION...CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder 24 Smooth Activation Function f(a)= 1 140 8 Classification

Probability & Bayesian Inference

J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

19

The Three-Layer Perceptron

  Thus the three-layer perceptron can separate classes resulting from any union of polyhedral regions in the input space.

Page 20: NONLINEAR CLASSIFICATION AND REGRESSION...CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder 24 Smooth Activation Function f(a)= 1 140 8 Classification

Probability & Bayesian Inference

J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

20

The Three-Layer Perceptron

  The first layer of the network forms the hyperplanes in the input space.

  The second layer of the network forms the polyhedral regions of the input space

  The third layer forms the appropriate unions of these regions and maps each to the appropriate class.

Page 21: NONLINEAR CLASSIFICATION AND REGRESSION...CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder 24 Smooth Activation Function f(a)= 1 140 8 Classification

Learning Parameters

Page 22: NONLINEAR CLASSIFICATION AND REGRESSION...CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder 24 Smooth Activation Function f(a)= 1 140 8 Classification

Probability & Bayesian Inference

J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

22

Training Data

  The training data consist of N input-output pairs:

y(i ), x(i )( ) , i ! 1,…N

where

y(i ) = y1(i ),… , y

kL

(i )"#$

%&'

t

and

x(i ) = x1(i ),… , x

k0

(i )"#$

%&'

t

Page 23: NONLINEAR CLASSIFICATION AND REGRESSION...CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder 24 Smooth Activation Function f(a)= 1 140 8 Classification

Probability & Bayesian Inference

J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

23

Choosing an Activation Function

  The unit step activation function means that the error rate of the network is a discontinuous function of the weights.

  This makes it difficult to learn optimal weights by minimizing the error.

  To fix this problem, we need to use a smooth activation function.

  A popular choice is the sigmoid function we used for logistic regression:

Page 24: NONLINEAR CLASSIFICATION AND REGRESSION...CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder 24 Smooth Activation Function f(a)= 1 140 8 Classification

Probability & Bayesian Inference

J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

24

Smooth Activation Function

f (a) =

1

1+ exp(!a)140 8 Classification models

!"

Figure 8.3 Logistic regression model in 1D and 2D. a) One dimensional fit.Green points denote set of examples S0 where y = 0. Pink points denoteset of examples S1 where y = 1. Note that in this (and all future figuresin this chapter) we have only plotted the probability Pr(y = 1|x) (compareto figure 8.2c). The probability Pr(y = 0|x) can be trivially computed as1 ! Pr(y = 1|x). b) Two dimensional fit. Here, the model has a sigmoidprofile in the direction of the gradient ! and is constant in the orthogonaldirections. The decision boundary (blue line) is linear.

As usual, however, it is simpler to maximize the logarithm L of this expression.Since the logarithm is a monotonic transformation, it does not change the positionof the maximum with respect to !. However, applying the logarithm the productand replaces it with a sum so that

L =I!

i=1

yi log

"1

1 + exp[!!Txi]

#+

I!

i=1

(1! yi) log

$exp[!!Txi]

1 + exp[!!Txi]

%. (8.6)

The derivative of the log likelihood L with respect to the parameters ! is

!L

!!=

I!

i=1

&1

1 + exp[!!Txi]! yi

'xi =

I!

i=1

(sig[ai]! yi)xi. (8.7)

Unfortunately, when we equate this expression to zero, there is no way to re-arrange to get a closed form solution for the parameters !. Instead we mustrely on a non-linear optimization technique to find the maximum of this function.We’ll now sketch the main ideas behind non-linear optimization. We defer a moredetailed discussion until section 8.10.

In non-linear optimization, we start with an initial estimate of the solution! and iteratively improve it. The methods we will discuss rely on computing

wt! x1

Page 25: NONLINEAR CLASSIFICATION AND REGRESSION...CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder 24 Smooth Activation Function f(a)= 1 140 8 Classification

Probability & Bayesian Inference

J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

25

Output: Two Classes

  For a binary classification problem, there is a single output node with activation function given by

  Since the output is constrained to lie between 0 and 1, it can be interpreted as the probability of the input vector belonging to Class 1.

f (a) =

1

1+ exp(!a)

Page 26: NONLINEAR CLASSIFICATION AND REGRESSION...CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder 24 Smooth Activation Function f(a)= 1 140 8 Classification

Probability & Bayesian Inference

J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

26

Output: K > 2 Classes

  For a K-class problem, we use K outputs, and the softmax function given by

  Since the outputs are constrained to lie between 0 and 1, and sum to 1, yk can be interpreted as the probability that the input vector belongs to Class K.

yk=

exp ak( )

exp aj( )

j

!

Page 27: NONLINEAR CLASSIFICATION AND REGRESSION...CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder 24 Smooth Activation Function f(a)= 1 140 8 Classification

Probability & Bayesian Inference

J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

27

Non-Convex

  Now each layer of our multi-layer perceptron is a logistic regressor.

  Recall that optimizing the weights in logistic regression results in a convex optimization problem.

  Unfortunately the cascading of logistic regressors in the multi-layer perceptron makes the problem non-convex.

  This makes it difficult to determine an exact solution.   Instead, we typically use gradient descent to find a

locally optimal solution to the weights.   The specific learning algorithm is called the

backpropagation algorithm.

Page 28: NONLINEAR CLASSIFICATION AND REGRESSION...CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder 24 Smooth Activation Function f(a)= 1 140 8 Classification

Nov 21, 2011

End of Lecture

Page 29: NONLINEAR CLASSIFICATION AND REGRESSION...CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder 24 Smooth Activation Function f(a)= 1 140 8 Classification

Probability & Bayesian Inference

J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

29

Nonlinear Classification and Regression: Outline

  Multi-Layer Perceptrons  The Back-Propagation Learning Algorithm

  Generalized Linear Models  Radial Basis Function Networks  Sparse Kernel Machines

 Nonlinear SVMs and the Kernel Trick  Relevance Vector Machines

Page 30: NONLINEAR CLASSIFICATION AND REGRESSION...CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder 24 Smooth Activation Function f(a)= 1 140 8 Classification

Paul J. Werbos. Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences. PhD thesis, Harvard University, 1974 Rumelhart, David E.; Hinton, Geoffrey E., Williams, Ronald J. (8 October 1986). "Learning representations by back-propagating errors". Nature 323 (6088): 533–536.

The Backpropagation Algorithm

Werbos Rumelhart Hinton

Page 31: NONLINEAR CLASSIFICATION AND REGRESSION...CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder 24 Smooth Activation Function f(a)= 1 140 8 Classification

Probability & Bayesian Inference

J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

31

Notation

  Assume a network with L layers   k0 nodes in the input layer.

  kr nodes in the r’th layer.

Page 32: NONLINEAR CLASSIFICATION AND REGRESSION...CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder 24 Smooth Activation Function f(a)= 1 140 8 Classification

Probability & Bayesian Inference

J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

32

Notation

Let ykr !1 be the output of the kth neuron of layer r ! 1.

Let wjkr be the weight of the synapse on the jth neuron of layer r

from the kth neuron of layer r ! 1.

Page 33: NONLINEAR CLASSIFICATION AND REGRESSION...CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder 24 Smooth Activation Function f(a)= 1 140 8 Classification

Probability & Bayesian Inference

J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

33

Input

y

k

0(i ) = xk(i ), k = 1,… ,k

0

Page 34: NONLINEAR CLASSIFICATION AND REGRESSION...CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder 24 Smooth Activation Function f(a)= 1 140 8 Classification

Probability & Bayesian Inference

J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

34

Notation

Then yj

r (i ) = f vj

r (i )( ) = f wjk

r yk

r !1 (i )k=0

kr !1

"#

$%

&

'(

Let v

jr be the total input to the jth neuron of layer r :

vjr (i ) = w

jr( )

tyr !1 (i ) = w

jkr y

kr !1 (i )

k=0

kr !1

"

where we define y0r (i ) = +1 to incorporate the bias term.

Page 35: NONLINEAR CLASSIFICATION AND REGRESSION...CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder 24 Smooth Activation Function f(a)= 1 140 8 Classification

Probability & Bayesian Inference

J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

35

Cost Function

  A common cost function is the squared error:

J = ! (i )i =1

N"

where ! (i ) ! 12 em(i )( )2

m=1

kL

" = 12 ym (i ) # ym (i )( )2

m=1

kL

"and ym (i ) = yk

r (i ) is the output of the network.

Page 36: NONLINEAR CLASSIFICATION AND REGRESSION...CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder 24 Smooth Activation Function f(a)= 1 140 8 Classification

Probability & Bayesian Inference

J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

36

Cost Function

  To summarize, the error for input i is given by

where is the output of the output layer and each layer is related to the previous layer through

and

! (i ) = 1

2 em(i )( )2m=1

kL

" = 12 ym (i ) # ym (i )( )2

m=1

kL

"

y

j

r (i ) = f vj

r (i )( )

v

j

r (i ) = wj

r( )t

yr !1 (i )

y

m(i ) = y

k

r (i )

Page 37: NONLINEAR CLASSIFICATION AND REGRESSION...CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder 24 Smooth Activation Function f(a)= 1 140 8 Classification

Probability & Bayesian Inference

J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

37

Gradient Descent

  Gradient descent starts with an initial guess at the weights over all layers of the network.

  We then use these weights to compute the network output for each input vector x(i) in the training data.

  This allows us to calculate the errorε(i) for each of these inputs.

  Then, in order to minimize this error, we incrementally update the weights in the negative gradient direction:

! (i ) = 1

2 em(i )( )2m=1

kL

" = 12 ym (i ) # ym (i )( )2

m=1

kL

"

wj

r (new) = wjr (old) - µ !J

!wjr = wj

r (old) - µ !" (i )!wj

ri =1

N#

y(i )

Page 38: NONLINEAR CLASSIFICATION AND REGRESSION...CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder 24 Smooth Activation Function f(a)= 1 140 8 Classification

Probability & Bayesian Inference

J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

38

Gradient Descent

  Since , the influence of the jth weight of the rth layer on the error can be expressed as:

v

j

r (i ) = wj

r( )t

yr !1 (i )

!" (i )

!wj

r=

!" (i )

!vj

r (i )

!vj

r (i )

!wj

r

= #j

r (i )yr $1 (i )

where

#j

r (i ) !!" (i )

!vj

r (i )

Page 39: NONLINEAR CLASSIFICATION AND REGRESSION...CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder 24 Smooth Activation Function f(a)= 1 140 8 Classification

Probability & Bayesian Inference

J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

39

Gradient Descent

!" (i )

!wj

r= #

j

r (i )yr $1 (i ),

where

#j

r (i ) !!" (i )

!vj

r (i )

For an intermediate layer r,

we cannot compute !jr (i ) directly.

However, !jr (i ) can be computed inductively,

starting from the output layer.

Page 40: NONLINEAR CLASSIFICATION AND REGRESSION...CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder 24 Smooth Activation Function f(a)= 1 140 8 Classification

Probability & Bayesian Inference

J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

40

Backpropagation: The Output Layer

Thus at the output layer we have

!" (i )!wj

r = # jr (i )yr $1 (i ), where # j

r (i ) ! !" (i )!vj

r (i )

and ! (i ) = 1

2 em(i )( )2m=1

kL

" = 12 ym (i ) # ym (i )( )2

m=1

kL

"

Recall that ym (i ) = yjL (i ) = f vj

L (i )( )

! j

L (i ) = "# (i )"vj

L (i ) = "# (i )"ej

L (i )"ej

L (i )"vj

L (i ) = ejL (i ) $f vj

L (i )( )

f (a) = 1

1+ exp(!a)" #f (a) = f (a) 1! f (a)( )

! jL (i ) = ej

L (i )f v jL (i )( ) 1 " f vj

L (i )( )( )

Page 41: NONLINEAR CLASSIFICATION AND REGRESSION...CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder 24 Smooth Activation Function f(a)= 1 140 8 Classification

Probability & Bayesian Inference

J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

41

Backpropagation: Hidden Layers

  Observe that the dependence of the error on the total input to a neuron in a previous layer can be expressed in terms of the dependence on the total input of neurons in the following layer:

! j

r "1 (i ) = #$ (i )#vj

r "1 (i ) = #$ (i )#vk

r (i )#vk

r (i )#vj

r "1 (i )k=1

kr

% = !kr (i ) #vk

r (i )#vj

r "1 (i )k=1

kr

%

where vk

r (i ) = wkmr ym

r !1 (i )m=0

kr !1

" = wkmr f vm

r !1 (i )( )m=0

kr !1

"

Thus we have !vk

r (i )!vj

r "1 (i ) = wkjr #f vj

r "1 (i )( )

and so ! j

r "1 (i ) = #$ (i )#vj

r "1 (i ) = %f vjr "1 (i )( ) !k

r (i )wkjr = f vj

L (i )( ) 1 " f vjL (i )( )( ) !k

r (i )wkjr

k=1

kr

&k=1

kr

&

Thus once the !kr (i ) are determined they can be propagated backward

to calculate ! jr "1 (i ) using this inductive formula.

Page 42: NONLINEAR CLASSIFICATION AND REGRESSION...CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder 24 Smooth Activation Function f(a)= 1 140 8 Classification

Probability & Bayesian Inference

J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

42

Backpropagation: Summary of Algorithm

1.  Initialization   Initialize all weights with small random values

2.  Forward Pass   For each input vector, run the network in the forward direction, calculating:

3.  Backward Pass   Starting with the output layer, use our inductive formula to compute the :

  Output Layer (Base Case):

  Hidden Layers (Inductive Case):

4.  Update Weights ! j

r "1 (i ) = #f vjr "1 (i )( ) !k

r (i )wkjr

k=1

kr

$ ! j

L (i ) = ejL (i ) "f vj

L (i )( )

wj

r (new) = wjr (old) - µ !" (i )

!wjr

i =1

N#

where !" (i )

!wjr = # j

r (i )yr $1 (i )

! jr "1 (i )

yjr (i ) = f vj

r (i )( ) v jr (i ) = wj

r( )t yr !1 (i );

and finally ! (i ) = 1

2 em(i )( )2m=1

kL

" = 12 ym (i ) # ym (i )( )2

m=1

kL

"

Repe

at u

ntil

conv

erge

nce

Page 43: NONLINEAR CLASSIFICATION AND REGRESSION...CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder 24 Smooth Activation Function f(a)= 1 140 8 Classification

Probability & Bayesian Inference

J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

43

Batch vs Online Learning

  As described, on each iteration backprop updates the weights based upon all of the training data. This is called batch learning.

  An alternative is to update the weights after each training input has been processed by the network, based only upon the error for that input. This is called online learning.

wj

r (new) = wjr (old) - µ !" (i )

!wjr

i =1

N#

where !" (i )

!wjr = # j

r (i )yr $1 (i )

wj

r (new) = wj

r (old) - µ!" (i )

!wj

r

where !" (i )

!wjr = # j

r (i )yr $1 (i )

Page 44: NONLINEAR CLASSIFICATION AND REGRESSION...CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder 24 Smooth Activation Function f(a)= 1 140 8 Classification

Probability & Bayesian Inference

J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

44

Batch vs Online Learning

  One advantage of batch learning is that averaging over all inputs when updating the weights should lead to smoother convergence.

  On the other hand, the randomness associated with online learning might help to prevent convergence toward a local minimum.

  Changing the order of presentation of the inputs from epoch to epoch may also improve results.

Page 45: NONLINEAR CLASSIFICATION AND REGRESSION...CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder 24 Smooth Activation Function f(a)= 1 140 8 Classification

Probability & Bayesian Inference

J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

45

Remarks

  Local Minima  The objective function is in general non-convex, and so

the solution may not be globally optimal.

  Stopping Criterion  Typically stop when the change in weights or the

change in the error function falls below a threshold.

  Learning Rate  The speed and reliability of convergence depends on

the learning rate μ.

Page 46: NONLINEAR CLASSIFICATION AND REGRESSION...CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder 24 Smooth Activation Function f(a)= 1 140 8 Classification

Probability & Bayesian Inference

J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

46

Nonlinear Classification and Regression: Outline

  Multi-Layer Perceptrons  The Back-Propagation Learning Algorithm

  Generalized Linear Models  Radial Basis Function Networks  Sparse Kernel Machines

 Nonlinear SVMs and the Kernel Trick  Relevance Vector Machines

Page 47: NONLINEAR CLASSIFICATION AND REGRESSION...CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder 24 Smooth Activation Function f(a)= 1 140 8 Classification

Probability & Bayesian Inference

J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

47

Generalizing Linear Classifiers

  One way of tackling problems that are not linearly separable is to transform the input in a nonlinear fashion prior to applying a linear classifier.

  The result is that decision boundaries that are linear in the resulting feature space may be highly nonlinear in the original input space.

x1

x2

−1 0 1

−1

0

1

!1

!2

0 0.5 1

0

0.5

1

Input Space Feature Space

Page 48: NONLINEAR CLASSIFICATION AND REGRESSION...CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder 24 Smooth Activation Function f(a)= 1 140 8 Classification

Probability & Bayesian Inference

J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

48

Nonlinear Basis Function Models

  Generally

  where ϕj(x) are known as basis functions.   Typically, Φ0(x) = 1, so that w0 acts as a bias.

Page 49: NONLINEAR CLASSIFICATION AND REGRESSION...CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder 24 Smooth Activation Function f(a)= 1 140 8 Classification

Probability & Bayesian Inference

J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

49

Nonlinear basis functions for classification

  In the context of classification, the discriminant function in the feature space becomes:

  This formulation can be thought of as an input space approximation of the true separating discriminant function g(x) using a set of interpolation functions .

g y (x)( ) = w

0+ w

iy

i(x)

i =1

M

! = w0+ w

i"

i(x)

i =1

M

!

!

i(x)

Page 50: NONLINEAR CLASSIFICATION AND REGRESSION...CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder 24 Smooth Activation Function f(a)= 1 140 8 Classification

Probability & Bayesian Inference

J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

50

Dimensionality

  The dimensionality M of the feature space may be less than, equal to, or greater than the dimensionality D of the original input space.  M < D: This may result in a factoring out of irrelevant

dimensions, reduction in the number of model parameters, and resulting improvement in generalization (reduced overlearning).

 M > D: Problems that are not linearly separable in the input space may become separable in the feature space, and the probability of linear separability generally increases with the dimensionality of the feature space. Thus choosing M >> D helps to make the problem linearly separable.

Page 51: NONLINEAR CLASSIFICATION AND REGRESSION...CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder 24 Smooth Activation Function f(a)= 1 140 8 Classification

Probability & Bayesian Inference

J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

51

Cover’s Theorem

  “A complex pattern-classification problem, cast in a high-dimensional space nonlinearly, is more likely to be linearly separable than in a low-dimensional space, provided that the space is not densely populated.” — Cover, T.M. , Geometrical and Statistical properties of systems of linear inequalities with applications in pattern recognition., 1965

Example

Page 52: NONLINEAR CLASSIFICATION AND REGRESSION...CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder 24 Smooth Activation Function f(a)= 1 140 8 Classification

Probability & Bayesian Inference

J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

52

Nonlinear Classification and Regression: Outline

  Multi-Layer Perceptrons  The Back-Propagation Learning Algorithm

  Generalized Linear Models  Radial Basis Function Networks  Sparse Kernel Machines

 Nonlinear SVMs and the Kernel Trick  Relevance Vector Machines

Page 53: NONLINEAR CLASSIFICATION AND REGRESSION...CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder 24 Smooth Activation Function f(a)= 1 140 8 Classification

Probability & Bayesian Inference

J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

53

Radial Basis Functions

  Consider interpolation functions (kernels) of the form

  In other words, the feature value depends only upon the Euclidean distance to a ‘centre point’ in the input space.

  A commonly used RBF is the isotropic Gaussian:

!

ix " µ

i( )

!i(x) = exp "

1

2#i2

x " µi

2$

%&

'

()

Page 54: NONLINEAR CLASSIFICATION AND REGRESSION...CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder 24 Smooth Activation Function f(a)= 1 140 8 Classification

Probability & Bayesian Inference

J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

54

Relation to KDE

  We can use Gaussian RBFs to approximate the discriminant function g(x):

  where

  This is reminiscent of kernel density estimation, where we approximated probability densities as a normalized sum of Gaussian kernels.

g y (x)( ) = w

0+ w

iy

i(x)

i =1

M

! = w0+ w

i"

i(x)

i =1

M

!

!i(x) = exp "

1

2#i2

x " µi

2$

%&

'

()

Page 55: NONLINEAR CLASSIFICATION AND REGRESSION...CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder 24 Smooth Activation Function f(a)= 1 140 8 Classification

Probability & Bayesian Inference

J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

55

Relation to KDE

  For KDE we planted a kernel at each data point. Thus there were N kernels.

  For RBF networks, we generally use far fewer kernels than the number of data points: M << N.

  This leads to greater efficiency and generalization.

Page 56: NONLINEAR CLASSIFICATION AND REGRESSION...CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder 24 Smooth Activation Function f(a)= 1 140 8 Classification

Probability & Bayesian Inference

J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

56

RBF Networks

  The Linear Classifier with nonlinear radial basis functions can be considered an artificial neural network where   The hidden nodes are nonlinear (e.g., Gaussian).   The output node is linear.

RBF Network for 2 Classes

Page 57: NONLINEAR CLASSIFICATION AND REGRESSION...CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder 24 Smooth Activation Function f(a)= 1 140 8 Classification

Probability & Bayesian Inference

J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

57

RBF Networks vs Perceptrons

  Recall that for a perceptron, the output of a hidden unit is invariant on a hyperplane.

  For an RBF, the output of a hidden unit is invariant on a circle centred on μi.

  Thus hidden units are global in a perceptron, but local in an RBF network.

RBF Network for 2 Classes

Page 58: NONLINEAR CLASSIFICATION AND REGRESSION...CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder 24 Smooth Activation Function f(a)= 1 140 8 Classification

Probability & Bayesian Inference

J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

58

RBF Networks vs Perceptrons

  This difference has consequences:  Multilayer perceptrons tend to learn slower than RBFs.  However, multilayer perceptrons tend to have better

generalization properties, especially in regions of the input space where training data are sparse.

 Typically, more neurons are needed for an RBF than for a multilayer perceptron to solve a given problem.

Page 59: NONLINEAR CLASSIFICATION AND REGRESSION...CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder 24 Smooth Activation Function f(a)= 1 140 8 Classification

Probability & Bayesian Inference

J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

59

Parameters

  There are two options for choosing the parameters (centres and scales) of the RBFs: 1.  Fixed.

  For example, randomly select a subset of M of the input vectors and use these as centres. Use a common scale based upon your judgement.

2.  Learned.   Note that when the RBF parameters are fixed, the weights

could be learned using linear classifier techniques (e.g., least squares).

  Thus the RBF parameters could be learned in an outer loop, by gradient descent.

Page 60: NONLINEAR CLASSIFICATION AND REGRESSION...CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder 24 Smooth Activation Function f(a)= 1 140 8 Classification

Probability & Bayesian Inference

J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

60

Nonlinear Classification and Regression: Outline

  Multi-Layer Perceptrons  The Back-Propagation Learning Algorithm

  Generalized Linear Models  Radial Basis Function Networks  Sparse Kernel Machines

 Nonlinear SVMs and the Kernel Trick  Relevance Vector Machines

Page 61: NONLINEAR CLASSIFICATION AND REGRESSION...CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder 24 Smooth Activation Function f(a)= 1 140 8 Classification

Probability & Bayesian Inference

J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

61

The Kernel Function

  Recall that an SVM is the solution to the problem

  A new input x is classified by computing

  Where S is the set of support vectors.

  Here we introduced the kernel function k(x, x’), defined as

  This is more than a notational convenience!!

Maximize !L(a) = ann=1

N

! " 12

anamtntmk xn,xm( )m=1

N

!n=1

N

!

subject to 0 # an #C and antnn=1

N

! = 0

k x, !x( ) = "(x)t"( !x )

y(x) = antnk(x,xn)

n!S" + b

Page 62: NONLINEAR CLASSIFICATION AND REGRESSION...CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder 24 Smooth Activation Function f(a)= 1 140 8 Classification

Probability & Bayesian Inference

J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

62

The Kernel Trick

  Note that the basis functions and individual training vectors are no longer part of the objective function.

  Instead all we need is the kernel value (like a distance measure) for all pairs of training vectors.

Maximize !L(a) = ann=1

N

! " 12

anamtntmk xn,xm( )m=1

N

!n=1

N

!

subject to 0 # an #C and antnn=1

N

! = 0

where k x, !x( ) = "(x)t"( !x )

Page 63: NONLINEAR CLASSIFICATION AND REGRESSION...CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder 24 Smooth Activation Function f(a)= 1 140 8 Classification

Probability & Bayesian Inference

J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

63

The Kernel Function

The kernel function k(x, !x ) measures the 'similarity' of input vectors x and !xas an inner product in a feature space defined by the feature space mapping "(x) :k(x, !x ) = "(x)t"( !x )

If k(x, !x ) = k(x " !x ) we say that the kernel is stationary

If k(x, !x ) = k x " !x( ) we call it a radial basis function.

Page 64: NONLINEAR CLASSIFICATION AND REGRESSION...CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder 24 Smooth Activation Function f(a)= 1 140 8 Classification

Probability & Bayesian Inference

J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

64

Constructing Kernels

  We can construct a kernel by selecting a feature space mapping ϕ(x) and then defining

k(x, !x ) = "(x)t"( !x )

!1 0 10

0.25

0.5

0.75

1

!1 0 10.0

1.0

2.0

Gaussian

!x

!( "x )

k(x, !x )

Page 65: NONLINEAR CLASSIFICATION AND REGRESSION...CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder 24 Smooth Activation Function f(a)= 1 140 8 Classification

Probability & Bayesian Inference

J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

65

Constructing Kernels

  Alternatively, we can construct the kernel function

directly, ensuring that it corresponds to an inner product in some (possibly infinite-dimensional) feature space.

Page 66: NONLINEAR CLASSIFICATION AND REGRESSION...CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder 24 Smooth Activation Function f(a)= 1 140 8 Classification

Probability & Bayesian Inference

J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

66

Constructing Kernels

k(x) = !(x)t!( "x )

Example 1: k(x,z) = xtz

Example 2: k(x,z) = xtz + c, c > 0

Example 3: k(x,z) = xtz( )2

Page 67: NONLINEAR CLASSIFICATION AND REGRESSION...CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder 24 Smooth Activation Function f(a)= 1 140 8 Classification

Probability & Bayesian Inference

J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

67

Kernel Properties

  Kernels obey certain properties that make it easy to construct complex kernels from simpler ones.

Page 68: NONLINEAR CLASSIFICATION AND REGRESSION...CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder 24 Smooth Activation Function f(a)= 1 140 8 Classification

Probability & Bayesian Inference

J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

68

Kernel Properties Combining Kernels

Given valid kernels k1(x,x!) and k2(x,x!) the following kernelswill also be valid:

k(x,x!) = ck1(x,x!) (6.13)

k(x,x!) = f(x)k1(x,x!)f(x!) (6.14)

k(x,x!) = q(k1(x,x!)) (6.15)

k(x,x!) = exp(k1(x,x!)) (6.16)

k(x,x!) = k1(x,x!) + k2(x,x!) (6.17)

k(x,x!) = k1(x,x!)k2(x,x!) (6.18)

k(x,x!) = k3(!(x),!(x!)) (6.19)

k(x,x!) = xT Ax! (6.20)

k(x,x!) = ka(xa,x!a) + kb(xb,x

!b) (6.21)

k(x,x!) = ka(xa,x!a)kb(xb,x

!b) (6.22)

with corresponding conditions on c, f, q,!, k3,A,xa,xb, ka, kb

Vasil Khalidov, Alex Klaser Bishop Chapter 6: Kernel Methods

where c > 0, f (!) is any function, q(!) is a polynomial with nonnegative coefficients, "(x) is a mapping from x # !M ,k3 is a valid kernel on !M , A is a symmetric positive semidefinite matrix, xa and xb are variables such that x = xa,xb( ) and ka,kb are valid kernels over their respective spaces.

Page 69: NONLINEAR CLASSIFICATION AND REGRESSION...CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder 24 Smooth Activation Function f(a)= 1 140 8 Classification

Probability & Bayesian Inference

J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

69

Constructing Kernels

  Examples:

k(x, !x ) = xt !x + c( )M ,c > 0

k(x, !x ) = exp " x " !x

2/ 2# 2( )

Corresponds to infinite-dimensional feature vector

(Use 6.18)

(Use 6.14 and 6.16.)

Page 70: NONLINEAR CLASSIFICATION AND REGRESSION...CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder 24 Smooth Activation Function f(a)= 1 140 8 Classification

Probability & Bayesian Inference

J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

70

Nonlinear SVM Example (Gaussian Kernel)

Input Space

x1

x2

Page 71: NONLINEAR CLASSIFICATION AND REGRESSION...CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder 24 Smooth Activation Function f(a)= 1 140 8 Classification

Probability & Bayesian Inference

J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

71

SVMs for Regression

In standard linear regression, we minimize12

yn ! tn( )2

n=1

N

" +#2

w2

This penalizes all deviations from the model.

To obtain sparse solutions, we replace the quadratic error functionby an !-insensitive error function, e.g.,

E! y(x) " t( ) = 0, if y(x) - t < !

y(x) - t " !, otherwise

#$%

&%

See text for details of solution.

y

y + !

y ! !

y(x)

x

!" > 0

" > 0

Page 72: NONLINEAR CLASSIFICATION AND REGRESSION...CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder 24 Smooth Activation Function f(a)= 1 140 8 Classification

Probability & Bayesian Inference

J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

72

Example

x

t

0 1

!1

0

1

Page 73: NONLINEAR CLASSIFICATION AND REGRESSION...CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder 24 Smooth Activation Function f(a)= 1 140 8 Classification

Probability & Bayesian Inference

J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

73

Nonlinear Classification and Regression: Outline

  Multi-Layer Perceptrons  The Back-Propagation Learning Algorithm

  Generalized Linear Models  Radial Basis Function Networks  Sparse Kernel Machines

 Nonlinear SVMs and the Kernel Trick  Relevance Vector Machines

Page 74: NONLINEAR CLASSIFICATION AND REGRESSION...CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder 24 Smooth Activation Function f(a)= 1 140 8 Classification

Probability & Bayesian Inference

J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

74

Relevance Vector Machines

  Some drawbacks of SVMs:  Do not provide posterior probabilities.  Not easily generalized to K > 2 classes.  Parameters (C, ε) must be learned by cross-validation.

  The Relevance Vector Machine is a sparse Bayesian kernel technique that avoids these drawbacks.

  RVMs also typically lead to sparser models.

Page 75: NONLINEAR CLASSIFICATION AND REGRESSION...CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder 24 Smooth Activation Function f(a)= 1 140 8 Classification

Probability & Bayesian Inference

J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

75

RVMs for Regression

p t | x,w,!( ) =N t | y(x),!"1( ) where y(x) = w t!(x)

In an RVM, the basis functions !(x) are kernels k x,xn( ) :

y(x) = wnk x,xn( )n=1

N

" + b

However, unlike in SVMs, the kernels need not be positive definite, and the xn need not be the training data points.

Page 76: NONLINEAR CLASSIFICATION AND REGRESSION...CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder 24 Smooth Activation Function f(a)= 1 140 8 Classification

Probability & Bayesian Inference

J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

76

RVMs for Regression

  Note that each weight parameter has its own precision hyperparameter.

Likelihood:

p t | X,w,!( ) = p tn | xn,w,!( )n=1

N

"where the nth row of X is xn

t .

Prior:

p(w |!) = N wi | 0,! i"1( )

i=1

M

#

Page 77: NONLINEAR CLASSIFICATION AND REGRESSION...CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder 24 Smooth Activation Function f(a)= 1 140 8 Classification

Probability & Bayesian Inference

J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

77

RVMs for Regression

  The conjugate prior for the precision of a Gaussian is a gamma distribution.   Integrating out the precision parameter leads to a Student’s t distribution over

wi.   Thus the distribution over w is a product of Student’s t distributions.   As a result, maximizing the evidence will yield a sparse w.   Note that to achieve sparsity it is critical that each parameter wi has a separate

precision αi.

p(wi |! i ) =N wi | 0,! i

"1( )

p ! i( ) = Gam ! i | a,b( )

p wi( ) = St wi | 2a( )

Bayesian Inference: Principles and Practice in Machine Learning 16

case of a Gamma hyperprior, which we introduce for greater generality here. This combination ofthe prior over !m controlling the prior over wm gives us what is often referred to as a hierarchicalprior. Now, if we have p(wm|!m) and p(!m) and we want to know the ‘true’ p(wm) we alreadyknow what to do — we must marginalise:

p(wm) =!

p(wm|!m) p(!m) d!m. (27)

For a Gamma p(!m), this integral is computable and we find that p(wm) is a Student-t distributionillustrated as a function of two parameters in Figure 8; its equivalent as a regularising penaltyfunction would be

"m log |wm|.

Gaussian prior Marginal prior: single ! Independent !

Figure 8: Contour plots of Gaussian and Student-t prior distributions over two parameters.While the marginal prior p(w1, w2) for the ‘single’ hyperparameter model of Section2 has a much sharper peak than the Gaussian at zero, it can be seen that it isnot sparse unlike the multiple ‘independent’ hyperparameter prior, which as well ashaving a sharp peak at zero, places most of its probability mass along axial ridgeswhere the magnitude of one of the two parameters is small.

4.3 A Sparse Bayesian Model for Regression

We can develop a sparse regression model by following an identical methodology to the previoussections. Again, we assume independent Gaussian noise: tn ! N(y(xn;w), "2), which gives acorresponding likelihood:

p(t|w,"2) = (2#"2)–N/2 exp#" 1

2"2#t"!w#2

$, (28)

where as before we denote t = (t1 . . . tN )T, w = (w1 . . . wM )T, and ! is the N $M ‘design’ matrixwith !nm = $m(xn).

Following the Bayesian framework, we desire the posterior distribution over all unknowns:

p(w,!,"2|t) =p(t|w, !,"2)p(w, !,"2)

p(t), (29)

which we can’t compute analytically. So as previously, we decompose this as:

p(w, !,"2|t) % p(w|t, !,"2) p(!,"2|t) (30)

where p(w|t, !,"2) is the ‘weight posterior’ distribution, and is tractable. This leaves p(!,"2|t)which must be approximated.

w2

w1

Page 78: NONLINEAR CLASSIFICATION AND REGRESSION...CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder 24 Smooth Activation Function f(a)= 1 140 8 Classification

Probability & Bayesian Inference

J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

78

RVMs for Regression

p(wi |! i ) =N wi | 0,! i

"1( )

p ! i( ) = Gam ! i | a,b( )

p wi( ) = St wi | 2a( )

Bayesian Inference: Principles and Practice in Machine Learning 16

case of a Gamma hyperprior, which we introduce for greater generality here. This combination ofthe prior over !m controlling the prior over wm gives us what is often referred to as a hierarchicalprior. Now, if we have p(wm|!m) and p(!m) and we want to know the ‘true’ p(wm) we alreadyknow what to do — we must marginalise:

p(wm) =!

p(wm|!m) p(!m) d!m. (27)

For a Gamma p(!m), this integral is computable and we find that p(wm) is a Student-t distributionillustrated as a function of two parameters in Figure 8; its equivalent as a regularising penaltyfunction would be

"m log |wm|.

Gaussian prior Marginal prior: single ! Independent !

Figure 8: Contour plots of Gaussian and Student-t prior distributions over two parameters.While the marginal prior p(w1, w2) for the ‘single’ hyperparameter model of Section2 has a much sharper peak than the Gaussian at zero, it can be seen that it isnot sparse unlike the multiple ‘independent’ hyperparameter prior, which as well ashaving a sharp peak at zero, places most of its probability mass along axial ridgeswhere the magnitude of one of the two parameters is small.

4.3 A Sparse Bayesian Model for Regression

We can develop a sparse regression model by following an identical methodology to the previoussections. Again, we assume independent Gaussian noise: tn ! N(y(xn;w), "2), which gives acorresponding likelihood:

p(t|w,"2) = (2#"2)–N/2 exp#" 1

2"2#t"!w#2

$, (28)

where as before we denote t = (t1 . . . tN )T, w = (w1 . . . wM )T, and ! is the N $M ‘design’ matrixwith !nm = $m(xn).

Following the Bayesian framework, we desire the posterior distribution over all unknowns:

p(w,!,"2|t) =p(t|w, !,"2)p(w, !,"2)

p(t), (29)

which we can’t compute analytically. So as previously, we decompose this as:

p(w, !,"2|t) % p(w|t, !,"2) p(!,"2|t) (30)

where p(w|t, !,"2) is the ‘weight posterior’ distribution, and is tractable. This leaves p(!,"2|t)which must be approximated.

w2

w1

Thus if we let a! 0,b! 0, then

p log"i( )! uniform and p w

i( )! wi

#1

.

Student's t Distribution: p x |!( ) ="

! +1

2

#$%

&'(

!)"!2

#$%

&'(

1+x 2

!#

$%&

'(

*!+1

2

.

Also recall the rule for transforming densities:

If y is a monotonic function of x , then

pY(y ) = p

X(x )

dx

dy

Gamma Distribution: p x |a,b( ) =

ba

!(a)x a"1e "bx .

Very sparse!!

Page 79: NONLINEAR CLASSIFICATION AND REGRESSION...CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder 24 Smooth Activation Function f(a)= 1 140 8 Classification

Probability & Bayesian Inference

J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

79

RVMs for Regression

  In practice, it is difficult to integrate α out exactly.   Instead, we use an approximate maximum likelihood method, finding ML values

for each αi.   When we maximize the evidence with respect to these hyperparameters, many

will ∞.   As a result, the corresponding weights will 0, yielding a sparse solution.

Likelihood:

p t | X,w,!( ) = p tn | xn,w,!( )n=1

N

"where the nth row of X is xn

t .

Prior:

p(w |!) = N wi | 0,! i"1( )

i=1

M

#

Page 80: NONLINEAR CLASSIFICATION AND REGRESSION...CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder 24 Smooth Activation Function f(a)= 1 140 8 Classification

Probability & Bayesian Inference

J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

80

RVMs for Regression

  Since both the likelihood and prior are normal, the posterior over w will also be normal:

Posterior:p w | t,X,!,"( ) =N w |m,#( )where m = "#$t t

# = A + "$t$( )%1

and $ni = &i xn( )A = diag ! i( )

Note that when !

i" #, the ith row and column of $ " 0, and

p w

i| t,X,!,"( ) = N w

i| 0,0( )

Page 81: NONLINEAR CLASSIFICATION AND REGRESSION...CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder 24 Smooth Activation Function f(a)= 1 140 8 Classification

Probability & Bayesian Inference

J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

81

RVMs for Regression

  The values for α and βare determined using the evidence approximation, where we maximize

p t | X,!,"( ) = p t | X,w,"( )p w |!( )dw#

In general, this results in many of the precision parameters ! i "#,so that wi " 0.

Unfortunately, this is a non-convex problem.

Page 82: NONLINEAR CLASSIFICATION AND REGRESSION...CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder 24 Smooth Activation Function f(a)= 1 140 8 Classification

Probability & Bayesian Inference

J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

82

Example

x

t

0 1

!1

0

1