37
2806 Neural Computation Single Layer Perceptron Lecture 3 2005 Ari Visa

2806 Neural Computation Single Layer Perceptron Lecture 3 2005 Ari Visa

  • View
    222

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 2806 Neural Computation Single Layer Perceptron Lecture 3 2005 Ari Visa

2806 Neural ComputationSingle Layer Perceptron

Lecture 3

2005 Ari Visa

Page 2: 2806 Neural Computation Single Layer Perceptron Lecture 3 2005 Ari Visa

Agenda

Some historical notes Linear adaptive filters Rosenblatt’s perceptron Conclusions

Page 3: 2806 Neural Computation Single Layer Perceptron Lecture 3 2005 Ari Visa

Some historical notes

Rosenblatt’s perceptron 1958: a single neuron with adjustable synaptic weights and bias

Perceptron convergence theorem 1962 (Rosenblatt): If patterns used to train the perceptron are drawn from two linearly separable classes, then the algorithm converges and positions the decision surface in the form of hyperplane between two classes.

Page 4: 2806 Neural Computation Single Layer Perceptron Lecture 3 2005 Ari Visa

Some historical notes

A single neuron -> an adaptive filter

Widrow & Hoff (1960): least-mean square algorithm (LMS) = delta rule

The problem of designing an optimum linear filter: Kolmogorov 1942, Wiener 1949, Zadeh 1953, Gabor 1954

Page 5: 2806 Neural Computation Single Layer Perceptron Lecture 3 2005 Ari Visa

Linear Adaptive Filters

Consider a dynamic system with an unknown mathematical characterization

A set of labeled input-output data generated by the system at discrete instants of time at some uniform rate is available

Page 6: 2806 Neural Computation Single Layer Perceptron Lecture 3 2005 Ari Visa

Linear Adaptive Filters

The problem is also known as system identification (in control theory).

An adaptive filter:1) Filtering process (involves the

computation of two signals; an output y(i) and an error signal e(i) = d(i)-y(i)) 2) Adaptive process (the automatic adjustment of the synaptic weights of the neuron in accordance with the error signal e(i)).

Page 7: 2806 Neural Computation Single Layer Perceptron Lecture 3 2005 Ari Visa

Linear Adaptive Filters

Note that y(i) = v(i) = km wk(i)xk(i) is

written into matrix formy(i) = xT(i)w(i)

The manner in which e(i) = d(i)-y(i) is used to control the adjusments to the synaptic weights is determined by the cost function used to derive the adaptive filtering algorithm.

Page 8: 2806 Neural Computation Single Layer Perceptron Lecture 3 2005 Ari Visa

Linear Adaptive Filters

unconstrained optimization problem: Minimize the cost function E(w) with respect to the

weight vector w. Operation analysis -> the necessary condition for

optimality is E(w*) = 0. The strategy is to use iterative steepest descent:

Starting with an initial guess w(0), generate a sequence of weight vectors w(1), w(2), ... , such that the cost function E(w) is reduced at each iteration of the algorithm as E(w(n+1) < Ew(n)).

Page 9: 2806 Neural Computation Single Layer Perceptron Lecture 3 2005 Ari Visa

Linear Adaptive Filters

1) Method of Steepest Descent

The direction of steepest descent is in direction opposite to the gradient vector g = E(w)

w(n+1) = w(n) –g(n) is the stepsize or

learning-rate parameter

Page 10: 2806 Neural Computation Single Layer Perceptron Lecture 3 2005 Ari Visa

Linear Adaptive Filters

The method of steepest descent converges to the optimal solution w* slowly.

The learning-rate parameter has an influence on the convergence behaviour:

- When is small, the transient response of the algorithm is overdamped (3.2a)

- When is large, the transient response of the algorithm is underdamped (3.2b)

– - When exceeds a certain critical value, the algorithm becomes unstable.

Page 11: 2806 Neural Computation Single Layer Perceptron Lecture 3 2005 Ari Visa

Linear Adaptive Filters

2) Newton’s method The idea is to minimize the quadratic approximation of

the cost function E(w) around the current point w(n). Using a second-order Taylor series expansion of the

cost function around the point w(n). Ew(n) gT(n)w(n) +½ wT(n) H(n) w(n) g(n) is the m-by-1 gradient vector of the cost function

E(w) evaluated at the point w(n). The matrix H(n) is the m-by-m Hessian matrix of E(w) (second derivative), H = ²E(w)

Page 12: 2806 Neural Computation Single Layer Perceptron Lecture 3 2005 Ari Visa

Linear Adaptive Filters

H = ²E(w) requires the cost function E(w) to be twice continuously differentiable with respect to the elements of w.

Differentiating Ew(n) gT(n)w(n) +½ wT(n) H(n) w(n) with respect to w, the change E(n) is minimized when

g(n) + H(n)w(n) = 0 → w(n) = H-1(n)g(n) w(n+1) = w(n) + w(n) w(n+1) = w(n)+H-1(n)g(n) where H-1(n) is the inverse of the Hessian of E(w).

Page 13: 2806 Neural Computation Single Layer Perceptron Lecture 3 2005 Ari Visa

Linear Adaptive Filters

Newton’s method converges quickly asymtotically and does not exhibit the zigzagging behavior.

Newton’s method requires that the Hessian H(n) has to be a positive definite matrix for all n!

Page 14: 2806 Neural Computation Single Layer Perceptron Lecture 3 2005 Ari Visa

Linear Adaptive Filters

3) Gauss-Newton Method It is applicable to a cost function that is expressed as

the sum of error squares. E(w) = ½i=1

n e²(i), note that all the error terms are calculated on the basis of a weight vector w that is fixed over the entire observation interval 1 i n.

The error signal e(i) is a function of the adjustable weight vector w. Given an operating point w(n), we linearize the dependence of e(i) on w by writing

e’(i,w) = e(i) + [e(i)/w]Tw=w(n) (w-w(n)), i=1,2,...,n

Page 15: 2806 Neural Computation Single Layer Perceptron Lecture 3 2005 Ari Visa

Linear Adaptive Filters

e’(n,w) = e(n) + J(n)(w-w(n))where e(n) is the error vector e(n) = [e(1),e(2),...,e(n)]T

and J(n) is the n-by-m Jacobian matrix of e(n) (The Jacobian J(n) is the transpose of the m-by-n gradient matrix e(n), where e(n) =[e(1), e(2), ...,e(n)].

w(n+1) = arg min w {½e’(n,w)²} = ½e(n)² +eT(n)J(n)(w-w(n)) + ½(w-w(n))TJT(n)J(n)(w-w(n))

Differentiating the expression with respect w and setting the result equal to zero

Page 16: 2806 Neural Computation Single Layer Perceptron Lecture 3 2005 Ari Visa

Linear Adaptive Filters

JT(n)e(n) + JT(n)e(n) (w-w(n)) = 0w(n+1) = w(n) – (JT(n)J(n))-1JT(n)e(n)The Gauss-Newton requires only the Jacobian matrix of

the error function e(n).For the Gauss-Newton iteration to be computable, the

matrix product JT(n)J(n) must be nonsigular. JT(n)J(n) is always nonnegative definite but to ensure that it is nonsingular, the Jacobian J(n) must have row rank n. → add the diagonal matrix I to the matrix JT(n)J(n), the parameter is a small positive constant.

Page 17: 2806 Neural Computation Single Layer Perceptron Lecture 3 2005 Ari Visa

Linear Adaptive Filters

JT(n)J(n)+ I ; positive definite for all n. -> The Gauss-Newton method is

implemented in the following form:w(n+1) = w(n) – (JT(n)J(n) + I )-1JT(n)e(n) This is the solution to the modified cost

function: E(w) = ½{w-w(0)²+ i=1

n e²(i)} where w(0) is the initial value of w.

Page 18: 2806 Neural Computation Single Layer Perceptron Lecture 3 2005 Ari Visa

Linear Adaptive Filters

Linear Least-Square Filter:- the single neuron is linear

- the cost function e(w) used to design the filter consists of the sum of error squares

e(n) = d(n) –X(n)w(n), where d(n) is the n-by-1 desired response vector and X(n) is the n-by-m data matrix.

Page 19: 2806 Neural Computation Single Layer Perceptron Lecture 3 2005 Ari Visa

Linear Adaptive Filters

w(n+1) =w(n) + (XT(n)X(n))-1XT(n)(d(n) – X(n)w(n))

= (XT(n)X(n))-1XT(n)d(n) where (XT(n)X(n))-1XT(n) is recognized as the pseudoinverse of the data matrix X(n)

The weight vector w(n+1) solves the linear least-squares problem defined over an observation interval of duration n.

Page 20: 2806 Neural Computation Single Layer Perceptron Lecture 3 2005 Ari Visa

Linear Adaptive Filters

Wiener Filter: ergodic environment that is also stationary -> We may substitute long-term sample averages/time-averages for expectations/ensemble averages

Correlation matrix of the input vector x(i): Rx

Cross-correlation vector between the input vector x(i) and desired response d(i): rxd

Page 21: 2806 Neural Computation Single Layer Perceptron Lecture 3 2005 Ari Visa

Linear Adaptive Filters

Rx = E[x(i)xT(i)]= lim n 1/n XT(n)X(n)

rxd = E[x(i)d(i)]= lim n 1/n XT(n)d(n)

wo = lim n w(n+1)

= Rx-1

rxd

where Rx-1

is the inverse of the correlation matrix Rx. The weight vector wo is called the Wiener solution to the linear optimum filtering problem.

Page 22: 2806 Neural Computation Single Layer Perceptron Lecture 3 2005 Ari Visa

Linear Adaptive Filters

For an ergodic process, the linear least-squares filter asymtotically approaches the Wiener filter as the number of observations approaches infinity.

The least-mean square (LMS) algorithm is based on the use of instantaneous values for the cost function.

E(w) = ½ e²(n), where e(n) is the error signal measured at time n.

Page 23: 2806 Neural Computation Single Layer Perceptron Lecture 3 2005 Ari Visa

Linear Adaptive Filters

e(n) = d(n) - xT(n)w(n) -> an estimate for the gradient vector g^(n) = -x(n)e(n)

-> w^(n+1) = w^(n) +x(n)e(n)

where is the learning -rate parameter

Page 24: 2806 Neural Computation Single Layer Perceptron Lecture 3 2005 Ari Visa

Linear Adaptive Filters

The feedback loop around the weight vector w^(n) in the LMS algorithm behaves like a low-pass filter.

The inverse of the learning-rate parameter is a measure of the memory of the LMS algorithm.

Page 25: 2806 Neural Computation Single Layer Perceptron Lecture 3 2005 Ari Visa

Linear Adaptive Filters

w^(n+1) = w^(n) + x(n)[d(n) - xT(n) w^(n)]

w^(n) = z-1[w^(n+1)]

A stochastic feedback system

The learning-rate parameter and the input vector x(n) that determine the transmittance of the feedback loop

Page 26: 2806 Neural Computation Single Layer Perceptron Lecture 3 2005 Ari Visa

Linear Adaptive Filters

The convergence issue that matters is convergence in the mean square

E[e²(n)] constant as nAssumptions: 1) the successive input vectors

are statistically independent of each other2) At time step n the input vector is

statistically independent of all previous samples of the desired response.

Page 27: 2806 Neural Computation Single Layer Perceptron Lecture 3 2005 Ari Visa

Linear Adaptive Filters

3) At time step n, the desired response is dependent on the input, but statistically independent of all previous values of the desired response.

4) The input vector and the desired response are drawn from Gaussian distributed populations.

The convergence issue is guaranted if 0 < < 2/(sum of mean-square values of the sensor inputs)

The LMS algorithm is model independent and therefore robust, which means that small model uncertainy and small disturbances can only result in small estimation errors.

Page 28: 2806 Neural Computation Single Layer Perceptron Lecture 3 2005 Ari Visa

Linear Adaptive Filters

LMS algorithm is slow in convergence and sensitive to variations in the eigenstructure of the input.

The LMS algorithm typically requires a number of iteration equal to about ten times the dimensionality of the input space for it to reach a steady-state condition.

The LMS algorithm is particularly sensitive to variations in the condition number (=eigenvalue spread) of the correlarion matrix Rx of the input vector x.

(Rx)maxmin

Page 29: 2806 Neural Computation Single Layer Perceptron Lecture 3 2005 Ari Visa

Linear Adaptive Filters

The learning curve is a plot of the mean-square value of the estimation error Eav(n) versus the number of iterations n.

Rate of convergence; the number of iterations n required for Eav(n) to decrease to some arbitrarily chosen value

Misadjustment M = E()/Emin -1

Page 30: 2806 Neural Computation Single Layer Perceptron Lecture 3 2005 Ari Visa

Linear Adaptive Filters

Difficulties encountered with the LMS algorithm may be attributed to the fact that the learning-rate parameter is maintained constant.

(n) = c/n where c is a conctant.

Search-then-converge schedule

(n) = 0/(1+(n/t)) where t is a search time constant

Page 31: 2806 Neural Computation Single Layer Perceptron Lecture 3 2005 Ari Visa

Perceptron

Perceptron is based on a nonlinear neuron.

v = i=1m wixi + b

In the simplest form of the perceptron there are two decision regions reparated by a hyperplane.

Page 32: 2806 Neural Computation Single Layer Perceptron Lecture 3 2005 Ari Visa

Perceptron

x(n) = [+1,x1(n),x2(n),...,xm(n)]T

w(n) = [b(n),w1(n),w2(n),...,xm(n)]T

v(n) = wT(n)x(n) defines a hyperplane

Page 33: 2806 Neural Computation Single Layer Perceptron Lecture 3 2005 Ari Visa

Perceptron

1. If the nth member of the training set, x(n), is correctly classified by the weight vector w(n) computed at the nth iteration of the algorithm, no correction is made to the weight vector of the perceptron in accordance with the rule:

w(n+1) =w(n) if wTx(n) >0 and x(n) belongs to class L1

w(n+1) =w(n) if wTx(n) 0 and x(n) belongs to class L1

Page 34: 2806 Neural Computation Single Layer Perceptron Lecture 3 2005 Ari Visa

Perceptron

2. Otherwise the weight vector of the perceptron is uppdated in accordance with the rule

w(n+1) =w(n) -(n)x(n) if wT(n)x(n) >0 and x(n) belongs to class L2

w(n+1) =w(n) + (n)x(n) if wT(n)x(n) 0 and x(n) belongs to class L1

where the learning-rate parameter (n) controls the adjustment applied to the weight vector at iteration n.

Page 35: 2806 Neural Computation Single Layer Perceptron Lecture 3 2005 Ari Visa

Perceptron

Fixed-increment convergence theorem: Let the subsets of training vectors H1 and H2 be linearly separable. Let the inputs presented to the perceptron originate from these two subsets. The perceptron converges after some n0 iterations, in the sense that w(n0) =w(n0 +1)=w(n0 +2)=... is a solution vector for n0 nmax .

Page 36: 2806 Neural Computation Single Layer Perceptron Lecture 3 2005 Ari Visa

Perceptron

The Perceptron Convergence Algorithm

Page 37: 2806 Neural Computation Single Layer Perceptron Lecture 3 2005 Ari Visa

Summary

The LMS algorithm uses a linear neuron.

Continuous learning takes place.

The perceptron uses the McCulloch-Pitts formal model of a neuron.

The learning process is performed for a finite number of iterations and then stops.