2806 Neural Computation Single Layer Perceptron Lecture 3 2005 Ari Visa

2806 Neural ComputationSingle Layer Perceptron

Lecture 3

2005 Ari Visa

Agenda

Some historical notes Linear adaptive filters Rosenblatt’s perceptron Conclusions

Some historical notes

Rosenblatt’s perceptron 1958: a single neuron with adjustable synaptic weights and bias

Perceptron convergence theorem 1962 (Rosenblatt): If patterns used to train the perceptron are drawn from two linearly separable classes, then the algorithm converges and positions the decision surface in the form of hyperplane between two classes.

Some historical notes

A single neuron -> an adaptive filter

Widrow & Hoff (1960): least-mean square algorithm (LMS) = delta rule

The problem of designing an optimum linear filter: Kolmogorov 1942, Wiener 1949, Zadeh 1953, Gabor 1954

Linear Adaptive Filters

Consider a dynamic system with an unknown mathematical characterization

A set of labeled input-output data generated by the system at discrete instants of time at some uniform rate is available


The problem is also known as system identification (in control theory).

An adaptive filter:1) Filtering process (involves the

computation of two signals; an output y(i) and an error signal e(i) = d(i)-y(i)) 2) Adaptive process (the automatic adjustment of the synaptic weights of the neuron in accordance with the error signal e(i)).


Note that y(i) = v(i) = km wk(i)xk(i) is

written into matrix formy(i) = xT(i)w(i)

The manner in which e(i) = d(i)-y(i) is used to control the adjusments to the synaptic weights is determined by the cost function used to derive the adaptive filtering algorithm.


unconstrained optimization problem: Minimize the cost function E(w) with respect to the

weight vector w. Operation analysis -> the necessary condition for

optimality is E(w*) = 0. The strategy is to use iterative steepest descent:

Starting with an initial guess w(0), generate a sequence of weight vectors w(1), w(2), ... , such that the cost function E(w) is reduced at each iteration of the algorithm as E(w(n+1) < Ew(n)).


1) Method of Steepest Descent

The direction of steepest descent is in direction opposite to the gradient vector g = E(w)

w(n+1) = w(n) –g(n) is the stepsize or

learning-rate parameter


The method of steepest descent converges to the optimal solution w* slowly.

The learning-rate parameter has an influence on the convergence behaviour:

- When is small, the transient response of the algorithm is overdamped (3.2a)

- When is large, the transient response of the algorithm is underdamped (3.2b)

– - When exceeds a certain critical value, the algorithm becomes unstable.


2) Newton’s method The idea is to minimize the quadratic approximation of

the cost function E(w) around the current point w(n). Using a second-order Taylor series expansion of the

cost function around the point w(n). Ew(n) gT(n)w(n) +½ wT(n) H(n) w(n) g(n) is the m-by-1 gradient vector of the cost function

E(w) evaluated at the point w(n). The matrix H(n) is the m-by-m Hessian matrix of E(w) (second derivative), H = ²E(w)


H = ²E(w) requires the cost function E(w) to be twice continuously differentiable with respect to the elements of w.

Differentiating Ew(n) gT(n)w(n) +½ wT(n) H(n) w(n) with respect to w, the change E(n) is minimized when

g(n) + H(n)w(n) = 0 → w(n) = H-1(n)g(n) w(n+1) = w(n) + w(n) w(n+1) = w(n)+H-1(n)g(n) where H-1(n) is the inverse of the Hessian of E(w).


Newton’s method converges quickly asymtotically and does not exhibit the zigzagging behavior.

Newton’s method requires that the Hessian H(n) has to be a positive definite matrix for all n!


3) Gauss-Newton Method It is applicable to a cost function that is expressed as

the sum of error squares. E(w) = ½i=1

n e²(i), note that all the error terms are calculated on the basis of a weight vector w that is fixed over the entire observation interval 1 i n.

The error signal e(i) is a function of the adjustable weight vector w. Given an operating point w(n), we linearize the dependence of e(i) on w by writing

e’(i,w) = e(i) + [e(i)/w]Tw=w(n) (w-w(n)), i=1,2,...,n


e’(n,w) = e(n) + J(n)(w-w(n))where e(n) is the error vector e(n) = [e(1),e(2),...,e(n)]T

and J(n) is the n-by-m Jacobian matrix of e(n) (The Jacobian J(n) is the transpose of the m-by-n gradient matrix e(n), where e(n) =[e(1), e(2), ...,e(n)].

w(n+1) = arg min w {½e’(n,w)²} = ½e(n)² +eT(n)J(n)(w-w(n)) + ½(w-w(n))TJT(n)J(n)(w-w(n))

Differentiating the expression with respect w and setting the result equal to zero


JT(n)e(n) + JT(n)e(n) (w-w(n)) = 0w(n+1) = w(n) – (JT(n)J(n))-1JT(n)e(n)The Gauss-Newton requires only the Jacobian matrix of

the error function e(n).For the Gauss-Newton iteration to be computable, the

matrix product JT(n)J(n) must be nonsigular. JT(n)J(n) is always nonnegative definite but to ensure that it is nonsingular, the Jacobian J(n) must have row rank n. → add the diagonal matrix I to the matrix JT(n)J(n), the parameter is a small positive constant.


JT(n)J(n)+ I ; positive definite for all n. -> The Gauss-Newton method is

implemented in the following form:w(n+1) = w(n) – (JT(n)J(n) + I )-1JT(n)e(n) This is the solution to the modified cost

function: E(w) = ½{w-w(0)²+ i=1

n e²(i)} where w(0) is the initial value of w.


Linear Least-Square Filter:- the single neuron is linear

- the cost function e(w) used to design the filter consists of the sum of error squares

e(n) = d(n) –X(n)w(n), where d(n) is the n-by-1 desired response vector and X(n) is the n-by-m data matrix.


w(n+1) =w(n) + (XT(n)X(n))-1XT(n)(d(n) – X(n)w(n))

= (XT(n)X(n))-1XT(n)d(n) where (XT(n)X(n))-1XT(n) is recognized as the pseudoinverse of the data matrix X(n)

The weight vector w(n+1) solves the linear least-squares problem defined over an observation interval of duration n.


Wiener Filter: ergodic environment that is also stationary -> We may substitute long-term sample averages/time-averages for expectations/ensemble averages

Correlation matrix of the input vector x(i): Rx

Cross-correlation vector between the input vector x(i) and desired response d(i): rxd


Rx = E[x(i)xT(i)]= lim n 1/n XT(n)X(n)

rxd = E[x(i)d(i)]= lim n 1/n XT(n)d(n)

wo = lim n w(n+1)

= Rx-1

rxd

where Rx-1

is the inverse of the correlation matrix Rx. The weight vector wo is called the Wiener solution to the linear optimum filtering problem.


For an ergodic process, the linear least-squares filter asymtotically approaches the Wiener filter as the number of observations approaches infinity.

The least-mean square (LMS) algorithm is based on the use of instantaneous values for the cost function.

E(w) = ½ e²(n), where e(n) is the error signal measured at time n.


e(n) = d(n) - xT(n)w(n) -> an estimate for the gradient vector g^(n) = -x(n)e(n)

-> w^(n+1) = w^(n) +x(n)e(n)

where is the learning -rate parameter


The feedback loop around the weight vector w^(n) in the LMS algorithm behaves like a low-pass filter.

The inverse of the learning-rate parameter is a measure of the memory of the LMS algorithm.


w^(n+1) = w^(n) + x(n)[d(n) - xT(n) w^(n)]

w^(n) = z-1[w^(n+1)]

A stochastic feedback system

The learning-rate parameter and the input vector x(n) that determine the transmittance of the feedback loop


The convergence issue that matters is convergence in the mean square

E[e²(n)] constant as nAssumptions: 1) the successive input vectors

are statistically independent of each other2) At time step n the input vector is

statistically independent of all previous samples of the desired response.


3) At time step n, the desired response is dependent on the input, but statistically independent of all previous values of the desired response.

4) The input vector and the desired response are drawn from Gaussian distributed populations.

The convergence issue is guaranted if 0 < < 2/(sum of mean-square values of the sensor inputs)

The LMS algorithm is model independent and therefore robust, which means that small model uncertainy and small disturbances can only result in small estimation errors.


LMS algorithm is slow in convergence and sensitive to variations in the eigenstructure of the input.

The LMS algorithm typically requires a number of iteration equal to about ten times the dimensionality of the input space for it to reach a steady-state condition.

The LMS algorithm is particularly sensitive to variations in the condition number (=eigenvalue spread) of the correlarion matrix Rx of the input vector x.

(Rx)maxmin


The learning curve is a plot of the mean-square value of the estimation error Eav(n) versus the number of iterations n.

Rate of convergence; the number of iterations n required for Eav(n) to decrease to some arbitrarily chosen value

Misadjustment M = E()/Emin -1


Difficulties encountered with the LMS algorithm may be attributed to the fact that the learning-rate parameter is maintained constant.

(n) = c/n where c is a conctant.

Search-then-converge schedule

(n) = 0/(1+(n/t)) where t is a search time constant

Perceptron

Perceptron is based on a nonlinear neuron.

v = i=1m wixi + b

In the simplest form of the perceptron there are two decision regions reparated by a hyperplane.

Perceptron

x(n) = [+1,x1(n),x2(n),...,xm(n)]T

w(n) = [b(n),w1(n),w2(n),...,xm(n)]T

v(n) = wT(n)x(n) defines a hyperplane

Perceptron

1. If the nth member of the training set, x(n), is correctly classified by the weight vector w(n) computed at the nth iteration of the algorithm, no correction is made to the weight vector of the perceptron in accordance with the rule:

w(n+1) =w(n) if wTx(n) >0 and x(n) belongs to class L1

w(n+1) =w(n) if wTx(n) 0 and x(n) belongs to class L1

Perceptron

2. Otherwise the weight vector of the perceptron is uppdated in accordance with the rule

w(n+1) =w(n) -(n)x(n) if wT(n)x(n) >0 and x(n) belongs to class L2

w(n+1) =w(n) + (n)x(n) if wT(n)x(n) 0 and x(n) belongs to class L1

where the learning-rate parameter (n) controls the adjustment applied to the weight vector at iteration n.

Perceptron

Fixed-increment convergence theorem: Let the subsets of training vectors H1 and H2 be linearly separable. Let the inputs presented to the perceptron originate from these two subsets. The perceptron converges after some n0 iterations, in the sense that w(n0) =w(n0 +1)=w(n0 +2)=... is a solution vector for n0 nmax .

Perceptron

The Perceptron Convergence Algorithm

Summary

The LMS algorithm uses a linear neuron.

Continuous learning takes place.

The perceptron uses the McCulloch-Pitts formal model of a neuron.

The learning process is performed for a finite number of iterations and then stops.

Documents

2806 Neural Computation Single Layer Perceptron Lecture 3 2005 Ari Visa