388_Class3_F06

7/30/2019 388_Class3_F06

1/5

Neural Networks: Single Neurons (continued)

G. Extension of the Delta Rule: smooth f(z)

1. The delta rule is easily extendable to cases where the step function output function is

not sufficient, i.e. if you want to better model a real neuron with a sigmoidalf(z).

2. Recall for a given training vector, the output is

y = f(z) = f wo + wjTjj=1

n

!"

#$

%

&'

Now, for non-step-function activation function, we define the error using the true

output:

E= 12

y ! t( )2

2. Again, the direction of steepest decrease ofEis given by !"E

"wi

, so

!wi = "#$E

$wi

3. Differentiating

!E

!wi=

y"t( )

!y

!wi=

y"t( )

!f(z)

!wi=

y"t( )

#f (z)

!z

!wi=

y"t( )

#f (z)Ti

wheref(z) is the derivative off(z) with respect toz. Hence, the weights are modified

by

!wi = "# y " t( ) $f (z)Ti =# t"y( ) $f (z)Ti

The main differences from the original delta rule are the presence ofy and the factoroff(z). The same equation can be used for updating the bias weights, but the factor of

Ti is replaced by 1.

4. Note that the step function is no longer a possibility forf(z), since its derivative iseither 0 or (explaining why z, rather thany, was used in the original delta rule error

function). The functionfmust now be differentiable, like the sigmoid functionsdescribed earlier. Here are some typical examples:

Binary sigmoid: f(z) =1

1+ e!"z (asymptotes aref(z) = 0 andf(z) =1). For this

casef(z) = f(z)[1-f(z)], so the derivative is easily calculable fromf(z) itself.

7/30/2019 388_Class3_F06

2/5

Bipolar sigmoid: f(z) =2

1+ e!"z

!1 (asymptotes aref(z) = 1 andf(z) =1). For

this casef(z) = [1+f(z)][1-f(z)], so the derivative is again easily calculable. Hyperbolic tangent: f(z) = tanh(z) (asymptotes aref(z) = 1 andf(z) =1). The

derivative isf(z) = sech2(z) = [1 f2(z)].

2. Multiple Neuron Networks

I. Madalines (Multiple Adalines)

A.A Single Layer of Adalines

1. Let there be n inputs into m output neurons. Assume each input is connected to each

output unit, so we'll have an nxm array of weights, wij, i = 1,...,n; j= 1,...,m. Then the

outputs are given by

yj = f(zj) = f woj + Tkwkjk=1

n

!"

#$%

&'

Here's an example with n = 2 and m = 2:

x1

x2

y1

y 2

b1

b2

w11

w 12

w 21

w 22

w 01

w02

7/30/2019 388_Class3_F06

3/5

2. The error function to be minimized should now include all the outputs. For step

function activity function:

E=1

2 zk " tk( )2

k=1

m

#

The derivation of the weight changes is basically the same as for a single neuron, since

"E

"wij=

1

2

"

"wijzk# tk( )

2

k=1

m

$ = zk# tk( )k=1

m

$"zk

"wij= zk# tk( )

k=1

m

$"

"wijTlwlk

l=1

n

$

= zk# tk( )k=1

m

$ Tl"

"wijwlk

l=1

n

$ = zk# tk( ) Tll=1

n

$k=1

m

$ %li%kj = zj# tj( )Ti

so the weights are modified by

"wij = #$ zj# tj( )Ti =$ tj#zj( )Ti

The same equation holds for updating the bias weights, if we take i = 0 and T0= 1.

3. For smooth activity functions, the error function to be minimized is based on the

outputs:

E= 12

yk! tk( )2

k=1

m

"

A similar calculation to that above yields

!wi j= "# yj" tj( ) $f (zj)Ti = # tj"yj( ) $f (zj)Ti

Again, the same equation holds for updating the bias weights, if we take i = 0 and

0= 1.

B. Madaline Networks with one hidden layer and one output layer

1. Begin with the simple case of a single output neuron (m = 1). Let there be n inputs and

l hidden neurons. We assume each input is connected to each hidden unit and the

outputs of the hidden units are the inputs to the output unit. Thus, we'll have an nxl

array of input-hidden weights, wij, i = 1,...,n; j= 1,...,l and l hidden-output weights, vj,

plus bias weights for each neuron. Heres an example with n = l = 2:

7/30/2019 388_Class3_F06

4/5

x1

x2

H1

H2

b1

b2

w11

w12

w 21

w 22

w 01

w02

Y1y

h1

h2

b

vov1

v2

The intermediate neurons, labeled with upper-case Hs in the figure, are often called

hidden units since theyre not visible at input or output, but only play an internal

processing role. Nonetheless, these are what makes it possible to solve non-linear-

separable problems and get around Minsky and Paperts theorem.

The hidden unit activations ziand outputs hj are given by

zj =

w0 j +

wqjTq and h j

q=1

n

" = f(z j)

The output neuron satisfies similar equations to the single neuron case we studied in

chapter 1: The output unit activation g and outputy are given by

g = v0+ vphp and y

p=1

n

! = f(g)

2. In the original form of Madaline, the output unit hadfixedweights, v0, v1, v2 (usually a"majority rules" algorithm, or an OR for 2 inputs). Hence, these weights would not

need to be trained. In addition, the activation functionfwas taken to be step function.

3. The original delta rule for weight update can be generalized for this single hidden layer

case, using the hidden unit activities

7/30/2019 388_Class3_F06

5/5

"wij =# t$z j( )Ti

where now the activitieszj are now even further removed from the target output, t.

Nonetheless, this method can succeed if the parameters and algorithm are chosen

carefully - i.e. to get this to work requires some experimentation, and there may be

different best methods for different problems.

4. Before the backpropagation learning rule was devised, there were many attempts to

improve learning using the delta rule. One variation that can be reasonably efficient is

the following algorithm; instead of updating all the weights at each iteration try to

short-circuit the process like so:

Epoch loop: While the stopping criterion is false, do the following:

Training vector loop: For each training vector (1, ..., n):

Computezjand hj for each hidden unit, g and overall output,y update weights as follows1. Ify = t, no update is performed

2. Ifytand t= 1, then update weights only for the hidden unitHc whoseinput sum is closest to zero

"wic=# 1$ z

c( )Ti

3. Ifyt, and t = -1, then update weights for all hidden unitsHs whoseinputs sums are positive:

"wis=# $1$ z

s( )Ti , for alls such thatzs > 0

End Training Vector Loop when all training vectors have been usedCheck stopping condition using updated weights after each epoch

If stopping criterion is satisfied then terminate, else do new epoch

This rather ad hoc method was no doubt the product of some experimentation as well

as theory and is typical of the efforts to make the delta rule work for complex networks

in the era before backpropagation. In fact, one of the problems in the 1970's slow

period of research was that there was no uniformly good method of optimally

modifying the weights, especially for multi-layer Madalines. It became a bit of an art

to find rules that would converge for a given problem, in a reasonable amount of time.

Although many other rules were suggested, we will not cover most of them since the

Delta rule leads directly into the backpropagation method, which is quite general.

Documents

388_Class3_F06