Upload
rebelnerdcloud
View
216
Download
0
Embed Size (px)
Citation preview
7/30/2019 388_Class3_F06
1/5
Neural Networks: Single Neurons (continued)
G. Extension of the Delta Rule: smooth f(z)
1. The delta rule is easily extendable to cases where the step function output function is
not sufficient, i.e. if you want to better model a real neuron with a sigmoidalf(z).
2. Recall for a given training vector, the output is
y = f(z) = f wo + wjTjj=1
n
!"
#$
%
&'
Now, for non-step-function activation function, we define the error using the true
output:
E= 12
y ! t( )2
2. Again, the direction of steepest decrease ofEis given by !"E
"wi
, so
!wi = "#$E
$wi
3. Differentiating
!E
!wi=
y"t( )
!y
!wi=
y"t( )
!f(z)
!wi=
y"t( )
#f (z)
!z
!wi=
y"t( )
#f (z)Ti
wheref(z) is the derivative off(z) with respect toz. Hence, the weights are modified
by
!wi = "# y " t( ) $f (z)Ti =# t"y( ) $f (z)Ti
The main differences from the original delta rule are the presence ofy and the factoroff(z). The same equation can be used for updating the bias weights, but the factor of
Ti is replaced by 1.
4. Note that the step function is no longer a possibility forf(z), since its derivative iseither 0 or (explaining why z, rather thany, was used in the original delta rule error
function). The functionfmust now be differentiable, like the sigmoid functionsdescribed earlier. Here are some typical examples:
Binary sigmoid: f(z) =1
1+ e!"z (asymptotes aref(z) = 0 andf(z) =1). For this
casef(z) = f(z)[1-f(z)], so the derivative is easily calculable fromf(z) itself.
7/30/2019 388_Class3_F06
2/5
Bipolar sigmoid: f(z) =2
1+ e!"z
!1 (asymptotes aref(z) = 1 andf(z) =1). For
this casef(z) = [1+f(z)][1-f(z)], so the derivative is again easily calculable. Hyperbolic tangent: f(z) = tanh(z) (asymptotes aref(z) = 1 andf(z) =1). The
derivative isf(z) = sech2(z) = [1 f2(z)].
2. Multiple Neuron Networks
I. Madalines (Multiple Adalines)
A.A Single Layer of Adalines
1. Let there be n inputs into m output neurons. Assume each input is connected to each
output unit, so we'll have an nxm array of weights, wij, i = 1,...,n; j= 1,...,m. Then the
outputs are given by
yj = f(zj) = f woj + Tkwkjk=1
n
!"
#$%
&'
Here's an example with n = 2 and m = 2:
x1
x2
y1
y 2
b1
b2
w11
w 12
w 21
w 22
w 01
w02
7/30/2019 388_Class3_F06
3/5
2. The error function to be minimized should now include all the outputs. For step
function activity function:
E=1
2 zk " tk( )2
k=1
m
#
The derivation of the weight changes is basically the same as for a single neuron, since
"E
"wij=
1
2
"
"wijzk# tk( )
2
k=1
m
$ = zk# tk( )k=1
m
$"zk
"wij= zk# tk( )
k=1
m
$"
"wijTlwlk
l=1
n
$
= zk# tk( )k=1
m
$ Tl"
"wijwlk
l=1
n
$ = zk# tk( ) Tll=1
n
$k=1
m
$ %li%kj = zj# tj( )Ti
so the weights are modified by
"wij = #$ zj# tj( )Ti =$ tj#zj( )Ti
The same equation holds for updating the bias weights, if we take i = 0 and T0= 1.
3. For smooth activity functions, the error function to be minimized is based on the
outputs:
E= 12
yk! tk( )2
k=1
m
"
A similar calculation to that above yields
!wi j= "# yj" tj( ) $f (zj)Ti = # tj"yj( ) $f (zj)Ti
Again, the same equation holds for updating the bias weights, if we take i = 0 and
0= 1.
B. Madaline Networks with one hidden layer and one output layer
1. Begin with the simple case of a single output neuron (m = 1). Let there be n inputs and
l hidden neurons. We assume each input is connected to each hidden unit and the
outputs of the hidden units are the inputs to the output unit. Thus, we'll have an nxl
array of input-hidden weights, wij, i = 1,...,n; j= 1,...,l and l hidden-output weights, vj,
plus bias weights for each neuron. Heres an example with n = l = 2:
7/30/2019 388_Class3_F06
4/5
x1
x2
H1
H2
b1
b2
w11
w12
w 21
w 22
w 01
w02
Y1y
h1
h2
b
vov1
v2
The intermediate neurons, labeled with upper-case Hs in the figure, are often called
hidden units since theyre not visible at input or output, but only play an internal
processing role. Nonetheless, these are what makes it possible to solve non-linear-
separable problems and get around Minsky and Paperts theorem.
The hidden unit activations ziand outputs hj are given by
zj =
w0 j +
wqjTq and h j
q=1
n
" = f(z j)
The output neuron satisfies similar equations to the single neuron case we studied in
chapter 1: The output unit activation g and outputy are given by
g = v0+ vphp and y
p=1
n
! = f(g)
2. In the original form of Madaline, the output unit hadfixedweights, v0, v1, v2 (usually a"majority rules" algorithm, or an OR for 2 inputs). Hence, these weights would not
need to be trained. In addition, the activation functionfwas taken to be step function.
3. The original delta rule for weight update can be generalized for this single hidden layer
case, using the hidden unit activities
7/30/2019 388_Class3_F06
5/5
"wij =# t$z j( )Ti
where now the activitieszj are now even further removed from the target output, t.
Nonetheless, this method can succeed if the parameters and algorithm are chosen
carefully - i.e. to get this to work requires some experimentation, and there may be
different best methods for different problems.
4. Before the backpropagation learning rule was devised, there were many attempts to
improve learning using the delta rule. One variation that can be reasonably efficient is
the following algorithm; instead of updating all the weights at each iteration try to
short-circuit the process like so:
Epoch loop: While the stopping criterion is false, do the following:
Training vector loop: For each training vector (1, ..., n):
Computezjand hj for each hidden unit, g and overall output,y update weights as follows1. Ify = t, no update is performed
2. Ifytand t= 1, then update weights only for the hidden unitHc whoseinput sum is closest to zero
"wic=# 1$ z
c( )Ti
3. Ifyt, and t = -1, then update weights for all hidden unitsHs whoseinputs sums are positive:
"wis=# $1$ z
s( )Ti , for alls such thatzs > 0
End Training Vector Loop when all training vectors have been usedCheck stopping condition using updated weights after each epoch
If stopping criterion is satisfied then terminate, else do new epoch
This rather ad hoc method was no doubt the product of some experimentation as well
as theory and is typical of the efforts to make the delta rule work for complex networks
in the era before backpropagation. In fact, one of the problems in the 1970's slow
period of research was that there was no uniformly good method of optimally
modifying the weights, especially for multi-layer Madalines. It became a bit of an art
to find rules that would converge for a given problem, in a reasonable amount of time.
Although many other rules were suggested, we will not cover most of them since the
Delta rule leads directly into the backpropagation method, which is quite general.