388_Class3_F06

Embed Size (px)

Citation preview

  • 7/30/2019 388_Class3_F06

    1/5

    Neural Networks: Single Neurons (continued)

    G. Extension of the Delta Rule: smooth f(z)

    1. The delta rule is easily extendable to cases where the step function output function is

    not sufficient, i.e. if you want to better model a real neuron with a sigmoidalf(z).

    2. Recall for a given training vector, the output is

    y = f(z) = f wo + wjTjj=1

    n

    !"

    #$

    %

    &'

    Now, for non-step-function activation function, we define the error using the true

    output:

    E= 12

    y ! t( )2

    2. Again, the direction of steepest decrease ofEis given by !"E

    "wi

    , so

    !wi = "#$E

    $wi

    3. Differentiating

    !E

    !wi=

    y"t( )

    !y

    !wi=

    y"t( )

    !f(z)

    !wi=

    y"t( )

    #f (z)

    !z

    !wi=

    y"t( )

    #f (z)Ti

    wheref(z) is the derivative off(z) with respect toz. Hence, the weights are modified

    by

    !wi = "# y " t( ) $f (z)Ti =# t"y( ) $f (z)Ti

    The main differences from the original delta rule are the presence ofy and the factoroff(z). The same equation can be used for updating the bias weights, but the factor of

    Ti is replaced by 1.

    4. Note that the step function is no longer a possibility forf(z), since its derivative iseither 0 or (explaining why z, rather thany, was used in the original delta rule error

    function). The functionfmust now be differentiable, like the sigmoid functionsdescribed earlier. Here are some typical examples:

    Binary sigmoid: f(z) =1

    1+ e!"z (asymptotes aref(z) = 0 andf(z) =1). For this

    casef(z) = f(z)[1-f(z)], so the derivative is easily calculable fromf(z) itself.

  • 7/30/2019 388_Class3_F06

    2/5

    Bipolar sigmoid: f(z) =2

    1+ e!"z

    !1 (asymptotes aref(z) = 1 andf(z) =1). For

    this casef(z) = [1+f(z)][1-f(z)], so the derivative is again easily calculable. Hyperbolic tangent: f(z) = tanh(z) (asymptotes aref(z) = 1 andf(z) =1). The

    derivative isf(z) = sech2(z) = [1 f2(z)].

    2. Multiple Neuron Networks

    I. Madalines (Multiple Adalines)

    A.A Single Layer of Adalines

    1. Let there be n inputs into m output neurons. Assume each input is connected to each

    output unit, so we'll have an nxm array of weights, wij, i = 1,...,n; j= 1,...,m. Then the

    outputs are given by

    yj = f(zj) = f woj + Tkwkjk=1

    n

    !"

    #$%

    &'

    Here's an example with n = 2 and m = 2:

    x1

    x2

    y1

    y 2

    b1

    b2

    w11

    w 12

    w 21

    w 22

    w 01

    w02

  • 7/30/2019 388_Class3_F06

    3/5

    2. The error function to be minimized should now include all the outputs. For step

    function activity function:

    E=1

    2 zk " tk( )2

    k=1

    m

    #

    The derivation of the weight changes is basically the same as for a single neuron, since

    "E

    "wij=

    1

    2

    "

    "wijzk# tk( )

    2

    k=1

    m

    $ = zk# tk( )k=1

    m

    $"zk

    "wij= zk# tk( )

    k=1

    m

    $"

    "wijTlwlk

    l=1

    n

    $

    = zk# tk( )k=1

    m

    $ Tl"

    "wijwlk

    l=1

    n

    $ = zk# tk( ) Tll=1

    n

    $k=1

    m

    $ %li%kj = zj# tj( )Ti

    so the weights are modified by

    "wij = #$ zj# tj( )Ti =$ tj#zj( )Ti

    The same equation holds for updating the bias weights, if we take i = 0 and T0= 1.

    3. For smooth activity functions, the error function to be minimized is based on the

    outputs:

    E= 12

    yk! tk( )2

    k=1

    m

    "

    A similar calculation to that above yields

    !wi j= "# yj" tj( ) $f (zj)Ti = # tj"yj( ) $f (zj)Ti

    Again, the same equation holds for updating the bias weights, if we take i = 0 and

    0= 1.

    B. Madaline Networks with one hidden layer and one output layer

    1. Begin with the simple case of a single output neuron (m = 1). Let there be n inputs and

    l hidden neurons. We assume each input is connected to each hidden unit and the

    outputs of the hidden units are the inputs to the output unit. Thus, we'll have an nxl

    array of input-hidden weights, wij, i = 1,...,n; j= 1,...,l and l hidden-output weights, vj,

    plus bias weights for each neuron. Heres an example with n = l = 2:

  • 7/30/2019 388_Class3_F06

    4/5

    x1

    x2

    H1

    H2

    b1

    b2

    w11

    w12

    w 21

    w 22

    w 01

    w02

    Y1y

    h1

    h2

    b

    vov1

    v2

    The intermediate neurons, labeled with upper-case Hs in the figure, are often called

    hidden units since theyre not visible at input or output, but only play an internal

    processing role. Nonetheless, these are what makes it possible to solve non-linear-

    separable problems and get around Minsky and Paperts theorem.

    The hidden unit activations ziand outputs hj are given by

    zj =

    w0 j +

    wqjTq and h j

    q=1

    n

    " = f(z j)

    The output neuron satisfies similar equations to the single neuron case we studied in

    chapter 1: The output unit activation g and outputy are given by

    g = v0+ vphp and y

    p=1

    n

    ! = f(g)

    2. In the original form of Madaline, the output unit hadfixedweights, v0, v1, v2 (usually a"majority rules" algorithm, or an OR for 2 inputs). Hence, these weights would not

    need to be trained. In addition, the activation functionfwas taken to be step function.

    3. The original delta rule for weight update can be generalized for this single hidden layer

    case, using the hidden unit activities

  • 7/30/2019 388_Class3_F06

    5/5

    "wij =# t$z j( )Ti

    where now the activitieszj are now even further removed from the target output, t.

    Nonetheless, this method can succeed if the parameters and algorithm are chosen

    carefully - i.e. to get this to work requires some experimentation, and there may be

    different best methods for different problems.

    4. Before the backpropagation learning rule was devised, there were many attempts to

    improve learning using the delta rule. One variation that can be reasonably efficient is

    the following algorithm; instead of updating all the weights at each iteration try to

    short-circuit the process like so:

    Epoch loop: While the stopping criterion is false, do the following:

    Training vector loop: For each training vector (1, ..., n):

    Computezjand hj for each hidden unit, g and overall output,y update weights as follows1. Ify = t, no update is performed

    2. Ifytand t= 1, then update weights only for the hidden unitHc whoseinput sum is closest to zero

    "wic=# 1$ z

    c( )Ti

    3. Ifyt, and t = -1, then update weights for all hidden unitsHs whoseinputs sums are positive:

    "wis=# $1$ z

    s( )Ti , for alls such thatzs > 0

    End Training Vector Loop when all training vectors have been usedCheck stopping condition using updated weights after each epoch

    If stopping criterion is satisfied then terminate, else do new epoch

    This rather ad hoc method was no doubt the product of some experimentation as well

    as theory and is typical of the efforts to make the delta rule work for complex networks

    in the era before backpropagation. In fact, one of the problems in the 1970's slow

    period of research was that there was no uniformly good method of optimally

    modifying the weights, especially for multi-layer Madalines. It became a bit of an art

    to find rules that would converge for a given problem, in a reasonable amount of time.

    Although many other rules were suggested, we will not cover most of them since the

    Delta rule leads directly into the backpropagation method, which is quite general.