55
UNCLASSIFIED SECURITY CLASSIFICATION OF THIS PArE %,Pen nue Fnt r.d) REPORT DOCUMENTATION PAGE AD-A217 733 I. AEPORT NUMBER 2. GOVT A ARO 25515.1-LS N/A 4. TITLE (anid Subtitle) S. TYPE OF REPORT & PERIOD COVERED Gain Modification Enhances High Momentum Technical Backward Propagation 6. PERFORMING ORG. REPORT NUMBER 17. AUTHOR(s) S. CONTRACT OR GRANT NUMBER(s) Charles M. Bachmann ~DAAL03-88-K-0116 :9. PERFORMING ORGANIZATION NAME AND ADDRESS 10. PROGRAM ELEMENT. PROJECT. TASK AREA & WORK UNIT NUMBERS Department of Physics and Center for Neural Science, Brown University, Prov., R. I., 02912 N/A l. CONTROLLING OFFICE NAME AND ADDRESS 12. REPORT DATE U. S. Army Research Office November 30, 1989 Post Office Box 12211 13. NUMBER OF PAGES Research Triangle Park, NC 27709 52 pages 14. MONITORING AGENCY NAME & AOORESS(Il different from Controlling Ollice) IS. SECURITY CLASS. (of t.le report) Unclassi fied 15&. DECLASSIFICATION/ OWNGRADING SCHEDULE 16. DISTRIBUTION STATEMENT (of thie Report) Approved for public release; distribution unlimited. JA23990 IS7. DISTRIBUTION STATEMENT (of the soetrict entered In Block 20, If different from Report) N A 18. SUPPLEMENTARY NOTES The view, opinions, and/or findings contained in this report are those of the author(s) and should not be construed as an official Department of the Army position, policy, or decision, unless so Sdesignated by other documentation. 19. KEY WORDS (Continue on reverse side it neceesary and identily by block number) Backward Propagation Gain Modification Momentum. d Effective 'Xme-BDpendent Step-Constant 20. AUST ACfr" Jame arvere. aide It neceseerw an IdentIfy by block number) We presenfg backward propagation networkwhich simultaneously modifies the gai parameters and'the synaptic weights. Gain modification is shown to enhance the improvement in convergence rate obtained by high momentum in standard synaptic backward propagation. These improvements occur without degrading the generaliza- tion capabilities of the final solutions obtained by the network. / . /'- CD JAN 1473 Ko,.o o vaIsonso .rt UNCLASSIFIED SEcuory CLA.tririeaI .. ae y,"PQ PAter fw~f. Oats~nerf

AD-A217 733

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: AD-A217 733

UNCLASSIFIEDSECURITY CLASSIFICATION OF THIS PArE %,Pen nue Fnt r.d)

REPORT DOCUMENTATION PAGE AD-A217 733I. AEPORT NUMBER

2. GOVT A

ARO 25515.1-LS N/A

4. TITLE (anid Subtitle) S. TYPE OF REPORT & PERIOD COVERED

Gain Modification Enhances High Momentum TechnicalBackward Propagation

6. PERFORMING ORG. REPORT NUMBER

17. AUTHOR(s) S. CONTRACT OR GRANT NUMBER(s)

Charles M. Bachmann~DAAL03-88-K-0116

:9. PERFORMING ORGANIZATION NAME AND ADDRESS 10. PROGRAM ELEMENT. PROJECT. TASK

AREA & WORK UNIT NUMBERS

Department of Physics and Center for NeuralScience, Brown University, Prov., R. I., 02912 N/A

l. CONTROLLING OFFICE NAME AND ADDRESS 12. REPORT DATE

U. S. Army Research Office November 30, 1989

Post Office Box 12211 13. NUMBER OF PAGES

Research Triangle Park, NC 27709 52 pages14. MONITORING AGENCY NAME & AOORESS(Il different from Controlling Ollice) IS. SECURITY CLASS. (of t.le report)

Unclassi fied15&. DECLASSIFICATION/ OWNGRADING

SCHEDULE

16. DISTRIBUTION STATEMENT (of thie Report)

Approved for public release; distribution unlimited.

JA23990

IS7. DISTRIBUTION STATEMENT (of the soetrict entered In Block 20, If different from Report)

N A

18. SUPPLEMENTARY NOTES

The view, opinions, and/or findings contained in this report are

those of the author(s) and should not be construed as an official

Department of the Army position, policy, or decision, unless soSdesignated by other documentation.

19. KEY WORDS (Continue on reverse side it neceesary and identily by block number)

Backward PropagationGain ModificationMomentum. dEffective 'Xme-BDpendent Step-Constant

20. AUST ACfr" Jame arvere. aide It neceseerw an IdentIfy by block number)

We presenfg backward propagation networkwhich simultaneously modifies the gaiparameters and'the synaptic weights. Gain modification is shown to enhance theimprovement in convergence rate obtained by high momentum in standard synapticbackward propagation. These improvements occur without degrading the generaliza-tion capabilities of the final solutions obtained by the network. / . /'-

CD JAN 1473 Ko,.o o vaIsonso .rt UNCLASSIFIED

SEcuory CLA.tririeaI ..ae y,"PQ PAter fw~f. Oats~nerf

Page 2: AD-A217 733

Aacesston. For

NTIS GRA&ZDTIC TABUnannounced 0JJustification

Distribution/

Availability Codes. IAvail and/or

Dist Special

A01Gain Modification Enhances High

Momentum Backward Propagation

Charles M. Bachmann1

Physics Departmentand Center for Neural Science

Brown UniversityProvidence, R. I. 02912

November 8, 1989

O DO

IThis work was supported in part by the National Science Foundation, the

Army Research Office, and the Office of Naval Research.

90 01 22 194

Page 3: AD-A217 733

Contents

1 Theoretical Development of Gain Modification 21.1 Introduction ........................... 21.2 Review of Notation for Standard Backward Propagation . . 31.3 - Simultaneous Modification of Gain Parameters and Synapses

by a Backward Propagation Algorithm ............... 61.4 Gain Modification Viewed as an Effective Time-Dependent

Step-Constant ....... ......................... 101.5 Impact of the Effective Time-Dependent Step-Constant . . 15

2 Experimental Benchmarks for Gain Modification and Mo-mentum 162.1 Momentum and the Concentric Circle Paradigm ........ 162.2 Gain Modification Combined with High Momentum Synap-

tic Modification ............................... 192.3 Future Directions .............................. 212.4 Acknowledgement .............................. 22

Page 4: AD-A217 733

Chapter 1

Theoretical Development of Gain Modification

r

1.1 Introduction We present a backward propagation network which simultaneously modifies the gain parameters and the synaptic weights. The additional complexity is minimized by the fact that the error signal for modification of the gain of a neuron is proportional to the ordinary error signal for the incoming synaptic weights to the neuron. For a given neuron, the proportionality factor is the reciprocal of its gain. Thus, only the ordinary error signal must be propagated. The input to a synapse, which multiplies the error signal in the standard modification rule for synapses, is replaced by the net input to the cell m the gain modification rule. We also demonstrate that our algorithm can be viewed as a gradient descent in rescaled synaptic vectors with effective time-dependent step-constants which depend on both the magnitude of the gains and the magnitude of the ordinary synaptic vectors. In this paper, we show that the effect of gain modification by this algorithm can be used to enhance the improvements in convergence rate obtained by the use of high momentum in ordinary synaptic modification. These improvements occur without degrading the generalization capabilities of the final solutions obtained by the network.

Page 5: AD-A217 733

1.2 Review of Notation for Standard Back-ward Propagation

We begin with a short summary of the notation which we have used forbackward propagation in our previous work (Bachmann, 1988) [1]. The no-tation differs somewhat from that used by Rumelhart, Hinton, and Williams(1986) [6] and Werbos (1988) [7]. For simplicity, we consider a three-layernetwork. A typical network is illustrated in figure 1.1.

okOutput Layer

h i "Hidden" Layer

J

Figure 1.1: The ok label the output units, the hi label the "hiddenunits", and the fj label the input units. Only some of the connectionsare shown. Superscripts on synaptic weights denote layer index.

The feedforward equations are defined by:NI

j=1

h! ) (1.2)

N2 ~w )h +0( 2 ) (1.3)i= 1

k= ( k;A'), (1.4)

of is the output of the kth unit in the output layer, h! is the output of theith unit in the hidden layer, and f, is the value of the jth input. s denotes

3

Page 6: AD-A217 733

the pattern index. Additionally, recall that y. is the total input to the kth

output unit, and x. is the total input to the ith hidden unit. O 1) is the bias

for the ith cell in the hidden layer, and 042) is the bias for the kth cell inthe output layer. O(z; A) is the sigmoid input-output function defined by:

1(z; = 1 + ez (1.5)

We have introduced explicitly the gains A(2) of the kth cell in the last layer

and A 1) for the ith cell in the hidden layer. 1 2 ,, the energy of pattern s,

is given by:N2

= -(oa - 71) (1.6)k-- 12

A pattern gradient, which is used to modify the synaptic weights followingthe presentation of each pattern to the network, is computed from the en-

ergy &. The backward propagation method is thus only an approximationto gradient descent since the true Liapunov function is really:

m

8=1

1m N2

EZZ(ok_ k,) 2(1.7)2=1 k=1

To make a connection with Rumelhart's delta error signal notation(Rumelhart, 1988) [6], we note that by defining partial derivatives withrespect to the "net", or net input to a cell, we have3 :

ao'k

- -A 2)(o - r)o(1 - o%) (1.8)

'In Rumelhart's original model there is no gain parameter A (A = 1). Although initially

all gains are set to 1, our proposed model allows the gains to vary individually.2 Hopfield (1984) [31 has used a gain parameter in the continuous version of his model to

study the effect of changing the character of the nonlinearity of the input-output function.3 We have added superscripts to the error signals for greater clarity.

4

Page 7: AD-A217 733

ax;N2 a, aokayk ah

ao), ay ah: ax:

N2_A!')h,(1-h:) -W)(o' -r ))2)o:(1- )w (1.9)k=1

This allows one to write:N2

b(1) = A -h,!h(1 - h!) N 6()w2) W( (1.10)k=1

The synaptic modifications are proportional to the negative gradient:

a() ay,A = ay a (1.11)

(1) a es

ac. ax:--= a=, aa i-) (1.12)

Therefore, with the above definitions for 6) and 6(! ) , equations 1.11 and1.12 become: 4

A.(W) =,,,h,:- (1.13)A (1=) _7()ff, (1.14),. Sj S : tl j"

Note that equation 1.10 defines the backward propagation of the error sig-nals. Recall also that in standard backward propagation, a "momentum"term is often used at each step of the modification procedure in the gradientdescent search for the minimum. Heuristically, the momentum term can

4As in Rumelhart et. al. (1986) [61, the biases are modified by the same procedure;however, the input for biases is defined to be unity.

5

Page 8: AD-A217 733

be viewed as a means of increasing the step-constant when the curvatureof the energy surface is low and several successive modifications have thesame sign (Jacobs, 1988) [5]. The momentum term consists of adding asmall amount proportional to the previous modification, so that the actualmodification at time step t for the nth layer of synaptic connections is:

A ct) (W(._)) =t +-1)7W j) (1.15)

where s(t) denotes the index of the pattern presented at time step t andr. is a positive constant less than 1. One can also write this in a slightlydifferent form:

, -- (,,) r,-M, (1.16)

This form is more suggestive in showing that the use of a momentum termis equivalent to a discrete approximation to a temporal integral averagewith an exponentially decaying kernel. The kernel has the effect of em-phasizing the influence of the patterns most recently presented, assigningexponentially less weight to those patterns presented earlier in time.

1.3 Simultaneous Modification of Gain Pa-rameters and Synapses by a BackwardPropagation Algorithm

In this section, we consider the possibility of modifying the gain parame-ters and the synaptic weights simultaneously. To accomplish this, we haveformulated a backward propagation procedure which modifies the gain pa-rameters in the network in a manner similar to the method used for thesynaptic weights. The procedure can take advantage of quantities alreadycalculated in the ordinary backward propagation procedure for the synapticweights, thus minimizing the additional complexity. The error signal forthe gain of a particular neuron is proportional to the ordinary synaptic er-ror signal for the incoming synaptic weights connected to the neuron. Theproportionality factor is just the reciprocal of the neuronal gain. Addition-ally, the input to a particular synapse, which multiplies the error signal in

6

Page 9: AD-A217 733

the standard synaptic modification formula, is replaced by the net input tothe neuron for the modification of its gain.

We note in passing that other work has been done on gain modificationprocedures, for example Kruschke (1988) [4]; however, Kruschke's proce-dure does not use the gradient with respect to the gain. Rather, gains aremodified in the context of a competitive learning scheme, which is com-bined with backward propagation modification of the synaptic weights. Incontrast, our scheme uses a backward propagation procedure for both gainsand weights, incorporating respectively the gradients with respect to thegains and the weights.

To derive our model, we begin by defining rescaled error signals -Iy andY() in terms of the error signals for the synaptic weights in equation s 1.8

and 1.9:

(2) 2) 1 6(2)

= -(o --,)o;,(1 -o ,) (1.17)

1) = 1(2)

k

1 a0,

X(2) ay.-I

1 ea5 ao ay ah:k-" aot a ah ax:

Nv2

ka

-(1 - ) -' o - ) ) ( - ) (1.17)

q) 1 6(1 1

Given~ ~ p teedfnio S emywieabcwr poaaineuto othe rescaled error ~ ~~Sinly)an fi yomiigeatns.1ad118Alternatively w olhae drvdti oml y ipyrpaig,,

Page 10: AD-A217 733

by A (2),(2)-, and 6,( 1) by A(1 ),y.9 in equation 1.10:N2

(1 h(1 - hf) A (2).(2) (1.19)

k=1

If we observe that:

-- - o) (1.20)

then we have: 0oik

(2)- Yo:(1 - o)

8 1 9OkYk 1 2o (1.21)

The same line of reasoning also allows us to write:

ah:- 1__X! ch (1.22)

-aITherefore, we may write:

A. A2 _a ae

- -' , y o

1 aA( 2 12

-ci () aye5aAk1

ah ao,

1 - ax, (1.24)=' -494~l ah

8

Page 11: AD-A217 733

where we have used equations 1.21 and 1.22 to write the derivatives withrespect to the gains in equations 1.23 and 1.24 in terms of derivatives withrespect to the net inputs y and x! to the kth cell in the output layerand the ith cell in the hidden layer respectively. Using the definitions inequations 1.17 and 1.18, we can recast equations 1.23 and 1.24 as:

1 = (1.25)

For comparison with equations 1.13 and 1.14 describing synaptic modifica-tion, we note that in equations 1.25 and 1.26 depicting gain modification,the ordinary error signals defined in equations 1.8 and 1.9 have been re-placed by the rescaled error signals defined in equations 1.17 and 1.18, andthe signal input to a particular incoming synapse (either h! or f] depend-ing on the layer) has been replaced by the corresponding total integratedpotential to the cell, i.e. y' or xi depending on the layer. Since the rescalederror signals, -y, are proportional to the ordinary synaptic error signals, 6,we need only propagate the ordinary error signals b and obtain the rescalederror signals locally by dividing by the cell gain. Equivalently, we can viewthe error signal as a quantity which initiates changes in the input-outputcharacteristics of the neuron at the same time that it modifies the incomingsynapses to the neuron. From this perspective, the modification rule fora synapse couples the ordinary error signal and the input to the incomingsynapse with coupling constant il, while the gain modification rule couplesthe ordinary error signal and the net input to a cell with coupling constant

The only difference is the strength of the coupling, and, in fact, since

tlie gain parameter is a more sensitive parameter than the synaptic weights,we choose the initial coupling for the gains, a, to be an order of magnitudesmaller than that of the synapses in the simulations described here. 5 Al-ternatively, we will show in the next section that one can view simultaneousgain and synaptic modification as a gradient descent in rescaled synapseswith effective step-constants which are dependent on the gain and magni-tude of the ordinary synaptic vectors and are direction-dependent in the

5The gains are all set to one initially. Therefore, the order of magnitude of the initialcoupling is determined by t. Even though the gains are chosen to be initially the same,symmetry is still broken by choosing the initial synaptic weights randomly in a smallhypercube.

9

Page 12: AD-A217 733

synaptic vector space. From this perspective, the effective step-constantis now also time-dependent through its dependence on the gain and themagnitude of the ordinary synaptic vectors.

1.4 Gain Modification Viewed as an Effec-tive Time-Dependent Step-Constant

In developing our theoretical arguments, it will be easiest to consider theeffects of gain modification on ordinary backward propagation without mo-mentum. It will be seen that the addition of momentum will not alter thedevelopment here. To simplify matters, we adopt a notation which includessynaptic weight vectors and biases in the same vector. We also change thenotation to a slightly more general form than that used in sections 1.2 and1.3. For the ith cell in the nth layer, 6 we define:

-(")" (Jn)'a, 1) (1.27)

where the input to layer n, fln)a - -n1),a, is ust the output vector ofW In , , y, then is just the

the previous layer. " The jth componenIt, of we , WS hnJ us hconnection from the jth cell in the (n-1)th layer to the ith cell in the nthlayer. This allows us to write the output for pattern s of the ith cell in thenth layer as:

- (1.28)

In the case of nonmodifiable gains set equal to one, it is apparent that wemay view the norm of the vector v4") as the gain. This follows by rewritingequation 1.28 as:

o(,),a, 11 +,s (1.29)l+e- S 6 1

6 The input layer is defined to have the index n = 0.7 To compare the notation for neuron activation used earlier for a three layer system,

., h, 6, with the current notation, we see that the inputs to the network would be'1I),a - -o),° = , hidden unit outputs would be f P2), = ,, and the last layer

output would be f13),* = =2),# do.

10

Page 13: AD-A217 733

Here i ") is a unit vector in the direction of . This leads us to ask ifmodifying the gains, A"), is truly advantageous. To answer this question,we will need to define rescaled synaptic vectors according to:

" i= iA W in (1.30)

Note that with this definition equation 1.28 becomes:o ,! ) ,, = 1

1 + (1.31)

Since u is just a rescaling of the synaptic vector i"( , we can ask how thisvector changes when we modify W (") and AXn) separately. In particular, we

will find it useful to consider the changes in i4) along the direction of t")and those in the hyperplane perpendicular to i"). We denote these twocomponents by (Ai)) 1 and (Asin))±. We also define O") to be a unitvector in the direction of (Ag"))± in the hyperplane orthogonal to fjln) Ifthe changes in \ n ) and (n) and thus in U&, are sufficiently small, thenwe may write:

,:,,) ("(}A -- (n(") ~).,(A-' A (WW) " + (1.32)

The change A'WT can be rectified into two orthogonal components: 8

7V(n) ±e( + V#,(.) a)) (1.33)- Io,

Note that AoA "n) is just:

= -, (1.34)

However, we can write ,T in terms of the component of V!.n )' alongSi(n) To see this, observe that we may write:

_____ _ - (n) .

- a8(-. a(n) ) .atg - (n) I n ). (1.35)

8 Here we neglect momentum.

11

Page 14: AD-A217 733

At the same time, note that:

7t) *V _ (.) C.o

a(,In4a ( ) a -I,)

_ A(n)(n) .,(n).,a (1.36)=a( -?. .)'

From equations 1.35 and 1.36, we obtain:

____ . I W(n) jinV

A (n) A) (1.37)

Having derived equation 1.37, we can now combine it with equations 1.33and 1.34 to rewrite equation 1.32 as:

/Un) -r/ ,l"{ (1 + a(I I) 2[ a

'1 An

+ (,Dn)- V,(.),.) !) }, (1.38)

where we have used the fact that ' This allows us to write:

(,,, ), _, 1 1+ k, n- V( i ,A in) + ((1.39)

We can recast this equation in terms of gradients with respect to Uj") by

converting the gradients with respect to . In doing so, we note onesubtlety about the gradients which were used to derive equation 1.39. Thegradient, V#( .,) C, which appears in ordinary backward propagation (equa-

tion 1.33, and therefore in equation 1.39, is a vector of partial deriva-tives evaluated at constant A We could denote this explicitly by writing(V 7c-) ')Ac). Similarly, in the modification of the gains (equation 1.34),

4Lis calculated as a partial derivative to be taken at constant WV") which

we could write as (a.)) .). In fact, equation 1.37 is an equation relating

12

Page 15: AD-A217 733

to (VW(.) 8).. Because the gradient of the energy with respect

to *,("), ( is calculated for fixed AN') , we may easily express it

in terms of the gradient with respect to 1 L (Vil.) )A.). We see that the

jth component of the two gradients are related by:

a(', ),!) = ( ~ n) a~(aW.(n ,~= a)( a U,

o,,n,- Cle8 (1.40)

We write this compactly as:

(V,.) CS),. ) A (n)(V.(n) ea),(.). (1.41)

This allows us to rewrite equation 1.39 as:

(A/ ))2

(A.())± 1 0 1 O(n) V '

(1.42)We see that a backward propagation algorithm which simultaneously

modifies the synaptic weights and the gains is equivalent to a gradientdescent in the rescaled synapes ui7) with time-dependent step-constants.In the direction of the rescaled synaptic vector, the step-constant now hasa quadratic dependence on the magnitudes of the gain and the ordinarysynaptic vector, and in the hyperplane perpendicular to the direction ofthe rescaled synaptic vector, the step-constant has a quadractic dependenceonly on the magnitude of the gain. From equation 1.42, we obtain:

= -- 7('\"))2 + a I W "n 12 (1.43)

r/-L= q(Ai))2. (1.44)

The fact that the step-constants are positive-definite ensures that wealways take steps in the direction opposite the gradient; therefore, we canclassify this approach as a gradient descent algorithm. Momentum can beadded to the above argument without loss of generality, since it simply adds

13

Page 16: AD-A217 733

a small amount proportional to the most recent change; the only differenceis that now the step-constants are time-dependent.

We note that the derivation of our model, resulting in equation 1.42,depended on the approximation in equation 1.32 that we take sufficientlysmall steps in changing the ordinary synapses and gains. If the synapsesgrow to be very large, it is possible that this approximation may breakdown. However, depending on our choices of ri and a, the approximationwill be valid for at least part of the evolution of the network and certainlyduring the early development. In a continuum model, the derivation wouldbe exact, provided that we replace the single pattern energy , with thetotal energy over all patterns = E, o. For such a continuum model, wewould replace equation 1.32 with:

d ") d( , " I') =,X ) di ) -,N di 'de4~) + () - + . (1.45)

dt dt ' dt ' dt

Similarly, we would replace equation 1.33 with:

Wj-n)

dt.-.n) V" + , .n. V(n5n, ) (1.46)

and equation 1.34 with:=! (1.47)

dt

In equations 1.35, 1.36, and 1.37, we would need to sum both sides of theequation over the pattern index s, to obtain the appropriate relationshipsfor the total energy C. In the end, we would obtain an exact nonlineardifferential equation of the form:

S (, 1 (1.48)

from which we would obtain the effective time-dependent step-constants inequations 1.43 and 1.44.

14

Page 17: AD-A217 733

1.5 Impact of the Effective Time-DependentStep-Constant

In the previous section, we derived the form of the effective step-constantand determined that t711 grows quadratically as a function of the magnitudesof the ordinary synaptic vector and the gain and that rv_, the step-constantin the direction of change of fi4N, grows quadratically with the gain. It maybe particularly important in the early stages of synaptic development when, n) P(n.s , (n),"I n) is small and o; can be approximated linearly as:

(),.1

1

2 - Un

1 (1 + 2 (1.49)22

In this regime, therefore, the error signals can be approximated as poly-nomial functions of U) S .(n), and the the quadratic dependence of the

step-constants on I - (n) I and A n) is likely to be significant. 10

In the next chapter, we will demonstrate that the use of high momen-tum leads to shorter convergence time. When gain modification is combinedwith high momentum, there is further improvement in convergence time.Empirical results suggest that when high momentum is used shorter con-vergence times result because the synapses develop more rapidly. Whengain modification is added to high momentum, the rate of development ofthe effective synapses, u 7 , is accelerated further.

'The initial synaptic weights are small: between -0.1 and 0.1, and the connectivity ofthe networks we considered in the next chapter was small. Additionally, the input patternswere confined to the unit disk. Therefore, we can expect that this approximation will bevalid at the beginning of synaptic development.

1°Furthermore, note that the ordinary error signals, as shown earlier, are explicitlydependent on the gains. This may also play a significant role in the time evolution of thenetwork.

15

Page 18: AD-A217 733

Chapter 2

Experimental Benchmarks for

Gain Modification and

Momentum

2.1 Momentum and the Concentric CircleParadigm

We describe below some benchmarks for the concentric circle paradigm. Inthis paradigm, we train a backward propagation network with two inputsrepresenting the x and y coordinates of a point in the unit disk. Within theunit disk, there are two pattern classes separated by a circular boundaryat r = 1 This divides the class into two regions, an inner disk and anouter annulus, both of area *. For the single output unit, the target outputfor patterns in the outer annulus is 1 and for patterns in the inner disk is0. In the simulations described below, the network had one hidden layerof 6 hidden units, and the network was trained on 40 patterns randomlyselected from the unit disk. For each setting of the parameters, we repeatedthe experiment 90 times starting the weights at different random points in asmall hypercube; each initial weight was chosen randomly betweeen 0.1 and-0.1. Additionally, a different randomly selected set of 40 patterns was usedin each experiment. In all experiments the step-constant in the first layerof connections was two times the size of the step constant in the secondlayer. We have found that this accelerates the learning procedure, a fact

16

Page 19: AD-A217 733

which is consistent with what Becker and Lecun (1988) [2] have observed.In their work, they determined that the step constant should be scaled bya factor of 1, where N,, for a given layer of synapses, is the number ofconnections in that layer onto a neuron in the next layer of neurons. If thisscaling is a general principal, then, for our particular network architecture,we should have increased the step-constant in the first layer of synapsesby a factor of Vii. However, by choosing a factor of 2, we are at leastapproximately close to what Becker and LeCun found to be ideal for thebinary input paradigm which they investigated. As of yet, we have notstudied in great detail the scaling of step-constants as a function of thenumber of synaptic connections in a layer. However, for consistency, in thesimulations described in the next section, we chose a, the step-constantfor gain modification, to be twice as large for neurons in the hidden layercompared to its value in the output layer.

In one set of experiments (series ab), we took the step-constant to be= 0.4. The momentum was varied by increments of 0.1 from 0.0 to 0.9,

and 90 experiments were performed at each value of momentum. The meanand standard deviation of convergence time are plotted for each value ofmomentum in figure a.1 of appendix a. Only those experiments convergingwithin 20000 epochs 1 2 were used to compute the means and standarddeviations. From the graph, it is apparent that the mean convergencetime decreases with increasing momentum. The large error bars, however,are somewhat misleading and tend to underemphasize the rather dramaticimprovement in convergence time at high momentum. The size of the errorsis typically comparable to the mean and is due to the fact that there area number of trials for which the convergence time is several times largerthan for the majority of the trials. Given the limited number of trials, 90,this tends to make the error bars large and to mask the improvement ofthe central cluster of trials. With increasing momentum, the central clusterof convergence times becomes narrower and moves toward shorter times.We illustrate this point in a series of graphs for series ab, figures a.4- a.13,ordered by momentum, in which we plot a histogram of number of trialsconverging vs. convergence time.

'An epoch is defined to be a complete presentation of the data set sequentially. How-ever, we modify the network following the presentation of each pattern.

2 In our experiments, we defined the convergence of a network to occur when the network

output was within 0.1 of the target output for all training patterns.

17

Page 20: AD-A217 733

It is interesting to note that at low momentum, there is evidence ofat least two distributions in the convergence times. These two clustersgradually move to shorter convergence times and merge to form a singlecluster at low convergence time.

Figure a.2 of appendix a shows the fraction of trials converging within20000 epochs. Over most of the range of r., that is for r. between 0.0 and0.8, the percentage of trials converging is in the range ofi 89 - 98%. Atvery high momentum, however, r. = 0.9, the percentage of trials converg-ing drops to 79%. At the same time, the amount of generalization by thenetwork is not momentum dependent. This is readily apparent in figurea.3 of appendix a, in which we plot the mean and standard deviation ofthe fraction of 5000 randomly selected test patterns which were correctlyidentified by the network after the network converged. We also graph thefraction which could be correctly identified by the network's "best guess",that is the output which the network was closest to, 0 or 1. Means and stan-dard deviations are depicted only for those trials which actually convergedwithin 20000 epochs. As indicated by the small error bars, the generalityof the solutions varies only weakly across trials for a given value of mo-mentum. Furthermore, as we have noted, the mean generality is constantacross momentum. We expect, therefore, that solution generality shouldnot have a strong dependence on convergence time.

We can conclude also that the length of convergence time has very littleto do with the generality of the solution reached for this paradigm; weinfer this from the fact that the error bars are small in the generalizationgraph in figure a.3. It is interesting to ask, then, which features mightdistinguish the network solutions with long convergence time from thosewith short convergence time. In appendix b, we have produced a scatterplot of the mean and standard deviation (over the neurons in a particularsolution) of the magnitude of synaptic vectors vs. convergence time forseries ab. As defined in the previous chapter, each of these vectors includesboth the ordinary synapse and the bias. For each layer, there is a set of 3graphs, ordered in increasing momentum, one at zero momentum (IC = 0),one at intermediate momentum (r. = 0.5), and one at high momentum(r. = 0.9). The graphs for layer 2 are in figures b.1- b.3, and those for layer1 are in figures b.4-b.6. We note that as the momentum is increased, thelonger convergence times move toward the cluster of convergence times atshorter times, this region of the graph becomes more dense, and itself moves

18

Page 21: AD-A217 733

simultaneously closer to the origin along the time axis. As a consequence,as the momentum is increased and the number of trials converging withinthe first several thousand iterations becomes more dense, we observe that,as a population, the synaptic norms rise more rapidly as a function ofconvergence time.

2.2 Gain Modification Combined with HighMomentum Synaptic Modification

We have seen from the histograms of number of trials converging vs. con-vergence time for series ab in appendix a, that high momentum can dra-matically improve the convergence rate for the majority of the trials. Withthis in -mind, we combined the gain modification procedure described ear-lier with high momentum synaptic modification. We chose a, the gainmodification step-constant, to be an order of magnitude smaller than thesynaptic modification step-constant, 17, since the gain is, in general, a moresensitive parameter. In series ag, we examined three different cases: nomomentum (sag3), high momentum (sag2), and high momentum with gainmodification (sag4,sag5,sag6). In each run, we performed 500 experiments.We summarize the parameters chosen for these runs: 3

Run Number 1 X asag3 0.3 0.0 0.000sag2 0.3 0.8 0.000sag5 0.3 0.8 0.020sag4 0.3 0.8 0.038sag6 0.3 0.8 0.060

In figure c.1 and c.2 of appendix c, we plot the percentage of trials con-verging vs. convergence time on two different time scales for the runs listedabove. Detailed histograms of sag2, sag3, and sag4, comparing the threecases on two different time scales, the first 6000 epochs (figures c.3-c.5),and the full 20000 epochs (figures c.6-c.8), are also provided in appendix c.

3 As noted earlier, q --+ 27 and a, --+ 2a for the first layer of synapses and the gains ofthe hidden units respectively. The values for q and a in the table are those values used inthe second layer of synapses and the gains of the output units respectively.

19

Page 22: AD-A217 733

Examining the graphs in figures c.1 and c.2, we see that without momen-tum or gain modification, 80% of the simulations converge within ; 10000epochs. High momentum clearly leads to much more rapid convergence,for we obtain the 80% level within ; 1500 epochs. When gain modifica-tion is added to high momentum, we achieve the 80% level within - 500epochs. This is 3 times faster than high momentum without gain modifi-cation and 20 times faster than the bare algorithm. This is a tantalizingresult, for on a more complex problem where longer convergence times aremore probable, reaching the 80% level in a third of the time (comparedwith high momentum alone) could be quite significant. Asymptotically,the runs in which gain modification was combined with high momentum(sag4, sag5, sag6) also achieved the highest percentage of trials converging,

95-98%. For high momentum without gain modification, the asymptoticconvergence rate was slightly lower at , 93%, and for the bare algorithm,even less at s 90%.

A comparison of the synapses obtained in these three cases is illustratedin the graphs in appendix d. Figures d.1, d.2, and d.3 graph the the synap-tic vector norm 4 in layer 2 vs. convergence time. Figure d.1 is the nomomentum case (sag3), figure d.2 the high momentum case (sag2), andfigure d.3 the high momentum with gain modification case (sag4). As apopulation there is a trend toward a longer synaptic norm in layer 2 astime increases when the bare algorithm is used (figure d.1). This increasein synaptic norm occurs more rapidly when high momentum (figure d.2)is added to the bare algorithm. When we combine high momentum withgain modification, (figure d.3) there is some depression of the size of thesynaptic norm. In this case, however, the length of the effective synap-tic vector, or rescaled synaptic vector, depicted in figure d.4 in appendixd, develops much more rapidly than in the high momentum case withoutgain modification. This is due tc the fact that the gains in this layer as apopulation are typically - 1.4-2.5 (see figure d.5 of appendix d where thegains are plotted as a function of synaptic norm for layer 2). This seems tospeed the process of learning, since solutions with longer synaptic normsare achieved without the requirement of the synapses' becoming large; asingle multiplicative factor, the gain, hastens the change of the length ofthe synaptic vector.

4As before, this vector includes both the ordinary synaptic vector and the bias.

20

Page 23: AD-A217 733

In layer 1, a similar trend obtains (figures d.6- d.9). s As in layer 2,longer effective synapses in layer 1 are achieved more rapidly by modifyingthe gain parameter. When we compare figures d.6 and d.7, we see thatthe increase in E([I W ) 1) occurs more rapidly as a function of convergencetime for the majority of the population as we go from no momentum to highmomentum. 6 However, the synaptic norms (figure d.8) are not significantlydepressed in the case of high momentum with gain modification; this leadsto rescaled synaptic vector norms which are much larger than the other twocases (see figure d.9). 7 In figure d.8, the ordinary synapses are probablynot significantly depressed in layer 1 because the step-constant was twiceas large for layer 1, as we noted earlier.

The mean gain is plotted versus mean synaptic vector norm in figured.10. The gains are principally between 1.0 - 2.0 in this layer and act toaccelerate the development of the length of the effective synaptic vectors.

Rapid development of the synaptic vector norms appears to be the factorwhich decreases the convergence time when high momentum is used, andthis effect is further enhanced when gain modification is used with highmomentum. We note also for completeness that the use of gain modificationdid not in any way degrade the generality of the solutions obtained.

2.3 Future Directions

In the work described above, we have shown how gain modification canenhance the improvements in convergence rate obtained with high momen-tum in synaptic modification for a simple concentric circle paradigm. Forcompleteness, we plan to do simulations in which gain modification is usedwithout momentum. In addition, further analysis and reduction of the dataalready obtained is in progress. We plan to establish similar benchmarks forhigher dimensional problems, for instance for higher dimensional concen-tric hyperspheres. This will allow a comparison of how the improvementsobtained with gain modification scale with dimensionality. Finally, we willestablish benchmarks for gain modification for more complex problems such

6 As before error bars represent the standard deviation with respect to the neurons in

the layer.E is the expected value, or mean.

'Note that the scale of the graph in figure d.9 is different from the previous graphs.

21

Page 24: AD-A217 733

as parity.

2.4 Acknowledgement

We wish to acknowledge Professor Leon N Cooper, who first suggestedthe idea of modifying gains in a backward propagation network and tothank Nathan Intrator, Dr. Christopher Scofield, David Ward, and MichaelPerrone for a number of useful discussions and suggestions concerning thiswork.

22

Page 25: AD-A217 733

Bibliography

[1] C. M. Bachmann, "Generalization and the Backward Propagation Neu-ral Network", ARO Technical Report, Physics Dpt. & Ctr. for NeuralScience, Brown University, Providence, R.I., Jan. 14, 1988.

[2] S. -Becker, Y. LeCun, Improving the Convergence of Back-PropagationLearning with Second Order Methods, in Proceedings of the1988 Connectionist Models Summer School, June 17-26, 1988,Carnegie Mellon University, eds. D. Touretzky, G. Hinton, T. Se-jnowski, p. 29- 37, Morgan Kaufmann Publishers, San Mateo, CA,1989.

[31 J. J. Hopfield, "Neurons with Graded Response Have Collec-tive Computational Properties Like Those of Two-State Neu-rons",Proc.Natl.Acad. Sci. USA, Vol.81, pp.9088-3092, May 1984,Biophysics

[41 J. K. Kruschke, Creating Local and Distributed Bottlenecks in HiddenLayers of Back-Propagation Networks in Proceedings of the 1988 Con-nectionist Models Summer School, June 17-26, 1988, Carnegie MellonUniversity, eds. D. Touretzky, G. Hinton, T. Sejnowski, Morgan Kauf-mann Publishers, San Mateo, CA, 1989, pp. 120-126.

[5] R. A. Jacobs, "Increased Rates of Convergence Through Learning RateAdaptation", Neural Networks, Vol. 1, pp. 295-307, 1988.

[6] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, "Learning Inter-nal Representations by Error Propagation", in Parallel DistributedProcessing, Explorations in the Microstructure of Cognition,Vol.1: Foundations, ed. D.E. Rummelhart, J.L. McClelland,TheMIT Press, Cambride Mass., 1986, pp.3 18 -3 6 2 .

23

Page 26: AD-A217 733

[7] P. J. Werbos, "Generalization of Backpropagation with Application to

a Recurrent Gas Market Model", Neural Networks, Vol. 1, 1988, pp.339-356.

24

Page 27: AD-A217 733

Appendix aGraphs of series ab: Convergence Properties as a Function of

Momentum

Page 28: AD-A217 733

Mean & Std. Dev. of Convergence Time vs. Momentumseries ab; eta 0.4

8000

7000

6000

50002

4000 -- mean time

3000

S 2000

1000

0 I I 1.

-0.2 0.0 0.2 0.4 0.6 0.8 1.0

Fig. a.1 momentum

Fraction Of 90 Experiments (1) Converged, (2) Best Guess Convergedvs. Momentum; series ab; momentum 0.4

1.0 . U U

0.9 U

0.8-

0.7

1 0.6

0.5 0 frac. cnNrg.0 0.4 frac. b.g. cnvrS 0.4-

.2 0.3O .2

0.1

0.0 I | I

0.0 0.2 0.4 0.6 0.8 1.0

Fig. a.2 momentum

Fraction of 5000 Test Patterns (1) Correct, (2) Best Guess Correctvs. Momentum; series ab; eta 0.4

1.0

UC 0.90.8

0.7

0.6

0 0.5 0 frac. correct# frac. best gu.

0.4

0.3

t 0.2

0.1

0.0 - 1 * I l l ' 1

0.0 0.2 0.4 0.6 0.8 1.0

Fig. a.3 momentum

Page 29: AD-A217 733

a G

Number of Trials Converging vs. Convergence Time3 sab$.bns; eta = 0.4; momentum = 0.0

2826 Concentric Circle Paradigm:2422 40 Training Patterns

20 6 Hidden Units1816 "14 -- #of Trials

o 121086420. .

0 5000 10000 15000 23000

Fig. a.4 Convergence Time (epochs)

Number of Trials Converging vs. Convergence Time

30 sab6.bns; eta = 0.4; momentum = 0.1

2826 Concentric Circle Paradigm:24 40 Training Patterns22

20 6 Hidden Units181614 -0- #of Trials1210864200 5000 10000 15000 20000

Fig. a.5 Convergence Time (epochs)

Number of Trials Converging vs. Convergence Time

30 - sab7.bns; eta = 0.4; momentum = 0.2

2826 Concentric Circle Paradigm:2422 40 Training Patterns

20 6 Hidden Units181614 -0- #of Trials121086420

0 5000 10000 15000 20000

Fig. a.6 Convergence Time (epochs)

Page 30: AD-A217 733

Number of Trials Converging vs. Convergence Time

30 sab8.bns; eta = 0.4; momentum = 0.3

2826 Concentric Circle Paradigm:2422 40 Training Patterns

20 6 Hidden Units181614 -0- #of Trials

o 1210

864200 5000 10000 15000 20000

Fig. a.7 Convergence Time (epochs)

Number of Trials Converging vs. Convergence Timesab9.bns; eta = 0.4; momentum = 0.430

2826 Concentric Circle Paradigm:2422 40 Training Patterns

20 6 Hidden Units181614 -0- # of Trials

0 12108642

0 5000 10000 15000 20000

Fig. a.8 Convergence Time (epochs)

Number of Trials Converging vs. Convergence Time

30 sabl.bns; eta = 0.4; momentum = 0.5

2826 Concentric Circle Paradigm:2422 40 Training Patterns

20 - 6 Hidden Units181614 - # of Trials12108-642'00 5000 10000 15000 20000

Fig. a.9 Convergence Time (epochs)

Page 31: AD-A217 733

Number of Trials Converging vs. Convergence Timesab2.bns; eta = 0.4; momentum = 0.6

302826 Concentric Circle Paradigm:24 40 Training Patterns22-

20 6 Hidden Units181614 -- #of Trials

0 1210

8642

0 5000 10000 15000 20000

Fig. a. 10 Convergence Time (epochs)

Number of Trials Converging vs. Convergence Timesab3.bns; eta 0.4; momentum = 0.73O-

2826 Concentric Circle Paradigm:24 40 Training Patterns22-

20 6 Hidden Units181614 -0- #of Trials1210

864

0 5000 10000 15000 20000Fig. a. 11 Convergence Time (epochs)

Number of Trials Converging vs. Convergence Timesabl0.bns; eta = 0.4; momentum = 0.8

30

2826 Concentric Circle Paradigm:2422 40 Training Patterns

20 6 Hidden Units18

M - 1614 -0- #of Trials

0 12

108642

0 5000 10000 15000 20000

Fig. a. 12 Convergence Time (epochs)

Page 32: AD-A217 733

Number of Trials Converging vs. Convergence Timesab4.bns; eta = 0.4; momentum = 0.9

302826 Concentric Circle Paradigm:2422 40 Training Patterns

20 6 Hidden Units18

Ac , 1614 -0- # of Trials

0 12

1086

42 [0-0 5000 10000 15000 20000

Fig. a. 13 Convergence Time (epochs)

Page 33: AD-A217 733

Appendix bGraphs of series ab: Magnitude of Synaptic Vectors vs.

Convergence Time

Page 34: AD-A217 733

*1 21

00

____ ____ ___ ____ ___ ____ ____ ___ ____ __

00

a)C

0l

00

a)0 0

1=04, 0.

00

00

-

4,k

C3

CD IT CV)l 0l

Page 35: AD-A217 733

*1~

4)9

cz~

400>0

lE0

o o w d T r

vis

Page 36: AD-A217 733

0

0____________________________________________ 0

* 00

E

000U,1~

4D -~* 0.:0 ii

6 0

~ E4,C

I-4.1 6

0CU ~ 0

0 0 -

~. B 0 C.,

C..

~0o 0

* 0U,

0

0 0 0 0 0 0 0 0 0CD U, C') C'.J -

VJXI 'iMI

Page 37: AD-A217 733

I-

0

00

* 00CJ

I 0 I2

U

4J ____________________________________________I 0 0

6 00U,

C.)

II-2 j6

0.Cu

'~.." 0 a~I I 0

I 0oo- U

I I- O-145? _

~L.r.Lo1o I ~

I *,-~-a1I 00

- 0In

.0

00 0 0 0 0 0 0 0 0

(0 In (V) C'~J -

Page 38: AD-A217 733

00

0

00e,.

2L

4JO

aF0

r" CD U') c) c0

D30

Page 39: AD-A217 733

U~

0ci .0

-0

En 0

E

co 0D I 0, c) C

Page 40: AD-A217 733

Appendix cGraphs of series ag: Convergence Properties for

(1)No Momentum, (2) High Momentum,(3) High Momentum with Gain Modification

Page 41: AD-A217 733

e'j m~ I- LOl (D0) 0) cm 0) CD

0rl 0

0

00

u

>

U >

~ce4 0

00

0 0 0 0) 000 co to 04J

Pa~J3AUOZ) SIflIl 0(J5 JO 0/

Page 42: AD-A217 733

CM0) 0Y) 0M 0)

0

E0

Ec

+ 0Go U,

0aJ)UDStJLOSJ

Page 43: AD-A217 733

Number of Trials Converging vs. Convergence Time

250 sag3.bns; eta = 0.3; momentum = 0.0; alpha = 0.0

225

200

175 Concentric Circle Paradigm:

150 40 Training Patterns

. 125 6 Hidden Units -0-#of Trials100

75

50

25 -r

0 1000 2000 3000 4000 5000 6000

Fig. c.3 Convergence Time (epochs)

Number of Trials Converging vs. Convergence Time

250 sag2.bns; eta = 0.3; momentum = 0.8; alpha = 0.0

225

200

175 Concentric Circle Paradigm:

150 40 Training Patterns

. 125 6 Hidden Units -o- #ofTrials

100

75

50

25

0 1000 2000 3000 4000 5000 6000

Fig. c.4 Convergence Time (epochs)

Number of Trials Converging vs. Convergence Timesag4.bns; eta = 0.3; momentum = 0.8; alpha = 0.038

250

225

200

175 Concentric Circle Paradigm_ 150-150 40 Training Patterns

... 125 6 Hidden Units -0- #ofTrialso 100

75

50

25

0. . .. . . . . . . . .

0 1000 2000 3000 4000 5000 6000

Fig. c.5 Convergence Time (epochs)

Page 44: AD-A217 733

Number of Trials Converging vs. Convergence Time

250 sag3.bns; eta = 0.3; momentum = 0.0; alpha = 0.0

225

200

175 Concentric Circle Paradigm:

150 40 Training Patterns.3_

125 6 Hidden Units #of Trlals

100

75

50250

t' " "JJ

0-0 5000 10000 15000 20000

Fig. c.6 , Convergence Time (epochs)

Number of Trials Converging vs. Convergence Time

250 sag2.bns; eta = 0.3; momentum = 0.8; alpha = 0.0

225

200175 Concentric Circle Paradigm:

150 40 Training Patterns

L-' 125 6 Hidden Units -0- #otTrials

100

75-

50-

25

0 5000 10000 15000 20000

Fig. C.7 Convergence Time (epochs)

Number of Trials Converging vs. Convergence Timesag4.bns; eta = 0.3; momentum = 0.8; alpha = 0.038

250

225-

200

175 Concentric Circle Paradigm

150 40 Training Patterns

E... 125 6 Hidden Units -a- #ofTrials100

7550

25

0 5000 10000 15000 20000

Fig. c.8 Convergence Time (epochs)

Page 45: AD-A217 733

Appendix dGraphs of series ag: Magnitude of Synaptic Vectors vs.

Convergence Time

Page 46: AD-A217 733

13 ~ 0 1

00

E0

000

0 0 013u 0

0

0 LACC3

00

0 00 0E 0 l0

0

. I ~0 052 0 1

a0- 0

.- 4o~

00U C) 0

Page 47: AD-A217 733

(.4l*

o0

E 0

LO)

D

o

0 00

0 oQ n.

c o N

• ! 2 I " I"

I "

2

00

Page 48: AD-A217 733

9 291

-2

F- 0400

u E6

m0

m0

100

ED 0

I11

0 0 0

co r (D CF 04

z *'.(I

Page 49: AD-A217 733

00

''

00

0013 0 P M E

0 03

IIn

CD 0

Go -- 0' V) 0

v J~l 'I0

Page 50: AD-A217 733

L-

0

0 00

0to,

CW

CV)I ~

Z cIiluu

Page 51: AD-A217 733

II I

F _,_ .___ __ _-__ :__

I " " I"

• I " III I " I

I' (!'I I 0

Page 52: AD-A217 733

±11

0L

co fl w t m 0

*j-ClG%%D

Page 53: AD-A217 733

L.

El

0______________________________________________ 0

C0

1 0 I2 I

04.1 00 00

0o ~ -

Q IIA U I~..

4' I El I- . .~

L. oq UC~ II

3 04' 0

00

;~l.. E -4,

*~ e~o 4,

4'

004.-o 1-0-44.' e~

~ 4' 00- 0

00

_________ 0

I ____________

p0

0 0 0 0 0 0 0 0 000 CD Ifl C~) -

Page 54: AD-A217 733

W~0E0

'I

zo

13,

v 0 co toC

Page 55: AD-A217 733

CC

0

- "

I - _ _ _ _ _ _ _ _

CCD

i. 04

m L~