IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 20, NO. 8, AUGUST 2009 …aix1.uottawa.ca/~schartie/Chartier-TNNd.pdf · 2009-08-29 · IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 20, NO

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 20, NO. 8, AUGUST 2009 1281

BAM Learning of Nonlinearly Separable Tasksby Using an Asymmetrical Output Function

and Reinforcement LearningSylvain Chartier, Member, IEEE, Mounir Boukadoum, Senior Member, IEEE, and Mahmood Amiri, Member, IEEE

Abstract—Most bidirectional associative memory (BAM) net-works use a symmetrical output function for dual fixed-pointbehavior. In this paper, we show that by introducing an asym-metry parameter into a recently introduced chaotic BAM outputfunction, prior knowledge can be used to momentarily disabledesired attractors from memory, hence biasing the search spaceto improve recall performance. This property allows controlof chaotic wandering, favoring given subspaces over others. Inaddition, reinforcement learning can then enable a dual BAMarchitecture to store and recall nonlinearly separable patterns.Our results allow the same BAM framework to model threedifferent types of learning: supervised, reinforcement, and un-supervised. This ability is very promising from the cognitivemodeling viewpoint. The new BAM model is also useful from anengineering perspective; our simulations results reveal a notableoverall increase in BAM learning and recall performances whenusing a hybrid model with the general regression neural network(GRNN).

Index Terms—Bidirectional associative memory (BAM), chaoscontrol, cusp catastrophe, hybrid model, nonlinearly separabletasks, prior knowledge, reinforcement learning.

I. INTRODUCTION

T HE bidirectional associative memory (BAM) neural net-work has been extensively studied since its introduction

by Kosko [1]. Over the years, several variants have been pro-posed to overcome the original model’s limited storage capaci-ties and improve its noise sensitivity, and most of today’s BAMmodels can store and recall all the patterns in a learning set.This was the outcome of numerous sophisticated approachesthat modify the learning rule or coding procedure, with the resultof increasing both storage capacity and performance, and de-creasing the number of spurious states (e.g., [2] and [3]). Someuse genetic algorithms [4], [5] or perceptron learning [6], othersuse expanded pattern pairs [2], [4], [7], [8], incorporate a time

Manuscript received June 11, 2008; revised November 14, 2008 and March27, 2009; accepted April 24, 2009. First published July 10, 2009; current versionpublished August 05, 2009.

S. Chartier is with the School of Psychology, University of Ottawa, Ottawa,ON K1N 6N5 Canada (e-mail: [email protected]).

M. Boukadoum is with the Department of Computer Science, Uni-versité du Québec à Montréal, Montréal, QC H3C 3P8 Canada (e-mail:[email protected]).

M. Amiri is with the School of Electrical and Computer Engineering, Uni-versity of Tehran, Tehran 67155-1616, Iran (e-mail: [email protected]).

Color versions of one or more of the figures in this paper are available onlineat http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TNN.2009.2023120

delay [9], [10], use impulses [11], [12], exploit interconnec-tions among the units inside each layer [3], or utilize a nonal-gebraic approach [13]. In all cases, the improvement in BAMperformance comes with an increase in network complexity,and the majority of the proposed models depart drastically fromthe simple Hebbian covariance/correlation learning scheme thatmade the BAM popular. For example, Wu and Pados [14] andEom et al. [15] add a hidden layer that increases the network’sdimension in order to facilitate class separation, and they usedifferent learning rules to update the weights of the hidden andoutput layers. In [16], the networks behave similarly to a feed-forward network, with a resulting loss of internal consistency.Also, most of the modified networks still use an offline learningalgorithm and can only memorize binary or bipolar patterns.The inability to learn real-valued, fixed-point attractors limitsthe models in both cognitive explanation power and engineeringapplications.

Recently, Chartier et al. [17] proposed a BAM that learns on-line to iteratively develop weight connections that converge toa stable solution, using nonlinear feedback provided by a noveloutput function. The proposed BAM learns by only using co-variance matrices, and it is among a few models that can createreal-valued attractors without preprocessing. It is also able toreduce the number of spurious attractors while maintaining per-formance in terms of noise degradation and storage capacity.In a more recent study, the model was extended to handle multi-step pattern recognition as well as one-to-many association [18],tasks that were exclusive to multilayer perceptron architectureswith local feedback. The new BAM can also switch from fixedpoints to dynamic orbits [19], with the ability to achieve ape-riodic (including chaotic) association and constrained output.Finally, removing a set of external teacher connections enabledthe model to become a feature-extracting bidirectional associa-tive memory (FEBAM) [20]. Using autonomous perceptual fea-ture creation, FEBAM performs competitively in tasks usuallyassociated with principal component analysis (PCA) and inde-pendent component analysis (ICA) networks, such as image re-construction. Moreover, FEBAM still keeps its attractor-like be-havior, such as prototype development. It can learn in noisy en-vironments and it creates and reorganizes its cluster-based cat-egories in a flexible way [21], [22].

As BAM architectures stand today, each fixed point partitionsthe recall space. However, as the number of learned prototypesincreases, the associated radii of attraction decrease. A typicalBAM with units can associate up to patterns [23]. Atthe limit, the radii of attraction are null and the network

1045-9227/$26.00 © 2009 IEEE

Authorized licensed use limited to: University of Ottawa. Downloaded on August 24, 2009 at 15:18 from IEEE Xplore. Restrictions apply.

1282 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 20, NO. 8, AUGUST 2009

can no longer process exemplars, because of losing its noise tol-erance ability. In most BAM designs, the problem is avoided bysetting the ratio of the learned patterns over the number of unitsto about 50% [24]. At that level, the model can still associativelyrecall stored patterns from noisy versions for low-to-moderatenoise levels. An alternative solution to the noise problem is toallow the network to ignore particular fixed points during a par-ticular recall without erasing them from memory; if the numberof attraction basins is reduced while keeping the total surface ofattraction constant (no new units added), then the remaining at-tractors benefit by having their radii of attraction increased. Thatshould lead to a performance increase for the concerned recall.

The problem of controlling the radii of attraction can alsobe found in aperiodic (chaotic) recurrent associative memories(RAMs) [25]–[27]. Sometimes, aperiodic dynamics are the de-sired behavior, with the network able to store patterns in irreg-ular orbits. Thus, the state vector is never trapped in a fixed point(or region), and it converges to an aperiodic orbit instead. Thismemory searching process is different from that of fixed-pointassociation and helps understand instability [25]. However, asthe number of stored patterns increases, so does also the chaoticretrieval time. If the memory searching process could be influ-enced so that a subset of the learned items becomes temporarilyunavailable, then the number of cycles needed to extract the de-sired output would decrease. Again, by controlling each fixedpoint’s attraction domain, one can influence the chaotic memorysearch process.

One mechanism to control the BAM’s basins of attractionis via reinforcement learning (RL) as will be explained later.In contrast to supervised learning, the feedback given by theteacher in RL [28] is of lower quality and can be composed ofsolely success–fail information. Thus, it is only evaluative andnot instructive, being only a numerical value that evaluates howgood a given output (behavior) is for the context. RL operateswith trial–error search and delayed reward. The network trieswhat it already knows in order to obtain a reward, but it mustalso explore new behavior to make better action selection in thefuture [28]. The learning algorithm is not told which actions totake as in supervised learning; instead, it must discover whichactions yield the most reward by trying them. Despite its sim-plicity, the knowledge thus obtained by interaction with the en-vironment can lead to complex classification tasks. But appar-ently, no BAM has had the capacity to perform RL in the past.This can be explained by the difficulty to constrain network be-havior so that once a wrong output is detected, the probabilityof that input decreases.

In this work, a mechanism that allows the BAM network to ig-nore fixed points during a particular recall without erasing themfrom memory is proposed. As a result, it gains the ability toprovide different outputs for the same input, depending on con-text. Furthermore, a two-BAM architecture is proposed wherebya primary BAM can achieve nonlinear classification using anextra BAM that is properly trained to output the control valuesthat will constrain the attractors. This paper proposes such acontrol mechanism by exploiting the properties of a cusp catas-trophe [29]. The cusp represents a shift in the behavior of a dy-namic system that arises from small parameter changes. It willbe verified that the transmission function of the BAM described

Fig. 1. Output function when � � � � � and � � �, showing two stable andone unstable fixed points.

in [19] can be modified to include an asymmetry parameter thatenables cusp catastrophes and consequently controls the quali-tative behavior of the network. As will be shown, by only mod-ifying the added parameter, the output function can pass fromtwo fixed points to one, therefore enabling the BAM to increaseits recall performance by including prior information, control itschaotic outputs, use reinforcement learning, and perform non-linear classification.

The remainder of this paper is organized as follows. Section IIpresents the output function of the BAM and the cusp bifur-cation properties. The architecture and learning function of theBAM are explained in Section III. Section IV presents variousresults about the increase of performance using priors, rein-forcement learning capacity through the classic XOR classifica-tion, chaotic control, and hybrid implementation. Finally, thepaper concludes with a discussion about the implication of in-troducing a cusp catastrophe in the BAM’s phase behavior.

II. BAM OUTPUT FUNCTION

A. 1-D Symmetric Setting

The output function used in the model is based on the classicVerhulst equation [30]. Since this quadratic function has onlyone stable fixed point, it is extended to a cubic form describedby the dynamic equation

(1)

where represents the weight connection, is the input value,is a bias or asymmetry parameter, and is a free parameter that

affects the equilibrium states of (1). For , the output func-tion is identical to the one in [19]. Fig. 1 illustrates its shape for

. The roots of (1) are the fixed points of the system.For example, if and , the corresponding fixedpoints are , and . The stability properties of the fixedpoints are determined by computing the derivative of (1)

(2)

If the derivative at a given fixed point is greater than zero, thena small perturbation results in growth; else, if the derivative isnegative, then a small perturbation results in decay. The firstsituation represents an unstable fixed point whereas the secondrepresents a stable one. In the previous example, both and

are stable fixed points while is an unstable fixed


CHARTIER et al.: BAM LEARNING OF NONLINEARLY SEPARABLE TASKS 1283

Fig. 2. Energy landscape of a 1-D system with � � � � � and � � �.

point. This is illustrated in Fig. 1 by filled and empty circles,respectively.

Another way to visualize the dynamics of a recurrent neuralnetwork is based on the physical idea of energy [31]. The energyfunction, noted , can be defined by

(3)

where the negative sign indicates that the state vector movesdownhill in the energy landscape. Using the chain rule, it fol-lows from (3) that

(4)

Thus, decreases along trajectories or, in other words, thestate vector globally converges towards lower energy states.Equilibrium occurs at locations of the vector field where localminima correspond to stable fixed points, and local maximacorrespond to unstable ones. To find them, we need to find

such that

(5)

The general solution is

(6)

where is an arbitrary constant (we use for conve-nience). Fig. 2 illustrates the energy function whenand . The system exhibits a double-welled potential withtwo stable equilibrium points ( and ).

The parameter plays an important role in determining thenumber and positions of the fixed points. Fig. 3 illustrates thesituation when and . For less than zero, thesystem has only one (stable) fixed point, ; for greaterthan zero, there are three fixed points: and , ofwhich the first one is unstable and the other two are stable. Fi-nally, for equal to zero, we have a pitchfork bifurcation point.We deduce from the preceding that we must have for thesystem to store binary stimuli.

The previous results were obtained with , which impliesa symmetric neural output function. This type of function hasbeen extensively studied and implemented in both BAM (e.g.,

Fig. 3. Phase diagram indicating the presence of a bifurcation for � � �.

Fig. 4. Stability of the fixed points as a function of � for � � �. In the triple-valued region, the middle branch is unstable and the upper and lower branchesare stable; S-N bif represents saddle-node bifurcation.

[17]–[19], and [32]) and other autoassociative memory architec-tures (e.g., [32]). When is different from zero, the symmetryis broken and (1) leads to a cusp catastrophe system.

B. 1-D Asymmetric Setting

To determine the values of and that result in one versustwo stable fixed points, (1) is set to zero and then decomposedinto two parts. Setting for simplicity, we obtain

(7a)

and

(7b)

The saddle-node bifurcation occurs for values of that aretangent to the local minima or maxima of the cubic part (7a).By finding the roots of the derivative of (7a), we have

(8)

and merging this result into (7a) yields

(9)

Thus, for different from 0 and positive, there will alwaysexist at least one stable fixed point as illustrated in Fig. 4. Forexample, if and is less than , there willbe one stable negative fixed point; if the value of is greater



Fig. 5. Stability diagram: depending on the values of � and �, the system willhave one or two stable fixed points.

Fig. 6. Cusp catastrophe surface.

than , there will be one stable positive fixed point; andif the value of is in between those values, there will be twostables fixed points (one positive and one negative). At the

limits, a saddle-node bifurcation occurs and the twofixed points collide and annihilate each other.

We conclude from (7b) that the saddle-node bifurcation oc-curs when , where

(10)

Equivalently, it occurs when , with

(11)

As is shown in the stability diagram in Fig. 5, the number ofstable fixed points will be one or two depending on the values of

and . Saddle-node bifurcations occur along the entire bound-aries of the region; at the cusp point , a codimen-sion-two bifurcation is observed.

The situations expressed in Figs. 3–5 can be summarized bythe cusp catastrophe surface shown in Fig. 6. For positive valuesof , and depending on the value of , there will be one negativeor one positive stable fixed point, or two stable fixed points, onebeing positive and the other negative.

Another way to illustrate the impact of the asymmetry pa-rameter on the fixed points is to use the energy function (5) asillustrated in Fig. 7. When , we have a double well; thisis the situation described in previous studies (e.g., [19]). For in-stance, suppose that and that is given an ini-

Fig. 7. Modification of the attraction domain as a function of �; if � is greaterthan the limit set by (10), then a stimulus will be able to move from an attractorto another one.

tial negative value. After some time, will converge to fixedpoint as indicated by the filled circle on the left ofFig. 7(a). However, if the value of is different from zero, say

, then the radius of the attraction domain of one attractorbecomes greater than that of the other, as shown by the deeperwell on the right in Fig. 7(b). The stimulus will remain in theleft fixed point as long as is less or equal to . If thevalue of continues to grow (e.g., ), then, only oneattractor will remain [Fig. 7(c)]. The stimulus will now be at-tracted by the right side attractor and will stabilize at a positivevalue. Thus, the effect of the asymmetry parameter is to force astimulus to move from one attractor to another. This is akin tomomentarily disabling a given attractor without erasing it fromsystem memory. Fig. 7 also shows that the fixed point atwill move to higher values as the value of increases. For ex-ample, if [Fig. 7(d)], the fixed point becomes .If needed, the signum function can be used to squash the valueback to 1. The situation in an -dimensional space is explorednext.

C. M-Dimensional Setting

In an -dimensional space, (1) takes a vector form and be-comes

(12)

where , ,, , and is the

network’s dimension. When the weight matrix is diagonal, thesystem is uncoupled and the analysis is straightforward. As inthe 1-D system, the fixed points are defined by the roots of (12).Their stability properties of each one are determined by findingthe eigenvalues of its Jacobian matrix. Depending on the eigen-values, different types of fixed-point behaviors are obtained. For

example, if , , and there will



Fig. 8. Phase portrait of a 2-D system with � ��

� �� and

� � �� . The stable fixed points are represented by filled circles and theunstable ones by empty circles.

Fig. 9. Energy landscape of the same 2-D system as in Fig. 8.

be nine fixed points as illustrated in Fig. 8. Four of them arestable: , and ; they are indi-cated by filled circles. The other five are unstable and indicatedby empty circles.

Here also, the stability of the fixed point can be determinedfrom the energy of the system. In matrix notation, the energyfunction is

(13)

The function is plotted in Fig. 9 for a 2-D system, using theprevious values of , , and . The stable fixed points corre-spond to the local minima in the plot and they partition the recallspace is into four equal wells. For example, if the desired cor-rect output (fixed point) is , the probability that thisfixed point attracts a uniformly distributed pattern whose ele-ments is 25%.

If the value of increases from zero, it makes the energylandscape “rotate.” The angular rotation depends on the element

Fig. 10. Energy landscape of the 2-D system when � �� : (a) � � �� ;(b) � � �� .

values of . For example, if , thenwill be positive as shown in Fig. 10(a); the system has now

only two stable fixed points. This is equivalent to biasing thesystem’s recall ability by increasing the probability of finding adesired output. In this example, the probability that the fixedpoint attracts a random stimulus increases to50%. In practice, this means that if prior knowledge about thedesired output is available, it can be coded into a value of toimprove the network’s associative performance. Ultimately, ifabsolute knowledge about the desired output is provided via aspecific value of (e.g., we know through some means that aparticular value, say , will lead to the desiredoutput), then there will be only one attractor in the system andthe probability that the fixed point attracts an inputwill be 100%. This last situation is illustrated in Fig. 10(b).

In short, the parameter can be used to generate differentoutputs from a given input. As an application example, assumethat input is normally attracted by fixedpoint: , and suppose we know that the first elementof the desired output cannot be positive; it is possible to enterthat information into the network by setting to any value lessthan (e.g., ). As a result, will convergeto a different attractor, in this case . Therefore,the network’s behavior regarding an input is modified based oncontext.

Before introducing the various simulations that illustrate theproperties induced by the parameter, a brief description of thespecific BAM model used in this study will be given.

III. BIDIRECTIONAL ASSOCIATIVE MEMORY

A. Architecture

The network architecture is similar to the one proposed byHassoun [33]. It is illustrated in Fig. 11 where andare the initial input states (stimuli), is the number of iterationsover the network, and and are weight matrices. The net-work is composed of two interconnected layers that, together,allow a recurrent flow of information that is processed in a bidi-rectional way. The layer returns information to the layer andvice versa. Like any BAM, this network can be both an autoas-sociative and heteroassociative memory [1]. In this particularmodel, the two layers can be of different dimensions and, con-trary to usual BAM designs, the weight matrix from one side isnot necessarily the transpose of the other side.



Fig. 11. Architecture illustration of the bidirectional heteroassociativememory.

B. Learning

The network tries to solve the following nonlinear con-straints:

(14a)

(14b)

where is the output function defined in (12). The form of theconstraints and the recurrent nature of the underlying networkcall for a learning process that is executed online; the weightsare modified based on both input and output data. In addition,we want the model to be close to neuroscience and variations ofHebbian learning are favored [34]. As a result, the learning isbased on time-difference Hebbian association [17], [32], [35],[36]. This is formally expressed as follows:

(15)

(16)

where is a learning parameter. In (15) and (16), weight ma-trices and are updated according to the following proce-dure: the initial input vectors and are fed to the net-work and iterated times through it (Fig. 11). This results inoutputs and , which are used for the weight updates.The weights will self stabilize when the feedback is the same asthe initial inputs [i.e., and

]; in other words, when the network has developed fixedpoints. This contrasts with most BAMs, where learning is per-formed offline, based on unit activation. Learning convergenceis a function of the value of the learning parameter . To ensureweight convergence, must be set according to the followingcondition [19]:

(17)

and are the number of units in each layer and is theoutput parameter. Further details about the learning rule are pro-vided in [17] and [19].

In Section IV, several simulations are presented to show thebehavior and performance of the novel network after the inclu-sion of asymmetry parameter .

Fig. 12. Pattern set used in Section IV-A.

IV. SIMULATION

A. Recall Improvement With Prior Knowledge

This first simulation shows how an improvement in perfor-mance is observed when prior information is coded in theparameter. To facilitate demonstration, the BAM is used as anautoassociator and the same patterns, shown in Fig. 12, are pre-sented to both the and layers. As a result, “A” will be asso-ciated with “A,” “B” with “B,” etc.

1) Methodology: The simulation patterns consist each ofa 5 7 pixel matrix representing a letter of the alphabet, andthus forming a 35-dimensional vector. White and black pixelswere, respectively, assigned values of and and the ab-solute correlations between patterns varied from 0.03 to 0.83.The memory loading ratio was 34% (12 patterns over 35 units).Learning was conducted as follows.

1) Initialization of and to 0.2) Random selection of a pair following a uni-

form distribution.3) Iteration through the network according to (12); to limit

computation time, only one iteration is performed .4) Weight update according to (15) and (16).5) Repetition of steps 2–4 until the desired number of learning

trials is reached, or the mean squared error (MSE) betweenand is sufficiently small for a processed item.

was set to 0.001, to , and the learning ended as soon asMSE was lower than 0.01. The output function was computedby Euler approximation, with the approximation parameter setto 0.1 [15], [17]. After some manipulation, (12) leads to

. Following [19], outputparameter was set to 0.1 during learning to let the networkdevelop fixed points.

Then, in a second run, information about the desired outputwas provided to the network via the parameter. The infor-mation was that the output should have a vertical bar in it asillustrated in Fig. 13. The vertical bar was converted to a vectorand the parameter was set to that prior, with white and blackpixels assigned values of and , respectively.

It was hypothesized that the recall performance would in-crease for the letters “I” and “T,” since the prior refers to thosepatterns. The performance was estimated using noisy versionsof the patterns. The noise in each case was measured by the per-centage of pixels that were flipped from to , and vice versa.For instance, a noise level of 20% meant that seven out of the35 pixel values were randomly chosen and changed. For eachpercent of noise, 500 noisy patterns were randomly generatedand the recall performance was determined in each case.



Fig. 13. � parameter corresponding to a vertical bar.

Fig. 14. Recall performance of the letters I and T as a function of noise level,with and without prior knowledge of a vertical bar in the output.

2) Results: Just as predicted, the performance for the “I” and“T” letter was increased by the inclusion of prior information viathe parameter. Fig. 14 shows that, as the noise level increases,the performance is greater for the situation with prior knowledgethan without it. On average, a 20% improvement was observed.

B. Learning Nonlinearly Separable Tasks

As mentioned previously, the parameter may be used to di-rect the network’s output. In particular, if an input produces thewrong output during training, the corresponding fixed point canbe momentarily disabled by adjusting the value of . This al-lows different behaviors to be tried until a satisfactory outcomeis achieved. A second BAM may then learn the association be-tween each input and the optimal value that leads to the cor-rect output. It will subsequently feed that value to the firstnetwork when it sees the input again, so that the desired outputis obtained. Reinforcement learning (RL) can be used to find theoptimal values for .

The previous procedure can be used to learn nonlinearly sepa-rable tasks. The mapping between input and output is performedin interaction with the environment. The goal is to minimize theerror in network recall [37]. Fig. 15 illustrates the different stepsto accomplish a nonlinearly separable learning task, using thefollowing algorithm.Step 1) The network first attempts to learn the problem as

a linearly separable task by using standard BAMlearning [17], [19]. Then, the training set is runthrough it again to detect inaccuracies [Fig. 15(a)].

Step 2) The wrong outputs are corrected by finding thevalues of that will favorably bias network be-havior for those outputs [Fig. 15(b)]. This is accom-plished by RL as follows.1) A random value ( or ) is generated for one

of the elements of .

Fig. 15. Proposed three-step BAM learning procedure for a nonlinearly sep-arable problem: (a) normal, BAM learning; (b) RL to find � for correction ofthe wrong outputs; (c) use of a second BAM to output the learned correct � foreach input.

Fig. 16. Input and output patterns for OR and XOR functions.

2) The network output is computed with the new.

3) If the output is closer to the correct solution, thenthe new is kept; else the older is restored.

4) Repetition of steps 1–3 for the remaining ele-ments of until the observed output is closeenough to the desired output.

Step 3) The associations between the found values of andthe corresponding inputs are learned by a secondBAM that will output those values to the first BAMas needed during recall [Fig. 15(c)].

1) XOR Problem Example: To illustrate the efficiency ofusing the previous algorithm, the following shows how thenetwork was trained to handle the classic XOR problem (seeFig. 16), one that no simple BAM described in the literaturehas been able to process.

2) Methodology: Fig. 16 shows the simulation patterns, en-coded as in the previous section. The network first attempts tolearn the problem as a regular linearly separable task, in this



Fig. 17. SE as a function of RL trial number and corresponding � parameterdisplayed in graphical form. A gray pixel indicates a value of 0, a white pixel avalue of ��, and a black pixel a value of 1.

TABLE IOVERALL NETWORK BEHAVIOR TO SOLVE THE XOR PROBLEM

case the OR function. Then, it is tested against the XOR data.For the first three inputs shown in Fig. 16, the network gener-ates the right outputs, but a problem arises with the fourth one:the network outputs “1” when it should have output “0.” As aresult, should be modified so that the network temporarilyignores the “1” output attractor and converges to the “0” at-tractor instead. In order to do so, a cost function is definedfor the RL algorithm. In this case, the squared error (SE) be-tween the network output and the desired “0” behavior is used

.3) Results: Fig. 17 illustrates the obtained quadratic error as

a function of the trial number, and of the parameter values con-verted to a graphic representation. On the first trial, the networkselected a value of that left the error unchanged ,therefore the value was ignored and was kept at . On thesecond trial, the randomly selected element produced a decreaseof the squared error ; therefore, this vector el-ement was kept (white pixel). However, the network was stillnot able to produce the correct answer; it took another randomvector selection (black pixel) trial to produce a decrease of thesquared error that was big enough to make the net-work produces the correct answer. The second BAM was thentrained to output that value of when it sees two “1” (thus,it learns the linearly separable AND function). Table I summa-rizes the overall network behavior during recall after success-fully completing the training to solve the XOR problem.

By adding noise to each stimulus, the classification bound-aries can be estimated (Fig. 18). The network had classifica-tion boundaries that are larger for opposite polarities (“1–0” or“0–1”) than for same polarities (“0–0” or “1–1”).

Fig. 18. Decision regions for noisy inputs.

Fig. 19. Stimuli used for learning in Section IV-C.

C. Chaos Control

In [19], it was shown that our BAM model can be used as anaperiodic associative memory. When it enters aperiodic mode,the model wanders from one stored stimulus region to another.By using the parameter, the wandering can be controlled andparticular subspace regions may be favored over others. Conse-quently, the network acquires the ability to increase or decreasethe probability to output a given memory trace. This behavior isillustrated next, where the same stimuli as the ones described in[19] are used.

1) Methodology: The stimuli consist of four 10 10 pat-terns, as shown in Fig. 19. The learning procedure was the sameas previously, with the learning parameter set to . Toobserve the desired chaotic properties, Euler approximation wasused for the output function [17], [19], leading to the expressiongiven in Section IV-A. Here also, following [19], output param-eter was set to 0.1 during learning to let the network developfixed points. During recall, it was set to to let the net-work operate as an aperiodic associative memory [25].

Initially, each of the four patterns has 25% chance to attracta random input. It is predicted that the more the distinctive fea-tures (pixels) of a given stimulus are adversely constrained by

, the less the likelihood of that stimulus being attracted by apattern will be. For example, assume that we want to decreasethe network’s probability to output the “X” pattern. Then, thepixels that allow the network to classify “X” patterns should beturned off by the parameter when the case arises. Fig. 20 illus-trates the pixels of the “X” pattern that are put to contribution.If a distinctive pixel is positive (black pixels), then the corre-sponding vector element will be negative. On the contrary, ifthe distinctive pixel is negative (white pixels), then the corre-sponding vector element will be positive. So, as the numberof distinctive pixels that are turned off increases, the probabilityof recalling the “X” pattern should decrease.



Fig. 20. The 12 distinctive features of the “X” patterns.

Fig. 21. Probability of recalling the “X” pattern as a function of the number ofremoved distinctive features.

Fig. 22. Example outcomes (100) of aperiodic recall when 12 distinctive fea-tures have been removed from the “X” pattern.

2) Results: Fig. 21 provides the probability of recalling the“X” pattern in the chaotic associative recall mode. The effect ofdistinctive feature removal does not show until there are aboutfour removed features. Afterward, there is a fairly linear de-crease of the recall probability. Fig. 22 shows different out-comes of the chaotic associative recall process when 12 distinc-tive features (pixels) have been removed. From the 100 itera-tions shown, only four remain close enough to the “X” pattern tobe classified in that category. Therefore, by using the param-eter, it was possible to control chaotic wandering of the memorytrace. However, it should be noted that the fact of removing dis-tinctive features of a given pattern does not hold the network

Fig. 23. GRNN architecture.

from recalling its complement; a white “X” on a black back-ground is still an attractor in memory.

D. Performance Improvement With a Hybrid Architecture,by Using a GRNN to Determine

For some applications, finding the appropriate values ofcan be accomplished by other means than a secondary BAM,and the resulting hybrid system has a potentially better overallperformance. For instance, the general regression neural net-work (GRNN) model [38] can be used due to its simplicityand fast approximation procedure for multiclass problems.Many other models are applicable (e.g., multilayer percep-trons, support vector machines, radial basis function networks,evolutionary algorithms, etc.), but the optimization of a hybridapproach exceeds the framework of this paper.

GRNN estimates a regression surface of dependent variables(desired output vectors) given independent variables (input

vectors). To compute the most probable values of an output ,the network relies uniquely on a set of training vectors . Inour simulations, both inputs and desired outputs are the sameand, therefore, the model is used for autoregression. This allowsit to output patterns that match the stored patterns as closelyas possible. By outputting the appropriate parameter values,GRNN will orient the energy landscape (Fig. 10) of the BAMin favor of the correct attractor.

The architecture of GRNN, shown in Fig. 23, consists of fourlayers. The input layer is fully connected to the pattern layerwhich has one unit for each pattern. Each unit in the patternlayer computes a Gaussian function expressed by

(18)

where denotes a smoothing parameter (set to 1 in this work),is the input vector to the network, and is a training vector.

The summation layer has two types of units and . The firsttype computes the weighted sum of the hidden layer outputs andthe second type has its weights set to “1,” providing the sum ofthe exponential terms alone. Finally, each output unit divides an

output by an output to provide the prediction result as [39]

(19)

1) Methodology: The patterns used for the simulation are thesame as those of Fig. 12 and learning is similar to the one used



Fig. 24. Percentage of correct categorization as a function of the number ofpixels flips for the BAM with � � �, the hybrid GRNN-BAM, the OLAM-BAM, and the QLBAM.

Fig. 25. Capacity performance for the BAM without priors, BAM with priorsfrom GRNN, OLAM, and QLBAM. Each curve represents 95% of recall accu-racy. The memory load is based on 64-dimensional random bipolar patterns.

in Section IV-A. Once the patterns are stored, the hybrid modelperformance is evaluated using noisy versions of the patterns.The noise in each case is measured by the percentage of pixelsthat are flipped from to , and vice versa. The resulting noisyversions are input to the GRNN to generate the corresponding

values according to (19). Then, both the noisy pattern and theGRNN output are sent to the BAM for categorization. Foreach percent noise, 200 noisy patterns are randomly generatedand their recall performance calculated.

2) Results: Fig. 24 shows the performances of the BAMwithout priors, GRNN-BAM models, optimal linear associativememory (OLAM-BAM) [3] based on the pseudoinverse and thequick learning BAM (QLBAM) [40]. The last two were addedfor benchmark comparison (notice that since the memory load

is greater than 0.15, the BAM by Kosko [1]cannot correctly learn the task). The figure shows that using theGRNN model to estimate the values of increases BAM per-formance by up to 50%.

3) Capacity Estimation: The capacity of the model was mea-sured according to the approach described in [41]. For each test,20 sets of random bipolar patterns of 64 dimensions were gener-ated. The performance was evaluated using a recall accuracy of

95%. Fig. 25 shows that using a hybrid approach does increasethe capacity of the BAM. If no priors are used and an acceptablenormalized radius of attraction is desired the capacityis about 0.5 (32 patterns of 64 dimensions). However, if priorsare used via the GRNN network, then the network still has a nor-malized radius of attraction of about 0.32 with a memory loadof 1 (64 patterns of 64 dimensions).

This increase in performance (Figs. 24 and 25) is compa-rable to that of using the GRNN model with another associativememory [39], [42]. However, the proposed BAM uses a longercomputational time, and learning and recall are much slowermechanisms when compared to [39] and [42].

V. DISCUSSION

Several avenues exist that may be explored to improve RLin the proposed BAM model. A simple RL learning schemewas considered in this study, where only the immediate lastaction was stored. However, usual reinforcement models learnunder delayed reinforcement, which more closely relates to thereal-world situation. This means that the weights adjustmentshould be a function of chronological sequences of stimuli. Itwas shown in a previous study that the BAM can perform mul-tistep pattern recognition [18]. Hence, more complex temporalcredit assignment problems can possibly be implemented. Asecond improvement to consider is the way behaviors are gen-erated by the model. In this study, they were generated througha simple random process. A more interesting approach wouldbe to use chaotic itinerancy [27]. In a previous study, it wasshown that the BAM can perform such aperiodic associative be-havior [19] and this study has shown how the cusp catastrophecan be used to orient the chaotic search process. A future studywould look at how chaotic itinerancy can be associated to RLas a means to improve the generation of oriented behavior withthe parameter.

A limitation of our model is that, since its learning is basedon correlation/covariance matrices, each stored pattern also en-codes its complement. This imposes an important restrictionon what features can be removed to control the chaotic recallprocess. Specifically, it is difficult to temporarily remove botha pattern and its complement. Hence, to increase the versatilityof the parameter, further studies should look into how asym-metry can be implemented during the learning process to pre-vent the model from storing complements [43].

Another aspect of the model that needs to be studied is themodel’s capacity to build its knowledge based on previous suc-cess. For example, we have shown that the model is able tosolve the XOR problem (parity-2 problem). By building on aparity-2 solution and by including a growing architecture (e.g.,cascade correlation [44]) the model might become able to solvea parity-3 problem. In other words, since the -parity problemis symmetric and the solution for -parity can be transferredto -parity, the network could be used for arbitrary parity prob-lems. In this regard, if a nonlinearly separable classification taskcannot be decomposed into a simpler one (e.g., parity-2), thenthe model will not be able to solve it.

Finally, a comprehensive study should be performed to assessthe gain in radii of attraction from the inclusion of priors as wellas its effect on spurious memories.



VI. CONCLUSION

The introduction of a cusp catastrophe geometry into BAMdynamics results in various desired behaviors for modeling cog-nitive processes. The cusp catastrophe depends only on the valueof parameter and is useful for temporarily removing attractorsfrom the energy landscape. Since the stimuli iterate within thesame network space, the probability of being captured by the re-maining attractors increases as a result. The parameter guidesthe retrieval process in an associative memory by encoding priorindications about the outcome, thereby helping choose the rightattractor during memory recall. More specifically, it was showedin this work that the proposed memory model has the ability tocontrol chaotic wandering of the output and favor some networksubspaces over others.

In addition, although the original BAM belongs to the classof supervised learning neural networks, it lacks the ability toperform nonlinear classification such as solving the well-knownXOR problem. It was shown that, by incorporating RL into themodel, two BAMs in interaction can achieve such classification.The resulting neural architecture may not boast optimal perfor-mance since its learning still relies on simple covariance ma-trices, but the goal is to develop a cognitive model and, in thatsense, a good model must first be able to display various behav-iors starting from a common base (architecture, learning, andoutput function). Therefore, being able to display the three basictypes of learning (supervised, unsupervised, and reinforcement)under the same framework is an interesting development by it-self. Finally, a notable performance improvement is achieved byusing the hybrid BAM/GRNN.

In conclusion, since the output function used in this workis a generalization of the one described in [17]–[19], the newBAM model inherits several interesting features such as multi-step pattern recognition and one-to-many association. In addi-tion, through the removal of one pair of external links, the modelacquires the ability to perform feature extraction and categorydevelopment [20]–[22]. Therefore, with the added features de-scribed in this paper, we believe that the new model can be abuilding block for developing larger scale (i.e., more complex)cognitive models of human perceptual and categorical processesthan currently available.

ACKNOWLEDGMENT

The authors would like to thank Dr. A. Sadeghian for his valu-able comments and helpful suggestions.

REFERENCES

[1] B. Kosko, “Bidirectional associative memories,” IEEE Trans. Syst.Man Cybern., vol. SMC-18, no. 1, pp. 49–60, Jan./Feb. 1988.

[2] Y. F. Wang, J. B. Cruz, Jr., and J. H. Mulligan, Jr., “Two coding strate-gies for bidirectional associative memory,” IEEE Trans. Neural Netw.,vol. 1, no. 1, pp. 81–92, Mar. 1990.

[3] Z. Wang, “A bidirectional associative memory based on optimallinear associative memory,” IEEE Trans. Comput., vol. 45, no. 10, pp.1171–1179, Oct. 1996.

[4] D. Shen and J. B. Cruz, Jr., “Encoding strategy for maximum noise tol-erance bidirectional associative memory,” IEEE Trans. Neural Netw.,vol. 16, no. 2, pp. 293–300, Mar. 2005.

[5] S. Du, Z. Chen, Z. Yuan, and X. Zhang, “Sensitivity to noise in bidi-rectional associative memory (BAM),” IEEE Trans. Neural Netw., vol.16, no. 4, pp. 887–898, Jul. 2005.

[6] C.-S. Leung, “Optimum learning for bidirectional associative memoryin the sense of capacity,” IEEE Trans. Syst. Man Cybern., vol. 24, no.5, pp. 791–795, May 1994.

[7] T. Wang, X. Zhuang, and X. Xing, “Weighted learning of bidirectionalassociative memories by global minimization,” IEEE Trans. NeuralNetw., vol. 3, no. 6, pp. 1010–1018, Nov. 1992.

[8] X. Zhuang, Y. Huang, and S. Chen, “Better learning for bidirectionalassociative memory,” Neural Netw., vol. 6, pp. 1131–1146, 1993.

[9] S. Arik, “Global asymptotic stability analysis of bidirectional associa-tive memory neural networks with time delays,” IEEE Trans. NeuralNetw., vol. 16, no. 3, pp. 580–586, May 2005.

[10] L. Wang and Z. Zou, “Capacity of stable periodic solutions in discrete-time bidirectional associative memory neural networks,” IEEE Trans.Circuits Syst. II, Exp. Briefs, vol. 51, no. 6, pp. 315–319, Jun. 2004.

[11] Y. Xia, Z. Huang, and M. Han, “Existence and globally exponential sta-bility of equilibrium for BAM neural networks with impulses,” ChaosSolitons Fractals, vol. 37, pp. 588–597, 2008.

[12] H. Gu, H. Jiang, and Z. Teng, “Existence and globally exponentialstability of periodic solution of BAM neural networks with impulsesand recent-history distributed delays,” Neurocomputing, vol. 71, pp.813–822, 2008.

[13] M. E. Acevedo-Mosqueda, C. Yáñez-Márquez, and I. López-Yáñez,“Alpha-beta bidirectional associative memories: Theory and applica-tions,” Neural Process. Lett., vol. 26, pp. 1–40, 2007.

[14] Y. Wu and D. A. Pados, “A feedforward bidirectional associativememory,” IEEE Trans. Neural Netw., vol. 11, no. 4, pp. 859–866, Jul.2000.

[15] T. Eom, C. Choi, and J. Lee, “Generalized asymmetrical bidirectionalassociative memory for multiple association,” Appl. Math. Comput.,vol. 127, pp. 221–233, 2002.

[16] H. Shi, Y. Zhao, and X. Zhuang, “A general model for bidirectionalassociative memories,” IEEE Trans. Syst. Man Cybern. B, Cybern., vol.28, no. 4, pp. 511–519, Aug. 1998.

[17] S. Chartier and M. Boukadoum, “A bidirectional heteroassociativememory for binary and grey-level patterns,” IEEE Trans. NeuralNetw., vol. 17, no. 2, pp. 385–396, Mar. 2006.

[18] S. Chartier and M. Boukadoum, “A sequential dynamic heteroassocia-tive memory for multistep pattern recognition and one-to-many associ-ation,” IEEE Trans. Neural Netw., vol. 17, no. 1, pp. 59–68, Jan. 2006.

[19] S. Chartier, P. Renaud, and M. Boukadoum, “A nonlinear dynamic ar-tificial neural network model of memory,” New Ideas Psychol., vol. 26,pp. 252–277, 2008.

[20] S. Chartier, G. Giguère, P. Renaud, R. Proulx, and J.-M. Lina,“FEBAM: A feature-extracting bidirectional associative memory,” inProc. Int. Joint Conf. Neural Netw., Aug. 2007, pp. 1679–1684.

[21] G. Giguère, S. Chartier, R. Proulx, and J.-M. Lina, “Creating perceptualfeatures using a BAM-inspired architecture,” in Proc. 29th Annu. Conf.Cogn. Sci. Soc., D. S. McNamara and J. G. Trafton, Eds., Austin, TX,2007, pp. 1025–1030.

[22] G. Giguère, S. Chartier, R. Proulx, and J.-M. Lina, “Category develop-ment and reorganization using a bidirectional associative memory-in-spired architecture,” in Proc. 8th Int. Conf. Cogn. Model., R. L. Lewis,T. A. Polk, and J. E. Laird, Eds., Ann Arbor, MI, 2007, pp. 97–102.

[23] I. Kanter and H. Sompolinsky, “Associative recall of memory withouterrors,” Phys. Rev. A, Gen. Phys., vol. 35, pp. 380–392, 1987.

[24] L. Personnaz, L. Guyon, and G. Dreyfus, “Collective computationalproperties of neural networks: New learning mechanisms,” Phys. Rev.A, Gen. Phys., vol. 34, pp. 4217–4228, 1986.

[25] M. Adachi and K. Aihara, “Associative dynamics in a chaotic neuralnetwork,” Neural Netw., vol. 10, pp. 83–98, 1997.

[26] K. Kaneko and I. Tsuda, “Chaotic itinerancy,” Chaos, vol. 13, pp.926–936, 2003.

[27] I. Tsuda, “Towards an interpretation of dynamic neural activity in termsof chaotic dynamical systems,” Behav. Brain Sci., vol. 24, pp. 793–847,2001.

[28] A. G. Barto, R. S. Sutton, and C. Anderson, “Neuron-like adaptive ele-ments that can solve difficult learning control problems,” IEEE Trans.Syst. Man Cybern., vol. SMC-13, no. 5, pp. 834–846, Sep./Oct. 1983.

[29] R. Gilmore, Catastrophe Theory for Scientists and Engineers. NewYork: Dover, 1993.

[30] A. A. Koronovskii, D. I. Trubetskov, and A. E. Khramov, “Populationdynamics as a process obeying the nonlinear diffusion equation,” Dok-lady Earth Sci., vol. 372, pp. 755–758, 2000.

[31] J. J. Hopfield, “Neural networks and physical systems with emergentcollective computational abilities,” Proc. Nat. Acad. Sci., vol. 79, pp.2554–2558, 1982.

[32] S. Chartier and R. Proulx, “NDRAM: Nonlinear dynamic recurrent as-sociative memory for learning bipolar and nonbipolar correlated pat-terns,” IEEE Trans. Neural Netw., vol. 16, no. 6, pp. 1393–1400, Nov.2005.

[33] M. H. Hassoun, “Dynamic heteroassociative neural memories,” NeuralNetw., vol. 2, pp. 275–287, 1989.



[34] W. Gerstner and W. Kistler, Spiking Neuron Models: Single Neurons,Populations, Plasticity. Cambridge, U.K.: Cambridge Univ. Press,2002.

[35] B. Kosko, “Unsupervised learning in noise,” IEEE Trans. Neural Netw.,vol. 1, no. 1, pp. 44–57, Mar. 1990.

[36] R. S. Sutton, “Learning to predict by the methods of temporal differ-ence,” Mach. Learn., vol. 3, pp. 9–44, 1988.

[37] S. Haykin, Neural Networks: A Comprehensive Foundation. Engle-wood Cliffs, NJ: Prentice-Hall, 1999.

[38] D. E. Specht, “A general regression neural network,” IEEE Trans.Neural Netw., vol. 2, no. 6, pp. 568–576, Nov. 1991.

[39] H. Davande, M. Amiri, A. Sadeghian, and S. Chartier, “Associativememory based on a new hybrid model of SFNN and GRNN: Perfor-mance comparison with NDRAM, ART2 and MLP,” in Proc. Int. JointConf. Neural Netw., 2008, pp. 1698–1703.

[40] M. Hagiwara, “Quick learning for bidirectional associative memory,”Inst. Electron. Inf. Commun. Eng., vol. E77-D, pp. 385–392, 1994.

[41] D. Liu, “A new synthesis approach for feedback neural networks basedon the perceptron training algorithm,” IEEE Trans. Neural Netw., vol.8, no. 6, pp. 1468–1482, Nov. 1997.

[42] M. Amiri, H. Davande, A. Sadeghian, and S. A. Seyyedsalehi, “Auto-associative neural network based on new hybrid model of SFNN andGRNN,” in Proc. Int. Joint Conf. Neural Netw., 2007, pp. 2664–2670.

[43] Z.-B. Xu, Y. Leung, and X.-W. He, “Asymmetric bidirectional asso-ciative memories,” IEEE Trans. Syst. Man Cybern., vol. 24, no. 10, pp.1558–1564, Oct. 1994.

[44] S. E. Fahlman and C. Lebiere, “The cascade-correlation learning archi-tecture,” Adv. Neural Inf. Process. Syst. II, pp. 524–532, 1990.

Sylvain Chartier (M’05) received the B.A. degreefrom the University of Ottawa, Ottawa, ON, Canada,in 1993 and the B.Sc. and Ph.D. degrees fromthe Université du Québec à Montréal, Montréal,QC, Canada, in 1996 and 2004, respectively, all inpsychology.

He has been a Professor at the University ofOttawa, since 2007 and he is currently the Directorof the Laboratory for Computational Neurodynamicsand Cognition (CONEC). He is the author or coau-thor of over 40 journal and conference papers in

the area of neural networks and quantitative methods. His research interestsare in the development of unsupervised and supervised recurrent associativememories. He is also interested in nonlinear time series analysis as well ascognition, perception, and statistics.

Mounir Boukadoum (M’90–SM’05) received the M.E.E. degree in electricalengineering from the Stevens Institute of Technology, Hoboken, NJ, in 1978and the Ph.D. degree in electrical engineering from the University of Houston,Houston, TX, in 1983.

He has been with the Université du Québec à Montréal, Montréal, QC,Canada, since 1984, where he is currently Professor of MicroelectronicsEngineering. He was elected Chairperson of Microelectronics Programs in1995, and was reelected in 1998 and 2001. He is currently Director of theMicroelectronics Prototyping Research Laboratory and Director of the Ph.D.Program in Cognitive Informatics. He is also an executive member of ReSMiQ.His research interests focus primarily on fluorescence-based instruments andintelligent signal processing.

Dr. Boukadoum is Vice-President of the IEEE Computational IntelligenceSociety chapter in Montréal.

Mahmood Amiri (M’08) received the B.Sc. andM.Sc. degrees (honors) in biomedical engineeringfrom University of Isfahan and Amirkabir Universityof Technology (Tehran Polytechnic), Tehran, Iran, in2004 and 2006, respectively. He is currently workingtowards the Ph.D. degree in biomedical engineeringat the University of Tehran, Tehran, Iran.

His main research interests are in computationalneuroscience, theoretical and practical aspects ofneural networks, and neural modeling.


Documents

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 20, NO. 8, AUGUST 2009 …aix1.uottawa.ca/~schartie/Chartier-TNNd.pdf · 2009-08-29 · IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 20, NO