[IEEE 2005 IEEE International Joint Conference on Neural Networks, 2005. - MOntreal, QC, Canada (July 31-Aug. 4, 2005)] Proceedings. 2005 IEEE International Joint Conference on Neural

Proceedings of Intemational Joint Conference on Neural Networks, Montreal, Canada, July 31 - August 4, 2005

Neural Network with Fuzzy Dynamic LogicLeonid I. Perlovsky

Air Force Research Lab., 80 Scott Rd., Hanscom AFB, MA 01731Tel. 781-377-1728; e-mail: [email protected]

Abstract - The paper describes a neural network utilizingmodels and extending Fuzzy Logic. The motivation is toexplain the process of learning as joint model improvementand fuzziness reduction. An initial state of this neuralnetwork is highly fuzzy with uncertain knowledge; itdynamically evolves into a low-fuzzy state of certainknowledge. This neural system resembles several knownmechanisms of the mind and overcomes certain long-standingdifficulties in several application fields. We present anexample and briefly discuss mechanisms of concepts,emotions, including aesthetic emotions, instincts, conscious,unconscious, imagination, perception, cognition and relatethem to the introduced neural network and the novelmathematics of dynamic logic.

I. NEURAL NETWORKS, LEARNING, ANDCOMPLEXITY

Relating complexity and learning dates at least to the1960s, when Bellman discussed "the curse ofdimensionality" [']. He considered learning in the contextof statistical pattern recognition; learning requiredestimating probability distribution functions in featurespaces for various classes; and Bellman found that suchestimation was often difficult in high-dimensional featurespaces. The following thirty years of developing adaptivestatistical pattern recognition and neural networkalgorithms for learning led to a conclusion that theseapproaches often encountered combinatorial complexity(CC) of learning requirements: recognition of any object,it seemed, could be learned if "enough" training exampleswere used for an algorithm learning. The requiredexamples had to account for all possible variations of "anobject", in all possible geometric positions and incombinations with other objects, sources of light, etc. Thenumber of combinations quickly grows with the number ofobjects and conditions; say, a medium complexity problemof recognizing 100 different objects in variouscombinations might lead to a need to learn 100! (-100'°)combinations; this number is larger than the number ofelementary particles in the Universe in its entire history. Itwould require a corresponding number of trainingexamples [2].During the 1960s and 70s a new paradigm was invented,

interests shifted from learning algorithms to artificialintelligence relying on rule-based systems. Rule systemsworked well, when system goals, objects, andenvironments did not change, and when all aspects of theproblem could be predetermined. When situations varied,learning was needed. For example, the first Chomsky ideasconcerning the mind mechanisms for learning language [3]were based on rules. Some founders of the rule-based

artificial intelligence emphasized that rule systems cannotexplain learning [4], but these warnings were not heeded.Attempts to add learning to rule systems dominatedartificial intelligence for about thirty years. However, rulesystems in presence of unexpected variability, encounteredCC of rules: detailed sub-rules and sub-sub-rules, onecontingent on another, had to be specified.A new paradigm emerged in the 1980s, model-based

systems were proposed to combine advantages of learningand rules by using adaptive models. Existing knowledgewas to be encapsulated in models, and unknown aspects ofconcrete situations were to be described by adaptiveparameters. Learning consisted in estimating modelparameters from training data. Complicated models coulddescribe complex situations with relatively few unknownparameters, so that limited amount of training data wouldsuffice for learning [2]. In the process of learning, eachmodel had to be associated with corresponding data orsignals (we use "data" and "signals" interchangeably), sothat the model parameters could be estimated. Thishowever was not easy to accomplish: In complexsituations objects or processes responsible for incomingsignals, which are described by different models, are notclearly separable. For example, in a visual scene it mightnot be clear where different objects are, and where shadesare; similarly when tracking in clutter, it is not clear whichsignal belongs to which track or clutter. To find a solution,one has to define a measure of similarity between modelsand signals (say, mean square error), then consider manycombinations between signal subsets and models, andestimate parameters by minimizing errors for eachcombination. These type algorithms are called MultipleHypothesis Testing or Multiple Hypothesis Tracking(MHT) [5' 6]. The number of subsets and the number ofcombinations are combinatorially large. Therefore, model-based systems encountered computational CC (N and NPcomplete algorithms). The CC became a ubiquitous featureof intelligent algorithms and seemingly, a fundamentalmathematical limitation.Mathematical structure of many neural networks was

shown to be equivalent to algorithmic concepts analyzedabove [7]. Among zillions of publications aimed atovercoming the CC, several fundamental mathematicalideas could be identified: Fuzzy Logic (FL), GeneticAlgorithms (GA), and Hierarchical Decomposition (HD).GA attempt to concur the CC by combinatorialaccumulation of experience from generation to generation,similar to genetic evolution. However, it is not clear howto set parameters of GA so that the system does not evolveinto an evolutionary dead end. HD fights the CC byexponential reduction of complexity at multiple levels of

G-7803-9048-2/05/$20.00 02005 IEEE 33046

the hierarchy. However, there are no general rules of howto define hierarchical levels, how many are needed, and ifthey have to be predetermined or evolving. Promisingschemes are emerging from combining these ideas, inparticular, granularity combines FL and HD [8, 9, 10].

II. COMPLEXITY AND LOGIC

Combinatorial complexity is related to the type of logic,underlying various algorithms [2]. Formal logic is based onthe "law of excluded middle", according to which everystatement is either true or false and nothing in between.Therefore, algorithms based on formal logic have toevaluate every little variation in data or models as aseparate logical statement; combinations of subset ofsignals have to be considered, a large number ofcombinations of these variations causes combinatorialcomplexity. In fact, combinatorial complexity ofalgorithms based on logic is related to G6del theory: it is afinite system manifestation of the incompleteness of logic["]. Multivalued logic and fuzzy logic were proposed toovercome limitations related to the law of excluded third[12]. Yet the mathematics of multivalued logic is nodifferent in principle from formal logic; excluded third isreplaced by excluded n+ 1. Fuzzy logic encountered adifficulty related to the degree of fuzziness: if too muchfuzziness is specified, the solution does not achieve aneeded accuracy, if too little, it becomes similar to formallogic. FL eliminates combinatorial complexity of subsets;this works however, at a fixed degree of fuzziness. Incomplex cases fuzziness has to be properly selected at eachalgorithmic step in various parts of the system; it leads tocombinatorial complexity of fuzziness optimization.

This paper concentrates on extending FL to FuzzyDynamic Logic, for shorter, Dynamic Logic (DL), whichadds dynamic evolution of fuzziness to FL. At eachalgorithmic step in various parts of the system, the degreeof fuzziness evolves according to local similarity betweenmodels and signals, without CC.

III. MODELING FIELD THEORY (MFT) NEURALNETWORK

Modeling field theory neural network (MFT) brieflysummarized below is based on parametric models [7].During the learning process, new associations of inputsignals are formed resulting in evolution of new concepts.Input signals {X(n)}, is a field of input neuronal synapseactivation levels, n = 1,... N, enumerates the input neuronsand X(n) are the activation levels; a set of concept-modelsh = 1,... H, is characterized by the models(representations) {Mb(n)} of the signals X(n); each modeldepends on its parameters I S}, M,(S,,n). In a highlysimplified description of a visual cortex, n enumerates thevisual cortex neurons, X(n) are the "bottom-up" activationlevels of these neurons coming from the retina throughvisual nerve, and Mh(n) are the "top-down" activation

levels (or priming) of the visual cortex neurons frompreviously learned object-models'3. Learning processattempts to "match" these top-down and bottom-upactivations by selecting "best" models and theirparameters. Mathematically, learning increases a similaritymeasure between the sets of models and signals,L({X(n)), IMb(n)}). The similarity measure is a function ofmodel parameters and associations between the inputsynapses and concepts-models. It is constructed in such away that any of a large number of objects can berecognized, no matter if they appear on the left or on theright. Correspondingly, a similarity measure is designed sothat it treats each concept-model as an alternative for eachsubset of signals. Models represent expected data in thefollowing sense: each model depends on its parametersIsb), and for certain values of these parameters, IS.), themodel is the conditional expectation for the data:

(1)

In the r.h.s. of this equation, h is a hypothesis (about thesource of the data, say an object and its orientation, etc.)and E{.) is the statistical expectation (please note, forshortness, we use the same notation h on the l.h.s. todenote just a model number, but not specific values of theparameters). For the parameter values S = S., the model,

M,(S, n), is the expected value of the signal sample n,conditioned on the hypothesis h. The similarity measurecan be defined as a likelihood; in absence of a prioriknowledge about signal-model association, it treats eachmodel as an alternative for each subset of signals

(2)L({X},IM)) = I r(h) I(X(n) Mb(n));n h

here, I(X(n)IMb(n)) (or simply l(n1h)) is a conditionalpartial similarity between one signal sample X(n) and onemodel Mb(n). Parameters r(h) are proportional to thenumber of signals {n associated with the model h andnormalized as

Z r(h')= 1.hi

(3)

In probability theory, r(h) are called priors, we call themclass rates, or rates for short; they are usually unknown,and have to be estimated as well as other modelparameters. For L to be interpreted as a likelihood for theentire set of signals (X(n),Il(njh) must be defined asconditional probability distribution functions, pdf. Thisprobabilistic interpretation is valid, if we use correct valuesfor model parameters; otherwise, these similarity measurescan be considered as fuzzy similarities, parametricallydependent on model parameters.

3047

M,(k, n) = El X(n) h ).

In probability and statistics, a product of probabilities orpdf is used for independent random numbers. Expression(2) is different in this regard: although it contains a productover signal samples, n, signal samples are not assumedindependent. Dependencies among { X(n) I are due tomodels, Mh(n), each model determines conditionalexpected values of many signals. Note, all possiblecombinations of signals and models are accounted for inthis expression. This can be seen by expanding a sum in(2), and multiplying all the terms; it would result in HMitems. This is the number of combinations between allsignals (N) and all models (H). Here is the source of CC ofmany algorithms used in the past, related to the idea ofmultiple hypothesis testing (MHT). MHT attempted tomaximize L over model parameters and associationsbetween signals and models, in two steps, first bymaximizing each of the HM items over model parameters,and second by selecting the largest item. Such a programinevitably faces a wall of CC.

Likelihood interpretation of similarity measure (2) hasan undesirable aspect: it requires that for some values ofmodel parameters, models are exact conditionalexpectations of the data in the sense of (1). In manyapplications however, models are in principle approximaterepresentation of reality. It is therefore more satisfactory touse another type of similarity measure, inspired by theconcept of mutual information. Maximization of mutualinformation leads to extraction of maximum informationfrom data, given available models. Such a similaritymeasure can be defined as follows (to preserve a notionalcorrespondence between information and a logarithm oflikelihood, we define log-similarity measure, LLin, relatedto information),

LLin({X,{1M}) =

E abs(X(n)) [In , r(h)l(X(n)jMh(n)) ]. (4)n h'

Formally, this expression can be considered a logarithmof (2) with the following modification: every data point nis accounted for with a weight proportional to the strengthof signal at this point. Relationship of this expression tomutual information in the models about data is consideredin [7].

l(nlh) = (2Q4) /2 (detC )Y12 *exp(-0.5(X(n)-Mh(n)) TCh- (X(n)-Mh(n)) }.(5)

Here, d is the dimensionality of the vectors X and M,and C is the covariance matrix. If model M (n) matchessignals X(n) well, then covariance C is small, andexpression (5) is close to the delta-function (lowfuzziness). If the model does not match signals, thencovariance Ch is large, and expression (5) is a widedistribution (highly fuzzy). The value of the covariance isto be estimated along with other parameter values. Wemust note, that this parameterization is very different fromstandard 'Gaussian assumption'. We do not assume thatsignals are Gaussian. Expression (5) assumes only thatdeviations between signals and models (X(n) - Mh(n)) areGaussian. As will be clear from the following, this is onlyneeded for a single model, which is the correct model forthe signal n. And even this assumption is not necessary,later we will consider modifications for the general case.

Second, we introduce fuzzy class memberships, f(hln),which define certainty of fuzzy logical statement thatsignal n originated from an object represented by model h,

f(hln) = r(h) l(nlh) / r(h') l(n|h').h'

(6)

These variables give a measure of correspondencebetween signal X(n) and model M normalized within[0,1]. The definition (6) formally follows the Bayesianexpression for a posteriori probabilities. We shouldremember, however, that the probabilistic interpretation isvalid, if we use correct values for model parameters;otherwise fuzzy class memberships, f(hln) can be calledestimated probabilities.

Third, we define a dynamic logic process, leading toestimation of class membership functions and modelparameters, which maximize similarity (2) withoutcombinatorial complexity. First, consider a straightforwardmaximization of similarity (or, equivalently, its logarithm)over the parameter values, Sb and C ', by gradient ascent(for now, we consider class rates r(h) known),

dS /dt= d InL/dS =h h

f(h/n)[-nl(nth)/-Mh] (dmhnash)d (7)

IV. FUZZY DYNAMIC LOGIC (DL)

Fuzzy dynamic logic [14] maximizes similarity (2) or (4)without combinatorial complexity as follows. First, wedefine conditional similarities l(nlh) in such a way, thattheir degrees of fuzziness correspond to accuracies ofmodels. This could be done in several ways, for exampleby using Gaussian parameterization for conditionalsimilarities,

dC -'/dt= d InL/d C -' =h b

Ef(hnn)[-lnl(nlh)/-C ]n

(8)

If this internal process operates on a given set of signalstX(n)} much faster than external real-time of the systemoperation, we can consider the dynamic logic processindependent from the external environment. In this case

3048

the convergence of process (7) and (8) is assured as long assimilarity is finite and its maximum is attained within thebounded region of parameter values. The first conditionholds according to definitions (2) and (5), as long as noneof the covariances turn to 0. In fact, C = 0, is a maximumof similarity measures (2) and (5). It is a degenerate non-fuzzy (crisp) solution in which just a few data pointsbelong to class h (if Mhdepends on p(h) parameters, S , adegenerate solution might contain exactly p(h) datapoints). Such solutions are usually undesirable and must beprevented by limiting the minimal values of covariances.The second condition can be imposed by limiting themaximal values of parameters. Therefore we impose thefollowing conditions:

(9)

Here C man and S values are based on physicalmeanings of the parameters. For example, C Run might bedetermined by errors of sensor measurements responsiblefor data X(n). The procedure described above guaranteesconvergence to a local maximum of the similarity measure.

Qualitatively, the dynamic logic process can becharacterized as follows. It is initiated using inaccuratemodels, large covariance matrices, and highly fuzzymembership functions. The process dynamics leads toimproved parameter estimation, improved models, smallcovariances and low-fuzzy membership functions (nearlycrisp logical statements about the signal classmemberships). We would like to emphasize again that inthe dynamic logic process, fuzziness of class memberships,f(hln), is matched to uncertainties of model parameters, S .This is due to the fact that covariances in eq.(12) aredefined by mismatches between the data and the models.This correspondence is maintained at every moment (everyiteration step) in the dynamic process, and locally for eachmodel (if model h matches its corresponding data, then itscovariance is small, and the membership functions for itsdata are close to 1).

V. IMAGE RECOGNITION EXAMPLE

Finding patterns below noise could be an exceedinglycomplex problem. If an exact pattern shape is not knownand depends on unknown parameters, these parametersshould be found by fitting the pattern model to the data.However, when location and orientation of patterns are notknown, it is not clear which subset of the data pointsshould be selected for fitting. A standard approach forsolving this kind of problems, as discussed, is multiplehypothesis testing (MHT) [516], which faces combinatorialcomplexity. In this example, we are looking for 'smile'and 'frown' patterns in noise, each pattern is characterizedby a 3-parameter parabolic shape. The image size in thisexample is lOOx100 points, and the true number ofpatterns is 3, which is not known. Therefore, at least 4

patterns should be fit to the data, to decide that 3 patternsfit best. Fitting 4x3=12 parameters to lOOx100 grid by abrute-force MHT would take about (106)12 = 1072operations, a prohibitive computational complexity.To apply MFT and dynamic logic to this problem we

need to develop parametric adaptive models of expectedpatterns. We use a uniform model for noise, Gaussianblobs for highly-fuzzy, poorly resolved patterns, andparabolic models for 'smiles' and 'frowns'. Conditionalpartial similarities l(nlh) for these models do not depend onthe measured data directly, they predict the expected datain every image pixel n. The parabolic models andsimilarities are given by

l(nlh) = rb E G(nlM,(n,k),C,);k

M (n,k) = nO + a6*(k2,k); (10)

here n is a 2-D pixel position within the image; M,(n,k) isa pattern shape in the image; k are points along the pattern,the item (k-,k) is a 2-D vector in the image planedetermining the parabolic shape; a is a curvature of theshape; nO is a center position of the shape; G are Gaussianshapes; the sum over k gives the pattern amplitude (height)represented as a sum of Gaussians along the pattern 2-Dshape, and r, is the pattern amplitude. Every parabolicmodel has 4 adaptive parameters found from the data, rh,nOW, a. The number of computer operations in this example(without optimization) was about 10'1. This formulation isfairly general for finding image or signal patterns in noise;only specific shape of parabolic models Mh in (10) isspecific to this example and should be modified accordingto the expected pattern 2-D shape.The figure below illustrates a solution of this problem

using the above models and dynamic logic: (a) true 'smile'and 'frown' patterns shown without noise; (b) actual imageavailable for recognition (signal is below noise, signal-to-noise ratio is between -2dB and -0.7dB); (c) an initialfuzzy model, a large fuzziness corresponds to uncertaintyof knowledge; (d) through (h) show improved models atvarious iteration stages (total of 22 iterations). Every fiveiterations the algorithm tried to increase or decrease thenumber of pattern-models. Between iterations (d) and (e)the algorithm decided, that it needs three Gaussian modelsfor the 'best' fit. There are three types of models: oneuniform model describing noise (it is not shown) and avariable number of blob-models and parabolic models,which number, location and curvature are estimated fromthe data. Until about stage (g) the algorithm used simpleblob models, at (g) and beyond, the algorithm decided thatit needs more complex parabolic models to describe thedata. Iterations stopped at (h), when similarity (5) stoppedincreasing.

3049

Cb> Chmin ; abs(Sh) < Shn,".

A ii

E F

VI. ARISTOTLE, ZADEH, AND FUZZY DYNAMIC LOGIC

In a seminal paper in 1965 Lotfi Zadeh introduced fuzzylogic [15] to mirror the pervasive imprecision of the realworld by providing a model for human reasoning in whicheverything - including truth - is a matter of degree. This iswidely believed to be a sharp break with traditions ofclassical Aristotelian logic. It is interesting therefore tonote that the original Aristotelian thinking might have beencloser to fuzzy logic of Zadeh than usually appreciated.Aristotle closely tied logic to language, he emphasized thatlogical statements should not be formulated toospecifically, otherwise meaning might be lost, "languagecontains necessary means for appropriate formulation oflogical statements" and "common sense must be used todo it" [16]1 However, Aristotle also formulated the "law ofexcluded third", which contradicted uncertainty oflanguage. This contradiction was noted in the 19t' centuryby Boole, who decided to exclude any uncertainty fromlogic. A great school of logic formalization emerged,promising in the eyes of many to completely and foreverformalize scientific discourse [I7]. Soon, however Russelldiscovered distur-bing inconsistencies and thirty years

later Godel proved that the entire enterprise was basicallyflawed ['7 '91.

Returning to Aristotle, we note that he saw logic as an

instrument of public discourse, a way to correctly argue forconclusions, which have been already obtained by othermeans. This is clearly seen, for example, from [201, wherehe lists logical arguments that should be used in publicspeeches as needed for or against any of dozens of politicalissues. Never he left any impression that logic was a

mechanism of obtaining truth. Logic is a tool of politicsnot of science. When Aristotle was seeking for explanationof human thinking, he developed a theory of Forms ['6].The main tenets of this theory is that perception is a

process in which "a priori Forms meet matter", that thisprocess is the foundation for all our experience, thisprocess creates concepts with which our mind thinks, andthat a priori Forms exist in our minds in a different formthan the 'final' form of concepts used for thinking. It is myopinion that if Aristotle new fuzzy logic, he might havesaid that the a priori Forms are fuzzy, the 'final' conceptsconsciously used by the mind are crisp (or low-fuzzy), andthe process by which "a priori Forms meet matter" is givenby fuzzy dynamic logic.

Returning from guesses about Aristotle to summarizingresults of the current paper, fuzzy dynamic logic was

formulated. It is a process in which a priori fuzzy anduncertain knowledge (models) "meets" signals about thereal world. The dynamic logic interaction betweenknowledge and signals leads to better knowledge withoutcombinatorial complexity. An example was shown were

patterns below noise were recognized in image data.

ACKNOWLEDGMENT

The author is grateful to R. Brockett, R. Linnehan, C.Mutz, J. Schindler, and B. Weijers for discussions andcontributions.

REFERENCES

3050

c D

G° .mm-DX H

--I

_j

'Bellman, R.E. (1961). Adaptive Control Processes.Princeton University Press, Princeton, NJ.

2 Perlovsky, L.I. (1998). Conundrum of CombinatorialComplexity. IEEE Trans. PAMI, 20(6) pp. 666-670.Chomsky, N. (1972). Language and Mind. HarcourtBrace Javanovich, New York, NY.

4 Minsky, M.L. (1975). A Framework for RepresentingKnowlege. In The Psychology of Computer Vision, ed.P. H. Whinston, McGraw-Hill Book, New York.

'Singer, R.A., Sea, R.G. and Housewright, R.B. (1974).Derivation and Evaluation of Improved Tracking Filtersfor Use in Dense Multitarget Environments, IEEETransactions on Information Theory, IT-20, pp. 423-432.

6 Blackman, S.S. (1986). Multiple Target Tracking withRadar Applications. Artech House, Norwood, MA.Perlovsky, L.I. (2001). Neural Networks and Intellect:using model-based concepts. Oxford University Press,New York, NY.Zadeh, L.A. (2002). Toward a Perception-Based Theoryof Probabilistic Reasoning with F-Granular ProbabilityDistributions. BISC seminar, UC Berkeley.

9 Pedrycz, W. (2001). Granular Computing: An EmergingParadigm, Physica-Verlag.

10 Caulfield, H.J. et al. (2002). Proc. 6th Joint Conf. Info.Sciences, Assoc. Intell. Mach., Durham, NC.Perlovsky, L.I. (1996). Gddel Theorem and Semiotics.Proceedings of the Conference on Intelligent Systemsand Semiotics '96. Gaithersburg, MD, v.2, pp. 14-18.

12 Pedrycz, W., Ruspini, E., Bonnisone, P. (1998).Handbook of Fuzzy Computing, Oxford UniversityPress.

3 In fact there are many levels between the retina, visualcortex, and object-models.

14 Perlovsky, L.I. Mutz, C., Weijers, B., Linnehan, R.,Schindler, J., Brockett, R. (2004). Synthesis of FormalAnd Fuzzy Logic To Detect Patterns In Clutter. IEEEInternational Conference on Computational Intelligencefor Measurement Systems and Applications, CIMSA'04,Boston, MA.

'5 Zadeh, L.A. (1965). Fuzzy sets, Information and Control,vol. 8, pp. 338 - 353, 1965.

16 Aristotle, (IV BC). Metaphysics, In the Complete Worksof Aristotle, Ed.J.Barnes, Bollingen Series, 1995,Princeton, NJ.

17 Hilbert, D. (1900). See in The Honors Class: Hilbert'sProblems and Their Solvers, Ben Yandell, AK PetersLtd, 2003, Natick, MA.

18 Russel, B. & Whitehead, A.N. (1908). PrincipiaMathematica. Cambridge Univ. Press, 1962, Oxford,England

'9 Godel, K. (1986). Kurt Godel collected works, I. (Ed.S.Feferman at al). Oxford University Press. Oxford.

20 Aristotle, (IV BC). Rhetoric to Alexander. In theComplete Works of Aristotle, Ed.J.Barnes, BollingenSeries, 1995, Princeton, NJ.

3051

Documents

[IEEE 2005 IEEE International Joint Conference on Neural Networks, 2005. - MOntreal, QC, Canada (July 31-Aug. 4, 2005)] Proceedings. 2005 IEEE International Joint Conference on Neural