Rule Evaluation on a Motorola SIMD

MOTOROLA CONFIDENTIAL PROPRIETARY

RULE EVALUATION ON A MOTOROLA SIMD

Melti n Bell: 512-505-8125, [email protected]

&

Rod Goke: 512-505-8121, [email protected]

Motorola Parallel Scalable Processors/Center for Emerging Computer Technology 505 Barton Spgs. Rd. Suite 1055, MD: F30, Austin, TX 78704

FAX: 512-505-81 00

ABSTRACT

Fuzzification, rule evaluation and defuzzification in most fuzzy logic systems are computationally expensive tasks. Many systems using a sequential processor will scan the rules/knowledge base and fetch or recompute the fuzzy inputs even if one of them is zero. Due to the nature of fuzzy AND-OR inference processing, this leads to unnecessary fetches and/or computations negatively impacting execution time and hardware resources. This paper presents an algorithm applied to the Association Engine (AE) Single Instruction Multiple Data (SIMD) machine that attempts to make this fuzzy inference process more efficient by minimizing the number of fetches and computations when fuzzy inputs are zero. Although this algorithm may be applied to fuzzy logic systems using sequential processors, analyzing the fuzzy inputs before scanning the knowledge base will highlight the scalable computing power of the AE as well as support Motorola's data oriented processing excellence in the fuzzy logic market.

BACKGROUND logic system, defuzzification, takes the fuzzy output data of the second stage and converts it into a crisp output.

Although fuzzy logic has been around for more than 20 years, The process of taking the usually small set of fuzzy input it's taken a long time for it to gain acceptance in the engineer

ing community. Over time, many people have addressed the grades and combining them with the rules for producing fuzzy outputs closely matches our reasoning abilities and potential drawbacks of fuzzy logic so that it is now seen as an partly explains why fuzzy logic systems often take less code invaluable tool in many of todays' systems. Even though

fuzzy logic is not generally suited for use in linear systems, and/or execute faster than traditional boolean logic systems. TIle basis for this second stage of the fuzzy logic process is it's projected that the fuzzy logic market will increase by 76% the fuzzy MIN-MAX inference method most frequentlyevery year into a billion dollar business through 1998 [St93]. applied to fuzzy set logical computation [Ar92]. This method The factors responsible for such market projections are computes the fuzzy AND of multiple fuzzy input grades by related to what makes fuzzy logic invaluable in many nonlintaking the minimum grade of each individual fuzzy input type ear systems: faster and lowercost development, adaptiveness, used in a rule. The rule weight giving the grade of one of thesmoother and simpler controls, fault tolerance, improved fuzzy outputs for such a rule is the same as the minimum product performance, maintenance and extensibility, etc. grade value of the fuzzy inputs. The method then computes

Fuzzy logic is also popular because it more closely emulates the fuzzy OR of multiple rule weights by taking the maxiour reasoning abilities and knowledge modelling capabilities mum of the rule weights associated with a particular fuzzy than traditional boolean logic systems [Ba93]. The first stage output. Mathematically, this method may be summarized by of a typical fuzzy logic system, fuzzification, deals with finding the degree/grade to which crisp system inputs fit within • fuzzy out typeX.ruleY = MIN(ruleY.fuzzy in typel,

the membership functions (MF) of the fuzzy inputs. The sec grade...ruleY.fuzzy in typeN.grade)

ond part, rule evaluation or fuzzy inference, uses these fuzzy • fuzzy out typeX = MAX(fuzzy out typeX.ruleJ... fuzzy input grades and the rules describing the desired behavior of out typeX.ruleN) the system to produce fuzzy output grades. This is the key stage of the process that models our knowledge reasoning capabilities, and, consequently, is responsible for much of the where rule.fuzzy in type.grade is the grade of a particular computation in most fuzzy systems. The last part of a fuzzy fuzzy input type associated with a rule, fuzzy out type.rule is

- 1


the grade of a fuzzy output type associatedwith a particular rule and fuzzy out type is the highestgrade for a particular fuzzy output type.

MOTIVATION

Many fuzzy logic systems spend most of their computation time during the fuzzy inferencestage becauseof the large numberof fuzzy inputsand rules that mustbe scannedduring the fuzzy AND-QRoperations. Since a fuzzy input grade of zerofor a rule meansa corresponding zero fuzzyoutputvalue for thatrule and 75%of the fuzzyinputgradesof manyfuzzy systemscharacteristically have zero values,significantcomputationtimeand resourcesare wastedscanningtherulesand performing fuzzy AND-OR/MIN-MAX operationson zero values. This paper will address this significant drawback to typical fuzzy logic systems with an algorithm written for a MotorolaSIMD that improvesthe performance factor directlyimpactingMotorola'sabilityto successfully compete in the expandingfuzzy logic market.

The examplefuzzy logic applicationfor thisalgorithmis the InvertedPendulumProblem while the targetarchitectureis the AE. The InvertedPendulumProblemfuzzy logic parameters are given in the followingsectionand derived in the reference [K092]. The section after the InvertedPendulum Problemdescription gives information on the AE related to the example.The next section will cover the specifics of the algorithm itself (the sortingof the fuzzy inputs, the representationof rules/knowledgebase format, the knowledgebase scanning/generation of fuzzy outputs)and illustratedata oriented processing's effect on algorithmdesign.The section following the algorithmdescriptionwill analyzeand summarize theperformanceof this algorithmfor the InvertedPendulum Problemas well as larger fuzzy logic applications. The last sectionacknowledges thosewho havecontributedto this paper.

INVERTED PENDULUM PROBLEM

Balancing an invertedpendulumin two dimensions is a classic controlproblem.A motor is used to movethe base of the invertedpendulum. Motionin onlyonedimension is assumed for thisexampleto simplifytheproblem to two inputs.These inputsare theangle thependulummakeswiththe vertical(A) and theper secondrate at whichthe anglechanges(AC).The positiveor negativeamount of current (C) supplied to the motor is the output that will balance the pendulum. The system is shown in the following figure:

PENDULUM

Figure 1: InvertedPendulum

MOTOR

D D D

~ ... There are seven triangularmembershipfunctions per input for this example.Three of the membershipfunctions represent positive values: Positive_Large (PL), Positive_Medium (PM),and Positive_Small (PS). Three more membership functions representnegative values: Negative_Large (NL), Negative_Medium (NM) and Negative_Small (NS).The last membership function is Zero (ZZ). Each edge of these membership functions is prohibited from overlappingwith more than one other membership functionedge so that each crisp systeminput will be describedby no more than2 nonzero fuzzy inputs (out of 7 possible). Althoughthreepoints are enough to define triangularmembershipfunctions, four points (Pl, P2, P3, and P4) are used in this exampleso that the applicationwill be general enough to beapplied to fuzzy logicsystemsusingtrapezoidal membershipfunctions as well as triangularones. Unlike the input membershipfunctions, singletons are used for the seven output membership functions (pL, PM, PS, ZZ, NI." NM, NS) so that only one point (PI) is needed.

With the inputand output membershipfunctions defined, commonsenseandsomeengineeringanalysismaybe usedto generatethe rules and membershipfunctionpoint values describing the behaviorof the system. For example, if the pendulum falls to the right, a negativecurrent should make the motorcompensate. Conversely, if the pendulum falls to theleft, theoutputcurrentshouldbe positive.If thependulum is balancedat the vertical, the output current should be zero. The full set of rules describingthe behaviorof the systemfollow:

(1) IF A IS NL AND AC IS ZZ THENC IS PL

- 2


(2) IF A IS NM AND AC IS ZZ THEN C IS PM

(3) IF A IS NS AND AC IS ZZ THEN C IS PS

(4) IF A IS NS AND AC IS PS THEN C IS PS

(5) IF A IS ZZ AND AC IS NL THEN C IS PL

(6) IF A IS ZZ AND AC IS NM THEN C IS PM

(7) IF A IS z: AND AC IS z:z THEN C IS ZZ

(8) IF A IS zz AND AC IS PS THEN C IS NS

(9) IF A IS zz. AND AC IS PM THEN C IS NM

(10) IF A IS ZZ AND AC IS PL THEN C IS NL

(11) IF A IS PS AND AC IS NS THEN C IS NS

(12) IF A IS PS AND AC IS ZZ THEN C IS NS

(13) IF A IS PM AND AC IS ZZ THEN C IS NM

(14) IF A IS PL AND AC IS zz THEN CIS NL

(15) IF A IS zz AND AC IS NS THEN C IS PS

The following tables apply engineering analysis techniques for relating the crisp system input or output points to their respective membership functions:

Table 1: ANGLE MF POINTS

MF PI P2 P3 P4

NL -90 -90 -54 -36

NM -54 -36 -36 -16

NS -36 -19 -18 0

ZZ -18 0 0 +20

PS 0 +17 +18 +36

PM +18 +36 +36 +56

PL +36 +56 +90 +90

Table 2: ANGLE CHANGE MF POINTS

MF PI P2 P3 P4

NL -90 -90 -72 -49

NM -72 -49 -48 -25

NS -48 -25 -24 -1

zz -24 -1 0 +23

PS 0 +23 +24 +47

PM +24 +47 +48 +71

PL +48 +71 +90 +90

Table 3: CURRENT MF POINTS

MF PI

NL -18

NM -12

NS -6

ZZ 0

PS +6

PM +12

PL +18

To summarize, the Inverted Pendulum Problem may be described as a 2-input, l-output fuzzy logic system with 7 membership functions per input or output, a maximum of 4 nonzero fuzzy inputs and a total of 15 rules.

THE ASSOCIATION ENGINE

The AE is a single-chip SIMD coprocessor intended for data oriented processing environments and parallel computing

- 3


applicationsrequiringsignificantcomputepower,suchas for pattern recognition, image compressionand decompression, neural networks,and fuzzy logic [AE93]. Althoughmany AEs may be linked together in arrays for MIMDand/or large SIMD processing,only one AE is required for the Inverted Pendulumexample.ntis examplewill demonstratethescalar engine which handlessequentialprogram execution,process control, exceptionprocessing and other traditional scalar operationsas well as the vector engine consistingof 64 processing elements (PEs) for efficientexecutionof parallelor vectorprocessingalgorithms.The followingfigures showall of the major AE modules explained in this section:

Figure 2: Modulesof the AE

CMA

Control Regia....

i15

I I

~

Figure 3: A Vector Engine Row

Indirect-Pointer PO through P7

Figure 4: The Scalar Engine PE

Each of the scalar and vector PEs (65 per AE) contain a dedicated 8-bit ALU enabling each AE to deliver 1.3 billion signed, unsignedor multibyteoperations per second at a 20MHzclock frequency. The PEs receive their commands from the SequenceController which in tum accesses them from the 256 byte InstructionCache (K'), Vectorengine PEs execute the same instructionsimultaneously, in lock-step, each accessing the Input Data Register (lOR), Coefficient MemoryArray (CMA),or vector data registers (vO-v7) associated with it while the scalarengine PE executes instructions that access the lOR, CMA, and scalar global and pointer registers (gO-g7, pO-p7).

In combinationwith the scalar and vector engines, the CMA and lOR are other major AE modules that demonstrate the AE's flexibility. The 64 by 64 (=4K) bytes of CMA SRAM functionsas the general memory storage for instructions, stackspace,jump tables, workingdata and data arrays.A row of 64 bytes is allocated to each of the 64 PEs so that a CMA columnof 64 bytes is availablefor vector/paralleloperations. The CMA can also interact with the lOR when the AE is in Run (vs. Host) mode (e.g. the AE is processing instructions insteadof interactingwitha hostprocessor for randomand/or stream accesses).

The IDR is the only input data path for the AE when the AE is in Run mode. An input tagging feature allows the lOR to access individualbytes of data out of a byte stream while an inputreplicationfeatureallows the individualbytes to be copied to more thanoneof the64 IDRelements.Theseindividual bytes enter from either of the 4 AE ports (North, South, East and West) and go directly into the IDR.Up to 64 bytesof data may then be accessed from the lOR by the scalar and vector enginesduring AE programexecution.The scalarenginecan accessan element/byteoutof the lOR while thevectorengine can access all 64 elements/bytesof the lOR.

Although other features of the AE include many control registers not yet definedand a rich instructionset where many operationstake 1 clock cycle, the Vector Process Control Register(VPCR)and the instructionslisted in this sectionare used to solve the fuzzy inferenceportion of the InvertedPendulum Problem. A VPCR is contained in each of the 64 PEs of thevectorengine.Only twoof the 8 bits in the VPCRapply to this example. Although the Vector ConditionalTrue (VT) bit is usually used to evaluate if-then-elseconditions, the loc

- 4

MOTOROLA CONFIDENTIAL PR'OPRIETARY

max instruction uses it to deactivate PEs that don't have the highest value among all vector register (vO-v7 and IDR) elements. The ValidInput Data (VID) bit indicates that the associated lOR element has data that is valid for use.

Besides the locmaxinstruction, the following instructions may be used for implementing efficientrule evaluation on AEs:

• vnwv

• movi

• nwv

• dskip

• skipne

• skipnvt

• repeat

• repeate

• vwritel

• locmin

• rowmin

• rowmax

• colmin

• colmax

• bra

• vifgt

• vifne

• vifeq

• vendif

• vor

• add

• getpe

• get

• put

• inc

• dec

• dsrot

The reader should consult the reference [AE93] for instruction execution times and further explanation of instructions, registers, or other AE features.

THE ALGORITHM

As the second stage of a fuzzy logic system, rule evaluation requires

• the fuzzy input grades of the first stage and

• the rules describing the mapping of fuzzy inputs to the fuzzy outputs

in order to generate the fuzzy output weights required for the third stage. As impliedearlier, most fuzzy logic systems start the fuzzy AND-ORIMIN-MAX operations by scanning the rules and then fetching or computing the fuzzy inputs. This means that rule processing will notonly be proportional to the number of rules, but the number of fuzzy inputs possible in a system. With the 7 membership functions/system input, 2 system inputs and 15 rule Inverted Pendulum Problem, rule processing will be proportional to 7 * 2 * 15 =210 membership function * rules even though a majority of the fuzzy inputs are zero.

By analyzing the fuzzy inputs and their impact on the fuzzy AND-ORoperations before the processof scanning the rules, this data oriented processing exercise changes the focus of computing from scanning all the rules and performing fuzzy MIN-MAX computations on every fuzzy input to determining the useful fuzzy inputs and then minimizing the amount of computation performed on them. With a maximum of 2 nonzero membershipfunctions/system input, 2 system inputs and 15 rules, rule processing under such a data oriented paradigm extends the execution time so that it is proportionally bounded by 2 * 2 * 15 =60 membership function * rules.

The data oriented processing emphasis of this algorithm is atypical of many fuzzy logic systems because the processing and space limitations of Single Instruction Single Data (SISD)chips, no matter how well or highly pipelined, require that all fuzzy AND-OR computations, the scanning of rules, and the recomputationor storing and retrieving of intermediate results beperformed by the single sequential processor. During all phases of this algorithm, the data flowarchitecture of the AE and the compute power available from its 65 processors stress the performance improvementover SISD chips of using Motorola data oriented processing engines, such as AEs, for fuzzy logic solutions.

The first part of this algorithm will sort the fuzzy inputs and maintain/trackthe relationship of fuzzy input to membership function so that the nonzerofuzzy inputsand rules using them will facilitate efficient scanning of the knowledge base. The second part of this algorithm, generating the fuzzy outputs from the sorted fuzzy inputs by efficientlyscanning the rules/ knowledgebase, is closely related to the rule knowledgebase

- 5


format so this is given after the sorting and before the scanning. For the remainder of this discussion, the fuzzy input grades from the fuzzification stage will be stored in 14 elements of a vector register and in the lOR, the fuzzy input membership functions will be stored in a vector register, the rules will be stored in a 7 by 14 byte space within the CMA and the fuzzy output rule weights will be computed in a second vector register. The register map for these and other values used for intermediate calculations follows:

Table 4: Register Map

Fuzzy Input Grades vl, lOR

PE With the Largest Fuzzy Input Grade p3

The Largest Fuzzy Input Grade g4

Sorted Fuzzy Input Grades vO

Sorted Fuzzy Input Grades Index Pointer p4

Number of Nonzero Fuzzy Inputs g5

Tracked Fuzzy Input MFs v2

Fuzzy Input MF Pointer Into CMA pO

Rules CMA[O,3]CMA[6,I6]

Fuzzy Input MF Column Offset Into CMA g7

Number of Fuzzy Input MFs for Example g6

Zero g3

Pointer Into IOR/Fuzzy Input Grades p2

Latches Bit Vector v3

Fuzzy Output Weights v4

Sorting The Fuzzy Inputs

With the instructions listed above, many sorting options are available on the AE. Some of the options apply theory from sorting algorithms for conventional SISD processors, but offer a significantperformance improvement when applied to the AE. For example, a good sorting algorithm for a sequential processor would have a performance proportional to O(N * 10g(N)) where N is the number of items to be sorted. Although there are theoretically faster sorting algorithms for sequential processors, the hardware or software overhead usually makes them undesirable or inefficient for small N. The application of an O(N * 10g(N)) conventional SISO sorting algorithm to an AE, however, can result in linear performance, O(N), practically impossible to achieve on any conventional sequential processor [l(n73, Be93]. Though either linear sorting algorithm would be sufficient for the Inverted Pendulum Problem, a routine based on the locmax instruction will be used to demonstrate the diversity and uniqueness of the AE instruction set and architecture.

The first part of this routine will initialize the Sorted Fuzzy Input Grades vector (v l), Tracked Fuzzy Input MFs vector (v2), Zero global (g3), and the Sorted Fuzzy Input Grades Index Pointer (p4) registers to zero. The next part of the routine is a loop that selects the largest fuzzy input grade from the Fuzzy Input Grades vector (v1) register, inserts that value into increasing locations of the Sorted Fuzzy Input Grades vector (vO), and then replaces the largest fuzzy input grade in the Fuzzy Input Grades vector (vI) with zero. The AE assembly code of this descending values sorting routine follows:

vmov#O, vO

vmov#O, v2

movi #0, g3

movi #0, p4

IDP: locmax#8, vI

skipnvt

bra BOTTOM

getpe p3

get vl, pe[p3], g4

put g4, pe[p4], vO

put p3, pe[p4], v2

put g3, pe[p3] vI

inc #1, p4

- 6

• •


vendif

bra TOP

BOTTOM:

The locmax-based sorting routine given above may beeasily modified for sorting across multiple AEs by substituting rowmax,rowmin, colmax, or colmin instructions for locmax and then writing the result out to a port for further processing by [an]other AEs or other hardware.

Rules Knowledge Base Format

The format for representing the rules ofa fuzzy logic application written for the AE further illustrates one of the data oriented processing edges over conventional function oriented processors. ntis format was chosen to make the storage of the knowledge base very compact and the scanning of these rules highly efficient. Each rule stored within the CMA will take up a subrow of bits in the CMA. The length of the subrow will be the number of fuzzy input MFs. All subrows contributing to a fuzzy output must be grouped together in a CMA row so that a total of 8 rules may affect a fuzzy output. For the Inverted Pendulum example, the fuzzy inputs MF relationship to fuzzy output MFs requires 14 columns and 7 rows of CMA space and can represent a maximum of 7 fuzzy outputs * 8 rules per fuzzy output = 56 rules with the limitation that no more than 8 rules contribute to a fuzzy output,

Since there are only 15 rules describing the Inverted Pendulum Problem, there will be 56 - 15 = 41 subrows that will not contribute to a fuzzy output. These excess subrows must be filled with I's to facilitate a latching mechanism described later.The other subrows identifying fuzzy input MFs contributing to a fuzzy output MF will be filled with l's and O's. For each of the 15 rules, the bits within each subrow set to 1 will identify the fuzzy input MFs contributing to a fuzzy output while those fuzzy input MFs not contributing to the fuzzy output for this same subrow/rule will be set to O. For this example, exactly two bits will be set in a subrow for a rule because each rule uses both fuzzy inputs. The following bitmap of the CMA representing this format for the Inverted Pendulum rules shows the CMA columns and subrows identifying the fuzzy inputs and associated MFs that contribute to a particular fuzzy output:

Table 5: Inverted Pendulum Problem Rules Knowledge Base

"'faIl~ AHQl£ ANGLECHANGE 8 u R CIIA..bMF·» NL NM NS zz. PS PM PI.. NL NM NS zz PS PM PI.. u r I Oulpul

0 . )(~

CMACol.. 3 4 5 7 1 10 11 12 13 14 15 " ·

1 1 1 1 1 1 1 1 1 1 1 1 1 1 0

1 1 t 1 1 1 t 1 t 1 1 1 t 1 1

t 1 t t t 1 1 1 1 1 1 1 1 1 2

1 1 1 1 1 1 1 1 1 1 1 1 1 1 3

1 1 1 1 1 1 1 1 1 1 1 1 , , 4 QINL

, ,1 1 1 1 1 1 1 1 1 1 1 1 5

0 0 0 0 0 0 , 0 0 0 1 0 0 0 14• ,a 0 a a 0 a a a a 0 0 0 t 7 10

... ---» FC FC Fe FD Fe Fe FE FC Fe FC FE Fe Fe FD

, ,1 t 1 t , t , 1 1 1 1 1 1

, , , ,1 1 1 1 1 1 1 1 1 1 • 1 , , , , , , , , , , , ,

'01

, , , , , , , , , ,1 1 t 1

" , , , 1 , , , , , , , , , , 1 1 1 1 1 1 1 1 12 11NU

1 1 t 1 t 1 13

a a a 0 0 1 a a 0 a 1 a a a 14 13

,D a D 1 a D a a D D D a 1 a 15

... ---» FC FC Fe FD FC FE FC Fe FC FC FE FC FD FC

, , , , , ,1 1 1 1 1 1 t 1 " , , , , , , , , ,1 1 t 1 1 '7

, 1 , , 1 1 t 1 1 , , , 1 , '1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 " ,1 1 1 1 1 1 1 1 1 t 1 , 1 2D 2INS

a D a a 1 D 0 a a a 1 a a a 21 12

a 0 0 a 1 0 a D a 1 0 0 a a 22 11

D D D 1 D D 0 a 0 0 a 1 0 0 23 1

... ---» FI Fa Fa FI FE FI Fa Fa Fa FA FC FI Fa FI

1 1 1 1 1 t , , 1 1 , , ,1 24

, , , , , , ,1 1 1 1 1 1 1 25

, 1 t 1 , , , 1 1 , 1 1 ,1 21

, , , , , , , , ,1 1 1 1 1 27

, ,1 1 1 1 1 1 , , 1 1 t 1 a 3IZZ

, , , , , , , , , , , 1 1 1 , , , , 1 , 1 , , , , 3D

, ,

1 1 1 21

D 0 D 0 a D D D D D 0 a 73' ...~.. FE FE FE FF FE FE FE FE FE FE FF FE FE FE

, , , , , , , , , , , , , , ,

1 1 1 1 1 t 1 3i2

t t 1 1 t 1 33

, , , , , t , , , , , ,1 1 34

, , , , , , , , , ,

t t t t t 1 36

1 1 , , , , t , 1 1 , , 3& 4JPS

, ,0 0 0 0 D 0 0 0 0 D D 0 37 '5

, ,

0 0 t 0 0 0 0 D 0 0 0 0 0 38 4

0 0 0 0 0 0 0 0 0 t 0 0 D 38 3

... ~-» FI FI F8 FC FI FI FI FI FI FC FI FA FI FI

, 1 , , , , , , ,1 1 t t t 4D

, , , , , , , 1 t , , , , , 4'

, , , , , , , , , , , , , Qt

, , , , , , , , , ,t t t t 43

t , , , , , , , , , , , , , t , , t , , , , t , , , 45

,

t t .w 5iPM

D D D 0 D 0 0 1 0 D D a a 46 I

,D 0 0 0 0 0 0 0 0 1 D 0 0 47 2 ...~.. FC FD FC FE FC FC FC FC FE FC FD FC FC FC

- 7


Input·. AHG..e ANOl.ECHANGE a II R CMA b II ..W .. M. NY NS ZZ PS PM Pl M. NY N8 zz. PS PM Pl r I 0uqIu&. . M~ , .CUACGI·. 3 4 5 I 7 I '0 '2 '4 '5 '1" " , , , 1 1 1 1 1 , 1 1 1 1 1 48

1 , 1 1 ,, 1 , 1 1 1 1 1 1 4'

, , , , , , 1 1 1 1 1 50, , , ,, 1 , , , , , 1 1 1 1 1 1 51

, 1 , , 1

, , 1 , 1 , 1 1 1 !as

, , , 1 , 1 1 1 1 52 IA.

, , , , , , ,

, , 0 0 0 a 0 a 0 0 a a 0 a 54 5

a a 0 0 0 0 0 0 a a a 0 55 1

..... ar-.. FD Fe FC FE FC Fe Fe FE Fe FC FD Fe Fe Fe

Although each subrow contributing to a fuzzy output is placed at the end of a CMA row, the actual order of subrows within a CMA row is not important Gustthat all subrows affectinga fuzzy output begroupedin the samerow). With so many excess subrows, however, it helps in generating the hexadecimalCMA bytes if the upperor lower 4 bits are all 1's (i.e. F).

Besides being a compact representationof the rules knowledge base, this format allows for a latching mechanism to be employedwhenscanning therulesso thatbits within the latch are set for excess subrows and when fuzzy input MFs contribute to a fuzzy output MF.The bits within the latch will never be cleared so that a fuzzy output MF weight is known when all bits in a byte of the latchare set to 1.This weight, however, may not be the correct weight because more than one fuzzy output MF weight is possible and the fuzzy OR operation requires that the highest weight be chosen.

Generating Fuzzy Output MF Weights From Rules and Sorted Fuzzy Input Grades

As stated above, the latching mechanismsupported by the rules knowledge base format is not enough to guarantee that the correct fuzzy output weight will be generated when scanning the knowledge base. The fuzzy input grades are sorted from highest value to lowest value partly because of this problem. The main concept behind this phase of the algorithm is a method of using the fuzzy inputs, sorted fuzzy inputs and associated MFs for efficientlyscanning the knowledge base so that the fuzzy AND-OR operations are preserved and the correct fuzzy output weight is generated for each fuzzy output MF.

Since a majority of the fuzzy input grades are zero, this methodmust evaluate all of the rules dependingon these zero fuzzy input grades and generate zero fuzzy output weights appropriately. With the IDR holdinga copy of the fuzzy input grades, this operation is relativelyeasy to perform and understand compared to calculating the nonzero fuzzy output

weights. For these reasons and the fact that finding the zero fuzzyoutput weights facilitatescalculating the nonzerofuzzy outputweights, this part of the fuzzy MIN/ANDevaluation is performedas the first computations for generating the fuzzy output weights.

Just as with the sorting routine, the firstpart of generatingthe fuzzyoutputweights will initialize a number of registerswith appropriatevalues. The Fuzzy Output Weights (v4) vector, Pointer Into IDR/Fuzzy Input Grades (p2), and the Latches Bit Vector(v3) registers will be initialized to zero while the Fuzzy Input MF Column Offset Into CMA (g7) global and Fuzzy Input MF Pointer Into CMA (pO) registers will be set to 3. The Number of Fuzzy Input MFs for Example (g6) global register will be set to 14. After the initializations, the CMA will be scanned and bits within the Latches Bit Vector (v3) register will be set to reflect fuzzy input MFs with zero weightsand other excess subrows not contributing to a fuzzy output MF weight. Any PEs containing a Latches Bit Vector (v3) element/byte with all bits set will be deactivated so that rule weights of zero will not be changed by subsequent processing.The AE assembly code for the firstpart of generating the fuzzy output MF weights follows (fuzzy MIN/AND operation):

vmov#O, v4

movi #3, g7

movi g7, pO

movi #O,p2

vmov#O, v3

movi #14, g6

repeate #2, g6

vifeq IDR[p2++], v4

vor CMA[pO++], v3

vifne #-1, v3

The next part of generating the fuzzy output MF weights initializes the Number of Nonzero Fuzzy Inputs (g5) global register and sets the Sorted Fuzzy Input Grades Index Pointer (p4) register to point to the last element (e.g. lowestgrade) of the SortedFuzzy Input Grades vector (vO) register.These initializationsare done so that the Sorted Fuzzy Input Grades (vO) vector may be traversed from smallest grade to largest grade as part of this algorithm's fuzzy AND-OR/MIN-MAX inferenceprocessing. The fuzzy AND-OR/MIN-MAX inference processing loop involves

•-•~- 8


• extractingthe MF numberof the lowest fuzzy inputgrade not yet processed into the Fuzzy Input MF Pointer Into CMA (pO) register,

• extracting the lowest fuzzy input grade not yet processed (continuanceof the fuzzy MIN/ANDoperation which started with computing the zero fuzzy output membership function weights),

• adding the Fuzzy Input MF ColumnOffset Into CMA (g7) register to the MF number of the lowest fuzzy input grade not yet processed (pO) register,

• ORing the rules using the lowest fuzzy input MF with the Latches Bit Vector (v3) register,

• moving the lowest fuzzy input grade into the active elements of the Fuzzy Output Weights (v4) vector register,

• setting up the Sorted Fuzzy Input Grades Index Pointer (p4) register to point to the next lowest fuzzy input grade not yet processed, and

• deactivating the PEs with all the bits set in their Latches Bit Vector (v3) register (fuzzy MAX/ORoperation)

The AE assembly code for this last part of generating the fuzzy output MF weights follows:

movp4,g5

dec #1, p4

repeat#7, g5

get v2, pe[p4], pO

get vO, pe[p4], g4

addg7,pO

vor CMA[pO], v3

vmovg4, v4

dec #1, p4

vifne #-1, v3

vendif

The vendifreactivatesall the PEs that were deactivatedduring the fuzzy AND-OR/MIN/MAX inferenceprocessing so that the third stage of fuzzy logic processing,defuzzification, doesn't have to worry about the state of the PEs.

It should also be noted that the theoreticalexecutiontime estimategiven for the algorithmearlier wasunderthe assumption

that the knowledgebase would only be scanned once. Since the rules knowledge base is scanned twice during this last phaseof thealgorithm(once for processingzero fuzzyoutput weights and once for processing nonzero fuzzy output weights), the theoreticalexecution time is proportionally bounded by 2 nonzero membership functions/system input * 2 system inputs * 15 rules * 2 =120 membership function * rules. This is still just under twice as fast as is possible on a conventionalprocessor. In practice, however, the theoretical execution time can be proportional to as little as 60 membership function * rules when there is only 1 nonzeromembership function/system input. For comparison's sake, let's assume that the average theoretical execution time of this algorithmwill be proportional to (120 + 60) /2 = 90 membership function * rules. This represents a theoretical 210/90 = 233% performanceimprovement over must fuzzy logic systems.

PERFORMANCE AND SUMMARY

Though the performanceestimates given above for the algorithm are impressive, they do not give the exact amount of time it takes for the algorithm to execute on the AE nor do they illustrate the AE's suitability for solving fuzzy logic problems of varying sizes. This section will give the algorithm's worst execution time in clock cycles for the Inverted PendulumProblem and larger fuzzy logic systems based on an unpipelinedAE and instruction cycle times given in the reference [AE93].The calculations used to generate the number of clockcycles in the following table reduces to 2 * I *10 + 72 * 1+ 41, where I is the number of system inputs for the fuzzy logic problem, 10 is the number of fuzzy input or output membershipfunctionsper system inputor output and the maximumnumber of rules supported is 10 * the number of fuzzy outputs * 8. Since the number of CMA columns accessed by this algorithm is only dependent on the number of fuzzy input MFs, this algorithm has the added benefitof allowing for a constant execution time when the number of rules is less than the maximum number of rules supported.

Table6: Performanceof Algorithmfor DifferentFuzzyLogic Systems

Fuzzy Logic System I 0 10 Max Rules Cycles

Inverted Pendulum 2 1 7 56 213 2/1 2 1 8 64 217 4/2 4 2 8 128 393

6/3 6 3 8 192 569 8/4 8 4 8 256 745

- 9

--

MOTOROLA CONFIDENTIAL PROPRIETARY�

[Ko92] Kosko, B., "Neural Networks and Fuzzy Systems",

The difference between the 2/1 and 4/2, 4/2 and 6/3, 6/3 and 8/4 fuzzylogic systems in theabove table is exactly 176clock cycles. This data proves that the AE scales linearly with the size of fuzzylogic systemsand providesan excellentexample of a chip well designed for scalable computing performance, Note also that even for the largest fuzzy logic system, 8/4, halfof the CMArows are empty.This implies that the AE can support larger fuzzy logic applications requiring more rules and/or fuzzy output MFs with slightly modified(if modified at all) code. This is important to note because although the problem size may increase, the code size may very well stay the same without adding significantlyto execution time.

In summary, this algorithm is particularly exemplary of data oriented processing enhancements available with applications using the AE. It shows how solving smaller parts of a fuzzy logicproblemon the AE withdata orientedpartitioning elegance creates an interdependenceamong all phases of a problem solution allowing for greater overall efficiencyand scalabilitythan can be attained withconventionalprocessors. These factors will give Motorola a clear performanceadvantage in fuzzy logic markets.

ACKNOWLEDGEMENTS

The authors would like to recognizeWilliamArchibaldas the firstand only other individual (to the authors' knowledge) to develop the basic algorithm and apply it to any other hardware(Ar92]as well as for his time in helpingus to understand the algorithm.Alex DeCastroalso provided the figureusedin this paper and the source for one of the references.

BIBLIOGRAPHY

[AE93]MotorolaParallel ScalableProcessorsGroup, "Association Engine (AE) Software Manual", Motorola MCTG Publications, 1993.

[Ar92]Archibald,W.,"FLIPPER Architecturaland Algorithmic Notes", Not yet published.

[Ba93] Barron, J., "Putting Fuzzy Logic Into Focus", Byte, April 1993 pp. 111 - 118.

[Be93] Bell, M., "Sorting on the AE", Not yet published.

[Kn73] Knuth, D., "Sorting and Searching",The Art of Computer Programming, Vol. 3, Addison-Wesley Publishing Company,Menlo Park, CA, 1973.

Prentice-Hall, Inc., Englewood Cliffs, NJ, 1992.

[[5t93] Stevens, T., "Fuzzy Logic Makes Sense", Industry Week, March 1, 1993 pp. 36 - 42.

- 10