6
NAFIPS 2005 - 2005 Annual Meeting of the North American Fuzzy Information Processing Society Constructing a Fuzzy Rule Based Classification System Using Pattern Discovery Andrew Hamilton-Wright Systems Design Engineering University of Waterloo Waterloo, Ontario, Canada andrewhw @ieee.org Abstract-Pattern Discovery (PD), an algorithm which dis- covers patterns based on a statistical analysis of training data was used to generate rules for a fuzzy rule based classification system (FRBCS). Classification performance of the FRBCS when using rules discovered by the PD algorithm and of the PD algorithm functioning as a classifier applied to a number of linearly and non-linearly separable continuous-valued data sets was compared. The results indicate an increased performance for the FRBCS. The improvement comes through both an increase in correct classifications and a decrease in the error rate in the class distri- butions studied. The use of trapezoidal shaped input membership functions applied to the input data values allowed vagueness in the input events to be modelled and resulted in a more robust determination of the characteristics of the input data which in turn resulted in more accurate classification. In addition, the standard use of a co-occurrence based weighting of the rules by the FRBCS outperformed the weight-of-evidence based selection and use of input patterns by the PD classifier. I. INTRODUCTION A continuing problem in fuzzy systems is the acquisition of a sound fuzzy rule base in the absence of high quality expert knowledge. This work examines a means of incorpo- rating rules which are automatically discovered by a "Pattern Discovery" (PD) algorithm in a statistically valid manner into a fuzzy rule based classification system (FRBCS). PD is an contingency table based pattern extraction algo- rithm developed by Wang and Wong [1], [2] to deal with ordinal/nominative data and is used for the analysis of data from a variety of domains. As PD is an event-based algorithm, the rules extracted can be easily interpreted and their statistical validity can be confirmed. This combined with the strong linguistic basis of fuzzy systems ensures that a FRBCS based on such rules would be capable of explaining the rationale and statistical validity of its decisions. This transparency would allow the developed FRBCS to be part of a complex decision support system, such as used in medical and financial applica- tions, where the classifications of a number of subordinate classifiers must be combined in an intelligent manner to support decisions in a broader context. The PD technique uses discrete data and continuous data needs to be quantized to be used by this application. Quantiza- tion can affect the performance of the PD algorithm. However, the PD algorithm has recently been successfully applied for the Daniel W. Stashuk Systems Design Engineering University of Waterloo Waterloo, Ontario, Canada [email protected] classification of continuous-valued data [3] and these results suggest that quantization does not necessarily exclude PD from successfully being applied to continuous data. In the current paper, the suitability of using the PD algorithm as a mechanism for the automatic creation of rules suitable for a FRBCS applied to a number of continuous-valued class distributions is explored. II. PATTERN DISCOVERY ALGORITHM The reader is referred to [2], [4] for a complete description of the algorithm; a brief overview of the major concepts is provided here. Consider a set of discrete training data presented as an array of N rows of length Al+1. Each row or input vector contains Ml input feature values and a single class label, Y=Yk, from a set of K possible class labels. Every input vector can be considered to be a single AI+ith- order event in information space. Each element of an Al- dimensional feature vector, xj (j e 1 * - Ml), can have one of vj discrete observed values from the set of possible values or primary events describing feature j. Each possible combi- nation of m primary events selected from within a vector can be considered a sub-event of order m, mr e I: 1 < Kn <.AM± 1. Primary (or first order) events are represented as xl, while in general an event of order m is represented as xl, with 1 indicating a particular sub-event within the list of all sub- events of order an occurring in a particular input event x. Events of interest with respect to classification must be of order 2 or greater and be an association of at least one input feature value (a primary event) and a specific class label. PD analysis begins by counting the number of occurrences of all observed events among the N vectors forming the training data. Statistically significant events (or "pattems") within this set are then discovered by using a residual analysis technique. A. Pattern Identification Using Residual Analysis The test performed on each event xl to determine whether it is "significant" simply compares the observed number of occurrences of the event with the expected number of occur- rences under the null hypothesis that the probability of the 0-7803-9187-X/05/$20.00 02005 IEEE. 460

[IEEE NAFIPS 2005 - 2005 Annual Meeting of the North American Fuzzy Information Processing Society - Detroit, MI, USA (26-28 June 2005)] NAFIPS 2005 - 2005 Annual Meeting of the North

  • Upload
    dw

  • View
    212

  • Download
    0

Embed Size (px)

Citation preview

Page 1: [IEEE NAFIPS 2005 - 2005 Annual Meeting of the North American Fuzzy Information Processing Society - Detroit, MI, USA (26-28 June 2005)] NAFIPS 2005 - 2005 Annual Meeting of the North

NAFIPS 2005 - 2005 Annual Meeting of the North American Fuzzy Information Processing Society

Constructing a Fuzzy Rule Based ClassificationSystem Using Pattern Discovery

Andrew Hamilton-WrightSystems Design Engineering

University of WaterlooWaterloo, Ontario, Canada

andrewhw @ieee.org

Abstract-Pattern Discovery (PD), an algorithm which dis-covers patterns based on a statistical analysis of training datawas used to generate rules for a fuzzy rule based classificationsystem (FRBCS). Classification performance of the FRBCS whenusing rules discovered by the PD algorithm and of the PDalgorithm functioning as a classifier applied to a number oflinearly and non-linearly separable continuous-valued data setswas compared.

The results indicate an increased performance for the FRBCS.The improvement comes through both an increase in correctclassifications and a decrease in the error rate in the class distri-butions studied. The use of trapezoidal shaped input membershipfunctions applied to the input data values allowed vagueness inthe input events to be modelled and resulted in a more robustdetermination of the characteristics of the input data which inturn resulted in more accurate classification. In addition, thestandard use of a co-occurrence based weighting of the rules bythe FRBCS outperformed the weight-of-evidence based selectionand use of input patterns by the PD classifier.

I. INTRODUCTIONA continuing problem in fuzzy systems is the acquisition

of a sound fuzzy rule base in the absence of high qualityexpert knowledge. This work examines a means of incorpo-rating rules which are automatically discovered by a "PatternDiscovery" (PD) algorithm in a statistically valid manner intoa fuzzy rule based classification system (FRBCS).PD is an contingency table based pattern extraction algo-

rithm developed by Wang and Wong [1], [2] to deal withordinal/nominative data and is used for the analysis of datafrom a variety of domains. As PD is an event-based algorithm,the rules extracted can be easily interpreted and their statisticalvalidity can be confirmed. This combined with the stronglinguistic basis of fuzzy systems ensures that a FRBCS basedon such rules would be capable of explaining the rationale andstatistical validity of its decisions. This transparency wouldallow the developed FRBCS to be part of a complex decisionsupport system, such as used in medical and financial applica-tions, where the classifications of a number of subordinateclassifiers must be combined in an intelligent manner tosupport decisions in a broader context.The PD technique uses discrete data and continuous data

needs to be quantized to be used by this application. Quantiza-tion can affect the performance of the PD algorithm. However,the PD algorithm has recently been successfully applied for the

Daniel W. StashukSystems Design Engineering

University of WaterlooWaterloo, Ontario, [email protected]

classification of continuous-valued data [3] and these resultssuggest that quantization does not necessarily exclude PD fromsuccessfully being applied to continuous data. In the currentpaper, the suitability of using the PD algorithm as a mechanismfor the automatic creation of rules suitable for a FRBCSapplied to a number of continuous-valued class distributionsis explored.

II. PATTERN DISCOVERY ALGORITHM

The reader is referred to [2], [4] for a complete descriptionof the algorithm; a brief overview of the major concepts isprovided here.

Consider a set of discrete training data presented as an arrayof N rows of length Al+1. Each row or input vector containsMl input feature values and a single class label, Y=Yk, froma set of K possible class labels.

Every input vector can be considered to be a single AI+ith-order event in information space. Each element of an Al-dimensional feature vector, xj (j e 1 *- Ml), can have oneof vj discrete observed values from the set of possible valuesor primary events describing feature j. Each possible combi-nation of m primary events selected from within a vector canbe considered a sub-event of order m, mr e I: 1 <Kn <.AM± 1.

Primary (or first order) events are represented as xl, whilein general an event of order m is represented as xl, with1 indicating a particular sub-event within the list of all sub-events of order an occurring in a particular input event x.

Events of interest with respect to classification must be oforder 2 or greater and be an association of at least one inputfeature value (a primary event) and a specific class label.PD analysis begins by counting the number of occurrences

of all observed events among the N vectors forming thetraining data. Statistically significant events (or "pattems")within this set are then discovered by using a residual analysistechnique.

A. Pattern Identification Using Residual AnalysisThe test performed on each event xl to determine whether

it is "significant" simply compares the observed number ofoccurrences of the event with the expected number of occur-rences under the null hypothesis that the probability of the

0-7803-9187-X/05/$20.00 02005 IEEE. 460

Page 2: [IEEE NAFIPS 2005 - 2005 Annual Meeting of the North American Fuzzy Information Processing Society - Detroit, MI, USA (26-28 June 2005)] NAFIPS 2005 - 2005 Annual Meeting of the North

Terminology 1 Adjusted and Standardised ResidualDefinition 1: Standardised Residual.The standardised residual is defined as the ratio of the simpleresidual to the square root of its expectation [5]:

oxm - exn (

whereexlm is the expected number of occurrences of xl given

an assumed model (random chance) and°xrn is the observed number of occurrences in a training

data set.

Definition 2: Adjusted Residual.The adjusted residual is a further normalization of the stan-dardised residual [5]:

zxrrz(2)

where vx,,z is the maximum likelihood estimate of the varianceof the zx... value in (1); as given by Wang [1], this is:

OXrna - exr(=var ) rI (3)j=lI

where oxli is the number of occurrences of the primary eventxli e xm (xl is the current event being examined) and Nis the total number of observations made (i.e.; the number ofrows in the training data set).

occurrence of each component primary event is random andindependent.The observed number of occurrences of x7l is represented

as ox,,, and the expected number of occurrences, ex,n is

m

e_Xrn-Nf (ox) (4)

i=1

where oxli is the number of occurrences of xli, itself a primaryevent drawn from the event x7n.To select significant events, the .A(O, 1) distributed adjusted

residual rx,l, defined in (2) is used. The value rx,7 definesthe relative significance of the associated event xlm. The PDalgorithm deems an event to be significant if Jr,,, exceeds1.96, defeating the null-hypothesis with 95% confidence.

Events capturing significant relationships between classlabels and other feature values in the training data are termed"patterns". Patterns are used to suggest the class labels of newinput feature vectors.

Significance is calculated in absolute terms because com-binations of events which occur significantly less frequentlythan would be expected by the null-hypothesis (patterns witha negative rxr,, ) are just as significant and potentially discrim-inative as those that occur more frequently. Such patterns maybe used to contra-indicate a specific class label.

B. Weight of EvidenceIn order to measure the discriminative power of a pattern,

PD uses the "weight of evidence" statistic (or WOE).Letting (Y=yk) represent the label portion of a given pattern

xl, the remaining portion (consisting of the input featurevalues) is referred to as x*. The mutual information betweenthese two components can be calculated [1] using:

I(Y=yk : x*) = ln Pr(y=YkIX*)Pr(Y=Yk) (5)

A WOE in favour of or against a particular labeling Yk E Ycan be calculated as

WOE (Yt lx*) I(Y=yk : X*) - I(Y#yk : x*) (6)

or

WO nPr(X*, Y=Yk)Pr(y#AYk)Pr(Y-yk)Pr(x*, Y$hyk)WOE thereby provides a measure of how discriminative a

pattern x* is in relation to a label Yk and gives us a measureof the relative probability of the co-occurrence of x* and Yk(i.e.; the "odds" of labelling correctly).The domain of WOE values is [- -...oo], where -oo

indicates those patterns x* that never occur in the training datawith the specific class label Yk; no indicates patterns x* whichonly occur with the specific class label Yk These infinite valuedWOE patterns are the most descriptive relationships found inthe training data set.

C. ClassificationSupport for each Yk (possible class label) is evaluated in

turn by considering the highest-order pattern with the greatestadjusted residual from the set of all patterns occurring in aninput data vector to be classified, and accumulating the WOEof this pattern in support of the associated label. All features ofthe input data vector matching this pattern are then excludedfrom further consideration for this Yk, and the next-highestorder occurring pattern is considered. This continues untilno patterns match the remaining input data vector, or all thefeatures in the input data vector are excluded.Once this is repeated to consider each Yk, the label with the

highest accrued WOE is assumed as the highest likelihoodmatch and this class label is assigned to the input featurevector.

III. PATTERN ANALYSIS USING CONTINUOUS-VALUESIn most real-world problems data is continuous-valued and

must be quantized to be used by the PD algorithm. Based ontraining data and a marginal maximum entropy partitioning(MME) scheme an optimal mapping of continuous-valued datapoints to bounded "bins" is determined for each feature. Thegeneral idea is that for a specific feature the data valuesassigned to each "bin" have a local similarity, and that eachbin contains the same number of assigned feature data values.

This is achieved over the set of observed values by:* sorting all the values for a given feature j, j E 1.. AI;

461

Page 3: [IEEE NAFIPS 2005 - 2005 Annual Meeting of the North American Fuzzy Information Processing Society - Detroit, MI, USA (26-28 June 2005)] NAFIPS 2005 - 2005 Annual Meeting of the North

I < Input Dimension Domain(Fuzzy Universe of Discourse)

0ci9j 0 0 0

a b

0

Observed

0 0 0 0 Input DataEvents

MMEQuantizationIntervals

FuzzyMembershipFunctions

Fig. 1. MME Based Fuzzy Input Space Divisions

. dividing the sorted list into qj "bins" of N values each;q,

. calculating a minimum cover or "core" of each bin;ccovering gaps between the calculated "cores" of adjacentbins by extending the bin intervals to the midpoint of thegaps.

Input features are then assigned to the bin whose intervalcontains their value. Feature values assigned to the same binrepresent the same primary event.

The joint occurrence (or co-occurrence) of a number ofprimary input events along with a class label, when observeda significant number of times, forms a high-order pattern orrule.

IV. FuzzIFICATION OF PATTERN DISCOVERY

A scheme to implement a simple "fuzzified" version of aPD classifier was produced. The two main aspects of a PDclassifier which had to be modified included the the crispnature of the input space, and the lack of any fuzzy frameworkfor the use of patterns or rules which might exist in an inputvector.

A. Fuzzification of Input ValuesContinuous valued data presented to a PD classifier are

discretized using bin intervals which are usually defined by anMME algorithm. The creation of the "core" elements of MMEbin intervals and extension of the intervals to the midpoint ofthe resulting "gaps'" is shown in Fig. 1. The first row showsthe underlying data points. The heavy vertical lines shown inthe second row define the MME bin boundaries with the greyblocks indicating the minimum cover or "core" regions of thebins defined by the underlying data points.

The gaps between these "core" regions are a source ofvagueness inherent in MME discretization. There is furthervagueness is the measurement of the input data values them-selves. The uncertainty of both of these can be captured bycreating trapezoidal fuzzy membership functions based on theunderlying "core" regions of the MME bins.

Trapezoidal membership functions are created by extending"skirts" out from the central plateau defined by each "core"region. A skirt is extended in each direction. The length ofeach skirt is set to the minimum of:

* the distance from the edge of the current bin plateau tothe midpoint of the neighbouring bin plateau;

* 1/2 the width of the current bin plateau.The resulting fuzzy membership function may overlap com-pletely, with skirts meeting in the center of a bin (as in pointa in row 3 of Fig. 1), may leave a section in the center of aplateau without any conflict (as in point b), and could possiblyleave a gap between membership functions; particularly if inthe training data there are regions where data values are sparseand widely separated.

For this reason, an extra fuzzy input set termed NOT-MATCHED was created which is the inversion of the t-norm ofall other input sets for a particular feature; thus the domain ofan input feature is covered, providing the necessary stabilityin mapping input features to output values.

B. Fuzzification of the Use of PD RulesThe patterns created by the PD algorithm are used as the

fuzzy rule base. As is standard in fuzzy logic systems, allrules are fired. Each rule produces a function-style outputindicating a single location in an assertion space; the pointhas an associated membership value computed from the inputmembership function and a min t-norm, as is standard in fuzzytheory. This assertion of rule consequents can be considered animplementation of the techniques of Takagi, Sugeno and Kang[6], [7]; each consequent is the output of a weight functionasserting support or refutation of a specific classification.The direct use of the PD WOE values as rule weights is

not possible in a standard fuzzy domain as the weights createdby (7) lie in the range [-x ... oo], preventing their use in astandard [O... 1] weighted fuzzy output space.

Rule weights are therefore created using the combinedco-occurrence of events, calculated through the number ofoccurrences:

I if rnIt > 1.96

W)=0| C' -ex,71 eXnz if r., z < -1.96

(8)

whereoXm indicates the number of occurrences of the event (xl)

defining rule 1, including the class label as in 1;exrn is the expected number of occurrences of the event

defining rule 1 andOX* indicates the number of occurrences of the input portion

of the event defining rule 1 excluding the class label,noting that this input event may also occur with otherclass labels;

and remembering that the values rx,n E(-1.96 ... 1.96) willnot be observed for significant patterns or rules.

If no rules fire, the classifier leaves an input vector un-classified, which is detectable in the output as a special classcalled UNCLASSIFIED. Rules to cover the NOTMATCHED case

462

Page 4: [IEEE NAFIPS 2005 - 2005 Annual Meeting of the North American Fuzzy Information Processing Society - Detroit, MI, USA (26-28 June 2005)] NAFIPS 2005 - 2005 Annual Meeting of the North

are created as independent rules for each input, assertingthat if the input value is NOTMATCHED; the output class isUNCLASSIFIED.As this classifier creates assertions supporting or refuting

a classification based purely on the observed occurrence ofsub-events, we will term this classifier "occurrence-based".

V. EVALUATIONA. Linearly Separable Class Distributions

Covariance matrices for 4-feature A(0, 1) data were gen-erated by using the variance values V={160,48,16,90} ar-ranged along the diagonal of a generating covariance matrix(Covii=AVi). Each off-diagonal element Covij was calcu-lated through

Bimodal P=0.5 DataSeparation 4 a

100 1lassAm :lassB

4) :~~~~lassCto ° i v ; lassD

-100-100 0 100

a) Feature A

Spiral Dataseparation 1/2 it

a

0)

b)

classAclassB

-5 0Feature A

S

Fig. 2. Sample Data Points: a) bimodal, b) spiral

2) Spiral Data: 2-class data was produced by using thespiral equation

Covij = ic (Cov2i)(Cov33) (9)This produced the covariance matrix for class A for each

data type. The covariance matrix for class B was produced bysetting

COVZ1 = V(i+l) mod 4 (10)

Class C and D were produced by substituting i+2 and i+3respectively into (10).

These matrices generate clouds of data which intersect non-orthogonally and which have differing variances and covari-ances in each dimension, and in each class.

Class separations were produced using a combination ofthe variances of classes A and B within the set of covariancematrices, where the separation vector c was calculated using:

Ci = (Covii)(CovB) (11)The four classes were separated into different quadrants in

Euclidean space by projecting the mean vector of each classaway from the origin by separation factors of:

si E S, S= {8', ', ', 1, 2, 3, 4} (12)

and combining si with ci from (11) to produce centers foreach factor of s located at (s-ci- sic-) ('sici,--sici),(-2sici, sici) and (-sici, - sici), respectively.B. Non-Linearly Separable Class DistributionsTwo types of non-linearly separable class distributions were

produced: bimodal and spiral.1) Bimodal data: The "bimodal" data was created by

beginning with the si cluster locations and adding a secondset of points for each class. The location of the second clusteris set by projecting the mean away from the origin in adiametrically opposite direction, with an extra translation of4V/i,_ where v_ is the maximum variance value specifiedin the set of variances, V. Thus, along with a cluster ofpoints centred at (s, s), a second cluster would be placed at(-4sv/T.,-4s,/,U). Each cluster contained half the totalnumber of points.

This algorithm was repeated for all sets, generating a layoutof pairs of clusters around the origin, shown in the sample dataillustrated in Fig. 2a for s=4.

r = p(27ri) + rO (13)

which relates the input variable 0 to a radius where ro is thebase radius at which the spiral begins and p is the scale, oracceleration of the spiral.The spiral defined in (13) accelerates out from (0,0). For

each (r, 0) point chosen on this spiral, a value from a A/(0, 1)distribution was chosen to apply a scatter in radians to the9 value of each generated point, while maintaining the sameradius. To generate 4 feature points, a second scatter is chosen,to perturb the data in the third and fourth dimensions also.The second class was generated by choosing a similar set of

points and separation was introduced by rotating the entire setaround the origin by a specified amount in units of 7r radians.Maximum separation therefore is 1.0.

Data was generated using o=1.0, T=0.5, p=0.5 andro=0.125. Two dimensions of a sample pair of class distribu-tions with a separation of 0.5 (a quarter-tum, or 27r) is shownin Fig. 2b.

C. Data Analysis

For each class distribution, 11 sets of continuous-valued datafor each class were produced to create training and testing datasets. The data were then combined using a jackknife procedureto generate 11 separate runs with 10 randomly chosen datasets combined and used for training and the remaining set fortesting for a total of 1000 training and 100 testing points perclass in each case. This amount of training data reflects thequantity of data available for several real-world problems. Alldata were discretized with Q=10 quantization intervals.The number of input test vectors classified correctly ver-

sus those classified incorrectly and those left unclassifiedwere examined. Performance results of the PD classifier, theoccurrence-based FRBCS using crisp input sets as defined bythe MME quantization algorithm ("crisp") and the FRBCSusing trapezoidal input sets ("skirts") were compared. Alongwith these, the classification accuracy of an MICD (MinimumInter-Class Distance) classifier (which is simply a NaiveBaysian classifier with equal a priori probabilities) [8, pp. 41]is provided for reference.

463

Page 5: [IEEE NAFIPS 2005 - 2005 Annual Meeting of the North American Fuzzy Information Processing Society - Detroit, MI, USA (26-28 June 2005)] NAFIPS 2005 - 2005 Annual Meeting of the North

TABLE ICOVARIED CLASS DiSTRIBUTION RESIJLTS

0.250 0.500Separations

1.000 2.000 4.000 8.000COVARIED DATA - FRACTION CORRECT 4C/4F 1000 POINTPD 0.60±0.014 0.61±0.020 0.64±0.026 0.72±0.027 0.86±0.014 0.97±0.011 0.99±0.003FZ (crisp) 0.63±0.021 0.63±0.020 0.67±0.024 0.72±0.026 0.87±0.019 0.96±0.009 0.99±0.003FZ (skirts) 0.65±0.023 0.64±0.015 0.67±0.024 0.73±0.027 0.88±0.019 0.98±0.009 1.00±0.002MICD 0.71±0.017 0.71±0.021 0.74±0.020 0.81±0.024 0.94±0.017 1.00±0.002 1.00±0.000

TABLE IIBIMODAL CLASS DISTRIBUTION RESULTS

SeparationsClassifier 0.125 0.250 0.500 1.000 2.000 4.000 8.000BIMODAL DATA - FRACTION CORRECT 4C/4F 1000 POINTPD 0.78±0.010 0.77±0.021 0.79±0.017 0.84±0.016 0.90±0.012 0.97±0.010 1.00±0.003FZ (crisp) 0.81±0.017 0.80±0.023 0.81±0.012 0.85±0.014 0.92±0.011 0.98±0.011 1.00±0.001FZ (skirts) 0.82±0.016 0.82±0.021 0.84±0.010 0.87±0.014 0.93±0.011 0.99±0.007 1.00±0.000MICD 0.73±0.012 0.72±0.013 0.71±0.020 0.74±0.011 0.80±0.018 0.89±0.012 0.94±0.010

TABLE IIISPIRAL CLASS DISTRIBUTION RESULTS

Classifier 0.125 0.250Separations

0.500 0.750 1.000SPIRAL DATA - FRACTION CORRECT 2C/4F 1000 POINTPD 0.60±0.040 0.69±0.022 0.85±0.012 0.92±0.023 0.93±0.015FZ (crisp) 0.60±0.043 0.69±0.029 0.85±0.018 0.91±0.021 0.93±0.016FZ (skirts) 0.60±0.036 0.72±0.028 0.87±0.020 0.93±0.019 0.97±0.010MICD 0.49±0.027 0.49±0.029 0.48±0.017 0.49±0.015 0.49±0.024SPIRAL DATA - FRACTION INCORRECT 2CI4F 1000 POINTPD 0.39±0.040 0.30±0.021 0.14±0.012 0.08±0.023 0.07±0.015FZ (crisp) 0.39±0.043 0.31±0.028 0.15±0.019 0.09±0.021 0.07±0.016FZ (skirts) 0.40±0.036 0.28±0.028 0.13±0.020 0.07±0.019 0.03±0.010MICD 0.51±0.027 0.51±0.029 0.52±0.017 j0.51±0.015 0.51±0.024

TABLE IVSPIRAL CLASS DISTRIBIJTION UNCLASSIFIED RECORDS

Classifier 0.125 0.250 0.500 0.750 1.000FZ (skirts) none none nonie nonie noneFZ (crisp) 0.013±0.0049 0.003±0.0031 0.001±0.0029 n1one nonePD 0.013±0.0049 0.004±0.0032 0.001±0.0029 none none

VI. RESULTS

A. Covaried Class Distributions

Table I presents the fractions of correct classificationsmade by the four classifiers studied for the covaried classdistributions. Compared to the PD classifier, the FRBCS with"crisp" input sets had a subtle increase in the number of correctclassifications and the FRBCS using fuzzy input "skirts"had yet an even slightly better performance. Note that theMICD performance is optimal for these linearly separableclass distributions.

B. Bimodal Class Distributions

Table II also shows a small increasing trend in the numberof correct classifications made by the FRBCS using "crisp"

sets and the FRBCS using "skirts" relative to the PD classifierresults.

C. Spiral Class DistributionsIn Table Im the MICD performance lies along the 50% line

as the data cannot be separated by a single hyper-plane at all.The performance difference between the FRBCS using "skirts"and the "crisp" FRBCS implementation is negligible; both areextremely close to the performance of the un-fuzzified PDclassifier.

Table IV shows the number of records left unclassified bythe three classifiers studied. Similar data is not presented forthe other class distributions because for these distributions norecords were left unclassified. The FRBCS using "skirts" wasable to classify records which were left unclassified by thecrisp methods.

464

Classifier 0.125

Page 6: [IEEE NAFIPS 2005 - 2005 Annual Meeting of the North American Fuzzy Information Processing Society - Detroit, MI, USA (26-28 June 2005)] NAFIPS 2005 - 2005 Annual Meeting of the North

TABLE VWEIGHTED PERFORMANCE RESULTS

Classifier Covaried Bimodal SpiralMICD 0.7354 0.7318 0.4903FZ (skirts) 0.6690 0.8359 0.7112FZ (crisp) 0.6572 0.8170 0.6965PD 0.6346 0.7922 : 0.7001

D. Weighted Peiformance ResultsThe weighed performance results shown in Table V demon-

strate the performance of each algorithm overall.To calculate weighted performance, the statistic

classSi (14)

was used, where si is a separation value, ISI indicatesthe number of separations tested and pclass (the classifica-tion performance) is the product of: (the fraction of recordsclassified)x(the fraction of correct classifications).

These results show the FRBCS with "skirts" implementationto be consistently better than the other two implementations.The MICD algorithm assumes linearly separable class dis-tributions, and as such, is not suitable for all of the classdistributions studied.

VII. DISCUSSIONEach FRBCS examined here uses a crisp discovery of

patterns (or rules) combined with a fuzzy rule weighting,producing a classification decision using the weighted firingof every rule, as is standard in fuzzy systems.The results suggest that the firing of all the rules and the use

of fuzzy inference improves performance relative to the basePD classifier and that that the use of fuzzy input membershipfunctions with "skirts" improves performance relative to the"crisp" FRBCS. Specifically, modelling quantization vague-ness contributed to the performance increases seen in Tables I,II and III. The fuzzification of the bin boundaries allows theFRBCS with "skirts" to make correct decisions through theuse of additional information.The reduction in the number of errors occurs through the fir-

ing and use of more (and possibly better) rules near inter-classfeature value boundaries. These rules were created for use inthe adjacent quantization bins, however the validity of theirassertions extends into neighbouring bins with a high degreeof observed accuracy, improving the overall performance.The data in Table IV also suggests that fuzzification of

the input space models some of the vagueness associatedwith quantization of continuous-valued input data, allowingrecords which are left unclassified by the crisp classifiers tobe classified by the FRBCS with "skirts" using rules whichare obviously still valid.

In the spiral class distribution problem adjacent bins fre-quently support differing labels, simply due to the topologyof the data. Table IV is therefore particularly interesting asit shows that the skirts extend the "region of effect" of agiven rule into parts of the data space which are difficult tocharacterize based on adjusted residual based quantization, butwhich still can be covered successfully using the "skirted"decreasing membership of rules in adjacent bins. The useof rules from adjacent bins provides good performance, eventhough the data topology would suggest that this is difficult.

VIII. CONCLUSIONSThe fuzzification of PD created rules using the occurrence-

based algorithm allows more input data vectors to be classifiedwith similar rates of error compared to standard PD classifierperformance. Modelling the vagueness in the input data featurevalues by using input memberships functions with "skirts" al-lows classifications to be made for events in hyper-cells whichare left unclassified by the PD classifier; these classificationsare made with similar accuracy as classifications based directlyon PD hyper-cell events.The success of this preliminary work supports the use of a

PD based FRBCS and suggests that continuing work exploringalternate input membership sets and different strategies forusing the PD provided rules, which make further use of theinformation provided by the statistical-based PD algorithm,is warranted. A description and evaluation of a frameworkfor incorporating WOE directly in fuzzy rule weightings isprovided in [9].

REFERENCES[1] A. K. C. Wong and Y. Wang, "High-order pattern discovery from discrete-

valued data," IEEE Trans. Knowledge Data Eng., vol. 9, no. 6, pp. 877-893, Nov-Dec 1997.

[2] , "Pattern discovery: A data driven approach to decision support,'IEEE Trans. Syst., Man, Cybern. C, vol. 33, no. 1, pp. 114-124, Feb2003.

[3] A. Hamilton-Wright and D. W. Stashuk, "Comparing 'pattern discovery'and back-propagation classifiers," in Proceedings of the InternlationialJoint Coniference on Neural Networks (IJCNN), 2005.

[4] Y Wang and A. K. C. Wong, "From association to classification: Inferenceusing weight of evidence," IEEE Trans. Knowlvedge Data Eng., vol. 15,no. 3, pp. 764-767, May-June 2003.

[5] S. J. Haberman, "The analysis of residuals in cross-classified tables,"Biometrics, vol. 29, no. 1, pp. 205-220, Mar 1973.

[6] T. Takagi and M. Sugeno, "Fuzzy identification of systems and itsapplication to modeling," IEEE Tranis. Syst., Man, Cybern., vol. 15, pp.116-132, 1985.

[7] M. Sugeno and G. T. Kang, "Structure identification of fuzzy model,"Fuzzy Sets and System11s, vol. 28, pp. 15-33, 1988.

[8] R. 0. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2nd ed.John Wiley and Sons, 2001.

[9] A. Hamilton-Wright and D. W. Stashuk, "A framework for incorporating'weight of evidence' within a fuzzy rule based classification system," inProceedinigs of the North American Fuzzy Iinfornmation Processing Society(NAFIPS), 2005.

465