A Neuro-Fuzzy Approach for Data Mining and - UlmMohamed Farouk Abdel Hady Teaching Assistant Information and Computer Science Department Institute of Statistical Studies and Research

A Neuro-Fuzzy Approach for Data Mining and Its Application to Medical Diagnosis

Mohamed Farouk Abdel Hady Teaching Assistant

Information and Computer Science

Department Institute of Statistical Studies and Research

Cairo University [email protected]

Prof. Dr. Adel Elmaghraby Chair

Computer Engineering and Computer Science

Department J. B. Speed Scientific School

University of Louisville [email protected]

Abstract: Data mining is a technique for discovering hidden knowledge of strategic importance from large databases. This technique is currently being successfully used by a number of real-world applications. Data mining is a tool for increasing the productivity of people trying to build predictive models. Both disciplines have been working on problems of pattern recognition and classification. Rule induction that is a method for deriving a set of rules to classify cases is one of the famous techniques used for data mining. In this paper, we present a new approach to build a fuzzy model from a numerical input–output database through rule induction; using a special type of backpropagation neural network (BPNN). A fuzzy model is derived in three phases: initialization, optimization, and simplification of the fuzzy model. Medical diagnosis of diseases is an important and difficult task in medicine. To verify the effectiveness of the proposed approach, it was applied on three well-known medical diagnosis problems.

Keywords: Data mining, feature selection, fuzzy modeling, rule induction, neuro-fuzzy.

طريقة جديدة الستخراج المعرفة باستخدام الشبكات العصبية ذات

النماذج المبهمة و تطبيقاتها فى مجال التشخيص الطبىفي . منهم قضية هاّمة فهمعرالعملية استخراج قد جعلاإلستعمال المتزايد للشبكات العصبية خالل السنوات الماضية،ان

والتى يتم استخدامها فى مجال بيانات العددية،ال من ه القواعد الضبابيستخراج إله جديدطريقه، نقّدم ورقة البحثيةهذه ال .ةعصبيالشبكات النظرية المنطق الضبابية، وبين مميزات ه المقترحطريقهتدمج ال . والتشخيص الطبيجذاتصنيف النم

. تبسيط النموذج الضبابياخيرا مرحلة ، ومرحلة التحسين، بتدائيهاالالمرحلة : القواعد الضبابية في ثالثة مراحلتستخرج و تشابه المدخالتشابه تراتلى إختبااالعناقيد مستندة من ي مجموعة الآليا بيانات مجموعة التقسم في المرحلة األولى،

كّل واقعه فى نقاط اللل ىاإلحصائالتباين والحسابى ط طبقا للوسو تعرف الداله بكّل عنقود هعضويداله ربط ت.المخرجات النموذج خدمستيفي المرحلة الثانية، . نموذج ضبابيفى النهايه ةشكّلم من كّل عنقود ه ضبابيهقاعدتستخرج ثّم، .عنقود

عن النموذج الضبابي يتم تحسين معامالت ثّم ه عصبيه كنقطة البداية لبناء شبكستخرج فى المرحله األولىلمالضبابي ا من جزئيه هإختيار مجموع يتم، ه الثالثهفي المرحل. الخلفينتشار طريقة االخاللن متحليل عقد الشبكة التي دّربت طريق

يتم تقييم الطريقه المقترحه من خالل تطبيقها ،فى النهاية. هالمعرف عملية اكتسابخفّض تعقيد يزيد دقة ويلك ذو مدخالتالعدد من بنتائج النتائج كما يتم مقارنة .معروفه لمعايير التقييم اللك وفقاذو المشهورة مجموعات البيانات عدد من على

.األخرى المستخدمه فى نفس المجال البحثىالطرق

1

mailto:[email protected]

mailto:[email protected]

1. INTRODUCTION

Databases today can range in size into terabytes of data. Within these masses of data lies hidden information of strategic importance. Data mining is the process used to discover patterns and relationships in data that may be used further to make valid predictions and decisions. Data mining offers value across a broad range of real-world applications. Medical applications are another fruitful area: data mining can be used to predict the effectiveness of surgical procedures, medical tests or medications. Recently various soft computing methodologies have been applied to handle the different challenges posed by data mining.

Soft computing is a collection of methodologies that works synergistically and provides, in one form or another, flexible information processing capability for handling real-life ambiguous situations [1]. Its aim is to exploit the tolerance for imprecision, uncertainty, approximate reasoning, and partial truth in order to achieve tractability, robustness, and low-cost solutions [2]. The main constituents of soft computing include fuzzy logic, neural networks, genetic algorithms, and rough sets. Each of them contributes a distinct methodology for addressing problems in its domain. This is done in a cooperative, rather than a competitive, manner. The result is a more intelligent and robust system providing a human-interpretable, low cost, approximate solution, as compared to traditional techniques. The modeling of imprecise and qualitative knowledge, as well as the transmission and handling of uncertainty at various stages are possible through the use of fuzzy sets. Fuzzy logic is capable of supporting, to a reasonable extent, human type reasoning in natural form. The development of fuzzy logic has led to the emergence of soft computing. Neural networks were earlier thought to be unsuitable for data mining because of their inherent black-box nature. No information was available from them in symbolic form, suitable for verification or interpretation by humans. Recently there has been widespread activity aimed at redressing this situation, by extracting the embedded knowledge in trained networks in the form of symbolic rules [3]. Unlike fuzzy sets, the main contribution of neural networks toward data mining stems from rule induction [4] and clustering.

Neuro-fuzzy computing [2] is one of the most popular hybridizations widely reported in literature (See [5] for a survey of the field). It comprises an integration of the merits of neural and fuzzy approaches, enabling one to build more intelligent decision-making systems. This incorporates the generic advantages of artificial neural networks like massive parallelism, robustness, and learning in data-rich environments into the system. The modeling of imprecise and qualitative knowledge in natural/linguistic terms as well as the transmission of uncertainty is possible through the use of fuzzy logic [15, 16]. While there are many methods for inducing rules from specialized networks, there are a small number of published techniques for extracting rules from local basis function networks.

• Tresp, Hollatz and Ahmed [6] describe a method for extracting rules from Gaussian

Radial Basis Function (RBF) network. • Berthold and Huber [7] describe a method for inducing rules from the Rectangular

Basis Function (RecBF) network. • Abe and Lan [8] describe a recursive method for constructing hyper-boxes and

extracting fuzzy rules from them. • Duch et. al [9] describe a method for induction, optimization and application of sets

of fuzzy rules from ‘soft trapezoidal’ membership functions.

2

• Lapedes and Faber [10] give a method for constructing locally responsive units using pairs of axis-parallel logistic sigmoid functions. Subtracting the value of one sigmoid from the other one will construct such local response region. They did not however offer a training scheme for networks constructed of such units.

• Geva and Sitte [11] describe a parameterization and training scheme for networks composed of such sigmoid based hidden units.

• Andrews and Geva [12, 13] propose a method to extract and refine crisp rules from these networks.

We propose a neuro-fuzzy approach for building data mining models. The proposed is called FRULIN (Fuzzy RULes INducer). A fuzzy knowledge base (FKB) can be induced from a given database in three steps, as shown in Figure 1. In the first phase, the data set is partitioned automatically into a set of clusters based on input-similarity and output-similarity tests. Membership functions associated with each cluster are defined according to statistical means and variances of the data points. Then, a fuzzy if-then rule is derived from each cluster to form a fuzzy model. In the second phase, the derived set of rules is used as starting point to construct a BPNN then the fuzzy model parameters are refined using the gradient descent method, to increase the accuracy of the fuzzy model. In the third phase, a simplification method is used to reduce the antecedent parts in the derived fuzzy model.

Data Fuzzy model

Feature Subset Selection

Neural Network BP Learning

SCRG

Initialization Optimization Simplification

Figure 1: Steps in the Proposed Approach

The n-fold cross validation method is used. In this method, the data is randomly divided into n disjoint groups. For example, suppose the data is divided into ten groups. The first group is set aside for testing and the other nine are collected together for model building. The model built on the 90% group is then used to predict the group that was set aside. This process is repeated a total of 10 times as each group in turn is set aside, the model is built on the remaining 90% of the data, and then that model is used to predict the set-aside group. Finally, a model is built using all the data. The mean of the 10 independent error rate predictions is used as the error rate for this last model.

The rest of this paper is organized as follows. Section 2 illustrates the architecture of neuro-fuzzy network used by our approach. Self-Constructing Rule Generator, SCRG, is described in Section 3. Section 4 describes the learning method of network. Simplification of the derived fuzzy model is described in Section 5. Experimental results are presented in Section 6. An evaluation of FRULIN is presented in Section 7. Finally, conclusions and future work are given in Section 8.

2. THE NEURO-FUZZY NETWORK ARCHITECTURE

The graph in Figure 2 illustrates the four-layer neural network proposed by our approach. The layers of the neuro-fuzzy neural network are described as follows:

3

O11(2) ONJ

(2) ON1

(2)

Group 1

Oij(2)

Group j

OM(4

)Ok

(4)

O1(3

)OJ

(3

)Group J

Oj(3)

w11 wjk wJM

x1

O1(4)

xi xN

Figure 2: Architecture of the proposed neuro-fuzzy Neural Network Layer 1: This layer contains N nodes. Node i of this layer produces output by transmitting its input signal directly to layer 2, i.e., for Ni ≤≤1

ii xO =)1( (1)

Layer 2: This layer contains J groups and each group contains N nodes. Each group represents the IF-part of a fuzzy rule. Node (i, j) produces its output by computing the value of the corresponding normalized ridge function, for Ni ≤≤1 and 1 Jj ≤≤

),(),())(,())(,(

),,;()2(

ijijijij

ijijiijijijiijijijijiijij bkbk

bcxkbcxkkbcxrrO

−−

−−−+−===

σσσσ (2)

The parameters cij, bij, and kij of the sigmoid functions ))(,( ijijiij bcxk +−σ and

))(,( ijijiij bcxk −−σ represent the center, breadth, and edge steepness, respectively of the ridge function r , and x),,;( ijijiji kbcx i are the ith input value. (See Figure 3 and Figure 4)

Figure 3: Ridge function

Figure 4: Local response unit

4

Layer 3: This layer contains J nodes. Node j of this layer produces its output by computing the value of the logistic function, i.e., for Jj ≤≤1

( ) ),(,;1

)2()3( ∑=

−==N

iijjjj BOKbcxO σl (3)

The parameter B is set to produce appreciable activation only when each of the xi input values lie in the ridge defined in the ith dimension. The parameter K is chosen such that output sigmoid ),,;( kbcxl cuts off the secondary ridges outside the boundary of the local function. Experiment has shown that good network performance can be obtained if B is set equal to the input dimensionality, B = N and K is set in the range 2-4. Layer 4: This layer contains M nodes. Node k of this layer produces its output by the centroid defuzzification, That is,

∑

∑

=

== J

jj

J

jjkj

k

O

wOO

1

)3(

1

)3(

)4(

. (4)

Where jw k is the output weight associated with each of the individual local response functions ),,;( kbcxl . Clearly, cij, bij, and wjk are the parameters that can be tuned to improve the performance of the fuzzy model. We use the error backpropagation gradient descent method to refine these parameters.

3. SELF-CONSTRUCTING RULE GENERATOR

Unlike common clustering-based methods, such as fuzzy c-means method, which require the number of clusters, and hence the number of rules, to be appropriately pre-selected, SCRG performs clustering with the ability to adapt the number of clusters as it proceeds. • First, the given input-output data set is partitioned into fuzzy (overlapped) clusters.

The degree of association is strong for data points within the same fuzzy cluster and weak for data points in different fuzzy clusters.

• Then, a fuzzy if-then rule describing the distribution of the data in each fuzzy cluster is obtained. Lee et al. [14] propose an approach for neuro-fuzzy system modeling using a similar method.

• For a system with N inputs and M outputs, we define a fuzzy cluster j as a pair ( ) )w ,( jxjl where ( )xjl is defined as:

( ) ( ) )),,;(,(,,; 1∑=

−==N

iijijijijjjj BkbcxrKkbcxx σll (5)

where [ ]Nxxx ,...,1= , [ ]Nj ccc ,...,1= , [ ]Nj bbb ,...,1= , [ ]Nj kkk ,...,1= , K, and

jw denote the input vector, center vector, width vector, steepness and height vector respectively, of the cluster j.

• Let J be the number of existing fuzzy clusters and Sj be the size of cluster j. Clearly, J initially equals zero.

5

• For an input-output instance v, ),(vv

qp , where [ vNvvppp ,...,1 ]= , and

[ vMvvqqq ,...,1= ]. We calculate )(

vj pl for each existing cluster j, We say that instance v passes input-similarity test on cluster j if

Jj ≤≤1 .

( ) ρ≥vj pl (6)

where ρ, 0 1≤≤ ρ , is a predefined threshold. Then, we calculate

jkvkvjk wqe −= (7) For each cluster j on which instance v has passed the input-similarity test. Let

where q q - q d kminkmaxk = kmax and qkmin are the maximum and minimum value of the kth output, respectively, of the given data set.

• We say that instance v passed the output-similarity test on cluster j if kvjk de τ≤ (8)

where τ, 10 ≤≤ τ , is another predefined threshold. • We have two cases. First, there is no existing fuzzy clusters on which instance v has

passed both input-similarity and output-similarity tests. For this case, we assume that instance v is not close enough to any existing cluster and a new fuzzy cluster k = J+1 is created with

vk pc = , ok bb = , and vk qw = (9)

such that [ ]oNoioo bbb ,...,,...,1=b and b )( iloweriupperooi XXb −= . Where and are the upper and lower limit of the i

iupperX

ilowerX th input, respectively, of the given data set and is a user-defined constant vector. Note that the new cluster k contains only one member, instance v, at this time. Of course, the number of existing clusters is increased by 1 and the size of cluster k should be initialed to 1, That is,

ob

J = J+1 and Sk=1 (10)

• Second, if there are a number of fuzzy clusters on which instance v has passed both input-similarity and output-similarity tests, let these clusters are j1, j2…and jf and let the cluster t be the cluster with the largest membership degree.

( ) ( ) ( ) ( )),...,,max( 21 vjfvjvjvt pppp llll = (11)

• In this case, we assume that instance v is closest to cluster t. So, cluster t should be modified, as shown below, to include instance v as its member. That is, for

Ni ≤≤1

0

2222

11))(1(

bS

pcSS

SS

pcSbbSb

t

viitt

t

t

t

viittoittit +

+++

−++−−

= (12)

1++

=t

viittit S

pcSc (13)

1++

=t

vktkttk S

qwSw (14)

1+= tt SS (15) Note that J is not changed in this case.

• The above-mentioned process is iterated until all the input-output instances have been processed. At the end, we have J fuzzy cluster. Note that each cluster j is

6

described as )w ,)( jxjl( , where )(xjl contains center vector jc , and width vector

jb . We can represent the cluster j by a fuzzy rule having the form:

R )(...)(...)(: 111 NNjNiijijj xISxANDANDxISxANDANDxISxIF µµµ

jMMjkkj wISyANDANDwISyANDANDwISyTHEN ......11 (16)

Such that

),(),())(,())(,(

),,;()(ijijijij

ijijiijijijiijijijijiiij bkbk

bcxkbcxkkbcxrx

−−

−−−+−==

σσσσ

µ (17)

and wjk represents the kth consequent part and )( iij xµ represents the ith antecedent part for 1 and 1 . The firing strength of the rule j is Mk ≤≤ Ni ≤≤

∏=

=N

iijijijij kbcxr

1

),,;(α (18)

• Finally, we have a set of J initial fuzzy rules for the given input-output data set.

With this approach, when new training data are considered, the existing clusters can be adjusted or new cluster can be created, without the necessity of generating the whole set of rules from the scratch.

4. ERROR BACKPROPAGATION TRAINING METHOD

Backpropagation training is simply a version of gradient descent, a type of algorithm that tries to reduce a target value (which is error, in the case of neural nets) at each step. A four-layer neural network is constructed, as shown in Figure 2, by turning each fuzzy rule, obtained in phase one, into a local response unit (LRU). The goal of this phase is to adjust both the premise and consequent parameters so as to minimize the mean squared error function shown below

∑=

=P

vvE

PE

1

1 (19)

where ( )∑=

=M

kvkv eE

1

2

21

, kkvk vqvye −= and )()4(vkk pOvy = is the actual output of the

vth training pattern. • The update formula for a generic weight α is

( )αηα α ∂∂−=∆ E , (20)where αη is the learning rate for that we ght. In summary, given training set T of P training patterns

i{ } { }),...,(),,...,(,...,1:),( 11 vMvvNvvv

qqppPvqpT === .

• For the sake of simplicity, the subscript v indicating the current sample will be dropped in the following derivation.

• Starting at the first layer, a forward pass is used to compute the activity levels of all the nodes in the network to obtain the current output values.

• Then, a backward pass is started at the output layer in which the error in the output is computed by finding the difference between the calculated output and the desired output.

• Next, the error from the output is assigned to the hidden layer nodes proportionally to their weights, α∂∂E .

7

• Finally, the error at each of the hidden and output nodes is used by the algorithm to adjust the weight coming into that node to reduce the error.

• This process is repeated for each row in the training set. Each pass through all rows in the training set is called an epoch. The training set will be used repeatedly, until the error no longer decreases.

• The complete learning algorithm is summarized as follow: 1. Initialize the weights { } Jj

Nijijiji kbc ,..,1

,..,1,, =

=and { } Mk

Jjjkw ,..,1,..,1

=

=with rule parameters

obtained in the SCRG phase. 2. Select the next input vector p

)4(kO

from T, propagate it through the network and

determine the output ky =3. Compute the error terms as follows:

kkkqO −= )4()4(δ (21)

∑∑==

−=M

ttkjk

M

kkj

OOw1

)3()4(

1

)4()3( )(δδ (22)

)1( )3()3()3()2(jjjij

OKO −=δδ (23) 4. Update the gradients of { } Jj

Nijiji bc ,..,1

,..,1, =

=and { } Mk

Jjjkw ,..,1,..,1

=

=respectively according to:

−−

−−−−=+

∂∂

−−++

),(),()1()1()2(

ijijijij

ijijijijijij

ij bkbkk

cE

σσσσσσ

δ (24)

−−

−+−=+

∂∂

−−++

),(),()1()1()2(

ijijijij

ijijijijijij

ij bkbkk

bE

σσσσσσ

δ (25)

∑=

=+

∂∂

J

tt

jk

jk O

OwE

1

)3(

)3()4(δ

(26)

5. After applying the whole training set T, Update the weights

{ } Jj

Nijijiji kbc ,..,1

,..,1,, =

=and respectively according to: { } Mk

Jjjkw ,..,1,..,1

=

=

∂∂

−=∆ij

ij cEc η (27)

∂∂

−=∆ij

ij bEb η (28)

ij

oij b

Kk = (29)

∂∂

−=∆jk

jk wEw η (30)

where η being the learning rate (by a proper selection of η the speed of convergence can be varied) and Ko is the initial steepness.

8

6. If E < ε or maximum number of iterations reached stop else go to step 2. (where ε is the error goal)

5. FEATURE SUBSET SELECTION BY RELEVANCE

Since in application areas like medicine not only the accuracy but also the rule simplicity and comprehensibility is important, the derived fuzzy model was reduced by applying a feature selection algorithm to cope with the high dimensionality of the real-world databases. • We first sort features according to their relevance for the classification. That is,

Features are sorted from the most relevant one (with the highest test set classification accuracy) to the least relevant one (with the lowest test set classification accuracy).

• Then, a neural network is created by using the best feature (the most relevant one). The classification accuracy of the network on the test dataset is saved for that subset.

• Next, the best two features are tested, followed by the best three features and it goes like that till the best N features (N numbers of features) are tested. For example, If the sorted list is like {f1, f2, ..., fN}. we test the subsets {f1}, {f1, f2}, {f1, f2, f3}, …, {f1, f2, ..., fN}. We find the subset with the best test set classification accuracy.

• Since we want the smallest feature, we take the full feature set accuracy (accfull) as our base and find the smallest subset within a certain range of that accuracy ( )β−≥

ullfnext accacc .

• For example, if the accuracy of the full feature set is 95% and best current subset with 3 features has accuracy of 97% and next best subset with 2 features has the accuracy of 92% and %5=β then we choose the subset with 2 features (because

and it becomes the best subset. An outline of the feature subset selection algorithm is given shown below. ( %5%95%92 −≥ )

visitedList = emptySet; N = numFeats(fullFeatureSet); for (i=0; i < N; i++) {

currentSubSet = fullFeatureSet – featurei Construct a Network by using currentSubSet and the training set Test the Network by using test set Find the classification accuracy (acci) of the test set Add the pair (featurei, acci) to the visitedList

} sort the visitedList in ascending order according to accuracy (Now the visitedList is sorted from the most relevant feature to the least) bestAcc = -1; currentSubSet = emptySet; bestSubSet = currentSubSet; for (i=0; i < N; i++) {

if ( bestAcc == 100 AND numFeats(bestSubSet) == 1) STOP Add the next most relevant feature to the currentSubSet Construct a Network by using currentSubSet and the training set Test the Network by using test set Find the classification accuracy (currentAcc) of the test set

9

if ((currentAcc >= fullAcc – Beta) AND (numFeats(currentSubSet) < numFeats(bestSubSet))) {

bestAcc = currentAcc; bestSubSet = currentSubSet;

} } return bestSubSet

6. EXPERIMENTAL RESULTS

Diagnosis of diseases is an important and difficult task in medicine. Detecting a disease from several factors or symptoms is a many-layered problem that also may lead to false assumptions with often unpredictable effects. Therefore, the attempt of using the knowledge and experience of many specialists collected in databases to support the diagnosis process seems reasonable.

A. Wisconsin Breast Cancer Database

The Wisconsin breast cancer dataset (WBCD) [18] contains 699 instances, with 458 benign (65.5%) and 241 (34.5%) malignant cases. Nine features with integer value in the range are used for each instance (See Table 1). For 16 instances one attribute is missing (it was replaced by an average value).

Table 1: Features and Feature values for the WBCD

ID Feature Feature values F1 Clump thickness [1, 10] F2 Uniformity of cell size [1, 10] F3 Uniformity of cell shape [1, 10] F4 Marginal adhesion [1, 10] F5 Single epithelial cell size [1, 10] F6 Bare nuclei [1, 10] F7 Bland chromatin [1, 10] F8 Normal nucleoli [1, 10] F9 Mitoses [1, 10]

To estimate the performance of the FKB extracted by the proposed approach, we

carried out a 10-fold cross-validation. The whole dataset was divided into 10 equally sized groups (a group consists of 70 samples randomly drawn from the two classes). Each group was used as a test set for the fuzzy system trained with the remaining 629 data points. A.1. For Initialization

The SCRG method is used to determine the initial centers and widths of the membership functions of the input features. Table 2 summaries the results obtained after applying the SCRG phase for the ten trails. (We have σo=0.3, ρ = 0.01, and τ =0.195)

10

Table 2: Results of the 10-fold cross validation after SCRG for the WBCD

After Initialization WBCD Training Set Test Set Whole Set

Run Rules Acc. [%] Mis. Acc. [%] Mis. Acc. [%] Mis. 1 4 97.43 9 97.13 10 97.28 9.5 2 4 96.29 13 97.71 8 97.00 10.5 3 4 97.14 10 94.84 18 95.99 14 4 4 96.00 14 91.98 28 93.99 21 5 5 97.71 8 95.99 14 96.85 11 6 4 96.00 14 97.71 8 96.86 11 7 4 96.29 13 95.70 15 96.00 14 8 3 97.71 8 95.99 14 96.85 11 9 5 96.57 12 95.70 15 96.14 13.5

10 5 96.86 11 97.13 10 97.00 10.5 avg. 4.20 96.80 11.20 95.99 14.00 96.39 12.6

A.2. For Optimization

The gradient-descent backpropagation learning method is used to optimize the fuzzy classifier induced in phase one. A network with 9 inputs and 2 outputs, corresponding to the two classes, was constructed. Table 3 summaries the results obtained after 100 epochs for the ten trials. (We have ε =0.01, and η = 1.0)

Table 3: Results of the 10-fold cross validation after optimization for the WBCD

After Optimization WBCD Training Set Test Set Whole Set

Run Rules Acc. [%] Mis. Acc. [%] Mis. Acc. [%] Mis. 1 4 97.43 9 95.99 14 96.71 11.5 2 4 96.00 14 97.71 8 96.86 11 3 4 96.86 11 96.56 12 96.71 11.5 4 4 97.43 9 95.99 14 96.71 11.5 5 5 98.00 7 95.70 15 96.85 11 6 4 95.71 15 97.99 7 96.85 11 7 4 97.43 9 96.28 13 96.86 11 8 3 97.71 8 95.99 14 96.85 11 9 5 97.14 10 96.85 11 97.00 10.5 10 5 97.43 9 96.56 12 97.00 10.5

avg. 4.20 97.11 10.10 96.56 12.00 96.84 11.05 As an illustrative example, Figure 5 shows the graphical representation of the FKB

obtained, after the optimization phase, in the first run of the 10 trials. (Using MATLAB Fuzzy Toolbox)

11

Figure 5: Graphical Representation of FRB obtained after Optimization for WBCD

A.3. For Simplification

The Feature Subset Selection by Relevance method is used to simplify the fuzzy classifier induced in phase one. Table 4 summaries the results obtained after this phase for the ten trials. (We have β =1)

Table 4: Results of the 10-fold cross validation after Simplification for WBCD

After Simplification WBCD

Training Set Test Set Whole Set Run Rules Acc. [%] Mis. Acc. [%] Mis. Acc. [%] Mis. Antec. Best Feature Set

1 4 97.71 8 95.99 14 96.85 11 5 F1, F2, F3, F6, F9 2 4 93.14 24 97.13 10 95.14 17 4 F4, F5, F6, F8 3 4 94.57 19 95.99 14 95.28 16.5 2 F2, F6 4 4 97.43 9 95.42 16 96.43 12.5 5 F1, F3, F4, F5, F6 5 5 96.29 13 95.42 16 95.86 14.5 2 F2, F6 6 4 95.71 15 97.42 9 96.57 12 4 F1,F3, F4, F6 7 4 94.57 19 95.99 14 95.28 16.5 2 F3, F6 8 3 96.29 13 94.84 18 95.57 15.5 5 F2, F3, F4, F8, F9 9 5 95.71 15 95.70 15 95.71 15 3 F2, F6, F9

10 5 96.29 13 96.56 12 96.43 12.5 3 F1, F3, F6 Avg. 4.20 95.77 14.80 96.05 13.80 95.91 14.30 3.5 F6, F1, F3 , F4, F5

Features Sorting by Relevance

9595.5

9696.5

9797.5

98

F1 F2 F3 F4 F5 F6 F7 F8 F9

Removed Feature

Tes

t Cla

ssifi

catio

n A

ccur

acy

Figure 6: Performance of RBPN during removal of input features for WBCD

12

For the tenth run of the ten trials, Figure 6 shows the performance of the networks constructed by the successive removal of input features and Figure 7 shows the performance of the networks constructed by the successive addition of the relevant features.

Features Subset Selection

7580859095

100

F1 F6 F3

Added Feature

Tes

t Cla

ssifi

catio

n A

ccur

acy

Figure 7: Performance of the RBPN with different features for WBCD

As an illustrative example, the graphs in Figure 8 and Figure 9 show the graphical and textual representation of the FKB obtained, respectively after the simplification phase, in the seventh run of the 10 trials. (Using MATLAB Fuzzy Toolbox)

Figure 8: Graphical Representation of FRB obtained after Simplification for WBCD

Rule 1: IF (‘Uniformity of cell shape’ IS in3mf1) AND (‘Bar nuclei’ IS in6mf1),

THEN (‘benign’ IS out1mf1) AND (‘malignant’ IS out2mf1)

Rule 2: IF (‘Uniformity of cell shape ’IS in3mf2) AND (‘Bar nuclei’ IS in6mf2),






13

Where: in3mf1 = ridgemf (x3; 3.5316, 1.3819, 0.5663)

in3mf2 = ridgemf (x3; 3.4665, 9.7275, 0.5769) in3mf3 = ridgemf (x3; 4.1170, 4.9983, 0.4858) in3mf4 = ridgemf (x3; 5.0730, 5.9791, 0.3942)

in6mf1 = ridgemf (x6; 3.6180, 2.1320, 0.5528) in6mf2 = ridgemf (x6; 5.6169, 8.3227, 0.3561) in6mf3 = ridgemf (x6; 5.4982, 6.8500, 0.3638) in6mf4 = ridgemf (x6; 5.5698, 7.7825, 0.3591)

and out1mf1 = 1.5455 out1mf2 = -0.0170

out1mf3 = 0.5499 out1mf4 = -0.4917

out2mf1 = -0.5455 out2mf2 = 1.0170

out2mf3 = 0.4501 out2mf4 = 1.4917

Figure 9: Textual Representation of the FRB obtained after Simplification for the WBCD

To evaluate the effectiveness of the ten fold cross validation results (see Table 5 and

Figure 10), they were compared with those obtained by other neural and traditional classifiers, developed for the same dataset (see Table 6). It can be seen that the classification of our approach is comparable with most of the considered models.

Table 5: Summary of Classification results of FRULEX for the WBCD

WBCD Phase 1 Phase 2 Phase 3 Misclassified 11.2 10.1 14.8 Training Accuracy 96.8 % 97.11 % 95.77 % Misclassified 14 12 13.8 Test Accuracy 95.99 % 96.56 % 96.05 % Misclassified 12.6 11.05 14.3 Average Accuracy 96.39 % 96.84 % 95.91 %

Wisconsin Breast Cancer Dataset

92.0093.00

94.0095.0096.00

97.0098.00

1 2 3 4 5 6 7 8 9 10

Run Number

Tes

t Acc

urac

y

InitializationOptimizationSimplification

Figure 10: Summary of Classification results of FRULEX for the WBCD

14

Table 6: Comparing FRULEX to some other approaches for the WBCD

Method Accuracy Rules

Antecedents Per rule

Reference

LOONN 95.6% N/A N/A [21] XVNN 95.3% N/A N/A [21] C4.5 94.74% N/A N/A [21] RBF 96.7% N/A N/A [18] FSM 96.5% 12 fuzzy rules 9 [9]

NEFCLASS 95.06% 2 fuzzy rules 5.5 [19] BIO-RE 96.63% 11 crisp rules 2 to 4 [20] Full-RE 96.19% 5 crisp rules 1 to 2 [20] RULEX 94.4% 5 crisp rules 24 [12] FRULEX 95.91% 4.2 fuzzy rules 3.5 Our result

B. Cleveland Heart Disease Dataset

The Cleveland heart disease dataset [18] (collected at Cleveland Clinic Foundation by R. Detrano) contains 303 instances, with 164 healthy (54.1%) instances, the rest are heart disease instances of various severity. While the database has 76 raw attributes, only 13 of them are actually used in machine learning tests, including six continuous features and four discrete values (See Table 7).

Table 7: Features and Feature values for the heart disease dataset

ID Feature Feature values F1 Age Continuous F2 Sex 0,1 (male, female)

F3 Chest pain type 0,1,2,3 (typical angina, atypical angina, non angina, asymptomatic angina)

F4 Resting blood pressure Continuous F5 Serum cholesterol continuous F6 Fasting blood sugar 0,1 (yes, no) F7 Resting ECG results {0,1,2} F8 Maximum heart rate Continuous F9 Exercise induced angina 0,1 (yes, no)

F10 Peak depression Continuous F11 Slope of ST segment 0,1,2 (up sloping, flat, down sloping) F12 Number of major vessels 0,1,2,3 F13 Thal 3,6,7 (normal, fixed defect, reversible defect)

To estimate the performance of the fuzzy classifier induced by the proposed

approach, we carried out a 10-fold cross-validation. The whole dataset was divided into 10 equally sized parts (a part consists of 30 samples randomly drawn from the two classes). Each part was used as a test set for the network trained with the remaining 273 data points.

15

B.1. For Initialization

The SCRG method is used to determine the initial centers and widths of the membership functions of the input features. Table 8 summaries the results after applying the SCRG phase for the ten trials. (We have σo=0.5, ρ = 0.01, and τ =0.195)

Table 8: Results of 10-fold cross validation after SCRG for heart disease dataset

After Initialization Heart Training Set Test Set Whole Set

Run Rules Acc. [%] Mis. Acc. [%] Mis. Acc. [%] Mis. 1 3 80.26 30 76.16 36 78.21 33 2 5 78.95 32 72.19 42 75.57 37 3 2 82.89 26 79.47 31 81.18 28.5 4 3 69.74 46 72.19 42 70.97 44 5 6 79.61 31 78.81 32 79.21 31.5 6 4 83.55 25 78.81 32 81.18 28.5 7 4 80.92 29 78.81 32 79.87 30.5 8 3 73.68 40 71.52 43 72.60 41.5 9 5 65.13 53 62.25 57 63.69 55

10 5 85.53 22 79.47 31 82.50 26.5 avg. 4.00 78.03 33.40 74.97 37.80 76.50 35.6

B.2. For Optimization

The backpropagation gradient descent learning method is used to optimize the FRB extracted in phase one. A network with 13 inputs and 2 outputs, corresponding to the two classes, was constructed. Table 9 summaries the results obtained after 100 epochs for the ten trials. (We have ε =0.01, and η = 1.0)

Table 9: Results of 10-fold cross validation after optimization for the Cleveland heart disease dataset

After Optimization Heart

Training Set Test Set Whole Set Run Rules Acc. [%] Mis. Acc. [%] Mis. Acc. [%] Mis.

1 3 83.55 25 79.47 31 81.51 28 2 5 83.55 25 84.77 23 84.16 24 3 2 82.24 27 78.15 33 80.20 30 4 3 82.89 26 82.12 27 82.51 26.5 5 6 80.92 29 83.44 25 82.18 27 6 4 86.18 21 80.79 29 83.49 25 7 4 82.89 26 80.13 30 81.51 28 8 3 83.55 25 80.13 30 81.84 27.5 9 5 79.61 31 80.13 30 79.87 30.5

10 5 88.16 18 80.13 30 84.15 24 avg. 4.00 83.35 25.30 80.93 28.80 82.14 27.05

16

As an illustrative example, Figure 11 shows the graphical representation of the FKB obtained, after optimization phase, in the first run of the 10 trials. (Using MATLAB Fuzzy Toolbox)

Figure 11: Graphical Representation of the FRB obtained after Optimization phase for the Cleveland heart disease dataset

B.3. For Simplification

The Feature Subset Selection by Relevance method is used to simplify the fuzzy classifier induced in phase one. Table 10 summaries the results obtained after this phase for the ten trials. (We have β =1)

Table 10: Results of 10-fold cross validation after Simplification phase for the Cleveland heart disease dataset

After Simplification Heart

Training Set Test Set Whole Set Run Rules Acc. [%] Mis. Acc. [%] Mis. Acc. [%] Mis. Antec. Best Feature Set

1 3 84.87 23 81.46 28 83.17 25.5 3 F3, F12, F13 2 5 78.29 33 85.43 22 81.86 27.5 5 F3, F6, F9, F10, F12 3 2 73.03 41 80.13 30 76.58 35.5 2 F3, F12 4 3 79.61 31 83.44 25 81.53 28 3 F3, F12, F13 5 6 84.21 24 86.09 21 85.15 22.5 6 F3, F8, F9, F11, F12, F13

6 4 83.55 25 82.12 27 82.84 26 7 F2, F3, F8, F9, F11, F12, F13

7 4 81.58 28 82.78 26 82.18 27 11 F1, F3, F5, F6, F7, F8 F9, F10, F11, F12, F13

8 3 79.61 31 80.79 29 80.20 30 3 F3, F8, F13 9 5 80.26 30 84.11 24 82.19 27 6 F3, F7, F9, F10, F12, F13

10 5 81.58 28 80.13 30 80.86 29 3 F11, F12, F13 avg. 4.00 80.66 29.40 82.65 26.20 81.65 27.80 4.9 F3, F9, F12, F13

For the first run of the ten trials, Figure 12 shows the performance of the networks

constructed by the successive removal of input features and Figure 13 shows the performance of the networks constructed by the successive addition of the relevant features.

17


70

75

80

85

F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11 F12 F13

Removed Feature

Tes

t Cla

ssifi

catio

n A

ccur

acy

Figure 12: Performance of RBPN during removal of input features for the

Cleveland heart disease dataset


70

75

80

85

F12 F3 F13

Added Feature

Test

Cla

ssifi

catio

n A

ccur

acy

Figure 13: Performance of the RBPN with different features for the Cleveland heart

disease dataset

As an illustrative example, Figure 14 and Figure 15 show the graphical and textual representation of the fuzzy classifier obtained, respectively after the simplification phase, in the first run of the 10 trials. (Using MATLAB Fuzzy Toolbox)

Figure 14: Graphical Representation of the FRB obtained after Simplification phase for the Cleveland heart disease dataset

18

Rule 1: IF (F3 IS in3mf1) AND (F12 IS in12mf1) AND (F13 IS in13mf1),

THEN (‘healthy’ IS out1mf1) AND (‘disease’ IS out2mf1)





Where: in3mf1 = ridgemf (x3; 0.7861, 0.6098, 2.5443)

in3mf2 = ridgemf (x3; 0.7742, 0.8458, 2.5833) in3mf3 = ridgemf (x3; 0.5000, 1.0000, 4.0000)



and out1mf1 = 2.0056 out1mf2 = -0.6194 out1mf3 = -0.1228 out2mf1 = -1.0056 out2mf2 = 1.6194 out2mf3 = 1.1228

Figure 15: Textual Representation of the FRB obtained after Simplification phase for the Cleveland heart disease dataset

To evaluate the effectiveness of the ten fold cross validation results (see Figure 16 and Table 11), they were compared with those obtained by other neural and traditional classifiers, developed for the same dataset (see Table 12). It can be seen that the classification of our approach is comparable with most of the considered models.

Cleveland Heart Disease

0.00

20.00

40.00

60.00

80.00

100.00

1 2 3 4 5 6 7 8 9 10

Run Number

Tes

t Acc

urac

y


Figure 16: Summary of Classification results of FRULEX for Cleveland heart

disease dataset

19

Table 11: Summary of Classification results of FRULEX for Cleveland heart disease dataset

Heart Disease Phase 1 Phase 2 Phase 3

Misclassified 33.4 25.3 29.4 Training Accuracy 78.03 % 83.35 % 80.66 % Misclassified 37.8 28.8 26.2 Test Accuracy 74.97 % 80.93 % 82.65 % Misclassified 35.6 27.05 27.8 Average Accuracy 76.5 % 82.14 % 81.65 %

Table 12: Comparing FRULEX to some other approaches for Cleveland heart disease dataset

Method Accuracy

Rules Number Antecedents

Per rule Reference

LOONN 76.2% N/A N/A [21] XVNN 76.2% N/A N/A [21] C4.5 77% N/A N/A [21] RBF 81.3% N/A N/A [18 FSM 82.0% 27 fuzzy rules 13 [9]

RULEX 80.2% 3 crisp rules 5 [12] FRULEX 81.65% 4 fuzzy rules 4.9 Our result

C. Pima Indians Diabetes Dataset

The “Pima Indians diabetes” dataset is stored in the UCI repository [18] and is frequently used as benchmark data. All patients were females at least 21 years old, of Pima Indian heritage. The data contains two classes, eight attributes, 768 instances, 500 (65.1%) healthy and 268 (34.9%) diabetes cases (See Table 13).

Table 13: Features and Feature values for the diabetes dataset

ID Feature Feature values F1 Number of times pregnant Discrete F2 Plasma glucose concentration Continuous F3 Diastolic blood pressure (mm Hg) Continuous F4 Triceps skin fold thickness (mm) Continuous F5 2-Hour serum insulin (mu U/ml) Continuous

F6 Body mass index (weight in kg/(height in m)^2) Continuous

F7 Diabetes pedigree function Continuous F8 Age Discrete

To estimate the performance of the FKB extracted by the proposed approach, we carried out a 10-fold cross-validation. The whole dataset was divided into 10 equally sized groups (a group consists of 76 samples randomly drawn from the two classes). Each group was used as a test set for the network trained with the remaining 692 data points.

20

C.1. For Initialization

The SCRG method is used to determine the initial centers and widths of the membership functions of the input features. Table 14 summaries the results after applying the SCRG phase for the ten trials. (We have σo=0.3, ρ = 0.01, and τ =0.195)

Table 14: Results of the 10-fold cross validation after SCRG for the Pima diabetes data set

After SCRG Diabetes


1 2 72.51 188 68.42 24 70.47 106 2 2 71.35 196 71.05 22 71.20 109 3 2 72.08 191 65.79 26 68.94 108.5 4 2 72.22 190 63.16 28 67.69 109 5 2 71.05 198 76.32 18 73.69 108 6 2 71.78 193 67.11 25 69.45 109 7 2 71.20 197 77.63 17 74.42 107 8 2 70.47 202 78.95 16 74.71 109 9 2 71.93 192 69.74 23 70.84 107.5

10 2 70.76 200 75.00 19 72.88 109.5 avg. 2.00 71.54 194.70 71.32 21.80 71.43 108.3

C.2. For Optimization

The backpropagation gradient descent learning method is used to optimize the fuzzy rule base extracted in phase one. Network with 8 inputs and 2 outputs, corresponding to the two classes, was constructed. Table 15 summaries the results obtained after 100 epochs for the ten trials. (We have ε =0.01, and η = 1.0)

Table 15: Results of the 10-fold cross validation after optimization for Pima diabetes dataset

After Optimization Diabetes


1 2 73.98 178 73.68 20 73.83 99 2 2 73.39 182 71.05 22 72.22 102 3 2 74.12 177 67.11 25 70.62 101 4 2 73.10 184 71.05 22 72.08 103 5 2 73.10 184 78.95 16 76.03 100 6 2 73.68 180 69.74 23 71.71 101.5 7 2 73.68 180 72.37 21 73.03 100.5 8 2 73.39 182 73.68 20 73.54 101 9 2 73.25 183 75.00 19 74.13 101

10 2 72.66 187 77.63 17 75.15 102 avg. 2.00 73.44 181.70 73.03 20.50 73.23 101.10

21

As an illustrative example, Figure 17 shows the graphical representation of the FKB obtained, after optimization phase, in the first run of the 10 trials. (Using MATLAB Fuzzy Toolbox)

Figure 17: Graphical Representation of the FRB obtained after Optimization phase

for the Pima diabetes dataset

C.3. For Simplification

The Feature Subset Selection by Relevance method is used to simplify the fuzzy classifier extracted in phase one. Table 16 summaries the results obtained after this phase for the ten trials. (We have β =1).

Table 16: Results of the 10-fold cross validation after Simplification for the Pima diabetes dataset

After Simplification Diabetes

Training Set Test Set Whole Set Run Rules Acc. [%] Mis. Acc. [%] Mis. Acc. [%] Mis. Antec. Best Feature

Set 1 2 72.66 187 75.00 19 73.83 103 2 F2, F8 2 2 73.68 180 75.00 19 74.34 99 2 F2, F6 3 2 66.23 231 65.79 26 66.01 128 3 F3, F6, F7 4 2 72.95 185 71.05 22 72.00 103 1 F2 5 2 72.08 191 72.37 21 72.23 106 1 F2 6 2 72.08 191 73.68 20 72.88 105 1 F2 7 2 72.08 191 72.37 21 72.23 106 1 F2 8 2 71.64 194 77.63 17 74.64 105 1 F2 9 2 72.66 187 73.68 20 73.17 103 3 F2, F7, F8

10 2 71.93 192 71.05 22 71.49 107 1 F2 avg. 2.00 71.80 192.9 72.76 20.7 72.28 106.8 1.6 F2

For the first run of the ten trails, Figure 18 shows the performance of the networks

constructed by the successive removal of input features and Figure 19 shows the performance of the networks constructed by the successive addition of the relevant features.

22


0

50

100

F1 F2 F3 F4 F5 F6 F7 F8

Removed FeatureTe

st C

lass

ifica

tion

Acc

urac

y

Figure 18: Performance of RBPN during removal of input features for the Pima diabetes dataset


72

74

76

78

F2 F8

Added Feature

Tes

t Cla

ssifi

catio

n A

ccur

acy

Figure 19: Performance of the RBPN with different features for the Pima diabetes data set

As an illustrative example, the graph in Figure 20 and Figure 21 show the graphical and textual representation of the FKB obtained, respectively after the simplification phase, in the first run of the 10 trials. (Using MATLAB Fuzzy Toolbox)

Rule 1: IF (‘Plasma Glucose Conc’ IS in2mf1) AND ('Age' IS in8mf1),

THEN 'negative' IS out1mf1 AND 'positive' IS out2mf1

Rule 2: IF ('Plasma Glucose Conc' IS in2mf1) AND ('Age' IS in8mf1),

THEN 'negative' IS out1mf1 AND 'positive' IS out2mf1

Where: in2mf1 = ridgemf (x2; 85.5060 110.1578 0.0234 ) in2mf1 = ridgemf (x2; 91.8180 141.1538 0.0218) in8mf1 = ridgemf (x8; 29.8390 31.2511 0.0670) in8mf2 = ridgemf (x8; 28.9345 36.5214 0.0691)

and out1mf1 = 2.3917 out1mf2 = -1.1383 out2mf1 = -1.3917 out2mf2 = 2.1383

Figure 20: Textual Representation of the FRB obtained after Simplification for the Pima diabetes dataset

23

Figure 21: Graphical Representation of the FRB obtained after Simplification for

the Pima diabetes dataset

To evaluate the effectiveness of the ten fold cross validation results (see Table 17 and Figure 22), they were compared with those obtained by other neural and traditional classifiers, developed for the same dataset (see Table 18). It can be seen that the classification of our approach is comparable with most of the considered models.

Table 17: Summary of Classification results of FRULEX for the Pima diabetes dataset

Diabetes Phase 1 Phase 2 Phase 3

Misclassified 194.7 181.7 192.9 Training Accuracy 71.54 % 73.44 % 71.8 % Misclassified 21.8 20.5 20.7 Test Accuracy 71.32 % 73.03 % 72.76 % Misclassified 108.3 101.1 106.8 Average Accuracy 71.43 % 73.23 % 72.28 %

Pima Indians Diabetes

0.0010.0020.0030.0040.0050.0060.0070.0080.00

1 2 3 4 5 6 7 8 9 10

Run Number

Tes

t Acc

urac

y


Figure 22: Summary of Classification results of FRULEX for the Pima diabetes

dataset

Table 18: Comparing FRULEX with some other approaches for the Pima diabetes dataset

Method Classification

Accuracy Rules Number Antecedents

Per rule Reference

LOONN 70.4% N/A N/A [21] XVNN 70.7% N/A N/A [21] C4.5 73.0% N/A N/A [18]

24

RBF 75.7% N/A N/A [18] FSM 73.8% 50 fuzzy rules 8 [9]

RULEX 72.6% 5 crisp rules 5 [12] FRULEX 72.28% 2 fuzzy rules 1.6 Our results

7. EVALUATION OF OUR APPROACH

The initialization module is linear in the number of fuzzy clusters (or fuzzy rules) and the number of training patterns, O(J.P). The optimization module is linear in the number of iterations, number of training patterns and the number of hidden nodes, O(I.P.J). The module associated with feature selection is linear in the number of features, the number of iterations, number of training patterns and the number of hidden nodes, O(N.I.P.J). Therefore, FRULIN is computationally efficient. Rule quality is assessed according to the accuracy, fidelity and comprehensibility of the extracted rules. FRULIN is a consistent algorithm because it always generates different fuzzy model with the same accuracy from any given training run. FRULIN is non-portable having been specifically designed to work with a special type of networks, which is a local function network. This means that it cannot be used as a general-purpose device for providing an explanation component for existing, trained neural networks. However, the BPNN is applicable to a broad range of problem domains (including continuous valued, discrete valued domains and domains which include missing values).

8. CONCLUSIONS

Soft computing methodologies, involving fuzzy sets, neural networks, and their hybridization, have recently been used to solve data mining problems. We developed a new approach to build data mining models by fuzzy rules induction. It extracts fuzzy model from trained networks and simplifies the fuzzy model in a way to maximize the fidelity between the knowledge base and the neural network. Medical diagnosis of diseases is an important and difficult task in medicine. Detecting a disease from several factors or symptoms is a many-layered problem that also may lead to false assumptions with unpredictable effects. Hence, the effectiveness of the new approach is verified by its application on three well-known benchmark medical databases. On the overall, the reported results indicate that our approach is a valid tool to automatically induce fuzzy rules from data providing a good balance between accuracy and readability, as shown in the results. The number of fuzzy rules is determined automatically and the membership functions match closely with the real distribution of the training data points and the selection of relevant features is automatic. Further extensions of the proposed approach may concern the use of Genetic Algorithms (GA) instead of error back propagation in building a neural network. The chromosome in this case would contain the weights. Alternatively, genetic algorithms might be used to find the best architecture, and the chromosomes would contain the number of hidden layers and the number of nodes in each layer. Also, future work is aimed to extend the proposed approach to deal with other types of fuzzy models, such as Mamdani-type fuzzy models. Finally, an Information Extraction (IE) technique can be integrated with our approach for extracting structured information, from semi-structured and free text, that be used by our approach.

25

REFERENCES

[1] L. A. Zadeh, “Fuzzy logic, neural networks, and soft computing,” Commun. ACM, vol. 37, pp. 77–84, 1994.

[2] S. K. Pal and S. Mitra, Neuro-Fuzzy Pattern Recognition: Methods in Soft Computing. New York: Wiley, 1999.

[3] A. B. Tickle, R. Andrews, M. Golea, and J. Diederich, “The truth will come to light: Directions and challenges in extracting the knowledge embedded within trained artificial neural networks,” IEEE Trans. Neural Networks, vol. 9, pp. 1057–1068, 1998.

[4] R. Andrews, A.B. Tickle and J. Diederich, “A Survey and Critique of Techniques for Extracting Rules from Trained Artificial Neural Networks,” Knowledge Based Systems Vol. 8, PP. 378-389, 1995.

[5] S. Mitra, and Y. Hayashi, .Neuro-Fuzzy Rule Generation: Survey in Soft Computing Framework., IEEE Trans. Neural Networks, Vol. 11, No. 3, May 2000.

[6] V. Tresp, J. Hollatz, and S. Ahmed, “Network structuring and training using rule-based knowledge,” Advances in Neural Information Processing Systems (NIPS*6), pp. 871-878, 1993.

[7] M. Berthold and K. Huber, “Building precise classifiers with automatic rule extraction,” In Proceeding of the IEEE International Conference on Neural Networks, Perth, Australia. vol. 3, pp. 1263-1268, 1995.

[8] S. Abe and M.S. Lan, “A method for fuzzy rules extraction directly from numerical data and its application to pattern classification,” IEEE Trans. on Fuzzy Systems, vol. 3, no.1, pp. 18-28, 1995.

[9] W. Duch, R. Adamczak, N. Jankowski and A. Naud, “Feature Space Mapping: A neurofuzzy network for system identification,” In Proceedings of Engineering Applications of Neural Networks, Helsinki, pp. 221-224, 1995.

[10] A. Lapedes, and R. Faber, “How neural networks work,” Neural Information Processing Systems, Anderson D.Z.(ed), American Institute of Physics, New York, pp. 442-456, 1987.

[11] S. Geva, and J. Sittle, “Constrained Gradient Descent,” In Proceedings of the 5th Australian Conference on Neural Computing, Brisbane, Australia, 1994.

[12] R. Andrews and S. Geva, “RULEX & CEBP Networks As the Basis for a Rule Refinement System,” in Hybrid Problems Hybrid Solutions, Hallam J. (Ed), IOS Press, PP. 1-12, 1995.

[13] R. Andrews and S. Geva, “On the Effects of Initializing a Neural Network with Prior Knowledge,” Proc. of the International Conference on Neural Information Processing, Perth, Western Australia, PP.251-256, 1999.

[14] S. J. Lee and C. S. Ouyang, “A Neuro-Fuzzy System Modeling with Self-Constructing Rule Generation and Hybrid SVD Based Learning,” IEEE Trans. on Fuzzy Systems, Vol.11, PP. 341-353, June 2003.

[15] Y. Lin, G. A. Cunningham III and S. V. Coggeshall, “Using fuzzy partitions to create fuzzy systems from input-output data and set the initial weights in a fuzzy neural network,” IEEE Trans. Fuzzy Syst., Vol. 5, PP. 614-621, August 1997.

[16] W. A. Farag, V. H. Quintana and G. Lambert-Torres, “A genetic-based neuro-fuzzy approach for modeling and control of dynamical systems,” IEEE Trans. Neural Networks, Vol.9, PP. 756-767, Oct. 1998.

[17] P. M. Murphy and D. W. Aha, “UCI repository of machine learning databases,” Machine-readable data repository, University of California, Department of Information and Computer Science, Irvine, CA, 1992. [Online]. Available: http://www.ics.uci.edu/pub/machine-learning-data-bases

26

27

[18] B. Ster and A. Dobnikar, “Neural networks in medical diagnosis: Comparison with other methods”, Proc. of Int. Conf. E`ANN’96, A. Bulsari, Ed., PP. 427-430, 1996.

[19] D. Nauck, U. Nauck and R. Kruse,”Generating classification rules with the neuro-fuzzy system NEFCLASS,” In Proceedings Biennial Conference North America Fuzzy Information Processing Society. (NAFIPS’96), Berkeley, CA, 1996.

[20] I. Taha and J. Ghosh, “Three techniques for extracting rules from feedforward networks,” In Intelligent Engineering Systems Through Artificial Neural Networks, volume 6, pages 23-28, 1996.

[21] R. Andrews and S. Geva, Extracting Rules from a Constrained Error Backpropagation Network. Proceedings of the 5th Australian Conference on Neural Networks, Brisbane, pp. 9-12.

Documents

A Neuro-Fuzzy Approach for Data Mining and - UlmMohamed Farouk Abdel Hady Teaching Assistant Information and Computer Science Department Institute of Statistical Studies and Research