An efficient classifier design integrating rough set and set oriented database operations

Ad

AB

a

ARRAA

KDCRDG

1

e[tg

dsbdofafoiossr

1d

Applied Soft Computing 11 (2011) 2279–2285

Contents lists available at ScienceDirect

Applied Soft Computing

journa l homepage: www.e lsev ier .com/ locate /asoc

n efficient classifier design integrating rough set and set orientedatabase operations

sit Kumar Das, Jaya Silengal Engineering and Science University, Computer Science and Technology, Shibpur, Howrah, West Bengal 711-103, India

r t i c l e i n f o

rticle history:eceived 21 April 2009eceived in revised form 23 June 2010ccepted 1 August 2010vailable online 10 August 2010

eywords:ata mining

a b s t r a c t

Feature subset selection and dimensionality reduction of data are fundamental and most explored areaof research in machine learning and data mining domains. Rough set theory (RST) constitutes a soundbasis for data mining, can be used at different phases of knowledge discovery process. In the paper, byintegrating the concept of RST and relational algebra operations, a new attribute reduction algorithm hasbeen presented to select the minimum set of attributes, called reducts, required for classification of data.Firstly, the conditional attributes are partitioned into different groups according to their score, calculatedusing projection (�) and division (÷) operations of relational algebra. The groups based on their scores

lassificationough set theoryecision treeraph theory

are sorted in ascending order while the first group contains maximum information is uniquely used forgenerating the reducts. The non-reduct attributes are combined with the elements of the next group andthe modified group is considered for computing the reducts. The process continues until all groups areexhausted and thus a final set of reducts is obtained. Then applying decision tree algorithm on each reduct,decision rule sets are generated, which are later pruned by removing the extraneous components. Finally,by involving the concept of probability theory and graph theory minimum number of rules is obtained

ient c
used for building an effic
. Introduction

Association rule based pattern mining is an important knowl-dge discovery technique has been address by many researches1–3] during quite a long period of time. However, management ofhis knowledge is challenging because rules have to be pruned androuped efficiently to building the classifiers.

Rough set theory (RST) provides a sound basis to knowledgeiscovery, especially when dealing with incomplete and/or incon-istent data. In recent years, rough set methodology [4–6,30] haseen witnessed great success in data reduction, attribute selection,ecision rule generation and pattern extraction in different phasesf knowledge discovery process. A large number of researchersrom mathematics, artificial intelligence and engineering societiesre evolved with the theoretic and application investigation thatocused on mainly dimensionality reduction [7,8] and classificationf data [9–11]. Dimensionality reduction performed by determin-ng minimum set of attributes called reduct is an important aspect
f classification where reduced attribute set has the same clas-ification power as the entire set of attributes of an informationystem. There is usually more than one reduct and generation ofeducts for real-world data sets is a NP-hard [12] problem. Exhaus-
E-mail address: [email protected] (A. Kumar Das).

568-4946/$ – see front matter © 2010 Elsevier B.V. All rights reserved.oi:10.1016/j.asoc.2010.08.008

lassifier.© 2010 Elsevier B.V. All rights reserved.

tive search for finding reduct is infeasible and therefore, heuristicmethods are applied, based on distinct measures of significanceof attributes [13–16]. Also it is not very clear which subset of thereduct should be selected for classification. An extensive review onrough set methods in feature selection was presented in [17].

In reality, there are multiple reducts in a given information sys-tem used for building classifiers and among the pool of reducts,the best performer is chosen as the final solution to the problem.But this is not always true and according to the Occam’s razor andminimal description length principle [18–20], the minimal reductis preferred. However, Roman et al. [21] has found that the minimalreduct is good for ideal situations where a given data set fully repre-sents a domain of interest. But for real life situations and limited sizedata sets, other than the minimal reducts might be better for predic-tion. Selecting a reduct with good performance is time expensive, asthere might be many reducts of given data sets. Therefore, obtain-ing a best performer classifier is not practical rather ensembleof different classifiers may lead to better classification accuracy.However, combining [22–25] large number of classifiers increasescomplexity of the system. So, there must be a trade off between
these two approaches. Many methods for constructing ensemblesof classifiers have been developed, some are general and some arespecific to particular algorithms. For example, Bauer and Kohavi[26] use 25 classifiers; Freund and Schapire [27] use 100 classifierswhile it is extended up to 1000 in [28]. In order to overcome the
dx.doi.org/10.1016/j.asoc.2010.08.008

http://www.sciencedirect.com/science/journal/15684946

www.elsevier.com/locate/asoc

mailto:[email protected]

dx.doi.org/10.1016/j.asoc.2010.08.008

2 ft Com

lbtctaa

bHwttt

paaapdoagdegrtuassactawbdttbt

ftoacs

2

fairpfieTth

280 A. Kumar Das, J. Sil / Applied So

imitations while constructing the classifiers, a novel technique haseen proposed here for reduct formation by integrating rough setheory and set oriented database operations. As a next step, a set oflassification rules is generated from each reduct applying decisionree algorithm [29]. Finally, using the concept of probability theorynd graph theory, minimum number of rules is obtained to buildn efficient classifier.

Rough set theory, as a powerful knowledge-mining tool, haseen widely applied to acquire knowledge in different domains.owever, its application is restricted in discrete domains onlyhereas real-world classification tasks involve continuous fea-

ures. Different discretization methods [36,38] are available and inhe paper, ChiMerge [38] discretization algorithm has been appliedo convert continuous attributes into discretized values.

In the proposed method, the relational algebra operations,rojection (�) and division (÷) are applied to find the indispens-ble conditional attributes required for categorizing the decisionttributes of a given data set. Firstly, for each conditional attribute,non-negative score has been computed which is inversely pro-ortional to its information gain. The attributes are partitioned intoifferent groups according to their score and sorted in ascendingrder. The groups are reformed by the power set of the respectivettribute sets, which could be the member of a reduct. Since the firstroup contains maximum information, its members are selected foretermining the reducts and stored (if any) in a set RED. Non-reductlements of the group are combined with the elements of the nextroup and the modified group is considered for determining theeducts. The process is repeated for all groups and finally RED con-ains an approximate set of reducts. From each possible reducts,sing decision tree algorithm, decision rules are generated, whichre later pruned by removing the extraneous components. Now theet of classifiers are built using the decision rules where each clas-ifier is treated as a decision tree in the decision forest consists ofll classifiers. The paper proposes an algorithm to determine theentral decision tree(s) using graph theory based probability fac-ors. The rules generated from the central tree(s) are considereds core rules. The distance based probability of each decision treeith respect to the central tree is measured to choose a threshold

ased on which the important subset of rules are selected from theecision trees. The final classifier built using the selected rules isherefore, accurate, complete and at the same time retain impor-ant information. In the proposed method, the final classifier is builty combining the minimal number of rules of each classifier unlikehe existing methods where ensembling of classifiers is required.

The paper is organized as follows: data reduction and reductormation method are described in Section 2. Section 3 presentshe techniques of generating and minimizing the number of rulesf each classifier for building the final classifier. Section 4 illustratespplication of the proposed approach and results are analyzed byomparing with other existing methods. Finally, conclusions areummarized in Section 5.

. Dimensionality reduction

In many applications, often it is difficult to know exactly whicheatures are relevant for a particular task. The irrelevant attributesre redundant and should be removed before classification tomprove accuracy and reduce complexity of the systems. Dataeduction plays vital role in knowledge discovery process to extractatterns hidden in the data. The aim of dimension reduction is to
nd a minimum set of relevant attributes that preserves all thessential information of the system sufficient for decision-making.his minimum set of attributes called reduct is used instead ofhe entire attribute set for classification of data. A novel conceptas been proposed in the paper that reduces dimensionality of a
puting 11 (2011) 2279–2285

given data set by computing an approximate set of reducts, usedfor building accurate classifiers.

2.1. Partitioning of attributes

Let the decision system DS = (U, A, C, D), where U is the uni-verse of discourse and A is the total number of attributes (suchthat A = C∪D); C is the number of conditional attributes and D isthe decision attributes. Assume that, P and Q (Q⊂P) are the set ofattributes of relations R1 and R2, respectively. In relational algebra,division (÷) is a binary operation applied on two relations R1 (P)and R2 (Q), producing another relation R3 (P–Q). So, R3 contains setof all tuples t such that for any tuple t1 of R1 and t2 of R2, followingconditions are satisfied.

• t[P–Q] = t1[P–Q]• t1[P–Q] = t2[Q]

This division (÷) operation is used to compute the score of eachconditional attribute as described in Eq. (1) where S(Ci) is the scorefunction and Card stands for cardinality operation, indicating thenumber of tuples in the output relation R3. Obviously, the attributeCi with minimum score implies that maximum number of objectshaving attribute values similar to Ci can uniquely take the decisions.Therefore, the attribute with minimum score is of maximum impor-tance and so lower score implies higher possibility of becoming amember of reduct.

S(Ci) = Card(˘Ci∪D(DS)÷˘D(DS)) (1)

To compute the reducts, the attributes are partitioned into gnumber of groups according to their scores and sorted in ascend-ing order, forming groups G1, G2, . . ., Gg considering power setof attribute set of each group. Therefore, group G1 contains themost important attributes including core. The algorithm “Parti-tion into Groups” has been presented below that partitioned theconditional attributes into various groups according to their score.Algorithm: Partition into GroupsBegin

d←�D (DS)for i = 1 to n do

Compute score function S(Ci) for Ci ∈C, using (1)Arrange Ci; ∀i = 1 .. n, in ascending order on the basis of their scoreRename the attributes as C = {C�1, C�2, . . ., C�n)

/* Partition attributes of C into g-groups */g = 1, j = 1for i = 1 to n−1 do

{Gg [j + +] = C�i

if(score(C�i) /= score(C�(i+1))){

Num[g] = j−1/* ‘Num’ array contains number of elements in each group*/

g = g + 1j = 1

}}

Gg [j] = C�n

Num[g] = j/*Compute all combination of elements of each group*/for i = 1 to g do

{Gi←P(Gi)./* P for power set */Num[i] = 2|Num[i]| −1

}End.

2.2. Reduct formation

Reduct is the minimum set of attributes that preserve the parti-tioning of the universe of discourse U, and denoted by R such that(R⊆C). In other words, reduct has the same classification power as

ft Computing 11 (2011) 2279–2285 2281

t(

C

roifi.rh

sdeAB

E

RB

E

3

tiodtrba

A. Kumar Das, J. Sil / Applied So

he entire set of conditional attributes (C) described by Constraint1).

onstraint 1: Card (�R,D (DS)) = Card (�R (DS))Each elements of group G1 which satisfies Constraint (1), are

emoved from G1 and inserted in the set of reducts, say RED. Tobtain other reducts, Cartesian product (×) operation is appliedteratively between Gi−1 and Gi such that Gi = Gi×Gi−1 that modi-es group Gi by accommodating the new elements; where i = 2, 3,. ., g. For generation of final set of reducts the above procedure isepeated for all groups and finally an approximate set of reductsas been obtained in RED.

The algorithm “Multiple Reduct Generation” has been pre-ented below that computes an approximate set of reducts of theecision system. It is an iterative process that generates reducts inach iteration by invoking a sub-algorithm “RED GEN()”.lgorithm: Multiple Reduct Generationeginno of reduct = 0/* Generate reducts, if any from group G1 and stored in array RED */no of reduct = RED GEN(G1, RED, &Num[i], no of reduct)/* Generate reducts from other groups */for i == 1 to g−1 do{

Gi+1←Gi ×Gi+1

Num[i =+ 1] = Num[i] * Num[i = +1]no of reduct = RED GEN(Gi+1, RED, &Num[i =+ 1], no of reduct)

}nd.

ED GEN(GRP, RED, no in GRP, no of reduct)egin/* Sub-algorithm RED GEN() select reducts from GRP and stored

in RED. Items of group Gi are copied into GRP and call the sub-algorithmto generate reducts. Thus the algorithm calls g-times */

For i == 1 to no in GRP do{

Let GRP[i] = {C�1, C�2, . . ., C�r}super = falsefor j = 1 to no of reduct do{

if(GRP[i]⊇RED[j]){

super = truebreak

}}

if(!super){

if (GRP[i] satisfies constraint (1)) thenRED[+ + no of reduct]←GRP[i]

}Let t = Number of super set of GRP[i] in GRPDelete GRP[i] and its super set from GRPNo in GRP = no in GRP− ti = i−1

}return(no of reduct)

nd.

. Classifier construction

Generally, most of the decision system has multiple reducts,herefore, multiple set of classification rules are derived by apply-ng decision tree algorithm [29] on each element of RED. Each rulef a rule set in a classification model is merely combination of con-
itional attribute values that together implies a class label. In ordero obtain unique rule sets, extraneous components of the rules areemoved by pruning the original rule sets. In the next step, the num-er of rules is minimized using probability and graph theory basedpproach for building the final classifier.
Fig. 1. Representing parent–child relationship.

3.1. Rule set generation

The reducts and decision attributes of a decision system (DS)are considered for generating the decision rules using decisiontree method. Attribute having the lowest score contains maximuminformation and therefore, considered as root of the decision tree.Taking root as the parent node (vi), directed edge(s) are drawn fromit representing value(s) of the parent node attribute. Let node vi rep-resenting the attribute with p distinct values (say, a1, a2, . . .., ap−1,ap) and p number of directed edges are drawn from vi. For each valueof vi, a subsystem is obtained by removing the attribute mapped atvi. If all tuples (objects) of the subsystem belong to a single class(decision attribute value) then the child node of vi becomes the leafnode representing that class; otherwise attribute with lowest scoreof the subsystem is considered as child node (vj). Thus, a directededge is obtained from vi to vj in the decision tree. In this way, fora parent node vi, p-number of child nodes are obtained which maybe leaf (denoted by rectangle) or intermediate (denoted as ellipse)node as shown in Fig. 1. The process is executed recursively untilall child nodes are emerged as leaf nodes. If multiple attributes areof equal score, select any attribute arbitrarily. Assume reducts andtheir score values are stored in RED and SCORE, respectively. Theformation of decision tree using “Classification Rules(DS)” algo-rithm is described below.Algorithm: Classification Rules(DS)Input: DS = {U, RED, D}

/* RED is the conditional attribute*/Output: Decission tree providing classification rulesBegin

j = 0PARENT = Lowest score attribute in DSCompute RPARENT = RED – PARENTFor each attribute value ai (i == 1 .. p) of PARENT do{

Create new decision subsystem DSi = {U, RPARENT, D}If (all tuples of DSi contains same D-value)

{CHILD = Classj

j = j + 1Draw an edge with label ai from PARENT to CHILDLabel CHILD node as leaf node

}Else

{CHILD = Lowest score attribute in DSi

Draw an edge with label ai from PARENT to CHILDClassification Rules (DSi)

}}

End.

3.2. Pruning classification rules

From each reduct, using “Classification Rules(DS)” algorithm,a rule set R has been formed. Each rule r∈R is denoted with
implication; r: {(C˛1 = vi1) ∧ (C˛2 = vi2) ∧ . . . ∧ (C˛k = vik)→ (D =di)}where, left hand side and the right hand sight of r are denotedby con(r) and dec(r), respectively. Here, con(r) is considered as acollection of rule-components {r1, r2, . . .., rk} such that rj of r is of

2 ft Com

ta

oStpP

eo

idc

odgaTnaoabe

3

itdPfi

Doe(

d

we(

282 A. Kumar Das, J. Sil / Applied So

he form (C˛j = vij). Before describing the optimum rule generationlgorithm, following propositions are introduced.

Extraneous component: A rule-component rj (rj ∈ con(r))f r is extraneous in R if S logically implies R where,= (R− {r})∪ {(con(r)− rt)→ (D = di)}. In other words, if R partitions

he training set into K-classes {PR1, PR2, . . ., PRK}, then S will alsoartition the training set into K-classes {PS1, PS2, . . ., PSK} such thatRI = PSI ∀ I = 1, 2, . . ., K.

Canonical cover: Collection of distinct rules of R without anyxtraneous component is called the canonical cover Rc of R. Obvi-usly, Rc logically implies R.

Logically implies rule: Let r and s (con(s)⊆ con(r)) are two rulesn R. Rule s logically implies r if both equally classify the trainingata set. Thus both the rules r and s contribute equally to build thelassifier using R.

The ultimate aim behind the pruning technique is to produceptimum number of rules by removing maximum number of con-itional attribute values without sacrificing classification accuracyoverned by the rule. Hence, for rule set R, the rule-componentsre checked and all extraneous components are removed from R.herefore, only the unique rule sets with non-extraneous compo-ents are available in R, which is the canonical cover Rc of R. Thus,rule set R is pruned and obtained a minimal set of rules Rc is

btained. This method is applied for all decision rule sets, gener-ting a compact rule set. The algorithm “Pruned Rules(R, n)” haseen presented below that pruned the rule set by removing thextraneous components from the rules.

Algorithm: Pruned Rules (R, n)Input: Rule set R with n number of rulesOutput: Canonical cover Rc of R, known as pruned rule set

BeginRc = {�}For all n-rules r do

{Let Con(r) = {r1, r2, . . .., rk}

/* r1, r2, . . .., rk are rule components */For i == 1 to k do

{/* check whether ri is extraneous or not */

Let � = con(r)− ri

If ({�→dec(r)} logically implies r){

ri = Extraneous componentCon(r) = con(r)− ri

}}

Rc = Rc ∪ {r}}

End.

.3. Combination of rules

The proposed method uses minimal cover rule sets and combin-ng them build a final classifier. As the rules are pruned, the decisionrees associated with the rule sets are modified. Let the modifiedecision trees T1, T2, . . ., Tn together form a decision forest (DF).robability based graphical approaches are used to construct thenal classifier, prior to which some definitions are presented.

efinition 1. The distance between two decision-trees Ti and Tjf a decision forest DF is defined as the number of edges present inither of the tree. This distance is denoted by d(Ti, Tj) as defined by2).

1
(Ti, Tj) = 2
N(Ti ⊕ Tj) (2)

here Ti⊕Tj, the ring sum of Ti and Tj be a subgraph containingdges of DF which are either in Ti or in Tj but not in both and NTi⊕Tj) denotes the number of edges in the subgraph.

puting 11 (2011) 2279–2285

Definition 2. Let for a decision tree Tc in a decision forest DF,max{d(Tc, Ti)} is the maximal distance of Tc from Ti among all n−1trees of DF. Tc is called a central tree of DF if it satisfies (3).

max d(Tc, Ti) ≤max

jd(T, Tj); ∀T∈DF (3)

Definition 3. Let S be a set of objects among which c objects arecorrectly classified and w objects are either misclassified or notclassified by the rule r (r∈R). The accuracy a of r with respect to Sis computed by (4), where c +w = |S|.

a = cw× 100 (4)

To build the classifier, the accuracy of the rules generated fromeach decision tree is calculated. A threshold selection is importantdepending on which the significant rules are identified to build thefinal classifier. Based on the position of the decision trees in the DF,the corresponding rule sets are utilized to build the final classifier.The central tree, vide (3) becomes the core tree and its associatedrule set represents the core rules. The core tree(s) situated at thecenter of the DF are most significant, and therefore correspondingrule sets are included in the final classifier. The central tree in the DFis, in general not unique and in this case, combination of all centraltrees represents the core tree. A subset of rules formed by other thanthe central trees is included in building the final classifier based onthe probability theory. To include a subset of rules associated withTi the rule set is refined as follows:

(i) The rules that are already in the rule set of Tc are deleted fromthe rule set of Ti.

(ii) The rules that contradict the rules in the rule set of Tc are deletedfrom the rule set of Ti, giving importance to the rules of Tc.

The probability value pi is computed using (5) for each Ti, whichindicates higher the probability value, the corresponding Ti is closerto the central tree Tc. This implies more rules of Ti are to be consid-ered for building the final classifier. Thus, the probability value pitakes an important role to determine the threshold, based on whichthe rules of Ti are selected.

Assume that, the rule set associated with Ti contains m numberof rules and the rules are sorted in ascending order based on theiraccuracy (Eq. (4)) from the rule set RCi = {ri˛1, ri˛2, . . ., ri˛m}. Sincethe probability of a tree indicates its importance and accordinglyfirst ni (<m) number of rules are selected using (6) and add them tothe final rule set.

pi =1/dci∑n

j=1(1/dcj)∀i = 1, 2, . . . , n; i /= c (5)

⌈ ⌉
ni = pi ×m (6)
The algorithm “Combined Classifier(DF)” has been presentedbelow that combines rules for building the final classifier.

A. Kumar Das, J. Sil / Applied Soft Computing 11 (2011) 2279–2285 2283

Table 1A decision system.

Diploma Experience French References Decision

x1 MBA Medium Yes Excellent Acceptx2 MBa Low Yes Neutral Rejectx3 MCE Low Yes Good Rejectx4 MSc High Yes Neutral Accept

AB

F

}E

4

a

Fig. 3. Decision tree for reduct (e,f,r).

Table 2Rule set for reduct (e,d).

r1: (Experience = Low)→ (Decision = Reject)r2: (Experience = High)→ (Decision = Accept)r3: (Experience = Medium)∧ (Diploma = Accept)

→(Decision = Accept)r4: (Experience = Medium)∧ (Diploma = MSc)

→(Decision = Reject)

Table 3Rule set for reduct (e,f,r).

r1: (Experience = Low)→ (Decision = Reject)r2: (Experience = High)→ (Decision = Accept)r3: (Experience = Medium)∧ (French = Yes)∧

(Reference = Excellent)→ (Decision = Accept)r4: (Experience = Medium)∧ (French = Yes)∧

(Reference = Neutral)→ (Decision = Reject)

Table 4Canonical cover rule set of Table 3.

r1: (Experience = Low)→ (Decision = Reject)r2: (Experience = High)→ (Decision = Accept)r3: (Experience = Medium)∧ (Reference = Excellent)

x5 MSc Medium Yes Neutral Rejectx6 MSc High Yes Excellent Acceptx7 MBA High No Good Acceptx8 MCE Low No Excellent Reject

lgorithm: Combined Classifier(DF)eginFor i == 1 to n do

Compute Rulei = Rules generate from Ti

Determine central tree Tc using (2) and (3)FinalRules = {Rulec}

or i == 1 to n (i /= c) do {Compute pi using (5)Let Rulei = {ri1, ri2, . . ., rim}For j = 1 to m do {

Rulej = Rulej – {Rulei ∩Rulej}Rulej = Rulej – {rjk/rjk contradicts Rulei for all k}Compute accuracy aij of rij using (4)Arrange elements of Rulej in increasing order of their accuracy aij

}Compute ni using (6)Final Rulei = First ni number of rules from Rulei

FinalRules = FinalRules∪ Final Rulei

nd.

. Results and discussions

We consider a sample data set to illustrate the proposed methodnd then generalize the experimental result.

(a) Consider a sample data set of a decision system, as shownin Table 1. Last column represents the decision attributewhile rest of the columns represents conditional attributes ofthe decision system. For simplicity, the conditional attributesare renamed as: d, e, f and r for ‘Diploma’, ‘Experience’,‘French’, and ‘Reference’, respectively. Using Eq. (1), score ofeach attribute is obtained, where score (d) = 2, score (e) = 1,score(f) = 2 and score(r) = 3. Using “Partition into Groups” algo-rithm three different groups: G1 = {e}, G2 = {d,f, {d,f}} andG3 = {r} are obtained. “Multiple Reduct Generation” algorithm,creates two reducts: RED = {{e,d}, {e,f,r}}. Thus two classi-
fiers are obtained using algorithm “Classification Rules(DS)”,and subsequently two decision trees are formed, shown inFigs. 2 and 3. Classification rules are generated from the deci-sion trees, listed in Tables 2 and 3, respectively. In Table 3, therule-component (French = Yes) of both r3 and r4 are extraneous
Fig. 2. Decision tree for reduct (e,d).

→(Decision = Accept)r4: (Experience = Medium)∧ (Reference = Reject)


component and so removed from the rules and canonical coverrule set is listed in Table 4. The decision trees associated withTables 2 and 4 are central trees vide (3). The final classificationrule set is combination of both the rule set, listed in Table 5.

(b) Generally, there are four strategies for combining multiple clas-sifiers: (i) sum of distribution, (ii) weighted or un-weighted
voting, (iii) naïve Bayesian combination [25] and (iv) decisiontable method. A detailed discussion of these methods is avail-able in [32]. These four methods are complementary to eachother; each has its strong and weak points depending on the
Table 5Final rule set for the data set.

r1: (Experience = Low)→ (Decision = Reject)r2: (Experience = High)→ (Decision = Accept)r3: (Experience = Medium)∧ (Diploma = MBA)

→(Decision = Accept)r4: (Experience = Medium)∧ (Diploma = MSC)

→(Decision = Reject)r5: (Experience = Medium)∧ (Reference = Excellent)

→(Decision = Accept)r6: (Experience = Medium)∧ (Reference = Neutral)


2284 A. Kumar Das, J. Sil / Applied Soft Computing 11 (2011) 2279–2285

Table 6The comparison of classifiers on various datasets.

Dataset Data size #Decision attribute #Conditional attribute C4.5 ID 3 SVM Proposed method

Zoo 100 6 9 95.04 94.79 95.03 96.72Mushroom 703 6 10 81.09 81.75 78.93 83.92Glass 214 6 9 74.44 73.26 69.78 76.12Ecoli 336 7 7 85.22 84.19 84.44 89.41

UCFafcd

iimiak

ffiolsbnasta

asiop[mntaApa

Hepatitis 155 2 19

Average

domain. An in depth analysis and comparison of these strate-gies can be found in [26,32–34]. A reduct classifier is built byinvolving only the information necessary to represent the givendata set without losing essential information. Our approach isto learn a reliable model for each classifier. It is obvious thateach classifier has a particular subdomain for which it is mostaccurate, thus it is very hard to say which one is better than theothers in the real applications. Depending on the subdomain,one reduct classifier can be more useful than another. So, theaccuracy is measured for each rule of the classifier and onlya certain number of rules are considered from each classifierbased on probability theory. Ultimately, the final classifier hasbeen generated by inclusion of those rules.

After discretizing [38] the attribute values of the data sets ofCI machine-learning repository [35], the proposed method and4.5, ID3, SVM based classifiers are applied to classify the data sets.ive different data sets are chosen from UCI repository databasesnd ten complete 10-fold cross-validations have been performedor each data set. Table 6 summarizes accuracies of the proposedlassifier and the classifiers like C4.5, ID3 and SVM, using variousata sets.

Machine learning using decision tree approach becomes popularn mining data because of its high predictive accuracy and ease ofmplementation. A set of if-then rules generated from large trees

ay be preferred in many cases. However, replication of subtreen the decision tree and its corresponding rule set can be removednd it is also easier to combine new discovered rules with existingnowledge in a given domain.

To fulfill the domain requirements, the proposed method trans-ormed the decision trees into rule sets, which are scalable andnally combined into a single classifier after pruning. Using a globalptimization strategy, C4.5 method becomes extremely slow onarge datasets. On the other hand, rule post-pruning algorithmsuch as CN2, IREP, or RIPPER [42–44] are scalable to large datasets,ut they suffer from the crucial problem of over pruning and doot often achieve a high accuracy as C4.5. This paper proposesn algorithm, which trade off the two problems and maintain thecalability and accuracy subsequently. Experimental results showhat the new algorithm can produce rule sets that are even moreccurate than those generated by C4.5.

As the result of Table 6 shows that the average accuracy of almostll the classifiers are very similar including the proposed method,o the methods are statistically significant. Here, to explain themportance of the proposed method compare to the other meth-ds a statistical testing on 10-fold cross-validation data sets iserformed. The statistical testing method Wilcoxon Ranksum test39–41] is applied since it is valid for data of any distribution and

uch less sensitive to outliers compare to the other testing method,amely ‘Two-sample t-test’ method [41]. The Wilcoxon Ranksum
est method computes the p-value between the proposed methodnd other methods (mentioned in Table 6) for the given datasets.s an example, for Zoo dataset the p-value between C4.5 and theroposed classifier is 0.0255 and that for ID3 and SVM are 0.0448nd 0.0633, respectively. Considering the p-value 0.0255, it shows
86.45 85.80 84.97 88.95

84.45 83.96 82.63 87.02

that though the average accuracy of C4.5 and the proposed classi-fier are very similar (95.04% and 96.72%, respectively), but there is a97.45% chance that the average accuracies would be truly differentif one looked at the total data set. Similarly, for ID3 and SVM clas-sifiers there are 95.52% and 93.67% chances, respectively, of beingtruly different average accuracies. For other data sets, experimentalresults show that the p-values are almost similar and on an averagethere are more than 90% chances of being truly different aver-age accuracies. Thus after executing the Wilcoxon Ranksum test,it has been observed that the proposed method is useful compareto others for classification of discrete data sets.

5. Conclusion

Ensembles of classifiers [31] are very effective method to obtainhighly accurate classifiers by combining less accurate ones. Theproposed concept is comparatively simple and it offers straightfor-ward interpretations of the results. The concept of rough set theoryoffers a sound theoretical foundation for constructing the reductsets, which is very effective for dimensionality reduction. Controlthe number of classifiers in an ensemble is very important in datamining application. Though ensembles provide very accurate clas-sifiers, however too many classifiers in an ensemble may limit theirpractical application. In the proposed method, the final classifier isformed by combining all the rules of the core classifiers and onlya selective number of rules of other classifiers using probabilitytheory.

Boosting experiments with many trials (e.g., 1000 as in [37]) arenot feasible in data mining applications, because ensemble withlarge number of classifiers can require large amount of memory andcpu time. This is a very serious problem in data mining applicationsbecause of the huge size of data set. Our approach in some way alle-viates this problem because it avoids generating high-correlatedclassifiers, and the number of classifiers generated in an ensembleis much less than the other methods. Further, use of probability the-ory selects only important rules for the final classifier that improvedthe classifier accuracy whereas in case of equal voting, all classifiersare ensembled giving equal importance.

References

[1] C. Carter, H. Hamilton, Efficient attribute-oriented generalization for knowl-edge discovery from large databases, IEEE Transactions on Knowledge and DataEngineering 10 (1998) 193–208.

[2] G. Dong, J. Li, Efficient mining of emerging patterns: discovering trends anddifferences, in: Proceedings 1999 International Conference on Knowledge Dis-covery and Data Mining (KDD’99), vol. 43–52, San Diego, CA, August, 1999.

[3] M. Klemettinen, H. Mannila, P. Ronkainen, H.A.I. Toivonen, Finding interest-ing rules from large sets of discovered association rules, in: Proceedings ofthe 3rd International Conference on Information and Knowledge Management(CIKM’94), ACM Press, New York, 1994, pp. 401–407.

[4] Z. Pawlak, Rough sets, International Journal of Information and Computer Sci-ences 11 (1982) 341–356.

[5] Z. Pawlak, Rough set theory and its applications to data analysis, Cyberneticsand Systems 29 (1998) 661–688.

[6] Z. Pawlak, Rough Sets—Theoretical Aspects of Reasoning About Data, KluwerAcademic Publishers, Boston, London, Dordrecht, 1991, p. 229.

[7] X. Ru-Zhi, N. Pei-yao, L. Pei-guang, C. Dong-sheng, Cloud model based dataattributes reduction for clustering, in: Proc. of the 1st International Conference

ft Com

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[[

[

[

[

[

[

[[

[

[

[

[

A. Kumar Das, J. Sil / Applied So

on Forensic Applications and Techniques in Telecommunications, Informationand Multimedia and Workshop, ICST, January, 2008.

[8] W. Zhu, Wang, Reduction and axiomization of covering generalized rough sets,Information Sciences 152 (2003) 217–230.

[9] Y. Huang, X. Huang, N. Cercone, Feature Selection with Rough Sets for Web PageClassification Transactions on Rough Sets, SpringerLink Publishers, 2004.

10] S. Tan, Y. Wang, X. Cheng, An efficient feature ranking for text categorization,in: Proc. of the ACM Symposium on Applied Computing – SAC’08, 2008, pp.407–413.

11] Y. Li, N. Zhong, Rough association rule mining in text documents for acquiringweb user information needs, in: Proceedings of the IEEE/WIC/ACM Interna-tional Conference on Web Intelligence – WI’06, IEEE Computer Society, 2006,pp. 226–232.

12] T.Y. Lin, N. Carcone (Eds.), Rough Sets and Data Mining: Analysis of ImpreciseData, Kluwer Academic Publishers, 1997.

13] A. Skowron, C. Rauszer, The discernibility matrics and functions in informationsystems, in: R. Slowinski (Ed.), Intelligent Decision Support, 1992, pp. 331–362.

14] R. Jensen, Q. Shen, Fuzzy-rough attribute reduction with application to webcategorization, Fuzzy Sets and Systems 141 (2004) 469–485.

15] G.Y. Wang, H. Yu, D.C. Yang, Decision table reduction based on conditionalinformation entropy, Chinese Journal of Computers 25 (7) (2002) 1–9.

16] J. Bazan, A comparison of dynamic and non-dynamic rough set methods forextracting laws from decision system, in: L. Polkowski, A. Skowron (Eds.), RoughSets in Knowledge Discovery, vol.1, Physica-Verlag, 1998, pp. 321–365.

17] W. Swiniarski, A. Skowron, Rough set methods in feature selection and recog-nition, Pattern Recognition Letters 24 (6) (2003) 833–849.

18] J. Quinlan, R. Rivest, Inferring decision trees using the minimum descriptionlength principle, Information and Computation 80 (1989) 227–248.

19] M. Hansen, B. Yu, Model selection and the principle of minimum descriptionlength, Journal of the American Statistical Association 96 (2001) 746–774.

20] J.R. Quinlan, The minimum description length and categorical theories, in: Pro-ceedings 11th International Conference on Machine Learning, New Brunswick,Morgan Kaufmann, San Francisco, 1994, pp. 233–241.

21] Roman, W. Swiniarski, L. Hargis, Rough sets as a frontend as neural-networkstexture classifiers, Neurocomputing 36 (2001) 85–102.

22] G. Fumera, F. Roli, Analysis of error-reject trade off in linearly combined mul-tiple classifiers, Pattern Recognition 37 (2004) 1245–1265.

23] J. Kittler, M. Hatef, On combining classifiers, IEEE Transactions on Pattern Anal-ysis and Machine Intelligence 20 (3) (1998) 139–226.

24] Y. Daren, H. Qinghua, B. Wen, Combining multiple neural networks for classifi-cation based on rough set reduction, in: Proceedings of the IEEE InternationalConference on Neural Network & Signal Processing, Nanjing, China, December14–17, 2003.

[

[

[

puting 11 (2011) 2279–2285 2285

25] M. Koppen, S. Engerson, Integrating multiple classifiers by finding their Area ofexpertise, in: Working Notes of the AAAi Workshops on Integrating MultipleLearned Models, 1999.

26] E. Bauer, R. Kohavi, An empirical comparison of voting classification algorithms:bagging, boosting, and variants, Machine Learning (1998).

27] Y Freund, R. Schapire, Experiments with new boosting algorithms, in: Proceed-ings of the International Conference on Machine Learning, 1996.

28] R.E. Schapire, Y. Freund, P. Bartlett, Explanation for the effectiveness of votingmethod, Machine Learning (1998).

29] Quinlan, Introduction of decision trees, Machine Learning 1 (1) (1986) 81–106.30] J. Wang, D.Q. Miao, Analysis on attribute reduction strategies of rough set,

Journal of Computer Science & Technology 13 (2) (1998) 189–193.31] X. Hu, V. Inc, Ensemble of Classifier Using Rough Set Theory, 0-7695-1119-8/01,

IEEE, 2001.32] X. Hu, N. Carcone, W. Ziarko, Construction of Multiple Knowledge Bases for

Data Using Rough Set Theory: Analysis of Imprecise Data, Kluwer AcademicPublishers, 1997.

33] T.G. Dietterich, An Experimental Comparison of three methods for constructingensembles of decision trees: bagging, boosting and randomization, MachineLearning (1998).

34] R. Maclin, Z. Opitz, An empirical evaluation of bagging and boosting, in: Pro-ceedings of the International Conference on Machine Learning, 1997.

35] P. Murphy, W. Aha, UCI Repository of Machine Learning Databases, 1996,http://www.ics.uci.edu/mlearn/MLRepository.html.

36] F.M. Brown, Boolean Reasoning, Kluwer, Dordrecht, 1990.37] R.E. Schapire, Y. Freund, Boosting of margin: a new explanation for the effective

of voting methods, in: Proceedings of the International Conference on MachineLearning, 1997.

38] R. Kerber, ChiMerge: discretization of numeric attributes, in: Proceedings ofAAAI-92, 9th International Conference on Artificial Intelligence, AAAI-Press,1992, pp. 123–128.

39] G.W. Corder, D.I. Foreman, Nonparametric Statistics for Non-Statisticians: AStep-by-Step Approach, Wiley, New Jersey, 2009.

40] F. Wilcoxon, Individual comparisons by ranking methods, Biometrics 1 (1945)80–83, http://sci2s.ugr.es/keel/pdf/algorithm/articulo/wilcoxon1945.pdf.

41] M. Hollander, D.A. Wolfe, Nonparametric Statistical Methods, John Wiley &Sons, Inc., Hoboken, NJ, 1999.

42] P. Clark, T. Niblett, The CN2 induction algorithm, Machine Learning 3 (1) (1989)261–284.

43] W. Cohen, Fast effective rule induction, in: Proceedings of the 12th InternationalConference on Machine Learning, 1995, pp. 115–123.

44] J. Fürnkfranz, G. Widmar, Incremental reduced error pruning, in: Proceedingsof the 11th International Conference on Machine Learning, 1994, pp. 70–77.

Documents

An efficient classifier design integrating rough set and set oriented database operations