5
Proceedings of the 2012 International Conference on Wavelet Analysis and Pattern Recognition, Xian, 15-17 July, 2012 ASFIDT, ADAPTIVE STEP FORWARD DECISION TREE CONSTRUCTION TAI-ZHE TAN l , YING-YI LIANG l 1 Faculty of Computer, Guangdong University of Technology, Guangzhou 510006, China E-MAIL: [email protected]@gdut.edu.cn Abstract: This paper presents a novel and efficient decision tree construction approach based on C4.5. C4.5 constructs decision tree with information gain ratio and deals with missing values or noise. ID3 and its improvement, C4.5, both select one attribute as the splitting criterion each time during constructing decision tree, adopting one step forward. Comparing with one step forward, the proposed algorithm, ASFIDT in the paper would use either one attribute or two attributes as the splitting criterion for establishing tree nodes, adopting adaptive step forward that would improve the possibility in finding the optima. Given 3 UCI standard datasets, the experimental results prove its performance and efficiency in constructing decision tree. Keywords: Information entropy; Gain ratio; C4.5; Decisiontree 1. Introduction Forward Decision Tree Construction. In section 4, we will present the experimental results and analysis them. Finally, the conclusions and our future direction of work will be given in section 5. 2. Related Works In the construction of decision tree, finding the splitting criterion used to generate a node becomes the most important and necessary step [5]. The information gain used by ID3 and the information gain ratio used by C4.5 are both the common ways to select the splitting criterion from the attribute list. They are the concepts of information entropy theory. In the decision tree building, the information gain and gain ratio can calculate each attribute's impurity that represents the possibility of being selected as the best splitting criterion, that is, the less, the better. 2.1. ID3 The "Iterative Dichotomiser Tree" (ID3) is a greedy induction algorithm, constructing decision trees in a top-to-down recursive divide-and-conquer manner and works only with categorical input and output data. ID3 grows trees splitting on all categories of an attribute, thus producing shallow and wide trees. It grows tree classifiers in three steps: 1. Splits creation in form of multi-way splits, i.e. for every attribute a single split is created where attributes categories are branches of the proposed split. 2. Evaluation of best split for tree branching based on information gain measure, and 3. Checking of the stop criteria, and recursively applying the steps to new branches. These three steps are iterating and are executed in all nodes of the decision tree classifier. The information gain measure (2) is based on the well-known Shannon entropy measurement shown in (1). The decision tree is an understandable and popular classification algorithm in machine learning because of its flexibility and clarity in representing the classifying process [1]. Among the algorithms for building decision tree, ID3 [2] proposed by Quinlan in 1981 and its improved version C4.5 [3] are two of the most popular ones. Because C4.5 can handle with missing values [4], noise, avoiding overfitting and so on, it is better than ID3 in generating decision tree used commonly. However, both select one attribute from attribute list as the splitting criterion to generate one node in building decision tree. To improve its accuracy rate of classification and reduce the depth of tree, the paper proposed a novel approach to extend C4.5 that is to select one or two attributes as the splitting criterion(s) depending on the bigger information gain ratio. The new algorithm is denoted as ASFIDT, Adaptive Step Forward Decision Tree Construction. The remainder of this paper is organized as follows: In section 2, we will give some conceptions of Information Entropy, ID3, and C4.5 with their mathematical descriptions. In section 3, we will give a new approach to construct decision tree, i.e., ASFIDT, Adaptive Step 978·1-4673·1535·7/121$31.00 ©2012 IEEE k Entropy(S) = L -Pi log, Pi i=l (1) 113

[IEEE 2012 International Conference on Wavelet Analysis and Pattern Recognition (ICWAPR) - Xian, China (2012.07.15-2012.07.17)] 2012 International Conference on Wavelet Analysis and

  • Upload
    ying-yi

  • View
    215

  • Download
    0

Embed Size (px)

Citation preview

Page 1: [IEEE 2012 International Conference on Wavelet Analysis and Pattern Recognition (ICWAPR) - Xian, China (2012.07.15-2012.07.17)] 2012 International Conference on Wavelet Analysis and

Proceedings of the 2012 International Conference on Wavelet Analysis and Pattern Recognition, Xian, 15-17 July, 2012

ASFIDT, ADAPTIVE STEP FORWARD DECISION TREE CONSTRUCTION

TAI-ZHE TAN l, YING-YI LIANG l

1 Faculty of Computer, Guangdong University ofTechnology, Guangzhou 510006, ChinaE-MAIL: [email protected]@gdut.edu.cn

Abstract:This paper presents a novel and efficient decision tree

construction approach based on C4.5. C4.5 constructs decisiontree with information gain ratio and deals with missing valuesor noise. ID3 and its improvement, C4.5, both select oneattribute as the splitting criterion each time duringconstructing decision tree, adopting one step forward.Comparing with one step forward, the proposed algorithm,ASFIDT in the paper would use either one attribute or twoattributes as the splitting criterion for establishing tree nodes,adopting adaptive step forward that would improve thepossibility in finding the optima. Given 3 UCI standarddatasets, the experimental results prove its performance andefficiency in constructing decision tree.

Keywords:Information entropy; Gain ratio; C4.5; Decision tree

1. Introduction

Forward Decision Tree Construction. In section 4, we willpresent the experimental results and analysis them. Finally,the conclusions and our future direction of work will begiven in section 5.

2. Related Works

In the construction of decision tree, finding thesplitting criterion used to generate a node becomes the mostimportant and necessary step [5]. The information gain usedby ID3 and the information gain ratio used by C4.5 are boththe common ways to select the splitting criterion from theattribute list. They are the concepts of information entropytheory. In the decision tree building, the information gainand gain ratio can calculate each attribute's impurity thatrepresents the possibility of being selected as the bestsplitting criterion, that is, the less, the better.

2.1. ID3

The "Iterative Dichotomiser Tree" (ID3) is a greedyinduction algorithm, constructing decision trees in atop-to-down recursive divide-and-conquer manner andworks only with categorical input and output data. ID3grows trees splitting on all categories of an attribute, thusproducing shallow and wide trees. It grows tree classifiersin three steps:

1. Splits creation in form of multi-way splits, i.e. forevery attribute a single split is created where attributescategories are branches of the proposed split.

2. Evaluation of best split for tree branching based oninformation gain measure, and

3. Checking of the stop criteria, and recursivelyapplying the steps to new branches.

These three steps are iterating and are executed in allnodes of the decision tree classifier. The information gainmeasure (2) is based on the well-known Shannon entropymeasurement shown in (1).

The decision tree is an understandable and popularclassification algorithm in machine learning because of itsflexibility and clarity in representing the classifying process[1]. Among the algorithms for building decision tree, ID3 [2]proposed by Quinlan in 1981 and its improved version C4.5[3] are two of the most popular ones. Because C4.5 canhandle with missing values [4], noise, avoiding overfittingand so on, it is better than ID3 in generating decision treeused commonly. However, both select one attribute fromattribute list as the splitting criterion to generate one nodein building decision tree. To improve its accuracy rate ofclassification and reduce the depth of tree, the paperproposed a novel approach to extend C4.5 that is to selectone or two attributes as the splitting criterion(s) dependingon the bigger information gain ratio. The new algorithm isdenoted as ASFIDT, Adaptive Step Forward Decision TreeConstruction.

The remainder of this paper is organized as follows: Insection 2, we will give some conceptions of InformationEntropy, ID3, and C4.5 with their mathematicaldescriptions. In section 3, we will give a new approach toconstruct decision tree, i.e., ASFIDT, Adaptive Step

978·1-4673·1535·7/121$31.00 ©2012 IEEE

k

Entropy(S) =L -Pi log, Pii=l

(1)

113

Page 2: [IEEE 2012 International Conference on Wavelet Analysis and Pattern Recognition (ICWAPR) - Xian, China (2012.07.15-2012.07.17)] 2012 International Conference on Wavelet Analysis and

Proceedings of the 2012 International Conference on Wavelet Analysis and Pattern Recognition, Xian, 15-17 July, 2012

(3)

where k represents the number of classes of the outputvariable, and Pi the probability of the i-th class. S

represents the dataset.ID3 uses information gain (2) as a measure of split

quality.

Gain(S,A)=Entropy(S)- L IISVIIEntropy(sv) (2)veValues(A) S

where Values(A) is the set of all possible values in

attribute A and S; is the subset of dataset S that have

value v in S. And Entropy(S) is the expected entropy

of an input attribute A that has k categories, Entropy isthe entropy of an attributes category with respect to theoutput attribute, and is the probability of the j-th

category in the attribute. Information gain of an attribute isthe difference between entropy of the system, or node, andthe entropy of an attribute. It represents the amount ofinformation an attribute holds for the class disambiguation

2.2. C4.5

C4.5 algorithm is an improvement of ID3. It can workwith numerical input attributes as well [6]. It follows threesteps during tree growth:

1. Splits creation for categorical attributes is the sameas in ID3. For numerical attributes all possible binary splitshave to be considered. Numerical attributes splits arealways binary.

2. Evaluation of best split for tree branching based ongain ratio measure, and

3. Checking of the stop criteria, and recursivelyapplying the steps to new branches.

This algorithm introduces a new, less biased, splitevaluation measure (Gain ratio). The algorithm can workwith missing values, has pruning option, grouping attributevalues, rules generating etc.

The Gain ratio selection criterion (3) is a measure that isless biased towards selecting attributes with morecategories.

( )~S. ISilSplitInformation S, A =- £..J ---!...log2 -i=l S S

where S1'° ° o,Sk are k sample subsets in S divided by

k different values of attribute A. And (1) calculatesinformation entropy, then (2) calculates the informationgain Gain(S,A) of A in dataset A, (3) calculates the

splitting information SplitInformation(S, A) and fmally

(4) calculates the information gain ratio of each attribute

GainRatio(S,A) . Gain ratio divides the attribute

information gain with the split info SplitInformation (S,A) ,defined by (4), a measure that is dependent on the numberof categories k in an attribute. C4.5 can work withcategorical and numerical attributes.

Categorical attributes can produce multi-way splits, andnumerical attributes binary splits. C4.5 includes threepruning algorithms, namely reduced error pruning,pessimistic error pruning and error based pruning. Insummary, it has the following merits:

1. Handle both continuous and discrete attributes;2. Deal with missing attribute value;3. Cope attributes with differing costs;4. Prune decision tree after it is created.

3. THE IMPROVED ALGORITHM, ASFIDF

The goal of classification is to classify data to thegiven class according to its specific attribute and a prioriknowledge that is one of the important tools in data analysis.Decision tree induction classification [7] is one kind ofcommon way used in classification that is sort of treestructure similar to flow chart, having the merits such asfast speed of classification, high accuracy, dealing withhigh-dimensional data, fitting for inductive knowledgediscovery and so on. The classical ID3 algorithm selectsattributes by adopting information gain as the inductivefunction, but tends to select those attributes that have manyvalues. There are two criterions in evaluating the quality ofdecision tree: (1) less leaf nodes, low depth and littleredundancy; (2) high accuracy of classification. C4.5selects splitting attributes by adopting information gainratio. It derives all merits from ID3 and add thetechnologies such as to discrete the continual attributes, todeal with the missing value of attributes [8], to prunedecision tree, etc. C4.5 is up-to-down to construct thedecision tree with one step greedy searching strategy. Thealgorithm only fmds the locally optimal solution inclassification. To improve the possibility of fmding theglobally optimal solution, we proposed a new algorithm toconstruct decision tree with two forward steps that issimilar to [9]. When selecting attributes, the proposedalgorithm took the information gain of selecting twoattributes simultaneously into account, not only one theinformation gain by selecting one attribute. Thus,considering two optimal attributes is better than the singleoptimal attribute to improve the possibility in searching forthe globally optimal solution that is also used in [10]. In theexperiments of 3 DCI standard data sets, the results provedthat the proposed algorithm apparently is better than C4.5.

114

Page 3: [IEEE 2012 International Conference on Wavelet Analysis and Pattern Recognition (ICWAPR) - Xian, China (2012.07.15-2012.07.17)] 2012 International Conference on Wavelet Analysis and

Proceedings of the 2012 International Conference on Wavelet Analysis and Pattern Recognition, Xian, 15-17 July, 2012

It can select splitting attributes more accurately, improvethe classifying result and construct decision tree with loweraverage depth.

However, in the case that the imbalance of eachattribute value distribution happens to data sets, theproposed algorithm has some flaws, resulting in thesituation that most samples are focused on some branch thatis worse than C4.5 in performance. It still cannot avoidbeing trapped into the locally optimal solution, though itimproves the possibility in finding the globally optimalsolution to the problem that needs further study. In order tofit the problem, we proposed a new way of constructingdecision with adaptive steps. The description of ASFIDTalgorithm is shown as the following. Given the training setand the label attribute set, the steps ofASFIDT are:

(a) Preprocess each item in the training set in case ofmissing values of some attributes. In terms of the strategyof dealing with missing values used in C4.5, to fill in thosemissing values of attribute. It gives one probabilistic valuefor each attribute by calculating the weight of each class inthe result from multi path to leaf node.

(b) According to the below formulas, to calculate theinformation gain ratio of each attribute (one step for-ward)and each pair of attributes (two steps forward);

(c) Compare the information gain ratio with one stepforward with the average information gain ratio with twosteps forward and select the greater as the current node toconstruct a tree node;

(d) For each branch of the current node, to select theoptimal attribute or attribute pair from the remainingattribute(s) by Step (b) as the successive node and set it asthe current node;

(e) Repeat Step (d) until all attributes are all selected.If only one attribute is left, to set it as the successor ofcurrent node directly;

(f) According to the same method pruning after ruleused by C4.5, to prune the constructed tree, i.e, to constructa decision tree from training data set and increase the treeuntil it fits the training data best, allowing the over fitting;to convert the decision tree into the equivalent rule set andmodify every rule by deleting any precondition that couldimprove the accuracy of evaluation; to sort them in terms ofthe evaluating accuracy of the pruned rule and classify thesamples with the rules according to the order.

Assuming that the dataset is S, A ={Al,···,~ } are

candidate attributes, then the defmition of information gain

of attribute pair {4 ,Aj} in dataset A as

Gain (S,~,Aj) = Entropy(S) - L ISllI'"11 Entropy(s,.• ) (4)lIEValueS(4) SlIEValues(Aj )

where, Values (Ak ) ( k = i, j) is the set of all possiblevalues of attribute Ak and SV," is subset of dataset Sthat has value v ofattribute 4 and value u ofattributeAj •

Assuming that the attribute 4 has n different

values and the attribute Aj has m different values, then

the splitting information is defined as

.. mxnlSil ISilSphtInfOrmatiOn(S,~,Aj)=-~-1 I log2-

1I (5)

~=1 S S

where, SI'·· ·,Smxn are m x n sample subsets of dataset

S divided by the combination of all possible values ofattributes 4 and Aj • The information gain ratio of

attribute pair {4, Aj } in dataset A is defmed as:

. . ( ) Gain(S,4,Aj )

Gainkatio S,4,A. = ()'J SplitInformation S,4,Aj (6)

(0< i < n.i < j <n)

where, GainRatio (S,4, Aj ) and

SplitInformation (S,4,Aj ) are calculated by (5) and (6).

4. Experimental Results

The C4.5 creates a decision tree by selecting the bestattributes as one of the nodes of the tree, after calculatingthe gini index [11] as well as information gain. Thesplit-ting will occur until the leaf node is reached. Part ofthe experiment was done through weka tool, for gettinghow many attributes or instances are correctly classifiedand how many are incorrectly classified. Afterpreprocessing all training data, the noisy information oroutliers are removed through filtering. One of the data setused for experiment is weather.nominal data set as shown inthe table 1.

TABLE 1. WEATHER.NoMINAL TRANING SET

OUTLOOK TEMPERATURE HUMDITY WINDY PLAY

sunny hot high FALSE no

sunny hot high TRUE no

overcast hot high FALSE yes

rainy mild high FALSE yes

rainy cool normal FALSE yes

rainy cool normal TRUE no

overcast cool normal TRUE yes

115

Page 4: [IEEE 2012 International Conference on Wavelet Analysis and Pattern Recognition (ICWAPR) - Xian, China (2012.07.15-2012.07.17)] 2012 International Conference on Wavelet Analysis and

Proceedings of the 2012 International Conference on Wavelet Analysis and Pattern Recognition, Xian, 15-17 July, 2012

OUTLOOK TEMPERATURE HUMDITY WINDY PLAY

sunny mild high FALSE no

sunny cool normal FALSE yes

rainy mild normal FALSE yes

sunny mild normal TRUE yes

overcast mild high TRUE yes

overcast hot normal FALSE yes

rainy mild high TRUE no

4.1. Datasets

In this paper, we used 3 data sets (kv-vs-kp, tic-tac-toeand weather) as the experimental objects from UCI [12],

i.e., University of California Irvine Machine LearningRepository, that collects kinds of databases, domaintheories, and data generators that are used by the machinelearning community for the empirical analysis of machinelearning algorithms.

H y W

~nnalTR~SE

@ @@ @

Figure 1. The same decision tree generated by C4.5 and ASFIDT

TABLE 2. TRADE-OFFOFDATASETS

2 50.00 5 50 5

DatasetName

Sample # Attribute #

kv-vs-kp 3196 37

tic-tac-toe 958 10

weather.nominal 14 5

Class #

2

2

C4.5

ACC. (%) # Leaf#

99.44 31

84.55 95

A CC.(%)#

100

74.39

ASFIDT

Leaf#

49

3

4.2. Experimental Settings

The hardware and software used in ASFIDT is showedin the following:

• CPU: Pentium (R) Dual-Core E5300 @ 2.60GHz• RAM: 1.99GB· as System: Windows Xp Professional SP3• Running Environment: Matlab R2010a, WekaIn order to use the datasets from UCI for Weka,

conversion has to be done before the ASFIDT algorithmcould run successfully. It worths noting that some datasetput class attribute on the first column and also have space(s)row at the end of original data file. As C4.5 is included inWeka, the experiment would use Weka tool in generatingtree because it is easy to select the gain ratios of attributes,generate a tree and test it accuracy with k-fold crossvalidation by several clicking. The representation ofdecision tree generated by ASFIDT is similar with that inWeka, but they are both easy to read.

4.3. Results and Analysis

The experimental results shown in Table 2 prove that:1. In kv-vs-kp dataset, ASFIDT gets more accurate

result than C4.5, but with greater leaves; inversely, though

ASFIDT generate a decision tree with fewer leaves, but getsless accuracy than C4.5 in tic-tac-toe dataset; in theweather.nominal dataset, the two algorithms have the sameaccuracy and the same number of leaves.

2. Because the size of weather.nominal dataset is smalland the values of each attribute are well-distributed, thegain ratios calculated by ASFIDT are equal to the ones byC4.5 in the selection method ofbest attribute(s) so that bothhave the same result and decision tree. It shows thatASFIDT cannot always select two best attributes.

3. ASFIDT sometimes has to sacrifice some accuracyin order to generate a small tree with less leaves, or getsmore accurate result but increases the number of leaves,depending on the conditions ofdatasets.

4. Conclusion and Perspective

In the paper, ASFIDT was proposed to improve theaccuracy rate of classification and the depth of tree inbuilding decision tree, comparing with C4.5. Whileselecting the splitting criterion, the improved algorithmASFIDT considers not only one attribute but twos, whichcan fmd the bigger information gain ratio of the criterion asthe splitting node of decision tree. To recurse the splittingprocessing each time in the remaining attribute list and thefmal decision tree will be built after the list is empty.

The experimental results proved that the accuracy rate

116

Page 5: [IEEE 2012 International Conference on Wavelet Analysis and Pattern Recognition (ICWAPR) - Xian, China (2012.07.15-2012.07.17)] 2012 International Conference on Wavelet Analysis and

Proceedings of the 2012 International Conference on Wavelet Analysis and Pattern Recognition, Xian, 15-17 July, 2012

and the depth in the built decision tree are improved. Eventhough, there exist disadvantages, that is to cost morecomputational time and the new approach still cannot avoidrunning into the local optimum. Our future work is tooptimize the algorithm in computation and extend thesplitting multi-criterion in order to adjust to complicatedsituations such as very high dimensionality or lots ofattributes in feature space.

Acknowledgements

This research was supported by GuangzhouDevelopment Zone Science and Technology Projects (grantno. 2010Q-P200) and Open fund of ART Key Laboratory ofDefense Science and Technology of Shenzhen University(grant no. 110022).

References

[1] L. X. Jiang, C. Q. Li, etc, Learning decision tree forranking, Knowl InfSyst, 2009, vol. 20, pp. 123-135.

[2] J. R. Quinlan, Induction of Decision Trees, MachineLearning, 1986, vol. 1, pp. 81-106.

[3] J. R. Quinlan, C4.5: Programs for Machine Learning,Morgan Kauffman, 1993.

[4] W. Z. Liu, A. P. White, etc, Techniques for dealingwith missing values in classification, Advances inIntelligent Data Analysis Reasoning about Data,Springer Berlin, 2006, vol.8, pp. 527-536.

[5] B. Chandra, R. Kothari, etc, A new node splittingmeasure for decision tree construction, PatternRecognition, 2010, vol. 43, pp. 2725-2731.

[6] E. Yen and I. -We M. Chu, Relaxing instanceboundaries for the search of splitting points ofnumerical attributes in classification tree, InformationSciences, 2007, vol. 177, pp. 1276-1289.

[7] M. Suknovic, B. Delibasic, etc, Reusable componentsin decision tree induction algorithms, Comput Stat,2012, vol. 27, pp. 127-148.

[8] S. Zhang, Q. Z, etc, "Missing is usefull": missingvalues in cost-sensitive decision trees, Knowledge andData Engineer-ing, 2005, vol. 17, issue 12, pp.1689-1693.

[9] B. Chandra, P. P. Varghese, Moving towards efficientdecision tree construction, Information Sciences, 2009,vol. 179, pp. 1059-1069.

[10] S. W. Lin, S. C. Chen, Parameter determination andfeature selection for C4.5 algorithm using scattersearch approach, Soft Comput, 2012, vol. 16, pp.63-75.

[11] G. Behera, Privacy preserving C4.5 using gini index,Pro-ceedings 2011 2nd National Conference onEmerging Trends and Applications in ComputerScience, Shilong, India, 2011.

[12] Asuncion, A. and Newman, D. J. (2007), UCIMachine Learning Repository[http://www.ics.ucLedu/DmlearnlMLRepository.html].Irvine, CA: University of California, School ofInformation and Computer.

117