forest-cover-type

Forest Cover Type Classification: A Decision Tree Approach

Laila van Ments, Kayleigh Beard, Anita Tran and Nathalie Post

March 31, 2015

Abstract

This paper describes our approach to a Kaggle classification task in which Forest Cover Types need tobe predicted. In order to solve this classification problem, three machine learning decision tree algorithmswere explored: Naive Bayes Tree (NBTree), Reduced Error Pruning Tree (REPTree) and J48 Tree. Themain focus of this research lies on two aspects. The first aspect concerns which of these trees yields thehighest accuracy for this classification task. The second aspect is finding which parameter settings andwhich category- or bin size of the partitioned training set yield the highest accuracy for each decision tree.Briefly, the experiments involve binning the training set, performing Cross Validated Parameter Selectionin Weka, calculating the true error and submitting the best performing trained decision trees to Kaggle inorder to determine the test set error. The results of this research show that bin sizes between 20 and 50yield higher accurateness for each decision tree. Moreover, the optimal parameters settings that yielded thehighest accuracy on the training set, were not necessarily the optimal parameter settings for the the test set.

Contents

Page

Introduction 2

1 Data inspection and Preparation 31.1 The available data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 Becoming familiar with the data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Determining missing values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.4 Data transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.5 Analyzing the data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.6 Data visualizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.7 Evaluating the Information Gain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.8 Outlier and measurement error detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.9 Discretize values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.10 Feature selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Experimental Setup 82.1 General setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.2 Detailed experimental setup per tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2.1 J48 decision tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2.2 Naive Bayes decision tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.2.3 Reduced Error Pruning decision tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3 Results 133.1 Training Set Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.2 Test Set Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4 Conclusion 16

5 Discussion 18

Bibliography 18

1

Introduction

This paper describes our approach to the Kaggle Competition ‘Forest Cover Type Prediction’. This compe-tition is a classification problem. The goal of this classification problem is to predict the correct forest covertype, based on 54 cartographic features. Cartographic features are the features that describe the landscape,such as elevation and soil type. The data was gathered in four wilderness areas in the Roosevelt NationalForest (Northern Colorado). [1]

With many large wilderness areas, it can be difficult and costly to determine the cover type of each areamanually. However, the information about the cover type is very useful for experts or ecosystem caretakerswho need to maintain the ecosystems of these areas. Thus, if a machine learning algorithm is capable ofaccurately predicting the cover type of a certain area with given cartographic features, it will be easier forexperts to make decisions when preserving ecological environments.

In order to solve this classification problem, three machine learning decision tree algorithms were ex-plored. Decision trees are often used in classification problems. The basic idea of a decision tree algorithmis as follows. Based on the data of a training set, a decision tree is constructed that maps the observationsabout features onto the class value. While constructing the tree, for each node, a decision needs to be madeon which feature to split on next, i.e. which feature will be placed on the current node. The chosen feature isthe feature that has the highest information gain at that moment. The information gain indicates how wella certain feature will predict the target class value. [2] After the tree is constructed based on the trainingdata, the class of every new and unknown data point can be predicted with the tree.

The three decision trees that will be evaluated for this classification problem are Naive Bayes Tree(NBtree), Reduced Error Pruning Tree (REPtree) and J48 Tree. The reason for testing these decision treesis because they are all similar to or based on the C4.5 statistical classifier, which is an extension of the ID3decision tree algorithm. [3] We are interested in the performance of these C4.5-like algorithms and will becomparing these trees with each other.

The three decision tree algorithms were explored and tuned in several experiments in order to find theparameters that yield the highest accuracy (highest percentage of correctly classified cover types) on boththe training set and the test set. We are particularly interested in assigning the real-valued features of thedataset into categories or bins, by testing several bin-sizes in order to discover which size yields the highestaccuracy. Moreover, several parameter settings are tested for each decision tree algorithm. Therefore, theresearch question that will be answered in this report is:

Which of the NBTree, REP Tree and J48 Tree yields the highest accuracy on theForest Cover Type classification problem?

With the following sub-question:

For each decision tree, which parameter settings and bin sizes of the features yield the highest accuracy?

The first step in answering these questions consists of gaining a deeper understanding of the data andpre-processing it where necessary, as described in Chapter 1. In the second chapter, the experimental setupfor each decision tree is discussed. The third chapter contains the accuracy results of the experiments onthe training set as well as the test set. Finally, the conclusion and discussion are reported.

2

Chapter 1

Data inspection and Preparation

1.1 The available data

The available data for this classification problem was provided by Kaggle, and consists of a training set anda test set. For the initial phases of the process, only the training set was utilized. The training set consistedof 15.120 instances with 54 attributes, all belonging to 1 of 7 forest cover type classes.

1.2 Becoming familiar with the data

The first step of the data analysis was to become familiar with the data. For that reason, the first step ofdata inspection was simply analyzing the data with the bare eye in Excel. From this it was observed that allthe provided data was numeric. Also, part of the data (the Wilderness Areas and the Soil Types) consistedof Boolean values.

1.3 Determining missing values

Excel was used in order to determine whether there were missing values in the data, which was not the case;the data was complete.

1.4 Data transformations

The Boolean features (Wilderness Areas and Soil Types) were further analyzed using a Python script. Itturned out that every instance in the data set belonged to exactly one of the four Wilderness Areas and exactlyone of the forty Soil Types. For that reason, another Python script was written in order to reshape the data.The four Wilderness Area Boolean features were merged into one feature: ‘Wilderness Areas Combined’. Inthis combined feature, a Class value is assigned to each of the Wilderness Areas (1,2,3 or 4). This was alsodone for the forty Boolean Soil Type Boolean features, resulting in one feature ‘Soil Type Combined’, withforty Classes (1, 2,...,40).

1.5 Analyzing the data

Weka (a machine learning software tool) was used to gain an understanding of the values and the ranges ofthe data. In Table 1.1 an overview of all the features is displayed.

3

Feature: Unit: Type: Range Median: Mode: Mean:StandardDeviation:

Elevation Meters Numerical [1863 , 3849] 2752 2830 2749,32 417,68Aspect Degrees Numerical [0 , 360] 126 45 156,68 110,09Slope Degrees Numerical [0 , 52] 15 11 16,5 8,45Horizontal Distance To Hydrology Meters Numerical [0 , 1343] 180 0 227,2 210,08Vertical Distance To Hydrology Meters Numerical [-146 , 554] 32 0 51,08 61,24Horizontal Distance To Roadways Meters Numerical [0 , 6890] 1316 150 1714,02 1325,07Hillshade 9 am Illumination value Numerical [0 , 254] 220 226 30,56Hillshade Noon Illumination value Numerical [99 , 254] 223 225 218,97 22,8Hillshade 3pm Illumination value Numerical [0 , 248] 138 143 135,09 45,9Horizontal Distance To Fire Points Meters Numerical [0 , 6993] 1256 618 1511,15 1099,94Wilderness Combined Wilderness Class Categorical [0 , 4] - 3 - -Soil Type Combined Soil Type Class Categorical [0 , 40] - 10 - -Cover Type Cover Type Class Categorical [1 , 7] - 5 - -

Table 1.1: Feature Overview

The unit, type, range, median, mode, mean and standard deviation of each feature are good indicatorsof the meaning and distribution of the dataset. For example, it is useful to know that the hill shades haveillumination values that range from 0 to 255, and that it would be wrong if this vector would contain valuesoutside this range.

Because some data types are categorical, the mean of these values were not calculated, because for thosevalues this is not a meaningful feature. The same holds for the median and the standard deviation of thesevalues.

Furthermore, Weka was used to visualize the shape of the distribution for each of the features.

1.6 Data visualizations

In order to gain insight on which variables are the most valuable for predicting the correct cover type, anumber of density plots were made in R, which are displayed in the figures below.

Figure 1.1: Density plot Elevation

From the Elevation density plot in Figure 1.1 can be observed that a clear distinction can be madebetween cover types that belong to class 3, 4 or 6 (elevation <2700) and cover types belonging to class 1, 2,5 or 7 (elevation >2700).

4

Figure 1.2: Density plot Aspect

The Aspect density plot, displayed in Figure 1.2 is less informative than the Elevation density plot.However, from this plot can be observed that if the aspect is between 100 and 150, there is a greater likelihoodthat the instance belongs to class 4. Also, instances belonging to class 6 are more likely to have an aspectbetween 0-100 and 250-350, than between 100-250.

Figure 1.3: Density plot Horizontal Distance To Hydrology

In the Horizontal Distance To Hydrology density plot, displayed in Figure 1.3, most of the values arestacked on top of each other, which makes it difficult to use this feature to distinguish the classes. However,instances belonging to class 6 are the most likely to fall in the lowest range (between 0 and 100 approximately).

The density plots of the other attributes were not as informative as the plots displayed above. Most ofthe classes in the rest of the density plots were stacked on top of each other, which makes it difficult toclearly make distinctions between the classes.

5

1.7 Evaluating the Information Gain

From looking at the plots displayed above, it can be observed that the Elevation is likely to be one of thebest predictors for the Cover Type. To confirm this assumption, Weka‘s information gain attribute evaluatortool was used. This tool calculates for all features of the dataset the information gain and ranks the featuresaccordingly. The results of the information gain ranking are displayed in Table 1.2. From these resultscan be observed that the earlier made assumption of the Elevation being a good predictor is confirmed.Furthermore, the Horizontal Distance To Roadways, Horizontal Distance To Fire Points, and WildernessArea4 all have quite a high information gain compared to the rest of the features.

Information Gain Feature1.656762 Elevation1.08141 Horizontal Distance To Roadways0.859312 Horizontal Distance To Fire Points0.601112 Wilderness Area40.270703 Horizontal Distance To Hydrology0.251287 Wilderness Area10.234928 Vertical Distance To Hydrology0.22248 Aspect0.203344 Hillshade 9am0.191957 Soil Type10

Table 1.2: Information Gain

1.8 Outlier and measurement error detection

With the information from the visualizations and information gain ranking, the data was inspected foroutliers and measurement errors.

The range of the feature Vertical Distance To Hydrology was explored, because the values of this featurerange from -146 to 554. The fact that this distance vector contains values below zero seemed odd. However,it turns out that forest cover types with Vertical Distance To Hydrology values below zero are situated inwater. These cover types should be distinguished from ‘non hydrology’ values, and therefore values belowzero are used in the data set.

The value ranges of other features are assumed not to have crucial measurement errors, since their rangesmatch perfectly with the possible ranges for these values in ‘the real world’. However, no detection techniqueis applied in order to detect outliers, thus no outliers were found. The reason for this, is that the only machinelearning algorithms used in this research are decision trees, and these algorithms can handle outliers verywell, by pruning for example.

1.9 Discretize values

Another data preparation step could be discretization of the feature values. Decision tree algorithms areessentially designed for data sets that contain nominal or categorical features. However, that does not meanthat they cannot handle real valued inputs. In order to enable decision trees to handle real valued inputsthe data has to be modified. An example of such a modification for a feature with real valued input is‘binning’. The real valued inputs of each feature are splitted into a number of categories, based on numericalthresholds. Each category then contains a range of real valued numbers.

The Forest Cover Type dataset contains real valued features that could be converted to categoricalfeatures (‘bins’) as a preprocessing step. For example, values Hillshade 3pm ranging from 0 to 50 belong tocategory 1, values ranging from 51 to 100 belong to category 2 et cetera. Nonetheless, it is interesting to

6

consider whether ‘binning’ the features of the Forest Cover Type dataset would result in a higher accuracy,and if so, what bin sizes perform the most optimal. The detailed process of finding the best (un)binnedfeature set is further explained in the experimental setup section.

1.10 Feature selection

After cleaning and preprocessing the features, it was decided not to eliminate or merge any features of thedataset. The first reason for this is that the amount of features present for classification is not extremelylarge. A classifier that makes decisions based on 12 features is not necessarily overly complex, expensiveor prone to overfitting.[4] Besides, a certain form of feature selection is already part of the decision treealgorithm in case pruning takes place.

7

Chapter 2

Experimental Setup

In this chapter an overview is given of the setup of the research experiments and the used techniques andalgorithms.

2.1 General setup

As mentioned in the introduction of this report, the research question is: Which of the Naive Bayes Tree,REP Tree and J48 Tree yields the highest accuracy on the Forest Cover Type classification problem? Withthe sub question: For each decision tree, which parameter settings and bin sizes of the features yield thehighest accuracy? Note that by using the accuracy as a performance measure, the performance on thetraining set, estimated future data and test set should be obtained and taken into account. This is becauseperformance on the training set alone is not a good estimator for the performance of the classifier on futuredata. Concerning answering the research questions, prior to the experiments, four variations of the trainingset need to be made:

• In addition to the original unbinned dataset, the set is divided into 5 bins, 20 bins and 50 bins. Thebin sizes for each feature are dependent on the range of the feature vector and splitted into intervals,which are certain percentages of the size of that range. To ensure efficiency this is achieved by writinga Python script.

Furthermore, concerning the performance measures, the following general setup for testing each decisiontree is as follows:

1. For each training set variant (unbinned, 5 bins, 20 bins and 50 bins), Weka‘s Cross Validated Param-eter Selection (‘CVParameterSelection’) is used in order to find the optimal values for two importantparameters. The values of the other parameters are kept as the Weka defaults, and are assumed to beoptimal in the pursue of preventing extremely complex and computationally expensive experiments.Naturally, it is important to consider that this method may not result in an optimal classifier be-ing found. Luckily, because of Weka‘s CVParameterSelection option, the optimal parameters for thedecision trees do not have to be sought for manually, which is a time consuming task.

2. For each training set variant (unbinned, 5 bins, 20 bins and 50 bins), the training set performance isrecorded, based on 10-fold cross validation in Weka with accuracy as result. Each fold in cross validationinvolves partitioning the dataset into a training set for training and fictional test set for validating.This validation method is chosen, because it only wastes 10 percent of the training data and can stillgive a good estimate of the training set performance. [5] The performance of the algorithms will bemeasured with accuracy, which is the percentage of correctly classified instances out of all instances.

3. The future data set performance is estimated for each training set variant (unbinned, 5 bins, 20 binsand 50 bins) and compared with the optimized values, as well as Weka‘s default values. This is donebecause the training set accuracy might be biased due to overfitting. To estimate the future data setperformance, the true error interval with confidence factor is calculated. The interval for true errorcan be calculated as displayed in Figure 2.1 [6].

8

Figure 2.1: Formula for estimating True Error Interval

Where errors(h) is the sample error, which is 1 minus the proportion of correctly classified trainingsamples. Where n is the number of samples and where Zn is the Z-index of the confidence factor. Theconfidence factor gives the certainty of the actual error lying in the true error interval. A confidencefactor of 68% is chosen for the upcoming calculations. Because of the risk of either overfitting orunderfitting that decision trees are prone to, the chosen confidence interval allows for a large deviationfrom the sample error.

4. The optimal version of the trained decision tree is used to classify the instances of the separate testset that is provided by Kaggle. The results are submitted to Kaggle.com to determine the test setaccuracy.

In the following section, the detailed parameter optimization process will be outlined per decision tree.

2.2 Detailed experimental setup per tree

In this section the three different decision trees and parameters that were experimented with are discussedin detail.

2.2.1 J48 decision tree

J48 is an implemented C4.5 algorithm (the C4.5 algorithm is an extension of the ID3 algorithm) in Wekaand can be used for classification. It generates a pruned or unpruned C4.5 decision tree. J48 builds decisiontrees from a set of labeled training data. It calculates which feature to split on by choosing the feature splitthat yields the highest information gain. After each split, the algorithm recurses, and calculates again foreach feature which next split will yield the highest information gain. The splitting procedure stops whenall instances in a subset belong to the same class. Furthermore, the J48 tree provides an option for pruningdecision trees. [7]

The parameters of the J48 decision tree that can be modified in Weka are displayed in Table 2.1.

confidenceFactor The confidence factor used for pruning (smaller values incur more pruning).minNumObj The minimum number of instances per leaf.

numFoldsDetermines the amount of data used for reduced-error pruning.One fold is used for pruning, the rest for growing the tree.

reducedErrorPruning Whether reduced-error pruning is used instead of C.4.5 pruning.seed The seed used for randomizing the data when reduced-error pruning is used.subtreeRaising Whether to consider the subtree raising operation when pruning.unpruned Whether pruning is performed.useLaplace Whether counts at leaves are smoothed based on Laplace.

Table 2.1: Parameters Adaptable for J48

The parameter settings displayed in Table 2.2 are the default settings for the J48 decision tree.

9

confidenceFactor FalseminNumObj 2numFolds 3reducedErrorPruning Falseseed 1subtreeRaising Trueunpruned FalseuseLaplace False

Table 2.2: Default Parameter Settings for J48

Parameters for Optimization

The chosen parameters for optimization are minimum number of instances per leaf and confidence factor.The rationale behind this parameter choice will be explained in this section.

The optimal size of a decision tree is crucial for the performance of the classification. When the decisiontree is too large (i.e. too many nodes), it may overfit on the training data and therefore generalize poorly tonew samples. On the other hand, if the tree is too small, it may not be able to capture important structuralinformation about the sample space. [9] A common strategy that is often used in order to prevent overfitting,is to grow the tree until each node contains a small number of instances and use pruning to remove nodesthat do not provide additional information. Pruning is used to reduce the size of a learning tree withoutreducing the predictive accuracy.

The parameter minimum number of instances per leaf influences the width of the tree. High values forthis parameter result in smaller trees.

The parameter confidence factor is used to discover at which level of pruning the prediction results improvemost. The smaller the confidence factor value, the more pruning will occur. Therefore, the combinationof optimal parameter values for the confidence factor and the minimum number of instances per leaf aresearched for using 10-fold ‘CVParameterSelection’ in Weka. This is done for each variant of the training set:unbinned, 5 bins, 20 bins and 50 bins.

Test values for Parameters

It was investigated what the regular values for both parameters are, in order to know which values the ‘CV-ParameterSelection’ in Weka should test for each (un)binned training set. [10] The values of the parameterconfidence factor that ‘CVParameterSelection’ tested ranged from 0.1 to 0.9, with a step size of 0.1. Thevalues of the parameter minimum number of instances per leaf that ‘CVParameterSelection’ tested rangedfrom 2 to 10, with a step size of 1.

2.2.2 Naive Bayes decision tree

Naive Bayes Tree (‘NBTree’) is an induced hybrid algorithm of decision-tree classifiers and Naive Bayesclassifiers. This algorithm generates a decision tree that splits the same way as regular decision trees, butwith Naive Bayes classifiers at the leaves. NBTree frequently achieves higher accuracy than either a naiveBayesian classifier or a decision tree learner. This approach is suitable in scenarios where many attributes arelikely to be relevant for a classification task, yet the attributes are not necessarily conditionally independentgiven the label. [8]

NBTree is only able to process nominal and binary classes. Therefore, the non-binned data sets couldnot be experimented with.

This algorithm does not contain any parameter settings in Weka. Therefore, this algorithm is only testedwith different bin sizes.

10

2.2.3 Reduced Error Pruning decision tree

Reduced Error Pruning Tree (‘REPTree’) is a fast decision tree learner and builds a decision tree basedon the information gain or reducing the variance. The basics of pruning of this algorithm are that it usesreduced-error pruning with backfitting. Starting at the leaves, each node is replaced by the class of majority.If the prediction accuracy is not affected, the change is kept. This results in the smallest version of the mostaccurate subtree with respect to the pruning set. [9]

The parameters of the REPTree that can be modified in Weka are displayed in Table 2.3

debug If set to true, classifier may output additional info to the console.maxDepth The maximum tree depth (-1 for no restriction).minNum The minimum number of instances per leaf

minVariancePropThe minimum proportion of the variance on all the data that needs to be presentat a node in order for splitting to be performed in regression trees.

noPruning Whether pruning is performed.

numFoldsDetermines the amount of data used for pruning. One fold is used for pruning,the rest for growing the tree.

seed The seed used for randomizing the data.

Table 2.3: Parameters Adaptable for REPtree

The parameter settings displayed in Table 2.4 are the default settings for the REPTree.

debug FalsemaxDepth -1minNum 2.0minVarianceProp 0.001noPruning FalsenumFolds 3seed 1

Table 2.4: Default Parameter Settings for REPtree

Parameters for Optimization

The parameters maximum depth of tree and minimum number of instances per leaf are optimized, becausethese are important parameters in the REPTree algorithm. The rationale behind this parameter choice willbe explained in this section.

The default setting for maximum depth of tree is set to -1, which means that there is no restriction fordepth. This parameter indicates the amount of pruning that takes place. Pruning is an important factorrelated to overfitting and underfitting, two things that both need to be prevented as much as possible. Theminimum number of instances per leaf influences the width of the tree. The higher this number, the smallerthe tree. Therefore, the combination of optimal parameter values for the maximum depth of tree and theminimum number of instances per leaf are searched for using 10-fold ‘CVParameterSelection’ in Weka. Thisis done for each variant of the training set: unbinned, 5 bins, 20 bins and 50 bins.

Test values for Parameters

It was investigated what the common values for both parameters are, in order to know which values the‘CVParameterSelection’ in Weka should test for each (un)binned training set. [10] The values of the param-eter maximum depth of tree that ‘CVParameterSelection’ tested ranged from 2 to 12 with step size 1. The

11

values of the parameter minimum number of instances per leaf that ‘CVParameterSelection’ tested rangedfrom 2 to 10, with a step size of 1.

12

Chapter 3

Results

In this chapter, the results of the performed experiments are given. As described in the previous chapter(The Experimental Setup), the results for each tree (J48 Tree, NBTree, REPtree) consist of:

Training Set Results:

• Default and optimal parameter settings found with Cross Validated Parameter Selection

• Training set accuracy based on 10-fold Cross Validation

• An estimation of the true error of a future dataset with 68% confidence

Test Set Results:

• Test set error, based on Kaggle submission

3.1 Training Set Results

The default and optimal parameter settings, training set accuracy and true error estimate for each decisiontree are summarized in Table 3.1 (J48 Tree), Table 3.2 (NBTree) and Table 3.3 (REPTree).

Optimized /Default

ConfidenceFactorValue

MinimumNumber ofInstances perLeaf Value

Training SetAccuracy

True Error(68%)Estimate

Not binnedDefault 0.25 2 59.33% 0.4067 (+- 0.004)Optimized* 0.4 8 59.84% 0.4016 (+- 0.004)

5 binsDefault 0.25 2 60.22% 0.3978 (+- 0.004)Optimized 0.2 5 61.17% 0.3883 (+- 0.004)



Table 3.1: J48 Tree Test Results



Not binned* - -5 bins 59.50% 0.405 (+- 0.004)10 bins 60.69% 0.3931 (+- 0.004)20 bins 60.81% 0.3919 (+- 0.004)50 bins 61.27% 0.3873 (+- 0.004)*Naive Bayes Tree cannot handle numeric inputs.

Table 3.2: NBTree Test Results

13

Optimized /Default

MaximumDepth ofTree Value

MinimumNumber ofInstances PerLeaf Value



Not binnedDefault -1 2 53.35% 0.4665 (+- 0.0041)Optimized* 2 2 53.35% 0.4665 (+- 0.0041)

5 binsDefault -1 2 59.57% 0.4043 (+- 0.004)Optimized 4 10 60.49% 0.3951 (+- 0.004)



Table 3.3: REPtree Test Results

Furthermore, in Figure 3.1 below, the accuracy of the decision trees is plotted against the used bin sizesin the training set to provide a brief visual summary of the results.

Figure 3.1: Plot of accuracy against bin sizes for each tree

3.2 Test Set Results

The optimal parameter settings for each type of decision tree (J48 Tree, NBTree, REPTree) are used inorder to classify the instances from the test set that was provided by Kaggle.

As was shown in Table 3.1, Table 3.2 and Table 3.3 above, the following parameter and bin settings areoptimal for the decision trees, i.e. result in highest accuracy and lowest true error. Therefore these are usedfor testing and submitting to Kaggle.com:

1. J48 Tree with confidence factor 0.3, a minimum of 5 instances per leaf and trained on 50 bins

2. REP Tree with maximum tree depth of 4, a minimum of 7 instances per leaf and trained on 20 bins

3. NBTree trained on 50 bins

These optimized trees will be referred to as ‘J48 Tree Optimized’, ‘REPtree Optimized’, and ‘NBtreeOptimized’ respectively. Table 3.4 displays the different performance measures of the optimized trees thatwere submitted to Kaggle. [13]

14

Decision TreeTraining SetAccuracy


Test SetAccuracy(Kaggle)

J48 Tree Optimized 62.26% 0.3774 (+- 0.0039) 43,67%NBtree Optimized 61.27% 0.3873 (+- 0.004) 43,35%REPtree Optimized 60.60% 0.394 (+- 0.004) 47,07%

Table 3.4: Different performance measures of optimized trees.

15

Chapter 4

Conclusion

In this section the results of the performed experiments are interpreted with respect to the research questionof this report:

Which of the NBTree, REP Tree and J48 Tree yields the highest accuracy on theForest Cover Type classification problem?

With the following sub-question:

For each decision tree, which parameter settings and bin sizes of the features yield the highest accuracy?

The accuracy on the training set alone is not a good predictor for the performance of the classifier onfuture data, so the estimated performance on future data and test set performance were also taken intoaccount.

In pursuit of answering these questions, experiments with unbinned, 5 binned, 20 binned and 50 binneddatasets were made. In addition, for each decision tree two important parameters were optimized. Accuracyresults based on training and test set performance were recorded, as well as the true error, which can befound in chapter 3.

Conclusions concerning the training set

Considering the sub-question, the following can be concluded based on the results from the training setexperiments.

• J48 Tree yields highest accuracy when trained on 50 bins, with confidence factor 0.3, a minimum of 5instances per leaf

• REPTree yields highest accuracy when trained on 20 bins, with maximum tree depth of 4 and aminimum of 7 instances per leaf

• NBTree yields highest accuracy when trained on 50 bins

With regard to the main research question, the results show that the J48 Tree yields the highest trainingset accuracy (62.26%) on the Forest Cover Type classification problem. The trees that follow respectively inaccuracy ranking are NBtree (61.27%) and REPtree (60.60%). The settings for the trees and the trainingset accuracy ranking also apply when considering the lowest estimated future error (true error). This meansthat J48 Tree is expected to have the least errors on future unknown data.

Even though the difference in accuracy between the three best performing decision trees is not significantlylarge, reasons for this accuracy ranking with respect to the properties of the tree are as follows. J48 is themost complex decision tree of explored ones. By complex is meant that it can fit the training data veryprecisely, which is beneficial for the training accuracy, but might be harmful when predicting new data withrespect to overfitting. NBtree is the simplest algorithm and generalizes more due to the fact that the NaiveBayes rule is used when constructing the tree. The Naive Bayes rule assumes that features are independentand this can cause more generalization of the data. This might be the reason for a low training set accuracy.

Consider Figure 3.1, where the accuracy of each optimized tree is plotted against the different (un)binnedtraining sets. For J48 Tree and NBtree it seems to show a correlation between accuracy and bin sizes. Hence,

16

one could state that the more bin sizes are used, the higher the accuracy. Even though it may appear so, itis not concluded that this correlation actually exists. For this we have conducted too few tests and it doesnot occur with the REPtree. Moreover, a correlation between two variables needs to be investigated further.However, based on the results it is assumed that bin sizes between 20 and 50 yield higher accuracy for thethree decision trees.

Conclusions concerning the test set

The test set accuracy results from Kaggle show something entirely different. REPtree yields the highest testset accuracy (47.07%) on the Forest Cover Type classification problem. The trees that follow respectivelyin accuracy ranking are J48 Tree (43,67%) and NBTree (43,35%). The occurrence of REPTree transcendingJ48 Tree on the test set can be because indeed, like mentioned before, J48 Tree was not generalizing enough.

Overall conclusion

The accuracy results of the trees on the training set experiments are significantly different from the accuracyresults on the test set. Firstly, the test set accuracy for each tree is roughly almost 20% lower than thetraining set accuracy. The reason for this can be that the decision trees are overfitting on the training setand incapable of generalizing on the new test set. Moreover, on the training set the J48 Tree performs best,and on the test set REPTree performs best. However, the results from the test set provide information aboutthe performance of the decision tree most accurately. This is because the test set has a larger number ofinstances. Moreover, the test set performance is a better estimator of how the classifier will predict newdata. Therefore, the final conclusion is that the Reduced Error Pruning Tree yields highest accuracy on theForest Cover Type classification problem.

17

Chapter 5

Discussion

When testing each decision tree and while drawing conclusions based on the results, several assumptions andsimplifications were made.

First of all, in the conducted experiments only four variants of the (un)binned dataset were used inpursuit of finding the optimal partitioning of the training set in bins. In case more bin sizes were tested,another training set with a certain bin size might have been found that would yield a higher accuracy.

Furthermore, the process of parameter optimization can be extended. In this research only two param-eters were tuned while assuming other parameters with default values were optimal. A combination of allparameter values being optimized might result in a decision tree with a much higher accuracy. In addition,the step sizes of the values each parameter is tested on (e.g. from 0.1 to 0.9 with step size 9) could bedecreased. Using that approach, more values for that parameter are tested and an even more precise andoptimal value for that parameter may be found.

Finally, it was assumed that the optimal settings for each decision tree would result likewise in the highestaccuracy on the test set submitted to Kaggle. Hence only the test set predictions from trees with optimaltraining settings were submitted. From the Kaggle results, it could be concluded that REPTree yields thehighest accuracy for the Forest Cover Type classification problem. However, another (version of a) treemight have performed better on the Kaggle test set, even though it performed worse on the training set.Therefore, the conclusion of this research would have been more reliable when test set results were recordedfor each default and optimized decision tree trained on each (un)binned variant.

18

Bibliography

[1] Kaggle, http://www.kaggle.com/c/forest-cover-type-prediction

[2] Quinlan, J.R. ‘Simplifying decision trees’ International Journal of Man-Machine Studies. Volume 27,Issue 3, September 1987, P 221-234.

[3] Zhao, Y., Zang, Y. ‘Comparison of decision tree methods for finding active objects’. Advances in SpaceResearch. Volume 41, Issue 12, 2008, P 1955-1959.

[4] Langley, P. Selection of relevant features in machine learning. 1994.

[5] Lecture Evert Haasdijk about Cross Validation. February 2015

[6] Lecture Evert Haasdijk about Evaluating Hypotheses. March 2015

[7] Ross Quinlan (1993). C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, San Mateo,CA

[8] Kohavi, R. (1996). Scaling Up the Accuracy of Naive Bayes Classifier: a Decision-Tree Hybrid (KDD-96Proceedings)

[9] Nor Haizan, W., Mohamed, W., Salleh, M. N. M., & Omar, A. H. (2012). A Comparative Study ofReduced Error Pruning Method in Decision Tree Algorithms (IEEE International Conference on ControlSystem, Computing and Engineering)

[10] Weka Documentation, http://www.weka.wikispaces.com

[11] Lewicki, M. S. (2007, April 12). Articifial Intelligence: Decision Trees 2

[12] Kumar, T. S. (2004, April 18). Introduction to Data Mining

[13] Kaggle submissions - http://www.kaggle.com/users/301943/kayleigh-beard

19

Documents

forest-cover-type