35
CHAPTER 5. PAPER 1: FAST RULE-BASED CLASSIFICATION USING P-TREES 5.1. Abstract Lazy classification does not require construction of a classifier. Decisions can, therefore, be based on the properties of the particular unknown sample rather than the set of all possible samples. We introduce a rule-based classifier that uses neighborhood membership rather than static intervalization to break up continuous domains. A distance function that makes neighborhood evaluation efficient for the P-tree data structure is chosen. We discuss information gain and statistical significance in a unified framework and observe that the commonly used formulation of information gain can be seen as an approximation to the exact entropy. A new algorithm for constructing and combining multiple classifiers is introduced and its improved performance demonstrated, especially for data sets with many attributes. 61

Chapter 4: Lazy Classification using P-treesadenton/thesis/Chapter5_FFFF.doc · Web viewDeriving several rules, each starting with a different attribute, gives some attributes a vote

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Chapter 4: Lazy Classification using P-trees

CHAPTER 5. PAPER 1:

FAST RULE-BASED CLASSIFICATION USING P-TREES

5.1. Abstract

Lazy classification does not require construction of a classifier. Decisions can, therefore, be based on the properties of the particular unknown sample rather than the set of all possible samples. We introduce a rule-based classifier that uses neighborhood membership rather than static intervalization to break up continuous domains. A distance function that makes neighborhood evaluation efficient for the P-tree data structure is chosen. We discuss information gain and statistical significance in a unified framework and observe that the commonly used formulation of information gain can be seen as an approximation to the exact entropy. A new algorithm for constructing and combining multiple classifiers is introduced and its improved performance demonstrated, especially for data sets with many attributes.

5.2. Introduction

Many data mining tasks involve predicting a discrete class label, such as whether a patient is sick, whether an e-mail message is spam, etc. Eager classification techniques construct a classifier from the training data before considering unknown samples. Eager algorithms include decision tree induction, which fundamentally views data as categorical and discretizes continuous data as part of the classifier construction [1]. Lazy classification predicts the class label attribute of an unknown sample based on the entire training data. Lazy versions of decision tree induction improve classification accuracy by using the information gain criterion based on the actual values of the unknown sample [2]. This modification reduces the fragmentation, replication, and small disjunct problems of standard decision trees [3]. Previous work has, however, always assumed that the data were discretized as a preprocessing step.

Many other lazy techniques use distance information explicitly through window or kernel functions that select and weight training points according to their distance from the unknown sample. Parzen window [4], kernel density estimate [5], Podium [6], and k-nearest neighbor algorithms can be seen as implementations of this concept. The idea of using similarity with respect to training data directly is not limited to lazy classification. Many eager techniques exist that improve on the accuracy of a simple kernel density estimate by minimizing an objective function that is a combination of the classification error for the training data and a term that quantifies complexity of the classifier. These techniques include support vector machines and regularization networks as well as many regression techniques [7]. Support vector machines often have the additional benefit of reducing the potentially large number of training points that have to be used in classifying unknown data to a small number of support vectors.

Kernel- and window-based techniques, at least in their original implementation, do not normally weight attributes according to their importance. For categorical attributes, this limitation is particularly problematic because it only allows two possible distances for each attribute. In this paper, we assume distance 0 if the values of categorical attributes match and 1 otherwise. Solutions that have been suggested include weighting attribute dimensions according to their information gain [8]; optimizing attribute weighting using genetic algorithms [9]; selecting important attributes in a wrapper approach [10]; and, in a more general kernel formulation, boosting as applied to heterogeneous kernels [11].

We combine the concept of a window that identifies similarity in continuous attributes with the lazy decision tree concept of [2] that can be seen as an efficient attribute selection scheme. A constant window function (step function as a kernel function) is used to determine the neighborhood in each dimension with attributes combined by a product of window functions. The window size is selected based on information gain. A special distance function, the HOBbit distance, allows efficient determination of neighbors for the P-tree data structure [12] that is used in this paper. A P-tree is a bit-wise, column-oriented data structure that is self-indexing and allows fast evaluation of tuple counts with given attribute value or interval combinations. For P-tree algorithms, database scans are replaced by multi-way AND operations on compressed bit sequences.

Section 5.3 discusses the Algorithm with Section 5.3.1 summarizing the P-tree data structure, Section 5.3.2 motivating the HOBbit distance, and Section 5.3.3 explaining how we map to binary attributes. Section 5.3.4 gives new insights related to information gain; Section 5.3.5 discusses our pruning technique; and Section 5.3.6 introduces a new way of combining multiple classifiers. Section 5.4 presents our experimental setup results, with Section 5.4.1 describing the data sets, Sections 5.4.2-5.4.4 discussing the accuracy of different classifiers, and Section 5.4.5 demonstrating performance. Section 5.5 concludes the paper.

5.3. Algorithm

Our algorithm selects attributes successively using information gain based on the subset of points selected by all previously considered attributes. The procedure is very similar to decision tree induction, e.g., [1], and, in particular, lazy decision tree induction [2] but has a few important differences. The main difference lies in our treatment of continous attributes using a window or kernel function based on the HOBbit distance function. Since a window function can be seen as a lazily constructed interval, it is straightforward to integrate into the algorithm.

Eager decision tree induction, such as [1], considers the entire domain of categorical attributes to evaluate information gain. Lazy decision tree induction, including [2], in contrast, calculates information gain and significance based on virtual binary attributes that are defined through similarity with respect to a test point. Instead of constructing a full tree, one branch is explored in a lazy fashion; i.e., a new branch is constructed for each test sample. The branch is part of a different binary tree for each test sample. Note that the other branches are never of interest because they correspond to one or more attributes being different from the test sample. We will, therefore, use the term rule-based classification since the branch can more easily be seen as a rule that is evaluated.

5.3.1. P-trees

The P-tree data structure was originally developed for spatial data [13] but has been successfully applied in many contexts [9,14]. The key ideas behind P-trees are to store data column-wise in hierarchically compressed bit-sequences that allow efficient evaluation of bit-wise AND operations based on an ordering of rows that benefits compression. The row-ordering aspect can be addressed most easily for data that show inherent continuity such as spatial and multi-media data. For spatial data, for example, neighboring pixels will often represent similar color values. Traversing an image in a sequence that keeps neighboring points close will preserve the continuity when moving to the one-dimensional storage representation. An example of a suitable ordering is Peano, or recursive raster ordering. If data show no natural continuity, sorting it according to generalized Peano order can lead to significant speed improvements. The idea of generalized Peano order sorting is to traverse the space of data mining-relevant attributes in Peano order while including only existing data points. Whereas spatial data do not require storing spatial coordinates, all attributes have to be represented if generalized Peano sorting is used, as is the case in this paper. Only integer data types are traversed in Peano order since the concept of proximity does not apply to categorical data. The relevance of categorical data for the ordering depends on the number of attributes needed to represent it. Figure 5.1 shows how two numerical attributes are traversed together with the P-tree that is constructed from the sequence of the highest order bit of x.

Figure 5.1. Peano order sorting and P-tree construction.

P-trees are data structures that use hierarchical compression. The P-tree graph shows a tree with fan-out 2 in which 1 stands for nodes that represent only 1 values, 0 stands for nodes that represent only 0 values, and m ("mixed") stands for nodes that represent a combination of 0 and 1 values. Only "mixed" nodes have children. For other nodes, the data sequence can be reconstructed based on the purity information and node level alone. The P-tree example is kept simple for demonstration purposes. The implementation has a fan-out of 16 and uses an array rather than pointers to represent the tree structure. It, furthermore, stores the count of 1-bits for all nodes to improve ANDing speed. For data mining purposes, we are mainly interested in the total number of 1-bits for the entire tree, which we will call root count in the following sections.

5.3.2. HOBbit Distance

The nature of a P-tree-based data representation with its bit-column structure has a strong impact on the kinds of algorithms that will be efficient. P-trees allow easy evaluation of the number of data points in intervals that can be represented by one bit pattern. For 8-bit numbers, the interval [0,128) can easily be specified by requiring the highest-order bit to be 0 and not putting any constraints on the lower-order bits. This specification is derived as the root count of the P-tree corresponding to the highest-order bit. The interval [128,194) can be defined by requiring the highest order bit to be 1 and the next bit to be 0, and not putting constraints on any other bits. The number of points within this interval is evaluated by computing an AND of the basic P-tree corresponding to the highest-order bit and the complement of the next highest-order bit. Arbitrary intervals can be defined using P-trees but may require the evaluation of several ANDs.

This observation led to the definition of the HOBbit distance for numerical data. Note that for categorical data, which can only have either of two possible distances, distance 0 if the attribute values are the same and 1 if they are different, bits are equivalent.

The HOBbit distance is defined as

ï

î

ï

í

ì

¹

ú

û

ú

ê

ë

ê

¹

ú

û

ú

ê

ë

ê

+

=

=

¥

=

t

s

j

t

j

s

j

t

s

t

s

HOBbit

a

a

a

a

j

a

a

a

a

d

for

)

2

2

|

1

(

max

for

0

)

,

(

0

(1)

where as and at are attribute values, and

ë

û

denotes the floor function. The HOBbit distance is, thereby, the number of bits by which two values have to be right-shifted to make them equal. The numbers 32 (10000) and 37 (10101) have HOBbit distance 3 because only the first two digits are equal, and the numbers, consequently, have to be right-shifted by 3 bits.

5.3.3. Virtual Binary Attributes

For the purpose of the calculation of information gain, all attributes are treated as binary. Any attribute can be mapped to a binary attribute using the following prescription. For a given test point and a given attribute, a virtual attribute is assumed to be constructed, which we will call a. Virtual attribute a is 1 for the attribute value of the test sample and 0 otherwise. For numerical attributes, a is 1 for any value within the HOBbit neighborhood of interest and 0 otherwise. Consider, as an example, a categorical attribute that represents color and a sample point that has attribute value "red." Attribute a would (virtually) be constructed to be 1 for all "red" items and 0 for all others, independently of how large the domain is. For a 5-bit integer attribute, a HOBbit neighborhood of dHOBbit=3, and a sample value of 37, binary attribute a will be imagined as being 1 for values in the interval [32,48) and 0 for values outside this interval. The class label attribute is a binary attribute for all data sets that we discuss in this paper and therefore does not require special treatment.

5.3.4. Information Gain

We can now formulate information and information gain in terms of imagined attribute a as well as the class label. The information concept was introduced by Shannon in analogy to entropy in physics [15]. The commonly used definition of the information of a binary system is

(

)

1

1

0

0

1

0

log

log

)

,

(

p

p

p

p

p

p

Info

+

-

=

,(2)

where p0 is the probability of one state, such as class label 0, and p1 is the probability of the other state, such as class label 1. The information of the system before attribute a has been used can be seen as the expected information still necessary to classify a sample. Splitting the set into those tuples that have a = 0 and those that have a = 1 may give us useful information for classification and, thereby, reduce the remaining information necessary for classification. The following difference is commonly called the information gain of attribute a

÷

÷

ø

ö

ç

ç

è

æ

-

÷

ø

ö

ç

è

æ

-

÷

ø

ö

ç

è

æ

=

y

y

y

y

Info

t

y

x

x

x

x

Info

t

x

t

t

t

t

Info

a

InfoGain

1

0

1

0

1

0

,

,

,

)

(

(3)

with x (y) being the number of training points for which attribute a has value 0 (1) and t = x + y. x0 refers to the number of points for which attribute a and the class label are 0; y0 refers to the number of points for which attribute a is 1 and the class label is 0; etc.

Appendix B shows how this definition of information gain can be derived from the probability associated with a 2x2 contingency table. It is noted that, in this derivation, an approximation has to be made, namely Stirling's approximation of a factorial (

n

n

n

n

-

@

log

!

log

), which is only valid for large numbers. An exact expression for the information is derived as

(

)

!

log

!

log

!

log

1

)

,

,

(

2

1

1

0

x

x

x

x

x

x

x

ExactInfo

-

-

=

(4)

The derivation suggests that the natural logarithm be used in (4). Using the binary logarithm leads to a result that differs by a constant scaling factor that can be ignored. The corresponding definition of information gain is

)

,

,

(

)

,

,

(

)

,

,

(

)

(

1

0

1

0

1

0

y

y

y

ExactInfo

t

y

x

x

x

ExactInfo

t

x

t

t

t

ExactInfo

a

ain

ExactInfoG

-

-

=

(5)

Figure 5.2 compares InfoGain (4) and ExactInfoGain (5) for a system with x = 250, y = 50, x1 = 50, and y1 varying from 0 to 50. Note that the expression for information (4) strictly follows from the equivalence between information and entropy that Shannon noted [15]. In a physical system, entropy is defined as

W

=

log

k

S

,(6)

where k is Bolzmann's constant and ( is the number of microstates of the system, leading an expression that is proportional to (4) for a binary system. Since Stirling's approximation is valid for most physical systems and logn! could not easily be evaluated for the 1023 particles that are typical for physical systems, (2) is more commonly quoted than the exact expression. Note that, in decision tree induction where information gain is used in branches with very few samples, numbers are not necessarily as large as they are in physical systems. It could, therefore, be practical from a computational perspective to use the exact definition. The logarithm of a factorial is, in fact, commonly used for significance calculations in decision tree induction and is, thereby, readily available. The difference in performance between both definitions is clearly noticeable for our algorithm, but unfortunately, we will see that the approximate definition performs better under most circumstances. This poor performance may, however, be due to the setup of our algorithm.

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0

10

20

30

40

50

y1

InfoGain

ExactInfoGain

Figure 5.2. InfoGain and ExactInfoGain for x = 250, y = 50, and

x1 = 50.

One noteworthy difference between both formulations of information gain is that (3) vanishes for

t

t

y

y

x

x

1

1

1

=

=

, whereas (5) does not, except in the limit of large numbers. In the exact formulation, information is, therefore, gained for any split, which is correct: If a set contains two data points with class label 0 and two with class label 1, then there are

6

2

4

=

÷

÷

ø

ö

ç

ç

è

æ

ways in which these class labels could be distributed among the four data points. Assume that attribute a is 0 for one data point of each class label. The number of configurations after the split will then only be

4

1

2

2

=

÷

÷

ø

ö

ç

ç

è

æ

. We have gained information about the precise system of the training points. Whether the exact formulation generally helps reduce the generalization error, which is the quantity of interest in classification of an unknown sample, is yet to be determined.

It is interesting to analyze systematically in what way ExactInfoGain differs from InfoGain. Figure 5.3 shows the difference ExactInfoGain(x,y,x1,y1) - InfoGain(x,y,x1,y1) for x = 250, y = 50, x1= r y1 with y1 varying from 1 to 50, and different values for r.

0

0.002

0.004

0.006

0.008

0.01

0.012

0

10

20

30

40

50

y1

ExactInfoGain - InfoGain

r = 1

r = 2

r = 3

r = 4

r = 5

h

Figure 5.3. Difference between ExactInfoGain and InfoGain.

It can clearly be seen that ExactInfoGain favors attributes that split the training set evenly. Note that the contribution depends very little on the absolute value of information gain. r = 5 corresponds to a split for which InfoGain = 0 independently of y1 while r = 1 corresponds to a split with high information gain (Both InfoGain and ExactInfoGain are high.) The difference between both, however, differs by less than 10% over the entire range.

It should finally be noted that Shannon [15] strictly introduced the information Info based on its mathematical properties, i.e., as a function that satisfies desirable conditions, and only noted in passing that it also was the well-known formulation of physical entropy.

5.3.5. Pruning

It is well known that decision trees have to be pruned to work successfully [1]. Information gain alone is, therefore, not a sufficient criterion to decide which attributes to use for classification. We use statistical significance as a stopping criterion, which is similar to decision tree algorithms that prune during tree construction. The calculation of significance is closely related to that of information gain. Appendix B shows that not only significance, but also information gain can be derived from contingency tables. Information gain is a function of the probability of a particular split under the assumption that a is unrelated to the class label, whereas significance is commonly derived as the probability that the observed split or any more extreme split would be encountered; i.e., it represents the p-value. This derivation explains why information gain is commonly used to determine the best among different possible splits, whereas significance determines whether a particular split should be done at all.

In our algorithm, significance is calculated on different data than information gain to get a statistically sound estimate. The training set is split into two parts, with two-thirds of the data being used to determine information gain and one-third to test significance through Fisher's exact test, where the two-sided test was approximated by twice the value of the one-sided test [16]. A two-sided version was also implemented, but test runs showed that it was slower and did not lead to higher accuracy. An attribute is considered relevant only if it leads to a split that is significant, e.g., at the 1% level. The full set is then taken to determine the predicted class label through majority vote.

5.3.6. Pursuing Multiple Paths

The number of attributes that can be considered in a decision tree like setting while maintaining a particular level of significance is limited. This limitation is a consequence of the "curse of dimensionality" [5]. To get better statistics, it may, therefore, be useful to construct different classifiers and combine them. A similar approach is taken by bagging algorithms [17]. We use a very simple alternative in which several branches are pursued, each starting with a different attribute. The attributes with the highest information gain are picked as starting attributes, and branches are constructed in the standard way from thereon. Note that the number of available attributes places a bound on the number of paths that can be pursued. The votes of all branches are combined into one final vote. This modification leads to a particularly high improvement for data sets with many attributes. For such data sets, each rule only contains a small subset of the attributes. Deriving several rules, each starting with a different attribute, gives some attributes a vote that would not otherwise have one.

5.4. Implementation and Results

We implemented all algorithms in Java and evaluated them on 7 data sets. Data sets were selected to have at least 3000 data points and a binary class label. Two-thirds of the data were taken as a training set and one-third as a test set. Due to the consistently large size of data sets, cross-validation was considered unnecessary. All experiments were done using the same parameter values for all data sets.

5.4.1. Data Sets

Five of the data sets were obtained from the UCI machine learning library [18] where full documentation on the data sets is available. These data sets include the following

· adult data set: Census data are used to predict whether income is greater than $50,000.

· spam data set: Word and letter frequencies are used to classify e-mail as spam.

· sick-euthyroid data set: Medical data are used to predict sickness from thyroid disease.

· kr-vs.-kp (king-rook-vs.-king-pawn) data set: Configurations on a chess board are used to predict if "white can win."

· mushroom data set: Physical characteristics are used to classify mushrooms as edible or poisonous.

Two additional data sets were used. A gene-function data set was generated from yeast data available at the web site of the Munich Information Center for Protein Sequences [19]. This site contains hierarchical categorical information on gene function, localization, protein class, complexes, pathways and phenotypes. One function was picked as a class label, "metabolism." The highest level of the hierarchical information of all properties except function was used to predict whether the protein was involved in "metabolism." Since proteins can have multiple localizations and other properties, each domain value was taken as a Boolean attribute that was 1 if the protein is known to have the localization and 0 otherwise.

A second data set was generated from spatial data. The RGB colors in the photograph of a cornfield are used to predict the yield of the field. The data corresponded to the top half of the data set available at [20]. Class label is the first bit of the 8-bit yield information; i.e., the class label is 1 if yield is higher than 128 for a given pixel. Table 5.1 summarizes the properties of the data sets.

No preprocessing of the data was done. Some attributes, however, were identified as being logarithmic in nature, and the logarithm was encoded in P-trees. The following attributes were chosen as logarithmic: "capital-gain" and "capital-loss" of the adult data set, and all attributes of the "spam" data set.

Table 5.1. Properties of data sets.

Number of attributes

without class label

Number of

instances

Un-known

values

Prob. of minority class label

total

catego-

rical

numeri-

cal

training

test

adult

14

8

6

30162

15060

no

24.57

spam

57

0

57

3067

1534

no

37.68

sick-euthyroid

25

19

6

2108

1055

yes

9.29

kr-vs-kp

36

36

0

2130

1066

no

47.75

mushroom

22

22

0

5416

2708

yes

47.05

gene-

function

146

146

0

4312

2156

no

17.86

crop

3

0

3

784080

87120

no

24.69

5.4.2. Results

Table 5.2 compares the results for our rule-based algorithm with the reported results for C4.5. It can be seen that the standard rule-based implementation tends to give slightly poorer results, whereas using the collective vote from 20 runs that use different attributes as first attributes leads to results that are at least comparable to C4.5. The performance on the spam data set is comparable with the reported result of approximately 7%. A comparison with C4.5 for all data sets is still to be done.

Table 5.2. Results of rule-based classification.

C4.5

Rule-based

(+/-)

ExactInfoGain

20 paths

adult

15.54 [UCI]

15.976

0.325703

15.93

14.934

spam

11.536

0.867191

12.125

7.11

sick-euthyroid

2.849

0.519661

3.318

2.938

kr-vs-kp

0.8 [lazy BR]

1.689

0.398049

1.689

0.844

mushroom

0 [lazy DTI]

0

0

0

0

gene-function

15.538

0.848933

15.816

15.492

crop

18.796

0.156117

20.925

19.034

5.4.3. Rule-based Results Compared with 20 Paths

For the rule-based algorithm a significance level of 1% was chosen. Information gain was calculated on two-thirds of the training data with significance being evaluated on the rest. The final vote was based on the entire training set. Accuracy was generally higher when a maximum of 20 paths were pursued. In this setting, accuracy improved when information gain and significance were evaluated on the full data set with the significance level set to 0.1%.

It can be seen in Figure 5.4 that the difference is particularly large for the spam data set that has many continuous attributes, which is a setting in which the algorithm is expected to work particularly well. For the crop data set, only three paths could be pursued corresponding to the number of attributes.

20 Paths vs. 1 Path

-2

0

2

4

6

adult

spam

sick-euthyroid

kr-vs-kp

gene-function

crop

Relative Accuracy

Figure 5.4. Difference between a vote based on 20 paths and a vote based

on 1 path in the rule-based algorithm in units of the standard error.

5.4.4. Exact Information Gain

The use of the exact formula for information gain proved to be disappointing as Figure 5.5 shows that results tend to be poorer for the exact information gain formula (5). A reason for this poorer performance may be that our lazy algorithm does not normally benefit from an attribute that splits the data set into similarly large halves. It would, instead, benefit from a split that leaves the available data larger. A standard decision tree, on the other hand, would benefit from such a split because it would keep the tree balanced. It would, therefore, be necessary to evaluate the exact formula within a standard decision tree algorithm to get a final answer as to its usefulness.

ExactInfoGain vs. InfoGain

-15

-10

-5

0

adult

spam

sick-

euthyroid

kr-vs-kp

gene-

function

crop

Relative Accuracy

Figure 5.5. Difference between standard and exact information in the

rule-based algorithm in units of the standard error.

5.4.5. Performance

Standard decision tree algorithms, whether they are eager or lazy, are based on database scans that scale linearly with the number of data points. The linear scaling is a serious problem in data mining problems that deal with thousands or millions of data points. A main benefit of using of P-trees lies in the fact that the AND operation that replaces database scans benefits from compression at every level. As a consequence, we see a significantly sub-linear scaling. Figure 5.6 shows the execution time for rule-based classification as a function of the number of data points for the adult data set.

0

20

40

60

80

0

10000

20000

30000

Number of Training Points

Time per Test Sample in

Milliseconds

Measured Execution

Time

Linear Interpolation

Figure 5.6. Scaling of execution time as a function of training set size.

5.5. Conclusions

We have introduced a lazy rule-based classifier that uses distance information within continuous attributes consistently by considering neighborhoods with respect to the unknown sample. Performance is achieved by using the P-tree data structure that allows efficient evaluation of counts of training points with particular properties. Neighborhoods are defined using the HOBbit distance that is particularly suitable to the bit-wise nature of the P-tree representation. We derive information gain from a 2x2 contingency table and analyze the approximation that has to be done in the process. We, furthermore, show that accuracy of our classifier can be improved by constructing multiple rules. The attributes with highest information gain are determined, and each rule is required to use one of them. This leads to accuracies that are comparable or better than results of C4.5. Finally, we show that our algorithm has less than O(N) scaling with the number of training points.

5.6. References

[1] J. R. Quinlan, "C4.5: Programs for Machine Learning," Morgan Kaufmann Publishers, San Mateo, CA, 1993.

[2] J. Friedman, R. Kohavi, and Y. Yun, "Lazy Decision Trees," 13th National Conf. on Artificial Intelligence and the 8th Innovative Applications of Artificial Intelligence Conference, 1996.

[3] Z. Zheng, G. Webb, and K. M. Ting, "Lazy Bayesian Rules: A Lazy Semi-naive Bayesian Learning Technique Competitive to Boosting Decision Trees," 16th Int. Conf. on Machine Learning, pp. 493-502, 1999.

[4] R. Duda, P. Hart, and D. Stork, "Pattern Classification," 2nd edition, John Wiley and Sons, New York, 2001.

[5] D. Hand, H. Mannila, and P. Smyth, "Principles of Data Mining," The MIT Press, Cambridge, MA, 2001.

[6] W. Perrizo, Q. Ding, A. Denton, K. Scott, Q. Ding, and M. Khan, "PINE-Podium Incremental Neighbor Evaluator for Spatial Data Using P-trees," Symposium on Applied Computing (SAC'03), Melbourne, FL, 2003.

[7] N. Cristianini and J. Shawe-Taylor, "An Introduction To Support Vector Machines And Other Kernel-Based Learning Methods," Cambridge University Press, Cambridge, England, 2000.

[8] S. Cost and S. Salzberg, "A Weighted Nearest Neighbor Algorithm for Learning with Symbolic Features," Machine Learning, Vol. 10, 57-78, 1993.

[9] A. Perera, A. Denton, P. Kotala, W. Jackheck, W. Valdivia Granda, and W. Perrizo, "P-tree Classification of Yeast Gene Deletion Data," SIGKDD Explorations, Dec. 2002.

[10] R. Kohavi and G. John, "Wrappers for Feature Subset Selection," Artificial Intelligence, Vol. 1-2, pp. 273-324, 1997.

[11] K. Bennett, M. Momma, and M. Embrechts, "A Boosting Algorithm for Heterogeneous Kernel Models," SIGKDD '02, Edmonton, Canada, 2002.

[12] M. Khan, Q. Ding, and W. Perrizo, "K-nearest Neighbor Classification of Spatial Data Streams Using P-trees," Pacific Asian Conf. on Knowledge Discovery and Data Mining (PAKDD-2002), Taipei, Taiwan, May 2002.

[13] Q. Ding, W. Perrizo, and Q. Ding, “On Mining Satellite and Other Remotely Sensed Images,” Workshop on Data Mining and Knowledge Discovery (DMKD-2001), Santa Barbara, CA, pp. 33-40, 2001.

[14] W. Perrizo, W. Jockheck, A. Perera, D. Ren, W. Wu, and Y. Zhang, "Multimedia Data Mining Using P-trees," Multimedia Data Mining Workshop, Conf. for Knowledge Discovery and Data Mining, Edmonton, Canada, Sept. 2002.

[15] C. Shannon, "A Mathematical Theory of Communication," Bell Systems Technical Journal, Vol. 27, pp. 379-423 and 623-656, Jul. and Oct. 1948.

[16] W. Ledermann, "Handbook of Applicable Mathematics," Vol. 6, Wiley, Chichester, England, 1980.

[17] L. Breiman, "Bagging Predictors," Machine Learning, Vol. 24, No. 2, pp. 123-140, 1996.

[18] C.L. Blake and C.J. Merz, "(UCI) Repository of Machine Learning Databases," Irvine, CA, 1998 http://www.ics.uci.edu/~mlearn/MLSummary.html, accessed Oct. 2002.

[19] Munich Information Center for Protein Sequences, "Comprehensive Yeast Genome Database," http://mips.gsf.de/genre/proj/yeast/index.jsp, accessed Jul. 2002.

[20] W. Perrizo, "Satellite Image Data Repository," Fargo, ND

http://midas-10.cs.ndsu.nodak.edu/data/images/data_set_94/, accessed Dec. 2002.

PAGE

81

_1104661092.unknown
_1104685329.unknown
_1104686188.unknown
_1106766196.xls

Chart1

00000

22222

44444

66666

88888

1010101010

1212121212

1414141414

1616161616

1818181818

2020202020

2222222222

2424242424

2626262626

2828282828

3030303030

3232323232

3434343434

3636363636

3838383838

4040404040

4242424242

4444444444

4646464646

4848484848

5050505050

h
r = 1
r = 2
r = 3
r = 4
r = 5
y1
ExactInfoGain - InfoGain
0
0
0
0
0
0.0046319266
0.0052549189
0.0055183262
0.0056635918
0.0057551054
0.0060621265
0.0067143759
0.0069828114
0.0071280174
0.0072176535
0.0068942835
0.0075545656
0.0078225417
0.0079653568
0.0080517517
0.0074634425
0.0081263835
0.0083925189
0.0085322621
0.008614853
0.0078822925
0.0085458049
0.0088095591
0.0089458798
0.0090242373
0.0082023375
0.0088654501
0.0091265735
0.0092592308
0.0093329424
0.0084512267
0.0091134592
0.0093718378
0.0095006391
0.0095692706
0.0086453069
0.0093064352
0.0095620344
0.0096868135
0.0097498896
0.0087948188
0.0094547721
0.0097076142
0.009828224
0.0098852129
0.0089063553
0.0095651653
0.0098153213
0.0099316331
0.0099819317
0.0089841283
0.0096419034
0.0098894915
0.0100013984
0.0100443155
0.0090306512
0.0096875625
0.0099327512
0.0100401748
0.0100749108
0.0090470986
0.0097033732
0.0099463872
0.0100492892
0.0100749108
0.0090334619
0.0096893797
0.0099305095
0.0100289079
0.0100443155
0.0089885456
0.0096444398
0.0098840539
0.0099780472
0.0099819317
0.0089098051
0.0095660636
0.0098046245
0.0098944269
0.0098852129
0.0087929751
0.0094500448
0.0096881294
0.009774122
0.0097498896
0.0086313691
0.0092897603
0.0095280867
0.009610894
0.0095692706
0.0084145756
0.0090748689
0.0093143301
0.0093949358
0.0093329424
0.0081259212
0.0087887752
0.0090304828
0.0091104098
0.0090242373
0.0077370757
0.0084032369
0.0086485788
0.0087301762
0.008614853
0.0071949123
0.0078652264
0.008115942
0.0082028601
0.0080517517
0.0063818419
0.0070572673
0.007315549
0.0074135598
0.0072176535
0.0049387723
0.0056203961
0.0058890264
0.0060075198
0.0057551054
-0.0012245358
-0.0005354782
-0.0002529377
-0.0000980226
0

infob6.out

r = 1r = 2r = 3r = 4r = 5

000000

20.00463192660.00525491890.00551832620.00566359180.0057551054

40.00606212650.00671437590.00698281140.00712801740.0072176535

60.00689428350.00755456560.00782254170.00796535680.0080517517

80.00746344250.00812638350.00839251890.00853226210.008614853

100.00788229250.00854580490.00880955910.00894587980.0090242373

120.00820233750.00886545010.00912657350.00925923080.0093329424

140.00845122670.00911345920.00937183780.00950063910.0095692706

160.00864530690.00930643520.00956203440.00968681350.0097498896

180.00879481880.00945477210.00970761420.0098282240.0098852129

200.00890635530.00956516530.00981532130.00993163310.0099819317

220.00898412830.00964190340.00988949150.01000139840.0100443155

240.00903065120.00968756250.00993275120.01004017480.0100749108

260.00904709860.00970337320.00994638720.01004928920.0100749108

280.00903346190.00968937970.00993050950.01002890790.0100443155

300.00898854560.00964443980.00988405390.00997804720.0099819317

320.00890980510.00956606360.00980462450.00989442690.0098852129

340.00879297510.00945004480.00968812940.0097741220.0097498896

360.00863136910.00928976030.00952808670.0096108940.0095692706

380.00841457560.00907486890.00931433010.00939493580.0093329424

400.00812592120.00878877520.00903048280.00911040980.0090242373

420.00773707570.00840323690.00864857880.00873017620.008614853

440.00719491230.00786522640.0081159420.00820286010.0080517517

460.00638184190.00705726730.0073155490.00741355980.0072176535

480.00493877230.00562039610.00588902640.00600751980.0057551054

50-0.0012245358-5.35E-04-2.53E-04-9.80E-050

infob6.out

h
r = 1
r = 2
r = 3
r = 4
r = 5
y1
ExactInfoGain - InfoGain
_1106995263.xls

Chart3

adult

spam

sick-euthyroid

kr-vs-kp

gene-function

crop

20 Paths vs. 1 Path
Relative Accuracy
3.1992352695
5.1038331187
-0.1712655493
2.12285576
0.0541856954
-1.5245011255

Sheet1

literaturetotalid3(+/-)holdexactPodium820pathsTraditional Naïve Bayes(+/-)P-tree Naïve BayesP-Tree Naïve BayesSemi-Naïve Bayes (a)(+/-)Semi-Naïve Bayes (a)NNB 1 / 1iterSemi-Naïve Bayes (b)(+/-)Semi-Naïve Bayes (b)Semi-Naïve Bayes ©ROC id3ROC 20pathsROC TNBROC PNBROC NNB

adultaround 141506016.8190.334185497715.97616.06916.54714.93418.340.348969265818.0080.951373179616.9190.33517750154.071991832117.04516.98516.5473.882863428916.5730.8510.9020.8780.8950.885

spamaround 715349.9150.803958589311.5369.9825.1797.1111.8720.87972972039.982.15066054548.9370.76327879843.3362519568.4150.74065230773.92961601777.958

sick-euthyroid2.65 for FSS NB10553.5070.57655620852.8493.6026.0662.93815.1661.1989727045.8777.74746578364.2650.63581868719.0919501034.364.1710.62877297199.17035055344.644

kr-vs-kp10660.9380.29663529011.6890.9385.8160.84412.0081.061345388512.008010.2250.97938411551.679943229810.50611.1631.02332093750.796159298510.038

mushroom270800001.18021.6020.893146397521.60200024.18640444550.1110.1110.064023167924.06212470780.18510.5940.5941

gene-function(our dataset)215615.5380.848932539415.53815.67715.8615.49215.2130.840007288615.213014.8420.82970142380.441662834414.88915.0280.83488414760.220236184314.9810.6890.6910.72040.72040.7256

crop(our dataset)7712019.0550.157188571118.79619.07119.05919.03421.5870.167306431222.011-2.534271976620.6880.16378561035.373373837120.90220.6670.16370246135.498892024720.9660.77640.77660.6990.71660.7233

splice10645.4510.71575975725.45105.4510.715759757205.6395.0750.69063281430.52531592656.297

waveform166718.1761.044194393618.896-0.68952678223.9951.199755024-5.572717145118.83620.3961.1061262572-2.126040911218.536

Sheet1

00000000

00000000

00000000

00000000

00000000

00000000

00000000

00000000

00000000

id3
hold
exact
Podium8
20paths
Traditional Naïve Bayes
P-tree Naïve Bayes
Semi-Naïve Bayes (b)

Sheet4

0000

0000

0000

0000

0000

0000

0000

0000

0000

Traditional Naïve Bayes
P-tree Naïve Bayes
Semi-Naïve Bayes (a)
Semi-Naïve Bayes (b)
Error Rate

Sheet2

000

000

000

000

000

000

000

000

000

P-Tree Naïve Bayes
Semi-Naïve Bayes (a)
Semi-Naïve Bayes (b)
Decrease in Error Rate in Units of Standard Deviation
Decrease in Error Rate compared with Traditional Naive Bayes

Sheet3

00000000

00000000

00000000

00000000

00000000

00000000

00000000

00000000

00000000

id3
hold
exact
Podium8
20paths
Traditional Naïve Bayes
P-tree Naïve Bayes
Semi-Naïve Bayes (a)

totalc4.5Rule-based(+/-)Rule-based Exact Information GainRule-based with 20 paths

adult1506015.5415.9760.325702835915.9314.9340.14123303493.1992352695

spam153411.5360.867191363312.1257.11-0.67920418145.1038331187

sick-euthyroid10552.8490.51966084463.3182.938-0.9025117149-0.1712655493

kr-vs-kp10660.81.6890.39804871151.6890.84402.12285576

mushroom27080000000

gene-function215615.5380.848932539415.81615.492-0.32747007220.0541856954

crop7712018.7960.156116644320.92519.034-13.6372390596-1.5245011255

ExactInfoGain vs. InfoGain20 Paths vs. 1 Path

adult0.14123303493.1992352695

spam-0.67920418145.1038331187

sick-euthyroid-0.9025117149-0.1712655493

kr-vs-kp02.12285576

gene-function-0.32747007220.0541856954

crop-13.6372390596-1.5245011255

0

0

0

0

0

0

ExactInfoGain vs. InfoGain
Relative Accuracy

0

0

0

0

0

0

20 Paths vs. 1 Path
Relative Accuracy

literaturec4.5totalid3(+/-)exact holdexactPodium with 8 attributesRule-based with 20 pathsRule-based(+/-)Semi-Naïve Bayes, parameter set c

adultaround 1415.541506016.8190.334185497715.9316.06916.5470.331472228114.9340.31490212233.199235269515.9760.32570283590.141233034916.5730.331732544

spamaround 715349.9150.803958589312.1259.9825.1791.28116940557.110.68080403425.103833118711.5360.8671913633-0.67920418147.9580.7202599849

sick-euthyroid2.65 for FSS NB10553.5070.57655620853.3183.6026.0660.75827191912.9380.5277152758-0.17126554932.8490.5196608446-0.90251171494.6440.6634678391

kr-vs-kp0.810660.9380.29663529011.6890.9385.8160.73864131650.8440.28137960852.122855761.6890.3980487115010.0380.9703870526

mushroom0270800001.180.20874528450000000.1850.0826535543

gene-function(our dataset)215615.5380.848932539415.81615.67715.860.85768381215.4920.84767498230.054185695415.5380.8489325394-0.327470072214.9810.8335775771

crop(our dataset)7712019.0550.157188571120.92519.07119.0590.157205068719.0340.1571019306-1.524501125518.7960.1561166443-13.637239059620.9660.1648823924

00

00

00

00

00

00

00

Podium with 8 attributes
Rule-based
Error Rate

000

000

000

000

000

000

000

Podium with 8 attributes
Rule-based with 20 paths
Semi-Naïve Bayes, parameter set c
Error Rate

0

0

0

0

0

0

0

0

0

0

0

0

0

0

Podium with 8 attributesRule-based with 20 pathsSemi-Naïve Bayes, parameter set c

adult16.54714.93416.573

spam25.1797.117.958

sick-euthyroid6.0662.9384.644

kr-vs-kp5.8160.84410.038

mushroom1.1800.185

gene-function15.8615.49214.981

crop19.05919.03420.966

0000000

0000000

0000000

adult
spam
sick-euthyroid
kr-vs-kp
mushroom
gene-function
crop

000

000

000

000

000

000

000

Podium with 8 attributes
Rule-based with 20 paths
Semi-Naïve Bayes, parameter set c

000

000

000

000

000

000

000

Podium with 8 attributes
Rule-based with 20 paths
Semi-Naïve Bayes, parameter set c
_1122908372.xls

Chart2

adult

spam

sick-euthyroid

kr-vs-kp

gene-function

crop

ExactInfoGain vs. InfoGain
Relative Accuracy
0.1412330349
-0.6792041814
-0.9025117149
0
-0.3274700722
-13.6372390596

Sheet1

literaturetotalid3(+/-)holdexactPodium820pathsTraditional Naïve Bayes(+/-)P-tree Naïve BayesP-Tree Naïve BayesSemi-Naïve Bayes (a)(+/-)Semi-Naïve Bayes (a)NNB 1 / 1iterSemi-Naïve Bayes (b)(+/-)Semi-Naïve Bayes (b)Semi-Naïve Bayes ©ROC id3ROC 20pathsROC TNBROC PNBROC NNB

adultaround 141506016.8190.334185497715.97616.06916.54714.93418.340.348969265818.0080.951373179616.9190.33517750154.071991832117.04516.98516.5473.882863428916.5730.8510.9020.8780.8950.885

spamaround 715349.9150.803958589311.5369.9825.1797.1111.8720.87972972039.982.15066054548.9370.76327879843.3362519568.4150.74065230773.92961601777.958

sick-euthyroid2.65 for FSS NB10553.5070.57655620852.8493.6026.0662.93815.1661.1989727045.8777.74746578364.2650.63581868719.0919501034.364.1710.62877297199.17035055344.644

kr-vs-kp10660.9380.29663529011.6890.9385.8160.84412.0081.061345388512.008010.2250.97938411551.679943229810.50611.1631.02332093750.796159298510.038

mushroom270800001.18021.6020.893146397521.60200024.18640444550.1110.1110.064023167924.06212470780.18510.5940.5941

gene-function(our dataset)215615.5380.848932539415.53815.67715.8615.49215.2130.840007288615.213014.8420.82970142380.441662834414.88915.0280.83488414760.220236184314.9810.6890.6910.72040.72040.7256

crop(our dataset)7712019.0550.157188571118.79619.07119.05919.03421.5870.167306431222.011-2.534271976620.6880.16378561035.373373837120.90220.6670.16370246135.498892024720.9660.77640.77660.6990.71660.7233

splice10645.4510.71575975725.45105.4510.715759757205.6395.0750.69063281430.52531592656.297

waveform166718.1761.044194393618.896-0.68952678223.9951.199755024-5.572717145118.83620.3961.1061262572-2.126040911218.536

Sheet1

00000000

00000000

00000000

00000000

00000000

00000000

00000000

00000000

00000000

id3
hold
exact
Podium8
20paths
Traditional Naïve Bayes
P-tree Naïve Bayes
Semi-Naïve Bayes (b)

Sheet4

0000

0000

0000

0000

0000

0000

0000

0000

0000

Traditional Naïve Bayes
P-tree Naïve Bayes
Semi-Naïve Bayes (a)
Semi-Naïve Bayes (b)
Error Rate

Sheet2

000

000

000

000

000

000

000

000

000

P-Tree Naïve Bayes
Semi-Naïve Bayes (a)
Semi-Naïve Bayes (b)
Decrease in Error Rate in Units of Standard Deviation
Decrease in Error Rate compared with Traditional Naive Bayes

Sheet3

00000000

00000000

00000000

00000000

00000000

00000000

00000000

00000000

00000000

id3
hold
exact
Podium8
20paths
Traditional Naïve Bayes
P-tree Naïve Bayes
Semi-Naïve Bayes (a)

totalc4.5Rule-based(+/-)Rule-based Exact Information GainRule-based with 20 paths

adult1506015.5415.9760.325702835915.9314.9340.14123303493.1992352695

spam153411.5360.867191363312.1257.11-0.67920418145.1038331187

sick-euthyroid10552.8490.51966084463.3182.938-0.9025117149-0.1712655493

kr-vs-kp10660.81.6890.39804871151.6890.84402.12285576

mushroom27080000000

gene-function215615.5380.848932539415.81615.492-0.32747007220.0541856954

crop7712018.7960.156116644320.92519.034-13.6372390596-1.5245011255

ExactInfoGain vs. InfoGain

adult0.14123303493.1992352695

spam-0.67920418145.1038331187

sick-euthyroid-0.9025117149-0.1712655493

kr-vs-kp02.12285576

gene-function-0.32747007220.0541856954

crop-13.6372390596-1.5245011255

0

0

0

0

0

0

ExactInfoGain vs. InfoGain
Relative Accuracy

literaturec4.5totalid3(+/-)exact holdexactPodium with 8 attributesRule-based with 20 pathsRule-based(+/-)Semi-Naïve Bayes, parameter set c

adultaround 1415.541506016.8190.334185497715.9316.06916.5470.331472228114.9340.31490212233.199235269515.9760.32570283590.141233034916.5730.331732544

spamaround 715349.9150.803958589312.1259.9825.1791.28116940557.110.68080403425.103833118711.5360.8671913633-0.67920418147.9580.7202599849

sick-euthyroid2.65 for FSS NB10553.5070.57655620853.3183.6026.0660.75827191912.9380.5277152758-0.17126554932.8490.5196608446-0.90251171494.6440.6634678391

kr-vs-kp0.810660.9380.29663529011.6890.9385.8160.73864131650.8440.28137960852.122855761.6890.3980487115010.0380.9703870526

mushroom0270800001.180.20874528450000000.1850.0826535543

gene-function(our dataset)215615.5380.848932539415.81615.67715.860.85768381215.4920.84767498230.054185695415.5380.8489325394-0.327470072214.9810.8335775771

crop(our dataset)7712019.0550.157188571120.92519.07119.0590.157205068719.0340.1571019306-1.524501125518.7960.1561166443-13.637239059620.9660.1648823924

00

00

00

00

00

00

00

Podium with 8 attributes
Rule-based
Error Rate

000

000

000

000

000

000

000

Podium with 8 attributes
Rule-based with 20 paths
Semi-Naïve Bayes, parameter set c
Error Rate

0

0

0

0

0

0

0

0

0

0

0

0

0

0

Podium with 8 attributesRule-based with 20 pathsSemi-Naïve Bayes, parameter set c

adult16.54714.93416.573

spam25.1797.117.958

sick-euthyroid6.0662.9384.644

kr-vs-kp5.8160.84410.038

mushroom1.1800.185

gene-function15.8615.49214.981

crop19.05919.03420.966

0000000

0000000

0000000

adult
spam
sick-euthyroid
kr-vs-kp
mushroom
gene-function
crop

000

000

000

000

000

000

000

Podium with 8 attributes
Rule-based with 20 paths
Semi-Naïve Bayes, parameter set c

000

000

000

000

000

000

000

Podium with 8 attributes
Rule-based with 20 paths
Semi-Naïve Bayes, parameter set c
_1106985505.xls

Chart2

00

50005000

1000010000

1500015000

2000020000

2500025000

3000030000

Measured Execution Time
Linear Interpolation
Number of Training Points
Time per Test Sample in Milliseconds
0
0
22.677
13.683495
36.906
27.36699
43.562
41.050485
57.427
54.73398
65.563
68.417475
76.75
82.10097

Sheet1

literaturetotalid3(+/-)holdexactPodium820pathsTraditional Naïve Bayes(+/-)P-tree Naïve BayesP-Tree Naïve BayesSemi-Naïve Bayes (a)(+/-)Semi-Naïve Bayes (a)NNB 1 / 1iterSemi-Naïve Bayes (b)(+/-)Semi-Naïve Bayes (b)Semi-Naïve Bayes ©ROC id3ROC 20pathsROC TNBROC PNBROC NNB

adultaround 141506016.8190.334185497715.97616.06916.54714.93418.340.348969265818.0080.951373179616.9190.33517750154.071991832117.04516.98516.5473.882863428916.5730.8510.9020.8780.8950.885

spamaround 715349.9150.803958589311.5369.9825.1797.1111.8720.87972972039.982.15066054548.9370.76327879843.3362519568.4150.74065230773.92961601777.958

sick-euthyroid2.65 for FSS NB10553.5070.57655620852.8493.6026.0662.93815.1661.1989727045.8777.74746578364.2650.63581868719.0919501034.364.1710.62877297199.17035055344.644

kr-vs-kp10660.9380.29663529011.6890.9385.8160.84412.0081.061345388512.008010.2250.97938411551.679943229810.50611.1631.02332093750.796159298510.038

mushroom270800001.18021.6020.893146397521.60200024.18640444550.1110.1110.064023167924.06212470780.18510.5940.5941

gene-function(our dataset)215615.5380.848932539415.53815.67715.8615.49215.2130.840007288615.213014.8420.82970142380.441662834414.88915.0280.83488414760.220236184314.9810.6890.6910.72040.72040.7256

crop(our dataset)7712019.0550.157188571118.79619.07119.05919.03421.5870.167306431222.011-2.534271976620.6880.16378561035.373373837120.90220.6670.16370246135.498892024720.9660.77640.77660.6990.71660.7233

splice10645.4510.71575975725.45105.4510.715759757205.6395.0750.69063281430.52531592656.297

waveform166718.1761.044194393618.896-0.68952678223.9951.199755024-5.572717145118.83620.3961.1061262572-2.126040911218.536

Sheet1

00000000

00000000

00000000

00000000

00000000

00000000

00000000

00000000

00000000

id3
hold
exact
Podium8
20paths
Traditional Naïve Bayes
P-tree Naïve Bayes
Semi-Naïve Bayes (b)

Sheet4

0000

0000

0000

0000

0000

0000

0000

0000

0000

Traditional Naïve Bayes
P-tree Naïve Bayes
Semi-Naïve Bayes (a)
Semi-Naïve Bayes (b)
Error Rate

Sheet2

000

000

000

000

000

000

000

000

000

P-Tree Naïve Bayes
Semi-Naïve Bayes (a)
Semi-Naïve Bayes (b)
Decrease in Error Rate in Units of Standard Deviation
Decrease in Error Rate compared with Traditional Naive Bayes

Sheet6

00000000

00000000

00000000

00000000

00000000

00000000

00000000

00000000

00000000

id3
hold
exact
Podium8
20paths
Traditional Naïve Bayes
P-tree Naïve Bayes
Semi-Naïve Bayes (a)

Sheet5

totalc4.5Rule-based(+/-)Rule-based Exact Information GainRule-based with 20 paths

adult1506015.5415.9760.325702835915.9314.9340.14123303493.1992352695

spam153411.5360.867191363312.1257.11-0.67920418145.1038331187

sick-euthyroid10552.8490.51966084463.3182.938-0.9025117149-0.1712655493

kr-vs-kp10660.81.6890.39804871151.6890.84402.12285576

mushroom27080000000

gene-function215615.5380.848932539415.81615.492-0.32747007220.0541856954

crop7712018.7960.156116644320.92519.034-13.6372390596-1.5245011255

ExactInfoGain vs. InfoGain20 Paths vs. 1 Path

adult0.14123303493.1992352695

spam-0.67920418145.1038331187

sick-euthyroid-0.9025117149-0.1712655493

kr-vs-kp02.12285576

gene-function-0.32747007220.0541856954

crop-13.6372390596-1.5245011255

Sheet5

0

0

0

0

0

0

ExactInfoGain vs. InfoGain
Relative Accuracy

Sheet3

0

0

0

0

0

0

20 Paths vs. 1 Path
Relative Accuracy

literaturec4.5totalid3(+/-)exact holdexactPodium with 8 attributesRule-based with 20 pathsRule-based(+/-)Semi-Naïve Bayes, parameter set c

adultaround 1415.541506016.8190.334185497715.9316.06916.5470.331472228114.9340.31490212233.199235269515.9760.32570283590.141233034916.5730.331732544

spamaround 715349.9150.803958589312.1259.9825.1791.28116940557.110.68080403425.103833118711.5360.8671913633-0.67920418147.9580.7202599849

sick-euthyroid2.65 for FSS NB10553.5070.57655620853.3183.6026.0660.75827191912.9380.5277152758-0.17126554932.8490.5196608446-0.90251171494.6440.6634678391

kr-vs-kp0.810660.9380.29663529011.6890.9385.8160.73864131650.8440.28137960852.122855761.6890.3980487115010.0380.9703870526

mushroom0270800001.180.20874528450000000.1850.0826535543

gene-function(our dataset)215615.5380.848932539415.81615.67715.860.85768381215.4920.84767498230.054185695415.5380.8489325394-0.327470072214.9810.8335775771

crop(our dataset)7712019.0550.157188571120.92519.07119.0590.157205068719.0340.1571019306-1.524501125518.7960.1561166443-13.637239059620.9660.1648823924

00

00

00

00

00

00

00

Podium with 8 attributes
Rule-based
Error Rate

000

000

000

000

000

000

000

Podium with 8 attributes
Rule-based with 20 paths
Semi-Naïve Bayes, parameter set c
Error Rate

0

0

0

0

0

0

0

0

0

0

0

0

0

0

Measured Execution TimeMeasured Execution TimeLinear Interpolation

0000000

50002236022.677113385250000002.74E-0313.683495

100003682836.9063690601000000002.74E-0327.36699

150004417243.5626534302250000002.74E-0341.050485

200005798457.42711485404000000002.74E-0354.73398

250006573565.56316390756250000002.74E-0368.417475

300007734476.7523025009000000002.74E-0382.10097

622599022750000000.0027366989

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

Measured Execution Time
Linear Interpolation
Number of Training Points
Time per Test Sample in Milliseconds
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0

00

10000015688

20000025281

30000030640

40000035516

50000039578

60000027156

70000027562

Podium with 8 attributesRule-based with 20 pathsSemi-Naïve Bayes, parameter set c

adult16.54714.93416.573

spam25.1797.117.958

sick-euthyroid6.0662.9384.644

kr-vs-kp5.8160.84410.038

mushroom1.1800.185

gene-function15.8615.49214.981

crop19.05919.03420.966

0000000

0000000

0000000

adult
spam
sick-euthyroid
kr-vs-kp
mushroom
gene-function
crop

000

000

000

000

000

000

000

Podium with 8 attributes
Rule-based with 20 paths
Semi-Naïve Bayes, parameter set c

000

000

000

000

000

000

000

Podium with 8 attributes
Rule-based with 20 paths
Semi-Naïve Bayes, parameter set c
_1106744355.xls

Chart2

00

11

22

33

44

55

66

77

88

99

1010

1111

1212

1313

1414

1515

1616

1717

1818

1919

2020

2121

2222

2323

2424

2525

2626

2727

2828

2929

3030

3131

3232

3333

3434

3535

3636

3737

3838

3939

4040

4141

4242

4343

4444

4545

4646

4747

4848

4949

5050

InfoGain
ExactInfoGain
y1
0.0484156759
0.0483176533
0.0325246092
0.0371493292
0.0232847455
0.0292988371
0.0165494906
0.0233864233
0.011440435
0.0188493929
0.0075424148
0.0153811293
0.0046125139
0.012789104
0.0024915789
0.0109416728
0.0010676819
0.009743562
0.000258204
0.0091229879
0
0.0090242373
0.0002435453
0.0094030932
0.0009492352
0.0102238625
0.0020849321
0.0114573626
0.0036242775
0.013079512
0.0055454939
0.0150703148
0.0078305143
0.0174131141
0.0104643357
0.0200940326
0.013434533
0.0231015465
0.0167308874
0.02642616
0.0203451027
0.0300601527
0.0242705857
0.0339973853
0.0285022801
0.03823315
0.0330365402
0.0427640582
0.0378710402
0.0475879586
0.0430047121
0.0527038811
0.0484377102
0.0581120051
0.0541714006
0.0638136498
0.0602083732
0.0698112852
0.0665524798
0.0761085669
0.0732088971
0.0827103924
0.0801842201
0.0896229854
0.0874865893
0.0968540106
0.0951258598
0.104412724
0.1031138219
0.11231017
0.1114644874
0.1205594353
0.120194463
0.1291759769
0.1293234385
0.13817805
0.1388748355
0.1475872694
0.1488766804
0.1574293585
0.1593628043
0.1677351671
0.1703745343
0.178542083
0.1819631479
0.1898960483
0.1941935774
0.2018545339
0.207150265
0.2144911166
0.2209470019
0.2279029016
0.235744847
0.2422234065
0.2517886686
0.2576470776
0.2694956324
0.2744824979
0.2897462033
0.2932943802
0.3166890883
0.3154645526

info1

InfoGainExactInfoGain

00.04841567590.0483176533

10.03252460920.0371493292

20.02328474550.0292988371

30.01654949060.0233864233

40.0114404350.0188493929

50.00754241480.0153811293

60.00461251390.012789104

70.00249157890.0109416728

80.00106768190.009743562

92.58E-040.0091229879

1000.0090242373

112.44E-040.0094030932

129.49E-040.0102238625

130.00208493210.0114573626

140.00362427750.013079512

150.00554549390.0150703148

160.00783051430.0174131141

170.01046433570.0200940326

180.0134345330.0231015465

190.01673088740.02642616

200.02034510270.0300601527

210.02427058570.0339973853

220.02850228010.03823315

230.03303654020.0427640582

240.03787104020.0475879586

250.04300471210.0527038811

260.04843771020.0581120051

270.05417140060.0638136498

280.06020837320.0698112852

290.06655247980.0761085669

300.07320889710.0827103924

310.08018422010.0896229854

320.08748658930.0968540106

330.09512585980.104412724

340.10311382190.11231017

350.11146448740.1205594353

360.1201944630.1291759769

370.12932343850.13817805

380.13887483550.1475872694

390.14887668040.1574293585

400.15936280430.1677351671

410.17037453430.178542083

420.18196314790.1898960483

430.19419357740.2018545339

440.2071502650.2144911166

450.22094700190.2279029016

460.2357448470.2422234065

470.25178866860.2576470776

480.26949563240.2744824979

490.28974620330.2932943802

500.31668908830.3154645526

info1

InfoGain
ExactInfoGain
y1
_1104685682.unknown
_1104661646.unknown
_1104667430.unknown
_1104661554.unknown
_1104651344.unknown
_1104653677.unknown
_1104586588.unknown
_1104649094.unknown