22
Linear tree Jo~ ao Gama * , Pavel Brazdil LIACC-FEP, University of Porto, Rua Campo Alegre 823, 4150 Porto, Portugal Received 19 June 1998; received in revised form 14 September 1998; accepted 22 October 1998 Abstract In this paper we present system Ltree for propositional supervised learning. Ltree is able to define decision surfaces both orthogonal and oblique to the axes defined by the attributes of the input space. This is done combining a decision tree with a linear discriminant by means of constructive induction. At each decision node Ltree defines a new instance space by insertion of new attributes that are projections of the examples that fall at this node over the hyper-planes given by a linear discriminant function. This new instance space is propagated down through the tree. Tests based on those new attributes are oblique with respect to the original input space. Ltree is a probabilistic tree in the sense that it outputs a class probability distribution for each query example. The class probability distribution is computed at learning time, taking into account the dierent class distributions on the path from the root to the actual node. We have carried out experiments on twenty one benchmark datasets and compared our system with other well known decision tree systems (orthogonal and oblique) like C4.5, OC1, LMDT, and CART. On these datasets we have observed that our system has advantages in what concerns accuracy and learning times at statistically significant confidence lev- els. Ó 1999 Elsevier Science B.V. All rights reserved. Keywords: Multivariate decision trees; Constructive induction; Machine learning 1. Introduction In Machine Learning most research is related to building simple, small, and accurate models for a set of data. For propositional problems of supervised learning a large number of systems are now available. In this paper we focus on a special kind of systems, widely used on Machine Learning community: Decision Trees. We present a new multivariate tree, Ltree that combines a decision tree with a discriminant function by means of constructive induction. In the following subsections we will briefly describe these three issues. 1.1. Decision trees A decision tree uses a divide-and-conquer strategy that attacks a complex problem by dividing it into simpler problems and recursively applying the same strategy to the sub-problems. Solutions of sub-problems can be combined to yield a solution of the complex problem. This is the basic www.elsevier.com/locate/ida Intelligent Data Analysis 3 (1999) 1–22 * Corresponding author. Tel.: +351-2-6078830; fax: +351-2-6003654; e-mail: [email protected] 1088-467X/99/$ – see front matter Ó 1999 Elsevier Science B.V. All rights reserved. PII: S1088-467X(99)00002-5

Linear tree

Embed Size (px)

Citation preview

Page 1: Linear tree

Linear tree

Jo~ao Gama *, Pavel BrazdilLIACC-FEP, University of Porto, Rua Campo Alegre 823, 4150 Porto, Portugal

Received 19 June 1998; received in revised form 14 September 1998; accepted 22 October 1998

Abstract

In this paper we present system Ltree for propositional supervised learning. Ltree is able to de®ne decision surfaces

both orthogonal and oblique to the axes de®ned by the attributes of the input space. This is done combining a decision

tree with a linear discriminant by means of constructive induction. At each decision node Ltree de®nes a new instance

space by insertion of new attributes that are projections of the examples that fall at this node over the hyper-planes

given by a linear discriminant function. This new instance space is propagated down through the tree. Tests based on

those new attributes are oblique with respect to the original input space. Ltree is a probabilistic tree in the sense that it

outputs a class probability distribution for each query example. The class probability distribution is computed at

learning time, taking into account the di�erent class distributions on the path from the root to the actual node. We have

carried out experiments on twenty one benchmark datasets and compared our system with other well known decision

tree systems (orthogonal and oblique) like C4.5, OC1, LMDT, and CART. On these datasets we have observed that our

system has advantages in what concerns accuracy and learning times at statistically signi®cant con®dence lev-

els. Ó 1999 Elsevier Science B.V. All rights reserved.

Keywords: Multivariate decision trees; Constructive induction; Machine learning

1. Introduction

In Machine Learning most research is related to building simple, small, and accurate models fora set of data. For propositional problems of supervised learning a large number of systems arenow available. In this paper we focus on a special kind of systems, widely used on MachineLearning community: Decision Trees. We present a new multivariate tree, Ltree that combines adecision tree with a discriminant function by means of constructive induction. In the followingsubsections we will brie¯y describe these three issues.

1.1. Decision trees

A decision tree uses a divide-and-conquer strategy that attacks a complex problem by dividingit into simpler problems and recursively applying the same strategy to the sub-problems. Solutionsof sub-problems can be combined to yield a solution of the complex problem. This is the basic

www.elsevier.com/locate/ida

Intelligent Data Analysis 3 (1999) 1±22

* Corresponding author. Tel.: +351-2-6078830; fax: +351-2-6003654; e-mail: [email protected]

1088-467X/99/$ ± see front matter Ó 1999 Elsevier Science B.V. All rights reserved.

PII: S 1 0 8 8 - 4 6 7 X ( 9 9 ) 0 0 0 0 2 - 5

Page 2: Linear tree

idea behind well known decision tree based algorithms: ID3 [20], ASSISTANT [6], CART [2],C4.5 [22], etc. The power of this approach comes from the ability to split the hyperspace intosubspaces and each subspace is ®tted with di�erent models. The main drawback of this approachis its instability with respect to small variations of the training set [11]. The hypothesis space ofID3 and its descendants are within the DNF formalism. Classi®ers generated by those systems canbe represented as a disjunction of rules. Each rule has a conditional part and a conclusion part.The conditional part consists of a conjunction of conditions based on attribute values. Conditionsare tests held on one of the attributes and one of the values of its domain. These kinds of testscorrespond, on the input space, to a hyper-plane that is orthogonal to the axis of the tested at-tribute and parallel to all other axes. The regions produced by these classi®ers are all hyper-rectangles.

1.2. Discriminant functions

A linear discriminant function is based on a linear composition of the attributes, which max-imizes the ratio of its between-group variance to its within-group variance. To minimize thenumber of misclassi®cations, the Bayes criterion assigns an object to the most probable class Ci is

argmaxiP�Ci�P �xjCi�; �1�where P(Ci) is the prior probability of observing the class Ci, and P(x|Ci) is the conditionalprobability of observing the example x given that the example belongs to class Ci. Although thisrule is optimal, its applicability is reduced due to the large number of conditional probabilitiesthat need to be estimated. To overcome this problem several assumptions are usually made. It isusually assumed that the attribute vectors are independent and that each class follows a certainprobability distribution fi. For continuous variables, it is usually assumed the normal distribution

fi�x� � 1���������������j2pP jp exp

�ÿ 1

2�xÿ li�TRÿ1�xÿ li�

�: �2�

This leads to the discriminant function. This approach works often well even if multivariatenormality is not satis®ed [23].

A discriminant function is a one step procedure, since it does not make a recursive partitioningof the input space. The advantage of these kinds of systems is their ability to generate decisionsurfaces with arbitrary slopes. It is a parametric approach: it is assumed that classes can beseparated by hyper-planes, and the problem is to determine the coe�cients of the hyper-plane.Although parametric approaches are often viewed as being an arbitrary imposition of the modelassumptions on the data, discriminant functions tend to be stable with respect to small variationson the training set due to the small number of free parameters [1].

1.3. Constructive induction

Learning can be a hard task if the attribute space is inappropriate to describe the target con-cept. To overcome this problem some researchers proposed the use of constructive learning.Constructive induction [14] de®nes new features from the training set and transforms the originalinstance space into a new high dimensional space by applying attribute constructor operators. Thedi�culty is how to choose the appropriate operators for the problem in question. In this paper we

2 J. Gama, P. Brazdil / Intelligent Data Analysis 3 (1999) 1±22

Page 3: Linear tree

argue that in domains described at least partially by numerical features discriminant analysis is auseful tool for constructive induction which is in the line with previous work presented in Ref.[27].

The system presented in this paper, Ltree, represents the con¯uence of these three areas. Itexplores the power of divide-and-conquer methodology from decision trees and the ability ofgenerating hyper-planes from the linear discriminant. It integrates both using constructive in-duction.

In the next section we describe the motivation behind Ltree and we give an illustrative exampleusing Iris dataset. Section 3 presents in detail the process of tree building and pruning. Section 4presents related work in the area of oblique decision trees. In Section 5 we perform a comparativestudy between our system and other oblique trees on two arti®cial datasets and twenty-onebenchmark datasets from the StatLog and UCI repositories. The last section presents conclusionsof the paper.

2. An illustrative example

2.1. Motivation

Consider an arti®cial two-class problem de®ned by two numerical attributes, which is shown inFig. 1.

Running C4.5, we obtain a decision tree with 65 nodes. Obviously this tree is much morecomplex than expected. By analyzing paths in the tree we systematically ®nd tests on the sameattribute. One of such paths, from the root to a leaf, is given:

IF at26 0:398 AND at16 0:281 AND at2 > 0:108 AND at1 > 0:184 AND at2

6 0:267 AND at2 > 0:189 AND at16 0:237 AND at26 0:218 AND at1

> 0:2 THEN CLASS�This means that the tree is making an approximation to an oblique region by means of a staircase-like structure.

Fig. 1. The original instance space.

J. Gama, P. Brazdil / Intelligent Data Analysis 3 (1999) 1±22 3

Page 4: Linear tree

Running a linear discriminant procedure, we get one discriminant:

H � 0:0� 11:0 � at1ÿ 11:0 � at2:

Line H is the hyper-plane generated by the linear discriminant. Projecting the examples over thishyper-plane generates a new attribute (at3). Fig. 2 shows the new instance space. It illustrates howthe two points (P+ and Pÿ) of di�erent classes are projected into the new space. In the newinstance space a hyper-rectangle orthogonal to the new axis (at3) is able to split classes using onlyone test. Running C4.5 on this new dataset, we obtain the following tree:

IF at3 > 0 THEN CLASS�IF at36 0 THEN CLASSÿ:

This is the smallest possible tree that discriminates the classes. The test is orthogonal to the axisde®ned by attribute at3 and parallel to the axes de®ned by at2 and at1. Rewriting the rule: ``Ifat3 > 0 THEN CLASS+'' in terms of the original space (that is in terms of at1 and at2) we get: ``If11 * at1 > 11 * at2ÿ 0 THEN CLASS+''. This rule de®nes a relation between attributes, thus isoblique to the at1 and at2 axes.

This simple example illustrates one fundamental point: constructive induction based on com-binations of attributes, extends the representational power and overcomes limitations of thelanguage bias of decision tree learning algorithms.

We have implemented a two step algorithm that explores this idea in a pre-processing step.Given a dataset, a linear discriminant builds the hyper-planes. Projecting all the examples overthe hyper-planes creates new attributes. The transformed dataset is then passed to C4.5. Werefer to this algorithm as C45Oblique. It was also used in our experimental study with quitegood results.

Ltree explores dynamically the constructive step of C45Oblique. At each decision node Ltreebuilds new attributes based on linear combinations of the previous ones. If the best orthogonalaxis involves one of the new attributes, the decision surface is oblique with respect to the original

Fig. 2. The new instance space.

4 J. Gama, P. Brazdil / Intelligent Data Analysis 3 (1999) 1±22

Page 5: Linear tree

axes. The motivation is that for each sub-space local interdependencies between attributes can becaptured.

There are two new aspects in our contribution to the state of the art on oblique decision trees.The ®rst one is that the new attributes are propagated down through the tree. The new attributesbuilt at lower nodes contain terms based on attributes built at previous nodes. This architectureallows building very complex decision surfaces, for example using a quadratic discriminant asconstructive operator 1. The second aspect is that the number of attributes is variable along thetree, including for two nodes at the same level. Another aspect is that Ltree constructs a prob-abilistic decision tree. When classifying an example the tree outputs a class probability distribu-tion that takes into account not only the distribution at the leaf where the example falls, but also acombination of the distributions of each ancestor node.

3. Growing the tree

Ltree is a top down inductive decision tree system. The main algorithm is similar to many otherprograms from the TDIDT family, except for the construction of new attributes at each decisionnode.

3.1. New attributes

At each decision point, Ltree computes new attributes dynamically. New attributes are linearcombinations of all attributes of the examples that fall at this decision node. For each example thenew attributes are computed by projecting the example over the hyper-planes built by the lineardiscriminant function. The linear discriminant function builds hyper-planes of the form:

Hi � ai � xTbi;

where

bi � Sÿ1pooledxi and ai � log�p�Ci�� ÿ 1

2xT

i Sÿ1pooledxi: �3�

Suppose that the prior probability of observing class i is p(Ci) and that fi(x) is the probabilitydensity function of x in class i. The joint probability of observing class i and example x isp(Ci)*fi(x) (Eq. (1)). Assuming that fi(x) follows a normal distribution (Eq. (2)), and that all theclasses have the same covariance matrix, the logarithm of the probability of observing class i andexample x is [7,16,19]

log�p�Ci�� � xTRÿ1li ÿ1

2lT

i Rÿ1li;

where R denotes the covariance matrix common to the multivariate normal distribution of x in allthe classes. In practice the population parameters li and R are unknown and are estimated fromthe examples that fall at this node. The discriminant function is estimated using the sample es-timates xi and the pooled covariance matrix Spooled. This leads to the Eq. (3).

1 One of the Ltree variants considered in the evaluation section uses a quadratic discriminant as constructive operator.

J. Gama, P. Brazdil / Intelligent Data Analysis 3 (1999) 1±22 5

Page 6: Linear tree

The number of hyper-planes is reduced by one normalizing the coe�cients a and b of eachhyper-plane by subtracting the corresponding coe�cients of the last hyper-plane, as it is suggestedin Ref. [16].

To build the hyper-planes the system considers at each decision point, a varying number ofclasses 2. Suppose that in a decision node Ltree considers qnode classes (qnode6 q). The lineardiscriminant procedure generates qnodeÿ 1 hyper-planes and Ltree builds qnodeÿ 1 new attributes.All the examples of this node are extended with the new attributes. Each new attribute is given bythe projection of the example over the hyper-plane Hi. The projection is computed as the dotproduct of the example vector x by the coe�cients of the hyper-plane Hi. Because class distri-bution varies along the tree, the number of new attributes is variable. Two di�erent nodes (also atthe same level) may have di�erent number of attributes. As the tree grows and the classes arediscriminated, the number of new attributes decreases. New attributes are propagated down thetree. This is the most innovative feature of Ltree.

3.2. Splitting criteria

It is known that building the optimal tree (in terms of accuracy and size) for a given dataset is aNP complete problem. In this situation we must use heuristics to guide the search. A splitting ruletypically works as a one-step look-ahead heuristic. For each possible test, the system hypothet-ically considers the subsets of data obtained and chooses the test that maximizes (or minimizes)some heuristic function over the subsets. By default Ltree uses Gain Ratio [22] as the splittingcriteria. A test on a nominal attribute will divide the data into as many subsets as the number ofvalues of the attribute. A test on a continuous attribute will divide the data into two subsets:``attribute value > cut point'' and ``attribute value6 cut point''. To determine the cut point, wefollow a process similar to C4.5.

3.3. Stopping criteria

The usual stopping criteria for a decision tree is stop building the tree when all the examplesthat fall at a decision node belongs to the same class. In noisy domains more relaxed criterion isneeded. Ltree uses the following rule: stop growing the tree if the percentage of the examples fromthe majority class is greater than a user de®ned parameter (by default 98%). If there are no ex-amples at a decision node, Ltree returns a leaf with the class distribution of the predecessor of thisnode.

3.4. Smoothing

In spite of the positive aspects of divide-and-conquer algorithms, one should be concernedabout the statistical consequences of dividing the input space. Dividing the data can improve thebias of an estimator, because it allows ®ne ®tting to the data, but in general it increases thevariance. The use of soft thresholds is an example of a methodology that attempts to minimize thee�ects of splitting the data 3. Ltree uses a smoothing process, following Buntine [5] that usually

2 Taking into account class distribution at this node, Ltree only considers those classes for which the number of examples exceeds the

triple of the number of attributes. This has done so because data under®t the concept, following Breiman et al. [2].3 The propagation down of the new attributes is also a more sophisticated form of soft thresholds.

6 J. Gama, P. Brazdil / Intelligent Data Analysis 3 (1999) 1±22

Page 7: Linear tree

improves performance. When classifying a new example, the example traverses the tree from theroot to a leaf. The class attached to the example takes into account not only the class distributionat the leaf, but also all class distributions of the nodes in the path. That is, all nodes along the pathcontribute to the ®nal classi®cation. Instead of computing class distribution for all paths in thetree at classi®cation time, as it is done in [21], Ltree computes a class distribution for all nodeswhen growing the tree. This is done recursively, taking into account class distributions at thecurrent node and at the predecessor of the current node. At each node n, Ltree combines bothclass distributions using the recursive Bayesian update formula [18].

P�Ci j en; en�1� � P�Ci j en� P�en�1 j en;Ci�P �en�1 j en�

Here P(en) is the probability that a given example e falls in node n, that can be seen as a shorthandfor P(e 2 En), where e represents the given example and En the set of examples in node n. SimilarlyP(en�1|en) is the probability that an example that falls in node n falls in node n+1, and P(en�1 | en,Ci) is the probability that an example from class Ci is passed from node n to node n+1. Thisrecursive formulation for updating beliefs allows Ltree to compute e�ciently the required classdistributions. Classi®cation done using smoothed class distributions is more robust than whencomputed from the examples that fall at the leaf [5].

In a set of experiments presented later in the section on the empirical evaluation, we haveobserved that smoothing generates more compact trees, although the bene®ts on error rate are notsigni®cant. Both pruning the tree and processing missing values explore the smoothed class dis-tributions.

3.5. Pruning

Pruning is considered to be the most crucial part of the tree building process at least in noisydomains. Statistics computed at deeper nodes of a tree have lower level of signi®cance due to thefact that smaller number of examples falls in these nodes. Deeper nodes re¯ect too much thetraining set (over®tting) and increase the error due to the variance of the classi®er. Several methodsfor pruning decision trees have been presented in the literature [2,22,8]. The process that we useexploits class probability distributions at each node computed by Ltree. Usually, and also in thecase of C4.5, pruning is a process that increases the error rate on the training data. In our case, thisis not necessarily true. The class that is assigned to an example takes into account the path from theroot to the leaf, and is often di�erent from the majority class of the examples that fall in one leaf. Ateach node, Ltree considers the static error and the backed-up error. Static error is the number ofmisclassi®cations considering that all the examples that fall at this node are classi®ed using amajority vote. Backed-up error is the sum of misclassi®cations of all sub-trees of the current node. IfBacked-up error is greater or equal than Static error then a terminal leaf node is created.

3.6. Missing values

At learning time, if an example has a missing value on the test attribute, a fractional example ispassed down all the descendent branches.

When using a tree as classi®er, the example to be classi®ed passes down the tree. At each node atest is performed, and if the value of the tested attribute is not known (often in real data someattribute values are unknown or undetermined) the normal procedure cannot be used to deter-

J. Gama, P. Brazdil / Intelligent Data Analysis 3 (1999) 1±22 7

Page 8: Linear tree

mine the path to follow. Since a decision tree constitutes a hierarchy of tests, the unknown attributeis dealt with as follows. Ltree passes the example down all branches of the node where the un-known attribute value was detected, following [5]. Each branch outputs a class distribution. Theoutput is a combination of the di�erent class distributions that sum to 1.

3.7. Classifying new examples

To be applied as a predictor Ltree stores, at each node, the discriminant function generated.When classifying a new example, the example traverses the tree in the usual way, but at eachdecision node it is extended by the insertion of the attributes generated by the discriminantfunction.

3.8. An illustrative example

Fig. 3 illustrates the decision tree built by Ltree on iris data. Each leaf shows the probabilityclass distribution attached. The coe�cients of the generated linear combinations are:

Linear5 �ÿ 31:5ÿ 3:2 � Sepallen ÿ 3:5 � Sepalwid � 7:5 � Petallen � 14:7 � Petalwid;

Linear6 �ÿ 13:3ÿ 7:7 � Sepallen � 16:6 � Sepalwid ÿ 21:5 � Petallen ÿ 24:3 � Petalwid;

Linear7 �ÿ 16:6ÿ 3:5 � Sepallen ÿ 5:5 � Sepalwid � 6:9 � Petallen � 12:4 � Petalwid

� 0:03 � Linear5 ÿ 0:03 � Linear6:

This is a tree with 5 nodes. It misclassi®es 3.1% of the examples on the test data 4. For the samedata, C4.5 generates a tree with 11 nodes with 4.9% of errors.

4. Related work

Brodley and Utgo� [4] have reviewed and evaluated several methods for learning the coe�-cients of a linear combination test, and besides also methods for simplifying linear combinations.As for the methods for learning the coe�cients, they have analyzed, among others, the following.

Fig. 3. Ltree on Iris data.

4 In a 10-fold cross validation.

8 J. Gama, P. Brazdil / Intelligent Data Analysis 3 (1999) 1±22

Page 9: Linear tree

The Recursive Least Squares, that minimizes the mean-squared error over the training data; thePocket Algorithm, that maximizes the number of correct classi®cations on the training data, andthe Thermal Training, that converges to a set of coe�cients by paying decreasing attention to largeerrors. From their empirical evaluation, they have concluded that considering the datasets instudy, the Recursive Least Squares achieves the best accuracy on an independent test set andproduces the smallest trees than the other methods. As for the methods for simplifying the linearequations, they have analyzed the Sequential Backward Elimination and the Forward SequentialSelection. The former begins with all the features, and in each successive step removes the featurethat contributes least to the performance of the linear combination. The later begins with anempty bag of features, and in each step adds the feature that contributes most to the performanceof the linear combination.

4.1. Multivariate trees

In Ref. [2], Breiman et al. suggest the use of linear combination of attributes instead of using asingle attribute. CART searches explicitly for a set of coe�cients that minimizes the partition-merit criterion de®ned by the user. Only instances for which no values are missing are used fortraining. Centering each value of each feature at its median and then dividing by its range nor-malizes each instance. The algorithm incorporated in CART, cycles through the attributes at eachstep, searching for an improved linear combination split. After the ®nal linear combination isdetermined, it is converted to a split test on the original non-normalized features. CART performsa Sequential Backward Elimination that starts with all the features in the linear combination andtries to remove the feature that will cause the smallest decrease of the partition-merit criterion. Itcontinues removing features until a stopping criterion is met.

Similar to CART, OC1 [17], uses a hill-climbing search algorithm. Beginning with the best axes-parallel split, it randomly perturbs each of the coe�cients until there is no improvement in theimpurity of the hyper-plane. As in CART, OC1 adjusts the coe�cients of the hyper-plane indi-vidually ®nding a local optimal value for each coe�cient at a time. The innovation of OC1 is arandomization step used to get out of local minima. OC1 uses an error complexity pruning using aseparate pruning set. The missing values are replaced, before learning, by the mean of the at-tribute. OC1 generates binary trees. The implementation that we have used in the comparativestudy accepts numerical attributes only.

Linear Machine Decision Tree (LMDT) of Brodley and Utgo� [3], uses a di�erent approach.Each internal node in a LMDT tree is a set of linear discriminant functions that are used toclassify an example. The training algorithm repeatedly presents examples at each node until thelinear machine converges. Because convergence cannot be guaranteed, LMDT uses heuristics todetermine when the node has stabilized. Trees generated by LMDT are not binary. The number ofdescendants of each decision node is equal to the number of classes that fall at this node.

The FACT system [12] uses a recursive partition of the input space using linear discriminantfunctions. The number of descendents of each node is equal to the number of classes. FACTsystem has results similar to CART, but achieves this with lower running times. Mangasarian [13]presents a system that uses linear programming to compute the coe�cients of each multivariatetest.

Reconstruction of the input space by means of new attributes de®ned as combinations oforiginal ones appears as a preprocessing step in Yip and Webb [27]. Yip's system CAF incorpo-

J. Gama, P. Brazdil / Intelligent Data Analysis 3 (1999) 1±22 9

Page 10: Linear tree

rates new attributes based on canonical discriminant analysis 5. They show that such techniquescan improve the performance of machine learning algorithms. Ittner and Schlosser [9] introduce,in a pre-processing step all pairwise products and squares of the numerical primitive attributes.The inductive step is done by OC1. The constructive preprocessing step allows OC1 to built non-linear decision surfaces. There are two weak points in this approach. The ®rst is that the numberof new attributes grows exponentially with the number of primitive attributes. The second is thatthe type of functions was selected in an arbitrary way.

There are several di�erences between Ltree and other multivariate trees like CART, OC1, andLMDT. With respect to the method for determining the hyper-planes, CART, OC1 and LMDTuse a gradient descent approach, while Ltree uses an analytical method. This aspect is relevant forlearning time, and is the explanation for the comparative fast performance of Ltree.

With respect to the number of hyper-planes, OC1, as CART, searches for only one hyper-plane,regardless the number of classes. The goal of the search is to minimize the impurity function thatis also used as the attribute selection criterion. In LMDT, the number of hyper-planes is equal tothe number of classes, while in Ltree this is equal to the number of classes minus one. The searchfor the hyper-planes in Ltree is guided by the quadratic error. Each hyper-plane corresponds toone new attribute, but it is chosen only if it also minimizes the entropy function. At each node,only one of the generated hyper-planes may be chosen. But all are propagated down the tree andthey could be, eventually, chosen in deeper nodes.

In Ltree, OC1, and CART, a multivariate test produces a binary split, while in LMDT thenumber of descendents is equal to the number of classes.

5. Empirical evaluation

We have performed an extensive empirical evaluation of Ltree both on arti®cial and real data.The performance of our system was compared with other well-known decision tree algorithms.We have used one univariate tree, C4.5, and three multivariate trees: LMDT and OC1 that areavailable via the Internet, and CART 6.

Performance of the algorithms was evaluated by measuring the error rate, and the number ofleaves of the generated tree, that is the number of di�erent regions into which the instance space ispartitioned. This last statistic is an indicator of the concept fragmentation. As such it can giveinsights about the adaptability of the model to the data. To compare model complexity of uni-variate and multivariate trees we have used the Minimum Description Length principle.

5.1. Arti®cial data

The main interest of arti®cial data, is the possibility of controlled experiments. We have usedtwo arti®cial datasets. Both are two class problems. The examples where generated randomly, andclassi®ed using a known rule.

The ®rst dataset is LS10, de®ned in Ref. [17]. The data is linearly separable with a 10-D hyper-plane de®ned by the equation x1 + x2 + x3 + x4 + x5 > x6 + x7 + x8 + x9 + x10. The attributes are

5 This is similar to our C45Oblique. The di�erence is that we use a linear discriminant function.6 California Statistical Software, Inc.

10 J. Gama, P. Brazdil / Intelligent Data Analysis 3 (1999) 1±22

Page 11: Linear tree

uniformly distributed in the range [0,1]. The second dataset consists of 1000 data points, de®nedby two continuous attributes uniformly distributed in the range [0,1]. There are two classes. Eachclass de®nes two convex regions, forming a kind of XOR problem. We use this simple dataset as atest for the ability of multivariate decision trees to deal with problems where the classes are notlinearly separable.

The overall results, using 10-fold cross validation, in terms of error rates and nr. of leaves onboth datasets are presented in Table 1. Ltree achieves lower error rate than all the other algo-rithms considered on the XOR problem and is the second best on LS10 problem.

We have also performed a set of experiments with learning curves. The experimentalmethodology was as follows: we have used a 10-fold strati®ed Cross validation and in eachfold, the algorithm was trained 10 times, using an increasing number of examples from thetraining set. This corresponds to generating 10� 10 classi®ers. The process begins with 10% ofthe training set examples, and each time the number of training examples is increased by 10%.On the 10th run, the algorithm learns on all the training examples. Each classi®er is used toclassify all the examples in the test set. After completing this we compute the means of errorrates and number of leaves of the runs where the algorithm uses the same number of trainingexamples.

The Figs. 4±6 show the results on LS10 dataset (learning curves for LS10 data). Fig. 5 showsthe variation of the error rate and Fig. 6 shows the variation of the number of leaves when varyingthe number of training examples.

This arti®cial problem is clearly biased for multivariate algorithms. The curve for C4.5 (Fig. 4)illustrates the typical behavior when the model class of the algorithm is not appropriate for theproblem: the tree grows linearly with the number of examples and there are no signi®cant bene®tsin the performance. CART and LMDT in all of the 10-folds generate a tree with 2 leaves; that is,they partition the instance space always into two regions. LMDT performs quite well. The errorrate of LMDT using 200 examples is similar to the error rate of CART or Ltree using 1800 ex-amples. The performance of Ltree is similar to CART, while OC1 performs signi®cantly worsethan Ltree.

Results on the XOR problem are shown in Figs. 7±9 (learning curves for XOR data). In thisproblem, Ltree, CART, and OC1 perform similarly with respect to both error rates (Fig. 8) andnumber of leaves (Fig. 9). The exception is LMDT that performs signi®cantly worse than C4.5,both in terms of error rate and number of leaves.

Table 1

Error rates and Nr. of leaves on 10 cross validation

C4.5 Cart OC1 LMDT Ltree

LS10

Error Rate 22.35 2.2 3.7 0.35 1.3

Nr. Leaves 137.6 2 6.3 2 2.5

XOR

Error rates 4.0 3.1 3.4 9.5 2.7

Nr. Leaves 19.6 13.5 13.2 36 15.7

J. Gama, P. Brazdil / Intelligent Data Analysis 3 (1999) 1±22 11

Page 12: Linear tree

5.2. UCI and Statlog datasets

To evaluate our algorithm we performed a 3� 10 strati®ed Cross Validation (CV) ontwenty-one datasets, some of which come from the StatLog repository 7 and others from theUCI repository 8 [15]. Datasets were permuted once before each of the three CV procedures. All

7 http://www.ncc.up.pt/liacc/ML/statlog/index.html.8 http://www.ics.uci.edu/AI/ML/Machine-Learning.html.

Fig. 4. Error rates and nr. of leaves for C4.5.

Fig. 5. Error rates for multivariate trees.

Fig. 6. Nr. leaves for multivariate trees.

12 J. Gama, P. Brazdil / Intelligent Data Analysis 3 (1999) 1±22

Page 13: Linear tree

algorithms were used with default settings. At each iteration of CV, all algorithms were trainedon the same train partition of the data. Classi®ers were also tested on the same test partition ofthe data.

Fig. 8. Error rates for XOR data.

Fig. 9. Nr. of leaves for XOR data.

Fig. 7. The XOR data set.

J. Gama, P. Brazdil / Intelligent Data Analysis 3 (1999) 1±22 13

Page 14: Linear tree

In the ®rst set of experiments, we have compared the univariate decision tree C4.5 with themultivariate decision trees Ltree and C45Oblique. To have statistical con®dence on the di�erenceswe have compared the results using paired t-tests with a con®dence level of 95%. The null hy-pothesis is that C4.5 performs equal to each one of the other algorithms. A+ orÿ sign means thatthere was a signi®cant di�erence at 95% of con®dence level. The results of the ®rst set of ex-periments are shown in Table 2. Table 3 shows a summary of the average number of leaves of thethree systems.

On these datasets Ltree performs signi®cantly better than C4.5 on 10 datasets and worse onone. Also C45Oblique performs signi®cantly better than C4.5 on 8 datasets and worse on two.These results clearly illustrate the bene®ts of using the linear combinations of attributes.

In the second set of experiments we have compared Ltree against other decision trees that usemultivariate tests at the internal nodes.

Table 2

C4.5 versus Ltree and C45Oblique ± error rates

Dataset Examples Classes C4.5 Ltree C45Oblique

Australian 690 2 14.49 � 4.4 14.24 � 3.6 13.80 � 4.4

Balance 625 3 22.39 � 4.1 + 6.49 � 3.0 + 5.22 � 2.7

Banding 238 2 24.28 � 8.0 22.62 � 7.0 22.42 � 8.6

Breast (W) 699 2 5.52 � 3.6 + 3.24 � 2.6 + 4.19 � 3.4

Cleveland 303 2 22.38 � 7.0 + 17.84 � 6.7 20.35 � 7.8

Credit 690 2 14.16 � 3.7 ÿ 15.08 � 4.4 14.01 � 3.9

Diabetes 768 2 25.70 � 5.1 26.00 � 4.7 24.21 � 4.5

German 1000 2 28.40 � 3.2 + 25.33 � 3.9 + 24.90 � 3.2

Glass 213 6 30.61 � 8.5 29.57 � 8.5 ÿ 34.90 � 10.7

Heart 270 2 21.60 � 9.4 + 17.28 � 7.3 + 17.90 � 7.6

Hepatitis 155 2 19.47 � 9.1 19.50 � 7.3 19.83 � 8.5

Ionosphere 351 2 11.12 � 4.3 10.80 � 5.1 10.62 � 4.6

Iris 150 3 4.89 � 5.2 + 3.11 � 4.5 4.22 � 5.4

Mushroom 8124 2 0.00 � 0.0 0.00 � 0.0 0.00 � 0.0

Satimage 6435 6 14.53 � 2.4 + 12.57 � 2.7 + 12.87 � 2.4

Segment 2310 7 3.30 � 1.6 3.42 � 1.3 3.13 � 1.3

Sonar 208 2 30.31 � 10.8 32.16 � 12.7 28.21 � 11.3

Vehicle 846 4 27.00 � 5.6 + 21.03 � 5.5 + 21.92 � 4.0

Votes 435 2 3.31 � 3.0 3.54 � 3.6 ÿ 4.30 � 3.0

Waveform 2581 3 24.46 � 2.6 + 16.89 � 2.4 + 16.70 � 2.3

Wine 178 3 7.24 � 7.6 + 1.84 � 3.0 + 3.54 � 4.1

Mean 16.91 14.41 14.63

Geo. Mean 14.46 11.36 12.04

Table 3

Number of Leaves: C4.5 versus Ltree

Ltree C4.5 C45Oblique

Mean 21.85 44.91 29.69

14 J. Gama, P. Brazdil / Intelligent Data Analysis 3 (1999) 1±22

Page 15: Linear tree

Slight syntactic modi®cations of the data were necessary to run OC1 and CART. CART wasused with the option of linear combinations active. Also it uses an internal consistency checkprocedure for the parameters setting. We have used as default the following parameters:Split� gini, Linear size� 20, param� 0.2 and linmax� 200. These settings work for most of thedatasets, although on some of them we need to increase the linmax option, as suggested by CART.LMDT was used reserving 10% of the training data for pruning.

Table 4 shows the mean and standard deviation of error rates (in percentages). Results werecompared using paired t-tests. The null hypothesis is that Ltree performs equally well as the otheralgorithm. A plus sign means that Ltree was worse than the given algorithm at 95% con®dencelevel, a minus means that Ltree performs signi®cantly better. Our system has an overall goodperformance in terms of error rate. On these datasets, it never performs signi®cantly worse thanthe other multivariate trees. In comparison against LMDT and OC1 it performs signi®cantlybetter on 14 datasets and against CART it is better on 9 datasets.

5.3. Discussion

In Ref. [12] the authors of CART, discuss the use of linear combination of attributes in treelearning. They express a view that ``although linear combination splitting has strong intuitive appeal,it does not seem to achieve this promise in practice...'' This was con®rmed in our experiments with

Table 4

Error rates of multivariate trees

Dataset Ltree LMDT OC1 CART

Australian 14.24 � 3.6 ÿ 16.71 � 3.9 15.37 � 5.1 14.77 � 4.7

Balance 6.49 � 3.0 ÿ 10.07 � 3.6 ÿ 7.73 � 3.6 ÿ 8.57 � 3.4

Banding 22.62 � 7.0 23.65 � 7.3 ÿ 26.73 � 10.5 ÿ 28.71 � 9.0

Breast (W) 3.24 � 2.6 ÿ 4.10 � 2.8 ÿ 5.15 � 3.0 ÿ 6.62 � 3.2

Cleveland 17.84 � 6.7 ÿ 21.38 � 7.5 ÿ 21.89 � 9.1 19.21 � 8.0

Credit 15.08 � 4.4 ÿ 16.39 � 3.7 ÿ 17.02 � 4.3 14.50 � 4.5

Diabetes 26.00 � 4.7 27.00 � 3.9 ÿ 28.17 � 3.9 25.83 � 4.7

German 25.33 � 3.9 ÿ 28.43 � 4.1 ÿ 27.80 � 4.5 26.07 � 4.0

Glass 29.57 � 8.5 ÿ 40.05 � 11.3 ÿ 34.14 � 10.4 31.12 � 11.8

Heart 17.28 � 7.3 ÿ 22.35 � 8.3 ÿ 24.44 � 9.9 ÿ 25.93 � 8.0

Hepatitis 19.50 � 7.3 18.14 � 7.6 23.86 � 9.3 22.31 � 4.9

Ionosphere 10.80 � 5.1 ÿ 13.40 � 5.3 12.31 � 5.6 10.98 � 5.5

Iris 3.11 � 4.5 ÿ 4.67 � 4.9 3.11 � 4.5 ÿ 6.22 � 5.8

Mushroom 0.00 � 0.0 0.00 � 0.0 ÿ 0.22 � 0.2 0.01 � 0.0

Satimage 12.57 � 2.7 ÿ 15.06 � 3.1 13.19 � 2.2 ÿ 14.88 � 2.4

Segment 3.42 � 1.3 ÿ 4.32 � 1.4 ÿ 5.18 � 1.6 3.88 � 1.3

Sonar 32.16 � 12.7 30.61 � 11.4 32.48 � 12.2 31.05 � 12.9

Vehicle 21.03 � 5.5 21.60 � 4.5 ÿ 30.65 � 4.1 ÿ 27.70 � 5.5

Votes 3.54 � 3.6 ÿ 5.54 � 3.9 4.83 � 3.2 ÿ 4.71 � 3.9

Waveform 16.89 � 2.4 16.82 � 2.1 ÿ 19.22 � 2.9 17.50 � 2.9

Wine 1.84 � 3.0 ÿ 4.49 � 3.5 ÿ 9.57 � 7.2 ÿ 7.62 � 7.4

Mean 14.41 16.43 17.29 16.58

Geo. Mean 11.36 13.97 14.69 14.52

J. Gama, P. Brazdil / Intelligent Data Analysis 3 (1999) 1±22 15

Page 16: Linear tree

systems such as LMDT and OC1 when compared with C4.5. The use of linear combination ofattributes increases the number of degrees of freedom to obtain better ®t to the data, but inconsequence it also increases the variance of the classi®er.

In Ltree each new attribute is the de®nition in extension of the linear discriminant functionand is therefore continuous. In other words, the new attribute is obtained by applying the lineardiscriminant function to each example. In typical decision tree systems, the original continuousattributes can be used more than once in each branch of the generated tree. The innovation inLtree, is that any new attribute generated during induction is treated similarly as the originalones. This implies that new attributes generated at one decision node can be used in subsequentnodes.

The linear discriminant function that we use, is not always the best method for determining thelinear combinations since it assumes the equality of the covariance matrices. If this assumptiondoes not hold, the method may not ®nd the right linear combination, even if the data is linearlyseparable. This was veri®ed, for instance with the LS10 data. Although this is a drawback, thelinear discriminant function is fast, and always provides a good approximation to a linearcombination.

5.4. Downward propagation of the new attributes

Ltree uses two heuristic rules that control the applicability of the constructive operator. The®rst rule de®nes the depth beyond which the constructive operator cannot be applied. This is auser settable parameter that by default is 5. The second rule sets dependence between the numberof examples that fall at a node and the number of attributes. The constructive operator is appliedonly if the number of examples is greater than 3 times the number of attributes 9.

There are three reasons that justify the downward propagation of the new attributes. The ®rstone is related to the method used for dealing with continuous attributes mainly used in decisiontree induction. Nominal variables are used at most once in a decision tree, because in all subsetsobtained by splitting the data on the nominal attribute, the value of this attribute is constant. Acontinuous attribute can be used more than once, because the test on this attribute produces abinary split. In general, in both subsets the attribute is not constant and future splits on thisattribute could be useful. The second reason is related with the method used by Ltree to generatethe linear combinations. Each linear combination discriminates only one class, and generates onenew attribute. At each node only one attribute is selected. If one of the new attributes is selected,this discriminates only one of the classes. The other attributes that discriminate the other classescould be useful later when extending the tree. The third reason is that at deeper nodes, the newattributes contain terms based on previously built attributes. That is, the linear combinations thatwe get at deeper nodes contain terms that are also linear combinations of linear combinationsobtained at upper nodes. The downward propagation of the attributes gradually increases thecomplexity of the attributes generated at deeper nodes.

To test this hypothesis, we have developed two variants of Ltree. In the ®rst variant, systemMtree, each new attribute is a linear composition of the original attributes. There is no downwardpropagation of the new attributes. In the second variant, system Qtree, we use a quadratic

9 A user settable parameter.

16 J. Gama, P. Brazdil / Intelligent Data Analysis 3 (1999) 1±22

Page 17: Linear tree

composition of the ordered attributes with downward propagation. This means that deeper nodescontain quadratic composition of quadratic terms.

A comparative summary of the results of Ltree, Qtree, and Mtree is shown in Table 5. Here wehave used the same (21) datasets as in the previous experiments. As for the error rates, the per-formance of the three systems is quite similar, but with respect to the number of leaves, there aresigni®cant improvements due to the use of more complex combinations in the deeper nodes. Ltreeachieves signi®cantly better performance than Mtree on six datasets. Column Ltree(ws) refers torunning Ltree setting o� the parameter that allows smoothed class distributions. On these data-sets, smoothing seems not to a�ect accuracy but smoothing is bene®cial when considering treesizes. Qtree is able to de®ne more complex decision surfaces although the gains in accuracy varymore due to variance (signi®cantly performs better in 3 datasets and worse on 2).

5.5. Learning times

We have measured the learning times, that is the time needed by the learning algorithm togenerate a predictor. Here comparisons may depend on the implementation details as well onunderlying hardware. However the order of magnitude of time complexity is still a useful indi-cator. Table 6 shows the average of the running time (in seconds) of each algorithm 10 on the 21datasets described earlier.

As Jordan et al. [10] have pointed the training time for divide and conquer algorithms is oftenorders of magnitude faster than the time of gradient based algorithms. This is one of the aspectswhy Ltree is signi®cantly faster than other oblique decision trees like CART, OC1 and LMDTthat use gradient descent approaches to determine hyper-planes.

5.6. How far from the best 11?

For each dataset we consider the algorithm with the lowest error rate. We designate errormargin as the standard deviation from a Bernoulli distribution for the error rate of this algorithm:

Error Margin ����������������������������������������Elow � �1ÿ Elow�=N

p;

10 All algorithms have been compiled with the same optimisation options and all have run on a Pentium 166MHz, 32M machine

under Linux.11 The method described here is a variant of the method used in the StatLog project [16].

Table 5

Comparison between Ltree and Ltree variants (resume)

Error rates Ltree Ltree (ws) Mtree Qtree

Mean 14.41 14.40 14.55 14.58

Nr. signi®cant wins 2-1 1-1 2-3

Nr. of leaves

Mean 21.85 22.80 22.29 19.74

Nr.of signi®cant wins 8-0 6-0 5-6

J. Gama, P. Brazdil / Intelligent Data Analysis 3 (1999) 1±22 17

Page 18: Linear tree

where Elow represents the error rate of the best algorithm and N the number of examples in the testset. The distance of the error rate of an algorithm to the best algorithm on each dataset in terms ofthe error margin, �Ea lgi

ÿ Elow�=Error Margin, gives us an indication about the algorithms per-formance taking into account problem di�culty. Table 7, shows the average of distances across alldatasets.

There is strong evidence that if we need to use a learning classi®er on new data, and do not havefurther information, we should ®rst use Ltree.

5.7. Tree complexity ± Minimum Description Length

Ltree, CART, LMDT, and OC1 generate trees with multivariate tests in its decision nodes. Treesize between these systems could be compared directly. Comparisons with univariate tests areproblematic because we must take into account the increase in complexity of combinations ofattributes. Multivariate trees are more compact than univariate ones, although the nodes con-taining multivariate tests are more di�cult to understand on the other hand the overall tree issubstantially simpler.

In this study, we analyze the model complexity along two dimensions. One dimension is relatedto the number of leaves in each decision tree. This corresponds to the number of di�erent regionsinto which the instance space is partitioned by the algorithm. Table 8 shows the results of mul-tivariate trees. From this point of view, LMDT is the algorithm that produces smaller trees.

The other dimension is based on the Minimum description length (MDL) principle, based onRissanen [24]. We are interested in comparing the size of the di�erent models. Because the modelsuse di�erent representation languages, any simple measure, for example ``number of nodes'', doesnot capture the complexity of each model.

Any theory or hypothesis about data can be used to encode the data as a binary string. In theMDL setting, we have a sender and a receiver. Usually it is assumed that the sender knows the fulltraining set and the model. The receiver only knows the attribute values 12. The sender musttransmit the necessary information that allows the receiver to classify all the examples of the

12 Note that if we need to code the data, this is independent of the model. As such the code of the data is constant regardless the

model.

Table 6

Average of running times

C4.5 Mtree C45Obl Ltree Qtree LMDT CART OC1

109 165 168 241 427 4626 4031 4858

Table 7

Average of Bernoulli distances

Ltree Qtree MTree C45Obl LMDT CART OC1 C4.5

0.90 0.95 0.96 1.15 2.8 3.16 3.69 3.89

18 J. Gama, P. Brazdil / Intelligent Data Analysis 3 (1999) 1±22

Page 19: Linear tree

training set. As such, he must send the theory generated by the algorithm. Also he must sendinformation identifying the examples misclassi®ed by the model, and moreover, the correctclassi®cation. This provides a trade-o� between simplicity of the model and the ability of themodel to explain the data. There is no guarantee that any codi®cation used is optimal. Most ofcodi®cation use smart tricks. We have used a basic schema. Although it may not be optimal, it issu�cient for comparison purposes. The codi®cation schema follows [3,25]. It is brie¯y describedin Appendix A.

We have used this coding schema with the previously described experiments and compared thecode size of decision trees build by C4.5 and Ltree. Table 9 shows a summary of the results. Foreach dataset, we have averaged the code size of the trees generated in all the cross validation runsand carried on comparisons using paired t-tests, with signi®cance level set at 95%. We observe thatC4.5 generates smaller code sizes in 11 datasets, while Ltree generates smaller code sizes in8 datasets.

6. Conclusions

We described a new and e�cient method for the construction of multivariate decision trees,which explores a combination of a decision tree with a linear discriminant by means of con-structive induction. The method can be used either as a pre-processing step, as in C45Oblique, orincorporated in the decision tree induction as in Ltree. The overall results of C45Oblique mea-sured both in terms of error rate and learning times compares favorably with other approaches.The advantage of C45Oblique is that it is a rather simple system incorporating two well-knowntechniques (C4.5 and linear discriminant). It can be easily implemented without changing thosesystems. However, system Ltree obtains more consistent and somewhat better results both interms of error rate and number of leaves.

There are two main features that characterize Ltree. The ®rst one is the use of constructiveinduction: when building the tree, new attributes are computed as linear combinations of theprevious ones. The new attributes built are propagated downward the tree. New attributes builtat lower nodes contain terms based on attributes built higher up. This schema allows building

Table 9

Minimum description length (summary)

Ltree C4.5

Mean code size 2440 1780

Nr.of signi®cant wins 8 11

Table 8

Number of leaves of multivariate trees

Ltree LMDT OC1 CART

Mean 21.85 7.71 13.21 19.08

J. Gama, P. Brazdil / Intelligent Data Analysis 3 (1999) 1±22 19

Page 20: Linear tree

very complex decision surfaces, mainly when using a quadratic function as constructive oper-ator. The second aspect is that the number of attributes is variable along the tree, including fortwo nodes at the same level. At each decision node Ltree performs two inductive steps: the ®rstone consists of building the discriminant function, the second one consists of applying thedecision tree criteria. In the ®rst step the discriminant decision rule is not used (i.e. not used toproduce classi®cations). All decisions, such as stopping, choosing the splitting attribute, etc, aredelayed to the second inductive step. The ®nal decision is made by the decision tree criteria. Inproblems with numerical attributes, attribute combination extends the representational powerand relaxes the language bias of univariate decision tree algorithms. In an analysis based onBias-Variance error decomposition [11] Ltree combines a linear discriminant that is known tohave high bias, but low variance with a decision tree, known to have low bias but high variance.This is the desirable composition. We use constructive induction as a way of extending bias.Using WolpertÕs terminology [26] the constructive step performed at each decision node is a bi-stacked generalization. From this point of view, the proposed methodology can be seen as ageneral architecture for combining algorithms by means of Constructive Induction, a kind oflocal bi-stacked generalization. A logistic discriminant, or a neural net can easily replace theconstructive operator used. This architecture could be used on regression problems by replacingour attribute constructor operator, the linear discriminant function, by an operator based onprincipal component analysis.

Another aspect is that the method generates a probabilistic decision tree. When classifying anexample, the tree outputs a class probability distribution that takes into account not only thedistribution at the leaf where the example falls, but also a combination of the distributions of thenodes on the path that the example follows.

We have shown that this methodology can improve both accuracy and tree size when comparedwith other oblique decision trees systems. This is done in an e�cient manner and leads to reducedlearning times.

Acknowledgements

Gratitude is expressed to ®nancial support under PRAXIS XXI and FEDER, project ECO,and the Plurianual support attributed to LIACC. Also thanks to my colleagues from LIACC andthe anonymous reviewers for useful comments and suggestions.

Appendix A

A.1. Codi®cation schema

The codi®cation schema for univariate decision trees, such as C4.5, we have used follows [3,25].Each node in a tree can be a leaf or an internal node with a test on a discrete or continuousattribute. To distinguish these types of nodes we need log2 (3) bits. To codify a leaf, we only needto codify the class attached to this leaf. The number of bits required is log2 (#classes). To codify atest on a discrete attribute, we need to codify the attribute, that requires log2 (#Attributes). Each

20 J. Gama, P. Brazdil / Intelligent Data Analysis 3 (1999) 1±22

Page 21: Linear tree

descendent node requires log2 (VAtti), where VAtti is the number of values for Attributei. The codeof a test on a discrete attribute is log2 (#Attributes) + VAtti*log2 (VAtti). To codify a test on acontinuous attribute, we need to identify the attribute and the cut point. The cut point is a realnumber and is codi®ed as in Rissanen [24].

Any Ltree tree is similarly coded, taking into account that the number of attributes is variablealong the tree. We codify the coe�cients that de®ne a hyper-plane, as a vector of real numbers.The coe�cients of eliminated attributes are coded as 0.

To codify a misclassi®ed example, we need to identify the example and provide the correctclass. To identify the example we need log2 (#Examples). As such, to codify the errors we need#errors*log2 (#Examples) + #errors*log2 (#Classes).

References

[1] L. Breiman, Bias, Variance and arcing classi®ers, Technical report 460, Statistics department, University of California, 1996.

[2] L. Breiman, J. Friedman, R. Olshen, C. Stone, Classi®cation and Regression trees, Wadsworth International Group, 1984.

[3] C. Brodley, P. Utgo�, Multivariate versus univariate decision trees. Coins technical report, 92-8, University of Massachusetts,

1992.

[4] C. Brodley, P. Utgo�, Multivariate trees, Machine Learning 19 (1995).

[5] W. Buntine, A theory of Learning Classi®cation Rules, Ph.D. thesis, University of Sydney, 1990.

[6] B. Cestnik, I. Kononenko, I. Bratko, Assistant 86: A knowledge-elicitation tool for sophisticated users, in: I. Bratko, N. Lavrac

(Eds.), European Working Session on Learning-EWSL87, Sigma Press, England, 1987.

[7] W. Dillon W, M. Goldstein, Multivariate Analysis, Methods and Applications, Wiley, New York, 1984.

[8] F. Esposito, D. Malerba, G. Semeraro, Decision tree pruning as a search in the state space, in: P. Brazdil (Ed.), Machine Learning

ECML93, LNAI 667, Springer, Berlin, 1993.

[9] A. Ittner, M. Schlosser, Non-linear decision tees, in: L. Saitta (Ed.), Machine Learning Proceedings of the 13th International

Conference, Morgan Kaufmann, 1996.

[10] M. Jordan, R. Jacob, Hierarchical mixtures of experts and the EM algorithm, Neural Computing 6 (1994).

[11] R. Kohavi, D. Wolpert, Bias plus variance decomposition for zero-one loss function, in: L. Saitta (Ed.), Machine Learning

Proceedings of the 13th International Conference, Morgan Kaufmann, 1996.

[12] W. Loh, N. Vanichsetakul, Tree-structured classi®cation via generalized discriminant analysis, Journal of the American Statistical

Association, 1988.

[13] O. Mangasarian, R. Setiono, W. Wolberg, Pattern recognition via linear programming: theory and applications to medical

diagnosis, in: T. Coleman, Y. Li (Eds.), SIAM, 1990.

[14] C. Matheus, L. Rendell, Constructive induction on decision trees, In Proceedings of IJCAI, Morgan Kaufmann, 1989.

[15] C. Merz, P. Murphy, UCI repository of machine learning databases, 1998.

[16] D. Michie, D. Spiegelhalter, C. Taylor, Machine learning, Neural and Statistical Classi®cation, Ellis Horwood, Chichester, UK,

1994.

[17] S. Murthy, S. Kasif, S. Salzberg, A system for induction of oblique decision trees, Journal of Arti®cial Intelligence Research, 1994.

[18] J. Pearl, Probabilistic Reasoning in Intelligent Systems, Networks of Plausible Inference, Morgan Kaufmann, Los Altos, CA,

1988.

[19] P. Pompe, A. Feelders, Using machine learning, neural networks and statistics to predict corporate bankruptcy: a comparative

study, in: Phillip Ein-Dor (Ed.), Arti®cial Intelligence in Economics and Management, Kluwer Academic Publishers, Dordrecht,

1996.

[20] R. Quinlan, Discovering rules by induction from large collections of examples, in: Donald Michie (Ed.), Expert Systems in the

Microelectronic Age, Edinburgh University Press, 1979.

[21] R. Quinlan, Learning with continuous classes, in: Adams, Sterling (Eds.), Proceedings of the AItÕ92, World Scienti®c, Singapore,

1992.

[22] R. Quinlan, C4.5, Programs for Machine Learning, Morgan Kaufmann, Los Altos, CA, 1993.

[23] B. Ripley, Pattern recognition and neural networks, Cambridge University Press, Cambridge, 1996.

[24] R. Rissanen, A universal prior for integers and estimation by minimum description length, The Annals of Statistics 11 (1983).

[25] C.S. Wallace, J.D. Patrick, Coding decision trees, Machine Learning 7±22 (11) (1993).

J. Gama, P. Brazdil / Intelligent Data Analysis 3 (1999) 1±22 21

Page 22: Linear tree

[26] D. Wolpert, Stacked generalization, in: Pergamon Press (Ed.), Neural Networks 5, 1992.

[27] S. Yip, G. Webb, Incorporating canonical discriminant attributes in classi®cation learning, In Proceedings of the 10th Canadian

Conference on Arti®cial Intelligence, Morgan Kaufmann, Los Altos, CA, 1994.

22 J. Gama, P. Brazdil / Intelligent Data Analysis 3 (1999) 1±22