Classiﬁcation Trees and MARS · Recap Regression Trees CART (classiﬁcation and regression trees) is a method developed by Breiman, Friedman, Olshen and Stone to classify data

Classification Trees and MARSSTA450S/4000S: Topics in statistics. Statistical Aspects

of Data Mining

Ana-Maria Staicu

Classification Trees and MARS – p. 1/19

Recap Regression Trees

CART (classification and regression trees) is a method developed byBreiman, Friedman, Olshen and Stone to classify data on the basis ofsome of the variables.Known also as Recursive Partitioning.

Basic idea: construct a tree that will separate the data in the "best"way by finding binary splits on variables; find the best splitting variableand the best splitting point at each stage. The routine is recursive.Usually the process stops when some minimum node size, say 5

nodes is reached.

Once the tree has been grown, a cost complexity criterion is used toprune it.The tuning parameter α which governs the tradeoff between tree sizeand its goodness of fit to the data.


Classification Trees

For trees, R uses either package tree or rpart.




The target variable Y takes values 1, 2, . . . , K

One basic difference between classification and regression trees isthe action that takes place at the splits:






Regression tree: we try to minimize the sum of squares of errorbetween the true values and the "predicted" values; (the "predicted"value is the mean of all responses on either side of the split).






Regression tree: we try to minimize the sum of squares of errorbetween the true values and the "predicted" values; (the "predicted"value is the mean of all responses on either side of the split).

Classification tree: we try to minimize a measure of impurity (lossfunction):


Node impurity measure:

Misclassification error: 1Nm

∑

i∈Rm

I(yi 6= k(m)),k(m) = arg maxk p̂mk(m).Note p̂mk = 1

Nm

∑

i∈Rm

I(yi = k) is the proportion of class k

observations in node m.

Gini Index:∑

k 6=k′ p̂mkp̂mk′ =∑K

1 p̂mk(1 − p̂mk).

Cross-Entropy or Deviance: −∑K

1 p̂mk log p̂mk.




∑

i∈Rm


Nm

∑

i∈Rm



Gini Index:∑


1 p̂mk(1 − p̂mk).


1 p̂mk log p̂mk.

When growing the treeChoose either Gini index or Cross-entropy. One reason is theirdifferentiability. Gini index is the default in R.




∑

i∈Rm


Nm

∑

i∈Rm



Gini Index:∑


1 p̂mk(1 − p̂mk).


1 p̂mk log p̂mk.

When growing the treeChoose either Gini index or Cross-entropy. One reason is theirdifferentiability. Gini index is the default in R.

When pruning the treeChoose any of the three. Misclassification error is typically used.


§9.2.4 Other issues

Handling Unordered Inputs

If an input Xj has q ordered possible values, there are q − 1

possible partitions into 2 groups

If an input Xj is categorical, having q unordered possible values,there are 2q−1 − 1 possible partitions into 2 groups

Solution (for a 0 − 1 or quantitative outcome): order the predictorclasses according to the proportion falling in outcome class 1. Splitthe predictor Xj as if the values were ordered. This split results inthe optimal split in terms of squared error or Gini index. SeeBreiman’s et al Classification and Regression Trees .


§9.2.4 The Loss Matrix

Misclassifying observations may vary with classes.




Define a K × K loss matrix, L:Lkk′ : the loss for misclassifying a k class observation as one of classk

′

. Evidently Lkk = 0.




Define a K × K loss matrix, L:Lkk′ : the loss for misclassifying a k class observation as one of classk

′

. Evidently Lkk = 0.

How to incorporate the losses into the modeling process?Case K = 2: Weight observation in class 1 by L12, and observation inclass 2 by L21

Case K > 2: If Lkk′ is a function only of k, not of k′, weightobservation in class k by Lkk′ . In a terminal node m class k(m) will beassigned;k(m) = arg mink

∑

l Llkp̂ml.To incorporate the loss into the process, modify Gini index to∑

k 6=k′ Lkk′ p̂mkp̂mk′ .


§9.2.4 Missing Predictor Values

In general two approaches:

1) discard the observations with missing values

2) impute the missing values, eg by mean of the predictor over the

non-missing observations







Tree based methods:

1) make a new category "NA" for the missing values of a categoricalpredictor.

2) use surrogate variables.







Tree based methods:

1) make a new category "NA" for the missing values of a categoricalpredictor.

2) use surrogate variables.

At any split, alternative splitting variables and corresponding splittingpoints are determined when building the model. A first surrogate splitwould best mimic the split of the training data achieved by the primarysplit, and so on. Use surrogate splits in order in the absence of theprimary splitting predictor.


Binary splitsMulti-way splits would split the data too quickly; insufficient data at thenext level down. Multi-way splits can be expressed as a series ofbinary splits.


Binary splitsMulti-way splits would split the data too quickly; insufficient data at thenext level down. Multi-way splits can be expressed as a series ofbinary splits.

Linear Combination splitsChoose a split of the form

∑

ajXj ≤ c instead of the form Xj ≤ c.Consequences:

1) it improves the predictive power of the tree

2) it reduces its interpretability.

Alternatives:

HME(hierarchial mixture model)


Advantages

Trees are easy to interpret

Trees can handle multicolinearity

Tree-method is a non parametric method (assumptions free).


Advantages

Trees are easy to interpret

Trees can handle multicolinearity

Tree-method is a non parametric method (assumptions free).

Disadvantages

High variance caused by the hierarchical nature of the process

Even a more stable split criterion does not remove the instabilityAn error in the top split is propagated down to all of the splits below it;

Lack of smoothness of the predictor surface (MARS alleviate)

Difficulty in modeling additive structure (MARS capture).


Some code for trees

library(MASS)

library(rpart)

cpus.rp <- rpart(log10(perf) ˜ ., cpus[ ,2:8], cp=1e-3)

summary(cpus.rp)

printcp(cpus.rp)

# Regression tree:

# rpart(formula = log10(perf) ˜ ., data = cpus[, 2:8], cp=0.0 01)

# Variables actually used in tree construction:

# [1] cach chmax chmin mmax syct

# Root node error: 43.116/209 = 0.20629

# CP nsplit rel error xerror xstd

# 1 0.5492697 0 1.00000 1.02128 0.098176

# 2 0.0893390 1 0.45073 0.47818 0.048282

cpus.rp.pr <- prune(cpus.rp, cp=0.006)

post(cpus.rp.pr,title="Plot of rpart object cpus.rp.pr" ,

filename="C:\\AM\\CpusTree.eps",horizontal=F, points ize=8)


|

cach< 27

mmax< 6100

mmax< 1750 syct>=360

chmin< 5.5

cach< 0.5

mmax< 2.8e+04

cach< 96.5

mmax< 1.124e+04

cach< 56

cach>=27

mmax>=6100

mmax>=1750 syct< 360

chmin>=5.5

cach>=0.5

mmax>=2.8e+04

cach>=96.5

mmax>=1.124e+04

cach>=56

1.753n=209

1.525n=143

1.375n=78

1.089n=12

1.427n=66

1.704n=65

1.28n=7

1.756n=58

1.699n=46

1.531n=11

1.751n=35

1.974n=12

2.249n=66

2.062n=41

2.008n=34

1.827n=14

2.135n=20

2.324n=7

2.555n=25

2.268n=7

2.667n=18

Plot of rpart object cpus.rp.pr


Some code for trees

library(tree)

fgl.tr <- tree(type˜.,fgl)

summary(fgl.tr)

# Classification tree: tree(formula = type ˜ ., data = fgl)

# Number of terminal nodes: 20

# Residual mean deviance: 0.6853 = 133 / 194

# Misclassification error rate: 0.1542 = 33 / 214

fgl.tr1 <- snip.tree(fgl.tr, nodes = 9)

# The nodes could be sniped off interactively, by clicking wi th

# the mouse on the terminal node. fgl.tr1 <- snip.tree(fgl.t r)

fgl.cv <- cv.tree(fgl.tr,, FUN=prune.tree, K=10)

# The algorithm below randomly divides the training set.

for(i in 2:5)

{fgl.cv$dev <- fgl.cv$dev + cv.tree(fgl.tr,, prune.tree) $dev}

fgl.cv$dev <- fgl.cv$dev/5

plot(fgl.cv) title("Cross-validation plot for pruning")


X−v

al R

elat

ive

Err

or

0.2

0.4

0.6

0.8

1.0

1.2

Inf 0.054 0.016 0.0048 0.0012

1 3 5 7 10 12 14 16

size of tree

Pruning: choosing parameter cp

devi

ance

9510

010

511

011

512

012

513

0

5 10 15 20

170.0 23.0 16.0 10.0 8.1Cross−validation plot for pruning


§9.4 MARS

For the regression tree process, the data were partitioned in a waythat produced the "best" split with reference to the deviances from themean on either side of the split.

There are commercial versions with an interface to R on JeromeFriedman’s home page.Friedman, J. (1991): Multivariate Adaptative Regression Splines. Annalsof Statistics, 19:1, 1-141.Free version comes with the package mda.


§9.4 MARS

For the regression tree process, the data were partitioned in a waythat produced the "best" split with reference to the deviances from themean on either side of the split.

For MARS a similar process is used to find the best split withreference to the deviances from a spline function on either side of thesplit.

There are commercial versions with an interface to R on JeromeFriedman’s home page.Friedman, J. (1991): Multivariate Adaptative Regression Splines. Annalsof Statistics, 19:1, 1-141.Free version comes with the package mda.


The spline functions used by MARS are:

(X − t)+ =

x − t x > t

0 o/wand (t − X)+ =

t − x x < t

0 o/w

Each function is a piecewise linear. By multiplying these splinestogether it is possible to produce quadratic or cubic curves.The pair of functions (X − t)+, (t − X)+ is called reflected pair while t

is called a knot.

Recall: Regression tree uses as basis functions: I(Xj > c) and I(Xj ≤ c).


The spline functions used by MARS are:

(X − t)+ =

x − t x > t

0 o/wand (t − X)+ =

t − x x < t

0 o/w

Each function is a piecewise linear. By multiplying these splinestogether it is possible to produce quadratic or cubic curves.The pair of functions (X − t)+, (t − X)+ is called reflected pair while t

is called a knot.

MARS uses the collection of basis functionsC =

{

(Xj − t)+, (t − Xj)+}

, with t ∈ {x1j , . . . , xNj} j = 1, 2, . . . , p

Recall: Regression tree uses as basis functions: I(Xj > c) and I(Xj ≤ c).


The model is of the form:

f(X) = β0 +

M∑

m=1

βmhm(X) (1)

where the hm(X) ∈ C or a product of functions in C.M = {h0(X), . . . , hM (X)} is the set of all functions included in the model.

How to build the model if model functions were known?If functions hm(X) were known, determine the coefficients β0, . . . , βM

by minimizing the residual sum of squares. The model buildingstrategy is similar to stepwise linear regression: use functions of theform of hm(X) instead of the original inputs.


MARS Model functions

Step 1

Start with h0(X) = 1; f̂ (1) = β̂(1)0 . M(1) = {h0(X)}

The algorithm stops when the model set contains some preset number of terms.Classification Trees and MARS – p. 17/19


Step 1


Step 2

Add to the model a function of the form b1(Xj − t)+ + b2(t − Xj)+, witht ∈ {x1j , . . . , xNj} that produces the largest decrease in training error.Say this is achieved by j = J , and t = xkJ .Model: f̂ (2) = β̂

(2)0 + β̂

(2)1 (XJ − xkJ)+ + β̂

(2)2 (xkJ − XJ)+.

M(2) = {h0(X), h1(X), h2(X)}, h1(X) = (XJ − xkJ)+etc.



Step 1


Step 2

Add to the model a function of the form b1(Xj − t)+ + b2(t − Xj)+, witht ∈ {x1j , . . . , xNj} that produces the largest decrease in training error.Say this is achieved by j = J , and t = xkJ .Model: f̂ (2) = β̂

(2)0 + β̂

(2)1 (XJ − xkJ)+ + β̂

(2)2 (xkJ − XJ)+.

M(2) = {h0(X), h1(X), h2(X)}, h1(X) = (XJ − xkJ)+etc.

Step m + 1

Add to the model a function of the formb2m−1hl(X)(Xj − t)+ + b2mhl(X)(t − Xj)+ with hl(X) ∈ M(m)

that produces the largest decrease in training error.Say this is achieved by j = J ′, t = xk′J′ and l = L.M(m+1) = M(m) ∪ {h2m−1(X), h2m(X)}

h2m−1 = hL(X)(XJ′ − xk′J′)+ and h2m = hL(X)(xk′J′ − XJ′)+


At the end of this process, we have a large model of the form (9.19),which most probably overfits the data. So a backward deletionprocedure is applied.



The term, whose removal causes the smallest increase in residual, isdeleted from the model, at each stage.




The tuning parameter λ governs the tradeoff between the size of themodel and the goodness of fit to the data. Optimal value of λ isestimated by generalized cross-validation criterion:

GCV (λ) =

∑N

i=1(yi − f̂λ(xi))2

(1 − M(λ)/N)2




The tuning parameter λ governs the tradeoff between the size of themodel and the goodness of fit to the data. Optimal value of λ isestimated by generalized cross-validation criterion:

GCV (λ) =

∑N

i=1(yi − f̂λ(xi))2

(1 − M(λ)/N)2

M(λ) is the effective number of parameters used in the model;namely # terms in the model plus # parameters used to select theoptimal positions of the knots (3 parameters per knot).


Advantages:

Using piecewise linear basis functions, the regression surface isbuilt up parsimoniously.

MARS is not computationally intensive:For the piecewise linear functions, the reflected pair with rightmostknot is fitted. The knot is moved successively one position at atime to the left.


Advantages:

Using piecewise linear basis functions, the regression surface isbuilt up parsimoniously.

MARS is not computationally intensive:For the piecewise linear functions, the reflected pair with rightmostknot is fitted. The knot is moved successively one position at atime to the left.

Limitations:

Hierarchical (forward) modeling strategy.The philosophy used is that a higher-order interaction will likelyexists only if some of its "footprints" exist as well.

Restriction in the formation of model terms: each input can appearat most once in a product.


Documents

Classiﬁcation Trees and MARS · Recap Regression Trees CART (classiﬁcation and regression trees) is a method developed by Breiman, Friedman, Olshen and Stone to classify data