38
Classification Trees and MARS STA450S/4000S: Topics in statistics. Statistical Aspects of Data Mining Ana-Maria Staicu Classification Trees and MARS – p. 1/1

Classification Trees and MARS · Recap Regression Trees CART (classification and regression trees) is a method developed by Breiman, Friedman, Olshen and Stone to classify data

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Classification Trees and MARS · Recap Regression Trees CART (classification and regression trees) is a method developed by Breiman, Friedman, Olshen and Stone to classify data

Classification Trees and MARSSTA450S/4000S: Topics in statistics. Statistical Aspects

of Data Mining

Ana-Maria Staicu

Classification Trees and MARS – p. 1/19

Page 2: Classification Trees and MARS · Recap Regression Trees CART (classification and regression trees) is a method developed by Breiman, Friedman, Olshen and Stone to classify data

Recap Regression Trees

CART (classification and regression trees) is a method developed byBreiman, Friedman, Olshen and Stone to classify data on the basis ofsome of the variables.Known also as Recursive Partitioning.

Basic idea: construct a tree that will separate the data in the "best"way by finding binary splits on variables; find the best splitting variableand the best splitting point at each stage. The routine is recursive.Usually the process stops when some minimum node size, say 5

nodes is reached.

Once the tree has been grown, a cost complexity criterion is used toprune it.The tuning parameter α which governs the tradeoff between tree sizeand its goodness of fit to the data.

Classification Trees and MARS – p. 2/19

Page 3: Classification Trees and MARS · Recap Regression Trees CART (classification and regression trees) is a method developed by Breiman, Friedman, Olshen and Stone to classify data

Classification Trees

For trees, R uses either package tree or rpart.

Classification Trees and MARS – p. 3/19

Page 4: Classification Trees and MARS · Recap Regression Trees CART (classification and regression trees) is a method developed by Breiman, Friedman, Olshen and Stone to classify data

Classification Trees

For trees, R uses either package tree or rpart.

The target variable Y takes values 1, 2, . . . , K

One basic difference between classification and regression trees isthe action that takes place at the splits:

Classification Trees and MARS – p. 3/19

Page 5: Classification Trees and MARS · Recap Regression Trees CART (classification and regression trees) is a method developed by Breiman, Friedman, Olshen and Stone to classify data

Classification Trees

For trees, R uses either package tree or rpart.

The target variable Y takes values 1, 2, . . . , K

One basic difference between classification and regression trees isthe action that takes place at the splits:

Regression tree: we try to minimize the sum of squares of errorbetween the true values and the "predicted" values; (the "predicted"value is the mean of all responses on either side of the split).

Classification Trees and MARS – p. 3/19

Page 6: Classification Trees and MARS · Recap Regression Trees CART (classification and regression trees) is a method developed by Breiman, Friedman, Olshen and Stone to classify data

Classification Trees

For trees, R uses either package tree or rpart.

The target variable Y takes values 1, 2, . . . , K

One basic difference between classification and regression trees isthe action that takes place at the splits:

Regression tree: we try to minimize the sum of squares of errorbetween the true values and the "predicted" values; (the "predicted"value is the mean of all responses on either side of the split).

Classification tree: we try to minimize a measure of impurity (lossfunction):

Classification Trees and MARS – p. 3/19

Page 7: Classification Trees and MARS · Recap Regression Trees CART (classification and regression trees) is a method developed by Breiman, Friedman, Olshen and Stone to classify data

Node impurity measure:

Misclassification error: 1Nm

i∈Rm

I(yi 6= k(m)),k(m) = arg maxk p̂mk(m).Note p̂mk = 1

Nm

i∈Rm

I(yi = k) is the proportion of class k

observations in node m.

Gini Index:∑

k 6=k′ p̂mkp̂mk′ =∑K

1 p̂mk(1 − p̂mk).

Cross-Entropy or Deviance: −∑K

1 p̂mk log p̂mk.

Classification Trees and MARS – p. 4/19

Page 8: Classification Trees and MARS · Recap Regression Trees CART (classification and regression trees) is a method developed by Breiman, Friedman, Olshen and Stone to classify data

Node impurity measure:

Misclassification error: 1Nm

i∈Rm

I(yi 6= k(m)),k(m) = arg maxk p̂mk(m).Note p̂mk = 1

Nm

i∈Rm

I(yi = k) is the proportion of class k

observations in node m.

Gini Index:∑

k 6=k′ p̂mkp̂mk′ =∑K

1 p̂mk(1 − p̂mk).

Cross-Entropy or Deviance: −∑K

1 p̂mk log p̂mk.

When growing the treeChoose either Gini index or Cross-entropy. One reason is theirdifferentiability. Gini index is the default in R.

Classification Trees and MARS – p. 4/19

Page 9: Classification Trees and MARS · Recap Regression Trees CART (classification and regression trees) is a method developed by Breiman, Friedman, Olshen and Stone to classify data

Node impurity measure:

Misclassification error: 1Nm

i∈Rm

I(yi 6= k(m)),k(m) = arg maxk p̂mk(m).Note p̂mk = 1

Nm

i∈Rm

I(yi = k) is the proportion of class k

observations in node m.

Gini Index:∑

k 6=k′ p̂mkp̂mk′ =∑K

1 p̂mk(1 − p̂mk).

Cross-Entropy or Deviance: −∑K

1 p̂mk log p̂mk.

When growing the treeChoose either Gini index or Cross-entropy. One reason is theirdifferentiability. Gini index is the default in R.

When pruning the treeChoose any of the three. Misclassification error is typically used.

Classification Trees and MARS – p. 4/19

Page 10: Classification Trees and MARS · Recap Regression Trees CART (classification and regression trees) is a method developed by Breiman, Friedman, Olshen and Stone to classify data

§9.2.4 Other issues

Handling Unordered Inputs

If an input Xj has q ordered possible values, there are q − 1

possible partitions into 2 groups

If an input Xj is categorical, having q unordered possible values,there are 2q−1 − 1 possible partitions into 2 groups

Solution (for a 0 − 1 or quantitative outcome): order the predictorclasses according to the proportion falling in outcome class 1. Splitthe predictor Xj as if the values were ordered. This split results inthe optimal split in terms of squared error or Gini index. SeeBreiman’s et al Classification and Regression Trees .

Classification Trees and MARS – p. 5/19

Page 11: Classification Trees and MARS · Recap Regression Trees CART (classification and regression trees) is a method developed by Breiman, Friedman, Olshen and Stone to classify data

§9.2.4 The Loss Matrix

Misclassifying observations may vary with classes.

Classification Trees and MARS – p. 6/19

Page 12: Classification Trees and MARS · Recap Regression Trees CART (classification and regression trees) is a method developed by Breiman, Friedman, Olshen and Stone to classify data

§9.2.4 The Loss Matrix

Misclassifying observations may vary with classes.

Define a K × K loss matrix, L:Lkk′ : the loss for misclassifying a k class observation as one of classk

. Evidently Lkk = 0.

Classification Trees and MARS – p. 6/19

Page 13: Classification Trees and MARS · Recap Regression Trees CART (classification and regression trees) is a method developed by Breiman, Friedman, Olshen and Stone to classify data

§9.2.4 The Loss Matrix

Misclassifying observations may vary with classes.

Define a K × K loss matrix, L:Lkk′ : the loss for misclassifying a k class observation as one of classk

. Evidently Lkk = 0.

How to incorporate the losses into the modeling process?Case K = 2: Weight observation in class 1 by L12, and observation inclass 2 by L21

Case K > 2: If Lkk′ is a function only of k, not of k′, weightobservation in class k by Lkk′ . In a terminal node m class k(m) will beassigned;k(m) = arg mink

l Llkp̂ml.To incorporate the loss into the process, modify Gini index to∑

k 6=k′ Lkk′ p̂mkp̂mk′ .

Classification Trees and MARS – p. 6/19

Page 14: Classification Trees and MARS · Recap Regression Trees CART (classification and regression trees) is a method developed by Breiman, Friedman, Olshen and Stone to classify data

§9.2.4 Missing Predictor Values

In general two approaches:

1) discard the observations with missing values

2) impute the missing values, eg by mean of the predictor over the

non-missing observations

Classification Trees and MARS – p. 7/19

Page 15: Classification Trees and MARS · Recap Regression Trees CART (classification and regression trees) is a method developed by Breiman, Friedman, Olshen and Stone to classify data

§9.2.4 Missing Predictor Values

In general two approaches:

1) discard the observations with missing values

2) impute the missing values, eg by mean of the predictor over the

non-missing observations

Tree based methods:

1) make a new category "NA" for the missing values of a categoricalpredictor.

2) use surrogate variables.

Classification Trees and MARS – p. 7/19

Page 16: Classification Trees and MARS · Recap Regression Trees CART (classification and regression trees) is a method developed by Breiman, Friedman, Olshen and Stone to classify data

§9.2.4 Missing Predictor Values

In general two approaches:

1) discard the observations with missing values

2) impute the missing values, eg by mean of the predictor over the

non-missing observations

Tree based methods:

1) make a new category "NA" for the missing values of a categoricalpredictor.

2) use surrogate variables.

At any split, alternative splitting variables and corresponding splittingpoints are determined when building the model. A first surrogate splitwould best mimic the split of the training data achieved by the primarysplit, and so on. Use surrogate splits in order in the absence of theprimary splitting predictor.

Classification Trees and MARS – p. 7/19

Page 17: Classification Trees and MARS · Recap Regression Trees CART (classification and regression trees) is a method developed by Breiman, Friedman, Olshen and Stone to classify data

Binary splitsMulti-way splits would split the data too quickly; insufficient data at thenext level down. Multi-way splits can be expressed as a series ofbinary splits.

Classification Trees and MARS – p. 8/19

Page 18: Classification Trees and MARS · Recap Regression Trees CART (classification and regression trees) is a method developed by Breiman, Friedman, Olshen and Stone to classify data

Binary splitsMulti-way splits would split the data too quickly; insufficient data at thenext level down. Multi-way splits can be expressed as a series ofbinary splits.

Linear Combination splitsChoose a split of the form

ajXj ≤ c instead of the form Xj ≤ c.Consequences:

1) it improves the predictive power of the tree

2) it reduces its interpretability.

Alternatives:

HME(hierarchial mixture model)

Classification Trees and MARS – p. 8/19

Page 19: Classification Trees and MARS · Recap Regression Trees CART (classification and regression trees) is a method developed by Breiman, Friedman, Olshen and Stone to classify data

Advantages

Trees are easy to interpret

Trees can handle multicolinearity

Tree-method is a non parametric method (assumptions free).

Classification Trees and MARS – p. 9/19

Page 20: Classification Trees and MARS · Recap Regression Trees CART (classification and regression trees) is a method developed by Breiman, Friedman, Olshen and Stone to classify data

Advantages

Trees are easy to interpret

Trees can handle multicolinearity

Tree-method is a non parametric method (assumptions free).

Disadvantages

High variance caused by the hierarchical nature of the process

Even a more stable split criterion does not remove the instabilityAn error in the top split is propagated down to all of the splits below it;

Lack of smoothness of the predictor surface (MARS alleviate)

Difficulty in modeling additive structure (MARS capture).

Classification Trees and MARS – p. 9/19

Page 21: Classification Trees and MARS · Recap Regression Trees CART (classification and regression trees) is a method developed by Breiman, Friedman, Olshen and Stone to classify data

Some code for trees

library(MASS)

library(rpart)

cpus.rp <- rpart(log10(perf) ˜ ., cpus[ ,2:8], cp=1e-3)

summary(cpus.rp)

printcp(cpus.rp)

# Regression tree:

# rpart(formula = log10(perf) ˜ ., data = cpus[, 2:8], cp=0.0 01)

# Variables actually used in tree construction:

# [1] cach chmax chmin mmax syct

# Root node error: 43.116/209 = 0.20629

# CP nsplit rel error xerror xstd

# 1 0.5492697 0 1.00000 1.02128 0.098176

# 2 0.0893390 1 0.45073 0.47818 0.048282

cpus.rp.pr <- prune(cpus.rp, cp=0.006)

post(cpus.rp.pr,title="Plot of rpart object cpus.rp.pr" ,

filename="C:\\AM\\CpusTree.eps",horizontal=F, points ize=8)

Classification Trees and MARS – p. 10/19

Page 22: Classification Trees and MARS · Recap Regression Trees CART (classification and regression trees) is a method developed by Breiman, Friedman, Olshen and Stone to classify data

|

cach< 27

mmax< 6100

mmax< 1750 syct>=360

chmin< 5.5

cach< 0.5

mmax< 2.8e+04

cach< 96.5

mmax< 1.124e+04

cach< 56

cach>=27

mmax>=6100

mmax>=1750 syct< 360

chmin>=5.5

cach>=0.5

mmax>=2.8e+04

cach>=96.5

mmax>=1.124e+04

cach>=56

1.753n=209

1.525n=143

1.375n=78

1.089n=12

1.427n=66

1.704n=65

1.28n=7

1.756n=58

1.699n=46

1.531n=11

1.751n=35

1.974n=12

2.249n=66

2.062n=41

2.008n=34

1.827n=14

2.135n=20

2.324n=7

2.555n=25

2.268n=7

2.667n=18

Plot of rpart object cpus.rp.pr

Classification Trees and MARS – p. 11/19

Page 23: Classification Trees and MARS · Recap Regression Trees CART (classification and regression trees) is a method developed by Breiman, Friedman, Olshen and Stone to classify data

Some code for trees

library(tree)

fgl.tr <- tree(type˜.,fgl)

summary(fgl.tr)

# Classification tree: tree(formula = type ˜ ., data = fgl)

# Number of terminal nodes: 20

# Residual mean deviance: 0.6853 = 133 / 194

# Misclassification error rate: 0.1542 = 33 / 214

fgl.tr1 <- snip.tree(fgl.tr, nodes = 9)

# The nodes could be sniped off interactively, by clicking wi th

# the mouse on the terminal node. fgl.tr1 <- snip.tree(fgl.t r)

fgl.cv <- cv.tree(fgl.tr,, FUN=prune.tree, K=10)

# The algorithm below randomly divides the training set.

for(i in 2:5)

{fgl.cv$dev <- fgl.cv$dev + cv.tree(fgl.tr,, prune.tree) $dev}

fgl.cv$dev <- fgl.cv$dev/5

plot(fgl.cv) title("Cross-validation plot for pruning")

Classification Trees and MARS – p. 12/19

Page 24: Classification Trees and MARS · Recap Regression Trees CART (classification and regression trees) is a method developed by Breiman, Friedman, Olshen and Stone to classify data

X−v

al R

elat

ive

Err

or

0.2

0.4

0.6

0.8

1.0

1.2

Inf 0.054 0.016 0.0048 0.0012

1 3 5 7 10 12 14 16

size of tree

Pruning: choosing parameter cp

devi

ance

9510

010

511

011

512

012

513

0

5 10 15 20

170.0 23.0 16.0 10.0 8.1Cross−validation plot for pruning

Classification Trees and MARS – p. 13/19

Page 25: Classification Trees and MARS · Recap Regression Trees CART (classification and regression trees) is a method developed by Breiman, Friedman, Olshen and Stone to classify data

§9.4 MARS

For the regression tree process, the data were partitioned in a waythat produced the "best" split with reference to the deviances from themean on either side of the split.

There are commercial versions with an interface to R on JeromeFriedman’s home page.Friedman, J. (1991): Multivariate Adaptative Regression Splines. Annalsof Statistics, 19:1, 1-141.Free version comes with the package mda.

Classification Trees and MARS – p. 14/19

Page 26: Classification Trees and MARS · Recap Regression Trees CART (classification and regression trees) is a method developed by Breiman, Friedman, Olshen and Stone to classify data

§9.4 MARS

For the regression tree process, the data were partitioned in a waythat produced the "best" split with reference to the deviances from themean on either side of the split.

For MARS a similar process is used to find the best split withreference to the deviances from a spline function on either side of thesplit.

There are commercial versions with an interface to R on JeromeFriedman’s home page.Friedman, J. (1991): Multivariate Adaptative Regression Splines. Annalsof Statistics, 19:1, 1-141.Free version comes with the package mda.

Classification Trees and MARS – p. 14/19

Page 27: Classification Trees and MARS · Recap Regression Trees CART (classification and regression trees) is a method developed by Breiman, Friedman, Olshen and Stone to classify data

The spline functions used by MARS are:

(X − t)+ =

x − t x > t

0 o/wand (t − X)+ =

t − x x < t

0 o/w

Each function is a piecewise linear. By multiplying these splinestogether it is possible to produce quadratic or cubic curves.The pair of functions (X − t)+, (t − X)+ is called reflected pair while t

is called a knot.

Recall: Regression tree uses as basis functions: I(Xj > c) and I(Xj ≤ c).

Classification Trees and MARS – p. 15/19

Page 28: Classification Trees and MARS · Recap Regression Trees CART (classification and regression trees) is a method developed by Breiman, Friedman, Olshen and Stone to classify data

The spline functions used by MARS are:

(X − t)+ =

x − t x > t

0 o/wand (t − X)+ =

t − x x < t

0 o/w

Each function is a piecewise linear. By multiplying these splinestogether it is possible to produce quadratic or cubic curves.The pair of functions (X − t)+, (t − X)+ is called reflected pair while t

is called a knot.

MARS uses the collection of basis functionsC =

{

(Xj − t)+, (t − Xj)+}

, with t ∈ {x1j , . . . , xNj} j = 1, 2, . . . , p

Recall: Regression tree uses as basis functions: I(Xj > c) and I(Xj ≤ c).

Classification Trees and MARS – p. 15/19

Page 29: Classification Trees and MARS · Recap Regression Trees CART (classification and regression trees) is a method developed by Breiman, Friedman, Olshen and Stone to classify data

The model is of the form:

f(X) = β0 +

M∑

m=1

βmhm(X) (1)

where the hm(X) ∈ C or a product of functions in C.M = {h0(X), . . . , hM (X)} is the set of all functions included in the model.

How to build the model if model functions were known?If functions hm(X) were known, determine the coefficients β0, . . . , βM

by minimizing the residual sum of squares. The model buildingstrategy is similar to stepwise linear regression: use functions of theform of hm(X) instead of the original inputs.

Classification Trees and MARS – p. 16/19

Page 30: Classification Trees and MARS · Recap Regression Trees CART (classification and regression trees) is a method developed by Breiman, Friedman, Olshen and Stone to classify data

MARS Model functions

Step 1

Start with h0(X) = 1; f̂ (1) = β̂(1)0 . M(1) = {h0(X)}

The algorithm stops when the model set contains some preset number of terms.Classification Trees and MARS – p. 17/19

Page 31: Classification Trees and MARS · Recap Regression Trees CART (classification and regression trees) is a method developed by Breiman, Friedman, Olshen and Stone to classify data

MARS Model functions

Step 1

Start with h0(X) = 1; f̂ (1) = β̂(1)0 . M(1) = {h0(X)}

Step 2

Add to the model a function of the form b1(Xj − t)+ + b2(t − Xj)+, witht ∈ {x1j , . . . , xNj} that produces the largest decrease in training error.Say this is achieved by j = J , and t = xkJ .Model: f̂ (2) = β̂

(2)0 + β̂

(2)1 (XJ − xkJ)+ + β̂

(2)2 (xkJ − XJ)+.

M(2) = {h0(X), h1(X), h2(X)}, h1(X) = (XJ − xkJ)+etc.

The algorithm stops when the model set contains some preset number of terms.Classification Trees and MARS – p. 17/19

Page 32: Classification Trees and MARS · Recap Regression Trees CART (classification and regression trees) is a method developed by Breiman, Friedman, Olshen and Stone to classify data

MARS Model functions

Step 1

Start with h0(X) = 1; f̂ (1) = β̂(1)0 . M(1) = {h0(X)}

Step 2

Add to the model a function of the form b1(Xj − t)+ + b2(t − Xj)+, witht ∈ {x1j , . . . , xNj} that produces the largest decrease in training error.Say this is achieved by j = J , and t = xkJ .Model: f̂ (2) = β̂

(2)0 + β̂

(2)1 (XJ − xkJ)+ + β̂

(2)2 (xkJ − XJ)+.

M(2) = {h0(X), h1(X), h2(X)}, h1(X) = (XJ − xkJ)+etc.

Step m + 1

Add to the model a function of the formb2m−1hl(X)(Xj − t)+ + b2mhl(X)(t − Xj)+ with hl(X) ∈ M(m)

that produces the largest decrease in training error.Say this is achieved by j = J ′, t = xk′J′ and l = L.M(m+1) = M(m) ∪ {h2m−1(X), h2m(X)}

h2m−1 = hL(X)(XJ′ − xk′J′)+ and h2m = hL(X)(xk′J′ − XJ′)+

The algorithm stops when the model set contains some preset number of terms.Classification Trees and MARS – p. 17/19

Page 33: Classification Trees and MARS · Recap Regression Trees CART (classification and regression trees) is a method developed by Breiman, Friedman, Olshen and Stone to classify data

At the end of this process, we have a large model of the form (9.19),which most probably overfits the data. So a backward deletionprocedure is applied.

Classification Trees and MARS – p. 18/19

Page 34: Classification Trees and MARS · Recap Regression Trees CART (classification and regression trees) is a method developed by Breiman, Friedman, Olshen and Stone to classify data

At the end of this process, we have a large model of the form (9.19),which most probably overfits the data. So a backward deletionprocedure is applied.

The term, whose removal causes the smallest increase in residual, isdeleted from the model, at each stage.

Classification Trees and MARS – p. 18/19

Page 35: Classification Trees and MARS · Recap Regression Trees CART (classification and regression trees) is a method developed by Breiman, Friedman, Olshen and Stone to classify data

At the end of this process, we have a large model of the form (9.19),which most probably overfits the data. So a backward deletionprocedure is applied.

The term, whose removal causes the smallest increase in residual, isdeleted from the model, at each stage.

The tuning parameter λ governs the tradeoff between the size of themodel and the goodness of fit to the data. Optimal value of λ isestimated by generalized cross-validation criterion:

GCV (λ) =

∑N

i=1(yi − f̂λ(xi))2

(1 − M(λ)/N)2

Classification Trees and MARS – p. 18/19

Page 36: Classification Trees and MARS · Recap Regression Trees CART (classification and regression trees) is a method developed by Breiman, Friedman, Olshen and Stone to classify data

At the end of this process, we have a large model of the form (9.19),which most probably overfits the data. So a backward deletionprocedure is applied.

The term, whose removal causes the smallest increase in residual, isdeleted from the model, at each stage.

The tuning parameter λ governs the tradeoff between the size of themodel and the goodness of fit to the data. Optimal value of λ isestimated by generalized cross-validation criterion:

GCV (λ) =

∑N

i=1(yi − f̂λ(xi))2

(1 − M(λ)/N)2

M(λ) is the effective number of parameters used in the model;namely # terms in the model plus # parameters used to select theoptimal positions of the knots (3 parameters per knot).

Classification Trees and MARS – p. 18/19

Page 37: Classification Trees and MARS · Recap Regression Trees CART (classification and regression trees) is a method developed by Breiman, Friedman, Olshen and Stone to classify data

Advantages:

Using piecewise linear basis functions, the regression surface isbuilt up parsimoniously.

MARS is not computationally intensive:For the piecewise linear functions, the reflected pair with rightmostknot is fitted. The knot is moved successively one position at atime to the left.

Classification Trees and MARS – p. 19/19

Page 38: Classification Trees and MARS · Recap Regression Trees CART (classification and regression trees) is a method developed by Breiman, Friedman, Olshen and Stone to classify data

Advantages:

Using piecewise linear basis functions, the regression surface isbuilt up parsimoniously.

MARS is not computationally intensive:For the piecewise linear functions, the reflected pair with rightmostknot is fitted. The knot is moved successively one position at atime to the left.

Limitations:

Hierarchical (forward) modeling strategy.The philosophy used is that a higher-order interaction will likelyexists only if some of its "footprints" exist as well.

Restriction in the formation of model terms: each input can appearat most once in a product.

Classification Trees and MARS – p. 19/19