Introduction to Machine Learning · c Introduction to Machine Learning –6 / 12. SPLITTING CRITERIA: CLASSIFICATION Typically uses either Brier score (so: L2 loss on probabilities)

Introduction to Machine Learning

CART: Splitting Criteria

compstat-lmu.github.io/lecture_i2ml

https://compstat-lmu.github.io/lecture_i2ml/

TREES

Classification Tree:

●●● ●●

●

●

●●

●

● ●

●●

●

●●

● ●●

●

●

●

●

●●

●

●● ●●

●

●

● ●● ●

●

● ●

●●

●

●

●

●

●● ●●

0.0

0.5

1.0

1.5

2.0

2.5

2 4 6Petal.Length

Pet

al.W

idth Species

● setosa

versicolor

virginica

Iris Data

Regression Tree:

-1.20 -0.42 0.98 -0.20 -0.01

c© Introduction to Machine Learning – 1 / 12

SPLITTING CRITERIA

How to find good splitting rules to define the tree?

=⇒ empirical risk minimization


SPLITTING CRITERIA: FORMALIZATION

Let N ⊆ D be the data that is assigned to a terminal node N of atree.

Let c be the predicted constant value for the data assigned to N :y ≡ c for all (x , y) ∈ N .

Then the risk R(N ) for a leaf is simply the average loss for thedata assigned to that leaf under a given loss function L:

R(N ) =1|N |

∑(x,y)∈N

L(y , c)

The prediction is given by the optimal constant c = arg mincR(N )


SPLITTING CRITERIA: FORMALIZATION

A split w.r.t. feature xj at split point t divides a parent nodeN into

N1 = {(x , y) ∈ N : xj ≤ t} and N2 = {(x , y) ∈ N : xj > t}.

In order to evaluate how good a split is, we compute the empiricalrisks in both child nodes and sum it up

R(N , j, t) =|N1||N |R(N1) +

|N2||N |R(N2)

=1|N |

∑(x,y)∈N1

L(y , c1) +∑

(x,y)∈N2

L(y , c2)

finding the best way to split N into N1,N2 means solving

arg minj,tR(N , j, t)


SPLITTING CRITERIA: REGRESSION

For regression trees, we usually use L2 loss:

R(N ) =1|N |

∑(x,y)∈N

(y − c)2

The best constant prediction under L2 is the mean

c = yN =1|N |

∑(x,y)∈N

y


SPLITTING CRITERIA: REGRESSION

This means the best split is the one that minimizes the (pooled)variance of the target distribution in the child nodes N1 and N2:

We can also interpret this as a way of measuring the impurity ofthe target distribution, i.e., how much it diverges from a constant ineach of the child nodes.

For L1 loss, c is the median of y ∈ N .


SPLITTING CRITERIA: CLASSIFICATION

Typically uses either Brier score (so: L2 loss on probabilities) orBernoulli loss (as in logistic regression) as loss functions

Predicted probabilities in node N are simply the class proportionsin the node:

π(N )k =

1|N |

∑(x,y)∈N

I(y = k)

This is the optimal prediction under both the logistic / Bernoulli lossand the Brier loss.

0.0

0.2

0.4

0.6

1 2 3Label

Cla

ss p

rob.


SPLITTING CRITERIA: COMMENTS

Splitting criteria for trees are usually defined in terms of "impurityreduction". Instead of minimizing empirical risk in the child nodesover all possible splits, a measure of “impurity” of the distribution ofthe target y in the child nodes is minimized.

For regression trees, the “impurity” of a node is usuallly defined asthe variance of the y (i) in the node. Minimizing this “varianceimpurity” is equivalent to minimizing the squared error loss for apredicted constant in the nodes.


SPLITTING CRITERIA: COMMENTS

Minimizing the Brier score is equivalent to minimizing the Giniimpurity

I(N ) =

g∑k=1

π(N )k (1− π(N )

k )

Minimizing the Bernoulli loss is equivalent to minimizing entropyimpurity

I(N ) = −g∑

k=1

π(N )k log π

(N )k

The approach based on loss functions instead of impuritymeasures is simpler and more straightforward, mathematicallyequivalent and shows that growing a tree can be understood interms of empirical risk minimization.


SPLITTING WITH MISCLASSIFICATION LOSS

Why don’t we use the misclassification loss for classification trees?I.e., always predict the majority class in each child node and counthow many errors we make.

In many other cases, we are interested in minimizing this kind oferror, but have to approximate it by some other criterion insteadsince the misclassification loss does not have derivatives that wecan use for optimization.We don’t need derivatives when we optimize the tree, so we couldgo for it!

This is possible, but Brier score and Bernoulli loss are moresensitive to changes in the node probabilities, and therefore oftenpreferred



Example: two-class problem with 400 obs in each class and twopossible splits:

Split 1:

class 0 class 1N1 300 100N2 100 300

Split 2:

class 0 class 1N1 400 200N2 0 200

Both splits are equivalent in terms of misclassification error, theyeach misclassify 200 observations.

But: Split 2 produces one pure node and is probably preferable.

Brier loss (Gini impurity) and Bernoulli loss (entropy impurity)prefer the second split



Calculation for Gini:

Split 1 :|N1||N |· 2 · π(N1)

0 π(N1)1 +

|N2||N |· 2 · π(N2)

0 π(N2)1 =

34· 1

4+

14· 3

4=

316

Split 2 :34· 2 · 2

3· 1

3+

14· 0 · 1 =

13


Documents

Introduction to Machine Learning · c Introduction to Machine Learning –6 / 12. SPLITTING CRITERIA: CLASSIFICATION Typically uses either Brier score (so: L2 loss on probabilities)