Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Introduction to Machine Learning
CART: Splitting Criteria
compstat-lmu.github.io/lecture_i2ml
TREES
Classification Tree:
●●● ●●
●
●
●●
●
● ●
●●
●
●●
● ●●
●
●
●
●
●●
●
●● ●●
●
●
● ●● ●
●
● ●
●●
●
●
●
●
●● ●●
0.0
0.5
1.0
1.5
2.0
2.5
2 4 6Petal.Length
Pet
al.W
idth Species
● setosa
versicolor
virginica
Iris Data
Regression Tree:
-1.20 -0.42 0.98 -0.20 -0.01
c© Introduction to Machine Learning – 1 / 12
SPLITTING CRITERIA
How to find good splitting rules to define the tree?
=⇒ empirical risk minimization
c© Introduction to Machine Learning – 2 / 12
SPLITTING CRITERIA: FORMALIZATION
Let N ⊆ D be the data that is assigned to a terminal node N of atree.
Let c be the predicted constant value for the data assigned to N :y ≡ c for all (x , y) ∈ N .
Then the risk R(N ) for a leaf is simply the average loss for thedata assigned to that leaf under a given loss function L:
R(N ) =1|N |
∑(x,y)∈N
L(y , c)
The prediction is given by the optimal constant c = arg mincR(N )
c© Introduction to Machine Learning – 3 / 12
SPLITTING CRITERIA: FORMALIZATION
A split w.r.t. feature xj at split point t divides a parent nodeN into
N1 = {(x , y) ∈ N : xj ≤ t} and N2 = {(x , y) ∈ N : xj > t}.
In order to evaluate how good a split is, we compute the empiricalrisks in both child nodes and sum it up
R(N , j, t) =|N1||N |R(N1) +
|N2||N |R(N2)
=1|N |
∑(x,y)∈N1
L(y , c1) +∑
(x,y)∈N2
L(y , c2)
finding the best way to split N into N1,N2 means solving
arg minj,tR(N , j, t)
c© Introduction to Machine Learning – 4 / 12
SPLITTING CRITERIA: REGRESSION
For regression trees, we usually use L2 loss:
R(N ) =1|N |
∑(x,y)∈N
(y − c)2
The best constant prediction under L2 is the mean
c = yN =1|N |
∑(x,y)∈N
y
c© Introduction to Machine Learning – 5 / 12
SPLITTING CRITERIA: REGRESSION
This means the best split is the one that minimizes the (pooled)variance of the target distribution in the child nodes N1 and N2:
We can also interpret this as a way of measuring the impurity ofthe target distribution, i.e., how much it diverges from a constant ineach of the child nodes.
For L1 loss, c is the median of y ∈ N .
c© Introduction to Machine Learning – 6 / 12
SPLITTING CRITERIA: CLASSIFICATION
Typically uses either Brier score (so: L2 loss on probabilities) orBernoulli loss (as in logistic regression) as loss functions
Predicted probabilities in node N are simply the class proportionsin the node:
π(N )k =
1|N |
∑(x,y)∈N
I(y = k)
This is the optimal prediction under both the logistic / Bernoulli lossand the Brier loss.
0.0
0.2
0.4
0.6
1 2 3Label
Cla
ss p
rob.
c© Introduction to Machine Learning – 7 / 12
SPLITTING CRITERIA: COMMENTS
Splitting criteria for trees are usually defined in terms of "impurityreduction". Instead of minimizing empirical risk in the child nodesover all possible splits, a measure of “impurity” of the distribution ofthe target y in the child nodes is minimized.
For regression trees, the “impurity” of a node is usuallly defined asthe variance of the y (i) in the node. Minimizing this “varianceimpurity” is equivalent to minimizing the squared error loss for apredicted constant in the nodes.
c© Introduction to Machine Learning – 8 / 12
SPLITTING CRITERIA: COMMENTS
Minimizing the Brier score is equivalent to minimizing the Giniimpurity
I(N ) =
g∑k=1
π(N )k (1− π(N )
k )
Minimizing the Bernoulli loss is equivalent to minimizing entropyimpurity
I(N ) = −g∑
k=1
π(N )k log π
(N )k
The approach based on loss functions instead of impuritymeasures is simpler and more straightforward, mathematicallyequivalent and shows that growing a tree can be understood interms of empirical risk minimization.
c© Introduction to Machine Learning – 9 / 12
SPLITTING WITH MISCLASSIFICATION LOSS
Why don’t we use the misclassification loss for classification trees?I.e., always predict the majority class in each child node and counthow many errors we make.
In many other cases, we are interested in minimizing this kind oferror, but have to approximate it by some other criterion insteadsince the misclassification loss does not have derivatives that wecan use for optimization.We don’t need derivatives when we optimize the tree, so we couldgo for it!
This is possible, but Brier score and Bernoulli loss are moresensitive to changes in the node probabilities, and therefore oftenpreferred
c© Introduction to Machine Learning – 10 / 12
SPLITTING WITH MISCLASSIFICATION LOSS
Example: two-class problem with 400 obs in each class and twopossible splits:
Split 1:
class 0 class 1N1 300 100N2 100 300
Split 2:
class 0 class 1N1 400 200N2 0 200
Both splits are equivalent in terms of misclassification error, theyeach misclassify 200 observations.
But: Split 2 produces one pure node and is probably preferable.
Brier loss (Gini impurity) and Bernoulli loss (entropy impurity)prefer the second split
c© Introduction to Machine Learning – 11 / 12
SPLITTING WITH MISCLASSIFICATION LOSS
Calculation for Gini:
Split 1 :|N1||N |· 2 · π(N1)
0 π(N1)1 +
|N2||N |· 2 · π(N2)
0 π(N2)1 =
34· 1
4+
14· 3
4=
316
Split 2 :34· 2 · 2
3· 1
3+
14· 0 · 1 =
13
c© Introduction to Machine Learning – 12 / 12