Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
2014-09-21
1
Han
yang
Un
iversity
Quest L
ab.
Chapter 9
Classification and regression trees
Fall, 2014Department of IMEHanyang University
Hanyang UniversityQuest Lab.
9.2 Classification trees
• Data-driven method for classification (classification tree) and prediction
(regression tree)
• Tree method developed by Breiman et al. → CART
• Obs are separated into subgroups based on predictors
• Goal: Classify or predict an outcome based on a set of predictors
• The output is a set of rules
• Recursive partitioning and pruning with consideration of the
homogeneity
2014-09-21
2
Hanyang UniversityQuest Lab.
Example:
• Goal: classify a record as “will accept credit card offer” or “will not
accept”
• Rule might be “IF (Income > 92.5) AND (Education < 1.5) AND (Family
<= 2.5) THEN Class = 0 (nonacceptor)
• Also called CART, Decision Trees, or just Trees
• Rules are represented by tree diagrams
9.1 Introduction
Hanyang UniversityQuest Lab.
9.1 Introduction
Number in the circle node: splitting value
Terminal (square) node: acceptor(1) or nonacceptor(0)
Number on the fork: number of records
Predictor
2014-09-21
3
Hanyang UniversityQuest Lab.
9.1 Introduction
“IF (Income > 92.5) AND (Education < 1.5) AND (Family <= 2.5) THEN Class = 0 (nonacceptor)”
Hanyang UniversityQuest Lab.
9.2 Classification trees
• Divide the p-dimensional space into two by a split value of such
that: 1, … , : and 1, … , :
• Measure how “pure” or homogeneous each of the resulting portions are
(“Pure” = containing records of mostly one class)
• Algorithm tries different values of ,and
to maximize purity in split
• After you get a “maximum purity” split, repeat the process for a second
split, and so on
• Repeat until we get “pure” classes (belonging to one class)
Recursive partitioning
2014-09-21
4
Hanyang UniversityQuest Lab.
9.2 Classification trees
12 owner and 12 non-owners
of riding mowers
Income(x1) and lot size(x2)
Example 1: Riding mowers
Income Lot_Size Ownership60.0 18.4 owner85.5 16.8 owner64.8 21.6 owner61.5 20.8 owner87.0 23.6 owner110.1 19.2 owner108.0 17.6 owner82.8 22.4 owner69.0 20.0 owner93.0 20.8 owner51.0 22.0 owner81.0 20.0 owner75.0 19.6 non-owner52.8 20.8 non-owner64.8 17.2 non-owner43.2 20.4 non-owner84.0 17.6 non-owner49.2 17.6 non-owner59.4 16.0 non-owner66.0 18.4 non-owner47.4 16.4 non-owner33.0 18.8 non-owner51.0 14.0 non-owner63.0 14.8 non-owner
9.2 Classification trees
The first split: Lot Size = 19,000
2014-09-21
5
9.2 Classification trees
First Split – The Tree
“Is Lot_Size ≤ 19,000?”
Yes No
Hanyang UniversityQuest Lab.
9.3 Measure of impurity
• Assume there are m classes
= rectangle A에서 class 에 속하는 obs의 비율
Gini impurity index :
• If all obs belong to same class, 0(why?)
• If all are equally distributed, 1 / .
• In two-class case, is maximized at 0.5.
2
1
( ) 1 ,m
kk
I A p
0 ( ) ( 1) /I A m m
2014-09-21
6
Hanyang UniversityQuest Lab.
9.3 Measure of impurity
• Assume there are m classes
= rectangle A에서 class 에 속하는 obs의 비율
Entropy measure :
• If all obs belong to same class, Entropy 0(why?)
• If all are equally distributed, Entropy log2 .
• In two-class case, Entropy(A) is maximized at 0.5.
21
Entropy( ) log ( ),m
k kk
A p p
20 Entropy( ) log ( )A m
Hanyang UniversityQuest Lab.
9.3 Measure of impurity
Node im
purity
Note that impurity measures are at their peaks when p=0.5 (when the data contain 50% of each of the two classes).
P = 0.5
2014-09-21
7
Hanyang UniversityQuest Lab.
9.3 Measure of impurity
In Example 1,
Before split (12 owner and 12 nonowner): Gini=0.5, Entropy=1
After split (x2 ≥ 19 와 x2 < 19):
• Upper rectangle(9 owners and 3 nonowners): Gini=0.375, Entropy=0.811
• Lower rectangle(3 owners and 9 nonowners): Gini=0.375, Entropy=0.811
• Combined impurity (weighted average): Gini=0.375, Entropy=0.811
Both impurity indices decrease after split.
Second Split: Income = $84,750
9.2 Classification trees
Third Split: Income = $57,150
2014-09-21
8
Tree after three splits
9.2 Classification trees
After All Splits (until when?)
9.2 Classification trees
2014-09-21
9
Hanyang UniversityQuest Lab.
9.3 Measure of impurity
Decision node: circle 안의 수는
splitting value
Fork: left는 splitting value보다
작거나 같은 쪽, right는 큰 쪽,
숫자는 record의 개수
Leaf: final rectangle
split variable은 circle 아래 표기
Tree after all splits
Hanyang UniversityQuest Lab.
9.3 Measure of impurity
Determining Leaf Node Label
• Each leaf node label is determined by “voting” of the records within it,
and by the cutoff value
• Records within each leaf node are from the training data
• Default cutoff=0.5 means that the leaf node’s label is the majority
class
• Cutoff = 0.75: requires majority of 75% or more “1” records in the leaf
to label it a “1” node
2014-09-21
10
Hanyang UniversityQuest Lab.
Then at a node, how to determine split value of ??
① 질문 생성
② 각 질문 별로 information gain (IG) 계산
③ IG가 가장 큰 것을 선택
When stop?
① Impurity = 0일 때
② Leaf에서의 record 개수가 어느 정도 이하일 때
③ IG가 임계값 이하일 때
④ Pruning
9.3 Measure of impurity
Hanyang UniversityQuest Lab.
Then at a node, how to determine split value of ??
① 질문 생성
② 각 질문 별로 information gain (IG) 계산 후 IG가 가장 큰 것을 선택
IG = Reduction in entropy caused by partitioning the data according to this
variable.
Ex. Suppose a node S is a collection of 14 examples [9 yes’, 5 no’s].
9.3 Measure of impurity
2
2 2 21
9 9 5 5[9 ,5 ] log log log 0.94
14 14 14 14i ii
Entropy y n p p
2014-09-21
11
Hanyang UniversityQuest Lab.
9.3 Measure of impurity
( )
( , ) ( ) ( )i
vi v
v value x
SIG S x Entropy S Entropy S
S
• value (A): the set of all possible values for a variable A.
• Sv: the subset of S for which variable A has value v.
Hanyang UniversityQuest Lab.
9.3 Measure of impurity
Outlook Temperature Humidity Windy Play?
sunny hot high false No
sunny hot high true No
overcast hot high false Yes
rain mild high false Yes
rain cool normal false Yes
rain cool normal true No
overcast cool normal true Yes
sunny mild high false No
sunny cool normal false Yes
rain mild normal false Yes
sunny mild normal true Yes
overcast mild high true Yes
overcast hot normal false Yes
rain mild high true No
( ) ,
[9 ,5 ]
[6 ,2 ]
[3 ,3 ]
false
true
Values windy false true
S y n
S y n
S y n
{ , }
( , ) ( ) ( )
8 6( ) ( ) ( )
14 140.048
vv
v false true
false true
SIG S windy Entropy S Entropy S
S
Entropy S Entropy S Entropy S
Similarly, IG(S,Humidity)=0.151.
Humidity provides greater information gain than Windy.
variable
2014-09-21
12
Hanyang UniversityQuest Lab.
9.4 Evaluating the performance of a classification tree
Example 2. Acceptance of personal loan
• Previous campaign을 분석하여 better campaign 기획
• What combination of factors make customers convert to personal loan?
• Predictors: demographic info(age, income, etc.), response to last campaign, r
elationship with bank(mortgage, security account, etc.)
• Last campaign – 480 customers accepted out of 5,000 customers (9.6%)
Hanyang UniversityQuest Lab.
9.4 Evaluating the performance of a classification tree
2,500 training set, 1,500 validation set, 1,000 test sets
2014-09-21
13
Hanyang UniversityQuest Lab.
9.4 Evaluating the performance of a classification tree
Hanyang UniversityQuest Lab.
9.4 Evaluating the performance of a classification tree
100% accurate in training set - full-grown tree에서 leaf는 pure class
“overfitting의 문제”
2014-09-21
14
Hanyang UniversityQuest Lab.
9.4 Evaluating the performance of a classification tree
• Validation set과 test set
에서는 100%가 아님
• Full-grown 이전에 stop
또는 full-grown tree의
pruning이 필요
Hanyang UniversityQuest Lab.
“Overfitting”: because of too small number of observations
9.5 Avoiding Overfitting
→ stop tree growth (CHAID), or pruning tree back (CART, C4.5)
2014-09-21
15
Hanyang UniversityQuest Lab.
9.5 Avoiding Overfitting
• Stop when?
tree depth? min number of obs? min reduction in impurity? etc.
- CHAID(chi-squared automatic interaction detection):
Splitting stops when purity improvement is not statistically significant
Stopping tree growth: CHAID
Hanyang UniversityQuest Lab.
9.5 Avoiding overfitting
- Lets tree grow to full extent, then prunes it back
- Riding mower 예제에서 마지막 몇 개 split에 의한 rectangle은
single record만 가짐 → 이들은 noise를 포함하고 있다고 볼 수 있음
- Pruning이란 decision node를 leaf로 바꾸는 것
- Unseen data의 error rate가 증가하는 시점을 발견하는 것이 목표
Pruning the tree
2014-09-21
16
Hanyang UniversityQuest Lab.
9.5 Avoiding overfitting
Use cost complexity(CC) of a tree : ,
where
=tree 에서의 misclassification에 의한 cost
=number of leaves in
=penalty factor (set by user)
(Tree size가 커질수록 에러는 작아지지만 대신 페널티가 커진다)
( 값이 커질 수록 큰 tree의 CC가 높아진다)
• Among trees of given size(# of decision nodes), choose the one with
lowest CC
Pruning the tree (CART)
Hanyang UniversityQuest Lab.
9.5 Avoiding overfitting
• Do this for each size of tree
→각 size 별로 최소 CC를 가지는 tree들을 얻음
• 이들 tree 중에서 validation set의 misclassification error가 minimum인
것(“minimum error tree”)을 선택 하거나
• 또는 minimum error tree의 one standard error 범위내의 tree 중 제일
작은 것을 선택(“best pruned tree”)
• 이는 Sampling error를 고려한 수정임 (validation data도 pruning 과정
에서 사용했으며, 다른 sample일 경우 달라질 수도 있었기 때문)
(1 )p pp
n
• Do this for each size of tree
→각 size 별로 최소 CC를 가지는 tree들을 얻음
• 이들 tree 중에서 validation set의 misclassification error가 minimum인
것(“minimum error tree”)을 선택 하거나
• 또는 minimum error tree의 one standard error 범위내의 tree 중 제일
작은 것을 선택(“best pruned tree”)
• 이는 Sampling error를 고려한 수정임 (validation data도 pruning 과정
에서 사용했으며, 다른 sample일 경우 달라질 수도 있었기 때문)
(1 ),
p pp
n
n은 validation set의 샘플수
2014-09-21
17
Hanyang UniversityQuest Lab.
9.5 Avoiding overfitting
full grown tree
0.0147 (1 0.0147)0.0147 0.0178 1.78%
1500
. . . . . . . . . . . .
minimum error tree
best pruned tree
Hanyang UniversityQuest Lab.
9.6 Classification rules from trees
Tree를 바탕으로 classification 쉽게 함
Ex. Rule might be “IF (Income > 92.5) AND (Education < 1.5) AND
(Family <= 2.5) THEN Class = 0 (nonacceptor)
• 두 개 이상의 class가 있
는 경우로 확장 가능
• Leaf node 는 m 개의
class 중 하나의 label
2014-09-21
18
Hanyang UniversityQuest Lab.
9.8 Regression trees
• Regression tree - Used with continuous outcome variable• Procedure similar to classification tree• Many splits attempted, choose the one that minimizes impurity
Ex. Predicting Toyota Corolla (see Ch 6): 10 predictors, 600 training set
Best tree: only two
predictors are useful –
Age and Horse Power.
Hanyang UniversityQuest Lab.
9.8 Regression trees
Cf. Classification tree: leaf의 class는 그 leaf에 있는 data를 “vote”해서 결정
Regression tree: leaf node의 value는 그 leaf에 있는 data의 average
Prediction
Typical measure: sum of squared deviation from mean of the leaf
Measuring impurity
RMSE (root-mean-squared-error), lift charts, etc.
same way as other predictive methods
Evaluating performance
2014-09-21
19
Hanyang UniversityQuest Lab.
• Good classifier, useful for variable selection
• No need to transform variables
• Automatically selection of variables by means of split (pruning에 의해 일부
variable만 선택 가능)
• Outliers에 대해 robust (split은 record의 value의 order에만 관련이 있고, 크기
와는 무관)
• But sensitive to the change in the data (slight change can cause very
different split)
• Predictor와 response 간에 특별한 relationship을 미리 가정하지 않음
(cf. linear regression은 linear relationship을 미리 가정)
9.9 Advantages, weaknesses, and extensions
Hanyang UniversityQuest Lab.
• Split가 one predictor에 의해서만 이루어지므로 predictor 간 상호관계 고려하
지 못함
• horizontal and vertical splitting이 적절하지 않은 경우에는 performance 저하
• Discriminant analysis (Ch 12) is preferable
9.9 Advantages, weaknesses, and extensions
2014-09-21
20
Hanyang UniversityQuest Lab.
• 9.1 (eBay.com)
• 9.2 (Flight delay)
• 9.3 (Regression tree – Toyota Corolla)
Ch. 9 Problems