20
Decision Trees Jyh-Shing Roger Jang ( 張張張 ) CSIE Dept, National Taiwan University

Decision Trees Jyh-Shing Roger Jang ( 張智星 ) CSIE Dept, National Taiwan University

Embed Size (px)

Citation preview

Decision Trees

Jyh-Shing Roger Jang (張智星 )CSIE Dept, National Taiwan University

2

Classification

Stages in classification Model construction: Given a collection of records (training

set), where each record has a set of attributes, including the class, we want to find a model (classifier) for predicting the class as a function of other attributes.

Model evaluation: Use previously unseen records (test set) to test the model, and hopefully the model should be able to assign a class as accurately as possible.

Model application: Apply the model directly.

3

Stages in Classification

Stages in Classification

Apply

Model

Induction

Deduction

Learn

Model

Model

Tid Attrib1 Attrib2 Attrib3 Class

1 Yes Large 125K No

2 No Medium 100K No

3 No Small 70K No

4 Yes Medium 120K No

5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No

8 No Small 85K Yes

9 No Medium 75K No

10 No Small 90K Yes 10

Tid Attrib1 Attrib2 Attrib3 Class

11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?

14 No Small 95K ?

15 No Large 67K ? 10

Test Set

Learningalgorithm

Training Set

4

Example

5

Examples Classification/Regression Tasks

Classification Predict the trend (up or down) of stock markets Predict tumors as benign or malignant Classify credit card transactions as legitimate or fraudulent Categorize news articles as finance, weather, entertainment,

sports, etc. Regression

Predict the temperature in 3 hours from now Predict tomorrow’s gold/oil price Estimate the paths of typhoon

6

Methods for Classification

Numerous methods for classification Decision trees Minimum-distance classifiers Artificial neural networks Naïve Bayes classifiers Quadratic classifiers Gaussian-mixture-model classifiers Support vector machines Rule-based methods …

7

Decision Tree Induction

Again, many algorithms Hunt’s algorithm (one of the earliest) CART (classification and regression trees) ID3, C4.5 SLIQ, SPRINT …

8

Decision Tree Induction

Again, many algorithms Hunt’s algorithm (one of the earliest) CART (classification and regression trees) ID3, C4.5 SLIQ, SPRINT …

9

General Steps in Tree Induction

Idea We want to send all the training data along the tree until it

reach the leaves where the data should be as “pure” as possible.

Let D be the data set that reach a node General procedure

If D contains records belonging to the same class y, then mark the node as a leaf with class y.

Otherwise use a test to split the data set based on an attribute to create subtree recursively.

10

Tree Induction

Issues in tree induction How to split the dataset at a node: Split the dataset based on

a greedy search to optimize a certain criterion/test When to stop splitting: When the “impurity measure” is less

than a threshold

11

How to Specify Test?

Depends on attribute types Nominal

Car types: Family, sports, luxury, etc Ordinal

T-shirt size: Small, median big, etc Continuous

Temperature: 10.3, 25.6, 38, etc Depends on number of ways to split

Binary (2-way) split Multi-way split

Aka “factor”

12

Splitting Based on Nominal/Ordinal Attributes

Multi-way split Use as many partitions as distinct values

Binary split Divides values into two subsets via optimal partitioning

CarTypeFamily

SportsLuxury

CarType{Family, Luxury} {Sports}

CarType{Sports, Luxury} {Family}

OR

13

Splitting Based on Continuous Attributes

Multi-way split Discretization to form an ordinal categorical attribute

Binary split (A<v or A v) Consider all possible splits to find the best one

TaxableIncome> 80K?

Yes No

TaxableIncome?

(i) Binary split (ii) Multi-way split

< 10K

[10K,25K) [25K,50K) [50K,80K)

> 80K

14

To Determine the Best Split

Goal Nodes with homogeneous (pure) class distribution are

preferred Need a measure of node impurity (which should be keep

as low as possible during split selection)

C0: 5C1: 5

C0: 9C1: 1

Non-homogeneous,

High degree of impurity

Homogeneous,

Low degree of impurity

15

Measures of Node Impurity

Numerous measures of node impurity Gini index

Entropy

Classification error

j

tjptGini 2)|(1)(

j

tjptjptEntropy )|(log)|()( 2

)|(max1)( tjptErrorj

For 2-class problem

16

Impurity Measure: Gini Index

Gini index for a given node t:

Extreme values Minimum = 0 Maximum = 1/(# of classes)

Examples

j

tjptGini 2)|(1)(

P(j|t) is the relative frequencyof class j at node t

C1 0C2 6

Gini=0.000

C1 2C2 4

Gini=0.444

C1 3C2 3

Gini=0.500

C1 1C2 5

Gini=0.278

All records in the same class

Records equally distributed among all classes

06

601

2

22

278.06

511

2

22

444.06

421

2

22

5.06

331

2

22

“confusion” in HW4

17

Splitting Based on Gini Index

The quality of splitting a node t into k childrens

ti = node of child i ni = number of records at ti

n = number of records at note t

k

ii

isplit tGini

n

ntGini

1

)(

“total confusion” in HW4

18

Gini Index for General Binary Split

Example for computing Gini index for binary split

B?

Yes No

Node N1 Node N2

Parent

C1 6

C2 6

Gini = 0.500

N1 N2 C1 5 1

C2 2 4

Gini=0.333

Gini(N1) = 1 – (5/6)2 – (2/6)2 = 0.194

Gini(N2) = 1 – (1/6)2 – (4/6)2 = 0.528

Ginisplit(B) = 7/12 * 0.194 + 5/12 * 0.528= 0.333

19

Gini Index for Nominal Attributes

For each child, obtain counts for each class Compute the Gini index for each child Compute the Gini index for the split

CarType{Sports,Luxury}

{Family}

C1 3 1

C2 2 4

Gini 0.400

CarType

Family Sports Luxury

C1 1 2 1

C2 4 1 1

Gini 0.393

Multi-way split Two-way split (find best partition of values)

CarType

{Sports}{Family,Luxury}

C1 2 2

C2 1 5

Gini 0.419

20

Gini Index for Binary Split on Continuous Attributes

For each attribute Sort the attribute values Linearly scan these value, and update the count matrix and

compute Gini index for a new value each time Choose the split that has the smallest Gini index

Cheat No No No Yes Yes Yes No No No No

Taxable Income

60 70 75 85 90 95 100 120 125 220

55 65 72 80 87 92 97 110 122 172 230

<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >

Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0

No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0

Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420

Split Positions

Sorted Values