44
SVM and Decision Tree Machine Learning I CSE 6740, Fall 2013 Le Song

SVM and Decision Treelsong/teaching/CSE6740fall13/lecture...Decision tree learning 28 Apply Model Induction Deduction el Model Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: SVM and Decision Treelsong/teaching/CSE6740fall13/lecture...Decision tree learning 28 Apply Model Induction Deduction el Model Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No

SVM and Decision Tree

Machine Learning I CSE 6740, Fall 2013

Le Song

Page 2: SVM and Decision Treelsong/teaching/CSE6740fall13/lecture...Decision tree learning 28 Apply Model Induction Deduction el Model Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No

Which decision boundary is better?

Suppose the training samples are linearly separable

We can find a decision boundary which gives zero training error

But there are many such decision boundaries

Which one is better?

2

Class 1

Class 2

Page 3: SVM and Decision Treelsong/teaching/CSE6740fall13/lecture...Decision tree learning 28 Apply Model Induction Deduction el Model Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No

Compare two decision boundaries

Suppose we perturb the data, which boundary is more susceptible to error?

3

Page 4: SVM and Decision Treelsong/teaching/CSE6740fall13/lecture...Decision tree learning 28 Apply Model Induction Deduction el Model Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No

Constraints on data points

Constraints on data points

For all 𝑥 in class 2, 𝑦 = 1 and 𝑤⊤𝑥 + 𝑏 ≥ 𝑐

For all 𝑥 in class 1, 𝑦 = −1 and 𝑤⊤𝑥 + 𝑏 ≤ −𝑐

Or more compactly, (𝑤⊤𝑥 + 𝑏)𝑦 ≥ 𝑐

4

Class 1

Class 2

0 bxwT

c c

𝑤

Page 5: SVM and Decision Treelsong/teaching/CSE6740fall13/lecture...Decision tree learning 28 Apply Model Induction Deduction el Model Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No

Classifier margin

Pick two data points 𝑥1 and 𝑥2 which are on each dashed line respectively

The margin is 𝛾 =1

𝑤𝑤⊤ 𝑥1 − 𝑥2 =

2𝑐

| 𝑤 |

5

Class 1

Class 2

0 bxwT

c c

𝑤

𝑥1

𝑥2

Page 6: SVM and Decision Treelsong/teaching/CSE6740fall13/lecture...Decision tree learning 28 Apply Model Induction Deduction el Model Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No

Maximum margin classifier

Find decision boundary 𝑤 as far from data point as possible

max𝑤,𝑏 2𝑐

| 𝑤 |

𝑠. 𝑡. 𝑦𝑖 𝑤⊤ 𝑥𝑖 + 𝑏 ≥ 𝑐, ∀𝑖

6

Class 1

Class 2

0 bxwT

c c

𝑤

𝑥1

𝑥2

Page 7: SVM and Decision Treelsong/teaching/CSE6740fall13/lecture...Decision tree learning 28 Apply Model Induction Deduction el Model Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No

Support vector machines with hard margin

min𝑤,𝑏 𝑤

2

𝑠. 𝑡. 𝑦𝑖 𝑤⊤ 𝑥𝑖 + 𝑏 ≥ 1, ∀𝑖

Convert to standard form

min𝑤,𝑏

1

2𝑤⊤𝑤

𝑠. 𝑡. 1 − 𝑦𝑖 𝑤⊤ 𝑥𝑖 + 𝑏 ≤ 0, ∀𝑖

The Lagrangian function

𝐿 𝑤, 𝛼, 𝛽 = 1

2𝑤⊤𝑤 + 𝛼𝑖 1 − 𝑦

𝑖 𝑤⊤ 𝑥𝑖 + 𝑏

𝑚

𝑖

7

Page 8: SVM and Decision Treelsong/teaching/CSE6740fall13/lecture...Decision tree learning 28 Apply Model Induction Deduction el Model Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No

Deriving the dual problem

𝐿 𝑤, 𝛼, 𝛽 = 1

2𝑤⊤𝑤 + 𝛼𝑖 1 − 𝑦

𝑖 𝑤⊤ 𝑥𝑖 + 𝑏

𝑚

𝑖

Taking derivative and set to zero

𝜕𝐿

𝜕𝑤= 𝑤 − 𝛼𝑖𝑦

𝑖𝑥𝑖 =

𝑚

𝑖

0

𝜕𝐿

𝜕𝑏= 𝛼𝑖𝑦

𝑖 =

𝑚

𝑖

0

8

Page 9: SVM and Decision Treelsong/teaching/CSE6740fall13/lecture...Decision tree learning 28 Apply Model Induction Deduction el Model Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No

Plug back relation of w and b

𝐿 𝑤, 𝛼, 𝛽 =

1

2 𝛼𝑖𝑦

𝑖𝑥𝑖𝑚𝑖

⊤ 𝛼𝑗𝑦

𝑗𝑥𝑗𝑚𝑗 +

𝛼𝑖 1 − 𝑦𝑖 𝛼𝑗𝑦

𝑗𝑥𝑗𝑚𝑗

⊤ 𝑥𝑖 + 𝑏𝑚

𝑖

After simplification

𝐿 𝑤, 𝛼, 𝛽 = 𝛼𝑖 −

𝑚

𝑖

1

2 𝛼𝑖𝛼𝑗𝑦

𝑖𝑦𝑗 𝑥𝑖⊤𝑥𝑗

𝑚

𝑖,𝑗

9

Page 10: SVM and Decision Treelsong/teaching/CSE6740fall13/lecture...Decision tree learning 28 Apply Model Induction Deduction el Model Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No

The dual problem

max𝛼 𝛼𝑖 −

𝑚

𝑖

1

2 𝛼𝑖𝛼𝑗𝑦

𝑖𝑦𝑗 𝑥𝑖⊤𝑥𝑗

𝑚

𝑖,𝑗

𝑠. 𝑡. 𝛼𝑖 ≥ 0, 𝑖 = 1,… ,𝑚

𝛼𝑖𝑦𝑖 =

𝑚

𝑖

0

This is a constrained quadratic programming

Nice and convex, and global maximum can be found

𝑤 can be found as 𝑤 = 𝛼𝑖𝑦𝑖𝑥𝑖𝑚

𝑖

How about 𝑏?

10

Page 11: SVM and Decision Treelsong/teaching/CSE6740fall13/lecture...Decision tree learning 28 Apply Model Induction Deduction el Model Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No

Support vectors

Note that the KKT condition

𝛼𝑖 1 − 𝑦𝑖 𝑤⊤ 𝑥𝑖 + 𝑏 = 0

For data points with 1 − 𝑦𝑖 𝑤⊤ 𝑥𝑖 + 𝑏 < 0 , 𝛼𝑖 = 0

For data points with 1 − 𝑦𝑖 𝑤⊤ 𝑥𝑖 + 𝑏 = 0 , 𝛼𝑖 > 0

11

a6=1.4

Class 1

Class 2

a1=0.8

a2=0

a3=0

a4=0

a5=0

a7=0

a8=0.6

a9=0

a10=0 Call the training data points

whose ai's are nonzero the

support vectors (SV)

Page 12: SVM and Decision Treelsong/teaching/CSE6740fall13/lecture...Decision tree learning 28 Apply Model Induction Deduction el Model Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No

Computing b and obtain the classifer

Pick any data point with 𝛼𝑖 > 0, solve for 𝑏 with

1 − 𝑦𝑖 𝑤⊤ 𝑥𝑖 + 𝑏 = 0

For a new test point z

Compute

𝑤⊤𝑧 + 𝑏 = 𝛼𝑖𝑦𝑖 𝑥𝑖𝑧 + 𝑏

𝑖∈𝑠𝑢𝑝𝑝𝑜𝑟𝑡 𝑣𝑒𝑐𝑡𝑜𝑟𝑠

Classify 𝑧 as class 1 if the result is positive, and class 2 otherwise

12

Page 13: SVM and Decision Treelsong/teaching/CSE6740fall13/lecture...Decision tree learning 28 Apply Model Induction Deduction el Model Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No

Interpretation of support vector machines

The optimal w is a linear combination of a small number of data points. This “sparse” representation can be viewed as data compression

To compute the weights 𝛼𝑖, and to use support vector machines we need to specify only the inner products (or

kernel) between the examples 𝑥𝑖⊤𝑥𝑗

We make decisions by comparing each new example z with only the support vectors:

𝑦∗ = 𝑠𝑖𝑔𝑛 𝛼𝑖𝑦𝑖 𝑥𝑖𝑧 + 𝑏

𝑖∈𝑠𝑢𝑝𝑝𝑜𝑟𝑡 𝑣𝑒𝑐𝑡𝑜𝑟𝑠

13

Page 14: SVM and Decision Treelsong/teaching/CSE6740fall13/lecture...Decision tree learning 28 Apply Model Induction Deduction el Model Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No

Soft margin constraints

What if the data is not linearly separable?

We will allow points to violate the hard margin constraint (𝑤⊤𝑥 + 𝑏)𝑦 ≥ 1 − 𝜉

14

Class 1

Class 2

0 bxwT

1 1

𝑤

𝜉1

𝜉2 𝜉3

Page 15: SVM and Decision Treelsong/teaching/CSE6740fall13/lecture...Decision tree learning 28 Apply Model Induction Deduction el Model Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No

Soft margin SVM

min𝑤,𝑏,𝜉 𝑤

2+ 𝐶 𝜉𝑖

𝑚

𝑖=1

𝑠. 𝑡. 𝑦𝑖 𝑤⊤ 𝑥𝑖 + 𝑏 ≥ 1 − 𝜉𝑖 , 𝜉𝑖 ≥ 0, ∀𝑖

Convert to standard form

min𝑤,𝑏

1

2𝑤⊤𝑤

𝑠. 𝑡. 1 − 𝑦𝑖 𝑤⊤ 𝑥𝑖 + 𝑏 − 𝜉𝑖 ≤ 0, 𝜉𝑖 ≥ 0, ∀𝑖

The Lagrangian function 𝐿 𝑤, 𝛼, 𝛽

= 1

2𝑤⊤𝑤 + 𝐶𝜉𝑖 + 𝛼𝑖 1 − 𝑦

𝑖 𝑤⊤ 𝑥𝑖 + 𝑏 − 𝜉𝑖𝑚

𝑖

− 𝛽𝑖𝜉𝑖

15

Page 16: SVM and Decision Treelsong/teaching/CSE6740fall13/lecture...Decision tree learning 28 Apply Model Induction Deduction el Model Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No

Deriving the dual problem

𝐿 𝑤, 𝛼, 𝛽

= 1

2𝑤⊤𝑤 + 𝐶𝜉𝑖 + 𝛼𝑖 1 − 𝑦

𝑖 𝑤⊤ 𝑥𝑖 + 𝑏 − 𝜉𝑖𝑚

𝑖

− 𝛽𝑖𝜉𝑖

Taking derivative and set to zero

𝜕𝐿

𝜕𝑤= 𝑤 − 𝛼𝑖𝑦

𝑖𝑥𝑖 =

𝑚

𝑖

0

𝜕𝐿

𝜕𝑏= 𝛼𝑖𝑦

𝑖 =

𝑚

𝑖

0

𝜕𝐿

𝜕𝜉𝑖 = 𝐶 − 𝛼𝑖 − 𝛽𝑖 = 0

16

Page 17: SVM and Decision Treelsong/teaching/CSE6740fall13/lecture...Decision tree learning 28 Apply Model Induction Deduction el Model Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No

Plug back relation of 𝑤, 𝑏 and 𝜉

𝐿 𝑤, 𝛼, 𝛽 =

1

2 𝛼𝑖𝑦

𝑖𝑥𝑖𝑚𝑖

⊤ 𝛼𝑗𝑦

𝑗𝑥𝑗𝑚𝑗 +

𝛼𝑖 1 − 𝑦𝑖 𝛼𝑗𝑦

𝑗𝑥𝑗𝑚𝑗

⊤ 𝑥𝑖 + 𝑏𝑚

𝑖

After simplification

𝐿 𝑤, 𝛼, 𝛽 = 𝛼𝑖 −

𝑚

𝑖

1

2 𝛼𝑖𝛼𝑗𝑦

𝑖𝑦𝑗 𝑥𝑖⊤𝑥𝑗

𝑚

𝑖,𝑗

17

Page 18: SVM and Decision Treelsong/teaching/CSE6740fall13/lecture...Decision tree learning 28 Apply Model Induction Deduction el Model Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No

The dual problem

max𝛼 𝛼𝑖 −

𝑚

𝑖

1

2 𝛼𝑖𝛼𝑗𝑦

𝑖𝑦𝑗 𝑥𝑖⊤𝑥𝑗

𝑚

𝑖,𝑗

𝑠. 𝑡. 𝐶 − 𝛼𝑖 − 𝛽𝑖 = 0, 𝛼𝑖 ≥ 0, 𝛽𝑖 ≥ 0, 𝑖 = 1,… ,𝑚

𝛼𝑖𝑦𝑖 =

𝑚

𝑖

0

The constraint 𝐶 − 𝛼𝑖 − 𝛽𝑖 = 0, 𝛼𝑖 ≥ 0, 𝛽𝑖 ≥ 0 can be simplified to 𝐶 ≥ 𝛼𝑖 ≥ 0

This is a constrained quadratic programming

Nice and convex, and global maximum can be found

18

Page 19: SVM and Decision Treelsong/teaching/CSE6740fall13/lecture...Decision tree learning 28 Apply Model Induction Deduction el Model Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No

Learning nonlinear decision boundary

Linearly separable

Nonlinearly separable

19

The XOR gate Speech recognition

Page 20: SVM and Decision Treelsong/teaching/CSE6740fall13/lecture...Decision tree learning 28 Apply Model Induction Deduction el Model Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No

A decision tree for Tax Fraud

Input: a vector of attributes 𝑋 = [Refund,MarSt,TaxInc]

Output: 𝑌= Cheating or Not

H as a procedure:

20

Refund

MarSt

TaxInc

YES NO

NO

NO

Yes No

Married Single, Divorced

< 80K > 80K

Each internal node: test one attribute 𝑋𝑖 Each branch from a node: selects one value for 𝑋𝑖 Each leaf node: predict 𝑌

Page 21: SVM and Decision Treelsong/teaching/CSE6740fall13/lecture...Decision tree learning 28 Apply Model Induction Deduction el Model Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No

Apply model to test data I

21

Refund

MarSt

TaxInc

YES NO

NO

NO

Yes No

Married Single, Divorced

< 80K > 80K

R e fu n d M a r ita l

S ta tu s

T a x a b le

In c o m e C h e a t

N o M a rr ie d 8 0 K ? 10

Query Data

Start from the root of tree.

Page 22: SVM and Decision Treelsong/teaching/CSE6740fall13/lecture...Decision tree learning 28 Apply Model Induction Deduction el Model Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No

Apply model to test data II

22

Refund

MarSt

TaxInc

YES NO

NO

NO

Yes No

Married Single, Divorced

< 80K > 80K

R e fu n d M a r ita l

S ta tu s

T a x a b le

In c o m e C h e a t

N o M a rr ie d 8 0 K ? 10

Query Data

Page 23: SVM and Decision Treelsong/teaching/CSE6740fall13/lecture...Decision tree learning 28 Apply Model Induction Deduction el Model Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No

Apply model to test data III

23

Refund

MarSt

TaxInc

YES NO

NO

NO

Yes No

Married Single, Divorced

< 80K > 80K

R e fu n d M a r ita l

S ta tu s

T a x a b le

In c o m e C h e a t

N o M a rr ie d 8 0 K ? 10

Query Data

Page 24: SVM and Decision Treelsong/teaching/CSE6740fall13/lecture...Decision tree learning 28 Apply Model Induction Deduction el Model Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No

Apply model to test data IV

24

Refund

MarSt

TaxInc

YES NO

NO

NO

Yes No

Married Single, Divorced

< 80K > 80K

R e fu n d M a r ita l

S ta tu s

T a x a b le

In c o m e C h e a t

N o M a rr ie d 8 0 K ? 10

Query Data

Page 25: SVM and Decision Treelsong/teaching/CSE6740fall13/lecture...Decision tree learning 28 Apply Model Induction Deduction el Model Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No

Apply model to test data V

25

Refund

MarSt

TaxInc

YES NO

NO

NO

Yes No

Married Single, Divorced

< 80K > 80K

R e fu n d M a r ita l

S ta tu s

T a x a b le

In c o m e C h e a t

N o M a rr ie d 8 0 K ? 10

Query Data

Assign Cheat to “No”

Page 26: SVM and Decision Treelsong/teaching/CSE6740fall13/lecture...Decision tree learning 28 Apply Model Induction Deduction el Model Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No

Expressiveness of decision tree

Decision trees can express any function of the input attributes.

E.g., for Boolean functions, truth table row → path to leaf:

Trivially, there is a consistent decision tree for any training set with one path to leaf for each example.

Prefer to find more compact decision trees

26

Page 27: SVM and Decision Treelsong/teaching/CSE6740fall13/lecture...Decision tree learning 28 Apply Model Induction Deduction el Model Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No

Hypothesis spaces (model space

How many distinct decision trees with n Boolean attributes?

= number of Boolean functions

= number of distinct truth tables with 2n rows = 22n

E.g., with 6 Boolean attributes, there are 18,446,744,073,709,551,616 trees

How many purely conjunctive hypotheses (e.g., Hungry Rain)?

Each attribute can be in (positive), in (negative), or out 3n distinct conjunctive hypotheses

More expressive hypothesis space increases chance that target function can be expressed

increases number of hypotheses consistent with training set

may get worse predictions

27

Page 28: SVM and Decision Treelsong/teaching/CSE6740fall13/lecture...Decision tree learning 28 Apply Model Induction Deduction el Model Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No

Decision tree learning

28

Apply

Model

Induction

Deduction

Learn

Model

Model

Tid Attrib1 Attrib2 Attrib3 Class

1 Yes Large 125K No

2 No Medium 100K No

3 No Small 70K No

4 Yes Medium 120K No

5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No

8 No Small 85K Yes

9 No Medium 75K No

10 No Small 90K Yes 10

Tid Attrib1 Attrib2 Attrib3 Class

11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?

14 No Small 95K ?

15 No Large 67K ? 10

Test Set

Tree

Induction

algorithm

Training Set

Decision

Tree

Page 29: SVM and Decision Treelsong/teaching/CSE6740fall13/lecture...Decision tree learning 28 Apply Model Induction Deduction el Model Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No

Example of a decision tree

29

T id R e fu n d M a rita l

S ta tu s

T a x a b le

In c o m e C h e a t

1 Y e s S in g le 1 2 5 K N o

2 N o M a rr ie d 1 0 0 K N o

3 N o S in g le 7 0 K N o

4 Y e s M a rr ie d 1 2 0 K N o

5 N o D iv o rc e d 9 5 K Y e s

6 N o M a rr ie d 6 0 K N o

7 Y e s D iv o rc e d 2 2 0 K N o

8 N o S in g le 8 5 K Y e s

9 N o M a rr ie d 7 5 K N o

1 0 N o S in g le 9 0 K Y e s10

Refund

MarSt

TaxInc

YES NO

NO

NO

Yes No

Married Single, Divorced

< 80K > 80K

Splitting Attributes

Training Data Model: Decision Tree

Page 30: SVM and Decision Treelsong/teaching/CSE6740fall13/lecture...Decision tree learning 28 Apply Model Induction Deduction el Model Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No

Another example of a decision tree

30

MarSt

Refund

TaxInc

YES NO

NO

NO

Yes No

Married Single,

Divorced

< 80K > 80K

There could be more than one tree that

fits the same data!

T id R e fu n d M a rita l

S ta tu s

T a x a b le

In c o m e C h e a t

1 Y e s S in g le 1 2 5 K N o

2 N o M a rr ie d 1 0 0 K N o

3 N o S in g le 7 0 K N o

4 Y e s M a rr ie d 1 2 0 K N o

5 N o D iv o rc e d 9 5 K Y e s

6 N o M a rr ie d 6 0 K N o

7 Y e s D iv o rc e d 2 2 0 K N o

8 N o S in g le 8 5 K Y e s

9 N o M a rr ie d 7 5 K N o

1 0 N o S in g le 9 0 K Y e s10

Training Data

Page 31: SVM and Decision Treelsong/teaching/CSE6740fall13/lecture...Decision tree learning 28 Apply Model Induction Deduction el Model Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No

Top-Down Induction of Decision tree

Main loop:

𝐴 ← the “best” decision attribute for next node

Assign A as the decision attribute for node

For each value of A, create new descendant of node

Sort training examples to leaf nodes

If training examples perfectly classified, then STOP;

ELSE iterate over new leaf nodes

31

Page 32: SVM and Decision Treelsong/teaching/CSE6740fall13/lecture...Decision tree learning 28 Apply Model Induction Deduction el Model Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No

Tree Induction

Greedy strategy.

Split the records based on an attribute test that optimizes certain criterion.

Issues

Determine how to split the records

How to specify the attribute test condition?

How to determine the best split?

Determine when to stop splitting

32

Page 33: SVM and Decision Treelsong/teaching/CSE6740fall13/lecture...Decision tree learning 28 Apply Model Induction Deduction el Model Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No

Multi-way split: Use as many partitions as distinct values.

Binary split: Divides values into two subsets. Need to find optimal partitioning.

CarType Family

Sports

Luxury

CarType {Family,

Luxury} {Sports}

CarType {Sports,

Luxury} {Family} OR

Splitting Based on Nominal Attributes

Page 34: SVM and Decision Treelsong/teaching/CSE6740fall13/lecture...Decision tree learning 28 Apply Model Induction Deduction el Model Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No

Multi-way split: Use as many partitions as distinct values.

Binary split: Divides values into two subsets. Need to find optimal partitioning.

Size Small

Medium

Large

Size {Medium,

Large} {Small}

Size {Small,

Medium} {Large} OR

Splitting Based on Ordinal Attributes

Page 35: SVM and Decision Treelsong/teaching/CSE6740fall13/lecture...Decision tree learning 28 Apply Model Induction Deduction el Model Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No

Different ways of handling Discretization to form an ordinal categorical attribute

Static – discretize once at the beginning

Dynamic – ranges can be found by equal interval bucketing, equal frequency bucketing (percentiles), or clustering.

Binary Decision: (𝐴 < 𝑡) or (𝐴 𝑡)

consider all possible splits and finds the best cut

can be more compute intensive

Splitting Based on Continuous Attributes

Taxable

Income

> 80K?

Yes No

Taxable

Income?

(i) Binary split (ii) Multi-way split

< 10K

[10K,25K) [25K,50K) [50K,80K)

> 80K

Page 36: SVM and Decision Treelsong/teaching/CSE6740fall13/lecture...Decision tree learning 28 Apply Model Induction Deduction el Model Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No

How to determine the Best Split

Idea: a good attribute splits the examples into subsets that are (ideally) "all positive" or "all negative"

Greedy approach: Nodes with homogeneous class distribution are preferred

Need a measure of node impurity

Non-homogeneous,

High degree of impurity

Homogeneous,

Low degree of impurity

Page 37: SVM and Decision Treelsong/teaching/CSE6740fall13/lecture...Decision tree learning 28 Apply Model Induction Deduction el Model Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No

How to compare attribute?

Entropy Entropy H(X) of a random variable X

H(X) is the expected number of bits needed to encode a randomly drawn value of X (under most efficient code)

Information theory:

Most efficient code assigns -log2P(X=i) bits to encode the message X=I, So, expected number of bits to code one random X is:

Page 38: SVM and Decision Treelsong/teaching/CSE6740fall13/lecture...Decision tree learning 28 Apply Model Induction Deduction el Model Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No

Sample Entropy

S is a sample of training examples

p+ is the proportion of positive examples in S

p- is the proportion of negative examples in S

Entropy measure the impurity of S

Page 39: SVM and Decision Treelsong/teaching/CSE6740fall13/lecture...Decision tree learning 28 Apply Model Induction Deduction el Model Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No

Examples for computing Entropy

C 1 0

C 2 6

C 1 2

C 2 4

C 1 1

C 2 5

P(C1) = 0/6 = 0 P(C2) = 6/6 = 1

Entropy = – 0 log 0 – 1 log 1 = – 0 – 0 = 0

P(C1) = 1/6 P(C2) = 5/6

Entropy = – (1/6) log2 (1/6) – (5/6) log2 (1/6) = 0.65

P(C1) = 2/6 P(C2) = 4/6

Entropy = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92

Page 40: SVM and Decision Treelsong/teaching/CSE6740fall13/lecture...Decision tree learning 28 Apply Model Induction Deduction el Model Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No

How to compare attribute?

Conditional Entropy of variable 𝑋 given variable 𝑌 Given specific Y=v entropy H(X|Y=v) of X:

Conditional entropy H(X|Y) of X: average of H(X|Y=v)

Mutual information (aka information gain) of X given Y :

Page 41: SVM and Decision Treelsong/teaching/CSE6740fall13/lecture...Decision tree learning 28 Apply Model Induction Deduction el Model Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No

Information Gain

Information gain (after split a node):

𝑛 samples in parent node 𝑝 is split into 𝑘 partitions; 𝑛𝑖 is number of records in partition 𝑖

Measures Reduction in Entropy achieved because of the split. Choose the split that achieves most reduction (maximizes GAIN)

k

i

i

splitiEntropy

n

npEntropyGAIN

1

)()(

Page 42: SVM and Decision Treelsong/teaching/CSE6740fall13/lecture...Decision tree learning 28 Apply Model Induction Deduction el Model Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No

Problem of splitting using information gain

Disadvantage: Tends to prefer splits that result in large number of partitions, each being small but pure.

Gain Ratio:

Adjusts Information Gain by the entropy of the partitioning (SplitINFO). Higher entropy partitioning (large number of small partitions) is penalized!

Used in C4.5

Designed to overcome the disadvantage of Information Gain

42

SplitINFO

GAINGainRATIO

Split

split

k

i

ii

n

n

n

nSplitINFO

1

log

Page 43: SVM and Decision Treelsong/teaching/CSE6740fall13/lecture...Decision tree learning 28 Apply Model Induction Deduction el Model Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No

Stopping Criteria for Tree Induction

Stop expanding a node when all the records belong to the same class

Stop expanding a node when all the records have similar attribute values

Early termination (to be discussed later)

Page 44: SVM and Decision Treelsong/teaching/CSE6740fall13/lecture...Decision tree learning 28 Apply Model Induction Deduction el Model Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No

Decision Tree Based Classification

Advantages:

Inexpensive to construct

Extremely fast at classifying unknown records

Easy to interpret for small-sized trees

Accuracy is comparable to other classification techniques for many simple data sets

Example: C4.5

Simple depth-first construction.

Uses Information Gain

Sorts Continuous Attributes at each node.

Needs entire data to fit in memory.

Unsuitable for Large Datasets. Needs out-of-core sorting.

You can download the software from: http://www.cse.unsw.edu.au/~quinlan/c4.5r8.tar.gz