42
Decision tree LING 572 Fei Xia 1/10/06

Decision tree LING 572 Fei Xia 1/10/06. Outline Basic concepts Main issues Advanced topics

  • View
    213

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Decision tree LING 572 Fei Xia 1/10/06. Outline Basic concepts Main issues Advanced topics

Decision tree

LING 572

Fei Xia

1/10/06

Page 2: Decision tree LING 572 Fei Xia 1/10/06. Outline Basic concepts Main issues Advanced topics

Outline

• Basic concepts

• Main issues

• Advanced topics

Page 3: Decision tree LING 572 Fei Xia 1/10/06. Outline Basic concepts Main issues Advanced topics

Basic concepts

Page 4: Decision tree LING 572 Fei Xia 1/10/06. Outline Basic concepts Main issues Advanced topics

A classification problem

District House type

Income Previous

Customer

Outcome

Suburban Detached High No Nothing

Suburban Semi-detached

High Yes Respond

Rural Semi-detached

Low No Respond

Urban Detached Low Yes Nothing

Page 5: Decision tree LING 572 Fei Xia 1/10/06. Outline Basic concepts Main issues Advanced topics

Classification and estimation problems

• Given– a finite set of (input) attributes: features

• Ex: District, House type, Income, Previous customer

– a target attribute: the goal• Ex: Outcome: {Nothing, Respond}

– training data: a set of classified examples in attribute-value representation

• Ex: the previous table

• Predict the value of the goal given the values of input attributes– The goal is a discrete variable classification problem– The goal is a continuous variable estimation problem

Page 6: Decision tree LING 572 Fei Xia 1/10/06. Outline Basic concepts Main issues Advanced topics

DistrictSuburban

(3/5)Rural(4/4)

Urban(3/5)

RespondHouse type

Detac

hed

(2/2

)

Nothing

Previous customer

Yes(3/3) N

o (2/2)

Semi-detached

(3/3)

Respond Nothing Respond

Decision tree

Page 7: Decision tree LING 572 Fei Xia 1/10/06. Outline Basic concepts Main issues Advanced topics

Decision tree representation

• Each internal node is a test:– Theoretically, a node can test multiple attributes– In most systems, a node tests exactly one attribute

• Each branch corresponds to test results– A branch corresponds to an attribute value or a range

of attribute values

• Each leaf node assigns – a class: decision tree– a real value: regression tree

Page 8: Decision tree LING 572 Fei Xia 1/10/06. Outline Basic concepts Main issues Advanced topics

What’s a best decision tree?

• “Best”: You need a bias (e.g., prefer the “smallest” tree): least depth? Fewest nodes? Which trees are the best predictors of unseen data?

• Occam's Razor: we prefer the simplest hypothesis that fits the data.

Find a decision tree that is as small as possible and fits the data

Page 9: Decision tree LING 572 Fei Xia 1/10/06. Outline Basic concepts Main issues Advanced topics

Finding a smallest decision tree

• A decision tree can represent any discrete function of the inputs: y=f(x1, x2, …, xn)

• The space of decision trees is too big for systemic search for a smallest decision tree.

• Solution: greedy algorithm

Page 10: Decision tree LING 572 Fei Xia 1/10/06. Outline Basic concepts Main issues Advanced topics

Basic algorithm: top-down induction

• Find the “best” decision attribute, A, and assign A as decision attribute for node

• For each value of A, create new branch, and divide up training examples

• Repeat the process until the gain is small enough

Page 11: Decision tree LING 572 Fei Xia 1/10/06. Outline Basic concepts Main issues Advanced topics

Major issues

Page 12: Decision tree LING 572 Fei Xia 1/10/06. Outline Basic concepts Main issues Advanced topics

Major issues

Q1: Choosing best attribute: what quality measure to use?

Q2: Determining when to stop splitting: avoid overfitting

Q3: Handling continuous attributesQ4: Handling training data with missing

attribute valuesQ5: Handing attributes with different costsQ6: Dealing with continuous goal attribute

Page 13: Decision tree LING 572 Fei Xia 1/10/06. Outline Basic concepts Main issues Advanced topics

Q1: What quality measure

• Information gain

• Gain Ratio

• …

Page 14: Decision tree LING 572 Fei Xia 1/10/06. Outline Basic concepts Main issues Advanced topics

Entropy of a training set

• S is a sample of training examples

• Entropy is one way of measuring the impurity of S

• Pc is the proportion of examples in S whose target attribute has value c.

cc

c ppSEntropy log)(

Page 15: Decision tree LING 572 Fei Xia 1/10/06. Outline Basic concepts Main issues Advanced topics

Information Gain

• Gain(S,A)=expected reduction in entropy due to sorting on A.

• Choose the A with the max information gain.

(a.k.a. choose the A with the min average entropy)

)(||

||)(),(

)(v

AValuesv

vdef

SEntropyS

SSEntropyASGain

Page 16: Decision tree LING 572 Fei Xia 1/10/06. Outline Basic concepts Main issues Advanced topics

An example

HumidityHigh Normal

S=[9+,5-]E=0.940

[3+,4-] [6+,1-]

InfoGain(S, Humidity)=0.940-(7/14)*0.985-(7/14)0.592=0.151

WindWeak Strong

S=[9+,5-]E=0.940

[6+,2-] [3+,3-]

InfoGain(S, Wind)=0.940-(8/14)*0.811-(6/14)*1.0=0.048

E=0.985 E=0.592 E=0.811 E=1.00

Page 17: Decision tree LING 572 Fei Xia 1/10/06. Outline Basic concepts Main issues Advanced topics

Other quality measures

• Problem of information gain:– Information Gain prefers attributes with many values.

• An alternative: Gain Ratio

Where Si is subset of S for which A has value vi.

||

||log||

||),(

),(

),(),(

21 S

S

S

SASmationSplitInfor

ASmationSplitInfor

ASGainASGainRatio

ic

i

i

Page 18: Decision tree LING 572 Fei Xia 1/10/06. Outline Basic concepts Main issues Advanced topics

Q2: avoiding overfitting

• Overfitting occurs when our decision tree characterizes too much detail, or noise in our training data.

• Consider error of hypothesis h over– Training data: ErrorTrain(h)– Entire distribution D of data: ErrorD(h)

• A hypothesis h overfits training data if there is an alternative hypothesis h’, such that– ErrorTrain(h) < ErrorTrain(h’), and– ErrorD(h) > errorD(h’)

Page 19: Decision tree LING 572 Fei Xia 1/10/06. Outline Basic concepts Main issues Advanced topics

How to avoiding Overfitting

• Stop growing the tree earlier – Ex: InfoGain < threshold– Ex: Size of examples in a node < threshold– …

• Grow full tree, then post-prune

In practice, the latter works better than the former.

Page 20: Decision tree LING 572 Fei Xia 1/10/06. Outline Basic concepts Main issues Advanced topics

Post-pruning

• Split data into training and validation set• Do until further pruning is harmful:

– Evaluate impact on validation set of pruning each possible node (plus those below it)

– Greedily remove the ones that don’t improve the performance on validation set

Produces a smaller tree with best performance measure

Page 21: Decision tree LING 572 Fei Xia 1/10/06. Outline Basic concepts Main issues Advanced topics

Performance measure

• Accuracy:– on validation data– K-fold cross validation

• Misclassification cost: Sometimes more accuracy is desired for some classes than others.

• MDL: size(tree) + errors(tree)

Page 22: Decision tree LING 572 Fei Xia 1/10/06. Outline Basic concepts Main issues Advanced topics

Rule post-pruning

• Convert tree to equivalent set of rules

• Prune each rule independently of others

• Sort final rules into desired sequence for use

• Perhaps most frequently used method (e.g., C4.5)

Page 23: Decision tree LING 572 Fei Xia 1/10/06. Outline Basic concepts Main issues Advanced topics

Q3: handling numeric attributes

• Continuous attribute discrete attribute

• Example– Original attribute: Temperature = 82.5– New attribute: (temperature > 72.3) = t, f

Question: how to choose thresholds?

Page 24: Decision tree LING 572 Fei Xia 1/10/06. Outline Basic concepts Main issues Advanced topics

Choosing thresholds for a continuous attribute

• Sort the examples according to the continuous attribute.

• Identify adjacent examples that differ in their target classification a set of candidate thresholds

• Choose the candidate with the highest information gain.

Page 25: Decision tree LING 572 Fei Xia 1/10/06. Outline Basic concepts Main issues Advanced topics

Q4: Unknown attribute values

• Assume an attribute can take the value “blank”.

• Assign most common value of A among training data at node n.

• Assign most common value of A among training data at node n which have the same target class.

• Assign prob pi to each possible value vi of A– Assign a fraction (pi) of example to each descendant in tree– This method is used in C4.5.

Page 26: Decision tree LING 572 Fei Xia 1/10/06. Outline Basic concepts Main issues Advanced topics

Q5: Attributes with cost

• Consider medical diagnosis (e.g., blood test) has a cost

• Question: how to learn a consistent tree with low expected cost?

• One approach: replace gain by – Tan and Schlimmer (1990)

)(

),(2

ACost

ASGain

Page 27: Decision tree LING 572 Fei Xia 1/10/06. Outline Basic concepts Main issues Advanced topics

Q6: Dealing with continuous goal attribute Regression tree

• A variant of decision trees

• Estimation problem: approximate real-valued functions: e.g., the crime rate

• A leaf node is marked with a real value or a linear function: e.g., the mean of the target values of the examples at the node.

• Measure of impurity: e.g., variance, standard deviation, …

Page 28: Decision tree LING 572 Fei Xia 1/10/06. Outline Basic concepts Main issues Advanced topics

Summary of Major issues

Q1: Choosing best attribute: different quality measure.

Q2: Determining when to stop splitting: stop earlier or post-pruning

Q3: Handling continuous attributes: find the breakpoints

Page 29: Decision tree LING 572 Fei Xia 1/10/06. Outline Basic concepts Main issues Advanced topics

Summary of major issues (cont)

Q4: Handling training data with missing attribute values: blank value, most common value, or fractional count

Q5: Handing attributes with different costs: use a quality measure that includes the cost factors.

Q6: Dealing with continuous goal attribute: various ways of building regression trees.

Page 30: Decision tree LING 572 Fei Xia 1/10/06. Outline Basic concepts Main issues Advanced topics

Common algorithms

• ID3

• C4.5

• CART

Page 31: Decision tree LING 572 Fei Xia 1/10/06. Outline Basic concepts Main issues Advanced topics

ID3

• Proposed by Quinlan (so is C4.5)

• Can handle basic cases: discrete attributes, no missing information, etc.

• Information gain as quality measure

Page 32: Decision tree LING 572 Fei Xia 1/10/06. Outline Basic concepts Main issues Advanced topics

C4.5

• An extension of ID3:– Several quality measures– Incomplete information (missing attribute

values)– Numerical (continuous) attributes– Pruning of decision trees– Rule derivation– Random mood and batch mood

Page 33: Decision tree LING 572 Fei Xia 1/10/06. Outline Basic concepts Main issues Advanced topics

CART

• CART (classification and regression tree)

• Proposed by Breiman et. al. (1984)

• Constant numerical values in leaves

• Variance as measure of impurity

Page 34: Decision tree LING 572 Fei Xia 1/10/06. Outline Basic concepts Main issues Advanced topics

Strengths of decision tree methods

• Ability to generate understandable rules

• Ease of calculation at classification time

• Ability to handle both continuous and categorical variables

• Ability to clearly indicate best attributes

Page 35: Decision tree LING 572 Fei Xia 1/10/06. Outline Basic concepts Main issues Advanced topics

The weaknesses of decision tree methods

• Greedy algorithm: no global optimization

• Error-prone with too many classes: numbers of training examples become smaller quickly in a tree with many levels/branches.

• Expensive to train: sorting, combination of attributes, calculating quality measures, etc.

• Trouble with non-rectangular regions: the rectangular classification boxes that may not correspond well with the actual distribution of records in the decision space.

Page 36: Decision tree LING 572 Fei Xia 1/10/06. Outline Basic concepts Main issues Advanced topics

Advanced topics

Page 37: Decision tree LING 572 Fei Xia 1/10/06. Outline Basic concepts Main issues Advanced topics

Combining multiple models

• The inherent instability of top-down decision tree induction: different training datasets from a given problem domain will produce quite different trees.

• Techniques:– Bagging– Boosting

Page 38: Decision tree LING 572 Fei Xia 1/10/06. Outline Basic concepts Main issues Advanced topics

Bagging

• Introduced by Breiman

• It first creates multiple decision trees based on different training sets.

• Then, it compares each tree and incorporates the best features of each.

• This addresses some of the problems inherent in regular ID3.

Page 39: Decision tree LING 572 Fei Xia 1/10/06. Outline Basic concepts Main issues Advanced topics

Boosting

• Introduced by Freund and Schapire

• It examines the trees that incorrectly classify an instance and assign them a weight.

• These weights are used to eliminate hypotheses or refocus the algorithm on the hypotheses that are performing well.

Page 40: Decision tree LING 572 Fei Xia 1/10/06. Outline Basic concepts Main issues Advanced topics

Summary

• Basic case: – Discrete input attributes– Discrete goal attribute– No missing attribute values– Same cost for all tests and all kinds of misclassification.

• Extended cases:– Continuous attributes– Real-valued goal attribute– Some examples miss some attribute values– Some tests are more expensive than others.– Some misclassifications are more serious than others.

Page 41: Decision tree LING 572 Fei Xia 1/10/06. Outline Basic concepts Main issues Advanced topics

Summary (cont)

• Basic algorithm: – greedy algorithm– top-down induction– Bias for small trees

• Major issues:

Page 42: Decision tree LING 572 Fei Xia 1/10/06. Outline Basic concepts Main issues Advanced topics

Uncovered issues

• Incremental decision tree induction?

• How can a decision relate to other decisions? what's the order of making the decisions? (e.g., POS tagging)

• What's the difference between decision tree and decision list?