Decision tree LING 572 Fei Xia 1/10/06. Outline Basic concepts Main issues Advanced topics

Decision tree

LING 572

Fei Xia

1/10/06

Outline

• Basic concepts

• Main issues

• Advanced topics

Basic concepts

A classification problem

District House type

Income Previous

Customer

Outcome

Suburban Detached High No Nothing

Suburban Semi-detached

High Yes Respond

Rural Semi-detached

Low No Respond

Urban Detached Low Yes Nothing

…

Classification and estimation problems

• Given– a finite set of (input) attributes: features

• Ex: District, House type, Income, Previous customer

– a target attribute: the goal• Ex: Outcome: {Nothing, Respond}

– training data: a set of classified examples in attribute-value representation

• Ex: the previous table

• Predict the value of the goal given the values of input attributes– The goal is a discrete variable classification problem– The goal is a continuous variable estimation problem

DistrictSuburban

(3/5)Rural(4/4)

Urban(3/5)

RespondHouse type

Detac

hed

(2/2

)

Nothing

Previous customer

Yes(3/3) N

o (2/2)

Semi-detached

(3/3)

Respond Nothing Respond

Decision tree

Decision tree representation

• Each internal node is a test:– Theoretically, a node can test multiple attributes– In most systems, a node tests exactly one attribute

• Each branch corresponds to test results– A branch corresponds to an attribute value or a range

of attribute values

• Each leaf node assigns – a class: decision tree– a real value: regression tree

What’s a best decision tree?

• “Best”: You need a bias (e.g., prefer the “smallest” tree): least depth? Fewest nodes? Which trees are the best predictors of unseen data?

• Occam's Razor: we prefer the simplest hypothesis that fits the data.

Find a decision tree that is as small as possible and fits the data

Finding a smallest decision tree

• A decision tree can represent any discrete function of the inputs: y=f(x1, x2, …, xn)

• The space of decision trees is too big for systemic search for a smallest decision tree.

• Solution: greedy algorithm

Basic algorithm: top-down induction

• Find the “best” decision attribute, A, and assign A as decision attribute for node

• For each value of A, create new branch, and divide up training examples

• Repeat the process until the gain is small enough

Major issues

Major issues

Q1: Choosing best attribute: what quality measure to use?

Q2: Determining when to stop splitting: avoid overfitting

Q3: Handling continuous attributesQ4: Handling training data with missing

attribute valuesQ5: Handing attributes with different costsQ6: Dealing with continuous goal attribute

Q1: What quality measure

• Information gain

• Gain Ratio

• …

Entropy of a training set

• S is a sample of training examples

• Entropy is one way of measuring the impurity of S

• Pc is the proportion of examples in S whose target attribute has value c.

cc

c ppSEntropy log)(

Information Gain

• Gain(S,A)=expected reduction in entropy due to sorting on A.

• Choose the A with the max information gain.

(a.k.a. choose the A with the min average entropy)

)(||

||)(),(

)(v

AValuesv

vdef

SEntropyS

SSEntropyASGain

An example

HumidityHigh Normal

S=[9+,5-]E=0.940

[3+,4-] [6+,1-]

InfoGain(S, Humidity)=0.940-(7/14)*0.985-(7/14)0.592=0.151

WindWeak Strong

S=[9+,5-]E=0.940

[6+,2-] [3+,3-]

InfoGain(S, Wind)=0.940-(8/14)*0.811-(6/14)*1.0=0.048

E=0.985 E=0.592 E=0.811 E=1.00

Other quality measures

• Problem of information gain:– Information Gain prefers attributes with many values.

• An alternative: Gain Ratio

Where Si is subset of S for which A has value vi.

||

||log||

||),(

),(

),(),(

21 S

S

S

SASmationSplitInfor

ASmationSplitInfor

ASGainASGainRatio

ic

i

i

Q2: avoiding overfitting

• Overfitting occurs when our decision tree characterizes too much detail, or noise in our training data.

• Consider error of hypothesis h over– Training data: ErrorTrain(h)– Entire distribution D of data: ErrorD(h)

• A hypothesis h overfits training data if there is an alternative hypothesis h’, such that– ErrorTrain(h) < ErrorTrain(h’), and– ErrorD(h) > errorD(h’)

How to avoiding Overfitting

• Stop growing the tree earlier – Ex: InfoGain < threshold– Ex: Size of examples in a node < threshold– …

• Grow full tree, then post-prune

In practice, the latter works better than the former.

Post-pruning

• Split data into training and validation set• Do until further pruning is harmful:

– Evaluate impact on validation set of pruning each possible node (plus those below it)

– Greedily remove the ones that don’t improve the performance on validation set

Produces a smaller tree with best performance measure

Performance measure

• Accuracy:– on validation data– K-fold cross validation

• Misclassification cost: Sometimes more accuracy is desired for some classes than others.

• MDL: size(tree) + errors(tree)

Rule post-pruning

• Convert tree to equivalent set of rules

• Prune each rule independently of others

• Sort final rules into desired sequence for use

• Perhaps most frequently used method (e.g., C4.5)

Q3: handling numeric attributes

• Continuous attribute discrete attribute

• Example– Original attribute: Temperature = 82.5– New attribute: (temperature > 72.3) = t, f

Question: how to choose thresholds?

Choosing thresholds for a continuous attribute

• Sort the examples according to the continuous attribute.

• Identify adjacent examples that differ in their target classification a set of candidate thresholds

• Choose the candidate with the highest information gain.

Q4: Unknown attribute values

• Assume an attribute can take the value “blank”.

• Assign most common value of A among training data at node n.

• Assign most common value of A among training data at node n which have the same target class.

• Assign prob pi to each possible value vi of A– Assign a fraction (pi) of example to each descendant in tree– This method is used in C4.5.

Q5: Attributes with cost

• Consider medical diagnosis (e.g., blood test) has a cost

• Question: how to learn a consistent tree with low expected cost?

• One approach: replace gain by – Tan and Schlimmer (1990)

)(

),(2

ACost

ASGain

Q6: Dealing with continuous goal attribute Regression tree

• A variant of decision trees

• Estimation problem: approximate real-valued functions: e.g., the crime rate

• A leaf node is marked with a real value or a linear function: e.g., the mean of the target values of the examples at the node.

• Measure of impurity: e.g., variance, standard deviation, …

Summary of Major issues

Q1: Choosing best attribute: different quality measure.

Q2: Determining when to stop splitting: stop earlier or post-pruning

Q3: Handling continuous attributes: find the breakpoints

Summary of major issues (cont)

Q4: Handling training data with missing attribute values: blank value, most common value, or fractional count

Q5: Handing attributes with different costs: use a quality measure that includes the cost factors.

Q6: Dealing with continuous goal attribute: various ways of building regression trees.

Common algorithms

• ID3

• C4.5

• CART

ID3

• Proposed by Quinlan (so is C4.5)

• Can handle basic cases: discrete attributes, no missing information, etc.

• Information gain as quality measure

C4.5

• An extension of ID3:– Several quality measures– Incomplete information (missing attribute

values)– Numerical (continuous) attributes– Pruning of decision trees– Rule derivation– Random mood and batch mood

CART

• CART (classification and regression tree)

• Proposed by Breiman et. al. (1984)

• Constant numerical values in leaves

• Variance as measure of impurity

Strengths of decision tree methods

• Ability to generate understandable rules

• Ease of calculation at classification time

• Ability to handle both continuous and categorical variables

• Ability to clearly indicate best attributes

The weaknesses of decision tree methods

• Greedy algorithm: no global optimization

• Error-prone with too many classes: numbers of training examples become smaller quickly in a tree with many levels/branches.

• Expensive to train: sorting, combination of attributes, calculating quality measures, etc.

• Trouble with non-rectangular regions: the rectangular classification boxes that may not correspond well with the actual distribution of records in the decision space.

Advanced topics

Combining multiple models

• The inherent instability of top-down decision tree induction: different training datasets from a given problem domain will produce quite different trees.

• Techniques:– Bagging– Boosting

Bagging

• Introduced by Breiman

• It first creates multiple decision trees based on different training sets.

• Then, it compares each tree and incorporates the best features of each.

• This addresses some of the problems inherent in regular ID3.

http://stat-www.berkeley.edu/users/breiman/

Boosting

• Introduced by Freund and Schapire

• It examines the trees that incorrectly classify an instance and assign them a weight.

• These weights are used to eliminate hypotheses or refocus the algorithm on the hypotheses that are performing well.

http://www.cs.huji.ac.il/~yoavf/

http://www.research.att.com/~schapire/

Summary

• Basic case: – Discrete input attributes– Discrete goal attribute– No missing attribute values– Same cost for all tests and all kinds of misclassification.

• Extended cases:– Continuous attributes– Real-valued goal attribute– Some examples miss some attribute values– Some tests are more expensive than others.– Some misclassifications are more serious than others.

Summary (cont)

• Basic algorithm: – greedy algorithm– top-down induction– Bias for small trees

• Major issues:

Uncovered issues

• Incremental decision tree induction?

• How can a decision relate to other decisions? what's the order of making the decisions? (e.g., POS tagging)

• What's the difference between decision tree and decision list?

Documents

Decision tree LING 572 Fei Xia 1/10/06. Outline Basic concepts Main issues Advanced topics