6-Slide Example: Gene Chip Data © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

6-Slide Example: Gene 6-Slide Example: Gene Chip DataChip Data

Person Gene A28202_ac AB00014_at AB00015_at . . . CLASS Person 1 P 1142.0 A 321.0 P 2567.2 . . . myeloma Person 2 A -586.3 P 586.1 P 759.0 . . . normal Person 3 A 105.2 A 559.3 P 3210.7 . . . myeloma Person 4 P -42.8 P 692.1 P 812.0 . . . normal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . © Jude Shavlik 2006, © Jude Shavlik 2006,

David Page 2010 David Page 2010CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)

Decision Trees in One Decision Trees in One PicturePicture

Myeloma Normal

AvgDiff of G5 < 1000 > 1000

© Jude Shavlik 2006, © Jude Shavlik 2006, David Page 2010 David Page 2010

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)

Example: Gene Example: Gene ExpressionExpression

Decision tree:

AD_X57809_at <= 20343.4: myeloma (74)

AD_X57809_at > 20343.4: normal (31)

Leave-one-out cross-validation accuracy estimate: 97.1%

X57809: IGL (immunoglobulin lambda locus)



Problem with ResultProblem with Result

Easy to predict accurately with genes Easy to predict accurately with genes related to immune function, such as related to immune function, such as IGL, but this gives us no new insight.IGL, but this gives us no new insight.

Eliminate these genes prior to Eliminate these genes prior to training.training.

Possible because of Possible because of comprehensibility of decision trees.comprehensibility of decision trees.



Ignoring Genes Ignoring Genes Associated with Associated with Immune functionImmune functionDecision tree:

AD_X04898_rna1_at <= -1453.4: normal (30)

AD_X04898_rna1_at > -1453.4: myeloma (74/1)

X04898: APOA2 (Apolipoprotein AII)

Leave-one-out accuracy estimate: 98.1%.



Another TreeAnother TreeAD_M15881_at > 992: normal (28)

AD_M15881_at <= 992:

AC_D82348_at = A: normal (3)

AC_D82348_at = P: myeloma (74)



A Measure of Node A Measure of Node PurityPurity

Let Let ff++ = fraction of = fraction of positivepositive examples examples Let Let ff

-- = = fraction of fraction of negativenegative examples examplesff++ = p / (p + n), = p / (p + n), ff

-- = n / (p + n), = n / (p + n), p=#pos,p=#pos, n=#neg n=#neg

Under an optimal code, the Under an optimal code, the information needed information needed (expected number of bits) to label one example (expected number of bits) to label one example isis

Info( fInfo( f++, f, f --) = - f) = - f++ lg (f lg (f++) - f) - f

-- lg (f lg (f --))

This is also called the This is also called the entropyentropy of the set of of the set of examplesexamples

(derived later)(derived later)



Another Commonly-Another Commonly-Used Measure of Node Used Measure of Node PurityPurity• Gini Index: (Gini Index: (ff++)) (( ff

-- ))• Used in CART (Classification and Used in CART (Classification and

Regression Trees, Breiman et al., Regression Trees, Breiman et al., 1984)1984)



• All same class All same class (+, say) (+, say) Info(1, 0) = -1 lg(1) + -0 lg(0) 0 Info(1, 0) = -1 lg(1) + -0 lg(0) 0

• 50-50 mixture50-50 mixtureInfo(½, ½) = 2[ -½ lg(½)] = 1Info(½, ½) = 2[ -½ lg(½)] = 1

Info(fInfo(f ++, f, f

--)) : Consider : Consider the Extreme Casesthe Extreme Cases

0

-1

0 (by def)



Evaluating a FeatureEvaluating a Feature

• How much does it help to know How much does it help to know the value of attribute/feature the value of attribute/feature AA ??

• Assume Assume AA divides the current set divides the current set of examples into of examples into N N groupsgroups

Let Let qqii = fraction of data on branch = fraction of data on branch ii

ffii++ = fraction of +’s on branch = fraction of +’s on branch ii

ffi i -- = fraction of –’s on branch = fraction of –’s on branch ii



E(E(AA ) ) ΣΣ q qi i xx I (fI (fii++, f, fii

--))

• Info Info neededneeded after determining the after determining the value of attribute value of attribute AA

• Another Another expected valueexpected value calc calc

PictorallyPictorally

Evaluating a Feature Evaluating a Feature (con’t)(con’t)

i= 1

N

A

v1 vN

I (fI (fNN++, f, fNN

--))

I (fI (f++, f, f--))

I (fI (f11++, f, f11

--))



Info GainInfo Gain

Gain(A) Gain(A) I(f I(f++, f, f --) – E(A)) – E(A)

Our scoring function in our hill-climbing (greedy) algorithm

So pick A with smallest E(A)

Constant for all features

That is, choose the feature that statistically tells us the most

about the class of another example drawn from this distribution© Jude Shavlik 2006, © Jude Shavlik 2006,

David Page 2010 David Page 2010CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)

Example Info-Gain Example Info-Gain CalculationCalculation

++BIGBIGRedRed

++BIGBIGRedRed

--SMALLSMALLYellowYellow --SMALLSMALLRedRed ++BIGBIGBlueBlue

ClassClassSizeSizeShapeShapeColorColor

?)(?)(

?)(?),(

sizeEshapeE

colorEffI



Info-Gain Calculation (cont.)Info-Gain Calculation (cont.)

0)1,0(4.0)0,1(6.0)(

)()2

1,

2

1(4.0)

3

1,

3

2(6.0)(

)1,0(2.0)0,1(2.0)3

1,

3

2(6.0)(

91.0)4.0(log4.0)6.0(log6.0)4.0,6.0(),( 22

IIsizeE

colorEIIshapeE

IIIcolorE

IffI

Note that “Size” provides complete classification, so done





ID3 Info Gain Measure JustifiedID3 Info Gain Measure Justified(Ref. (Ref. C4.5C4.5, J. R. Quinlan, Morgan Kaufmann, 1993, pp21-, J. R. Quinlan, Morgan Kaufmann, 1993, pp21-22)22)

Definition of InformationDefinition of Information Info conveyed by message Info conveyed by message MM depends on its probability, i.e., depends on its probability, i.e.,

(due to Shannon)(due to Shannon)

Select example from a set Select example from a set SS and announce it belongs to class and announce it belongs to class CC

The probability of this occurring is the fraction of The probability of this occurring is the fraction of CC ’s in ’s in SS

Hence info in this announcement is, by definition,Hence info in this announcement is, by definition,

]Prob(M)[log- info(M) 2

Cf

)(log2 Cf



Let there be Let there be KK different classesdifferent classes in set in set SS, the classes are:, the classes are:

What is What is expectedexpected info info from a msg about the class of an example in set from a msg about the class of an example in set S S ??

is the average number of bits of information is the average number of bits of information (by looking at feature values) needed to (by looking at feature values) needed to

classify classify a member of set a member of set SS

ID3 Info Gain Measure ID3 Info Gain Measure (cont.)(cont.)

KCCC ,.......,, 21

. class of are that set offraction where,

,)log( )info(

)(log...)(log)(log- )info(

1

222 2211

jC

K

jCC

CCCCCC

CSf

ffS

ffffffS

j

jj

KK

)(info S



Handling Handling HierarchicalHierarchical Features Features in ID3 in ID3 Define a Define a newnew feature for each level in hierarchy, e.g., feature for each level in hierarchy, e.g.,

Let ID3 choose the appropriate level of abstraction!Let ID3 choose the appropriate level of abstraction!

Shape

Circular Polygonal

Shape 1 = {Circular, Polygonal}Shape2 = { , , , , }



Handling Handling NumericNumeric Features Featuresin ID3in ID3

On the flyOn the fly create binary featurescreate binary features and choose and choose bestbest

Step 1Step 1: Plot current examples : Plot current examples (green=pos, red=neg)(green=pos, red=neg)

Step 2Step 2: Divide : Divide midway between every midway between every consecutive pair of points with consecutive pair of points with different different categoriescategories to create new binary features, eg to create new binary features, eg featurefeaturenew1new1 = F<8 = F<8 andand feature featurenew2new2 = F<10 = F<10

Step 3Step 3: Choose : Choose split with best info gainsplit with best info gain

Value of Feature5 7 9 11 13



Handling Numeric Features Handling Numeric Features (cont.)(cont.)

NoteNote

F<10

F< 5 -

+ -

T

T F

F

Cannot discard numeric feature after use in one portion of d-tree



Characteristic Property of Characteristic Property of Using Info-Gain MeasureUsing Info-Gain Measure

FAVORS FEATURES WITH FAVORS FEATURES WITH HIGH BRANCHINGHIGH BRANCHING FACTORS FACTORS (i.e. many possible values)(i.e. many possible values)

Extreme Case:Extreme Case:

At most one example per leafAt most one example per leaf and all I(.,.) scores and all I(.,.) scores for leafs equals zero, so gets perfect score! But for leafs equals zero, so gets perfect score! But generalizes very poorly (ie, memorizes data)generalizes very poorly (ie, memorizes data)

Student ID

1+0- 0+

0-

0+1-

1

99

999



Fix: Method 1Fix: Method 1

Convert all features to binaryConvert all features to binarye.g., Color = {Red, Blue, Green}e.g., Color = {Red, Blue, Green}

From 1 From 1 NN-valued feature to -valued feature to NN binary features binary features

Color = Red? {True, False}Color = Red? {True, False}

Color = Blue? {True, False}Color = Blue? {True, False}

Color = Green? {True, False}Color = Green? {True, False}

Used in Neural Nets and SVMsUsed in Neural Nets and SVMs

D-tree readability probably less, but not D-tree readability probably less, but not necessarilynecessarily



Fix 2: Gain Fix 2: Gain RatioRatio

Find Find info contentinfo content in answer to: in answer to:

What is value of feature A What is value of feature A ignoringignoring output output category?category?

fraction of all fraction of all

examples with A=i examples with A=i

Choose A that maximizes:Choose A that maximizes:

Read text (Mitchell) for exact details!Read text (Mitchell) for exact details!

np

np

np

npAIV ii

K

i

ii2

1

log)(

)(

)( InfoSplit

AIV

AGain



Fix: Method 3Fix: Method 3

Group values of nominal featuresGroup values of nominal features

vs.vs.

• Done in CART (Breiman et.al. 1984)Done in CART (Breiman et.al. 1984)• Breiman et.al. proved for the 2-category Breiman et.al. proved for the 2-category

case, optimal binary partition can be found case, optimal binary partition can be found be considering only O(N) possibilities instead be considering only O(N) possibilities instead of O(2of O(2NN))

Color?R

B GY

Color?

R vs B vs …G vs Y vs …



Multiple Category Multiple Category Classification – Method 1 Classification – Method 1 (used in SVMs)(used in SVMs)

Approach 1: Learn one tree (ie, model) per categoryApproach 1: Learn one tree (ie, model) per category

( ) versus ( )i iCategory Category

What happens if test ex. is predicted to lie in multiple categories? To none?

Pass test ex’s through each tree



Multiple Category Multiple Category Classification – Method 2Classification – Method 2

Approach 2: Learn Approach 2: Learn one treeone tree in total in total

Subdivides the full space such that Subdivides the full space such that every pointevery point belongs to belongs to one and only oneone and only one category category (drawing slightly misleading)(drawing slightly misleading)



Where DT Learners Where DT Learners FailFail

• Correlation-immune functions like Correlation-immune functions like XOR (more in later lecture)XOR (more in later lecture)

• Target concepts where all/most Target concepts where all/most features are needed (so large tree features are needed (so large tree is required), but we have limited is required), but we have limited datadata



Key Lessons from DT Key Lessons from DT Research of 1980sResearch of 1980s

• In general, simpler trees are betterIn general, simpler trees are better• Avoid overfittingAvoid overfitting• Link to MDL/MML principle and Occam’s Link to MDL/MML principle and Occam’s

RazorRazor

• Pruning is better than Early StoppingPruning is better than Early Stopping



Advantages of Advantages of Decision Tree LearningDecision Tree Learning

• 1. Output is easily human-1. Output is easily human-comprehensiblecomprehensible• Can convert tree to rules – one rule Can convert tree to rules – one rule

per branch leading to a positive classper branch leading to a positive class• Gives insight into taskGives insight into task• Let’s us find errors in our approachLet’s us find errors in our approach

• Using a “noise peak” in mass spec dataUsing a “noise peak” in mass spec data• Using antibody (IG, HLA) genes in Using antibody (IG, HLA) genes in

myelomamyeloma© Jude Shavlik 2006, © Jude Shavlik 2006, David Page 2010 David Page 2010


Advantages of DT Advantages of DT LearningLearning

• 2. Good at Junta-Learning 2. Good at Junta-Learning problemsproblems• Tree learners focus on finding Tree learners focus on finding

smallest number of features that can smallest number of features that can be used to classify accuratelybe used to classify accurately

• Tree learners ignore all features they Tree learners ignore all features they don’t selectdon’t select



Advantages of DT Advantages of DT LearningLearning• 3. Really FAST!3. Really FAST!

• Time to learn is fast: O(nm) to make a Time to learn is fast: O(nm) to make a split, where nm is size of data set (n is split, where nm is size of data set (n is number of features, m is number of number of features, m is number of data points)data points)

• Typically few splits required, data Typically few splits required, data analyzed for recursive splits is even analyzed for recursive splits is even smallersmaller

• Time to label is linear in tree depth onlyTime to label is linear in tree depth only



Advantages of DT Advantages of DT LearnersLearners

• 4. Easily extended4. Easily extended• Numerical prediction: put average Numerical prediction: put average

value at leaf, score split by value at leaf, score split by reduction reduction in variance in variance (regression trees)(regression trees)

• Can instead have linear regression Can instead have linear regression models at leaves (model trees)models at leaves (model trees)

• Easy to do multi-class prediction and Easy to do multi-class prediction and multiple instance learning (later multiple instance learning (later lecture)lecture)



Advantages of DT Advantages of DT LearningLearning

• 5. Complete representation 5. Complete representation languagelanguage• DTs can represent DTs can represent anyany Boolean Boolean

functionfunction• Note: this does not guarantee they Note: this does not guarantee they

will effectively learn any functionwill effectively learn any function



Disadvantages of DT Disadvantages of DT LearnersLearners• 1. Greedy algorithm means they fail 1. Greedy algorithm means they fail

on correlation immune functionson correlation immune functions• 2. They settle for minimum feature 2. They settle for minimum feature

set required for discrimination: set required for discrimination: discriminatediscriminate rather than rather than characterizecharacterize

• 3. Tendency to 3. Tendency to overfitoverfit if more if more features than data… little data left features than data… little data left for as we get deeper in treefor as we get deeper in tree



ErrorError

• Assume some probability distribution D Assume some probability distribution D over possible examples (e.g., all truth over possible examples (e.g., all truth assignments over x1, …, xN)assignments over x1, …, xN)

• For a training set, assume D uniformFor a training set, assume D uniform• Assume a hypothesis space H (e.g., all Assume a hypothesis space H (e.g., all

decision trees over x1, …, xN)decision trees over x1, …, xN)• The error of a hypothesis h in H is the The error of a hypothesis h in H is the

probability h misclassifies an example probability h misclassifies an example drawn randomly according to Ddrawn randomly according to D



OverfittingOverfitting

• Given hypothesis space H, Given hypothesis space H, probability distribution D over probability distribution D over examples, and training set Texamples, and training set T

• Hypothesis h in H Hypothesis h in H overfitsoverfits T if T if there exists another h’ in H such there exists another h’ in H such that:that:• h has smaller error than h’ on Th has smaller error than h’ on T• h’ has smaller error than h over Dh’ has smaller error than h over D



How we usually How we usually observe overfittingobserve overfitting

• Accuracy (1 – error) of our Accuracy (1 – error) of our hypothesis drops substantially on hypothesis drops substantially on a new test set compared to the a new test set compared to the training settraining set

• We observe this to greater extent We observe this to greater extent as decision tree size growsas decision tree size grows

• How can we combat overfitting in How can we combat overfitting in DTs?DTs?



Reduced Error PruningReduced Error Pruning

• Hold aside a tuning set (or validation Hold aside a tuning set (or validation set) from training setset) from training set

• For each internal node in the treeFor each internal node in the tree• Try replacing that subtree at that node Try replacing that subtree at that node

with a leaf that predicts majority class at with a leaf that predicts majority class at that node in training setthat node in training set

• Keep this change if performance on tune Keep this change if performance on tune set grows no worse than without changeset grows no worse than without change

• Kearns & Mansour: do this bottom-upKearns & Mansour: do this bottom-up© Jude Shavlik 2006, © Jude Shavlik 2006, David Page 2010 David Page 2010


From Mitchell (p. 70)From Mitchell (p. 70)



Rule Post-PruningRule Post-Pruning

• Build full decision treeBuild full decision tree• Convert each root-leaf path into an IF-Convert each root-leaf path into an IF-

THEN rule, where tests along the path THEN rule, where tests along the path are PRECONDITIONS and leaf label is are PRECONDITIONS and leaf label is CONSEQUENTCONSEQUENT

• For each rule, remove preconditions For each rule, remove preconditions that improve estimated accuracythat improve estimated accuracy

• Sort rules by accuracy estimates, and Sort rules by accuracy estimates, and consider them in order: consider them in order: decision listdecision list



Estimating AccuracyEstimating Accuracy

• By “accuracy” here, we really mean By “accuracy” here, we really mean precisionprecision: probability the rule is correct : probability the rule is correct when it fireswhen it fires

• Could use validation (tune) set if Could use validation (tune) set if enough data is availableenough data is available

• C4.5: evaluate on training set and be C4.5: evaluate on training set and be pessimistic: use binomial distribution pessimistic: use binomial distribution to get 95% confidence intervalto get 95% confidence interval

• Example to follow…Example to follow…© Jude Shavlik 2006, © Jude Shavlik 2006, David Page 2010 David Page 2010


The Binomial The Binomial DistributionDistribution

xnx ppx

nx

)1()Pr(

• Distribution over the number of successes in a Distribution over the number of successes in a fixed number fixed number nn of independent trials (with of independent trials (with

same probability of success same probability of success pp in each) in each)

0

0.2

0.4

0.6

0.8

1

0 2 4 6 8 10

Pr(

x)

x

Binomial distribution w/ p=0.5, n=10



Using the Binomial Using the Binomial HereHere• Let each data point (example) be a Let each data point (example) be a

trial, and let a success be a correct trial, and let a success be a correct predictionprediction

• Can exactly compute probability that Can exactly compute probability that error rate estimate error rate estimate pp is off by more is off by more than some amount, say 0.025, in than some amount, say 0.025, in either directioneither direction



ExampleExample• Rule: color = red & size = small -> class Rule: color = red & size = small -> class

= positive (50/14)= positive (50/14)• Want x s.t. Pr(precision<x) < 0.05Want x s.t. Pr(precision<x) < 0.05• b(50,0.72) says x is 60%b(50,0.72) says x is 60%• Revised: color = red -> class = positive Revised: color = red -> class = positive

(100/20)(100/20)• b(100,0.8) says x is 72%b(100,0.8) says x is 72%• Win by raising precision or raising Win by raising precision or raising

coverage without big loss in precisioncoverage without big loss in precision



Documents

6-Slide Example: Gene Chip Data © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)