44
6-Slide Example: Gene 6-Slide Example: Gene Chip Data Chip Data Person G ene A28202_ac AB00014_at AB00015_at ... CLASS Person 1 P 1142.0 A 321.0 P 2567.2 ... m yeloma Person 2 A -586.3 P 586.1 P 759.0 ... norm al Person 3 A 105.2 A 559.3 P 3210.7 ... m yelom a Person 4 P -42.8 P 692.1 P 812.0 ... norm al . . . . . . ... . . . . . . . ... . . . . . . . ... . © Jude Shavlik 2006, © Jude Shavlik 2006, David Page 2010 David Page 2010 CS 760 – Machine Learning (UW- CS 760 – Machine Learning (UW- Madison) Madison)

6-Slide Example: Gene Chip Data © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

Embed Size (px)

Citation preview

Page 1: 6-Slide Example: Gene Chip Data © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

6-Slide Example: Gene 6-Slide Example: Gene Chip DataChip Data

Person Gene A28202_ac AB00014_at AB00015_at . . . CLASS Person 1 P 1142.0 A 321.0 P 2567.2 . . . myeloma Person 2 A -586.3 P 586.1 P 759.0 . . . normal Person 3 A 105.2 A 559.3 P 3210.7 . . . myeloma Person 4 P -42.8 P 692.1 P 812.0 . . . normal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . © Jude Shavlik 2006, © Jude Shavlik 2006,

David Page 2010 David Page 2010CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)

Page 2: 6-Slide Example: Gene Chip Data © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

Decision Trees in One Decision Trees in One PicturePicture

Myeloma Normal

AvgDiff of G5 < 1000 > 1000

© Jude Shavlik 2006, © Jude Shavlik 2006, David Page 2010 David Page 2010

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)

Page 3: 6-Slide Example: Gene Chip Data © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

Example: Gene Example: Gene ExpressionExpression

Decision tree:

AD_X57809_at <= 20343.4: myeloma (74)

AD_X57809_at > 20343.4: normal (31)

Leave-one-out cross-validation accuracy estimate: 97.1%

X57809: IGL (immunoglobulin lambda locus)

© Jude Shavlik 2006, © Jude Shavlik 2006, David Page 2010 David Page 2010

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)

Page 4: 6-Slide Example: Gene Chip Data © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

Problem with ResultProblem with Result

Easy to predict accurately with genes Easy to predict accurately with genes related to immune function, such as related to immune function, such as IGL, but this gives us no new insight.IGL, but this gives us no new insight.

Eliminate these genes prior to Eliminate these genes prior to training.training.

Possible because of Possible because of comprehensibility of decision trees.comprehensibility of decision trees.

© Jude Shavlik 2006, © Jude Shavlik 2006, David Page 2010 David Page 2010

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)

Page 5: 6-Slide Example: Gene Chip Data © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

Ignoring Genes Ignoring Genes Associated with Associated with Immune functionImmune functionDecision tree:

AD_X04898_rna1_at <= -1453.4: normal (30)

AD_X04898_rna1_at > -1453.4: myeloma (74/1)

X04898: APOA2 (Apolipoprotein AII)

Leave-one-out accuracy estimate: 98.1%.

© Jude Shavlik 2006, © Jude Shavlik 2006, David Page 2010 David Page 2010

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)

Page 6: 6-Slide Example: Gene Chip Data © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

Another TreeAnother TreeAD_M15881_at > 992: normal (28)

AD_M15881_at <= 992:

AC_D82348_at = A: normal (3)

AC_D82348_at = P: myeloma (74)

© Jude Shavlik 2006, © Jude Shavlik 2006, David Page 2010 David Page 2010

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)

Page 7: 6-Slide Example: Gene Chip Data © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

A Measure of Node A Measure of Node PurityPurity

Let Let ff++ = fraction of = fraction of positivepositive examples examples Let Let ff

-- = = fraction of fraction of negativenegative examples examplesff++ = p / (p + n), = p / (p + n), ff

-- = n / (p + n), = n / (p + n), p=#pos,p=#pos, n=#neg n=#neg

Under an optimal code, the Under an optimal code, the information needed information needed (expected number of bits) to label one example (expected number of bits) to label one example isis

Info( fInfo( f++, f, f --) = - f) = - f++ lg (f lg (f++) - f) - f

-- lg (f lg (f --))

This is also called the This is also called the entropyentropy of the set of of the set of examplesexamples

(derived later)(derived later)

© Jude Shavlik 2006, © Jude Shavlik 2006, David Page 2010 David Page 2010

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)

Page 8: 6-Slide Example: Gene Chip Data © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

Another Commonly-Another Commonly-Used Measure of Node Used Measure of Node PurityPurity• Gini Index: (Gini Index: (ff++)) (( ff

-- ))• Used in CART (Classification and Used in CART (Classification and

Regression Trees, Breiman et al., Regression Trees, Breiman et al., 1984)1984)

© Jude Shavlik 2006, © Jude Shavlik 2006, David Page 2010 David Page 2010

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)

Page 9: 6-Slide Example: Gene Chip Data © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

• All same class All same class (+, say) (+, say) Info(1, 0) = -1 lg(1) + -0 lg(0) 0 Info(1, 0) = -1 lg(1) + -0 lg(0) 0

• 50-50 mixture50-50 mixtureInfo(½, ½) = 2[ -½ lg(½)] = 1Info(½, ½) = 2[ -½ lg(½)] = 1

Info(fInfo(f ++, f, f

--)) : Consider : Consider the Extreme Casesthe Extreme Cases

0

-1

0 (by def)

© Jude Shavlik 2006, © Jude Shavlik 2006, David Page 2010 David Page 2010

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)

Page 10: 6-Slide Example: Gene Chip Data © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

Evaluating a FeatureEvaluating a Feature

• How much does it help to know How much does it help to know the value of attribute/feature the value of attribute/feature AA ??

• Assume Assume AA divides the current set divides the current set of examples into of examples into N N groupsgroups

Let Let qqii = fraction of data on branch = fraction of data on branch ii

ffii++ = fraction of +’s on branch = fraction of +’s on branch ii

ffi i -- = fraction of –’s on branch = fraction of –’s on branch ii

© Jude Shavlik 2006, © Jude Shavlik 2006, David Page 2010 David Page 2010

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)

Page 11: 6-Slide Example: Gene Chip Data © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

E(E(AA ) ) ΣΣ q qi i xx I (fI (fii++, f, fii

--))

• Info Info neededneeded after determining the after determining the value of attribute value of attribute AA

• Another Another expected valueexpected value calc calc

PictorallyPictorally

Evaluating a Feature Evaluating a Feature (con’t)(con’t)

i= 1

N

A

v1 vN

I (fI (fNN++, f, fNN

--))

I (fI (f++, f, f--))

I (fI (f11++, f, f11

--))

© Jude Shavlik 2006, © Jude Shavlik 2006, David Page 2010 David Page 2010

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)

Page 12: 6-Slide Example: Gene Chip Data © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

Info GainInfo Gain

Gain(A) Gain(A) I(f I(f++, f, f --) – E(A)) – E(A)

Our scoring function in our hill-climbing (greedy) algorithm

So pick A with smallest E(A)

Constant for all features

That is, choose the feature that statistically tells us the most

about the class of another example drawn from this distribution© Jude Shavlik 2006, © Jude Shavlik 2006,

David Page 2010 David Page 2010CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)

Page 13: 6-Slide Example: Gene Chip Data © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

Example Info-Gain Example Info-Gain CalculationCalculation

++BIGBIGRedRed

++BIGBIGRedRed

--SMALLSMALLYellowYellow --SMALLSMALLRedRed ++BIGBIGBlueBlue

ClassClassSizeSizeShapeShapeColorColor

?)(?)(

?)(?),(

sizeEshapeE

colorEffI

© Jude Shavlik 2006, © Jude Shavlik 2006, David Page 2010 David Page 2010

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)

Page 14: 6-Slide Example: Gene Chip Data © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

Info-Gain Calculation (cont.)Info-Gain Calculation (cont.)

0)1,0(4.0)0,1(6.0)(

)()2

1,

2

1(4.0)

3

1,

3

2(6.0)(

)1,0(2.0)0,1(2.0)3

1,

3

2(6.0)(

91.0)4.0(log4.0)6.0(log6.0)4.0,6.0(),( 22

IIsizeE

colorEIIshapeE

IIIcolorE

IffI

Note that “Size” provides complete classification, so done

© Jude Shavlik 2006, © Jude Shavlik 2006, David Page 2010 David Page 2010

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)

Page 15: 6-Slide Example: Gene Chip Data © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

© Jude Shavlik 2006, © Jude Shavlik 2006, David Page 2010 David Page 2010

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)

Page 16: 6-Slide Example: Gene Chip Data © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

ID3 Info Gain Measure JustifiedID3 Info Gain Measure Justified(Ref. (Ref. C4.5C4.5, J. R. Quinlan, Morgan Kaufmann, 1993, pp21-, J. R. Quinlan, Morgan Kaufmann, 1993, pp21-22)22)

Definition of InformationDefinition of Information Info conveyed by message Info conveyed by message MM depends on its probability, i.e., depends on its probability, i.e.,

(due to Shannon)(due to Shannon)

Select example from a set Select example from a set SS and announce it belongs to class and announce it belongs to class CC

The probability of this occurring is the fraction of The probability of this occurring is the fraction of CC ’s in ’s in SS

Hence info in this announcement is, by definition,Hence info in this announcement is, by definition,

]Prob(M)[log- info(M) 2

Cf

)(log2 Cf

© Jude Shavlik 2006, © Jude Shavlik 2006, David Page 2010 David Page 2010

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)

Page 17: 6-Slide Example: Gene Chip Data © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

Let there be Let there be KK different classesdifferent classes in set in set SS, the classes are:, the classes are:

What is What is expectedexpected info info from a msg about the class of an example in set from a msg about the class of an example in set S S ??

is the average number of bits of information is the average number of bits of information (by looking at feature values) needed to (by looking at feature values) needed to

classify classify a member of set a member of set SS

ID3 Info Gain Measure ID3 Info Gain Measure (cont.)(cont.)

KCCC ,.......,, 21

. class of are that set offraction where,

,)log( )info(

)(log...)(log)(log- )info(

1

222 2211

jC

K

jCC

CCCCCC

CSf

ffS

ffffffS

j

jj

KK

)(info S

© Jude Shavlik 2006, © Jude Shavlik 2006, David Page 2010 David Page 2010

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)

Page 18: 6-Slide Example: Gene Chip Data © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

Handling Handling HierarchicalHierarchical Features Features in ID3 in ID3 Define a Define a newnew feature for each level in hierarchy, e.g., feature for each level in hierarchy, e.g.,

Let ID3 choose the appropriate level of abstraction!Let ID3 choose the appropriate level of abstraction!

Shape

Circular Polygonal

Shape 1 = {Circular, Polygonal}Shape2 = { , , , , }

© Jude Shavlik 2006, © Jude Shavlik 2006, David Page 2010 David Page 2010

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)

Page 19: 6-Slide Example: Gene Chip Data © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

Handling Handling NumericNumeric Features Featuresin ID3in ID3

On the flyOn the fly create binary featurescreate binary features and choose and choose bestbest

Step 1Step 1: Plot current examples : Plot current examples (green=pos, red=neg)(green=pos, red=neg)

Step 2Step 2: Divide : Divide midway between every midway between every consecutive pair of points with consecutive pair of points with different different categoriescategories to create new binary features, eg to create new binary features, eg featurefeaturenew1new1 = F<8 = F<8 andand feature featurenew2new2 = F<10 = F<10

Step 3Step 3: Choose : Choose split with best info gainsplit with best info gain

Value of Feature5 7 9 11 13

© Jude Shavlik 2006, © Jude Shavlik 2006, David Page 2010 David Page 2010

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)

Page 20: 6-Slide Example: Gene Chip Data © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

Handling Numeric Features Handling Numeric Features (cont.)(cont.)

NoteNote

F<10

F< 5 -

+ -

T

T F

F

Cannot discard numeric feature after use in one portion of d-tree

© Jude Shavlik 2006, © Jude Shavlik 2006, David Page 2010 David Page 2010

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)

Page 21: 6-Slide Example: Gene Chip Data © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

Characteristic Property of Characteristic Property of Using Info-Gain MeasureUsing Info-Gain Measure

FAVORS FEATURES WITH FAVORS FEATURES WITH HIGH BRANCHINGHIGH BRANCHING FACTORS FACTORS (i.e. many possible values)(i.e. many possible values)

Extreme Case:Extreme Case:

At most one example per leafAt most one example per leaf and all I(.,.) scores and all I(.,.) scores for leafs equals zero, so gets perfect score! But for leafs equals zero, so gets perfect score! But generalizes very poorly (ie, memorizes data)generalizes very poorly (ie, memorizes data)

Student ID

1+0- 0+

0-

0+1-

1

99

999

© Jude Shavlik 2006, © Jude Shavlik 2006, David Page 2010 David Page 2010

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)

Page 22: 6-Slide Example: Gene Chip Data © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

Fix: Method 1Fix: Method 1

Convert all features to binaryConvert all features to binarye.g., Color = {Red, Blue, Green}e.g., Color = {Red, Blue, Green}

From 1 From 1 NN-valued feature to -valued feature to NN binary features binary features

Color = Red? {True, False}Color = Red? {True, False}

Color = Blue? {True, False}Color = Blue? {True, False}

Color = Green? {True, False}Color = Green? {True, False}

Used in Neural Nets and SVMsUsed in Neural Nets and SVMs

D-tree readability probably less, but not D-tree readability probably less, but not necessarilynecessarily

© Jude Shavlik 2006, © Jude Shavlik 2006, David Page 2010 David Page 2010

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)

Page 23: 6-Slide Example: Gene Chip Data © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

Fix 2: Gain Fix 2: Gain RatioRatio

Find Find info contentinfo content in answer to: in answer to:

What is value of feature A What is value of feature A ignoringignoring output output category?category?

fraction of all fraction of all

examples with A=i examples with A=i

Choose A that maximizes:Choose A that maximizes:

Read text (Mitchell) for exact details!Read text (Mitchell) for exact details!

np

np

np

npAIV ii

K

i

ii2

1

log)(

)(

)( InfoSplit

AIV

AGain

© Jude Shavlik 2006, © Jude Shavlik 2006, David Page 2010 David Page 2010

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)

Page 24: 6-Slide Example: Gene Chip Data © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

Fix: Method 3Fix: Method 3

Group values of nominal featuresGroup values of nominal features

vs.vs.

• Done in CART (Breiman et.al. 1984)Done in CART (Breiman et.al. 1984)• Breiman et.al. proved for the 2-category Breiman et.al. proved for the 2-category

case, optimal binary partition can be found case, optimal binary partition can be found be considering only O(N) possibilities instead be considering only O(N) possibilities instead of O(2of O(2NN))

Color?R

B GY

Color?

R vs B vs …G vs Y vs …

© Jude Shavlik 2006, © Jude Shavlik 2006, David Page 2010 David Page 2010

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)

Page 25: 6-Slide Example: Gene Chip Data © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

Multiple Category Multiple Category Classification – Method 1 Classification – Method 1 (used in SVMs)(used in SVMs)

Approach 1: Learn one tree (ie, model) per categoryApproach 1: Learn one tree (ie, model) per category

( ) versus ( )i iCategory Category

What happens if test ex. is predicted to lie in multiple categories? To none?

Pass test ex’s through each tree

© Jude Shavlik 2006, © Jude Shavlik 2006, David Page 2010 David Page 2010

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)

Page 26: 6-Slide Example: Gene Chip Data © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

Multiple Category Multiple Category Classification – Method 2Classification – Method 2

Approach 2: Learn Approach 2: Learn one treeone tree in total in total

Subdivides the full space such that Subdivides the full space such that every pointevery point belongs to belongs to one and only oneone and only one category category (drawing slightly misleading)(drawing slightly misleading)

© Jude Shavlik 2006, © Jude Shavlik 2006, David Page 2010 David Page 2010

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)

Page 27: 6-Slide Example: Gene Chip Data © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

Where DT Learners Where DT Learners FailFail

• Correlation-immune functions like Correlation-immune functions like XOR (more in later lecture)XOR (more in later lecture)

• Target concepts where all/most Target concepts where all/most features are needed (so large tree features are needed (so large tree is required), but we have limited is required), but we have limited datadata

© Jude Shavlik 2006, © Jude Shavlik 2006, David Page 2010 David Page 2010

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)

Page 28: 6-Slide Example: Gene Chip Data © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

Key Lessons from DT Key Lessons from DT Research of 1980sResearch of 1980s

• In general, simpler trees are betterIn general, simpler trees are better• Avoid overfittingAvoid overfitting• Link to MDL/MML principle and Occam’s Link to MDL/MML principle and Occam’s

RazorRazor

• Pruning is better than Early StoppingPruning is better than Early Stopping

© Jude Shavlik 2006, © Jude Shavlik 2006, David Page 2010 David Page 2010

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)

Page 29: 6-Slide Example: Gene Chip Data © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

Advantages of Advantages of Decision Tree LearningDecision Tree Learning

• 1. Output is easily human-1. Output is easily human-comprehensiblecomprehensible• Can convert tree to rules – one rule Can convert tree to rules – one rule

per branch leading to a positive classper branch leading to a positive class• Gives insight into taskGives insight into task• Let’s us find errors in our approachLet’s us find errors in our approach

• Using a “noise peak” in mass spec dataUsing a “noise peak” in mass spec data• Using antibody (IG, HLA) genes in Using antibody (IG, HLA) genes in

myelomamyeloma© Jude Shavlik 2006, © Jude Shavlik 2006, David Page 2010 David Page 2010

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)

Page 30: 6-Slide Example: Gene Chip Data © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

Advantages of DT Advantages of DT LearningLearning

• 2. Good at Junta-Learning 2. Good at Junta-Learning problemsproblems• Tree learners focus on finding Tree learners focus on finding

smallest number of features that can smallest number of features that can be used to classify accuratelybe used to classify accurately

• Tree learners ignore all features they Tree learners ignore all features they don’t selectdon’t select

© Jude Shavlik 2006, © Jude Shavlik 2006, David Page 2010 David Page 2010

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)

Page 31: 6-Slide Example: Gene Chip Data © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

Advantages of DT Advantages of DT LearningLearning• 3. Really FAST!3. Really FAST!

• Time to learn is fast: O(nm) to make a Time to learn is fast: O(nm) to make a split, where nm is size of data set (n is split, where nm is size of data set (n is number of features, m is number of number of features, m is number of data points)data points)

• Typically few splits required, data Typically few splits required, data analyzed for recursive splits is even analyzed for recursive splits is even smallersmaller

• Time to label is linear in tree depth onlyTime to label is linear in tree depth only

© Jude Shavlik 2006, © Jude Shavlik 2006, David Page 2010 David Page 2010

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)

Page 32: 6-Slide Example: Gene Chip Data © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

Advantages of DT Advantages of DT LearnersLearners

• 4. Easily extended4. Easily extended• Numerical prediction: put average Numerical prediction: put average

value at leaf, score split by value at leaf, score split by reduction reduction in variance in variance (regression trees)(regression trees)

• Can instead have linear regression Can instead have linear regression models at leaves (model trees)models at leaves (model trees)

• Easy to do multi-class prediction and Easy to do multi-class prediction and multiple instance learning (later multiple instance learning (later lecture)lecture)

© Jude Shavlik 2006, © Jude Shavlik 2006, David Page 2010 David Page 2010

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)

Page 33: 6-Slide Example: Gene Chip Data © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

Advantages of DT Advantages of DT LearningLearning

• 5. Complete representation 5. Complete representation languagelanguage• DTs can represent DTs can represent anyany Boolean Boolean

functionfunction• Note: this does not guarantee they Note: this does not guarantee they

will effectively learn any functionwill effectively learn any function

© Jude Shavlik 2006, © Jude Shavlik 2006, David Page 2010 David Page 2010

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)

Page 34: 6-Slide Example: Gene Chip Data © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

Disadvantages of DT Disadvantages of DT LearnersLearners• 1. Greedy algorithm means they fail 1. Greedy algorithm means they fail

on correlation immune functionson correlation immune functions• 2. They settle for minimum feature 2. They settle for minimum feature

set required for discrimination: set required for discrimination: discriminatediscriminate rather than rather than characterizecharacterize

• 3. Tendency to 3. Tendency to overfitoverfit if more if more features than data… little data left features than data… little data left for as we get deeper in treefor as we get deeper in tree

© Jude Shavlik 2006, © Jude Shavlik 2006, David Page 2010 David Page 2010

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)

Page 35: 6-Slide Example: Gene Chip Data © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

ErrorError

• Assume some probability distribution D Assume some probability distribution D over possible examples (e.g., all truth over possible examples (e.g., all truth assignments over x1, …, xN)assignments over x1, …, xN)

• For a training set, assume D uniformFor a training set, assume D uniform• Assume a hypothesis space H (e.g., all Assume a hypothesis space H (e.g., all

decision trees over x1, …, xN)decision trees over x1, …, xN)• The error of a hypothesis h in H is the The error of a hypothesis h in H is the

probability h misclassifies an example probability h misclassifies an example drawn randomly according to Ddrawn randomly according to D

© Jude Shavlik 2006, © Jude Shavlik 2006, David Page 2010 David Page 2010

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)

Page 36: 6-Slide Example: Gene Chip Data © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

OverfittingOverfitting

• Given hypothesis space H, Given hypothesis space H, probability distribution D over probability distribution D over examples, and training set Texamples, and training set T

• Hypothesis h in H Hypothesis h in H overfitsoverfits T if T if there exists another h’ in H such there exists another h’ in H such that:that:• h has smaller error than h’ on Th has smaller error than h’ on T• h’ has smaller error than h over Dh’ has smaller error than h over D

© Jude Shavlik 2006, © Jude Shavlik 2006, David Page 2010 David Page 2010

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)

Page 37: 6-Slide Example: Gene Chip Data © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

How we usually How we usually observe overfittingobserve overfitting

• Accuracy (1 – error) of our Accuracy (1 – error) of our hypothesis drops substantially on hypothesis drops substantially on a new test set compared to the a new test set compared to the training settraining set

• We observe this to greater extent We observe this to greater extent as decision tree size growsas decision tree size grows

• How can we combat overfitting in How can we combat overfitting in DTs?DTs?

© Jude Shavlik 2006, © Jude Shavlik 2006, David Page 2010 David Page 2010

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)

Page 38: 6-Slide Example: Gene Chip Data © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

Reduced Error PruningReduced Error Pruning

• Hold aside a tuning set (or validation Hold aside a tuning set (or validation set) from training setset) from training set

• For each internal node in the treeFor each internal node in the tree• Try replacing that subtree at that node Try replacing that subtree at that node

with a leaf that predicts majority class at with a leaf that predicts majority class at that node in training setthat node in training set

• Keep this change if performance on tune Keep this change if performance on tune set grows no worse than without changeset grows no worse than without change

• Kearns & Mansour: do this bottom-upKearns & Mansour: do this bottom-up© Jude Shavlik 2006, © Jude Shavlik 2006, David Page 2010 David Page 2010

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)

Page 39: 6-Slide Example: Gene Chip Data © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

From Mitchell (p. 70)From Mitchell (p. 70)

© Jude Shavlik 2006, © Jude Shavlik 2006, David Page 2010 David Page 2010

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)

Page 40: 6-Slide Example: Gene Chip Data © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

Rule Post-PruningRule Post-Pruning

• Build full decision treeBuild full decision tree• Convert each root-leaf path into an IF-Convert each root-leaf path into an IF-

THEN rule, where tests along the path THEN rule, where tests along the path are PRECONDITIONS and leaf label is are PRECONDITIONS and leaf label is CONSEQUENTCONSEQUENT

• For each rule, remove preconditions For each rule, remove preconditions that improve estimated accuracythat improve estimated accuracy

• Sort rules by accuracy estimates, and Sort rules by accuracy estimates, and consider them in order: consider them in order: decision listdecision list

© Jude Shavlik 2006, © Jude Shavlik 2006, David Page 2010 David Page 2010

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)

Page 41: 6-Slide Example: Gene Chip Data © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

Estimating AccuracyEstimating Accuracy

• By “accuracy” here, we really mean By “accuracy” here, we really mean precisionprecision: probability the rule is correct : probability the rule is correct when it fireswhen it fires

• Could use validation (tune) set if Could use validation (tune) set if enough data is availableenough data is available

• C4.5: evaluate on training set and be C4.5: evaluate on training set and be pessimistic: use binomial distribution pessimistic: use binomial distribution to get 95% confidence intervalto get 95% confidence interval

• Example to follow…Example to follow…© Jude Shavlik 2006, © Jude Shavlik 2006, David Page 2010 David Page 2010

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)

Page 42: 6-Slide Example: Gene Chip Data © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

The Binomial The Binomial DistributionDistribution

xnx ppx

nx

)1()Pr(

• Distribution over the number of successes in a Distribution over the number of successes in a fixed number fixed number nn of independent trials (with of independent trials (with

same probability of success same probability of success pp in each) in each)

0

0.2

0.4

0.6

0.8

1

0 2 4 6 8 10

Pr(

x)

x

Binomial distribution w/ p=0.5, n=10

© Jude Shavlik 2006, © Jude Shavlik 2006, David Page 2010 David Page 2010

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)

Page 43: 6-Slide Example: Gene Chip Data © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

Using the Binomial Using the Binomial HereHere• Let each data point (example) be a Let each data point (example) be a

trial, and let a success be a correct trial, and let a success be a correct predictionprediction

• Can exactly compute probability that Can exactly compute probability that error rate estimate error rate estimate pp is off by more is off by more than some amount, say 0.025, in than some amount, say 0.025, in either directioneither direction

© Jude Shavlik 2006, © Jude Shavlik 2006, David Page 2010 David Page 2010

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)

Page 44: 6-Slide Example: Gene Chip Data © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

ExampleExample• Rule: color = red & size = small -> class Rule: color = red & size = small -> class

= positive (50/14)= positive (50/14)• Want x s.t. Pr(precision<x) < 0.05Want x s.t. Pr(precision<x) < 0.05• b(50,0.72) says x is 60%b(50,0.72) says x is 60%• Revised: color = red -> class = positive Revised: color = red -> class = positive

(100/20)(100/20)• b(100,0.8) says x is 72%b(100,0.8) says x is 72%• Win by raising precision or raising Win by raising precision or raising

coverage without big loss in precisioncoverage without big loss in precision

© Jude Shavlik 2006, © Jude Shavlik 2006, David Page 2010 David Page 2010

CS 760 – Machine Learning (UW-Madison)CS 760 – Machine Learning (UW-Madison)