1 Data Mining I Karl Young Center for Imaging of Neurodegenerative Diseases, UCSF

11

Data Mining IData Mining I

Karl YoungKarl Young

CCenter for enter for IImaging of maging of NNeurodegenerative eurodegenerative DDiseases, UCSFiseases, UCSF

22

The “Issues”The “Issues” Data Explosion Problem Data Explosion Problem

– Automated data collection tools + widely used Automated data collection tools + widely used database systems + computerized society + database systems + computerized society + Internet lead to tremendous amounts of data Internet lead to tremendous amounts of data accumulated and/or to be analyzed in databases, accumulated and/or to be analyzed in databases, data warehouses, WWW, and other information data warehouses, WWW, and other information repositories repositories

We are drowning in data, but starving for We are drowning in data, but starving for knowledge! knowledge!

Solution: Data Warehousing and Data MiningSolution: Data Warehousing and Data Mining– Data warehousing and on-line analytical Data warehousing and on-line analytical

processing (OLAP)processing (OLAP)– Mining interesting knowledge (rules, regularities, Mining interesting knowledge (rules, regularities,

patterns, constraints) from data in large databasespatterns, constraints) from data in large databases

33

Data Warehousing + Data Data Warehousing + Data MiningMining

(one of many schematic views)(one of many schematic views)

Efficient And RobustData Storage And Retrival

Database Technology

StatisticsComputer Science

HighPerformanceComputing

MachineLearning

Visualization,…

Efficient And RobustData Summary

And Visualization

44

Machine learning and Machine learning and statisticsstatistics

Historical difference (grossly Historical difference (grossly oversimplified):oversimplified):– Statistics: testing hypothesesStatistics: testing hypotheses– Machine learning: finding the right hypothesisMachine learning: finding the right hypothesis

But: huge overlapBut: huge overlap– Decision trees (C4.5 and CART)Decision trees (C4.5 and CART)– Nearest-neighbor methodsNearest-neighbor methods

Today: perspectives have convergedToday: perspectives have converged– Most ML algorithms employ statistical Most ML algorithms employ statistical

techniquestechniques

55

SchematicallySchematically

Data Cleaning

Data Integration

Data Warehouse

Task-relevant Data

Selection

Data Mining

Pattern Evaluation

66

SchematicallySchematically– Data warehouse —Data warehouse —

core of efficient data core of efficient data organizationorganization

Data Cleaning

Data Integration

Data Warehouse

Task-relevant Data

Selection

Data Mining

Pattern Evaluation

77

– Data mining—core of Data mining—core of knowledge discovery knowledge discovery processprocess

Data Cleaning

Data Integration

Data Warehouse

Task-relevant Data

Selection

Data Mining

Pattern Evaluation

SchematicallySchematically

88

Data miningData mining

Needed: programs that detect patterns Needed: programs that detect patterns and regularities in the dataand regularities in the data

Strong patterns = good predictionsStrong patterns = good predictions– Problem 1: most patterns are not Problem 1: most patterns are not

interestinginteresting– Problem 2: patterns may be inexact (or Problem 2: patterns may be inexact (or

spurious)spurious)– Problem 3: data may be garbled or missingProblem 3: data may be garbled or missing

Want to Want to learnlearn “concept”, i.e. rule or “concept”, i.e. rule or set of rules that characterize observed set of rules that characterize observed patterns in patterns in datadata

99

Types of LearningTypes of Learning Supervised - ClassificationSupervised - Classification

– Know classes for examplesKnow classes for examples Induction RulesInduction Rules Decision TreesDecision Trees Bayesian ClassificationBayesian Classification

– NaieveNaieve– NetworksNetworks

Numeric PredictionNumeric Prediction– Linear RegressionLinear Regression– Neural NetsNeural Nets– Support Vector MachinesSupport Vector Machines

Unsupervised – Learn Natural GroupingsUnsupervised – Learn Natural Groupings– ClusteringClustering

Partitioning MethodsPartitioning Methods Hierarchical MethodsHierarchical Methods Density Based MethodsDensity Based Methods Model Based MethodsModel Based Methods

Learn Association Rules – In Principle Learn All AtributesLearn Association Rules – In Principle Learn All Atributes

1010

Algorithms: Algorithms: The basic methodsThe basic methods

Simplicity first: 1RSimplicity first: 1R Use all attributes: Naïve BayesUse all attributes: Naïve Bayes Decision trees: ID3Decision trees: ID3 Covering algorithms: decision rules: PRISMCovering algorithms: decision rules: PRISM Association rulesAssociation rules Linear modelsLinear models Instance-based learningInstance-based learning

1111



1212

Simplicity firstSimplicity first

Simple algorithms often work very well! Simple algorithms often work very well! There are many kinds of simple There are many kinds of simple

structure, eg:structure, eg:– One attribute does all the workOne attribute does all the work– All attributes contribute equally & All attributes contribute equally &

independentlyindependently– A weighted linear combination might doA weighted linear combination might do– Instance-based: use a few prototypesInstance-based: use a few prototypes– Use simple logical rulesUse simple logical rules

Success of method depends on the Success of method depends on the domaindomain

1313

The weather problem The weather problem (used for illustration)(used for illustration)

Conditions for playing a certain Conditions for playing a certain gamegameOutlookOutlook TemperatureTemperature HumidityHumidity WindyWindy PlayPlay

SunnySunny HotHot HighHigh FalseFalse NoNo

SunnySunny Hot Hot High High TrueTrue NoNo

Overcast Overcast Hot Hot HighHigh FalseFalse YesYes

RainyRainy MildMild NormalNormal FalseFalse YesYes

…… …… …… …… ……

If outlook = sunny and humidity = high then play = noIf outlook = sunny and humidity = high then play = no

If outlook = rainy and windy = true then play = noIf outlook = rainy and windy = true then play = no

If outlook = overcast then play = yesIf outlook = overcast then play = yes

If humidity = normal then play = yesIf humidity = normal then play = yes

If none of the above then play = yesIf none of the above then play = yes

1414

Weather data with mixed Weather data with mixed attributesattributes

Some attributes have numeric Some attributes have numeric valuesvaluesOutlookOutlook TemperatureTemperature HumidityHumidity WindyWindy PlayPlay

SunnySunny 8585 8585 FalseFalse NoNo

SunnySunny 8080 9090 TrueTrue NoNo

Overcast Overcast 8383 8686 FalseFalse YesYes

RainyRainy 7575 8080 FalseFalse YesYes

…… …… …… …… ……

If outlook = sunny and humidity > 83 then play = noIf outlook = sunny and humidity > 83 then play = no

If outlook = rainy and windy = true then play = noIf outlook = rainy and windy = true then play = no

If outlook = overcast then play = yesIf outlook = overcast then play = yes

If humidity < 85 then play = yesIf humidity < 85 then play = yes

If none of the above then play = yesIf none of the above then play = yes

1515

Inferring rudimentary Inferring rudimentary rulesrules

1R: learns a 1-level decision tree1R: learns a 1-level decision tree– I.e., rules that all test one particular I.e., rules that all test one particular

attributeattribute Basic versionBasic version

– One branch for each valueOne branch for each value– Each branch assigns most frequent classEach branch assigns most frequent class– Error rate: proportion of instances that Error rate: proportion of instances that

don’t belong to the majority class of their don’t belong to the majority class of their corresponding branchcorresponding branch

– Choose attribute with lowest error rateChoose attribute with lowest error rate

((assumes nominal attributesassumes nominal attributes))

1616

Pseudo-code for 1RPseudo-code for 1R

For each attribute,For each attribute,

For each value of the attribute, make a rule as follows:For each value of the attribute, make a rule as follows:

count how often each class appearscount how often each class appears

find the most frequent classfind the most frequent class

make the rule assign that class to this attribute-valuemake the rule assign that class to this attribute-value

Calculate the error rate of the rulesCalculate the error rate of the rules

Choose the rules with the smallest error rateChoose the rules with the smallest error rate

Note: “missing” is treated as a separate Note: “missing” is treated as a separate attribute valueattribute value

1717

Evaluating the weather Evaluating the weather attributesattributes

Attribute Attribute RulesRules ErrorErrorss

Total Total errorerrorss

OutlookOutlook Sunny Sunny No No 2/52/5 4/144/14

Overcast Overcast YesYes

0/40/4

Rainy Rainy Yes Yes 2/52/5

TempTemp Hot Hot No* No* 2/42/4 5/145/14

Mild Mild Yes Yes 2/62/6

Cool Cool Yes Yes 1/41/4

HumidityHumidity High High No No 3/73/7 4/144/14

Normal Normal Yes Yes 1/71/7

WindyWindy False False Yes Yes 2/82/8 5/145/14

True True No* No* 3/63/6

OutlookOutlook TempTemp HumiditHumidityy

WindWindyy

PlaPlayy



OvercaOvercast st

Hot Hot HighHigh FalseFalse YesYes

RainyRainy MildMild HighHigh FalseFalse YesYes

RainyRainy CoolCool NormalNormal FalseFalse YesYes

RainyRainy CoolCool NormalNormal TrueTrue NoNo

OvercaOvercastst

CoolCool NormalNormal TrueTrue YesYes

SunnySunny MildMild HighHigh FalseFalse NoNo

SunnySunny CoolCool NormalNormal FalseFalse YesYes


SunnySunny MildMild NormalNormal TrueTrue YesYes

OvercaOvercastst

MildMild HighHigh TrueTrue YesYes

OvercaOvercastst

HotHot NormalNormal FalseFalse YesYes

RainyRainy MildMild HighHigh TrueTrue NoNo

* indicates a tie* indicates a tie

1818

Dealing withDealing withnumeric attributesnumeric attributes

Discretize numeric attributesDiscretize numeric attributes Divide each attribute’s range into Divide each attribute’s range into

intervalsintervals– Sort instances according to attribute’s Sort instances according to attribute’s

valuesvalues– Place breakpoints where the class changesPlace breakpoints where the class changes

(the majority class)(the majority class)– This minimizes the total errorThis minimizes the total error

Example: Example: temperaturetemperature from weather from weather datadata

OutlookOutlook TemperatuTemperaturere

HumidityHumidity WindyWindy PlayPlay





…… …… …… …… ……

1919

Dealing withDealing withnumeric attributesnumeric attributes

Example: Example: temperaturetemperature from weather from weather datadata

64 65 68 69 70 71 72 72 75 75 80 81 83 64 65 68 69 70 71 72 72 75 75 80 81 83 85 85

Yes | No | Yes Yes Yes | No No Yes | Yes Yes | No | Yes Yes | NoYes | No | Yes Yes Yes | No No Yes | Yes Yes | No | Yes Yes | No

OutlookOutlook TemperatuTemperaturere

HumidityHumidity WindyWindy PlayPlay





…… …… …… …… ……

2020

The problem of overfittingThe problem of overfitting

This procedure is very sensitive to This procedure is very sensitive to noisenoise– One instance with an incorrect class label One instance with an incorrect class label

will probably produce a separate intervalwill probably produce a separate interval Also: Also: time stamptime stamp attribute will have attribute will have

zero errorszero errors Simple solution:Simple solution:

enforce minimum number of instances enforce minimum number of instances in majority class per intervalin majority class per interval

Example (with min = 3):Example (with min = 3):64 65 68 69 70 71 72 72 75 75 80 81 83 64 65 68 69 70 71 72 72 75 75 80 81 83 85 85

Yes | No | Yes Yes Yes | No No Yes | Yes Yes | No | Yes Yes | NoYes | No | Yes Yes Yes | No No Yes | Yes Yes | No | Yes Yes | No

64 65 68 69 70 71 72 72 75 75 80 81 83 64 65 68 69 70 71 72 72 75 75 80 81 83 85 85

Yes No Yes Yes Yes | No No Yes Yes Yes | No Yes Yes NoYes No Yes Yes Yes | No No Yes Yes Yes | No Yes Yes No

2121

With overfitting With overfitting avoidanceavoidance

Resulting rule set:Resulting rule set:

Attribute Attribute RulesRules ErrorsErrors Total Total errorserrors

OutlookOutlook Sunny Sunny No No 2/52/5 4/144/14

Overcast Overcast Yes Yes 0/40/4

Rainy Rainy Yes Yes 2/52/5

TemperatureTemperature 77.5 77.5 Yes Yes 3/103/10 5/145/14

> 77.5 > 77.5 No* No* 2/42/4

HumidityHumidity 82.5 82.5 Yes Yes 1/71/7 3/143/14

> 82.5 and > 82.5 and 95.5 95.5 NoNo

2/62/6

> 95.5 > 95.5 Yes Yes 0/10/1

WindyWindy False False Yes Yes 2/82/8 5/145/14

True True No* No* 3/63/6

2222

Discussion of 1RDiscussion of 1R 1R was described in a paper by Holte 1R was described in a paper by Holte

(1993)(1993)– Contains an experimental evaluation on 16 Contains an experimental evaluation on 16

datasets (using datasets (using cross-validationcross-validation so that so that results were representative of performance results were representative of performance on future data)on future data)

– Minimum number of instances was set to 6 Minimum number of instances was set to 6 after some experimentationafter some experimentation

– 1R’s simple rules performed not much 1R’s simple rules performed not much worse than much more complex decision worse than much more complex decision treestrees

Simplicity first pays off! Simplicity first pays off! Very Simple Classification Rules Perform Well on Most Very Simple Classification Rules Perform Well on Most Commonly Used DatasetsCommonly Used DatasetsRobert C. Holte, Computer Science Department, University of OttawaRobert C. Holte, Computer Science Department, University of Ottawa

2323



2424

Statistical modelingStatistical modeling

““Opposite” of 1R: use all the attributesOpposite” of 1R: use all the attributes

Two assumptions: Attributes areTwo assumptions: Attributes are– equally importantequally important– statistically independentstatistically independent (given the class (given the class

value)value) I.e., knowing the value of one attribute says I.e., knowing the value of one attribute says

nothing about the value of anothernothing about the value of another(if the class is known)(if the class is known)

Independence assumption is never Independence assumption is never correct!correct!

But … this scheme works well in But … this scheme works well in practicepractice

2525

Probabilities forProbabilities forweather dataweather data

OutlookOutlook TempTemp HumidityHumidity WindyWindy PlayPlay



Overcast Overcast Hot Hot HighHigh FalseFalse YesYes




OvercastOvercast CoolCool NormalNormal TrueTrue YesYes





OvercastOvercast MildMild HighHigh TrueTrue YesYes

OvercastOvercast HotHot NormalNormal FalseFalse YesYes


2626


OutlookOutlook TemperatureTemperature HumidityHumidity WindyWindy PlayPlay

YesYes NoNo YesYes NoNo YesYes NoNo YesYes NoNo YesYes NoNo

SunnySunny 22 33 HotHot 22 22 HighHigh 33 44 FalseFalse 66 22 99 55

OvercasOvercastt

44 00 MildMild 44 22 NormalNormal 66 11 TrueTrue 33 33

RainyRainy 33 22 CoolCool 33 11

SunnySunny 2/92/9 3/53/5 HotHot 2/92/9 2/52/5 HighHigh 3/93/9 4/54/5 FalseFalse 6/96/9 2/52/5 9/19/144

5/15/144

OvercasOvercastt

4/94/9 0/50/5 MildMild 4/94/9 2/52/5 NormalNormal 6/96/9 1/51/5 TrueTrue 3/93/9 3/53/5

RainyRainy 3/93/9 2/52/5 CoolCool 3/93/9 1/51/5

2727




SunnySunny 22 33 HotHot 22 22 HighHigh 33 44 FalseFalse 66 22 99 55

OvercastOvercast 44 00 MildMild 44 22 NormalNormal 66 11 TrueTrue 33 33

RainyRainy 33 22 CoolCool 33 11

SunnySunny 2/92/9 3/53/5 HotHot 2/92/9 2/52/5 HighHigh 3/93/9 4/54/5 FalseFalse 6/96/9 2/52/5 9/19/144

5/15/144

OvercastOvercast 4/94/9 0/50/5 MildMild 4/94/9 2/52/5 NormalNormal 6/96/9 1/51/5 TrueTrue 3/93/9 3/53/5

RainyRainy 3/93/9 2/52/5 CoolCool 3/93/9 1/51/5

OutlookOutlook Temp.Temp. HumiditHumidityy

WindyWindy PlayPlay

SunnySunny CoolCool HighHigh TrueTrue ??

A new day:A new day:

Likelihood of the two classesLikelihood of the two classes

For “yes” = 2/9 For “yes” = 2/9 3/9 3/9 3/9 3/9 3/9 3/9 9/14 = 9/14 = 0.00530.0053

For “no” = 3/5 For “no” = 3/5 1/5 1/5 4/5 4/5 3/5 3/5 5/14 = 0.0206 5/14 = 0.0206

Conversion into a probability by normalization:Conversion into a probability by normalization:

P(“yes”) = 0.0053 / (0.0053 + 0.0206) = 0.205P(“yes”) = 0.0053 / (0.0053 + 0.0206) = 0.205

P(“no”) = 0.0206 / (0.0053 + 0.0206) = 0.795P(“no”) = 0.0206 / (0.0053 + 0.0206) = 0.795

2828

Bayes’s ruleBayes’s rule Probability of event Probability of event HH given evidence given evidence E E

::

PriorPrior probability of probability of H H ::– Probability of event Probability of event beforebefore evidence is evidence is

seenseen Posterior Posterior probability ofprobability of H H ::

– Probability of event Probability of event afterafter evidence is seen evidence is seen

Pr[ | ]Pr[ ]Pr[ | ]

Pr[ ]

E H HH E

E

Pr[ | ]H E

Pr[ ]H

Thomas BayesThomas BayesBorn:Born: 1702 in London, England1702 in London, EnglandDied:Died: 1761 in Tunbridge Wells, Kent, England1761 in Tunbridge Wells, Kent, England

2929

Naïve Bayes for Naïve Bayes for classificationclassification

Classification learning: what’s the Classification learning: what’s the probability of the class given an probability of the class given an instance? instance? – Evidence Evidence E E = instance= instance– Event Event HH = class value for instance = class value for instance

Naïve assumption: evidence splits into Naïve assumption: evidence splits into parts (i.e. attributes) that are parts (i.e. attributes) that are independentindependent1 2Pr[ | ]Pr[ | ] Pr[ | ]Pr[ ]

Pr[ | ]Pr[ ]

nE H E H E H HH E

E

3030

Weather data exampleWeather data example


WindWindyy

PlaPlayy

SunnySunny CoolCool HighHigh TrueTrue ??

Evidence E

Probability ofclass “yes”

Pr[ | ] Pr[ | ]yes E Outlook Sunny yes Pr[ | ]Temperature Cool yes Pr[ | ]Humidity High yes Pr[ | ]Windy True yes Pr[ ]

Pr[ ]

yes

E

3 3 3 929 9 9 9 14

Pr[ ]E

3131

The “zero-frequency The “zero-frequency problem”problem”

What if an attribute value doesn’t occur with What if an attribute value doesn’t occur with every class value?every class value?(e.g. “Humidity = high” for class “yes”)(e.g. “Humidity = high” for class “yes”)– Probability will be zero!Probability will be zero!– A posterioriA posteriori probability will also be zero! probability will also be zero!

(No matter how likely the other values are!) (No matter how likely the other values are!) Remedy: add 1 to the count for every Remedy: add 1 to the count for every

attribute value-class combination (attribute value-class combination (Laplace Laplace estimator)estimator)

Result: probabilities will never be zero!Result: probabilities will never be zero!(also: stabilizes probability estimates)(also: stabilizes probability estimates)

Pr[ | ] 0yes E

Pr[ | ] 0Humidity High yes

3232

Modified probability Modified probability estimatesestimates

In some cases adding a constant In some cases adding a constant different from 1 might be more different from 1 might be more appropriateappropriate

Example: attribute Example: attribute outlookoutlook for class for class yesyes

Weights don’t need to be equal Weights don’t need to be equal (but they must sum to 1)(but they must sum to 1)

2 / 3

9

4 / 3

9

3 / 3

9

Sunny Overcast Rainy

12

9

p

24

9

p

33

9

p

3333

Missing valuesMissing values Training: instance is not included Training: instance is not included

in frequency count for attribute in frequency count for attribute value-class combinationvalue-class combination

Classification: attribute will be Classification: attribute will be omitted from calculationomitted from calculation

Example:Example: OutlookOutlook Temp.Temp. HumiditHumidityy

WindWindyy

PlayPlay

?? CoolCool HighHigh TrueTrue ??

Likelihood of “yes” = 3/9 Likelihood of “yes” = 3/9 3/9 3/9 3/9 3/9 9/14 = 9/14 = 0.02380.0238

Likelihood of “no” = 1/5 Likelihood of “no” = 1/5 4/5 4/5 3/5 3/5 5/14 = 5/14 = 0.03430.0343

P(“yes”) = 0.0238 / (0.0238 + 0.0343) = 41%P(“yes”) = 0.0238 / (0.0238 + 0.0343) = 41%

P(“no”) = 0.0343 / (0.0238 + 0.0343) = 59%P(“no”) = 0.0343 / (0.0238 + 0.0343) = 59%

3434

Numeric attributesNumeric attributes Usual assumption: attributes have a Usual assumption: attributes have a normalnormal or or

GaussianGaussian probability distribution (given the probability distribution (given the class)class)

The The probability density functionprobability density function for the normal for the normal distribution is defined by two parameters:distribution is defined by two parameters:– Sample mean:Sample mean:

– Standard deviation:Standard deviation:

– density function density function is:is:

1

1 n

ii

xn

2

1

1( )

1

n

ii

xn

2

2

( )

21( )

2

x

f x e

3535

Statistics forStatistics forweather dataweather data

Example density value:Example density value:2

2

(66 73)

2 6.21( 66 | ) 0.0340

2 6.2f temperature yes e



SunnySunny 22 33 64, 68,64, 68, 65, 65, 71,71,

65, 70,65, 70, 70, 85,70, 85, FalseFalse 66 22 99 55

OvercasOvercastt

44 00 69, 70,69, 70, 72, 72, 80,80,

70, 75,70, 75, 90, 91,90, 91, TrueTrue 33 33

RainyRainy 33 22 72, …72, … 85, 85, ……

80, …80, … 95, …95, …

SunnySunny 2/92/9 3/53/5 =73=73 =75=75 =79=79 =86=86 FalseFalse 6/96/9 2/52/5 9/19/144

5/15/144

OvercasOvercastt

4/94/9 0/50/5 =6.2=6.2 =7.9=7.9

=10.2=10.2 =9.7=9.7 TrueTrue 3/93/9 3/53/5

RainyRainy 3/93/9 2/52/5

3636

Classifying a new dayClassifying a new day A new day:A new day:

Missing values during training are not Missing values during training are not included in calculation of mean and included in calculation of mean and standard deviationstandard deviation


WindWindyy

PlaPlayy

SunnySunny 6666 9090 truetrue ??

Likelihood of “yes” = 2/9 Likelihood of “yes” = 2/9 0.0340 0.0340 0.0221 0.0221 3/9 3/9 9/14 = 9/14 = 0.0000360.000036

Likelihood of “no” = 3/5 Likelihood of “no” = 3/5 0.0291 0.0291 0.0380 0.0380 3/5 3/5 5/14 = 5/14 = 0.0001360.000136

P(“yes”) = 0.000036 / (0.000036 + 0. 000136) = 20.9%P(“yes”) = 0.000036 / (0.000036 + 0. 000136) = 20.9%

P(“no”) = 0.000136 / (0.000036 + 0. 000136) = 79.1%P(“no”) = 0.000136 / (0.000036 + 0. 000136) = 79.1%

3737

Probability densitiesProbability densities

Relationship between probability and Relationship between probability and density:density:

But: this doesn’t change calculation of But: this doesn’t change calculation of a posterioria posteriori probabilities because probabilities because cancels outcancels out

Exact relationship:Exact relationship:

Pr[ ] ( )2 2

c x c f c

Pr[ ] ( )b

a

a x b f t dt

3838

Naïve Bayes: discussionNaïve Bayes: discussion

Naïve Bayes works surprisingly well (even if Naïve Bayes works surprisingly well (even if independence assumption is clearly violated)independence assumption is clearly violated)

Why? Because classification doesn’t require Why? Because classification doesn’t require accurate probability estimates accurate probability estimates as long as as long as maximum probability is assigned to correct maximum probability is assigned to correct classclass

However: adding too many redundant However: adding too many redundant attributes will cause problems (e.g. identical attributes will cause problems (e.g. identical attributes)attributes)

Note also: many numeric attributes are not Note also: many numeric attributes are not normally distributed (normally distributed ( kernel density kernel density estimatorsestimators))

3939



4040

Constructing decision Constructing decision treestrees

Strategy: top downStrategy: top downRecursive Recursive divide-and-conquerdivide-and-conquer fashion fashion– First: select attribute for root nodeFirst: select attribute for root node

Create branch for each possible attribute Create branch for each possible attribute valuevalue

– Then: split instances into subsetsThen: split instances into subsetsOne for each branch extending from the One for each branch extending from the nodenode

– Finally: repeat recursively for each branch, Finally: repeat recursively for each branch, using only instances that reach the branchusing only instances that reach the branch

Stop if all instances have the same Stop if all instances have the same classclass

4141

Which attribute to select?Which attribute to select?

4242

Criterion for attribute Criterion for attribute selectionselection

Which is the best attribute?Which is the best attribute?– Want to get the smallest treeWant to get the smallest tree– Heuristic: choose the attribute that Heuristic: choose the attribute that

produces the “purest” nodesproduces the “purest” nodes Popular Popular impurity criterionimpurity criterion:: information information

gaingain– Information gain increases with the Information gain increases with the

average purity of the subsetsaverage purity of the subsets Strategy: choose attribute that gives Strategy: choose attribute that gives

greatest information gaingreatest information gain

4343

Computing informationComputing information

Measure information in Measure information in bitsbits– Given a probability distribution, the Given a probability distribution, the

info required to predict an event is info required to predict an event is the distribution’s the distribution’s entropyentropy

– Entropy gives the information Entropy gives the information required in bitsrequired in bits(can involve fractions of bits!)(can involve fractions of bits!)

Recall, formula for entropy:Recall, formula for entropy:

1 2 1 1 2 2( , , , ) log log logn n nH p p p p p p p p p

4444

Claude Shannon, who has died aged 84, perhaps more than anyone laid the groundwork for today’s digital revolution. His exposition of information theory, stating that all information could be represented mathematically as a succession of noughts and ones, facilitated the digital manipulation of data without which today’s information society would be unthinkable.

Shannon’s master’s thesis, obtained in 1940 at MIT, demonstrated that problem solving could be achieved by manipulating the symbols 0 and 1 in a process that could be carried out automatically with electrical circuitry. That dissertation has been hailed as one of the most significant master’s theses of the 20th century. Eight years later, Shannon published another landmark paper, A Mathematical Theory of Communication, generally taken as his most important scientific contribution.

Claude ShannonBorn: 30 April 1916Died: 23 February 2001

“Father of information theory”

Shannon applied the same radical approach to cryptography research, in which he later became a consultant to the US government.

Many of Shannon’s pioneering insights were developed before they could be applied in practical form. He was truly a remarkable man, yet unknown to most of the world.

4545

Example: attribute Example: attribute OutlookOutlook

OutlookOutlook = = Sunny Sunny ::

OutlookOutlook = = Overcast Overcast ::

OutlookOutlook = = Rainy Rainy ::

Expected information for Expected information for attribute:attribute:

info([2,3]) (2/5,3/5) 2 / 5log(2 / 5) 3 / 5log(3/ 5) 0.971 bitsH

info([4,0]) (1,0) 1log(1) 0 log(0) 0 bitsH

info([3,2]) (3/5,2/5) 3 / 5log(3/ 5) 2 / 5log(2 / 5) 0.971 bitsH

Note: thisis normallyundefined.

info([3,2],[4,0],[3,2]) (5 /14) 0.971 (4 /14) 0 (5 /14) 0.971 0.693 bits

4646

ComputingComputinginformation gaininformation gain

Information gain: information Information gain: information before splitting – information after before splitting – information after splittingsplitting

Information gain for attributes Information gain for attributes from weather data:from weather data:gain(gain(Outlook Outlook )) = 0.247 = 0.247

bitsbits

gain(gain(Temperature Temperature )) = 0.029 = 0.029 bitsbits

gain(gain(Humidity Humidity )) = 0.152 = 0.152 bitsbits

gain(gain(Windy Windy )) = 0.048 = 0.048 bitsbits

gain(gain(Outlook Outlook )) = info([9,5]) = info([9,5]) – info([2,3],[4,0],– info([2,3],[4,0],[3,2])[3,2])

= 0.940 – 0.693= 0.940 – 0.693= = 0.247 bits0.247 bits

4747

Continuing to splitContinuing to split

gain(gain(Temperature Temperature )) = 0.571 = 0.571 bitsbits

gain(gain(Humidity Humidity )) = 0.971 = 0.971 bitsbits

gain(gain(Windy Windy )) = 0.020 = 0.020 bitsbits

4848

Final decision treeFinal decision tree

Note: not all leaves need to be pure; Note: not all leaves need to be pure; sometimes identical instances have sometimes identical instances have different classesdifferent classes Splitting stops when data can’t be split Splitting stops when data can’t be split

any furtherany further

4949

Wishlist for a purity Wishlist for a purity measuremeasure

Properties we require from a purity Properties we require from a purity measure:measure:– When node is pure, measure should be When node is pure, measure should be

zerozero– When impurity is maximal (i.e. all classes When impurity is maximal (i.e. all classes

equally likely), measure should be equally likely), measure should be maximalmaximal

– Measure should obey Measure should obey multistage propertymultistage property (i.e. decisions can be made in several (i.e. decisions can be made in several stages):stages):

Entropy is the only function that Entropy is the only function that satisfies all three properties!satisfies all three properties!

measure([2,3,4]) measure([2,7]) (7/9) measure([3,4])

5050

Properties of the entropyProperties of the entropy

The multistage property:The multistage property:

Simplification of computation:Simplification of computation:

Note: instead of maximizing info gain Note: instead of maximizing info gain we could just minimize informationwe could just minimize information

( ) ( ) ( ) ( )q r

H p,q,r H p,q r H q r H ,q r q r

info([2,3,4]) 2 / 9 log(2 / 9) 3/ 9 log(3/ 9) 4 / 9 log(4 / 9)

[ 2 log 2 3log3 4log 4 9log9] / 9

5151

Highly-branching Highly-branching attributesattributes

Problematic: attributes with a large Problematic: attributes with a large number of values (extreme case: ID number of values (extreme case: ID code)code)

Subsets are more likely to be pure if Subsets are more likely to be pure if there is a large number of valuesthere is a large number of values Information gain is biased towards Information gain is biased towards

choosing attributes with a large number of choosing attributes with a large number of valuesvalues

This may result in This may result in overfittingoverfitting (selection of (selection of an attribute that is non-optimal for an attribute that is non-optimal for prediction)prediction)

Another problem: Another problem: fragmentationfragmentation

5252

Weather data with Weather data with ID codeID codeID codeID code OutlookOutlook Temp.Temp. HumiditHumidit

yyWindWindyy

PlaPlayy

AA SunnySunny HotHot HighHigh FalseFalse NoNo

BB SunnySunny Hot Hot High High TrueTrue NoNo

CC OvercaOvercast st


DD RainyRainy MildMild HighHigh FalseFalse YesYes

EE RainyRainy CoolCool NormalNormal FalseFalse YesYes

FF RainyRainy CoolCool NormalNormal TrueTrue NoNo

GG OvercaOvercastst


HH SunnySunny MildMild HighHigh FalseFalse NoNo

II SunnySunny CoolCool NormalNormal FalseFalse YesYes

JJ RainyRainy MildMild NormalNormal FalseFalse YesYes

KK SunnySunny MildMild NormalNormal TrueTrue YesYes

LL OvercaOvercastst


MM OvercaOvercastst


NN RainyRainy MildMild HighHigh TrueTrue NoNo

5353

Tree stump for Tree stump for ID codeID code attributeattribute

Entropy of split:Entropy of split:

Information gain is maximal for ID Information gain is maximal for ID code (namely 0.940 bits)code (namely 0.940 bits)

info("ID code") info([0,1]) info([0,1]) info([0,1]) 0 bits

5454

Gain ratioGain ratio

Gain ratioGain ratio: a modification of the : a modification of the information gain that reduces its biasinformation gain that reduces its bias

Gain ratio takes number and size of Gain ratio takes number and size of branches into account when choosing branches into account when choosing an attributean attribute– It corrects the information gain by taking It corrects the information gain by taking

the the intrinsic informationintrinsic information of a split into of a split into accountaccount

Intrinsic information: entropy of Intrinsic information: entropy of distribution of instances into branches distribution of instances into branches (i.e. how much info do we need to tell (i.e. how much info do we need to tell which branch an instance belongs to)which branch an instance belongs to)

5555

Computing the gain ratioComputing the gain ratio

Example: intrinsic information for ID Example: intrinsic information for ID codecode

Value of attribute decreases as Value of attribute decreases as intrinsic information gets largerintrinsic information gets larger

Definition of gain ratio:Definition of gain ratio:

Example:Example:

info([1,1, ,1) 14 ( 1/14 log1/14) 3.807 bits

gain("Attribute")gain_ratio("Attribute")

intrinsic_info("Attribute")

0.940 bitsgain_ratio("ID_code") 0.246

3.807 bits

5656

Gain ratios for weather Gain ratios for weather datadata

OutlookOutlook TemperatureTemperature

Info:Info: 0.6930.693 Info:Info: 0.9110.911

Gain: 0.940-0.693Gain: 0.940-0.693 0.247 0.247 Gain: 0.940-0.911 Gain: 0.940-0.911 0.0290.029

Split info: Split info: info([5,4,5])info([5,4,5])

1.577 1.577 Split info: Split info: info([4,6,4])info([4,6,4])

1.3621.362

Gain ratio: Gain ratio: 0.247/1.5770.247/1.577

0.1560.156 Gain ratio: Gain ratio: 0.029/1.3620.029/1.362

0.0210.021HumidityHumidity WindyWindy

Info:Info: 0.7880.788 Info:Info: 0.8920.892

Gain: 0.940-0.788Gain: 0.940-0.788 0.1520.152 Gain: 0.940-0.892 Gain: 0.940-0.892 0.0480.048

Split info: info([7,7])Split info: info([7,7]) 1.000 1.000 Split info: info([8,6])Split info: info([8,6]) 0.9850.985

Gain ratio: 0.152/1Gain ratio: 0.152/1 0.1520.152 Gain ratio: Gain ratio: 0.048/0.9850.048/0.985

0.0490.049

5757

More on the gain ratioMore on the gain ratio

““Outlook” still comes out topOutlook” still comes out top However: “ID code” has greater gain However: “ID code” has greater gain

ratioratio– Standard fix: Standard fix: ad hocad hoc test to prevent test to prevent

splitting on that type of attributesplitting on that type of attribute Problem with gain ratio: it may Problem with gain ratio: it may

overcompensateovercompensate– May choose an attribute just because its May choose an attribute just because its

intrinsic information is very lowintrinsic information is very low– Standard fix: only consider attributes with Standard fix: only consider attributes with

greater than average information gaingreater than average information gain

5858

DiscussionDiscussion

Top-down induction of decision trees: Top-down induction of decision trees: ID3, algorithm developed by Ross ID3, algorithm developed by Ross QuinlanQuinlan– Gain ratio just one modification of this Gain ratio just one modification of this

basic algorithmbasic algorithm C4.5: deals with numeric attributes, C4.5: deals with numeric attributes,

missing values, noisy datamissing values, noisy data Similar approach: CARTSimilar approach: CART There are many other attribute There are many other attribute

selection criteria!selection criteria!(But little difference in accuracy of (But little difference in accuracy of result)result)

5959



6060

Covering algorithmsCovering algorithms

Convert decision tree into a rule setConvert decision tree into a rule set– Straightforward, but rule set overly Straightforward, but rule set overly

complexcomplex– More effective conversions are not trivialMore effective conversions are not trivial

Instead, can generate rule set directlyInstead, can generate rule set directly– for each class in turn find rule set that for each class in turn find rule set that

covers all instances in itcovers all instances in it(excluding instances not in the class)(excluding instances not in the class)

Called a Called a coveringcovering approach: approach:– at each stage a rule is identified that at each stage a rule is identified that

“covers” some of the instances“covers” some of the instances

6161

Example: generating a Example: generating a rulerule

y

x

a

b b

b

b

b

bb

b

b b bb

bb

aa

aa

ay

a

b b

b

b

b

bb

b

b b bb

bb

a a

aa

a

x1·2

y

a

b b

b

b

b

bb

b

b b bb

bb

a a

aa

a

x1·2

2·6

If x > 1.2If x > 1.2then class = athen class = a

If x > 1.2 and y > 2.6If x > 1.2 and y > 2.6then class = athen class = a

If trueIf truethen class = athen class = a

Possible rule set for class “b”:Possible rule set for class “b”:

Could add more rules, get “perfect” Could add more rules, get “perfect” rule setrule set

If x If x 1.2 then class = b 1.2 then class = b

If x > 1.2 and y If x > 1.2 and y 2.6 then class = b 2.6 then class = b

6262

Rules vs. treesRules vs. trees

Corresponding decision tree:Corresponding decision tree:(produces exactly the same(produces exactly the samepredictions)predictions)

But: rule sets But: rule sets cancan be more perspicuous when be more perspicuous when decision trees suffer from replicated subtreesdecision trees suffer from replicated subtrees

Also: in multiclass situations, covering Also: in multiclass situations, covering algorithm concentrates on one class at a algorithm concentrates on one class at a time whereas decision tree learner takes all time whereas decision tree learner takes all classes into accountclasses into account

6363

space of examples

rule so far

rule after adding new term

Simple covering Simple covering algorithmalgorithm

Generates a rule by adding tests that Generates a rule by adding tests that maximize rule’s accuracymaximize rule’s accuracy

Similar to situation in decision trees: Similar to situation in decision trees: problem of selecting an attribute to problem of selecting an attribute to split onsplit on– But: decision tree inducer maximizes But: decision tree inducer maximizes

overall purityoverall purity Each new test reducesEach new test reduces

rule’s coverage:rule’s coverage:

6464

Selecting a testSelecting a test

Goal: maximize accuracyGoal: maximize accuracy t t total number of instances covered total number of instances covered

by ruleby rule– pp positive examples of the class covered positive examples of the class covered

by ruleby rule– t t – – pp number of errors made by rule number of errors made by rule Select test that maximizes the ratio Select test that maximizes the ratio p/tp/t

We are finished when We are finished when p/t p/t = 1 or the = 1 or the set of instances can’t be split any set of instances can’t be split any furtherfurther

6565

Rules vs. decision listsRules vs. decision lists

PRISM with outer loop removed generates PRISM with outer loop removed generates a decision list for one classa decision list for one class– Subsequent rules are designed for rules that Subsequent rules are designed for rules that

are not covered by previous rulesare not covered by previous rules– But: order doesn’t matter because all rules But: order doesn’t matter because all rules

predict the same classpredict the same class Outer loop considers all classes separatelyOuter loop considers all classes separately

– No order dependence impliedNo order dependence implied Problems: overlapping rules, default rule Problems: overlapping rules, default rule

requiredrequired

6666

Pseudo-code for PRISMPseudo-code for PRISMFor each class CFor each class C

Initialize E to the instance setInitialize E to the instance set

While E contains instances in class CWhile E contains instances in class C

Create a rule R with an empty left-hand side that predicts class CCreate a rule R with an empty left-hand side that predicts class C

Until R is perfect (or there are no more attributes to use) doUntil R is perfect (or there are no more attributes to use) do

For each attribute A not mentioned in R, and each value v,For each attribute A not mentioned in R, and each value v,

Consider adding the condition A = v to the left-hand side of RConsider adding the condition A = v to the left-hand side of R

Select A and v to maximize the accuracy p/tSelect A and v to maximize the accuracy p/t

(break ties by choosing the condition with the largest p)(break ties by choosing the condition with the largest p)

Add A = v to RAdd A = v to R

Remove the instances covered by R from E Remove the instances covered by R from E

6767

Separate and conquerSeparate and conquer

Methods like PRISM (for dealing with one Methods like PRISM (for dealing with one class) are class) are separate-and-conquerseparate-and-conquer algorithms:algorithms:– First, identify a useful ruleFirst, identify a useful rule– Then, separate out all the instances it coversThen, separate out all the instances it covers– Finally, “conquer” the remaining instancesFinally, “conquer” the remaining instances

Difference to divide-and-conquer methods:Difference to divide-and-conquer methods:– Subset covered by rule doesn’t need to be Subset covered by rule doesn’t need to be

explored any furtherexplored any further

6868



6969

Association rulesAssociation rules

Association rules…Association rules…– … … can predict any attribute and combinations can predict any attribute and combinations

of attributesof attributes– … … are not intended to be used together as a are not intended to be used together as a

setset Problem: immense number of possible Problem: immense number of possible

associationsassociations– Output needs to be restricted to show only the Output needs to be restricted to show only the

most predictive associations most predictive associations only those with only those with high high support support and high and high confidenceconfidence

7070

Support and confidence of a Support and confidence of a rulerule


WindWindyy

PlayPlay



OvercasOvercast t





OvercasOvercastt






OvercasOvercastt


OvercasOvercastt



7171

Support and confidence of a Support and confidence of a rulerule

Support: number of instances predicted Support: number of instances predicted correctly correctly

Confidence: number of correct predictions, Confidence: number of correct predictions, as proportion of all instances the rule applies as proportion of all instances the rule applies toto

Example: 4 cool days with normal humidityExample: 4 cool days with normal humidity

Support = 4, confidence = 100%Support = 4, confidence = 100% Normally: minimum support and confidence Normally: minimum support and confidence

pre-specified (e.g. 58 rules with support pre-specified (e.g. 58 rules with support 2 2 and confidence and confidence 95% for weather data) 95% for weather data)

If temperature = cool then humidity = normalIf temperature = cool then humidity = normal

7272

Interpreting association Interpreting association rulesrules

If humidity = high and windy = false and play = If humidity = high and windy = false and play = nonothen outlook = sunnythen outlook = sunny

Interpretation is not obvious:Interpretation is not obvious:

is is notnot the same as the same as

However, it means that the following also However, it means that the following also holds:holds:

If windy = false and play = noIf windy = false and play = nothen outlook = sunny then outlook = sunny

If windy = false and play = no If windy = false and play = no then humidity = highthen humidity = high

If windy = false and play = noIf windy = false and play = nothen outlook = sunny and humidity = highthen outlook = sunny and humidity = high

7373

Mining association rulesMining association rules

Naïve method for finding association rules:Naïve method for finding association rules:– Use separate-and-conquer methodUse separate-and-conquer method– Treat every possible combination of attribute Treat every possible combination of attribute

values as a separate classvalues as a separate class Two problems:Two problems:

– Computational complexityComputational complexity– Resulting number of rules (which would have to Resulting number of rules (which would have to

be pruned on the basis of support and be pruned on the basis of support and confidence)confidence)

But: we can look for high support rules But: we can look for high support rules directly!directly!

7474

Item setsItem sets Support: number of instances correctly Support: number of instances correctly

covered by association rulecovered by association rule– The same as the number of instances covered The same as the number of instances covered

by by allall tests in the rule (LHS and RHS!) tests in the rule (LHS and RHS!) ItemItem: one test/attribute-value pair: one test/attribute-value pair Item set Item set : all items occurring in a rule: all items occurring in a rule Goal: only rules that exceed pre-defined Goal: only rules that exceed pre-defined

supportsupport Do it by finding all item sets with the given Do it by finding all item sets with the given

minimum support and generating rules from minimum support and generating rules from them!them!

7575

Item Sets For Weather DataItem Sets For Weather Data


WindWindyy

PlayPlay



OvercasOvercast t





OvercasOvercastt






OvercasOvercastt


OvercasOvercastt



7676

Item sets for weather dataItem sets for weather data

One-item setsOne-item sets Two-item setsTwo-item sets Three-item setsThree-item sets Four-item setsFour-item sets

Outlook = Sunny (5)Outlook = Sunny (5) Outlook = SunnyOutlook = Sunny

Temperature = Hot (2)Temperature = Hot (2)Outlook = SunnyOutlook = Sunny

Temperature = HotTemperature = Hot

Humidity = High (2)Humidity = High (2)

Outlook = SunnyOutlook = Sunny

Temperature = HotTemperature = Hot

Humidity = HighHumidity = High

Play = No (2)Play = No (2)

Temperature = Cool Temperature = Cool (4)(4)

Outlook = SunnyOutlook = Sunny

Humidity = High (3)Humidity = High (3)Outlook = SunnyOutlook = Sunny

Humidity = HighHumidity = High

Windy = False (2)Windy = False (2)

Outlook = RainyOutlook = Rainy

Temperature = MildTemperature = Mild

Windy = FalseWindy = False

Play = Yes (2)Play = Yes (2)

…… …… …… ……

In total: 12 one-item sets, 47 two-item In total: 12 one-item sets, 47 two-item sets, 39 three-item sets, 6 four-item sets, 39 three-item sets, 6 four-item sets and 0 five-item sets (with minimum sets and 0 five-item sets (with minimum support of two)support of two)

7777

Generating rules from an Generating rules from an item setitem set

Once all item sets with minimum Once all item sets with minimum support have been generated, we can support have been generated, we can turn them into rulesturn them into rules

Example:Example:

Seven (2Seven (2NN-1) potential rules:-1) potential rules:

Humidity = Normal, Windy = False, Play = Yes (4)Humidity = Normal, Windy = False, Play = Yes (4)

If Humidity = Normal and Windy = False then Play = YesIf Humidity = Normal and Windy = False then Play = Yes

If Humidity = Normal and Play = Yes then Windy = FalseIf Humidity = Normal and Play = Yes then Windy = False

If Windy = False and Play = Yes then Humidity = NormalIf Windy = False and Play = Yes then Humidity = Normal

If Humidity = Normal then Windy = False and Play = YesIf Humidity = Normal then Windy = False and Play = Yes

If Windy = False then Humidity = Normal and Play = YesIf Windy = False then Humidity = Normal and Play = Yes

If Play = Yes then Humidity = Normal and Windy = FalseIf Play = Yes then Humidity = Normal and Windy = False

If True then Humidity = Normal and Windy = False If True then Humidity = Normal and Windy = False and Play = Yesand Play = Yes

4/44/4

4/64/6

4/64/6

4/74/7

4/84/8

4/94/9

4/124/12

7878

Rules for weather dataRules for weather data

Rules with support > 1 and confidence = Rules with support > 1 and confidence = 100%:100%:

In total:In total: 3 rules with support four 3 rules with support four 5 with support three 5 with support three50 with support two50 with support two

Association ruleAssociation rule Sup.Sup. Conf.Conf.

11 Humidity=Normal Windy=FalseHumidity=Normal Windy=False Play=YesPlay=Yes 44 100%100%

22 Temperature=CoolTemperature=Cool Humidity=NormalHumidity=Normal 44 100%100%

33 Outlook=OvercastOutlook=Overcast Play=YesPlay=Yes 44 100%100%

44 Temperature=Cold Play=YesTemperature=Cold Play=Yes Humidity=NormalHumidity=Normal 33 100%100%

...... ...... ...... ......

5858 Outlook=Sunny Temperature=HotOutlook=Sunny Temperature=Hot Humidity=HighHumidity=High 22 100%100%

7979

Example rules from the Example rules from the same setsame set

Item set:Item set:

Resulting rules (all with 100% confidence):Resulting rules (all with 100% confidence):

due to the following “frequent” item sets:due to the following “frequent” item sets:

Temperature = Cool, Humidity = Normal, Windy = False, Play = Yes (2)Temperature = Cool, Humidity = Normal, Windy = False, Play = Yes (2)

Temperature = Cool, Windy = False Temperature = Cool, Windy = False Humidity = Normal, Play = Yes Humidity = Normal, Play = Yes

Temperature = Cool, Windy = False, Humidity = Normal Temperature = Cool, Windy = False, Humidity = Normal Play = Yes Play = Yes

Temperature = Cool, Windy = False, Play = Yes Temperature = Cool, Windy = False, Play = Yes Humidity = Normal Humidity = Normal

Temperature = Cool, Windy = False (2)Temperature = Cool, Windy = False (2)

Temperature = Cool, Humidity = Normal, Windy = False (2)Temperature = Cool, Humidity = Normal, Windy = False (2)

Temperature = Cool, Windy = False, Play = Yes (2)Temperature = Cool, Windy = False, Play = Yes (2)

8080

Generating item sets Generating item sets efficientlyefficiently

How can we efficiently find all frequent item How can we efficiently find all frequent item sets?sets?

Finding one-item sets easyFinding one-item sets easy Idea: use one-item sets to generate two-item Idea: use one-item sets to generate two-item

sets, two-item sets to generate three-item sets, two-item sets to generate three-item sets, …sets, …– If (A B) is frequent item set, then (A) and (B) have If (A B) is frequent item set, then (A) and (B) have

to be frequent item sets as well!to be frequent item sets as well!– In general: if X is frequent In general: if X is frequent kk-item set, then all (-item set, then all (kk--

1)-item subsets of X are also frequent1)-item subsets of X are also frequent

Compute Compute kk-item set by merging (-item set by merging (kk-1)-item sets-1)-item sets

8181

ExampleExample Given: five three-item setsGiven: five three-item sets

(A B C), (A B D), (A C D), (A C E), (B C D)(A B C), (A B D), (A C D), (A C E), (B C D)

Lexicographically ordered!Lexicographically ordered!

Candidate four-item sets:Candidate four-item sets:

(A B C D)(A B C D) OK because of (B C D) OK because of (B C D)

(A C D E) Not OK because of (C D E)(A C D E) Not OK because of (C D E)

Final check by counting instances in Final check by counting instances in dataset!dataset!

((kk ––1)-item sets are stored in hash table1)-item sets are stored in hash table

8282

Generating rules efficientlyGenerating rules efficiently

We are looking for all high-confidence rulesWe are looking for all high-confidence rules– Support of antecedent obtained from hash tableSupport of antecedent obtained from hash table– But: brute-force method is (2But: brute-force method is (2NN-1) -1)

Better way: building (Better way: building (cc + 1)-consequent rules + 1)-consequent rules from from cc-consequent ones-consequent ones– Observation: (Observation: (cc + 1)-consequent rule can only hold + 1)-consequent rule can only hold

if all corresponding if all corresponding cc-consequent rules also hold -consequent rules also hold Resulting algorithm similar to procedure for Resulting algorithm similar to procedure for

large item setslarge item sets

8383

ExampleExample 1-consequent rules:1-consequent rules:

Corresponding 2-consequent rule:Corresponding 2-consequent rule:

Final check of antecedent against hash table!Final check of antecedent against hash table!

If Windy = False and Play = NoIf Windy = False and Play = Nothen Outlook = Sunny and Humidity = High (2/2)then Outlook = Sunny and Humidity = High (2/2)

If Outlook = Sunny and Windy = False and Play = No If Outlook = Sunny and Windy = False and Play = No then Humidity = High (2/2)then Humidity = High (2/2)

If Humidity = High and Windy = False and Play = NoIf Humidity = High and Windy = False and Play = Nothen Outlook = Sunny (2/2)then Outlook = Sunny (2/2)

8484

Association rules: discussionAssociation rules: discussion Above method makes one pass through the Above method makes one pass through the

data for each different size item setdata for each different size item set– Other possibility: generate (Other possibility: generate (kk+2)-item sets just +2)-item sets just

after (after (kk+1)-item sets have been generated+1)-item sets have been generated– Result: more (Result: more (kk+2)-item sets than necessary will +2)-item sets than necessary will

be considered but less passes through the databe considered but less passes through the data– Makes sense if data too large for main memoryMakes sense if data too large for main memory

Practical issue: generating a certain number Practical issue: generating a certain number of rules (e.g. by incrementally reducing min. of rules (e.g. by incrementally reducing min. support)support)

8585

Other issuesOther issues

Standard ARFF format very inefficient for Standard ARFF format very inefficient for typical typical market basket datamarket basket data– Attributes represent items in a basket and Attributes represent items in a basket and

most items are usually missingmost items are usually missing– Need way of representing sparse dataNeed way of representing sparse data

Instances are also called Instances are also called transactionstransactions Confidence is not necessarily the best Confidence is not necessarily the best

measuremeasure– Example: milk occurs in almost every Example: milk occurs in almost every

supermarket transactionsupermarket transaction– Other measures have been devised (e.g. lift) Other measures have been devised (e.g. lift)

8686



8787

Linear modelsLinear models

Work most naturally with numeric attributesWork most naturally with numeric attributes Standard technique for numeric prediction: Standard technique for numeric prediction:

linear regressionlinear regression– Outcome is linear combination of attributesOutcome is linear combination of attributes

Weights are calculated from the training Weights are calculated from the training datadata

Predicted value for first training instance Predicted value for first training instance aa(1)(1)

0 1 1 2 2 ... k kx w w a w a w a

(1) (1) (1) (1) (1)0 0 1 1 2 2

0

...k

k k j jj

w a w a w a w a w a

8888

Minimizing the squared Minimizing the squared errorerror

ChooseChoose k k +1 coefficients to minimize the +1 coefficients to minimize the squared error on the training datasquared error on the training data

Squared error:Squared error:

Derive coefficients using standard matrix Derive coefficients using standard matrix operationsoperations

Can be done if there are more instances Can be done if there are more instances than attributes (roughly speaking)than attributes (roughly speaking)

Minimizing the Minimizing the absolute errorabsolute error is more is more difficultdifficult

2

( ) ( )

1 0

n ki i

j ji j

x w a

8989

ClassificationClassification

AnyAny regression technique can be used for regression technique can be used for classificationclassification– Training: perform a regression for each class, Training: perform a regression for each class,

setting the output to 1 for training instances setting the output to 1 for training instances that belong to class, and 0 for those that don’tthat belong to class, and 0 for those that don’t

– Prediction: predict class corresponding to Prediction: predict class corresponding to model with largest output value (model with largest output value (membership membership valuevalue))

For linear regression this is known as For linear regression this is known as multi-response linear regressionmulti-response linear regression

9191

Pairwise regressionPairwise regression Another way of using regression for Another way of using regression for

classification: classification: – A regression function for every A regression function for every pairpair of classes, of classes,

using only instances from these two classesusing only instances from these two classes– Assign output of +1 to one member of the pair, –Assign output of +1 to one member of the pair, –

1 to the other1 to the other Prediction is done by votingPrediction is done by voting

– Class that receives most votes is predictedClass that receives most votes is predicted– Alternative: “don’t know” if there is no Alternative: “don’t know” if there is no

agreementagreement More likely to be accurate but more More likely to be accurate but more

expensive expensive

9292

Logistic regressionLogistic regression Problem: some assumptions violated Problem: some assumptions violated

when linear regression is applied to when linear regression is applied to classification problemsclassification problems

Logistic Logistic regression: alternative to linear regression: alternative to linear regressionregression– Designed for classification problemsDesigned for classification problems– Tries to estimate class probabilities directlyTries to estimate class probabilities directly

Does this using the Does this using the maximum likelihoodmaximum likelihood method method

– Uses this linear model:Uses this linear model:

Class probability

0 0 1 1 2 2log1 k k

Pw a w a w a w a

P

9393

Discussion of linear modelsDiscussion of linear models

Not appropriate if data exhibits non-linear Not appropriate if data exhibits non-linear dependenciesdependencies

But: can serve as building blocks for more But: can serve as building blocks for more complex schemes (i.e. model trees)complex schemes (i.e. model trees)

Example: multi-response linear regression Example: multi-response linear regression defines a defines a hyperplanehyperplane for any two given for any two given classes:classes:

(1) (2) (1) (2) (1) (2) (1) (2)0 0 0 1 1 1 2 2 2( ) ( ) ( ) ( ) 0k k kw w a w w a w w a w w a

9494



9595

Instance-based Instance-based representationrepresentation

Simplest form of learning: Simplest form of learning: rote learningrote learning– Training instances are searched for instance that most Training instances are searched for instance that most

closely resembles new instanceclosely resembles new instance– The instances themselves represent the knowledgeThe instances themselves represent the knowledge– Also called Also called instance-basedinstance-based learning learning

Similarity function defines what’s “learned”Similarity function defines what’s “learned” Instance-based learning is Instance-based learning is lazylazy learning learning Methods:Methods:

– nearest-neighbornearest-neighbor– k-nearest-neighbork-nearest-neighbor– ……

9696

The distance functionThe distance function Simplest case: one numeric attributeSimplest case: one numeric attribute

– Distance is the difference between the two Distance is the difference between the two attribute values involved (or a function attribute values involved (or a function thereof)thereof)

Several numeric attributes: normally, Several numeric attributes: normally, Euclidean distance is used and attributes Euclidean distance is used and attributes are normalizedare normalized

Nominal attributes: distance is set to 1 if Nominal attributes: distance is set to 1 if values are different, 0 if they are equalvalues are different, 0 if they are equal

Are all attributes equally important?Are all attributes equally important?– Weighting the attributes might be necessaryWeighting the attributes might be necessary

9797

Instance-based learningInstance-based learning

Distance function defines what’s learnedDistance function defines what’s learned Most instance-based schemes use Most instance-based schemes use

Euclidean distanceEuclidean distance::

aa(1)(1) and and aa(2)(2): two instances with : two instances with kk attributes attributes Taking the square root is not required Taking the square root is not required

when comparing distanceswhen comparing distances Other popular metric: Other popular metric: city-block metriccity-block metric

– Adds differences without squaring them Adds differences without squaring them

(1) (2) 2 (1) (2) 2 (1) (2) 21 1 2 2( ) ( ) ... ( )k ka a a a a a

9898

Normalization and other Normalization and other issuesissues

Different attributes are measured on Different attributes are measured on different scales different scales need to be need to be normalizednormalized::

vvi i : the actual value of attribute : the actual value of attribute ii Nominal attributes: distance either 0 or 1Nominal attributes: distance either 0 or 1 Common policy for missing values: Common policy for missing values:

assumed to be maximally distant (given assumed to be maximally distant (given normalized attributes)normalized attributes)

min

max mini i

ii i

v va

v v

9999

Discussion of 1-NNDiscussion of 1-NN Often very accurateOften very accurate … … but slow:but slow:

– simple version scans entire training data to derive a simple version scans entire training data to derive a predictionprediction

Assumes all attributes are equally importantAssumes all attributes are equally important– Remedy: attribute selection or weightsRemedy: attribute selection or weights

Possible remedies against noisy instances:Possible remedies against noisy instances:– Take a majority vote over the Take a majority vote over the kk nearest neighbors nearest neighbors– Removing noisy instances from dataset (difficult!)Removing noisy instances from dataset (difficult!)

Statisticians have used Statisticians have used kk-NN since early 1950s-NN since early 1950s– If If n n and and k/n k/n 0, error approaches minimum 0, error approaches minimum

100100

Comments on basic Comments on basic methodsmethods

Bayes’ rule stems from his “Essay towards Bayes’ rule stems from his “Essay towards solving a problem in the doctrine of chances” solving a problem in the doctrine of chances” (1763)(1763)– Difficult bit: estimating prior probabilitiesDifficult bit: estimating prior probabilities

Extension of Naïve Bayes: Bayesian NetworksExtension of Naïve Bayes: Bayesian Networks Algorithm for association rules is called Algorithm for association rules is called

APRIORIAPRIORI Minsky and Papert (1969) showed that linear Minsky and Papert (1969) showed that linear

classifiers have limitations, e.g. can’t learn classifiers have limitations, e.g. can’t learn XORXOR– But: combinations of them can (But: combinations of them can ( Neural Nets) Neural Nets)

101101

Credibility:Credibility:Evaluating what’s been learnedEvaluating what’s been learned

Issues: training, testing, tuningIssues: training, testing, tuning Predicting performance: confidence limitsPredicting performance: confidence limits Holdout, cross-validation, bootstrapHoldout, cross-validation, bootstrap Comparing schemes: the t-testComparing schemes: the t-test Predicting probabilities: loss functionsPredicting probabilities: loss functions Cost-sensitive measuresCost-sensitive measures Evaluating numeric predictionEvaluating numeric prediction The Minimum Description Length principleThe Minimum Description Length principle

102102

Evaluation: the key to Evaluation: the key to successsuccess

How predictive is the model we learned?How predictive is the model we learned? Error on the training data is Error on the training data is notnot a good a good

indicator of performance on future dataindicator of performance on future data– Otherwise 1-NN would be the optimum Otherwise 1-NN would be the optimum

classifier!classifier! Simple solution that can be used if lots of Simple solution that can be used if lots of

(labeled) data is available:(labeled) data is available:– Split data into training and test setSplit data into training and test set

However: (labeled) data is usually limitedHowever: (labeled) data is usually limited– More sophisticated techniques need to be usedMore sophisticated techniques need to be used

103103

Issues in evaluationIssues in evaluation

Statistical reliability of estimated Statistical reliability of estimated differences in performance (differences in performance ( significance significance tests)tests)

Choice of performance measure:Choice of performance measure:– Number of correct classificationsNumber of correct classifications– Accuracy of probability estimates Accuracy of probability estimates – Error in numeric predictionsError in numeric predictions

Costs assigned to different types of errorsCosts assigned to different types of errors– Many practical applications involve costsMany practical applications involve costs

104104



105105

Training and testing ITraining and testing I

Natural performance measure for Natural performance measure for classification problems: classification problems: error rateerror rate– SuccessSuccess: instance’s class is predicted correctly: instance’s class is predicted correctly– ErrorError: instance’s class is predicted incorrectly: instance’s class is predicted incorrectly– Error rate: proportion of errors made over the Error rate: proportion of errors made over the

whole set of instanceswhole set of instances

Resubstitution error: Resubstitution error: error rate obtained error rate obtained from training datafrom training data

Resubstitution error is (hopelessly) Resubstitution error is (hopelessly) optimistic!optimistic!

106106

Training and testing IITraining and testing II

Test setTest set: independent instances that have : independent instances that have played no part in formation of classifierplayed no part in formation of classifier– Assumption: both training data and test data Assumption: both training data and test data

are representative samples of the underlying are representative samples of the underlying problemproblem

Test and training data may differ in natureTest and training data may differ in nature– Example: classifiers built using subject data Example: classifiers built using subject data

with two different diagnoses with two different diagnoses AA and and BB To estimate performance of classifier for subjects To estimate performance of classifier for subjects

with diagnosis with diagnosis AA on subjects diagnosed with on subjects diagnosed with BB, test it , test it on data for subjects diagnosed with on data for subjects diagnosed with BB

107107

Note on parameter tuningNote on parameter tuning It is important that the test data is not used It is important that the test data is not used in in

any wayany way to create the classifier to create the classifier Some learning schemes operate in two stages:Some learning schemes operate in two stages:

– Stage 1: build the basic structureStage 1: build the basic structure– Stage 2: optimize parameter settingsStage 2: optimize parameter settings

The test data can’t be used for parameter The test data can’t be used for parameter tuning!tuning!

Proper procedure uses Proper procedure uses threethree sets: sets: training training datadata, , validation datavalidation data, and , and test datatest data– Validation data is used to optimize parametersValidation data is used to optimize parameters

108108

Making the most of the dataMaking the most of the data

Once evaluation is complete, Once evaluation is complete, all the dataall the data can be used to build the final classifiercan be used to build the final classifier

Generally, the larger the training data the Generally, the larger the training data the better the classifier (but returns diminish)better the classifier (but returns diminish)

The larger the test data the more accurate The larger the test data the more accurate the error estimatethe error estimate

HoldoutHoldout procedure: method of splitting procedure: method of splitting original data into training and test setoriginal data into training and test set– Dilemma: ideally both training set Dilemma: ideally both training set and and test set test set

should be large!should be large!

109109



110110

Predicting performancePredicting performance

Assume the estimated error rate is 25%. Assume the estimated error rate is 25%. How close is this to the true error rate?How close is this to the true error rate?– Depends on the amount of test dataDepends on the amount of test data

Prediction is just like tossing a (biased!) Prediction is just like tossing a (biased!) coincoin– ““Head” is a “success”, “tail” is an “error”Head” is a “success”, “tail” is an “error”

In statistics, a succession of independent In statistics, a succession of independent events like this is called a events like this is called a Bernoulli Bernoulli processprocess– Statistical theory provides us with confidence Statistical theory provides us with confidence

intervals for the true underlying proportionintervals for the true underlying proportion

111111

Confidence intervalsConfidence intervals

We can say: We can say: pp lies within a certain specified lies within a certain specified interval with a certain specified confidenceinterval with a certain specified confidence

Example: Example: SS=750 successes in =750 successes in NN=1000 trials=1000 trials– Estimated success rate: 75%Estimated success rate: 75%– How close is this to true success rate How close is this to true success rate pp??

Answer: with 80% confidence Answer: with 80% confidence pp[73.2,76.7][73.2,76.7]

Another example: Another example: SS=75 and =75 and NN=100=100– Estimated success rate: 75%Estimated success rate: 75%– With 80% confidenceWith 80% confidence p p[69.1,80.1][69.1,80.1]

112112

Mean and varianceMean and variance Mean and variance for a Bernoulli trial:Mean and variance for a Bernoulli trial:

p, p p, p (1(1––pp)) Expected success rate Expected success rate f=Sf=S//NN Mean and variance for Mean and variance for f f : : p, p p, p (1(1––pp)/)/NN For large enough For large enough NN, , ff follows a Normal follows a Normal

distributiondistribution c% confidence interval [c% confidence interval [––z z XX zz] for ] for

random variable with 0 mean is given by:random variable with 0 mean is given by:

With a symmetric distribution:With a symmetric distribution:Pr[ ]z X z c

Pr[ ] 1 2 Pr[ ]z X z X z

113113

Confidence limitsConfidence limits Confidence limits for the normal distribution with 0 Confidence limits for the normal distribution with 0

mean and a variance of 1:mean and a variance of 1:

Thus:Thus:

To use this we have to reduce our random variable To use this we have to reduce our random variable ff to to have 0 mean and unit variancehave 0 mean and unit variance

Pr[Pr[X X zz]]

zz

0.1%0.1% 3.093.09

0.5%0.5% 2.582.58

1%1% 2.332.33

5%5% 1.651.65

10%10% 1.281.28

20%20% 0.840.84

40%40% 0.250.25

Pr[ 1.65 1.65] 90%X

––1 0 1 1.651 0 1 1.65

114114

Transforming Transforming ff

Transformed value for Transformed value for f f ::

(i.e. subtract the mean and divide by the (i.e. subtract the mean and divide by the standard standard deviationdeviation))

Resulting equation:Resulting equation:

Solving for Solving for p p ::

(1 ) /

f p

p p N

Pr(1 ) /

f pz z c

p p N

2 2 2 2

2 12 4z f f z z

p f zN N N N N

115115

ExamplesExamples f f = 75%, = 75%, N N = 1000, = 1000, c c = 80% (so that = 80% (so that z = z = 1.28):1.28):

f f = 75%, = 75%, N N = 100, = 100, c c = 80% (so that = 80% (so that z = z = 1.28):1.28):

Note that normal distribution assumption is only Note that normal distribution assumption is only valid for large valid for large N N (i.e. (i.e. NN > 100) > 100)

f f = 75%, = 75%, N N = 10, = 10, c c = 80% (so that = 80% (so that z = z = 1.28):1.28):

(should be taken with a grain of salt)(should be taken with a grain of salt)

[0.732,0.767]p

[0.691,0.801]p

[0.549,0.881]p

116116



117117

Holdout estimationHoldout estimation What to do if the amount of data is limited?What to do if the amount of data is limited? The The holdoutholdout method reserves a certain method reserves a certain

amount for testing and uses the remainder amount for testing and uses the remainder for trainingfor training– Usually: one third for testing, the rest for trainingUsually: one third for testing, the rest for training

Problem: the samples might not be Problem: the samples might not be representativerepresentative– Example: class might be missing in the test dataExample: class might be missing in the test data

Advanced version uses Advanced version uses stratificationstratification– Ensures that each class is represented with Ensures that each class is represented with

approximately equal proportions in both subsetsapproximately equal proportions in both subsets

118118

Repeated holdout methodRepeated holdout method Holdout estimate can be made more Holdout estimate can be made more

reliable by repeating the process with reliable by repeating the process with different subsamplesdifferent subsamples– In each iteration, a certain proportion is In each iteration, a certain proportion is

randomly selected for training (possibly with randomly selected for training (possibly with stratificiation)stratificiation)

– The error rates on the different iterations are The error rates on the different iterations are averaged to yield an overall error rateaveraged to yield an overall error rate

This is called the This is called the repeated holdout repeated holdout methodmethod Still not optimum: the different test sets Still not optimum: the different test sets

overlapoverlap– Can we prevent overlapping?Can we prevent overlapping?

119119

Cross-validationCross-validation Cross-validationCross-validation avoids overlapping test avoids overlapping test

setssets– First step: split data into First step: split data into kk subsets of equal size subsets of equal size– Second step: use each subset in turn for Second step: use each subset in turn for

testing, the remainder for trainingtesting, the remainder for training Called Called k-fold cross-validationk-fold cross-validation Often the subsets are stratified before the Often the subsets are stratified before the

cross-validation is performedcross-validation is performed The error estimates are averaged to yield The error estimates are averaged to yield

an overall error estimatean overall error estimate

120120

More on cross-validationMore on cross-validation Standard method for evaluation: stratified ten-Standard method for evaluation: stratified ten-

fold cross-validationfold cross-validation Why ten?Why ten?

– Extensive experiments have shown that this is the Extensive experiments have shown that this is the best choice to get an accurate estimatebest choice to get an accurate estimate

– There is also some theoretical evidence for thisThere is also some theoretical evidence for this Stratification reduces the estimate’s varianceStratification reduces the estimate’s variance Even better: repeated stratified cross-Even better: repeated stratified cross-

validationvalidation– E.g. ten-fold cross-validation is repeated ten times E.g. ten-fold cross-validation is repeated ten times

and results are averaged (reduces the variance)and results are averaged (reduces the variance)

121121

Leave-One-Out cross-Leave-One-Out cross-validationvalidation

Leave-One-Out:Leave-One-Out:a particular form of cross-validation:a particular form of cross-validation:– Set number of folds to number of training Set number of folds to number of training

instancesinstances– I.e., for I.e., for nn training instances, build classifier training instances, build classifier nn

timestimes Makes best use of the dataMakes best use of the data Involves no random subsampling Involves no random subsampling Very computationally expensiveVery computationally expensive

– (exception: NN)(exception: NN)

122122

Leave-One-Out-CV and Leave-One-Out-CV and stratificationstratification

Disadvantage of Leave-One-Out-CV: Disadvantage of Leave-One-Out-CV: stratification is not possiblestratification is not possible– It It guaranteesguarantees a non-stratified sample because a non-stratified sample because

there is only one instance in the test set!there is only one instance in the test set! Extreme example: random dataset split Extreme example: random dataset split

equally into two classesequally into two classes– Best inducer predicts majority classBest inducer predicts majority class– 50% accuracy on fresh data 50% accuracy on fresh data – Leave-One-Out-CV estimate is 100% error!Leave-One-Out-CV estimate is 100% error!

123123

The bootstrapThe bootstrap CV uses sampling CV uses sampling without replacementwithout replacement

– The same instance, once selected, can not be The same instance, once selected, can not be selected again for a particular training/test setselected again for a particular training/test set

The The bootstrapbootstrap uses sampling uses sampling with replacementwith replacement to form the training setto form the training set– Sample a dataset of Sample a dataset of n n instances instances nn times times with with

replacementreplacement to form a new dataset to form a new datasetof of nn instances instances

– Use this data as the training setUse this data as the training set– Use the instances from the originalUse the instances from the original

dataset that don’t occur in the newdataset that don’t occur in the newtraining set for testingtraining set for testing

124124

The 0.632 bootstrapThe 0.632 bootstrap

Also called the Also called the 0.632 bootstrap0.632 bootstrap– A particular instance has a probability of 1A particular instance has a probability of 1––1/1/nn

of of not not being pickedbeing picked– Thus its probability of not ending up in the test Thus its probability of not ending up in the test

data is:data is:

– This means the training data will contain This means the training data will contain approximately 63.2% of the instancesapproximately 63.2% of the instances

111 0.368

n

en

125125

Estimating errorEstimating errorwith the bootstrapwith the bootstrap

The error estimate on the test data will be The error estimate on the test data will be very pessimistic very pessimistic – Trained on just ~63% of the instancesTrained on just ~63% of the instances

Therefore, combine it with the resubstitution Therefore, combine it with the resubstitution error:error:

The resubstitution error gets less weight The resubstitution error gets less weight than the error on the test datathan the error on the test data

Repeat process several times with different Repeat process several times with different replacement samples; average the resultsreplacement samples; average the results

test instances training instances0.632 0.368err e e

126126

More on the bootstrapMore on the bootstrap

Probably the best way of estimating Probably the best way of estimating performance for very small datasetsperformance for very small datasets

However, it has some problemsHowever, it has some problems– Consider the random dataset from aboveConsider the random dataset from above– A perfect memorizer will achieveA perfect memorizer will achieve

0% resubstitution error and 0% resubstitution error and ~50% error on test data ~50% error on test data

– Bootstrap estimate for this classifier:Bootstrap estimate for this classifier:

– True expected error: 50%True expected error: 50%0.632 50% 0.368 0% 31.6%err

127127



128128

Comparing data mining Comparing data mining schemesschemes

Frequent question: which of two learning Frequent question: which of two learning schemes performs better?schemes performs better?

Note: this is domain dependent!Note: this is domain dependent! Obvious way: compare 10-fold CV estimatesObvious way: compare 10-fold CV estimates Problem: variance in estimateProblem: variance in estimate Variance can be reduced using repeated CVVariance can be reduced using repeated CV However, we still don’t know whether the However, we still don’t know whether the

results are reliableresults are reliable

129129

Significance testsSignificance tests Significance tests tell us how confident we Significance tests tell us how confident we

can be that there really is a differencecan be that there really is a difference Null hypothesisNull hypothesis: there is no “real” : there is no “real”

differencedifference Alternative hypothesisAlternative hypothesis: there is a difference: there is a difference A significance test measures how much A significance test measures how much

evidence there is in favor of rejecting the evidence there is in favor of rejecting the null hypothesisnull hypothesis

Let’s say we are using 10-fold CVLet’s say we are using 10-fold CV Question: do the two means of the 10 CV Question: do the two means of the 10 CV

estimates differ significantly?estimates differ significantly?

130130

Paired t-testPaired t-test Student’s t-testStudent’s t-test tells whether the means tells whether the means

of two samples are significantly differentof two samples are significantly different Take individual samples using cross-Take individual samples using cross-

validationvalidation Use a Use a pairedpaired t-test because the t-test because the

individual samples are pairedindividual samples are paired– The same CV is applied twiceThe same CV is applied twice

William GossetWilliam GossetBorn:Born: 1876 in Canterbury; Died: 1937 in Beaconsfield, England1876 in Canterbury; Died: 1937 in Beaconsfield, England

Obtained a post as a chemist in the Guinness brewery in Dublin in Obtained a post as a chemist in the Guinness brewery in Dublin in 1899. Invented the t-test to handle small samples for quality control 1899. Invented the t-test to handle small samples for quality control in brewing. Wrote under the name "Student". in brewing. Wrote under the name "Student".

131131

Student’s distributionStudent’s distribution With small samples (With small samples (k k < 100) the < 100) the

mean follows mean follows Student’s distribution Student’s distribution with kwith k––11 degrees of freedom degrees of freedom

Confidence limits:Confidence limits:

Pr[Pr[X X zz]]

zz

0.1%0.1% 4.304.30

0.5%0.5% 3.253.25

1%1% 2.822.82

5%5% 1.831.83

10%10% 1.381.38

20%20% 0.880.88

Pr[Pr[X X zz]]

zz

0.1%0.1% 3.093.09

0.5%0.5% 2.582.58

1%1% 2.332.33

5%5% 1.651.65

10%10% 1.281.28

20%20% 0.840.84

9 degrees of freedom normal 9 degrees of freedom normal distributiondistribution

132132

Distribution of the meansDistribution of the means xx11 x x22 … x … xkk and and yy11 y y22 … y … ykk are the 2are the 2kk samples for samples for

a a kk-fold CV-fold CV mmxx and and mmyy are the means are the means With enough samples, the mean of a set of With enough samples, the mean of a set of

independent samples is normally distributedindependent samples is normally distributed

Estimated variances of the means are Estimated variances of the means are xx

22/k and /k and yy22/k /k

If If xx and and yy are the true means thenare the true means then

areare approximately approximately normally distributed withnormally distributed withmean 0, variance 1mean 0, variance 1

2 /x x

x

m

k

2 /

y y

y

m

k

133133

Distribution of the Distribution of the differencesdifferences

Let Let mmd d = m= mx x – – mmyy

The difference of the means (The difference of the means (mmdd) also has a ) also has a Student’s distribution with Student’s distribution with kk––1 degrees of 1 degrees of freedomfreedom

Let Let dd22 be the variance of the difference be the variance of the difference

The standardized version of The standardized version of mmdd is called the is called the tt--statistic:statistic:

We use We use tt to perform the to perform the t-t-testtest

2 /d

d

mt

k

134134

Performing the testPerforming the test• Fix a significance level Fix a significance level

• If a difference is significant at the If a difference is significant at the % level,% level,there is a (100-there is a (100-)% chance that there really is a )% chance that there really is a differencedifference

• Divide the significance level by two because Divide the significance level by two because the test is two-tailedthe test is two-tailed

• I.e. the true difference can be +ve or I.e. the true difference can be +ve or –– ve ve• Look up the value for Look up the value for zz that corresponds to that corresponds to

/2/2• If If t t ––z z or or t t zz then the difference is then the difference is

significantsignificant• I.e. the null hypothesis can be rejectedI.e. the null hypothesis can be rejected

135135

Unpaired observationsUnpaired observations If the CV estimates are from different If the CV estimates are from different

randomizations, they are no longer randomizations, they are no longer pairedpaired

(or maybe we used (or maybe we used k k -fold CV for one -fold CV for one scheme, and scheme, and j j -fold CV for the other one)-fold CV for the other one)

Then we have to use an Then we have to use an un un paired t-test paired t-test with min(with min(k k , , jj) ) –– 1 degrees of freedom 1 degrees of freedom

The The tt-statistic becomes:-statistic becomes:

22

x y

yx

m mt

k j

2 /d

d

mt

k

136136

Interpreting the resultInterpreting the result All our cross-validation estimates are All our cross-validation estimates are

based on the same datasetbased on the same dataset Samples are not independentSamples are not independent Should really use a different dataset Should really use a different dataset

sample for each of the sample for each of the kk estimates estimates used in the test to judge performance used in the test to judge performance across different training setsacross different training sets

Or, use heuristic test, e.g. Or, use heuristic test, e.g. corrected corrected resampled t-testresampled t-test

137137



138138

Predicting probabilitiesPredicting probabilities Performance measure so far: success ratePerformance measure so far: success rate Also called Also called 0-1 loss function0-1 loss function::

Most classifiers produces class probabilitiesMost classifiers produces class probabilities Depending on the application, we might want to Depending on the application, we might want to

check the accuracy of the probability estimatescheck the accuracy of the probability estimates 0-1 loss is not the right thing to use in those 0-1 loss is not the right thing to use in those

casescases

0 if prediction is correct

1 if prediction is incorrecti

139139

Quadratic loss functionQuadratic loss function pp11 … … ppkk are probability estimates for an instanceare probability estimates for an instance

cc is the index of the instance’s actual class is the index of the instance’s actual class

aa11 … … aak k = = 00, , except for except for aacc which is 1 which is 1

Quadratic lossQuadratic loss is: is:

Want to minimizeWant to minimize

Can show that this is minimized when Can show that this is minimized when ppj j = = ppjj**, the , the

true probabilitiestrue probabilities

2( )j jj

E p a

2 2 2( ) (1 )j j j cj j c

p a p p

140140

Informational loss functionInformational loss function

The informational loss function is –log(The informational loss function is –log(ppcc),),where where cc is the index of the instance’s actual class is the index of the instance’s actual class

Number of bits required to communicate the Number of bits required to communicate the actual classactual class

Let Let pp11** … … ppkk

* * be the true class probabilitiesbe the true class probabilities

Then the expected value for the loss function is:Then the expected value for the loss function is:

Justification: minimized when Justification: minimized when ppj j = p= pjj**

Difficulty:Difficulty: zero-frequency problem zero-frequency problem

* *1 2 1 2log ... logk kp p p p

141141

DiscussionDiscussion Which loss function to choose?Which loss function to choose?

– Both encourage honestyBoth encourage honesty– Quadratic loss function takes into account all class Quadratic loss function takes into account all class

probability estimates for an instanceprobability estimates for an instance– Informational loss focuses only on the probability Informational loss focuses only on the probability

estimate for the actual classestimate for the actual class– Quadratic loss is bounded:Quadratic loss is bounded:

it can never exceed 2it can never exceed 2– Informational loss can be infiniteInformational loss can be infinite

Informational loss is related to Informational loss is related to MDL principleMDL principle [later][later]

21 jj

p

142142



143143

Counting the costCounting the cost

In practice, different types of classification In practice, different types of classification errors often incur different costserrors often incur different costs

Examples:Examples:– Disease diagnosisDisease diagnosis– Terrorist profilingTerrorist profiling

““Not a terrorist” correct 99.99% of the timeNot a terrorist” correct 99.99% of the time

– Loan decisionsLoan decisions– Oil-slick detectionOil-slick detection– Fault diagnosisFault diagnosis– Promotional mailingPromotional mailing

144144

Counting the costCounting the cost

The The confusion matrixconfusion matrix::

There many other types of cost!There many other types of cost!– E.g.: cost of collecting training dataE.g.: cost of collecting training data

Predicted classPredicted class

YesYes NoNo

Actual Actual classclass

YesYes True positiveTrue positive False False negativenegative

NoNo False False positivepositive

True True negativenegative

145145

Lift chartsLift charts In practice, costs are rarely knownIn practice, costs are rarely known Decisions are usually made by comparing Decisions are usually made by comparing

possible scenariospossible scenarios Example: promotional mailout to 1,000,000 Example: promotional mailout to 1,000,000

householdshouseholds• Mail to all; 0.1% respond (1000)Mail to all; 0.1% respond (1000)• Data mining tool identifies subset of 100,000 Data mining tool identifies subset of 100,000

most promising, 0.4% of these respond (400)most promising, 0.4% of these respond (400)40% of responses for 10% of cost may pay off40% of responses for 10% of cost may pay off

• Identify subset of 400,000 most promising, Identify subset of 400,000 most promising, 0.2% respond (800)0.2% respond (800)

A A lift chartlift chart allows a visual comparison allows a visual comparison

146146

Generating a lift chartGenerating a lift chart Sort instances according to predicted probability of Sort instances according to predicted probability of

being positive:being positive:

xx axis is sample size axis is sample sizeyy axis is number of true positives axis is number of true positives

Predicted Predicted probabilityprobability

Actual classActual class

11 0.950.95 YesYes22 0.930.93 YesYes33 0.930.93 NoNo44 0.880.88 YesYes

…… …… ……

147147

A hypothetical lift chartA hypothetical lift chart

40% of responses40% of responsesfor 10% of costfor 10% of cost 80% of responses80% of responses

for 40% of costfor 40% of cost

148148

ROC curvesROC curves

ROC curvesROC curves are similar to lift charts are similar to lift charts– Stands for “receiver operating characteristic”Stands for “receiver operating characteristic”– Used in signal detection to show tradeoff between Used in signal detection to show tradeoff between

hit rate and false alarm rate over noisy channelhit rate and false alarm rate over noisy channel Differences to lift chart:Differences to lift chart:

– y y axis shows percentage of true positives in sample axis shows percentage of true positives in sample rather than absolute numberrather than absolute number

– xx axis shows percentage of false positives in axis shows percentage of false positives in samplesample rather than sample sizerather than sample size

149149

A sample ROC curveA sample ROC curve

Jagged curve—one set of test dataJagged curve—one set of test data Smooth curve—use cross-validationSmooth curve—use cross-validation

150150

Cross-validation and ROC Cross-validation and ROC curvescurves

Simple method of getting a ROC curve Simple method of getting a ROC curve using cross-validation:using cross-validation:– Collect probabilities for instances in test foldsCollect probabilities for instances in test folds– Sort instances according to probabilitiesSort instances according to probabilities

This method is implemented in WEKAThis method is implemented in WEKA However, this is just one possibilityHowever, this is just one possibility

– The method described in the WEKA book The method described in the WEKA book generates an ROC curve for each fold and generates an ROC curve for each fold and averages them averages them

151151

ROC curves for two ROC curves for two schemesschemes

For a small, focused sample, use method AFor a small, focused sample, use method A For a larger one, use method BFor a larger one, use method B In between, choose between A and B with appropriate probabilitiesIn between, choose between A and B with appropriate probabilities

152152

The convex hullThe convex hull

Given two learning schemes we can achieve Given two learning schemes we can achieve any point on the convex hull!any point on the convex hull!

TP and FP rates for scheme 1: TP and FP rates for scheme 1: tt11 and and ff11

TP and FP rates for scheme 2: TP and FP rates for scheme 2: tt22 and and ff22

If scheme 1 is used to predict 100If scheme 1 is used to predict 100q q % of % of the cases and scheme 2 for the rest, thenthe cases and scheme 2 for the rest, then– TP rate for combined scheme:TP rate for combined scheme:

q q tt11+(1-+(1-qq) ) tt22

– FP rate for combined scheme:FP rate for combined scheme:q q ff22+(1-+(1-qq) ) ff22

153153

Cost-sensitive learningCost-sensitive learning

Most learning schemes do not perform Most learning schemes do not perform cost-sensitive learningcost-sensitive learning– They generate the same classifier no matter They generate the same classifier no matter

what costs are assigned to the different classeswhat costs are assigned to the different classes– Example: standard decision tree learnerExample: standard decision tree learner

Simple methods for cost-sensitive Simple methods for cost-sensitive learning:learning:– Resampling of instances according to costsResampling of instances according to costs– Weighting of instances according to costsWeighting of instances according to costs

Some schemes can take costs into account Some schemes can take costs into account by varying a parameter, e.g. naïve Bayesby varying a parameter, e.g. naïve Bayes

154154

Measures in information Measures in information retrievalretrieval

Percentage of retrieved documents that are Percentage of retrieved documents that are relevant: relevant: precision=precision=TP/(TP+FP)TP/(TP+FP)

Percentage of relevant documents that are Percentage of relevant documents that are returned: returned: recall =recall =TP/(TP+FN)TP/(TP+FN)

Precision/recall curves have hyperbolic shapePrecision/recall curves have hyperbolic shape Summary measures: average precision at Summary measures: average precision at

20%, 50% and 80% recall (20%, 50% and 80% recall (three-point three-point average recallaverage recall))

F-measureF-measure=(2=(2recallrecallprecision)/precision)/(recall+precision)(recall+precision)

155155

Summary of measuresSummary of measures

DomainDomain PlotPlot ExplanationExplanation

Lift Lift chartchart

MarketingMarketing TP TP

Subset sizeSubset sizeTPTP(TP+FP)/(TP+FP)/(TP+FP+TN+FN)(TP+FP+TN+FN)

ROC ROC curvecurve

Disease Disease ClassificationClassification

TP rate TP rate (Sensitivity(Sensitivity))

FP rateFP rate

TP/(TP+FN)TP/(TP+FN)

FP/(FP+TN)FP/(FP+TN)

Recall-Recall-precisioprecision curven curve

Information Information retrievalretrieval

RecallRecall

PrecisionPrecisionTP/(TP+FN)TP/(TP+FN)

TP/(TP+FP)TP/(TP+FP)

156156



157157

Evaluating numeric Evaluating numeric predictionprediction

Same strategies: independent test set, Same strategies: independent test set, cross-validation, significance tests, etc.cross-validation, significance tests, etc.

Difference: error measuresDifference: error measures Actual target values: Actual target values: aa11 aa22 ……aann

Predicted target values: Predicted target values: pp11 pp22 … … ppnn

Most popular measure: Most popular measure: mean-squared errormean-squared error

– Easy to manipulate mathematicallyEasy to manipulate mathematically

2 21 1( ) ... ( )n np a p a

n

158158

Other measuresOther measures The The root mean-squared error root mean-squared error ::

The The mean absolute error mean absolute error is less sensitive to is less sensitive to outliers than the mean-squared error:outliers than the mean-squared error:

Sometimes Sometimes relativerelative error values are more error values are more appropriate (e.g. 10% for an error of 50 when appropriate (e.g. 10% for an error of 50 when predicting 500)predicting 500)

1 1| | ... | |n np a p a

n

2 21 1( ) ... ( )n np a p a

n

159159

Improvement on the meanImprovement on the mean

How much does the scheme improve on How much does the scheme improve on simply predicting the average?simply predicting the average?

The The relative squared errorrelative squared error is ( ): is ( ):

The The relative absolute error relative absolute error is:is:

2 21 1

2 21

( ) ... ( )

( ) ... ( )n n

n

p a p a

a a a a

is the averagea

1 1

1

| | ... | |

| | ... | |n n

n

p a p a

a a a a

160160

Correlation coefficientCorrelation coefficient

Measures the Measures the statistical correlationstatistical correlation between between the predicted values and the actual valuesthe predicted values and the actual values

Scale independent, between –1 and +1Scale independent, between –1 and +1 Good performance leads to large values!Good performance leads to large values!

PA

P A

S

S S

( )( )

1

i ii

PA

p p a aS

n

2( )

1

ii

P

p pS

n

2( )

1

ii

A

a aS

n

161161

Which measure?Which measure? Best to look at all of themBest to look at all of them Often it doesn’t matterOften it doesn’t matter Example:Example:

AA BB CC DD

Root mean-squared Root mean-squared errorerror

67.867.8 91.791.7 63.363.3 57.457.4

Mean absolute errorMean absolute error 41.341.3 38.538.5 33.433.4 29.229.2

Root rel squared errorRoot rel squared error 42.2%42.2% 57.2%57.2% 39.4%39.4% 35.8%35.8%

Relative absolute Relative absolute errorerror

43.1%43.1% 40.1%40.1% 34.8%34.8% 30.4%30.4%

Correlation coefficientCorrelation coefficient 0.880.88 0.880.88 0.890.89 0.910.91 D bestD best C second-C second-

bestbest A, B arguableA, B arguable

162162



163163

The MDL principleThe MDL principle

MDL stands for MDL stands for minimum description lengthminimum description length The description length is defined as:The description length is defined as:

space required to describe a theoryspace required to describe a theory

++

space required to describe the theory’s mistakesspace required to describe the theory’s mistakes In our case the theory is the classifier and In our case the theory is the classifier and

the mistakes are the errors on the training the mistakes are the errors on the training datadata

Aim: we seek a classifier with minimal DLAim: we seek a classifier with minimal DL MDL principle is a MDL principle is a model selection criterionmodel selection criterion

164164

Model selection criteriaModel selection criteria Model selection criteria attempt to find Model selection criteria attempt to find

a good compromise between:a good compromise between:• The complexity of a modelThe complexity of a model• Its prediction accuracy on the training dataIts prediction accuracy on the training data

Reasoning: a good model is a simple Reasoning: a good model is a simple model that achieves high accuracy on model that achieves high accuracy on the given datathe given data

Also known as Also known as Occam’s Razor Occam’s Razor ::the best theory is the smallest onethe best theory is the smallest onethat describes all the facts that describes all the facts

William of Ockham, born in the village of Ockham in Surrey William of Ockham, born in the village of Ockham in Surrey (England) about 1285, was the most influential philosopher of (England) about 1285, was the most influential philosopher of the 14th century and a controversial theologian. the 14th century and a controversial theologian.

165165

Elegance vs. errorsElegance vs. errors

Theory 1: very simple, elegant theory that Theory 1: very simple, elegant theory that explains the data almost perfectlyexplains the data almost perfectly

Theory 2: significantly more complex Theory 2: significantly more complex theory that reproduces the data without theory that reproduces the data without mistakesmistakes

Theory 1 is probably preferableTheory 1 is probably preferable Classical example: Kepler’s three laws on Classical example: Kepler’s three laws on

planetary motionplanetary motion– Less accurate than Copernicus’s latest Less accurate than Copernicus’s latest

refinement of the Ptolemaic theory of epicyclesrefinement of the Ptolemaic theory of epicycles

166166

Elegance vs. errorsElegance vs. errors

Kepler – “I have cleared the Augean Kepler – “I have cleared the Augean stables of astronomy of cycles and stables of astronomy of cycles and spirals, and left behind me only a spirals, and left behind me only a single cartload of dung”single cartload of dung”

167167

MDL and compressionMDL and compression

MDL principle relates to data compression:MDL principle relates to data compression:– The best theory is the one that compresses the The best theory is the one that compresses the

data the mostdata the most– I.e. to compress a dataset we generate a model I.e. to compress a dataset we generate a model

and then store the model and its mistakesand then store the model and its mistakes We need to computeWe need to compute

(a) size of the model, and(a) size of the model, and(b) space needed to encode the errors(b) space needed to encode the errors

(b) easy: use the informational loss function(b) easy: use the informational loss function (a) need a method to encode the model(a) need a method to encode the model

168168

MDL and Bayes’s theoremMDL and Bayes’s theorem

L[T]=“length” of the theoryL[T]=“length” of the theory L[E|T]=training set encoded wrt the theory L[E|T]=training set encoded wrt the theory

(“dung”)(“dung”) Description length= L[T] + L[E|T]Description length= L[T] + L[E|T] Bayes’s theorem gives Bayes’s theorem gives a posterioria posteriori

probability of a theory given the data:probability of a theory given the data:

Equivalent to:Equivalent to:

Pr[ | ]Pr[ ]Pr[ | ]

Pr[ ]

E T TT E

E

log Pr[ | ] log Pr[ | ] log Pr[ ] log Pr[ ]T E E T T E constant

169169

MDL and MAPMDL and MAP

MAP stands for MAP stands for maximum a posteriori probabilitymaximum a posteriori probability Finding the MAP theory corresponds to finding the Finding the MAP theory corresponds to finding the

MDL theoryMDL theory Difficult bit in applying the MAP principle: Difficult bit in applying the MAP principle:

determining the prior probability Pr[T] of the determining the prior probability Pr[T] of the theorytheory

Corresponds to difficult part in applying the MDL Corresponds to difficult part in applying the MDL principle: coding scheme for the theoryprinciple: coding scheme for the theory

I.e. if we know a priori that a particular theory is I.e. if we know a priori that a particular theory is more likely we need less bits to encode itmore likely we need less bits to encode it

170170

Discussion of MDL principleDiscussion of MDL principle

Advantage: makes full use of the training data Advantage: makes full use of the training data when selecting a modelwhen selecting a model

Disadvantage 1: appropriate coding Disadvantage 1: appropriate coding scheme/prior probabilities for theories are crucialscheme/prior probabilities for theories are crucial

Disadvantage 2: no guarantee that the MDL Disadvantage 2: no guarantee that the MDL theory is the one which minimizes the expected theory is the one which minimizes the expected error error

Note: Occam’s Razor is an axiom!Note: Occam’s Razor is an axiom! Epicurus’s Epicurus’s principle of multiple explanationsprinciple of multiple explanations: :

keep all theories that are consistent with the keep all theories that are consistent with the datadata

171171

Bayesian model averagingBayesian model averaging

Reflects Epicurus’s principle: all theories are used Reflects Epicurus’s principle: all theories are used for prediction weighted according to P[T|E]for prediction weighted according to P[T|E]

Let Let II be a new instance whose class we must be a new instance whose class we must predictpredict

Let Let CC be the random variable denoting the class be the random variable denoting the class Then BMA gives the probability of Then BMA gives the probability of CC given given

– II– training data training data EE

– possible theories possible theories TTjjPr[ | , ] Pr[ | , ]Pr[ | ]j j

j

C I E C I T T E

172172

MDL and clusteringMDL and clustering Description length of theory:Description length of theory:

bits needed to encode the clustersbits needed to encode the clusters– e.g. cluster centerse.g. cluster centers

Description length of data given theory:Description length of data given theory:encode cluster membership and position encode cluster membership and position relative to clusterrelative to cluster– e.g. distance to cluster centere.g. distance to cluster center

Works if coding scheme uses less code space Works if coding scheme uses less code space for small numbers than for large onesfor small numbers than for large ones

With nominal attributes, must communicate With nominal attributes, must communicate probability distributions for each clusterprobability distributions for each cluster

173173

Main ReferencesMain References Han, J., Kamber, M. (2011).Han, J., Kamber, M. (2011). Data mining: Concepts Data mining: Concepts

and Techniques (2and Techniques (2ndnd ed.). ed.). New York: Morgan- New York: Morgan-Kaufman.Kaufman.

Witten, I. H., & Frank, E. (2005). Witten, I. H., & Frank, E. (2005). Data mining: Data mining: Practical Machine Learning Tools and Techniques Practical Machine Learning Tools and Techniques (2(2ndnd ed.). ed.). New York: Morgan-Kaufmann. New York: Morgan-Kaufmann.

Hastie, T., Tibshirani, R., & Friedman, J. H. (2nd ed. Hastie, T., Tibshirani, R., & Friedman, J. H. (2nd ed. 2009. 5th Printing.). 2009. 5th Printing.). The elements of statistical The elements of statistical learning : Data mining, inference, and prediction.learning : Data mining, inference, and prediction. New York: Springer.New York: Springer.

Documents

1 Data Mining I Karl Young Center for Imaging of Neurodegenerative Diseases, UCSF