Upload
marjory-dorsey
View
217
Download
0
Embed Size (px)
Citation preview
11
Data Mining IData Mining I
Karl YoungKarl Young
CCenter for enter for IImaging of maging of NNeurodegenerative eurodegenerative DDiseases, UCSFiseases, UCSF
22
The “Issues”The “Issues” Data Explosion Problem Data Explosion Problem
– Automated data collection tools + widely used Automated data collection tools + widely used database systems + computerized society + database systems + computerized society + Internet lead to tremendous amounts of data Internet lead to tremendous amounts of data accumulated and/or to be analyzed in databases, accumulated and/or to be analyzed in databases, data warehouses, WWW, and other information data warehouses, WWW, and other information repositories repositories
We are drowning in data, but starving for We are drowning in data, but starving for knowledge! knowledge!
Solution: Data Warehousing and Data MiningSolution: Data Warehousing and Data Mining– Data warehousing and on-line analytical Data warehousing and on-line analytical
processing (OLAP)processing (OLAP)– Mining interesting knowledge (rules, regularities, Mining interesting knowledge (rules, regularities,
patterns, constraints) from data in large databasespatterns, constraints) from data in large databases
33
Data Warehousing + Data Data Warehousing + Data MiningMining
(one of many schematic views)(one of many schematic views)
Efficient And RobustData Storage And Retrival
Database Technology
StatisticsComputer Science
HighPerformanceComputing
MachineLearning
Visualization,…
Efficient And RobustData Summary
And Visualization
44
Machine learning and Machine learning and statisticsstatistics
Historical difference (grossly Historical difference (grossly oversimplified):oversimplified):– Statistics: testing hypothesesStatistics: testing hypotheses– Machine learning: finding the right hypothesisMachine learning: finding the right hypothesis
But: huge overlapBut: huge overlap– Decision trees (C4.5 and CART)Decision trees (C4.5 and CART)– Nearest-neighbor methodsNearest-neighbor methods
Today: perspectives have convergedToday: perspectives have converged– Most ML algorithms employ statistical Most ML algorithms employ statistical
techniquestechniques
55
SchematicallySchematically
Data Cleaning
Data Integration
Data Warehouse
Task-relevant Data
Selection
Data Mining
Pattern Evaluation
66
SchematicallySchematically– Data warehouse —Data warehouse —
core of efficient data core of efficient data organizationorganization
Data Cleaning
Data Integration
Data Warehouse
Task-relevant Data
Selection
Data Mining
Pattern Evaluation
77
– Data mining—core of Data mining—core of knowledge discovery knowledge discovery processprocess
Data Cleaning
Data Integration
Data Warehouse
Task-relevant Data
Selection
Data Mining
Pattern Evaluation
SchematicallySchematically
88
Data miningData mining
Needed: programs that detect patterns Needed: programs that detect patterns and regularities in the dataand regularities in the data
Strong patterns = good predictionsStrong patterns = good predictions– Problem 1: most patterns are not Problem 1: most patterns are not
interestinginteresting– Problem 2: patterns may be inexact (or Problem 2: patterns may be inexact (or
spurious)spurious)– Problem 3: data may be garbled or missingProblem 3: data may be garbled or missing
Want to Want to learnlearn “concept”, i.e. rule or “concept”, i.e. rule or set of rules that characterize observed set of rules that characterize observed patterns in patterns in datadata
99
Types of LearningTypes of Learning Supervised - ClassificationSupervised - Classification
– Know classes for examplesKnow classes for examples Induction RulesInduction Rules Decision TreesDecision Trees Bayesian ClassificationBayesian Classification
– NaieveNaieve– NetworksNetworks
Numeric PredictionNumeric Prediction– Linear RegressionLinear Regression– Neural NetsNeural Nets– Support Vector MachinesSupport Vector Machines
Unsupervised – Learn Natural GroupingsUnsupervised – Learn Natural Groupings– ClusteringClustering
Partitioning MethodsPartitioning Methods Hierarchical MethodsHierarchical Methods Density Based MethodsDensity Based Methods Model Based MethodsModel Based Methods
Learn Association Rules – In Principle Learn All AtributesLearn Association Rules – In Principle Learn All Atributes
1010
Algorithms: Algorithms: The basic methodsThe basic methods
Simplicity first: 1RSimplicity first: 1R Use all attributes: Naïve BayesUse all attributes: Naïve Bayes Decision trees: ID3Decision trees: ID3 Covering algorithms: decision rules: PRISMCovering algorithms: decision rules: PRISM Association rulesAssociation rules Linear modelsLinear models Instance-based learningInstance-based learning
1111
Algorithms: Algorithms: The basic methodsThe basic methods
Simplicity first: 1RSimplicity first: 1R Use all attributes: Naïve BayesUse all attributes: Naïve Bayes Decision trees: ID3Decision trees: ID3 Covering algorithms: decision rules: PRISMCovering algorithms: decision rules: PRISM Association rulesAssociation rules Linear modelsLinear models Instance-based learningInstance-based learning
1212
Simplicity firstSimplicity first
Simple algorithms often work very well! Simple algorithms often work very well! There are many kinds of simple There are many kinds of simple
structure, eg:structure, eg:– One attribute does all the workOne attribute does all the work– All attributes contribute equally & All attributes contribute equally &
independentlyindependently– A weighted linear combination might doA weighted linear combination might do– Instance-based: use a few prototypesInstance-based: use a few prototypes– Use simple logical rulesUse simple logical rules
Success of method depends on the Success of method depends on the domaindomain
1313
The weather problem The weather problem (used for illustration)(used for illustration)
Conditions for playing a certain Conditions for playing a certain gamegameOutlookOutlook TemperatureTemperature HumidityHumidity WindyWindy PlayPlay
SunnySunny HotHot HighHigh FalseFalse NoNo
SunnySunny Hot Hot High High TrueTrue NoNo
Overcast Overcast Hot Hot HighHigh FalseFalse YesYes
RainyRainy MildMild NormalNormal FalseFalse YesYes
…… …… …… …… ……
If outlook = sunny and humidity = high then play = noIf outlook = sunny and humidity = high then play = no
If outlook = rainy and windy = true then play = noIf outlook = rainy and windy = true then play = no
If outlook = overcast then play = yesIf outlook = overcast then play = yes
If humidity = normal then play = yesIf humidity = normal then play = yes
If none of the above then play = yesIf none of the above then play = yes
1414
Weather data with mixed Weather data with mixed attributesattributes
Some attributes have numeric Some attributes have numeric valuesvaluesOutlookOutlook TemperatureTemperature HumidityHumidity WindyWindy PlayPlay
SunnySunny 8585 8585 FalseFalse NoNo
SunnySunny 8080 9090 TrueTrue NoNo
Overcast Overcast 8383 8686 FalseFalse YesYes
RainyRainy 7575 8080 FalseFalse YesYes
…… …… …… …… ……
If outlook = sunny and humidity > 83 then play = noIf outlook = sunny and humidity > 83 then play = no
If outlook = rainy and windy = true then play = noIf outlook = rainy and windy = true then play = no
If outlook = overcast then play = yesIf outlook = overcast then play = yes
If humidity < 85 then play = yesIf humidity < 85 then play = yes
If none of the above then play = yesIf none of the above then play = yes
1515
Inferring rudimentary Inferring rudimentary rulesrules
1R: learns a 1-level decision tree1R: learns a 1-level decision tree– I.e., rules that all test one particular I.e., rules that all test one particular
attributeattribute Basic versionBasic version
– One branch for each valueOne branch for each value– Each branch assigns most frequent classEach branch assigns most frequent class– Error rate: proportion of instances that Error rate: proportion of instances that
don’t belong to the majority class of their don’t belong to the majority class of their corresponding branchcorresponding branch
– Choose attribute with lowest error rateChoose attribute with lowest error rate
((assumes nominal attributesassumes nominal attributes))
1616
Pseudo-code for 1RPseudo-code for 1R
For each attribute,For each attribute,
For each value of the attribute, make a rule as follows:For each value of the attribute, make a rule as follows:
count how often each class appearscount how often each class appears
find the most frequent classfind the most frequent class
make the rule assign that class to this attribute-valuemake the rule assign that class to this attribute-value
Calculate the error rate of the rulesCalculate the error rate of the rules
Choose the rules with the smallest error rateChoose the rules with the smallest error rate
Note: “missing” is treated as a separate Note: “missing” is treated as a separate attribute valueattribute value
1717
Evaluating the weather Evaluating the weather attributesattributes
Attribute Attribute RulesRules ErrorErrorss
Total Total errorerrorss
OutlookOutlook Sunny Sunny No No 2/52/5 4/144/14
Overcast Overcast YesYes
0/40/4
Rainy Rainy Yes Yes 2/52/5
TempTemp Hot Hot No* No* 2/42/4 5/145/14
Mild Mild Yes Yes 2/62/6
Cool Cool Yes Yes 1/41/4
HumidityHumidity High High No No 3/73/7 4/144/14
Normal Normal Yes Yes 1/71/7
WindyWindy False False Yes Yes 2/82/8 5/145/14
True True No* No* 3/63/6
OutlookOutlook TempTemp HumiditHumidityy
WindWindyy
PlaPlayy
SunnySunny HotHot HighHigh FalseFalse NoNo
SunnySunny Hot Hot High High TrueTrue NoNo
OvercaOvercast st
Hot Hot HighHigh FalseFalse YesYes
RainyRainy MildMild HighHigh FalseFalse YesYes
RainyRainy CoolCool NormalNormal FalseFalse YesYes
RainyRainy CoolCool NormalNormal TrueTrue NoNo
OvercaOvercastst
CoolCool NormalNormal TrueTrue YesYes
SunnySunny MildMild HighHigh FalseFalse NoNo
SunnySunny CoolCool NormalNormal FalseFalse YesYes
RainyRainy MildMild NormalNormal FalseFalse YesYes
SunnySunny MildMild NormalNormal TrueTrue YesYes
OvercaOvercastst
MildMild HighHigh TrueTrue YesYes
OvercaOvercastst
HotHot NormalNormal FalseFalse YesYes
RainyRainy MildMild HighHigh TrueTrue NoNo
* indicates a tie* indicates a tie
1818
Dealing withDealing withnumeric attributesnumeric attributes
Discretize numeric attributesDiscretize numeric attributes Divide each attribute’s range into Divide each attribute’s range into
intervalsintervals– Sort instances according to attribute’s Sort instances according to attribute’s
valuesvalues– Place breakpoints where the class changesPlace breakpoints where the class changes
(the majority class)(the majority class)– This minimizes the total errorThis minimizes the total error
Example: Example: temperaturetemperature from weather from weather datadata
OutlookOutlook TemperatuTemperaturere
HumidityHumidity WindyWindy PlayPlay
SunnySunny 8585 8585 FalseFalse NoNo
SunnySunny 8080 9090 TrueTrue NoNo
Overcast Overcast 8383 8686 FalseFalse YesYes
RainyRainy 7575 8080 FalseFalse YesYes
…… …… …… …… ……
1919
Dealing withDealing withnumeric attributesnumeric attributes
Example: Example: temperaturetemperature from weather from weather datadata
64 65 68 69 70 71 72 72 75 75 80 81 83 64 65 68 69 70 71 72 72 75 75 80 81 83 85 85
Yes | No | Yes Yes Yes | No No Yes | Yes Yes | No | Yes Yes | NoYes | No | Yes Yes Yes | No No Yes | Yes Yes | No | Yes Yes | No
OutlookOutlook TemperatuTemperaturere
HumidityHumidity WindyWindy PlayPlay
SunnySunny 8585 8585 FalseFalse NoNo
SunnySunny 8080 9090 TrueTrue NoNo
Overcast Overcast 8383 8686 FalseFalse YesYes
RainyRainy 7575 8080 FalseFalse YesYes
…… …… …… …… ……
2020
The problem of overfittingThe problem of overfitting
This procedure is very sensitive to This procedure is very sensitive to noisenoise– One instance with an incorrect class label One instance with an incorrect class label
will probably produce a separate intervalwill probably produce a separate interval Also: Also: time stamptime stamp attribute will have attribute will have
zero errorszero errors Simple solution:Simple solution:
enforce minimum number of instances enforce minimum number of instances in majority class per intervalin majority class per interval
Example (with min = 3):Example (with min = 3):64 65 68 69 70 71 72 72 75 75 80 81 83 64 65 68 69 70 71 72 72 75 75 80 81 83 85 85
Yes | No | Yes Yes Yes | No No Yes | Yes Yes | No | Yes Yes | NoYes | No | Yes Yes Yes | No No Yes | Yes Yes | No | Yes Yes | No
64 65 68 69 70 71 72 72 75 75 80 81 83 64 65 68 69 70 71 72 72 75 75 80 81 83 85 85
Yes No Yes Yes Yes | No No Yes Yes Yes | No Yes Yes NoYes No Yes Yes Yes | No No Yes Yes Yes | No Yes Yes No
2121
With overfitting With overfitting avoidanceavoidance
Resulting rule set:Resulting rule set:
Attribute Attribute RulesRules ErrorsErrors Total Total errorserrors
OutlookOutlook Sunny Sunny No No 2/52/5 4/144/14
Overcast Overcast Yes Yes 0/40/4
Rainy Rainy Yes Yes 2/52/5
TemperatureTemperature 77.5 77.5 Yes Yes 3/103/10 5/145/14
> 77.5 > 77.5 No* No* 2/42/4
HumidityHumidity 82.5 82.5 Yes Yes 1/71/7 3/143/14
> 82.5 and > 82.5 and 95.5 95.5 NoNo
2/62/6
> 95.5 > 95.5 Yes Yes 0/10/1
WindyWindy False False Yes Yes 2/82/8 5/145/14
True True No* No* 3/63/6
2222
Discussion of 1RDiscussion of 1R 1R was described in a paper by Holte 1R was described in a paper by Holte
(1993)(1993)– Contains an experimental evaluation on 16 Contains an experimental evaluation on 16
datasets (using datasets (using cross-validationcross-validation so that so that results were representative of performance results were representative of performance on future data)on future data)
– Minimum number of instances was set to 6 Minimum number of instances was set to 6 after some experimentationafter some experimentation
– 1R’s simple rules performed not much 1R’s simple rules performed not much worse than much more complex decision worse than much more complex decision treestrees
Simplicity first pays off! Simplicity first pays off! Very Simple Classification Rules Perform Well on Most Very Simple Classification Rules Perform Well on Most Commonly Used DatasetsCommonly Used DatasetsRobert C. Holte, Computer Science Department, University of OttawaRobert C. Holte, Computer Science Department, University of Ottawa
2323
Algorithms: Algorithms: The basic methodsThe basic methods
Simplicity first: 1RSimplicity first: 1R Use all attributes: Naïve BayesUse all attributes: Naïve Bayes Decision trees: ID3Decision trees: ID3 Covering algorithms: decision rules: PRISMCovering algorithms: decision rules: PRISM Association rulesAssociation rules Linear modelsLinear models Instance-based learningInstance-based learning
2424
Statistical modelingStatistical modeling
““Opposite” of 1R: use all the attributesOpposite” of 1R: use all the attributes
Two assumptions: Attributes areTwo assumptions: Attributes are– equally importantequally important– statistically independentstatistically independent (given the class (given the class
value)value) I.e., knowing the value of one attribute says I.e., knowing the value of one attribute says
nothing about the value of anothernothing about the value of another(if the class is known)(if the class is known)
Independence assumption is never Independence assumption is never correct!correct!
But … this scheme works well in But … this scheme works well in practicepractice
2525
Probabilities forProbabilities forweather dataweather data
OutlookOutlook TempTemp HumidityHumidity WindyWindy PlayPlay
SunnySunny HotHot HighHigh FalseFalse NoNo
SunnySunny Hot Hot High High TrueTrue NoNo
Overcast Overcast Hot Hot HighHigh FalseFalse YesYes
RainyRainy MildMild HighHigh FalseFalse YesYes
RainyRainy CoolCool NormalNormal FalseFalse YesYes
RainyRainy CoolCool NormalNormal TrueTrue NoNo
OvercastOvercast CoolCool NormalNormal TrueTrue YesYes
SunnySunny MildMild HighHigh FalseFalse NoNo
SunnySunny CoolCool NormalNormal FalseFalse YesYes
RainyRainy MildMild NormalNormal FalseFalse YesYes
SunnySunny MildMild NormalNormal TrueTrue YesYes
OvercastOvercast MildMild HighHigh TrueTrue YesYes
OvercastOvercast HotHot NormalNormal FalseFalse YesYes
RainyRainy MildMild HighHigh TrueTrue NoNo
2626
Probabilities forProbabilities forweather dataweather data
OutlookOutlook TemperatureTemperature HumidityHumidity WindyWindy PlayPlay
YesYes NoNo YesYes NoNo YesYes NoNo YesYes NoNo YesYes NoNo
SunnySunny 22 33 HotHot 22 22 HighHigh 33 44 FalseFalse 66 22 99 55
OvercasOvercastt
44 00 MildMild 44 22 NormalNormal 66 11 TrueTrue 33 33
RainyRainy 33 22 CoolCool 33 11
SunnySunny 2/92/9 3/53/5 HotHot 2/92/9 2/52/5 HighHigh 3/93/9 4/54/5 FalseFalse 6/96/9 2/52/5 9/19/144
5/15/144
OvercasOvercastt
4/94/9 0/50/5 MildMild 4/94/9 2/52/5 NormalNormal 6/96/9 1/51/5 TrueTrue 3/93/9 3/53/5
RainyRainy 3/93/9 2/52/5 CoolCool 3/93/9 1/51/5
2727
Probabilities forProbabilities forweather dataweather data
OutlookOutlook TemperatureTemperature HumidityHumidity WindyWindy PlayPlay
YesYes NoNo YesYes NoNo YesYes NoNo YesYes NoNo YesYes NoNo
SunnySunny 22 33 HotHot 22 22 HighHigh 33 44 FalseFalse 66 22 99 55
OvercastOvercast 44 00 MildMild 44 22 NormalNormal 66 11 TrueTrue 33 33
RainyRainy 33 22 CoolCool 33 11
SunnySunny 2/92/9 3/53/5 HotHot 2/92/9 2/52/5 HighHigh 3/93/9 4/54/5 FalseFalse 6/96/9 2/52/5 9/19/144
5/15/144
OvercastOvercast 4/94/9 0/50/5 MildMild 4/94/9 2/52/5 NormalNormal 6/96/9 1/51/5 TrueTrue 3/93/9 3/53/5
RainyRainy 3/93/9 2/52/5 CoolCool 3/93/9 1/51/5
OutlookOutlook Temp.Temp. HumiditHumidityy
WindyWindy PlayPlay
SunnySunny CoolCool HighHigh TrueTrue ??
A new day:A new day:
Likelihood of the two classesLikelihood of the two classes
For “yes” = 2/9 For “yes” = 2/9 3/9 3/9 3/9 3/9 3/9 3/9 9/14 = 9/14 = 0.00530.0053
For “no” = 3/5 For “no” = 3/5 1/5 1/5 4/5 4/5 3/5 3/5 5/14 = 0.0206 5/14 = 0.0206
Conversion into a probability by normalization:Conversion into a probability by normalization:
P(“yes”) = 0.0053 / (0.0053 + 0.0206) = 0.205P(“yes”) = 0.0053 / (0.0053 + 0.0206) = 0.205
P(“no”) = 0.0206 / (0.0053 + 0.0206) = 0.795P(“no”) = 0.0206 / (0.0053 + 0.0206) = 0.795
2828
Bayes’s ruleBayes’s rule Probability of event Probability of event HH given evidence given evidence E E
::
PriorPrior probability of probability of H H ::– Probability of event Probability of event beforebefore evidence is evidence is
seenseen Posterior Posterior probability ofprobability of H H ::
– Probability of event Probability of event afterafter evidence is seen evidence is seen
Pr[ | ]Pr[ ]Pr[ | ]
Pr[ ]
E H HH E
E
Pr[ | ]H E
Pr[ ]H
Thomas BayesThomas BayesBorn:Born: 1702 in London, England1702 in London, EnglandDied:Died: 1761 in Tunbridge Wells, Kent, England1761 in Tunbridge Wells, Kent, England
2929
Naïve Bayes for Naïve Bayes for classificationclassification
Classification learning: what’s the Classification learning: what’s the probability of the class given an probability of the class given an instance? instance? – Evidence Evidence E E = instance= instance– Event Event HH = class value for instance = class value for instance
Naïve assumption: evidence splits into Naïve assumption: evidence splits into parts (i.e. attributes) that are parts (i.e. attributes) that are independentindependent1 2Pr[ | ]Pr[ | ] Pr[ | ]Pr[ ]
Pr[ | ]Pr[ ]
nE H E H E H HH E
E
3030
Weather data exampleWeather data example
OutlookOutlook Temp.Temp. HumiditHumidityy
WindWindyy
PlaPlayy
SunnySunny CoolCool HighHigh TrueTrue ??
Evidence E
Probability ofclass “yes”
Pr[ | ] Pr[ | ]yes E Outlook Sunny yes Pr[ | ]Temperature Cool yes Pr[ | ]Humidity High yes Pr[ | ]Windy True yes Pr[ ]
Pr[ ]
yes
E
3 3 3 929 9 9 9 14
Pr[ ]E
3131
The “zero-frequency The “zero-frequency problem”problem”
What if an attribute value doesn’t occur with What if an attribute value doesn’t occur with every class value?every class value?(e.g. “Humidity = high” for class “yes”)(e.g. “Humidity = high” for class “yes”)– Probability will be zero!Probability will be zero!– A posterioriA posteriori probability will also be zero! probability will also be zero!
(No matter how likely the other values are!) (No matter how likely the other values are!) Remedy: add 1 to the count for every Remedy: add 1 to the count for every
attribute value-class combination (attribute value-class combination (Laplace Laplace estimator)estimator)
Result: probabilities will never be zero!Result: probabilities will never be zero!(also: stabilizes probability estimates)(also: stabilizes probability estimates)
Pr[ | ] 0yes E
Pr[ | ] 0Humidity High yes
3232
Modified probability Modified probability estimatesestimates
In some cases adding a constant In some cases adding a constant different from 1 might be more different from 1 might be more appropriateappropriate
Example: attribute Example: attribute outlookoutlook for class for class yesyes
Weights don’t need to be equal Weights don’t need to be equal (but they must sum to 1)(but they must sum to 1)
2 / 3
9
4 / 3
9
3 / 3
9
Sunny Overcast Rainy
12
9
p
24
9
p
33
9
p
3333
Missing valuesMissing values Training: instance is not included Training: instance is not included
in frequency count for attribute in frequency count for attribute value-class combinationvalue-class combination
Classification: attribute will be Classification: attribute will be omitted from calculationomitted from calculation
Example:Example: OutlookOutlook Temp.Temp. HumiditHumidityy
WindWindyy
PlayPlay
?? CoolCool HighHigh TrueTrue ??
Likelihood of “yes” = 3/9 Likelihood of “yes” = 3/9 3/9 3/9 3/9 3/9 9/14 = 9/14 = 0.02380.0238
Likelihood of “no” = 1/5 Likelihood of “no” = 1/5 4/5 4/5 3/5 3/5 5/14 = 5/14 = 0.03430.0343
P(“yes”) = 0.0238 / (0.0238 + 0.0343) = 41%P(“yes”) = 0.0238 / (0.0238 + 0.0343) = 41%
P(“no”) = 0.0343 / (0.0238 + 0.0343) = 59%P(“no”) = 0.0343 / (0.0238 + 0.0343) = 59%
3434
Numeric attributesNumeric attributes Usual assumption: attributes have a Usual assumption: attributes have a normalnormal or or
GaussianGaussian probability distribution (given the probability distribution (given the class)class)
The The probability density functionprobability density function for the normal for the normal distribution is defined by two parameters:distribution is defined by two parameters:– Sample mean:Sample mean:
– Standard deviation:Standard deviation:
– density function density function is:is:
1
1 n
ii
xn
2
1
1( )
1
n
ii
xn
2
2
( )
21( )
2
x
f x e
3535
Statistics forStatistics forweather dataweather data
Example density value:Example density value:2
2
(66 73)
2 6.21( 66 | ) 0.0340
2 6.2f temperature yes e
OutlookOutlook TemperatureTemperature HumidityHumidity WindyWindy PlayPlay
YesYes NoNo YesYes NoNo YesYes NoNo YesYes NoNo YesYes NoNo
SunnySunny 22 33 64, 68,64, 68, 65, 65, 71,71,
65, 70,65, 70, 70, 85,70, 85, FalseFalse 66 22 99 55
OvercasOvercastt
44 00 69, 70,69, 70, 72, 72, 80,80,
70, 75,70, 75, 90, 91,90, 91, TrueTrue 33 33
RainyRainy 33 22 72, …72, … 85, 85, ……
80, …80, … 95, …95, …
SunnySunny 2/92/9 3/53/5 =73=73 =75=75 =79=79 =86=86 FalseFalse 6/96/9 2/52/5 9/19/144
5/15/144
OvercasOvercastt
4/94/9 0/50/5 =6.2=6.2 =7.9=7.9
=10.2=10.2 =9.7=9.7 TrueTrue 3/93/9 3/53/5
RainyRainy 3/93/9 2/52/5
3636
Classifying a new dayClassifying a new day A new day:A new day:
Missing values during training are not Missing values during training are not included in calculation of mean and included in calculation of mean and standard deviationstandard deviation
OutlookOutlook Temp.Temp. HumiditHumidityy
WindWindyy
PlaPlayy
SunnySunny 6666 9090 truetrue ??
Likelihood of “yes” = 2/9 Likelihood of “yes” = 2/9 0.0340 0.0340 0.0221 0.0221 3/9 3/9 9/14 = 9/14 = 0.0000360.000036
Likelihood of “no” = 3/5 Likelihood of “no” = 3/5 0.0291 0.0291 0.0380 0.0380 3/5 3/5 5/14 = 5/14 = 0.0001360.000136
P(“yes”) = 0.000036 / (0.000036 + 0. 000136) = 20.9%P(“yes”) = 0.000036 / (0.000036 + 0. 000136) = 20.9%
P(“no”) = 0.000136 / (0.000036 + 0. 000136) = 79.1%P(“no”) = 0.000136 / (0.000036 + 0. 000136) = 79.1%
3737
Probability densitiesProbability densities
Relationship between probability and Relationship between probability and density:density:
But: this doesn’t change calculation of But: this doesn’t change calculation of a posterioria posteriori probabilities because probabilities because cancels outcancels out
Exact relationship:Exact relationship:
Pr[ ] ( )2 2
c x c f c
Pr[ ] ( )b
a
a x b f t dt
3838
Naïve Bayes: discussionNaïve Bayes: discussion
Naïve Bayes works surprisingly well (even if Naïve Bayes works surprisingly well (even if independence assumption is clearly violated)independence assumption is clearly violated)
Why? Because classification doesn’t require Why? Because classification doesn’t require accurate probability estimates accurate probability estimates as long as as long as maximum probability is assigned to correct maximum probability is assigned to correct classclass
However: adding too many redundant However: adding too many redundant attributes will cause problems (e.g. identical attributes will cause problems (e.g. identical attributes)attributes)
Note also: many numeric attributes are not Note also: many numeric attributes are not normally distributed (normally distributed ( kernel density kernel density estimatorsestimators))
3939
Algorithms: Algorithms: The basic methodsThe basic methods
Simplicity first: 1RSimplicity first: 1R Use all attributes: Naïve BayesUse all attributes: Naïve Bayes Decision trees: ID3Decision trees: ID3 Covering algorithms: decision rules: PRISMCovering algorithms: decision rules: PRISM Association rulesAssociation rules Linear modelsLinear models Instance-based learningInstance-based learning
4040
Constructing decision Constructing decision treestrees
Strategy: top downStrategy: top downRecursive Recursive divide-and-conquerdivide-and-conquer fashion fashion– First: select attribute for root nodeFirst: select attribute for root node
Create branch for each possible attribute Create branch for each possible attribute valuevalue
– Then: split instances into subsetsThen: split instances into subsetsOne for each branch extending from the One for each branch extending from the nodenode
– Finally: repeat recursively for each branch, Finally: repeat recursively for each branch, using only instances that reach the branchusing only instances that reach the branch
Stop if all instances have the same Stop if all instances have the same classclass
4141
Which attribute to select?Which attribute to select?
4242
Criterion for attribute Criterion for attribute selectionselection
Which is the best attribute?Which is the best attribute?– Want to get the smallest treeWant to get the smallest tree– Heuristic: choose the attribute that Heuristic: choose the attribute that
produces the “purest” nodesproduces the “purest” nodes Popular Popular impurity criterionimpurity criterion:: information information
gaingain– Information gain increases with the Information gain increases with the
average purity of the subsetsaverage purity of the subsets Strategy: choose attribute that gives Strategy: choose attribute that gives
greatest information gaingreatest information gain
4343
Computing informationComputing information
Measure information in Measure information in bitsbits– Given a probability distribution, the Given a probability distribution, the
info required to predict an event is info required to predict an event is the distribution’s the distribution’s entropyentropy
– Entropy gives the information Entropy gives the information required in bitsrequired in bits(can involve fractions of bits!)(can involve fractions of bits!)
Recall, formula for entropy:Recall, formula for entropy:
1 2 1 1 2 2( , , , ) log log logn n nH p p p p p p p p p
4444
Claude Shannon, who has died aged 84, perhaps more than anyone laid the groundwork for today’s digital revolution. His exposition of information theory, stating that all information could be represented mathematically as a succession of noughts and ones, facilitated the digital manipulation of data without which today’s information society would be unthinkable.
Shannon’s master’s thesis, obtained in 1940 at MIT, demonstrated that problem solving could be achieved by manipulating the symbols 0 and 1 in a process that could be carried out automatically with electrical circuitry. That dissertation has been hailed as one of the most significant master’s theses of the 20th century. Eight years later, Shannon published another landmark paper, A Mathematical Theory of Communication, generally taken as his most important scientific contribution.
Claude ShannonBorn: 30 April 1916Died: 23 February 2001
“Father of information theory”
Shannon applied the same radical approach to cryptography research, in which he later became a consultant to the US government.
Many of Shannon’s pioneering insights were developed before they could be applied in practical form. He was truly a remarkable man, yet unknown to most of the world.
4545
Example: attribute Example: attribute OutlookOutlook
OutlookOutlook = = Sunny Sunny ::
OutlookOutlook = = Overcast Overcast ::
OutlookOutlook = = Rainy Rainy ::
Expected information for Expected information for attribute:attribute:
info([2,3]) (2/5,3/5) 2 / 5log(2 / 5) 3 / 5log(3/ 5) 0.971 bitsH
info([4,0]) (1,0) 1log(1) 0 log(0) 0 bitsH
info([3,2]) (3/5,2/5) 3 / 5log(3/ 5) 2 / 5log(2 / 5) 0.971 bitsH
Note: thisis normallyundefined.
info([3,2],[4,0],[3,2]) (5 /14) 0.971 (4 /14) 0 (5 /14) 0.971 0.693 bits
4646
ComputingComputinginformation gaininformation gain
Information gain: information Information gain: information before splitting – information after before splitting – information after splittingsplitting
Information gain for attributes Information gain for attributes from weather data:from weather data:gain(gain(Outlook Outlook )) = 0.247 = 0.247
bitsbits
gain(gain(Temperature Temperature )) = 0.029 = 0.029 bitsbits
gain(gain(Humidity Humidity )) = 0.152 = 0.152 bitsbits
gain(gain(Windy Windy )) = 0.048 = 0.048 bitsbits
gain(gain(Outlook Outlook )) = info([9,5]) = info([9,5]) – info([2,3],[4,0],– info([2,3],[4,0],[3,2])[3,2])
= 0.940 – 0.693= 0.940 – 0.693= = 0.247 bits0.247 bits
4747
Continuing to splitContinuing to split
gain(gain(Temperature Temperature )) = 0.571 = 0.571 bitsbits
gain(gain(Humidity Humidity )) = 0.971 = 0.971 bitsbits
gain(gain(Windy Windy )) = 0.020 = 0.020 bitsbits
4848
Final decision treeFinal decision tree
Note: not all leaves need to be pure; Note: not all leaves need to be pure; sometimes identical instances have sometimes identical instances have different classesdifferent classes Splitting stops when data can’t be split Splitting stops when data can’t be split
any furtherany further
4949
Wishlist for a purity Wishlist for a purity measuremeasure
Properties we require from a purity Properties we require from a purity measure:measure:– When node is pure, measure should be When node is pure, measure should be
zerozero– When impurity is maximal (i.e. all classes When impurity is maximal (i.e. all classes
equally likely), measure should be equally likely), measure should be maximalmaximal
– Measure should obey Measure should obey multistage propertymultistage property (i.e. decisions can be made in several (i.e. decisions can be made in several stages):stages):
Entropy is the only function that Entropy is the only function that satisfies all three properties!satisfies all three properties!
measure([2,3,4]) measure([2,7]) (7/9) measure([3,4])
5050
Properties of the entropyProperties of the entropy
The multistage property:The multistage property:
Simplification of computation:Simplification of computation:
Note: instead of maximizing info gain Note: instead of maximizing info gain we could just minimize informationwe could just minimize information
( ) ( ) ( ) ( )q r
H p,q,r H p,q r H q r H ,q r q r
info([2,3,4]) 2 / 9 log(2 / 9) 3/ 9 log(3/ 9) 4 / 9 log(4 / 9)
[ 2 log 2 3log3 4log 4 9log9] / 9
5151
Highly-branching Highly-branching attributesattributes
Problematic: attributes with a large Problematic: attributes with a large number of values (extreme case: ID number of values (extreme case: ID code)code)
Subsets are more likely to be pure if Subsets are more likely to be pure if there is a large number of valuesthere is a large number of values Information gain is biased towards Information gain is biased towards
choosing attributes with a large number of choosing attributes with a large number of valuesvalues
This may result in This may result in overfittingoverfitting (selection of (selection of an attribute that is non-optimal for an attribute that is non-optimal for prediction)prediction)
Another problem: Another problem: fragmentationfragmentation
5252
Weather data with Weather data with ID codeID codeID codeID code OutlookOutlook Temp.Temp. HumiditHumidit
yyWindWindyy
PlaPlayy
AA SunnySunny HotHot HighHigh FalseFalse NoNo
BB SunnySunny Hot Hot High High TrueTrue NoNo
CC OvercaOvercast st
Hot Hot HighHigh FalseFalse YesYes
DD RainyRainy MildMild HighHigh FalseFalse YesYes
EE RainyRainy CoolCool NormalNormal FalseFalse YesYes
FF RainyRainy CoolCool NormalNormal TrueTrue NoNo
GG OvercaOvercastst
CoolCool NormalNormal TrueTrue YesYes
HH SunnySunny MildMild HighHigh FalseFalse NoNo
II SunnySunny CoolCool NormalNormal FalseFalse YesYes
JJ RainyRainy MildMild NormalNormal FalseFalse YesYes
KK SunnySunny MildMild NormalNormal TrueTrue YesYes
LL OvercaOvercastst
MildMild HighHigh TrueTrue YesYes
MM OvercaOvercastst
HotHot NormalNormal FalseFalse YesYes
NN RainyRainy MildMild HighHigh TrueTrue NoNo
5353
Tree stump for Tree stump for ID codeID code attributeattribute
Entropy of split:Entropy of split:
Information gain is maximal for ID Information gain is maximal for ID code (namely 0.940 bits)code (namely 0.940 bits)
info("ID code") info([0,1]) info([0,1]) info([0,1]) 0 bits
5454
Gain ratioGain ratio
Gain ratioGain ratio: a modification of the : a modification of the information gain that reduces its biasinformation gain that reduces its bias
Gain ratio takes number and size of Gain ratio takes number and size of branches into account when choosing branches into account when choosing an attributean attribute– It corrects the information gain by taking It corrects the information gain by taking
the the intrinsic informationintrinsic information of a split into of a split into accountaccount
Intrinsic information: entropy of Intrinsic information: entropy of distribution of instances into branches distribution of instances into branches (i.e. how much info do we need to tell (i.e. how much info do we need to tell which branch an instance belongs to)which branch an instance belongs to)
5555
Computing the gain ratioComputing the gain ratio
Example: intrinsic information for ID Example: intrinsic information for ID codecode
Value of attribute decreases as Value of attribute decreases as intrinsic information gets largerintrinsic information gets larger
Definition of gain ratio:Definition of gain ratio:
Example:Example:
info([1,1, ,1) 14 ( 1/14 log1/14) 3.807 bits
gain("Attribute")gain_ratio("Attribute")
intrinsic_info("Attribute")
0.940 bitsgain_ratio("ID_code") 0.246
3.807 bits
5656
Gain ratios for weather Gain ratios for weather datadata
OutlookOutlook TemperatureTemperature
Info:Info: 0.6930.693 Info:Info: 0.9110.911
Gain: 0.940-0.693Gain: 0.940-0.693 0.247 0.247 Gain: 0.940-0.911 Gain: 0.940-0.911 0.0290.029
Split info: Split info: info([5,4,5])info([5,4,5])
1.577 1.577 Split info: Split info: info([4,6,4])info([4,6,4])
1.3621.362
Gain ratio: Gain ratio: 0.247/1.5770.247/1.577
0.1560.156 Gain ratio: Gain ratio: 0.029/1.3620.029/1.362
0.0210.021HumidityHumidity WindyWindy
Info:Info: 0.7880.788 Info:Info: 0.8920.892
Gain: 0.940-0.788Gain: 0.940-0.788 0.1520.152 Gain: 0.940-0.892 Gain: 0.940-0.892 0.0480.048
Split info: info([7,7])Split info: info([7,7]) 1.000 1.000 Split info: info([8,6])Split info: info([8,6]) 0.9850.985
Gain ratio: 0.152/1Gain ratio: 0.152/1 0.1520.152 Gain ratio: Gain ratio: 0.048/0.9850.048/0.985
0.0490.049
5757
More on the gain ratioMore on the gain ratio
““Outlook” still comes out topOutlook” still comes out top However: “ID code” has greater gain However: “ID code” has greater gain
ratioratio– Standard fix: Standard fix: ad hocad hoc test to prevent test to prevent
splitting on that type of attributesplitting on that type of attribute Problem with gain ratio: it may Problem with gain ratio: it may
overcompensateovercompensate– May choose an attribute just because its May choose an attribute just because its
intrinsic information is very lowintrinsic information is very low– Standard fix: only consider attributes with Standard fix: only consider attributes with
greater than average information gaingreater than average information gain
5858
DiscussionDiscussion
Top-down induction of decision trees: Top-down induction of decision trees: ID3, algorithm developed by Ross ID3, algorithm developed by Ross QuinlanQuinlan– Gain ratio just one modification of this Gain ratio just one modification of this
basic algorithmbasic algorithm C4.5: deals with numeric attributes, C4.5: deals with numeric attributes,
missing values, noisy datamissing values, noisy data Similar approach: CARTSimilar approach: CART There are many other attribute There are many other attribute
selection criteria!selection criteria!(But little difference in accuracy of (But little difference in accuracy of result)result)
5959
Algorithms: Algorithms: The basic methodsThe basic methods
Simplicity first: 1RSimplicity first: 1R Use all attributes: Naïve BayesUse all attributes: Naïve Bayes Decision trees: ID3Decision trees: ID3 Covering algorithms: decision rules: PRISMCovering algorithms: decision rules: PRISM Association rulesAssociation rules Linear modelsLinear models Instance-based learningInstance-based learning
6060
Covering algorithmsCovering algorithms
Convert decision tree into a rule setConvert decision tree into a rule set– Straightforward, but rule set overly Straightforward, but rule set overly
complexcomplex– More effective conversions are not trivialMore effective conversions are not trivial
Instead, can generate rule set directlyInstead, can generate rule set directly– for each class in turn find rule set that for each class in turn find rule set that
covers all instances in itcovers all instances in it(excluding instances not in the class)(excluding instances not in the class)
Called a Called a coveringcovering approach: approach:– at each stage a rule is identified that at each stage a rule is identified that
“covers” some of the instances“covers” some of the instances
6161
Example: generating a Example: generating a rulerule
y
x
a
b b
b
b
b
bb
b
b b bb
bb
aa
aa
ay
a
b b
b
b
b
bb
b
b b bb
bb
a a
aa
a
x1·2
y
a
b b
b
b
b
bb
b
b b bb
bb
a a
aa
a
x1·2
2·6
If x > 1.2If x > 1.2then class = athen class = a
If x > 1.2 and y > 2.6If x > 1.2 and y > 2.6then class = athen class = a
If trueIf truethen class = athen class = a
Possible rule set for class “b”:Possible rule set for class “b”:
Could add more rules, get “perfect” Could add more rules, get “perfect” rule setrule set
If x If x 1.2 then class = b 1.2 then class = b
If x > 1.2 and y If x > 1.2 and y 2.6 then class = b 2.6 then class = b
6262
Rules vs. treesRules vs. trees
Corresponding decision tree:Corresponding decision tree:(produces exactly the same(produces exactly the samepredictions)predictions)
But: rule sets But: rule sets cancan be more perspicuous when be more perspicuous when decision trees suffer from replicated subtreesdecision trees suffer from replicated subtrees
Also: in multiclass situations, covering Also: in multiclass situations, covering algorithm concentrates on one class at a algorithm concentrates on one class at a time whereas decision tree learner takes all time whereas decision tree learner takes all classes into accountclasses into account
6363
space of examples
rule so far
rule after adding new term
Simple covering Simple covering algorithmalgorithm
Generates a rule by adding tests that Generates a rule by adding tests that maximize rule’s accuracymaximize rule’s accuracy
Similar to situation in decision trees: Similar to situation in decision trees: problem of selecting an attribute to problem of selecting an attribute to split onsplit on– But: decision tree inducer maximizes But: decision tree inducer maximizes
overall purityoverall purity Each new test reducesEach new test reduces
rule’s coverage:rule’s coverage:
6464
Selecting a testSelecting a test
Goal: maximize accuracyGoal: maximize accuracy t t total number of instances covered total number of instances covered
by ruleby rule– pp positive examples of the class covered positive examples of the class covered
by ruleby rule– t t – – pp number of errors made by rule number of errors made by rule Select test that maximizes the ratio Select test that maximizes the ratio p/tp/t
We are finished when We are finished when p/t p/t = 1 or the = 1 or the set of instances can’t be split any set of instances can’t be split any furtherfurther
6565
Rules vs. decision listsRules vs. decision lists
PRISM with outer loop removed generates PRISM with outer loop removed generates a decision list for one classa decision list for one class– Subsequent rules are designed for rules that Subsequent rules are designed for rules that
are not covered by previous rulesare not covered by previous rules– But: order doesn’t matter because all rules But: order doesn’t matter because all rules
predict the same classpredict the same class Outer loop considers all classes separatelyOuter loop considers all classes separately
– No order dependence impliedNo order dependence implied Problems: overlapping rules, default rule Problems: overlapping rules, default rule
requiredrequired
6666
Pseudo-code for PRISMPseudo-code for PRISMFor each class CFor each class C
Initialize E to the instance setInitialize E to the instance set
While E contains instances in class CWhile E contains instances in class C
Create a rule R with an empty left-hand side that predicts class CCreate a rule R with an empty left-hand side that predicts class C
Until R is perfect (or there are no more attributes to use) doUntil R is perfect (or there are no more attributes to use) do
For each attribute A not mentioned in R, and each value v,For each attribute A not mentioned in R, and each value v,
Consider adding the condition A = v to the left-hand side of RConsider adding the condition A = v to the left-hand side of R
Select A and v to maximize the accuracy p/tSelect A and v to maximize the accuracy p/t
(break ties by choosing the condition with the largest p)(break ties by choosing the condition with the largest p)
Add A = v to RAdd A = v to R
Remove the instances covered by R from E Remove the instances covered by R from E
6767
Separate and conquerSeparate and conquer
Methods like PRISM (for dealing with one Methods like PRISM (for dealing with one class) are class) are separate-and-conquerseparate-and-conquer algorithms:algorithms:– First, identify a useful ruleFirst, identify a useful rule– Then, separate out all the instances it coversThen, separate out all the instances it covers– Finally, “conquer” the remaining instancesFinally, “conquer” the remaining instances
Difference to divide-and-conquer methods:Difference to divide-and-conquer methods:– Subset covered by rule doesn’t need to be Subset covered by rule doesn’t need to be
explored any furtherexplored any further
6868
Algorithms: Algorithms: The basic methodsThe basic methods
Simplicity first: 1RSimplicity first: 1R Use all attributes: Naïve BayesUse all attributes: Naïve Bayes Decision trees: ID3Decision trees: ID3 Covering algorithms: decision rules: PRISMCovering algorithms: decision rules: PRISM Association rulesAssociation rules Linear modelsLinear models Instance-based learningInstance-based learning
6969
Association rulesAssociation rules
Association rules…Association rules…– … … can predict any attribute and combinations can predict any attribute and combinations
of attributesof attributes– … … are not intended to be used together as a are not intended to be used together as a
setset Problem: immense number of possible Problem: immense number of possible
associationsassociations– Output needs to be restricted to show only the Output needs to be restricted to show only the
most predictive associations most predictive associations only those with only those with high high support support and high and high confidenceconfidence
7070
Support and confidence of a Support and confidence of a rulerule
OutlookOutlook TempTemp HumiditHumidityy
WindWindyy
PlayPlay
SunnySunny HotHot HighHigh FalseFalse NoNo
SunnySunny Hot Hot High High TrueTrue NoNo
OvercasOvercast t
Hot Hot HighHigh FalseFalse YesYes
RainyRainy MildMild HighHigh FalseFalse YesYes
RainyRainy CoolCool NormalNormal FalseFalse YesYes
RainyRainy CoolCool NormalNormal TrueTrue NoNo
OvercasOvercastt
CoolCool NormalNormal TrueTrue YesYes
SunnySunny MildMild HighHigh FalseFalse NoNo
SunnySunny CoolCool NormalNormal FalseFalse YesYes
RainyRainy MildMild NormalNormal FalseFalse YesYes
SunnySunny MildMild NormalNormal TrueTrue YesYes
OvercasOvercastt
MildMild HighHigh TrueTrue YesYes
OvercasOvercastt
HotHot NormalNormal FalseFalse YesYes
RainyRainy MildMild HighHigh TrueTrue NoNo
7171
Support and confidence of a Support and confidence of a rulerule
Support: number of instances predicted Support: number of instances predicted correctly correctly
Confidence: number of correct predictions, Confidence: number of correct predictions, as proportion of all instances the rule applies as proportion of all instances the rule applies toto
Example: 4 cool days with normal humidityExample: 4 cool days with normal humidity
Support = 4, confidence = 100%Support = 4, confidence = 100% Normally: minimum support and confidence Normally: minimum support and confidence
pre-specified (e.g. 58 rules with support pre-specified (e.g. 58 rules with support 2 2 and confidence and confidence 95% for weather data) 95% for weather data)
If temperature = cool then humidity = normalIf temperature = cool then humidity = normal
7272
Interpreting association Interpreting association rulesrules
If humidity = high and windy = false and play = If humidity = high and windy = false and play = nonothen outlook = sunnythen outlook = sunny
Interpretation is not obvious:Interpretation is not obvious:
is is notnot the same as the same as
However, it means that the following also However, it means that the following also holds:holds:
If windy = false and play = noIf windy = false and play = nothen outlook = sunny then outlook = sunny
If windy = false and play = no If windy = false and play = no then humidity = highthen humidity = high
If windy = false and play = noIf windy = false and play = nothen outlook = sunny and humidity = highthen outlook = sunny and humidity = high
7373
Mining association rulesMining association rules
Naïve method for finding association rules:Naïve method for finding association rules:– Use separate-and-conquer methodUse separate-and-conquer method– Treat every possible combination of attribute Treat every possible combination of attribute
values as a separate classvalues as a separate class Two problems:Two problems:
– Computational complexityComputational complexity– Resulting number of rules (which would have to Resulting number of rules (which would have to
be pruned on the basis of support and be pruned on the basis of support and confidence)confidence)
But: we can look for high support rules But: we can look for high support rules directly!directly!
7474
Item setsItem sets Support: number of instances correctly Support: number of instances correctly
covered by association rulecovered by association rule– The same as the number of instances covered The same as the number of instances covered
by by allall tests in the rule (LHS and RHS!) tests in the rule (LHS and RHS!) ItemItem: one test/attribute-value pair: one test/attribute-value pair Item set Item set : all items occurring in a rule: all items occurring in a rule Goal: only rules that exceed pre-defined Goal: only rules that exceed pre-defined
supportsupport Do it by finding all item sets with the given Do it by finding all item sets with the given
minimum support and generating rules from minimum support and generating rules from them!them!
7575
Item Sets For Weather DataItem Sets For Weather Data
OutlookOutlook TempTemp HumiditHumidityy
WindWindyy
PlayPlay
SunnySunny HotHot HighHigh FalseFalse NoNo
SunnySunny Hot Hot High High TrueTrue NoNo
OvercasOvercast t
Hot Hot HighHigh FalseFalse YesYes
RainyRainy MildMild HighHigh FalseFalse YesYes
RainyRainy CoolCool NormalNormal FalseFalse YesYes
RainyRainy CoolCool NormalNormal TrueTrue NoNo
OvercasOvercastt
CoolCool NormalNormal TrueTrue YesYes
SunnySunny MildMild HighHigh FalseFalse NoNo
SunnySunny CoolCool NormalNormal FalseFalse YesYes
RainyRainy MildMild NormalNormal FalseFalse YesYes
SunnySunny MildMild NormalNormal TrueTrue YesYes
OvercasOvercastt
MildMild HighHigh TrueTrue YesYes
OvercasOvercastt
HotHot NormalNormal FalseFalse YesYes
RainyRainy MildMild HighHigh TrueTrue NoNo
7676
Item sets for weather dataItem sets for weather data
One-item setsOne-item sets Two-item setsTwo-item sets Three-item setsThree-item sets Four-item setsFour-item sets
Outlook = Sunny (5)Outlook = Sunny (5) Outlook = SunnyOutlook = Sunny
Temperature = Hot (2)Temperature = Hot (2)Outlook = SunnyOutlook = Sunny
Temperature = HotTemperature = Hot
Humidity = High (2)Humidity = High (2)
Outlook = SunnyOutlook = Sunny
Temperature = HotTemperature = Hot
Humidity = HighHumidity = High
Play = No (2)Play = No (2)
Temperature = Cool Temperature = Cool (4)(4)
Outlook = SunnyOutlook = Sunny
Humidity = High (3)Humidity = High (3)Outlook = SunnyOutlook = Sunny
Humidity = HighHumidity = High
Windy = False (2)Windy = False (2)
Outlook = RainyOutlook = Rainy
Temperature = MildTemperature = Mild
Windy = FalseWindy = False
Play = Yes (2)Play = Yes (2)
…… …… …… ……
In total: 12 one-item sets, 47 two-item In total: 12 one-item sets, 47 two-item sets, 39 three-item sets, 6 four-item sets, 39 three-item sets, 6 four-item sets and 0 five-item sets (with minimum sets and 0 five-item sets (with minimum support of two)support of two)
7777
Generating rules from an Generating rules from an item setitem set
Once all item sets with minimum Once all item sets with minimum support have been generated, we can support have been generated, we can turn them into rulesturn them into rules
Example:Example:
Seven (2Seven (2NN-1) potential rules:-1) potential rules:
Humidity = Normal, Windy = False, Play = Yes (4)Humidity = Normal, Windy = False, Play = Yes (4)
If Humidity = Normal and Windy = False then Play = YesIf Humidity = Normal and Windy = False then Play = Yes
If Humidity = Normal and Play = Yes then Windy = FalseIf Humidity = Normal and Play = Yes then Windy = False
If Windy = False and Play = Yes then Humidity = NormalIf Windy = False and Play = Yes then Humidity = Normal
If Humidity = Normal then Windy = False and Play = YesIf Humidity = Normal then Windy = False and Play = Yes
If Windy = False then Humidity = Normal and Play = YesIf Windy = False then Humidity = Normal and Play = Yes
If Play = Yes then Humidity = Normal and Windy = FalseIf Play = Yes then Humidity = Normal and Windy = False
If True then Humidity = Normal and Windy = False If True then Humidity = Normal and Windy = False and Play = Yesand Play = Yes
4/44/4
4/64/6
4/64/6
4/74/7
4/84/8
4/94/9
4/124/12
7878
Rules for weather dataRules for weather data
Rules with support > 1 and confidence = Rules with support > 1 and confidence = 100%:100%:
In total:In total: 3 rules with support four 3 rules with support four 5 with support three 5 with support three50 with support two50 with support two
Association ruleAssociation rule Sup.Sup. Conf.Conf.
11 Humidity=Normal Windy=FalseHumidity=Normal Windy=False Play=YesPlay=Yes 44 100%100%
22 Temperature=CoolTemperature=Cool Humidity=NormalHumidity=Normal 44 100%100%
33 Outlook=OvercastOutlook=Overcast Play=YesPlay=Yes 44 100%100%
44 Temperature=Cold Play=YesTemperature=Cold Play=Yes Humidity=NormalHumidity=Normal 33 100%100%
...... ...... ...... ......
5858 Outlook=Sunny Temperature=HotOutlook=Sunny Temperature=Hot Humidity=HighHumidity=High 22 100%100%
7979
Example rules from the Example rules from the same setsame set
Item set:Item set:
Resulting rules (all with 100% confidence):Resulting rules (all with 100% confidence):
due to the following “frequent” item sets:due to the following “frequent” item sets:
Temperature = Cool, Humidity = Normal, Windy = False, Play = Yes (2)Temperature = Cool, Humidity = Normal, Windy = False, Play = Yes (2)
Temperature = Cool, Windy = False Temperature = Cool, Windy = False Humidity = Normal, Play = Yes Humidity = Normal, Play = Yes
Temperature = Cool, Windy = False, Humidity = Normal Temperature = Cool, Windy = False, Humidity = Normal Play = Yes Play = Yes
Temperature = Cool, Windy = False, Play = Yes Temperature = Cool, Windy = False, Play = Yes Humidity = Normal Humidity = Normal
Temperature = Cool, Windy = False (2)Temperature = Cool, Windy = False (2)
Temperature = Cool, Humidity = Normal, Windy = False (2)Temperature = Cool, Humidity = Normal, Windy = False (2)
Temperature = Cool, Windy = False, Play = Yes (2)Temperature = Cool, Windy = False, Play = Yes (2)
8080
Generating item sets Generating item sets efficientlyefficiently
How can we efficiently find all frequent item How can we efficiently find all frequent item sets?sets?
Finding one-item sets easyFinding one-item sets easy Idea: use one-item sets to generate two-item Idea: use one-item sets to generate two-item
sets, two-item sets to generate three-item sets, two-item sets to generate three-item sets, …sets, …– If (A B) is frequent item set, then (A) and (B) have If (A B) is frequent item set, then (A) and (B) have
to be frequent item sets as well!to be frequent item sets as well!– In general: if X is frequent In general: if X is frequent kk-item set, then all (-item set, then all (kk--
1)-item subsets of X are also frequent1)-item subsets of X are also frequent
Compute Compute kk-item set by merging (-item set by merging (kk-1)-item sets-1)-item sets
8181
ExampleExample Given: five three-item setsGiven: five three-item sets
(A B C), (A B D), (A C D), (A C E), (B C D)(A B C), (A B D), (A C D), (A C E), (B C D)
Lexicographically ordered!Lexicographically ordered!
Candidate four-item sets:Candidate four-item sets:
(A B C D)(A B C D) OK because of (B C D) OK because of (B C D)
(A C D E) Not OK because of (C D E)(A C D E) Not OK because of (C D E)
Final check by counting instances in Final check by counting instances in dataset!dataset!
((kk ––1)-item sets are stored in hash table1)-item sets are stored in hash table
8282
Generating rules efficientlyGenerating rules efficiently
We are looking for all high-confidence rulesWe are looking for all high-confidence rules– Support of antecedent obtained from hash tableSupport of antecedent obtained from hash table– But: brute-force method is (2But: brute-force method is (2NN-1) -1)
Better way: building (Better way: building (cc + 1)-consequent rules + 1)-consequent rules from from cc-consequent ones-consequent ones– Observation: (Observation: (cc + 1)-consequent rule can only hold + 1)-consequent rule can only hold
if all corresponding if all corresponding cc-consequent rules also hold -consequent rules also hold Resulting algorithm similar to procedure for Resulting algorithm similar to procedure for
large item setslarge item sets
8383
ExampleExample 1-consequent rules:1-consequent rules:
Corresponding 2-consequent rule:Corresponding 2-consequent rule:
Final check of antecedent against hash table!Final check of antecedent against hash table!
If Windy = False and Play = NoIf Windy = False and Play = Nothen Outlook = Sunny and Humidity = High (2/2)then Outlook = Sunny and Humidity = High (2/2)
If Outlook = Sunny and Windy = False and Play = No If Outlook = Sunny and Windy = False and Play = No then Humidity = High (2/2)then Humidity = High (2/2)
If Humidity = High and Windy = False and Play = NoIf Humidity = High and Windy = False and Play = Nothen Outlook = Sunny (2/2)then Outlook = Sunny (2/2)
8484
Association rules: discussionAssociation rules: discussion Above method makes one pass through the Above method makes one pass through the
data for each different size item setdata for each different size item set– Other possibility: generate (Other possibility: generate (kk+2)-item sets just +2)-item sets just
after (after (kk+1)-item sets have been generated+1)-item sets have been generated– Result: more (Result: more (kk+2)-item sets than necessary will +2)-item sets than necessary will
be considered but less passes through the databe considered but less passes through the data– Makes sense if data too large for main memoryMakes sense if data too large for main memory
Practical issue: generating a certain number Practical issue: generating a certain number of rules (e.g. by incrementally reducing min. of rules (e.g. by incrementally reducing min. support)support)
8585
Other issuesOther issues
Standard ARFF format very inefficient for Standard ARFF format very inefficient for typical typical market basket datamarket basket data– Attributes represent items in a basket and Attributes represent items in a basket and
most items are usually missingmost items are usually missing– Need way of representing sparse dataNeed way of representing sparse data
Instances are also called Instances are also called transactionstransactions Confidence is not necessarily the best Confidence is not necessarily the best
measuremeasure– Example: milk occurs in almost every Example: milk occurs in almost every
supermarket transactionsupermarket transaction– Other measures have been devised (e.g. lift) Other measures have been devised (e.g. lift)
8686
Algorithms: Algorithms: The basic methodsThe basic methods
Simplicity first: 1RSimplicity first: 1R Use all attributes: Naïve BayesUse all attributes: Naïve Bayes Decision trees: ID3Decision trees: ID3 Covering algorithms: decision rules: PRISMCovering algorithms: decision rules: PRISM Association rulesAssociation rules Linear modelsLinear models Instance-based learningInstance-based learning
8787
Linear modelsLinear models
Work most naturally with numeric attributesWork most naturally with numeric attributes Standard technique for numeric prediction: Standard technique for numeric prediction:
linear regressionlinear regression– Outcome is linear combination of attributesOutcome is linear combination of attributes
Weights are calculated from the training Weights are calculated from the training datadata
Predicted value for first training instance Predicted value for first training instance aa(1)(1)
0 1 1 2 2 ... k kx w w a w a w a
(1) (1) (1) (1) (1)0 0 1 1 2 2
0
...k
k k j jj
w a w a w a w a w a
8888
Minimizing the squared Minimizing the squared errorerror
ChooseChoose k k +1 coefficients to minimize the +1 coefficients to minimize the squared error on the training datasquared error on the training data
Squared error:Squared error:
Derive coefficients using standard matrix Derive coefficients using standard matrix operationsoperations
Can be done if there are more instances Can be done if there are more instances than attributes (roughly speaking)than attributes (roughly speaking)
Minimizing the Minimizing the absolute errorabsolute error is more is more difficultdifficult
2
( ) ( )
1 0
n ki i
j ji j
x w a
8989
ClassificationClassification
AnyAny regression technique can be used for regression technique can be used for classificationclassification– Training: perform a regression for each class, Training: perform a regression for each class,
setting the output to 1 for training instances setting the output to 1 for training instances that belong to class, and 0 for those that don’tthat belong to class, and 0 for those that don’t
– Prediction: predict class corresponding to Prediction: predict class corresponding to model with largest output value (model with largest output value (membership membership valuevalue))
For linear regression this is known as For linear regression this is known as multi-response linear regressionmulti-response linear regression
9191
Pairwise regressionPairwise regression Another way of using regression for Another way of using regression for
classification: classification: – A regression function for every A regression function for every pairpair of classes, of classes,
using only instances from these two classesusing only instances from these two classes– Assign output of +1 to one member of the pair, –Assign output of +1 to one member of the pair, –
1 to the other1 to the other Prediction is done by votingPrediction is done by voting
– Class that receives most votes is predictedClass that receives most votes is predicted– Alternative: “don’t know” if there is no Alternative: “don’t know” if there is no
agreementagreement More likely to be accurate but more More likely to be accurate but more
expensive expensive
9292
Logistic regressionLogistic regression Problem: some assumptions violated Problem: some assumptions violated
when linear regression is applied to when linear regression is applied to classification problemsclassification problems
Logistic Logistic regression: alternative to linear regression: alternative to linear regressionregression– Designed for classification problemsDesigned for classification problems– Tries to estimate class probabilities directlyTries to estimate class probabilities directly
Does this using the Does this using the maximum likelihoodmaximum likelihood method method
– Uses this linear model:Uses this linear model:
Class probability
0 0 1 1 2 2log1 k k
Pw a w a w a w a
P
9393
Discussion of linear modelsDiscussion of linear models
Not appropriate if data exhibits non-linear Not appropriate if data exhibits non-linear dependenciesdependencies
But: can serve as building blocks for more But: can serve as building blocks for more complex schemes (i.e. model trees)complex schemes (i.e. model trees)
Example: multi-response linear regression Example: multi-response linear regression defines a defines a hyperplanehyperplane for any two given for any two given classes:classes:
(1) (2) (1) (2) (1) (2) (1) (2)0 0 0 1 1 1 2 2 2( ) ( ) ( ) ( ) 0k k kw w a w w a w w a w w a
9494
Algorithms: Algorithms: The basic methodsThe basic methods
Simplicity first: 1RSimplicity first: 1R Use all attributes: Naïve BayesUse all attributes: Naïve Bayes Decision trees: ID3Decision trees: ID3 Covering algorithms: decision rules: PRISMCovering algorithms: decision rules: PRISM Association rulesAssociation rules Linear modelsLinear models Instance-based learningInstance-based learning
9595
Instance-based Instance-based representationrepresentation
Simplest form of learning: Simplest form of learning: rote learningrote learning– Training instances are searched for instance that most Training instances are searched for instance that most
closely resembles new instanceclosely resembles new instance– The instances themselves represent the knowledgeThe instances themselves represent the knowledge– Also called Also called instance-basedinstance-based learning learning
Similarity function defines what’s “learned”Similarity function defines what’s “learned” Instance-based learning is Instance-based learning is lazylazy learning learning Methods:Methods:
– nearest-neighbornearest-neighbor– k-nearest-neighbork-nearest-neighbor– ……
9696
The distance functionThe distance function Simplest case: one numeric attributeSimplest case: one numeric attribute
– Distance is the difference between the two Distance is the difference between the two attribute values involved (or a function attribute values involved (or a function thereof)thereof)
Several numeric attributes: normally, Several numeric attributes: normally, Euclidean distance is used and attributes Euclidean distance is used and attributes are normalizedare normalized
Nominal attributes: distance is set to 1 if Nominal attributes: distance is set to 1 if values are different, 0 if they are equalvalues are different, 0 if they are equal
Are all attributes equally important?Are all attributes equally important?– Weighting the attributes might be necessaryWeighting the attributes might be necessary
9797
Instance-based learningInstance-based learning
Distance function defines what’s learnedDistance function defines what’s learned Most instance-based schemes use Most instance-based schemes use
Euclidean distanceEuclidean distance::
aa(1)(1) and and aa(2)(2): two instances with : two instances with kk attributes attributes Taking the square root is not required Taking the square root is not required
when comparing distanceswhen comparing distances Other popular metric: Other popular metric: city-block metriccity-block metric
– Adds differences without squaring them Adds differences without squaring them
(1) (2) 2 (1) (2) 2 (1) (2) 21 1 2 2( ) ( ) ... ( )k ka a a a a a
9898
Normalization and other Normalization and other issuesissues
Different attributes are measured on Different attributes are measured on different scales different scales need to be need to be normalizednormalized::
vvi i : the actual value of attribute : the actual value of attribute ii Nominal attributes: distance either 0 or 1Nominal attributes: distance either 0 or 1 Common policy for missing values: Common policy for missing values:
assumed to be maximally distant (given assumed to be maximally distant (given normalized attributes)normalized attributes)
min
max mini i
ii i
v va
v v
9999
Discussion of 1-NNDiscussion of 1-NN Often very accurateOften very accurate … … but slow:but slow:
– simple version scans entire training data to derive a simple version scans entire training data to derive a predictionprediction
Assumes all attributes are equally importantAssumes all attributes are equally important– Remedy: attribute selection or weightsRemedy: attribute selection or weights
Possible remedies against noisy instances:Possible remedies against noisy instances:– Take a majority vote over the Take a majority vote over the kk nearest neighbors nearest neighbors– Removing noisy instances from dataset (difficult!)Removing noisy instances from dataset (difficult!)
Statisticians have used Statisticians have used kk-NN since early 1950s-NN since early 1950s– If If n n and and k/n k/n 0, error approaches minimum 0, error approaches minimum
100100
Comments on basic Comments on basic methodsmethods
Bayes’ rule stems from his “Essay towards Bayes’ rule stems from his “Essay towards solving a problem in the doctrine of chances” solving a problem in the doctrine of chances” (1763)(1763)– Difficult bit: estimating prior probabilitiesDifficult bit: estimating prior probabilities
Extension of Naïve Bayes: Bayesian NetworksExtension of Naïve Bayes: Bayesian Networks Algorithm for association rules is called Algorithm for association rules is called
APRIORIAPRIORI Minsky and Papert (1969) showed that linear Minsky and Papert (1969) showed that linear
classifiers have limitations, e.g. can’t learn classifiers have limitations, e.g. can’t learn XORXOR– But: combinations of them can (But: combinations of them can ( Neural Nets) Neural Nets)
101101
Credibility:Credibility:Evaluating what’s been learnedEvaluating what’s been learned
Issues: training, testing, tuningIssues: training, testing, tuning Predicting performance: confidence limitsPredicting performance: confidence limits Holdout, cross-validation, bootstrapHoldout, cross-validation, bootstrap Comparing schemes: the t-testComparing schemes: the t-test Predicting probabilities: loss functionsPredicting probabilities: loss functions Cost-sensitive measuresCost-sensitive measures Evaluating numeric predictionEvaluating numeric prediction The Minimum Description Length principleThe Minimum Description Length principle
102102
Evaluation: the key to Evaluation: the key to successsuccess
How predictive is the model we learned?How predictive is the model we learned? Error on the training data is Error on the training data is notnot a good a good
indicator of performance on future dataindicator of performance on future data– Otherwise 1-NN would be the optimum Otherwise 1-NN would be the optimum
classifier!classifier! Simple solution that can be used if lots of Simple solution that can be used if lots of
(labeled) data is available:(labeled) data is available:– Split data into training and test setSplit data into training and test set
However: (labeled) data is usually limitedHowever: (labeled) data is usually limited– More sophisticated techniques need to be usedMore sophisticated techniques need to be used
103103
Issues in evaluationIssues in evaluation
Statistical reliability of estimated Statistical reliability of estimated differences in performance (differences in performance ( significance significance tests)tests)
Choice of performance measure:Choice of performance measure:– Number of correct classificationsNumber of correct classifications– Accuracy of probability estimates Accuracy of probability estimates – Error in numeric predictionsError in numeric predictions
Costs assigned to different types of errorsCosts assigned to different types of errors– Many practical applications involve costsMany practical applications involve costs
104104
Credibility:Credibility:Evaluating what’s been learnedEvaluating what’s been learned
Issues: training, testing, tuningIssues: training, testing, tuning Predicting performance: confidence limitsPredicting performance: confidence limits Holdout, cross-validation, bootstrapHoldout, cross-validation, bootstrap Comparing schemes: the t-testComparing schemes: the t-test Predicting probabilities: loss functionsPredicting probabilities: loss functions Cost-sensitive measuresCost-sensitive measures Evaluating numeric predictionEvaluating numeric prediction The Minimum Description Length principleThe Minimum Description Length principle
105105
Training and testing ITraining and testing I
Natural performance measure for Natural performance measure for classification problems: classification problems: error rateerror rate– SuccessSuccess: instance’s class is predicted correctly: instance’s class is predicted correctly– ErrorError: instance’s class is predicted incorrectly: instance’s class is predicted incorrectly– Error rate: proportion of errors made over the Error rate: proportion of errors made over the
whole set of instanceswhole set of instances
Resubstitution error: Resubstitution error: error rate obtained error rate obtained from training datafrom training data
Resubstitution error is (hopelessly) Resubstitution error is (hopelessly) optimistic!optimistic!
106106
Training and testing IITraining and testing II
Test setTest set: independent instances that have : independent instances that have played no part in formation of classifierplayed no part in formation of classifier– Assumption: both training data and test data Assumption: both training data and test data
are representative samples of the underlying are representative samples of the underlying problemproblem
Test and training data may differ in natureTest and training data may differ in nature– Example: classifiers built using subject data Example: classifiers built using subject data
with two different diagnoses with two different diagnoses AA and and BB To estimate performance of classifier for subjects To estimate performance of classifier for subjects
with diagnosis with diagnosis AA on subjects diagnosed with on subjects diagnosed with BB, test it , test it on data for subjects diagnosed with on data for subjects diagnosed with BB
107107
Note on parameter tuningNote on parameter tuning It is important that the test data is not used It is important that the test data is not used in in
any wayany way to create the classifier to create the classifier Some learning schemes operate in two stages:Some learning schemes operate in two stages:
– Stage 1: build the basic structureStage 1: build the basic structure– Stage 2: optimize parameter settingsStage 2: optimize parameter settings
The test data can’t be used for parameter The test data can’t be used for parameter tuning!tuning!
Proper procedure uses Proper procedure uses threethree sets: sets: training training datadata, , validation datavalidation data, and , and test datatest data– Validation data is used to optimize parametersValidation data is used to optimize parameters
108108
Making the most of the dataMaking the most of the data
Once evaluation is complete, Once evaluation is complete, all the dataall the data can be used to build the final classifiercan be used to build the final classifier
Generally, the larger the training data the Generally, the larger the training data the better the classifier (but returns diminish)better the classifier (but returns diminish)
The larger the test data the more accurate The larger the test data the more accurate the error estimatethe error estimate
HoldoutHoldout procedure: method of splitting procedure: method of splitting original data into training and test setoriginal data into training and test set– Dilemma: ideally both training set Dilemma: ideally both training set and and test set test set
should be large!should be large!
109109
Credibility:Credibility:Evaluating what’s been learnedEvaluating what’s been learned
Issues: training, testing, tuningIssues: training, testing, tuning Predicting performance: confidence limitsPredicting performance: confidence limits Holdout, cross-validation, bootstrapHoldout, cross-validation, bootstrap Comparing schemes: the t-testComparing schemes: the t-test Predicting probabilities: loss functionsPredicting probabilities: loss functions Cost-sensitive measuresCost-sensitive measures Evaluating numeric predictionEvaluating numeric prediction The Minimum Description Length principleThe Minimum Description Length principle
110110
Predicting performancePredicting performance
Assume the estimated error rate is 25%. Assume the estimated error rate is 25%. How close is this to the true error rate?How close is this to the true error rate?– Depends on the amount of test dataDepends on the amount of test data
Prediction is just like tossing a (biased!) Prediction is just like tossing a (biased!) coincoin– ““Head” is a “success”, “tail” is an “error”Head” is a “success”, “tail” is an “error”
In statistics, a succession of independent In statistics, a succession of independent events like this is called a events like this is called a Bernoulli Bernoulli processprocess– Statistical theory provides us with confidence Statistical theory provides us with confidence
intervals for the true underlying proportionintervals for the true underlying proportion
111111
Confidence intervalsConfidence intervals
We can say: We can say: pp lies within a certain specified lies within a certain specified interval with a certain specified confidenceinterval with a certain specified confidence
Example: Example: SS=750 successes in =750 successes in NN=1000 trials=1000 trials– Estimated success rate: 75%Estimated success rate: 75%– How close is this to true success rate How close is this to true success rate pp??
Answer: with 80% confidence Answer: with 80% confidence pp[73.2,76.7][73.2,76.7]
Another example: Another example: SS=75 and =75 and NN=100=100– Estimated success rate: 75%Estimated success rate: 75%– With 80% confidenceWith 80% confidence p p[69.1,80.1][69.1,80.1]
112112
Mean and varianceMean and variance Mean and variance for a Bernoulli trial:Mean and variance for a Bernoulli trial:
p, p p, p (1(1––pp)) Expected success rate Expected success rate f=Sf=S//NN Mean and variance for Mean and variance for f f : : p, p p, p (1(1––pp)/)/NN For large enough For large enough NN, , ff follows a Normal follows a Normal
distributiondistribution c% confidence interval [c% confidence interval [––z z XX zz] for ] for
random variable with 0 mean is given by:random variable with 0 mean is given by:
With a symmetric distribution:With a symmetric distribution:Pr[ ]z X z c
Pr[ ] 1 2 Pr[ ]z X z X z
113113
Confidence limitsConfidence limits Confidence limits for the normal distribution with 0 Confidence limits for the normal distribution with 0
mean and a variance of 1:mean and a variance of 1:
Thus:Thus:
To use this we have to reduce our random variable To use this we have to reduce our random variable ff to to have 0 mean and unit variancehave 0 mean and unit variance
Pr[Pr[X X zz]]
zz
0.1%0.1% 3.093.09
0.5%0.5% 2.582.58
1%1% 2.332.33
5%5% 1.651.65
10%10% 1.281.28
20%20% 0.840.84
40%40% 0.250.25
Pr[ 1.65 1.65] 90%X
––1 0 1 1.651 0 1 1.65
114114
Transforming Transforming ff
Transformed value for Transformed value for f f ::
(i.e. subtract the mean and divide by the (i.e. subtract the mean and divide by the standard standard deviationdeviation))
Resulting equation:Resulting equation:
Solving for Solving for p p ::
(1 ) /
f p
p p N
Pr(1 ) /
f pz z c
p p N
2 2 2 2
2 12 4z f f z z
p f zN N N N N
115115
ExamplesExamples f f = 75%, = 75%, N N = 1000, = 1000, c c = 80% (so that = 80% (so that z = z = 1.28):1.28):
f f = 75%, = 75%, N N = 100, = 100, c c = 80% (so that = 80% (so that z = z = 1.28):1.28):
Note that normal distribution assumption is only Note that normal distribution assumption is only valid for large valid for large N N (i.e. (i.e. NN > 100) > 100)
f f = 75%, = 75%, N N = 10, = 10, c c = 80% (so that = 80% (so that z = z = 1.28):1.28):
(should be taken with a grain of salt)(should be taken with a grain of salt)
[0.732,0.767]p
[0.691,0.801]p
[0.549,0.881]p
116116
Credibility:Credibility:Evaluating what’s been learnedEvaluating what’s been learned
Issues: training, testing, tuningIssues: training, testing, tuning Predicting performance: confidence limitsPredicting performance: confidence limits Holdout, cross-validation, bootstrapHoldout, cross-validation, bootstrap Comparing schemes: the t-testComparing schemes: the t-test Predicting probabilities: loss functionsPredicting probabilities: loss functions Cost-sensitive measuresCost-sensitive measures Evaluating numeric predictionEvaluating numeric prediction The Minimum Description Length principleThe Minimum Description Length principle
117117
Holdout estimationHoldout estimation What to do if the amount of data is limited?What to do if the amount of data is limited? The The holdoutholdout method reserves a certain method reserves a certain
amount for testing and uses the remainder amount for testing and uses the remainder for trainingfor training– Usually: one third for testing, the rest for trainingUsually: one third for testing, the rest for training
Problem: the samples might not be Problem: the samples might not be representativerepresentative– Example: class might be missing in the test dataExample: class might be missing in the test data
Advanced version uses Advanced version uses stratificationstratification– Ensures that each class is represented with Ensures that each class is represented with
approximately equal proportions in both subsetsapproximately equal proportions in both subsets
118118
Repeated holdout methodRepeated holdout method Holdout estimate can be made more Holdout estimate can be made more
reliable by repeating the process with reliable by repeating the process with different subsamplesdifferent subsamples– In each iteration, a certain proportion is In each iteration, a certain proportion is
randomly selected for training (possibly with randomly selected for training (possibly with stratificiation)stratificiation)
– The error rates on the different iterations are The error rates on the different iterations are averaged to yield an overall error rateaveraged to yield an overall error rate
This is called the This is called the repeated holdout repeated holdout methodmethod Still not optimum: the different test sets Still not optimum: the different test sets
overlapoverlap– Can we prevent overlapping?Can we prevent overlapping?
119119
Cross-validationCross-validation Cross-validationCross-validation avoids overlapping test avoids overlapping test
setssets– First step: split data into First step: split data into kk subsets of equal size subsets of equal size– Second step: use each subset in turn for Second step: use each subset in turn for
testing, the remainder for trainingtesting, the remainder for training Called Called k-fold cross-validationk-fold cross-validation Often the subsets are stratified before the Often the subsets are stratified before the
cross-validation is performedcross-validation is performed The error estimates are averaged to yield The error estimates are averaged to yield
an overall error estimatean overall error estimate
120120
More on cross-validationMore on cross-validation Standard method for evaluation: stratified ten-Standard method for evaluation: stratified ten-
fold cross-validationfold cross-validation Why ten?Why ten?
– Extensive experiments have shown that this is the Extensive experiments have shown that this is the best choice to get an accurate estimatebest choice to get an accurate estimate
– There is also some theoretical evidence for thisThere is also some theoretical evidence for this Stratification reduces the estimate’s varianceStratification reduces the estimate’s variance Even better: repeated stratified cross-Even better: repeated stratified cross-
validationvalidation– E.g. ten-fold cross-validation is repeated ten times E.g. ten-fold cross-validation is repeated ten times
and results are averaged (reduces the variance)and results are averaged (reduces the variance)
121121
Leave-One-Out cross-Leave-One-Out cross-validationvalidation
Leave-One-Out:Leave-One-Out:a particular form of cross-validation:a particular form of cross-validation:– Set number of folds to number of training Set number of folds to number of training
instancesinstances– I.e., for I.e., for nn training instances, build classifier training instances, build classifier nn
timestimes Makes best use of the dataMakes best use of the data Involves no random subsampling Involves no random subsampling Very computationally expensiveVery computationally expensive
– (exception: NN)(exception: NN)
122122
Leave-One-Out-CV and Leave-One-Out-CV and stratificationstratification
Disadvantage of Leave-One-Out-CV: Disadvantage of Leave-One-Out-CV: stratification is not possiblestratification is not possible– It It guaranteesguarantees a non-stratified sample because a non-stratified sample because
there is only one instance in the test set!there is only one instance in the test set! Extreme example: random dataset split Extreme example: random dataset split
equally into two classesequally into two classes– Best inducer predicts majority classBest inducer predicts majority class– 50% accuracy on fresh data 50% accuracy on fresh data – Leave-One-Out-CV estimate is 100% error!Leave-One-Out-CV estimate is 100% error!
123123
The bootstrapThe bootstrap CV uses sampling CV uses sampling without replacementwithout replacement
– The same instance, once selected, can not be The same instance, once selected, can not be selected again for a particular training/test setselected again for a particular training/test set
The The bootstrapbootstrap uses sampling uses sampling with replacementwith replacement to form the training setto form the training set– Sample a dataset of Sample a dataset of n n instances instances nn times times with with
replacementreplacement to form a new dataset to form a new datasetof of nn instances instances
– Use this data as the training setUse this data as the training set– Use the instances from the originalUse the instances from the original
dataset that don’t occur in the newdataset that don’t occur in the newtraining set for testingtraining set for testing
124124
The 0.632 bootstrapThe 0.632 bootstrap
Also called the Also called the 0.632 bootstrap0.632 bootstrap– A particular instance has a probability of 1A particular instance has a probability of 1––1/1/nn
of of not not being pickedbeing picked– Thus its probability of not ending up in the test Thus its probability of not ending up in the test
data is:data is:
– This means the training data will contain This means the training data will contain approximately 63.2% of the instancesapproximately 63.2% of the instances
111 0.368
n
en
125125
Estimating errorEstimating errorwith the bootstrapwith the bootstrap
The error estimate on the test data will be The error estimate on the test data will be very pessimistic very pessimistic – Trained on just ~63% of the instancesTrained on just ~63% of the instances
Therefore, combine it with the resubstitution Therefore, combine it with the resubstitution error:error:
The resubstitution error gets less weight The resubstitution error gets less weight than the error on the test datathan the error on the test data
Repeat process several times with different Repeat process several times with different replacement samples; average the resultsreplacement samples; average the results
test instances training instances0.632 0.368err e e
126126
More on the bootstrapMore on the bootstrap
Probably the best way of estimating Probably the best way of estimating performance for very small datasetsperformance for very small datasets
However, it has some problemsHowever, it has some problems– Consider the random dataset from aboveConsider the random dataset from above– A perfect memorizer will achieveA perfect memorizer will achieve
0% resubstitution error and 0% resubstitution error and ~50% error on test data ~50% error on test data
– Bootstrap estimate for this classifier:Bootstrap estimate for this classifier:
– True expected error: 50%True expected error: 50%0.632 50% 0.368 0% 31.6%err
127127
Credibility:Credibility:Evaluating what’s been learnedEvaluating what’s been learned
Issues: training, testing, tuningIssues: training, testing, tuning Predicting performance: confidence limitsPredicting performance: confidence limits Holdout, cross-validation, bootstrapHoldout, cross-validation, bootstrap Comparing schemes: the t-testComparing schemes: the t-test Predicting probabilities: loss functionsPredicting probabilities: loss functions Cost-sensitive measuresCost-sensitive measures Evaluating numeric predictionEvaluating numeric prediction The Minimum Description Length principleThe Minimum Description Length principle
128128
Comparing data mining Comparing data mining schemesschemes
Frequent question: which of two learning Frequent question: which of two learning schemes performs better?schemes performs better?
Note: this is domain dependent!Note: this is domain dependent! Obvious way: compare 10-fold CV estimatesObvious way: compare 10-fold CV estimates Problem: variance in estimateProblem: variance in estimate Variance can be reduced using repeated CVVariance can be reduced using repeated CV However, we still don’t know whether the However, we still don’t know whether the
results are reliableresults are reliable
129129
Significance testsSignificance tests Significance tests tell us how confident we Significance tests tell us how confident we
can be that there really is a differencecan be that there really is a difference Null hypothesisNull hypothesis: there is no “real” : there is no “real”
differencedifference Alternative hypothesisAlternative hypothesis: there is a difference: there is a difference A significance test measures how much A significance test measures how much
evidence there is in favor of rejecting the evidence there is in favor of rejecting the null hypothesisnull hypothesis
Let’s say we are using 10-fold CVLet’s say we are using 10-fold CV Question: do the two means of the 10 CV Question: do the two means of the 10 CV
estimates differ significantly?estimates differ significantly?
130130
Paired t-testPaired t-test Student’s t-testStudent’s t-test tells whether the means tells whether the means
of two samples are significantly differentof two samples are significantly different Take individual samples using cross-Take individual samples using cross-
validationvalidation Use a Use a pairedpaired t-test because the t-test because the
individual samples are pairedindividual samples are paired– The same CV is applied twiceThe same CV is applied twice
William GossetWilliam GossetBorn:Born: 1876 in Canterbury; Died: 1937 in Beaconsfield, England1876 in Canterbury; Died: 1937 in Beaconsfield, England
Obtained a post as a chemist in the Guinness brewery in Dublin in Obtained a post as a chemist in the Guinness brewery in Dublin in 1899. Invented the t-test to handle small samples for quality control 1899. Invented the t-test to handle small samples for quality control in brewing. Wrote under the name "Student". in brewing. Wrote under the name "Student".
131131
Student’s distributionStudent’s distribution With small samples (With small samples (k k < 100) the < 100) the
mean follows mean follows Student’s distribution Student’s distribution with kwith k––11 degrees of freedom degrees of freedom
Confidence limits:Confidence limits:
Pr[Pr[X X zz]]
zz
0.1%0.1% 4.304.30
0.5%0.5% 3.253.25
1%1% 2.822.82
5%5% 1.831.83
10%10% 1.381.38
20%20% 0.880.88
Pr[Pr[X X zz]]
zz
0.1%0.1% 3.093.09
0.5%0.5% 2.582.58
1%1% 2.332.33
5%5% 1.651.65
10%10% 1.281.28
20%20% 0.840.84
9 degrees of freedom normal 9 degrees of freedom normal distributiondistribution
132132
Distribution of the meansDistribution of the means xx11 x x22 … x … xkk and and yy11 y y22 … y … ykk are the 2are the 2kk samples for samples for
a a kk-fold CV-fold CV mmxx and and mmyy are the means are the means With enough samples, the mean of a set of With enough samples, the mean of a set of
independent samples is normally distributedindependent samples is normally distributed
Estimated variances of the means are Estimated variances of the means are xx
22/k and /k and yy22/k /k
If If xx and and yy are the true means thenare the true means then
areare approximately approximately normally distributed withnormally distributed withmean 0, variance 1mean 0, variance 1
2 /x x
x
m
k
2 /
y y
y
m
k
133133
Distribution of the Distribution of the differencesdifferences
Let Let mmd d = m= mx x – – mmyy
The difference of the means (The difference of the means (mmdd) also has a ) also has a Student’s distribution with Student’s distribution with kk––1 degrees of 1 degrees of freedomfreedom
Let Let dd22 be the variance of the difference be the variance of the difference
The standardized version of The standardized version of mmdd is called the is called the tt--statistic:statistic:
We use We use tt to perform the to perform the t-t-testtest
2 /d
d
mt
k
134134
Performing the testPerforming the test• Fix a significance level Fix a significance level
• If a difference is significant at the If a difference is significant at the % level,% level,there is a (100-there is a (100-)% chance that there really is a )% chance that there really is a differencedifference
• Divide the significance level by two because Divide the significance level by two because the test is two-tailedthe test is two-tailed
• I.e. the true difference can be +ve or I.e. the true difference can be +ve or –– ve ve• Look up the value for Look up the value for zz that corresponds to that corresponds to
/2/2• If If t t ––z z or or t t zz then the difference is then the difference is
significantsignificant• I.e. the null hypothesis can be rejectedI.e. the null hypothesis can be rejected
135135
Unpaired observationsUnpaired observations If the CV estimates are from different If the CV estimates are from different
randomizations, they are no longer randomizations, they are no longer pairedpaired
(or maybe we used (or maybe we used k k -fold CV for one -fold CV for one scheme, and scheme, and j j -fold CV for the other one)-fold CV for the other one)
Then we have to use an Then we have to use an un un paired t-test paired t-test with min(with min(k k , , jj) ) –– 1 degrees of freedom 1 degrees of freedom
The The tt-statistic becomes:-statistic becomes:
22
x y
yx
m mt
k j
2 /d
d
mt
k
136136
Interpreting the resultInterpreting the result All our cross-validation estimates are All our cross-validation estimates are
based on the same datasetbased on the same dataset Samples are not independentSamples are not independent Should really use a different dataset Should really use a different dataset
sample for each of the sample for each of the kk estimates estimates used in the test to judge performance used in the test to judge performance across different training setsacross different training sets
Or, use heuristic test, e.g. Or, use heuristic test, e.g. corrected corrected resampled t-testresampled t-test
137137
Credibility:Credibility:Evaluating what’s been learnedEvaluating what’s been learned
Issues: training, testing, tuningIssues: training, testing, tuning Predicting performance: confidence limitsPredicting performance: confidence limits Holdout, cross-validation, bootstrapHoldout, cross-validation, bootstrap Comparing schemes: the t-testComparing schemes: the t-test Predicting probabilities: loss functionsPredicting probabilities: loss functions Cost-sensitive measuresCost-sensitive measures Evaluating numeric predictionEvaluating numeric prediction The Minimum Description Length principleThe Minimum Description Length principle
138138
Predicting probabilitiesPredicting probabilities Performance measure so far: success ratePerformance measure so far: success rate Also called Also called 0-1 loss function0-1 loss function::
Most classifiers produces class probabilitiesMost classifiers produces class probabilities Depending on the application, we might want to Depending on the application, we might want to
check the accuracy of the probability estimatescheck the accuracy of the probability estimates 0-1 loss is not the right thing to use in those 0-1 loss is not the right thing to use in those
casescases
0 if prediction is correct
1 if prediction is incorrecti
139139
Quadratic loss functionQuadratic loss function pp11 … … ppkk are probability estimates for an instanceare probability estimates for an instance
cc is the index of the instance’s actual class is the index of the instance’s actual class
aa11 … … aak k = = 00, , except for except for aacc which is 1 which is 1
Quadratic lossQuadratic loss is: is:
Want to minimizeWant to minimize
Can show that this is minimized when Can show that this is minimized when ppj j = = ppjj**, the , the
true probabilitiestrue probabilities
2( )j jj
E p a
2 2 2( ) (1 )j j j cj j c
p a p p
140140
Informational loss functionInformational loss function
The informational loss function is –log(The informational loss function is –log(ppcc),),where where cc is the index of the instance’s actual class is the index of the instance’s actual class
Number of bits required to communicate the Number of bits required to communicate the actual classactual class
Let Let pp11** … … ppkk
* * be the true class probabilitiesbe the true class probabilities
Then the expected value for the loss function is:Then the expected value for the loss function is:
Justification: minimized when Justification: minimized when ppj j = p= pjj**
Difficulty:Difficulty: zero-frequency problem zero-frequency problem
* *1 2 1 2log ... logk kp p p p
141141
DiscussionDiscussion Which loss function to choose?Which loss function to choose?
– Both encourage honestyBoth encourage honesty– Quadratic loss function takes into account all class Quadratic loss function takes into account all class
probability estimates for an instanceprobability estimates for an instance– Informational loss focuses only on the probability Informational loss focuses only on the probability
estimate for the actual classestimate for the actual class– Quadratic loss is bounded:Quadratic loss is bounded:
it can never exceed 2it can never exceed 2– Informational loss can be infiniteInformational loss can be infinite
Informational loss is related to Informational loss is related to MDL principleMDL principle [later][later]
21 jj
p
142142
Credibility:Credibility:Evaluating what’s been learnedEvaluating what’s been learned
Issues: training, testing, tuningIssues: training, testing, tuning Predicting performance: confidence limitsPredicting performance: confidence limits Holdout, cross-validation, bootstrapHoldout, cross-validation, bootstrap Comparing schemes: the t-testComparing schemes: the t-test Predicting probabilities: loss functionsPredicting probabilities: loss functions Cost-sensitive measuresCost-sensitive measures Evaluating numeric predictionEvaluating numeric prediction The Minimum Description Length principleThe Minimum Description Length principle
143143
Counting the costCounting the cost
In practice, different types of classification In practice, different types of classification errors often incur different costserrors often incur different costs
Examples:Examples:– Disease diagnosisDisease diagnosis– Terrorist profilingTerrorist profiling
““Not a terrorist” correct 99.99% of the timeNot a terrorist” correct 99.99% of the time
– Loan decisionsLoan decisions– Oil-slick detectionOil-slick detection– Fault diagnosisFault diagnosis– Promotional mailingPromotional mailing
144144
Counting the costCounting the cost
The The confusion matrixconfusion matrix::
There many other types of cost!There many other types of cost!– E.g.: cost of collecting training dataE.g.: cost of collecting training data
Predicted classPredicted class
YesYes NoNo
Actual Actual classclass
YesYes True positiveTrue positive False False negativenegative
NoNo False False positivepositive
True True negativenegative
145145
Lift chartsLift charts In practice, costs are rarely knownIn practice, costs are rarely known Decisions are usually made by comparing Decisions are usually made by comparing
possible scenariospossible scenarios Example: promotional mailout to 1,000,000 Example: promotional mailout to 1,000,000
householdshouseholds• Mail to all; 0.1% respond (1000)Mail to all; 0.1% respond (1000)• Data mining tool identifies subset of 100,000 Data mining tool identifies subset of 100,000
most promising, 0.4% of these respond (400)most promising, 0.4% of these respond (400)40% of responses for 10% of cost may pay off40% of responses for 10% of cost may pay off
• Identify subset of 400,000 most promising, Identify subset of 400,000 most promising, 0.2% respond (800)0.2% respond (800)
A A lift chartlift chart allows a visual comparison allows a visual comparison
146146
Generating a lift chartGenerating a lift chart Sort instances according to predicted probability of Sort instances according to predicted probability of
being positive:being positive:
xx axis is sample size axis is sample sizeyy axis is number of true positives axis is number of true positives
Predicted Predicted probabilityprobability
Actual classActual class
11 0.950.95 YesYes22 0.930.93 YesYes33 0.930.93 NoNo44 0.880.88 YesYes
…… …… ……
147147
A hypothetical lift chartA hypothetical lift chart
40% of responses40% of responsesfor 10% of costfor 10% of cost 80% of responses80% of responses
for 40% of costfor 40% of cost
148148
ROC curvesROC curves
ROC curvesROC curves are similar to lift charts are similar to lift charts– Stands for “receiver operating characteristic”Stands for “receiver operating characteristic”– Used in signal detection to show tradeoff between Used in signal detection to show tradeoff between
hit rate and false alarm rate over noisy channelhit rate and false alarm rate over noisy channel Differences to lift chart:Differences to lift chart:
– y y axis shows percentage of true positives in sample axis shows percentage of true positives in sample rather than absolute numberrather than absolute number
– xx axis shows percentage of false positives in axis shows percentage of false positives in samplesample rather than sample sizerather than sample size
149149
A sample ROC curveA sample ROC curve
Jagged curve—one set of test dataJagged curve—one set of test data Smooth curve—use cross-validationSmooth curve—use cross-validation
150150
Cross-validation and ROC Cross-validation and ROC curvescurves
Simple method of getting a ROC curve Simple method of getting a ROC curve using cross-validation:using cross-validation:– Collect probabilities for instances in test foldsCollect probabilities for instances in test folds– Sort instances according to probabilitiesSort instances according to probabilities
This method is implemented in WEKAThis method is implemented in WEKA However, this is just one possibilityHowever, this is just one possibility
– The method described in the WEKA book The method described in the WEKA book generates an ROC curve for each fold and generates an ROC curve for each fold and averages them averages them
151151
ROC curves for two ROC curves for two schemesschemes
For a small, focused sample, use method AFor a small, focused sample, use method A For a larger one, use method BFor a larger one, use method B In between, choose between A and B with appropriate probabilitiesIn between, choose between A and B with appropriate probabilities
152152
The convex hullThe convex hull
Given two learning schemes we can achieve Given two learning schemes we can achieve any point on the convex hull!any point on the convex hull!
TP and FP rates for scheme 1: TP and FP rates for scheme 1: tt11 and and ff11
TP and FP rates for scheme 2: TP and FP rates for scheme 2: tt22 and and ff22
If scheme 1 is used to predict 100If scheme 1 is used to predict 100q q % of % of the cases and scheme 2 for the rest, thenthe cases and scheme 2 for the rest, then– TP rate for combined scheme:TP rate for combined scheme:
q q tt11+(1-+(1-qq) ) tt22
– FP rate for combined scheme:FP rate for combined scheme:q q ff22+(1-+(1-qq) ) ff22
153153
Cost-sensitive learningCost-sensitive learning
Most learning schemes do not perform Most learning schemes do not perform cost-sensitive learningcost-sensitive learning– They generate the same classifier no matter They generate the same classifier no matter
what costs are assigned to the different classeswhat costs are assigned to the different classes– Example: standard decision tree learnerExample: standard decision tree learner
Simple methods for cost-sensitive Simple methods for cost-sensitive learning:learning:– Resampling of instances according to costsResampling of instances according to costs– Weighting of instances according to costsWeighting of instances according to costs
Some schemes can take costs into account Some schemes can take costs into account by varying a parameter, e.g. naïve Bayesby varying a parameter, e.g. naïve Bayes
154154
Measures in information Measures in information retrievalretrieval
Percentage of retrieved documents that are Percentage of retrieved documents that are relevant: relevant: precision=precision=TP/(TP+FP)TP/(TP+FP)
Percentage of relevant documents that are Percentage of relevant documents that are returned: returned: recall =recall =TP/(TP+FN)TP/(TP+FN)
Precision/recall curves have hyperbolic shapePrecision/recall curves have hyperbolic shape Summary measures: average precision at Summary measures: average precision at
20%, 50% and 80% recall (20%, 50% and 80% recall (three-point three-point average recallaverage recall))
F-measureF-measure=(2=(2recallrecallprecision)/precision)/(recall+precision)(recall+precision)
155155
Summary of measuresSummary of measures
DomainDomain PlotPlot ExplanationExplanation
Lift Lift chartchart
MarketingMarketing TP TP
Subset sizeSubset sizeTPTP(TP+FP)/(TP+FP)/(TP+FP+TN+FN)(TP+FP+TN+FN)
ROC ROC curvecurve
Disease Disease ClassificationClassification
TP rate TP rate (Sensitivity(Sensitivity))
FP rateFP rate
TP/(TP+FN)TP/(TP+FN)
FP/(FP+TN)FP/(FP+TN)
Recall-Recall-precisioprecision curven curve
Information Information retrievalretrieval
RecallRecall
PrecisionPrecisionTP/(TP+FN)TP/(TP+FN)
TP/(TP+FP)TP/(TP+FP)
156156
Credibility:Credibility:Evaluating what’s been learnedEvaluating what’s been learned
Issues: training, testing, tuningIssues: training, testing, tuning Predicting performance: confidence limitsPredicting performance: confidence limits Holdout, cross-validation, bootstrapHoldout, cross-validation, bootstrap Comparing schemes: the t-testComparing schemes: the t-test Predicting probabilities: loss functionsPredicting probabilities: loss functions Cost-sensitive measuresCost-sensitive measures Evaluating numeric predictionEvaluating numeric prediction The Minimum Description Length principleThe Minimum Description Length principle
157157
Evaluating numeric Evaluating numeric predictionprediction
Same strategies: independent test set, Same strategies: independent test set, cross-validation, significance tests, etc.cross-validation, significance tests, etc.
Difference: error measuresDifference: error measures Actual target values: Actual target values: aa11 aa22 ……aann
Predicted target values: Predicted target values: pp11 pp22 … … ppnn
Most popular measure: Most popular measure: mean-squared errormean-squared error
– Easy to manipulate mathematicallyEasy to manipulate mathematically
2 21 1( ) ... ( )n np a p a
n
158158
Other measuresOther measures The The root mean-squared error root mean-squared error ::
The The mean absolute error mean absolute error is less sensitive to is less sensitive to outliers than the mean-squared error:outliers than the mean-squared error:
Sometimes Sometimes relativerelative error values are more error values are more appropriate (e.g. 10% for an error of 50 when appropriate (e.g. 10% for an error of 50 when predicting 500)predicting 500)
1 1| | ... | |n np a p a
n
2 21 1( ) ... ( )n np a p a
n
159159
Improvement on the meanImprovement on the mean
How much does the scheme improve on How much does the scheme improve on simply predicting the average?simply predicting the average?
The The relative squared errorrelative squared error is ( ): is ( ):
The The relative absolute error relative absolute error is:is:
2 21 1
2 21
( ) ... ( )
( ) ... ( )n n
n
p a p a
a a a a
is the averagea
1 1
1
| | ... | |
| | ... | |n n
n
p a p a
a a a a
160160
Correlation coefficientCorrelation coefficient
Measures the Measures the statistical correlationstatistical correlation between between the predicted values and the actual valuesthe predicted values and the actual values
Scale independent, between –1 and +1Scale independent, between –1 and +1 Good performance leads to large values!Good performance leads to large values!
PA
P A
S
S S
( )( )
1
i ii
PA
p p a aS
n
2( )
1
ii
P
p pS
n
2( )
1
ii
A
a aS
n
161161
Which measure?Which measure? Best to look at all of themBest to look at all of them Often it doesn’t matterOften it doesn’t matter Example:Example:
AA BB CC DD
Root mean-squared Root mean-squared errorerror
67.867.8 91.791.7 63.363.3 57.457.4
Mean absolute errorMean absolute error 41.341.3 38.538.5 33.433.4 29.229.2
Root rel squared errorRoot rel squared error 42.2%42.2% 57.2%57.2% 39.4%39.4% 35.8%35.8%
Relative absolute Relative absolute errorerror
43.1%43.1% 40.1%40.1% 34.8%34.8% 30.4%30.4%
Correlation coefficientCorrelation coefficient 0.880.88 0.880.88 0.890.89 0.910.91 D bestD best C second-C second-
bestbest A, B arguableA, B arguable
162162
Credibility:Credibility:Evaluating what’s been learnedEvaluating what’s been learned
Issues: training, testing, tuningIssues: training, testing, tuning Predicting performance: confidence limitsPredicting performance: confidence limits Holdout, cross-validation, bootstrapHoldout, cross-validation, bootstrap Comparing schemes: the t-testComparing schemes: the t-test Predicting probabilities: loss functionsPredicting probabilities: loss functions Cost-sensitive measuresCost-sensitive measures Evaluating numeric predictionEvaluating numeric prediction The Minimum Description Length principleThe Minimum Description Length principle
163163
The MDL principleThe MDL principle
MDL stands for MDL stands for minimum description lengthminimum description length The description length is defined as:The description length is defined as:
space required to describe a theoryspace required to describe a theory
++
space required to describe the theory’s mistakesspace required to describe the theory’s mistakes In our case the theory is the classifier and In our case the theory is the classifier and
the mistakes are the errors on the training the mistakes are the errors on the training datadata
Aim: we seek a classifier with minimal DLAim: we seek a classifier with minimal DL MDL principle is a MDL principle is a model selection criterionmodel selection criterion
164164
Model selection criteriaModel selection criteria Model selection criteria attempt to find Model selection criteria attempt to find
a good compromise between:a good compromise between:• The complexity of a modelThe complexity of a model• Its prediction accuracy on the training dataIts prediction accuracy on the training data
Reasoning: a good model is a simple Reasoning: a good model is a simple model that achieves high accuracy on model that achieves high accuracy on the given datathe given data
Also known as Also known as Occam’s Razor Occam’s Razor ::the best theory is the smallest onethe best theory is the smallest onethat describes all the facts that describes all the facts
William of Ockham, born in the village of Ockham in Surrey William of Ockham, born in the village of Ockham in Surrey (England) about 1285, was the most influential philosopher of (England) about 1285, was the most influential philosopher of the 14th century and a controversial theologian. the 14th century and a controversial theologian.
165165
Elegance vs. errorsElegance vs. errors
Theory 1: very simple, elegant theory that Theory 1: very simple, elegant theory that explains the data almost perfectlyexplains the data almost perfectly
Theory 2: significantly more complex Theory 2: significantly more complex theory that reproduces the data without theory that reproduces the data without mistakesmistakes
Theory 1 is probably preferableTheory 1 is probably preferable Classical example: Kepler’s three laws on Classical example: Kepler’s three laws on
planetary motionplanetary motion– Less accurate than Copernicus’s latest Less accurate than Copernicus’s latest
refinement of the Ptolemaic theory of epicyclesrefinement of the Ptolemaic theory of epicycles
166166
Elegance vs. errorsElegance vs. errors
Kepler – “I have cleared the Augean Kepler – “I have cleared the Augean stables of astronomy of cycles and stables of astronomy of cycles and spirals, and left behind me only a spirals, and left behind me only a single cartload of dung”single cartload of dung”
167167
MDL and compressionMDL and compression
MDL principle relates to data compression:MDL principle relates to data compression:– The best theory is the one that compresses the The best theory is the one that compresses the
data the mostdata the most– I.e. to compress a dataset we generate a model I.e. to compress a dataset we generate a model
and then store the model and its mistakesand then store the model and its mistakes We need to computeWe need to compute
(a) size of the model, and(a) size of the model, and(b) space needed to encode the errors(b) space needed to encode the errors
(b) easy: use the informational loss function(b) easy: use the informational loss function (a) need a method to encode the model(a) need a method to encode the model
168168
MDL and Bayes’s theoremMDL and Bayes’s theorem
L[T]=“length” of the theoryL[T]=“length” of the theory L[E|T]=training set encoded wrt the theory L[E|T]=training set encoded wrt the theory
(“dung”)(“dung”) Description length= L[T] + L[E|T]Description length= L[T] + L[E|T] Bayes’s theorem gives Bayes’s theorem gives a posterioria posteriori
probability of a theory given the data:probability of a theory given the data:
Equivalent to:Equivalent to:
Pr[ | ]Pr[ ]Pr[ | ]
Pr[ ]
E T TT E
E
log Pr[ | ] log Pr[ | ] log Pr[ ] log Pr[ ]T E E T T E constant
169169
MDL and MAPMDL and MAP
MAP stands for MAP stands for maximum a posteriori probabilitymaximum a posteriori probability Finding the MAP theory corresponds to finding the Finding the MAP theory corresponds to finding the
MDL theoryMDL theory Difficult bit in applying the MAP principle: Difficult bit in applying the MAP principle:
determining the prior probability Pr[T] of the determining the prior probability Pr[T] of the theorytheory
Corresponds to difficult part in applying the MDL Corresponds to difficult part in applying the MDL principle: coding scheme for the theoryprinciple: coding scheme for the theory
I.e. if we know a priori that a particular theory is I.e. if we know a priori that a particular theory is more likely we need less bits to encode itmore likely we need less bits to encode it
170170
Discussion of MDL principleDiscussion of MDL principle
Advantage: makes full use of the training data Advantage: makes full use of the training data when selecting a modelwhen selecting a model
Disadvantage 1: appropriate coding Disadvantage 1: appropriate coding scheme/prior probabilities for theories are crucialscheme/prior probabilities for theories are crucial
Disadvantage 2: no guarantee that the MDL Disadvantage 2: no guarantee that the MDL theory is the one which minimizes the expected theory is the one which minimizes the expected error error
Note: Occam’s Razor is an axiom!Note: Occam’s Razor is an axiom! Epicurus’s Epicurus’s principle of multiple explanationsprinciple of multiple explanations: :
keep all theories that are consistent with the keep all theories that are consistent with the datadata
171171
Bayesian model averagingBayesian model averaging
Reflects Epicurus’s principle: all theories are used Reflects Epicurus’s principle: all theories are used for prediction weighted according to P[T|E]for prediction weighted according to P[T|E]
Let Let II be a new instance whose class we must be a new instance whose class we must predictpredict
Let Let CC be the random variable denoting the class be the random variable denoting the class Then BMA gives the probability of Then BMA gives the probability of CC given given
– II– training data training data EE
– possible theories possible theories TTjjPr[ | , ] Pr[ | , ]Pr[ | ]j j
j
C I E C I T T E
172172
MDL and clusteringMDL and clustering Description length of theory:Description length of theory:
bits needed to encode the clustersbits needed to encode the clusters– e.g. cluster centerse.g. cluster centers
Description length of data given theory:Description length of data given theory:encode cluster membership and position encode cluster membership and position relative to clusterrelative to cluster– e.g. distance to cluster centere.g. distance to cluster center
Works if coding scheme uses less code space Works if coding scheme uses less code space for small numbers than for large onesfor small numbers than for large ones
With nominal attributes, must communicate With nominal attributes, must communicate probability distributions for each clusterprobability distributions for each cluster
173173
Main ReferencesMain References Han, J., Kamber, M. (2011).Han, J., Kamber, M. (2011). Data mining: Concepts Data mining: Concepts
and Techniques (2and Techniques (2ndnd ed.). ed.). New York: Morgan- New York: Morgan-Kaufman.Kaufman.
Witten, I. H., & Frank, E. (2005). Witten, I. H., & Frank, E. (2005). Data mining: Data mining: Practical Machine Learning Tools and Techniques Practical Machine Learning Tools and Techniques (2(2ndnd ed.). ed.). New York: Morgan-Kaufmann. New York: Morgan-Kaufmann.
Hastie, T., Tibshirani, R., & Friedman, J. H. (2nd ed. Hastie, T., Tibshirani, R., & Friedman, J. H. (2nd ed. 2009. 5th Printing.). 2009. 5th Printing.). The elements of statistical The elements of statistical learning : Data mining, inference, and prediction.learning : Data mining, inference, and prediction. New York: Springer.New York: Springer.