Bab 4.2 - 1/57 Bab 4 Classification: Basic Concepts, Decision Trees & Model Evaluation Part 2 Model Overfitting & Classifier Evaluation

Embed Size (px)

Citation preview

  • Slide 1

Bab 4.2 - 1/57 Bab 4 Classification: Basic Concepts, Decision Trees & Model Evaluation Part 2 Model Overfitting & Classifier Evaluation Slide 2 Bab 4.2 - 2/57 Classification Error Slide 3 Bab 4.2 - 3/57 Example Data Set Slide 4 Bab 4.2 - 4/57 Decision Trees Slide 5 Bab 4.2 - 5/57 Model Overfiting Slide 6 Bab 4.2 - 6/57 Overfiting due to Presence of Noise Decision boundary is distorted by noise point Slide 7 Bab 4.2 - 7/57 Example: Mammal Classification Slide 8 Bab 4.2 - 8/57 Effect of Noise Slide 9 Bab 4.2 - 9/57 Overfiting due to Insufficient Examples Lack of data points in the lower half of the diagram makes it difficult to predict correctly the class labels of that region Insufficient number of training records in the region causes the decision tree to predict the test examples using other training records that are irrelevant to the classification task Slide 10 Bab 4.2 - 10/57 Lack of Representative Samples Slide 11 Bab 4.2 - 11/57 Effect of Multiple Comparison Procedure / 1 Slide 12 Bab 4.2 - 12/57 Effect of Multiple Comparison Procedure / 2 How does the multiple comparison procedure relate to model overfitting? Slide 13 Bab 4.2 - 13/57 Effect of Multiple Comparison Procedure / 3 Slide 14 Bab 4.2 - 14/57 Notes on Overfitting Slide 15 Bab 4.2 - 15/57 Estimating Generalization Errors Slide 16 Bab 4.2 - 16/57 Resubstitution Estimates Note: classification decision at each leaf node is based on the majority class Slide 17 Bab 4.2 - 17/57 Incorporating Model Complexity (IMC) Slide 18 Bab 4.2 - 18/57 IMC: Pessimistic Error Estimates Slide 19 Bab 4.2 - 19/57 Pessimistic Error Estimates: Example = 0.5 e(T L ) = (4 + 7 x 0.5) / 24 = 0.3125 e(T R ) = (6 + 4 x 0.5) / 24 = 0.3333 = 1.0 e(T L ) = (4 + 7 x 1) / 24 = 0.458 e(T R ) = (6 + 4 x 1) / 24 = 0.417 Slide 20 Bab 4.2 - 20/57 IMC: Minimum Description Length/1 Based on information-theoretic approach In the above example: A and B are given a set of records with known attributes x A knows exact label for each record, while B knows none B obtain the classification of each record by requesting that A transmits the class label sequentially (require (n), where n = total number of records) Slide 21 Bab 4.2 - 21/57 IMC: Minimum Description Length/2 Slide 22 Bab 4.2 - 22/57 Estimating Statistical Bounds N = total number of training records used to compute e = confidence level Z /2 = standardized value from a standard normal distribution e = upper bound limit of error Slide 23 Bab 4.2 - 23/57 Using Validation Set Slide 24 Bab 4.2 - 24/57 Handling Overfitting in Decision Tree / 1 Slide 25 Bab 4.2 - 25/57 Handling Overfitting in Decision Tree / 2 Slide 26 Bab 4.2 - 26/57 Example of Post-Pruning / 1 Slide 27 Bab 4.2 - 27/57 Example of Post-Prunning / 2 Slide 28 Bab 4.2 - 28/57 Evaluating Performance of Classifier Slide 29 Bab 4.2 - 29/57 Model Evaluation Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates? Methods for Model Comparison How to compare the relative performance among competing models? Slide 30 Bab 4.2 - 30/57 Metric for Performance Evaluation / 1 Focus on the predictive capability of a model Rather than how fast it takes to classify or build models, scalability, etc. Confusion Matrix: PREDICTED CLASS ACTUAL CLASS Class=YesClass=No Class=Yesab Class=Nocd a: TP (true positive) b: FN (false negative) c: FP (false positive) d: TN (true negative) Slide 31 Bab 4.2 - 31/57 Metric for Performance Evaluation / 2 PREDICTED CLASS ACTUAL CLASS Class=YesClass=No Class=Yesa (TP) b (FN) Class=Noc (FP) d (TN) Slide 32 Bab 4.2 - 32/57 Limitation of Accuracy Consider a 2-class problem Number of Class 0 examples = 9990 Number of Class 1 examples = 10 If model predicts everything to be class 0, accuracy is 9990/10000 = 99.9 % Accuracy is misleading because model does not detect any class 1 example Slide 33 Bab 4.2 - 33/57 Cost Matrix PREDICTED CLASS ACTUAL CLASS C(i|j) Class=YesClass=No Class=Yes C(Yes|Yes)C(No|Yes) Class=No C(Yes|No)C(No|No) C(i|j): Cost of misclassifying class j example as class i Slide 34 Bab 4.2 - 34/57 Computing Cost of Classification Cost Matrix PREDICTED CLASS ACTUAL CLASS C(i|j) +- + 100 - 10 Model M 1 PREDICTED CLASS ACTUAL CLASS +- + 15040 - 60250 Model M 2 PREDICTED CLASS ACTUAL CLASS +- + 25045 - 5200 Accuracy = 80% Cost = 3910 Accuracy = 90% Cost = 4255 Slide 35 Bab 4.2 - 35/57 Cost vs. Accuracy Count PREDICTED CLASS ACTUAL CLASS Class=YesClass=No Class=Yes ab Class=No cd Cost PREDICTED CLASS ACTUAL CLASS Class=YesClass=No Class=Yes pq Class=No qp N = a + b + c + d Accuracy = (a + d)/N Cost = p (a + d) + q (b + c) = p (a + d) + q (N a d) = q N (q p)(a + d) = N [q (q-p) Accuracy] Accuracy is proportional to cost if 1. C(Yes|No)=C(No|Yes) = q 2. C(Yes|Yes)=C(No|No) = p Slide 36 Bab 4.2 - 36/57 Cost-Sensitive Measure l Precision is biased towards C(Yes|Yes) & C(Yes|No) l Recall is biased towards C(Yes|Yes) & C(No|Yes) l F-measure is biased towards all except C(No|No) Slide 37 Bab 4.2 - 37/57 Methods for Performance Evaluation How to obtain a reliable estimate of performance? Performance of a model may depend on other factors besides the learning algorithm: Class distribution Cost of misclassification Size of training and test sets Slide 38 Bab 4.2 - 38/57 Learning curve l Learning curve shows how accuracy changes with varying sample size l Effect of small sample size: - Bias in the estimate - Variance of estimate Slide 39 Bab 4.2 - 39/57 Methods for Estimation Holdout Method Random Subsampling Cross-Validation Bootstrap Slide 40 Bab 4.2 - 40/57 Holdout Method Original data is partitioned into 2 disjoint sets: training dan test sets Reserve k% as training set and (100-k)% as test set Accuracy of the classifier can be estimated based on the accuracy of the induced model on the test set Limitations: The induced model may not be as good as when all the labeled examples are used for training (i.e., the smaller the training size, the larger the variance of the model) If the training set is too large, the estimated accuracy computed from the smaller test set is less reliable (i.e., a wide confidence interval) The training and test sets are no longer independent of each other, because the training and test sets are subsets of the original data) Slide 41 Bab 4.2 - 41/57 Random Subsampling Based on holdout method that is repeated several times to improve the estimated accuracy The overall accuracy = Where k : total number of iteration, acc i : model accuracy during i th iteration Problems: As for holdout method, it does not utilize as much data as possible for training It has no control over the number of times each record is used for testing and training (i.e., some records might be used for training more often than others) Slide 42 Bab 4.2 - 42/57 Cross-Validation Partition data into k equal-sized disjoin subsets k-fold: train ok k-1 partitions, test on the remaining one (repeated k times so that each partition is used for testing exactly once) If k = N (the size of the data set) called Leave-one-out method (i.e., each test set contains only one record) The total estimated accuracy is computed as for Random Subsampling Advantages of Leave-one-out method: Data is utilized as mush as possible for training The test sets are mutually exclusive and they are effectively cover the entire data set Drawback of Leave-one-out method : Computationally expensive by repeating procedure N times Variance of the estimated performance metric tends to be high, since each test set contains only one record Slide 43 Bab 4.2 - 43/57 Bootstrap Sampling with replacement All previous approaches assume that the training records are sampled without replacement In bootstrap approach, training records are sampled with replacement If original data contains N records on average a bootstrap sample of size N contains about 63.2% of the records in the original data (probability of a record is chosen 1 (1 1/N) N 1 e 1 = 0.632 for N ) Records that are not included in the bootstrap sample become part of the test set The sampling procedure is repeated b times to generate b bootstrap sample.632 bootstrap approach (one of widely used approaches) Acc boot = i : accuracy of each bootstrap sample; acc s : accuracy computed from a training set contains all the labeled examples in the original data Slide 44 Bab 4.2 - 44/57 Methods for Comparing Classifiers (issue of estimating the confidence interval of a given model accuracy) (issue of testing the statistical significance of the observed deviation) Slide 45 Bab 4.2 - 45/57 Confidence Interval for Accuracy / 1 Slide 46 Bab 4.2 - 46/57 Confidence Interval for Accuracy / 2 Slide 47 Bab 4.2 - 47/57 Confidence Interval for Accuracy / 3 Slide 48 Bab 4.2 - 48/57 Confidence Interval for Accuracy / 4 Slide 49 Bab 4.2 - 49/57 Comparing Performance of 2 Models / 1 Slide 50 Bab 4.2 - 50/57 Comparing Performance of 2 Models / 2 Slide 51 Bab 4.2 - 51/57 An Illustrative Example Slide 52 Bab 4.2 - 52/57 Comparing Performance of 2 Classifiers Slide 53 Bab 4.2 - 53/57 An Illustrative Example Slide 54 Bab 4.2 - 54/57 Handling Missing Attributes Missing values affect decision tree construction in three different ways: Affects how impurity measures are computed Affects how to distribute instance with missing value to child nodes Affects how a test instance with missing value is classified Slide 55 Bab 4.2 - 55/57 Computing Impurity Measure Split on Refund: Entropy(Refund=Yes) = 0 (3/3)log(3/3) = 0 Entropy(Refund=No) = -(3/7)log(3/7) (4/7)log(4/7) = 0.9852 Entropy(Children) = 0.3 (0) + 0.6 (0.9852) = 0.5911 Gain = 0.8813 0.5911 = 0.2902 Missing value is assumed = No (majority) Before Splitting: Entropy(Parent) = -0.3 log(0.3)-(0.7)log(0.7) = 0.8813 Slide 56 Bab 4.2 - 56/57 Distributed Instances Refund YesNo Refund YesNo Probability that Refund=Yes is 3/9 Probability that Refund=No is 6/9 Assign record to the left child with weight = 3/9 and to the right child with weight = 6/9 Slide 57 Bab 4.2 - 57/57 Classify Instances Refund MarSt TaxInc YES NO Yes No Married Single, Divorced < 80K> 80K MarriedSingleDivorcedTotal Class=No3104 Class=Yes6/9112.67 Total3.6721 6.67 New record: Probability that Marital Status = Married is 3.67 / 6.67 Probability that Marital Status ={Single, Divorced} is 3 / 6.67 Slide 58 Bab 4.2 - 58/57 Tugas Kelompok Bab 4 Soal Nomor: 5, 6, 7, 9, 10, 11 Selain hardcopy, siapkan softcopynya untuk didiskusikan Dead line: Selasa (17 Maret 2008)