Upload
madeleine-neal
View
219
Download
1
Tags:
Embed Size (px)
Citation preview
Find GCD in C
CSCI 347, Data MiningEvaluation: Training and Testing,Section 5.1, pages 147-150QuestionMany learning algorithms, how to compare their effectiveness?
Evaluation ProcessProcess: Have two large datasets training data set and a testing dataset (both representative of the underlying problem) Run a wide variety of algorithms on the training dataset (different algorithms will produce different models, i.e. different patterns try many algorithms so that if there is a hidden pattern it is likely that one algorithm will find it)Test each model on the test dataset to see which performs best
Independent and Representative DataAs long as the training and testing datasets are independent and representative of the underlying problem, it is likely that the performance predicted on the test dataset will match reality. That is, when the model is used on new data, the error rate will be the same as predicted.
How Much DataHow much data is enough? Depends on: Algorithms being usedComplexity of the dataRequired success rate in some cases the cost of misclassification is much more serious than in other cases Relative frequency of possible outcomes
StatisticsStatisticians have spent years developing tests for determining the smallest model set that can be used to produce an accurate model
QuoteData mining is useful when the sheer volume of data obscures patterns that might be detectable in smaller databases. Generally start with tens of thousands, if not millions of pre-classified records. If data is scarce, data mining is unlikely to be useful.
From Data Mining Techniques: For Marketing, Sales, and Customer Relationship Management
Rare Class ValuesTarget variables might represent something relatively rare: Prospects responding to a direct mail offer Credit card holders committing fraud In a month, newspaper subscribers canceling their subscription
RecommendationThe training set should be balanced with equal numbers of each of the outcomes. A smaller, balanced sample is preferable to a larger one with a very low proportion of rare outcomes
Limited Quality DataQuality data which is representative of the underlying problem, is hard to come by.
Satellite Imagery of Sea Ice
11Electrical Load ForecastingWant to predict future demand for power as far in advance as possible
With accurate predictions can fine tune: operating reserves maintenance scheduling, and fuel inventory management
Data collected over 15 years
Major holidays, such as Thanksgiving, Christmas and New years Day show significant variation from normal loads
Page 24
Diagnosis of Electromechanical FailuresPreventative maintenance of electromechanical devices such as motors and generators can forestall failures that disrupt industrial processes
Had data with 600 faults, each comprising a set of measurements along with an experts diagnosis, representing 20 years of experience
Half were unsatisfactory for various reasons and had to be discarded, the remainder used for training examples
Page 25
Validation Data SetIn many cases 3 data sets are needed: Training data set for selecting the learning algorithmValidation data set for setting parameters on the chosen learning algorithmsTesting data set for determining the accuracy
Maximizing TrainingOnce error rate is measured, re-bundle all three datasets and train again but dont re-measure the error rate!
How Close to the True Success Rate? Toss a coin 100 times and get 75 headsEstimated success rate: f = 75/100 = .75Toss a coin 1,000 times and get 750 headsEstimated success rate: f = 750/1000 = .75
Confidence IntervalsStatistical theory provides us with confidence intervals for the true underlying proportion
Toss a coin 100 times and get 75 headsEstimated success rate f = 75/100 = .75With 80% confidence, true success rate in [69.1,80.1]
Toss a coin 1,000 times and get 750 headsEstimated success rate f = 750/1000 = .75With 80% confidence, true success rate in [73.2,76.7]