CSCI 347, Data Mining Evaluation: Training and Testing, Section 5.1, pages 147- 150

Find GCD in C

CSCI 347, Data MiningEvaluation: Training and Testing,Section 5.1, pages 147-150QuestionMany learning algorithms, how to compare their effectiveness?

Evaluation ProcessProcess: Have two large datasets training data set and a testing dataset (both representative of the underlying problem) Run a wide variety of algorithms on the training dataset (different algorithms will produce different models, i.e. different patterns try many algorithms so that if there is a hidden pattern it is likely that one algorithm will find it)Test each model on the test dataset to see which performs best

Independent and Representative DataAs long as the training and testing datasets are independent and representative of the underlying problem, it is likely that the performance predicted on the test dataset will match reality. That is, when the model is used on new data, the error rate will be the same as predicted.

How Much DataHow much data is enough? Depends on: Algorithms being usedComplexity of the dataRequired success rate in some cases the cost of misclassification is much more serious than in other cases Relative frequency of possible outcomes

StatisticsStatisticians have spent years developing tests for determining the smallest model set that can be used to produce an accurate model

QuoteData mining is useful when the sheer volume of data obscures patterns that might be detectable in smaller databases. Generally start with tens of thousands, if not millions of pre-classified records. If data is scarce, data mining is unlikely to be useful.

From Data Mining Techniques: For Marketing, Sales, and Customer Relationship Management

Rare Class ValuesTarget variables might represent something relatively rare: Prospects responding to a direct mail offer Credit card holders committing fraud In a month, newspaper subscribers canceling their subscription

RecommendationThe training set should be balanced with equal numbers of each of the outcomes. A smaller, balanced sample is preferable to a larger one with a very low proportion of rare outcomes

Limited Quality DataQuality data which is representative of the underlying problem, is hard to come by.

Satellite Imagery of Sea Ice

11Electrical Load ForecastingWant to predict future demand for power as far in advance as possible

With accurate predictions can fine tune: operating reserves maintenance scheduling, and fuel inventory management

Data collected over 15 years

Major holidays, such as Thanksgiving, Christmas and New years Day show significant variation from normal loads

Page 24

Diagnosis of Electromechanical FailuresPreventative maintenance of electromechanical devices such as motors and generators can forestall failures that disrupt industrial processes

Had data with 600 faults, each comprising a set of measurements along with an experts diagnosis, representing 20 years of experience

Half were unsatisfactory for various reasons and had to be discarded, the remainder used for training examples

Page 25

Validation Data SetIn many cases 3 data sets are needed: Training data set for selecting the learning algorithmValidation data set for setting parameters on the chosen learning algorithmsTesting data set for determining the accuracy

Maximizing TrainingOnce error rate is measured, re-bundle all three datasets and train again but dont re-measure the error rate!

How Close to the True Success Rate? Toss a coin 100 times and get 75 headsEstimated success rate: f = 75/100 = .75Toss a coin 1,000 times and get 750 headsEstimated success rate: f = 750/1000 = .75

Confidence IntervalsStatistical theory provides us with confidence intervals for the true underlying proportion

Toss a coin 100 times and get 75 headsEstimated success rate f = 75/100 = .75With 80% confidence, true success rate in [69.1,80.1]

Toss a coin 1,000 times and get 750 headsEstimated success rate f = 750/1000 = .75With 80% confidence, true success rate in [73.2,76.7]

Documents

CSCI 347, Data Mining Evaluation: Training and Testing, Section 5.1, pages 147- 150