Upload
arnold-franklin
View
227
Download
0
Embed Size (px)
Citation preview
An Exercise in An Exercise in Machine Machine LearningLearninghttp://www.cs.iastate.edu/~cs573x/bbsilab.html
• Machine Learning Software
• Preparing Data
• Building Classifiers
• Interpreting Results
• Test-driving WEKA
Machine Learning Machine Learning SoftwareSoftware
Suites (General Purpose)Suites (General Purpose) WEKA (Source: Java) WEKA (Source: Java) MLC++ (Source: C++) MLC++ (Source: C++) SIPINA SIPINA List from KDNuggets (Various) List from KDNuggets (Various)
SpecificSpecific Classification: C4.5, SVMlightClassification: C4.5, SVMlight Association Rule MiningAssociation Rule Mining Bayesian Net … …Bayesian Net … …
Commercial vs. Free vs. Commercial vs. Free vs. Programming Programming
What does WEKA do?What does WEKA do?
Implementation of state-of-art Implementation of state-of-art learning algorithmlearning algorithm
Main strengths in the classificationMain strengths in the classification Regression, Association Rules and Regression, Association Rules and
clustering algorithms clustering algorithms Extensible to try new learning Extensible to try new learning
schemesschemes Large variety of handy tools Large variety of handy tools
(transforming datasets, filters, (transforming datasets, filters, visualization etc…)visualization etc…)
WEKA resourcesWEKA resources
API Documentation, Tutorial, Source API Documentation, Tutorial, Source code.code.
WEKA mailing listWEKA mailing list Data Mining: Practical Machine LearninData Mining: Practical Machine Learnin
g Tools and Techniques with Java Impleg Tools and Techniques with Java Implementationsmentations
Weka-related Projects:Weka-related Projects: Weka-Parallel - parallel processing for Weka-Parallel - parallel processing for
Weka Weka RWeka - linking R and Weka RWeka - linking R and Weka YALE - Yet Another Learning Environment YALE - Yet Another Learning Environment Many others…Many others…
Getting StartedGetting Started
Installation (Java runtime +WEKA)Installation (Java runtime +WEKA) Setting up the environment Setting up the environment
((CLASSPATHCLASSPATH)) Reference Book and online API Reference Book and online API
documentdocument Preparing Data setsPreparing Data sets Running WEKARunning WEKA Interpreting ResultsInterpreting Results
ARFF Data FormatARFF Data Format
Attribute-Relation File Attribute-Relation File FormatFormat
Header – describing the Header – describing the attribute types attribute types
Data – (instances, Data – (instances, examples) comma-examples) comma-separated listseparated list
Use the right data format:Use the right data format: Filestem, CSV Filestem, CSV ARFF format ARFF format Use C45Loader and Use C45Loader and
CSVLoader to convertCSVLoader to convert
Launching WEKALaunching WEKA
Load Dataset into WEKALoad Dataset into WEKA
Data FiltersData Filters
Useful support for data preprocessingUseful support for data preprocessing Removing or adding attributes, Removing or adding attributes,
resampling the dataset, removing resampling the dataset, removing examples, etc.examples, etc.
Creates stratified cross-validation folds of Creates stratified cross-validation folds of the given dataset, and class distributions the given dataset, and class distributions are approximately retained within each are approximately retained within each fold.fold.
Typically split data as 2/3 in training and Typically split data as 2/3 in training and 1/3 in testing1/3 in testing
Building ClassifiersBuilding Classifiers
A classifier model - mapping from A classifier model - mapping from dataset attributes to the class dataset attributes to the class (target) attribute. Creation and (target) attribute. Creation and form differs. form differs.
Decision Tree and Naïve Bayes Decision Tree and Naïve Bayes ClassifiersClassifiers
Which one is the best?Which one is the best? No Free Lunch!No Free Lunch!
Building ClassifierBuilding Classifier
(1) (1) weka.classifiers.rules.Zerweka.classifiers.rules.Zer
oR oR Building and using a 0-R classifier. Building and using a 0-R classifier.
Predicts the mean (for a numeric class) Predicts the mean (for a numeric class) or the mode (for a nominal class). or the mode (for a nominal class).
(2) (2) weka.classifiers.bayes.Naiveweka.classifiers.bayes.Naive
BayesBayes Class for building a Naive Bayesian classifier Class for building a Naive Bayesian classifier
(3) (3) weka.classifiers.trees.J48 weka.classifiers.trees.J48
Class for generating Class for generating an unpruned or a an unpruned or a pruned C4.5 decision pruned C4.5 decision tree.tree.
Test OptionsTest Options
Percentage Split (2/3 Training; 1/3 Percentage Split (2/3 Training; 1/3 Testing)Testing)
Cross-validationCross-validation estimating the generalization error estimating the generalization error
based on resampling when limited based on resampling when limited data; averaged error estimate.data; averaged error estimate.
stratifiedstratified 10-fold10-fold leave-one-out (Loo)leave-one-out (Loo) 10-fold vs. Loo10-fold vs. Loo
Understanding OutputUnderstanding Output
Decision Tree Output Decision Tree Output (1)(1)
J48 pruned treeJ48 pruned tree------------------------------------
outlook = sunnyoutlook = sunny| humidity <= 75: yes | humidity <= 75: yes
(2.0)(2.0)| humidity > 75: no (3.0)| humidity > 75: no (3.0)outlook = overcast: yes outlook = overcast: yes
(4.0)(4.0)outlook = rainyoutlook = rainy| windy = TRUE: no (2.0)| windy = TRUE: no (2.0)| windy = FALSE: yes | windy = FALSE: yes
(3.0)(3.0)
Number of Leaves : 5Number of Leaves : 5
Size of the tree : 8 Size of the tree : 8
=== Error on training data ====== Error on training data ===
Correctly Classified Instance 14 100 Correctly Classified Instance 14 100 %%
Incorrectly Classified Instances 0 0 Incorrectly Classified Instances 0 0 % %
Kappa statistic 1 Kappa statistic 1 Mean absolute error 0 Mean absolute error 0 Root mean squared error 0 Root mean squared error 0 Relative absolute error 0%Relative absolute error 0%Root relative squared error 0%Root relative squared error 0%Total Number of Instances 14 Total Number of Instances 14 === Detailed Accuracy By Class ====== Detailed Accuracy By Class ===TP FP Precision Recall F-Measure TP FP Precision Recall F-Measure
ClassClass 1 0 1 1 1 yes1 0 1 1 1 yes 1 0 1 1 1 no1 0 1 1 1 no
=== Confusion Matrix ====== Confusion Matrix ===a b <-- classified asa b <-- classified as99 0 | a = yes0 | a = yes1010 0 5 | b = no0 5 | b = no
Decision Tree Output Decision Tree Output (2)(2)
=== Stratified cross-validation ====== Stratified cross-validation ===
Correctly Classified Instances 9 64.2857 %Correctly Classified Instances 9 64.2857 %Incorrectly Classified Instances 5 35.7143 %Incorrectly Classified Instances 5 35.7143 %Kappa statistic 0.186 Kappa statistic 0.186 Mean absolute error 0.2857Mean absolute error 0.2857Root mean squared error 0.4818Root mean squared error 0.4818Relative absolute error 60%Relative absolute error 60%Root relative squared error 97.6586 %Root relative squared error 97.6586 %Total Number of Instances 14 Total Number of Instances 14
=== Detailed Accuracy By Class ====== Detailed Accuracy By Class ===TP Rate FP Rate Precision Recall F-Measure ClassTP Rate FP Rate Precision Recall F-Measure Class 0.778 0.6 0.7 0.778 0.737 yes0.778 0.6 0.7 0.778 0.737 yes 0.4 0.222 0.5 0.4 0.444 no0.4 0.222 0.5 0.4 0.444 no
=== Confusion Matrix ====== Confusion Matrix ===a b <-- classified asa b <-- classified as7 2 | a = yes7 2 | a = yes3 2 | b = no 3 2 | b = no
Performance MeasuresPerformance Measures
Accuracy & Error rateAccuracy & Error rate Mean absolute error Mean absolute error Root mean-squared root (square root Root mean-squared root (square root
of the average quadratic loss)of the average quadratic loss) Confusion matrix – contingency tableConfusion matrix – contingency table True Positive rate & False Positive True Positive rate & False Positive
raterate Precision & F-Measure Precision & F-Measure
Decision Tree PruningDecision Tree Pruning
Overcome Over-fittingOvercome Over-fitting Pre-pruning and Post-pruningPre-pruning and Post-pruning Reduced error pruningReduced error pruning Subtree raising with different Subtree raising with different
confidenceconfidence Comparing tree size and accuracy.Comparing tree size and accuracy.
Subtree replacementSubtree replacement
Bottom-up: tree is considered for Bottom-up: tree is considered for replacement once all its subtrees replacement once all its subtrees have been consideredhave been considered
Subtree RaisingSubtree Raising
Deletes node and redistributes instancesDeletes node and redistributes instances Slower than subtree replacementSlower than subtree replacement
Naïve Bayesian ClassifierNaïve Bayesian Classifier
Output CPT, same set of Output CPT, same set of performance measures performance measures
By default, use normal distribution By default, use normal distribution to model numeric attributes.to model numeric attributes.
Kernel density estimator could Kernel density estimator could improve performance if normality improve performance if normality assumption is incorrect. (-k option)assumption is incorrect. (-k option)
Data Sets to work onData Sets to work on
Data sets were preprocessed into Data sets were preprocessed into ARFF formatARFF format
Three data sets from UCI repositoryThree data sets from UCI repository Two data sets from Computational Two data sets from Computational
Biology Biology Protein Function Prediction Protein Function Prediction Surface Residue Prediction Surface Residue Prediction
Protein Function Protein Function PredictionPrediction
Build a Decision Tree classifier that assign Build a Decision Tree classifier that assign protein sequences into functional families protein sequences into functional families based on characteristic motif compositionsbased on characteristic motif compositions
Each attribute (motif) has a Prosite access Each attribute (motif) has a Prosite access number: PS#### number: PS####
Class label use Prosite Doc ID: Class label use Prosite Doc ID: PDOC####PDOC####
73 attributes (binary) & 10 classes (73 attributes (binary) & 10 classes (PDOCPDOC). ). Suggested methodSuggested method: Use 10-fold CV and : Use 10-fold CV and
Pruning the tree using Sub-tree raising Pruning the tree using Sub-tree raising method method
Surface Residue Surface Residue PredictionPrediction
Prediction is based on the identity of Prediction is based on the identity of the target residue and its 4 sequence the target residue and its 4 sequence neighborsneighbors
Window Size = 5Window Size = 5 Target residue is on Surface or not?Target residue is on Surface or not? 5 attributes and binary classes.5 attributes and binary classes. Suggested method: Use Naïve Bayesian Suggested method: Use Naïve Bayesian
Classifier with no kernelsClassifier with no kernels
X1X1 X2X2 X3X3 X4X4 X5X5
Your Turn to Test Your Turn to Test Drive!Drive!