An Exercise in Machine Learning cs573x/bbsilab.html Machine Learning Software Preparing Data Building Classifiers Interpreting

An Exercise in An Exercise in Machine Machine LearningLearninghttp://www.cs.iastate.edu/~cs573x/bbsilab.html

• Machine Learning Software

• Preparing Data

• Building Classifiers

• Interpreting Results

• Test-driving WEKA

Machine Learning Machine Learning SoftwareSoftware

Suites (General Purpose)Suites (General Purpose) WEKA (Source: Java) WEKA (Source: Java) MLC++ (Source: C++) MLC++ (Source: C++) SIPINA SIPINA List from KDNuggets (Various) List from KDNuggets (Various)

SpecificSpecific Classification: C4.5, SVMlightClassification: C4.5, SVMlight Association Rule MiningAssociation Rule Mining Bayesian Net … …Bayesian Net … …

Commercial vs. Free vs. Commercial vs. Free vs. Programming Programming

What does WEKA do?What does WEKA do?

Implementation of state-of-art Implementation of state-of-art learning algorithmlearning algorithm

Main strengths in the classificationMain strengths in the classification Regression, Association Rules and Regression, Association Rules and

clustering algorithms clustering algorithms Extensible to try new learning Extensible to try new learning

schemesschemes Large variety of handy tools Large variety of handy tools

(transforming datasets, filters, (transforming datasets, filters, visualization etc…)visualization etc…)

WEKA resourcesWEKA resources

API Documentation, Tutorial, Source API Documentation, Tutorial, Source code.code.

WEKA mailing listWEKA mailing list Data Mining: Practical Machine LearninData Mining: Practical Machine Learnin

g Tools and Techniques with Java Impleg Tools and Techniques with Java Implementationsmentations

Weka-related Projects:Weka-related Projects: Weka-Parallel - parallel processing for Weka-Parallel - parallel processing for

Weka Weka RWeka - linking R and Weka RWeka - linking R and Weka YALE - Yet Another Learning Environment YALE - Yet Another Learning Environment Many others…Many others…

Getting StartedGetting Started

Installation (Java runtime +WEKA)Installation (Java runtime +WEKA) Setting up the environment Setting up the environment

((CLASSPATHCLASSPATH)) Reference Book and online API Reference Book and online API

documentdocument Preparing Data setsPreparing Data sets Running WEKARunning WEKA Interpreting ResultsInterpreting Results

ARFF Data FormatARFF Data Format

Attribute-Relation File Attribute-Relation File FormatFormat

Header – describing the Header – describing the attribute types attribute types

Data – (instances, Data – (instances, examples) comma-examples) comma-separated listseparated list

Use the right data format:Use the right data format: Filestem, CSV Filestem, CSV ARFF format ARFF format Use C45Loader and Use C45Loader and

CSVLoader to convertCSVLoader to convert

Launching WEKALaunching WEKA

Load Dataset into WEKALoad Dataset into WEKA

Data FiltersData Filters

Useful support for data preprocessingUseful support for data preprocessing Removing or adding attributes, Removing or adding attributes,

resampling the dataset, removing resampling the dataset, removing examples, etc.examples, etc.

Creates stratified cross-validation folds of Creates stratified cross-validation folds of the given dataset, and class distributions the given dataset, and class distributions are approximately retained within each are approximately retained within each fold.fold.

Typically split data as 2/3 in training and Typically split data as 2/3 in training and 1/3 in testing1/3 in testing

Building ClassifiersBuilding Classifiers

A classifier model - mapping from A classifier model - mapping from dataset attributes to the class dataset attributes to the class (target) attribute. Creation and (target) attribute. Creation and form differs. form differs.

Decision Tree and Naïve Bayes Decision Tree and Naïve Bayes ClassifiersClassifiers

Which one is the best?Which one is the best? No Free Lunch!No Free Lunch!

Building ClassifierBuilding Classifier

(1) (1) weka.classifiers.rules.Zerweka.classifiers.rules.Zer

oR oR Building and using a 0-R classifier. Building and using a 0-R classifier.

Predicts the mean (for a numeric class) Predicts the mean (for a numeric class) or the mode (for a nominal class). or the mode (for a nominal class).

(2) (2) weka.classifiers.bayes.Naiveweka.classifiers.bayes.Naive

BayesBayes Class for building a Naive Bayesian classifier Class for building a Naive Bayesian classifier

(3) (3) weka.classifiers.trees.J48 weka.classifiers.trees.J48

Class for generating Class for generating an unpruned or a an unpruned or a pruned C4.5 decision pruned C4.5 decision tree.tree.

Test OptionsTest Options

Percentage Split (2/3 Training; 1/3 Percentage Split (2/3 Training; 1/3 Testing)Testing)

Cross-validationCross-validation estimating the generalization error estimating the generalization error

based on resampling when limited based on resampling when limited data; averaged error estimate.data; averaged error estimate.

stratifiedstratified 10-fold10-fold leave-one-out (Loo)leave-one-out (Loo) 10-fold vs. Loo10-fold vs. Loo

Understanding OutputUnderstanding Output

Decision Tree Output Decision Tree Output (1)(1)

J48 pruned treeJ48 pruned tree------------------------------------

outlook = sunnyoutlook = sunny| humidity <= 75: yes | humidity <= 75: yes

(2.0)(2.0)| humidity > 75: no (3.0)| humidity > 75: no (3.0)outlook = overcast: yes outlook = overcast: yes

(4.0)(4.0)outlook = rainyoutlook = rainy| windy = TRUE: no (2.0)| windy = TRUE: no (2.0)| windy = FALSE: yes | windy = FALSE: yes

(3.0)(3.0)

Number of Leaves : 5Number of Leaves : 5

Size of the tree : 8 Size of the tree : 8

=== Error on training data ====== Error on training data ===

Correctly Classified Instance 14 100 Correctly Classified Instance 14 100 %%

Incorrectly Classified Instances 0 0 Incorrectly Classified Instances 0 0 % %

Kappa statistic 1 Kappa statistic 1 Mean absolute error 0 Mean absolute error 0 Root mean squared error 0 Root mean squared error 0 Relative absolute error 0%Relative absolute error 0%Root relative squared error 0%Root relative squared error 0%Total Number of Instances 14 Total Number of Instances 14 === Detailed Accuracy By Class ====== Detailed Accuracy By Class ===TP FP Precision Recall F-Measure TP FP Precision Recall F-Measure

ClassClass 1 0 1 1 1 yes1 0 1 1 1 yes 1 0 1 1 1 no1 0 1 1 1 no

=== Confusion Matrix ====== Confusion Matrix ===a b <-- classified asa b <-- classified as99 0 | a = yes0 | a = yes1010 0 5 | b = no0 5 | b = no

Decision Tree Output Decision Tree Output (2)(2)

=== Stratified cross-validation ====== Stratified cross-validation ===

Correctly Classified Instances 9 64.2857 %Correctly Classified Instances 9 64.2857 %Incorrectly Classified Instances 5 35.7143 %Incorrectly Classified Instances 5 35.7143 %Kappa statistic 0.186 Kappa statistic 0.186 Mean absolute error 0.2857Mean absolute error 0.2857Root mean squared error 0.4818Root mean squared error 0.4818Relative absolute error 60%Relative absolute error 60%Root relative squared error 97.6586 %Root relative squared error 97.6586 %Total Number of Instances 14 Total Number of Instances 14

=== Detailed Accuracy By Class ====== Detailed Accuracy By Class ===TP Rate FP Rate Precision Recall F-Measure ClassTP Rate FP Rate Precision Recall F-Measure Class 0.778 0.6 0.7 0.778 0.737 yes0.778 0.6 0.7 0.778 0.737 yes 0.4 0.222 0.5 0.4 0.444 no0.4 0.222 0.5 0.4 0.444 no

=== Confusion Matrix ====== Confusion Matrix ===a b <-- classified asa b <-- classified as7 2 | a = yes7 2 | a = yes3 2 | b = no 3 2 | b = no

Performance MeasuresPerformance Measures

Accuracy & Error rateAccuracy & Error rate Mean absolute error Mean absolute error Root mean-squared root (square root Root mean-squared root (square root

of the average quadratic loss)of the average quadratic loss) Confusion matrix – contingency tableConfusion matrix – contingency table True Positive rate & False Positive True Positive rate & False Positive

raterate Precision & F-Measure Precision & F-Measure

Decision Tree PruningDecision Tree Pruning

Overcome Over-fittingOvercome Over-fitting Pre-pruning and Post-pruningPre-pruning and Post-pruning Reduced error pruningReduced error pruning Subtree raising with different Subtree raising with different

confidenceconfidence Comparing tree size and accuracy.Comparing tree size and accuracy.

Subtree replacementSubtree replacement

Bottom-up: tree is considered for Bottom-up: tree is considered for replacement once all its subtrees replacement once all its subtrees have been consideredhave been considered

Subtree RaisingSubtree Raising

Deletes node and redistributes instancesDeletes node and redistributes instances Slower than subtree replacementSlower than subtree replacement

Naïve Bayesian ClassifierNaïve Bayesian Classifier

Output CPT, same set of Output CPT, same set of performance measures performance measures

By default, use normal distribution By default, use normal distribution to model numeric attributes.to model numeric attributes.

Kernel density estimator could Kernel density estimator could improve performance if normality improve performance if normality assumption is incorrect. (-k option)assumption is incorrect. (-k option)

Data Sets to work onData Sets to work on

Data sets were preprocessed into Data sets were preprocessed into ARFF formatARFF format

Three data sets from UCI repositoryThree data sets from UCI repository Two data sets from Computational Two data sets from Computational

Biology Biology Protein Function Prediction Protein Function Prediction Surface Residue Prediction Surface Residue Prediction

Protein Function Protein Function PredictionPrediction

Build a Decision Tree classifier that assign Build a Decision Tree classifier that assign protein sequences into functional families protein sequences into functional families based on characteristic motif compositionsbased on characteristic motif compositions

Each attribute (motif) has a Prosite access Each attribute (motif) has a Prosite access number: PS#### number: PS####

Class label use Prosite Doc ID: Class label use Prosite Doc ID: PDOC####PDOC####

73 attributes (binary) & 10 classes (73 attributes (binary) & 10 classes (PDOCPDOC). ). Suggested methodSuggested method: Use 10-fold CV and : Use 10-fold CV and

Pruning the tree using Sub-tree raising Pruning the tree using Sub-tree raising method method

Surface Residue Surface Residue PredictionPrediction

Prediction is based on the identity of Prediction is based on the identity of the target residue and its 4 sequence the target residue and its 4 sequence neighborsneighbors

Window Size = 5Window Size = 5 Target residue is on Surface or not?Target residue is on Surface or not? 5 attributes and binary classes.5 attributes and binary classes. Suggested method: Use Naïve Bayesian Suggested method: Use Naïve Bayesian

Classifier with no kernelsClassifier with no kernels

X1X1 X2X2 X3X3 X4X4 X5X5

Your Turn to Test Your Turn to Test Drive!Drive!

Documents

An Exercise in Machine Learning cs573x/bbsilab.html Machine Learning Software Preparing Data Building Classifiers Interpreting