27
Weka Tutorial D. De Cao R. Basili Corso di Web Mining e Retrieval a.a. 2008-9 April 15, 2009

Weka Tutorial

Embed Size (px)

Citation preview

Page 1: Weka Tutorial

Weka Tutorial

D. De Cao R. Basili

Corso di Web Mining e Retrievala.a. 2008-9

April 15, 2009

Page 2: Weka Tutorial

What is WEKA?

Collection of ML algorithms - open-source Java packageSite:http://www.cs.waikato.ac.nz/ml/weka/Documentation:http://www.cs.waikato.ac.nz/ml/weka/index_documentation.html

Schemes for classification include:decision trees, rule learners, naive Bayes, decision tables, locallyweighted regression, SVMs, instance-based learners, logistic regression,voted perceptrons, multi-layer perceptron

Schemes for numeric prediction include:linear regression, model tree generators, locally weighted regression,instance-based learners, decision tables, multi-layer perceptron

Meta-schemes include:Bagging, boosting, stacking, regression via classification, classificationvia regression, cost sensitive classification

Schemes for clustering:EM and Cobweb

Page 3: Weka Tutorial

What is WEKA?

Collection of ML algorithms - open-source Java packageSite:http://www.cs.waikato.ac.nz/ml/weka/Documentation:http://www.cs.waikato.ac.nz/ml/weka/index_documentation.html

Schemes for classification include:decision trees, rule learners, naive Bayes, decision tables, locallyweighted regression, SVMs, instance-based learners, logistic regression,voted perceptrons, multi-layer perceptron

Schemes for numeric prediction include:linear regression, model tree generators, locally weighted regression,instance-based learners, decision tables, multi-layer perceptron

Meta-schemes include:Bagging, boosting, stacking, regression via classification, classificationvia regression, cost sensitive classification

Schemes for clustering:EM and Cobweb

Page 4: Weka Tutorial

What is WEKA?

Collection of ML algorithms - open-source Java packageSite:http://www.cs.waikato.ac.nz/ml/weka/Documentation:http://www.cs.waikato.ac.nz/ml/weka/index_documentation.html

Schemes for classification include:decision trees, rule learners, naive Bayes, decision tables, locallyweighted regression, SVMs, instance-based learners, logistic regression,voted perceptrons, multi-layer perceptron

Schemes for numeric prediction include:linear regression, model tree generators, locally weighted regression,instance-based learners, decision tables, multi-layer perceptron

Meta-schemes include:Bagging, boosting, stacking, regression via classification, classificationvia regression, cost sensitive classification

Schemes for clustering:EM and Cobweb

Page 5: Weka Tutorial

What is WEKA?

Collection of ML algorithms - open-source Java packageSite:http://www.cs.waikato.ac.nz/ml/weka/Documentation:http://www.cs.waikato.ac.nz/ml/weka/index_documentation.html

Schemes for classification include:decision trees, rule learners, naive Bayes, decision tables, locallyweighted regression, SVMs, instance-based learners, logistic regression,voted perceptrons, multi-layer perceptron

Schemes for numeric prediction include:linear regression, model tree generators, locally weighted regression,instance-based learners, decision tables, multi-layer perceptron

Meta-schemes include:Bagging, boosting, stacking, regression via classification, classificationvia regression, cost sensitive classification

Schemes for clustering:EM and Cobweb

Page 6: Weka Tutorial

What is WEKA?

Collection of ML algorithms - open-source Java packageSite:http://www.cs.waikato.ac.nz/ml/weka/Documentation:http://www.cs.waikato.ac.nz/ml/weka/index_documentation.html

Schemes for classification include:decision trees, rule learners, naive Bayes, decision tables, locallyweighted regression, SVMs, instance-based learners, logistic regression,voted perceptrons, multi-layer perceptron

Schemes for numeric prediction include:linear regression, model tree generators, locally weighted regression,instance-based learners, decision tables, multi-layer perceptron

Meta-schemes include:Bagging, boosting, stacking, regression via classification, classificationvia regression, cost sensitive classification

Schemes for clustering:EM and Cobweb

Page 7: Weka Tutorial

What is WEKA?

Collection of ML algorithms - open-source Java packageSite:http://www.cs.waikato.ac.nz/ml/weka/Documentation:http://www.cs.waikato.ac.nz/ml/weka/index_documentation.html

Schemes for classification include:decision trees, rule learners, naive Bayes, decision tables, locallyweighted regression, SVMs, instance-based learners, logistic regression,voted perceptrons, multi-layer perceptron

Schemes for numeric prediction include:linear regression, model tree generators, locally weighted regression,instance-based learners, decision tables, multi-layer perceptron

Meta-schemes include:Bagging, boosting, stacking, regression via classification, classificationvia regression, cost sensitive classification

Schemes for clustering:EM and Cobweb

Page 8: Weka Tutorial

ARFF File Format

Require declarations of @RELATION, @ATTRIBUTE and @DATA

@RELATION declaration associates a name with the dataset@RELATION <relation-name>

@ATTRIBUTE declaration specifies the name and type of an attribute@ATTRIBUTE <attribute-name> <datatype>Datatype can be numeric, nominal, string or date

@ATTRIBUTE sepallength NUMERIC@ATTRIBUTE petalwidth NUMERIC@ATTRIBUTE class {Setosa,Versicolor,Virginica}

@DATA declaration is a single line denoting the start of the datasegment

Missing values are represented by [email protected], 0.2, Setosa1.4, ?, Versicolor

Page 9: Weka Tutorial

ARFF File Format

Require declarations of @RELATION, @ATTRIBUTE and @DATA@RELATION declaration associates a name with the dataset

@RELATION <relation-name>

@ATTRIBUTE declaration specifies the name and type of an attribute@ATTRIBUTE <attribute-name> <datatype>Datatype can be numeric, nominal, string or date

@ATTRIBUTE sepallength NUMERIC@ATTRIBUTE petalwidth NUMERIC@ATTRIBUTE class {Setosa,Versicolor,Virginica}

@DATA declaration is a single line denoting the start of the datasegment

Missing values are represented by [email protected], 0.2, Setosa1.4, ?, Versicolor

Page 10: Weka Tutorial

ARFF File Format

Require declarations of @RELATION, @ATTRIBUTE and @DATA@RELATION declaration associates a name with the dataset

@RELATION <relation-name>

@ATTRIBUTE declaration specifies the name and type of an attribute@ATTRIBUTE <attribute-name> <datatype>Datatype can be numeric, nominal, string or date

@ATTRIBUTE sepallength NUMERIC@ATTRIBUTE petalwidth NUMERIC@ATTRIBUTE class {Setosa,Versicolor,Virginica}

@DATA declaration is a single line denoting the start of the datasegment

Missing values are represented by [email protected], 0.2, Setosa1.4, ?, Versicolor

Page 11: Weka Tutorial

ARFF File Format

Require declarations of @RELATION, @ATTRIBUTE and @DATA@RELATION declaration associates a name with the dataset

@RELATION <relation-name>

@ATTRIBUTE declaration specifies the name and type of an attribute@ATTRIBUTE <attribute-name> <datatype>Datatype can be numeric, nominal, string or date

@ATTRIBUTE sepallength NUMERIC@ATTRIBUTE petalwidth NUMERIC@ATTRIBUTE class {Setosa,Versicolor,Virginica}

@DATA declaration is a single line denoting the start of the datasegment

Missing values are represented by [email protected], 0.2, Setosa1.4, ?, Versicolor

Page 12: Weka Tutorial

ARFF File Format

Require declarations of @RELATION, @ATTRIBUTE and @DATA@RELATION declaration associates a name with the dataset

@RELATION <relation-name>

@ATTRIBUTE declaration specifies the name and type of an attribute@ATTRIBUTE <attribute-name> <datatype>Datatype can be numeric, nominal, string or date

@ATTRIBUTE sepallength NUMERIC@ATTRIBUTE petalwidth NUMERIC@ATTRIBUTE class {Setosa,Versicolor,Virginica}

@DATA declaration is a single line denoting the start of the datasegment

Missing values are represented by ?

@DATA1.4, 0.2, Setosa1.4, ?, Versicolor

Page 13: Weka Tutorial

ARFF File Format

Require declarations of @RELATION, @ATTRIBUTE and @DATA@RELATION declaration associates a name with the dataset

@RELATION <relation-name>

@ATTRIBUTE declaration specifies the name and type of an attribute@ATTRIBUTE <attribute-name> <datatype>Datatype can be numeric, nominal, string or date

@ATTRIBUTE sepallength NUMERIC@ATTRIBUTE petalwidth NUMERIC@ATTRIBUTE class {Setosa,Versicolor,Virginica}

@DATA declaration is a single line denoting the start of the datasegment

Missing values are represented by [email protected], 0.2, Setosa1.4, ?, Versicolor

Page 14: Weka Tutorial

ARFF Sparse File Format

Similar to AARF files except that data value 0 are not represented

Non-zero attributes are specified by attribute number and valueFull:

@DATA0 , X , 0 , Y , ”class A"0 , 0 , W , 0 , ”class B"

Sparse:@DATA{1 X, 3 Y, 4 ”class A"}{2 W, 4 ”class B"}

Note that the omitted values in a sparse instance are 0, they are notmissing values! If a value is unknown, you must explicitly represent itwith a question mark (?)

Page 15: Weka Tutorial

ARFF Sparse File Format

Similar to AARF files except that data value 0 are not representedNon-zero attributes are specified by attribute number and value

Full:@DATA0 , X , 0 , Y , ”class A"0 , 0 , W , 0 , ”class B"

Sparse:@DATA{1 X, 3 Y, 4 ”class A"}{2 W, 4 ”class B"}

Note that the omitted values in a sparse instance are 0, they are notmissing values! If a value is unknown, you must explicitly represent itwith a question mark (?)

Page 16: Weka Tutorial

ARFF Sparse File Format

Similar to AARF files except that data value 0 are not representedNon-zero attributes are specified by attribute number and valueFull:

@DATA0 , X , 0 , Y , ”class A"0 , 0 , W , 0 , ”class B"

Sparse:@DATA{1 X, 3 Y, 4 ”class A"}{2 W, 4 ”class B"}

Note that the omitted values in a sparse instance are 0, they are notmissing values! If a value is unknown, you must explicitly represent itwith a question mark (?)

Page 17: Weka Tutorial

ARFF Sparse File Format

Similar to AARF files except that data value 0 are not representedNon-zero attributes are specified by attribute number and valueFull:

@DATA0 , X , 0 , Y , ”class A"0 , 0 , W , 0 , ”class B"

Sparse:@DATA{1 X, 3 Y, 4 ”class A"}{2 W, 4 ”class B"}

Note that the omitted values in a sparse instance are 0, they are notmissing values! If a value is unknown, you must explicitly represent itwith a question mark (?)

Page 18: Weka Tutorial

ARFF Sparse File Format

Similar to AARF files except that data value 0 are not representedNon-zero attributes are specified by attribute number and valueFull:

@DATA0 , X , 0 , Y , ”class A"0 , 0 , W , 0 , ”class B"

Sparse:@DATA{1 X, 3 Y, 4 ”class A"}{2 W, 4 ”class B"}

Note that the omitted values in a sparse instance are 0, they are notmissing values! If a value is unknown, you must explicitly represent itwith a question mark (?)

Page 19: Weka Tutorial

ARFF Sparse File Format: Problem

There is a known problem saving SparseInstance objects from datasetsthat have string attributes.

In Weka, string and nominal data values are stored as numbers;String at position 0 is mapped to value “0”If read back in, first String missing from Instances

SolutionPut dummy string in position 0 when writing a SparseInstance with strings;Dummy will be ignored while writing, actual instance will be writtenproperly

Page 20: Weka Tutorial

ARFF Sparse File Format: Problem

There is a known problem saving SparseInstance objects from datasetsthat have string attributes.In Weka, string and nominal data values are stored as numbers;

String at position 0 is mapped to value “0”If read back in, first String missing from Instances

SolutionPut dummy string in position 0 when writing a SparseInstance with strings;Dummy will be ignored while writing, actual instance will be writtenproperly

Page 21: Weka Tutorial

ARFF Sparse File Format: Problem

There is a known problem saving SparseInstance objects from datasetsthat have string attributes.In Weka, string and nominal data values are stored as numbers;String at position 0 is mapped to value “0”

If read back in, first String missing from Instances

SolutionPut dummy string in position 0 when writing a SparseInstance with strings;Dummy will be ignored while writing, actual instance will be writtenproperly

Page 22: Weka Tutorial

ARFF Sparse File Format: Problem

There is a known problem saving SparseInstance objects from datasetsthat have string attributes.In Weka, string and nominal data values are stored as numbers;String at position 0 is mapped to value “0”If read back in, first String missing from Instances

SolutionPut dummy string in position 0 when writing a SparseInstance with strings;Dummy will be ignored while writing, actual instance will be writtenproperly

Page 23: Weka Tutorial

ARFF Sparse File Format: Problem

There is a known problem saving SparseInstance objects from datasetsthat have string attributes.In Weka, string and nominal data values are stored as numbers;String at position 0 is mapped to value “0”If read back in, first String missing from Instances

SolutionPut dummy string in position 0 when writing a SparseInstance with strings;Dummy will be ignored while writing, actual instance will be writtenproperly

Page 24: Weka Tutorial

Running Learning Schemes

java -Xmx512m -cp weka.jar <learner class> [options]

Example learner classes:Decision Tree: weka.classifiers.trees.J48 (Quinlan 1993 [1])Naive Bayes: weka.classifiers.bayes.NaiveBayesk-NN: weka.classifiers.lazy.IBk

Important generic options:-t <training file> Specify training file-T <test files> Specify Test file. If none testing is performed ontraining data-x <number of folds> Number of folds for cross-validation-l <input file> Use saved model-d <output file> Output model to file-split-percentage <train size> Size of training set-c <class index> Index of attribute to use as class (NB: the indexstart from 1)-p <attribute index> Only output the predictions and oneattribute (0 for none) for all test instances.

Page 25: Weka Tutorial

Running Learning Schemes

java -Xmx512m -cp weka.jar <learner class> [options]

Example learner classes:Decision Tree: weka.classifiers.trees.J48 (Quinlan 1993 [1])Naive Bayes: weka.classifiers.bayes.NaiveBayesk-NN: weka.classifiers.lazy.IBk

Important generic options:-t <training file> Specify training file-T <test files> Specify Test file. If none testing is performed ontraining data-x <number of folds> Number of folds for cross-validation-l <input file> Use saved model-d <output file> Output model to file-split-percentage <train size> Size of training set-c <class index> Index of attribute to use as class (NB: the indexstart from 1)-p <attribute index> Only output the predictions and oneattribute (0 for none) for all test instances.

Page 26: Weka Tutorial

Running Learning Schemes

java -Xmx512m -cp weka.jar <learner class> [options]

Example learner classes:Decision Tree: weka.classifiers.trees.J48 (Quinlan 1993 [1])Naive Bayes: weka.classifiers.bayes.NaiveBayesk-NN: weka.classifiers.lazy.IBk

Important generic options:-t <training file> Specify training file-T <test files> Specify Test file. If none testing is performed ontraining data-x <number of folds> Number of folds for cross-validation-l <input file> Use saved model-d <output file> Output model to file-split-percentage <train size> Size of training set-c <class index> Index of attribute to use as class (NB: the indexstart from 1)-p <attribute index> Only output the predictions and oneattribute (0 for none) for all test instances.

Page 27: Weka Tutorial

Ross Quinlan.C4.5: Programs for Machine Learning.Morgan Kaufmann Publishers, San Mateo, CA, 1993.