Upload
charles-preston
View
214
Download
2
Tags:
Embed Size (px)
Citation preview
The Weka The Weka is an well known bird of New Zealand ..
W(aikato) E(nvironment) for K(nowlegde) A(nalysis)
Developed by the University of Waikato in New Zealand
It is Comprehensive suite of Java class libraries
Implement many state-of-the-art machine learning and data mining algorithms
It supports data files like CSV(Comma Separated file), ARFF(Attribute-Relation File Format)…
Collection of ML(Machine Learning) algorithms – open-source Java package
Schemes for classification include: decision trees, rule learners, naive Bayes, decision tables, locally weighted regression, SVMs, instance-based learners, logistic regression, voted perceptrons, multi-layer perceptron
Schemes for numeric prediction include: linear regression, model tree generators, locally weighted regression, instance-based learners, decision tables, multi-layer perceptron
Meta-schemes include: Bagging, boosting, stacking, regression via classification, classification via regression, cost sensitive classification
Schemes for clustering: EM and Cobweb
49 data preprocessing tools
76 classification/regression algorithms
8 clustering algorithms
15 attribute/subset evaluators + 10 search algorithms for feature selection
3 algorithms for finding association rules
3 graphical user interfaces“The Explorer” (exploratory data analysis)“The Experimenter” (experimental environment)“The Knowledge Flow” (new process model inspired interface)
Require declarations of @RELATION, @ATTRIBUTE and @DATA
@RELATION declaration associates a name with the dataset
Syntax: @RELATION <relation-name> E.g. @RELATION stud
@ATTRIBUTE declaration specifies the name and type of an attribute
Syntax: @attribute <attribute-name> <datatype> Datatype can be numeric, nominal, string or date E. g. @ATTRIBUTE sepallength NUMERIC
@ATTRIBUTE petalwidth NUMERIC @ATTRIBUTE class {Iris-setosa,Iris-versicolor,Iris-virginica}
@DATA declaration is a single line denoting the start of the data segment
Missing values are represented by ? @DATA
5.1, 3.5, 1.4, 0.2, Iris-setosa 4.9, ?, 1.4, ?, Iris-versicolor
In addition to nominal and numeric attributes, exemplified by the weather data, the ARFF format has two further attribute types: string attributes and date attributes. String attributes have values that are textual. Suppose you have a string attribute that you want to call description. In the block defining the attributes, it is specified as follows:
@attribute description string
Then, in the instance data, include any character string in quotation marks (to include quotation marks in your string, use the standard convention of preceding each one by a backslash, \). Strings are stored internally in a string table and represented by their address in that table. Thus two strings that contain the same characters will have the same value.
In Date attributes are strings with a special format and are introduced like this:
@attribute today date
(for an attribute called today). Weka, the machine learning software discussed in Part II of this book, uses the ISO-8601 combined date and time format yyyy-MM-dd-THH:mm:ss with four digits for the year, two each for the month and day, then the letter T followed by the time with two digits for each of hours, minutes, and seconds.1 In the data section of the file, dates are specified as the orresponding string representation of the date and time, for example, 2004-04-03T12:00:00. Although they are specified as strings, dates are converted to numeric form when the input file is read. Dates can also be converted internally to different formats, so you can have absolute timestamps in the data file and use transformations to forms such as time of day or day of the week to detect periodic behavior.
Similar to AARF files except that data value 0 are not represented
Non-zero attributes are specified by attribute number and value
For examples of ARFF files see $WEKAHOME/data
@data
0, X, 0, Y, “class A” 0, 0, W, 0, "class B"
@data
{1 X, 3 Y, 4 "class A"} {2 W, 4 "class B"}
-t <training file> Specify training file represented
-T <test files> If none, CV is performed on training data
-x <number of folds> Number of folds for cross-validation
-s <random number seed> For CV
-l <input file> Use saved model
-d <output file> Output model to file
Internal variables private Should have protected or package- level access
SparseInstance for Strings requires dummy at index 0
Problem: Strings are mapped into internal indices to an array String at position 0 is mapped to value “0” When written out as SparseInstance, it will not be written (0
value) If read back in, first String missing from Instances
Solution: Put dummy string in position 0 when writing a SparseInstance with
strings Dummy will be ignored while writing, actual instance will be written
properly