Bilal Ahmed ShaikDMDW Lab

Embed Size (px)

Citation preview

  • 8/2/2019 Bilal Ahmed ShaikDMDW Lab

    1/24

    EXPERIMENT N0:01

    AIM:-Create a data file in ARFF format.

    Problem Statement:-Create a data file in ARFF format manually using any editor such as

    notepad or turbo editor.

    Theory: - An ARFF (Attribute-Relation File Format) file is an ASCII text fi le that describes a list of

    instances sharing a set of attributes. ARFF files were developed by the Machine Learning Project

    at the Department of Computer Science of The University of Waikato for use with the Weka

    machine learning software.

    ARFF files have two distinct sections, Header information and Data information. The

    Header of the ARFF file contains the name of the relation, a list of the attributes (the columns in

    the data), and their types. The ARFF Header section of the file contains the relation declaration

    and attributes declarations.

    The @relation Declaration:-The relation name is defined as the first line in the ARFF

    file, The format is: @relation , where is a string, The string

    must be quoted if the name includes spaces.

    Examples:-

    @RELATION iris@RELATION bank

    The @attribute Declarations:-Attribute declarations take the form of an ordered

    sequence of @attribute statements. Each attribute in the data set has its own @attribute

    statement which uniquely defines the name of that attribute and its data type. The order the

    attributes are declared indicates the column position in the data section of the file. For example, if

    an attribute is the third one declared then Weka expects that all that attributes values will be

    found in the third comma delimited column. The format for the @attribute statement is:

    @attribute , where the must start with an

    alphabetic character. If spaces are to be included in the name then the entire name must be

    quoted. The can be any of the four types currently (version 3.2.1) supported by

    Weka. Numeric, , string, date [].

    Numeric attributes can be real or integer numbers. Nominal attributes:-Nominal values

    are defined by providing an listing the possible values: {, , , ...} For example, @ATTRIBUTE gender {male,

    female}. String attributes allow us to create attributes containing arbitrary textual values. Date

    attribute declarations take the form: @attribute date [] where is

    the name for the attribute. is an optional string specifying how date values should

    http://www.cs.waikato.ac.nz/~ml/http://www.cs.waikato.ac.nz/~ml/http://www.cs.waikato.ac.nz/~ml/http://www.cs.waikato.ac.nz/~ml/http://www.cs.waikato.ac.nz/~ml/
  • 8/2/2019 Bilal Ahmed ShaikDMDW Lab

    2/24

    be parsed and printed (this is the same format used by SimpleDateFormat). The default format

    string accepts the ISO-8601 combined date and time format: "yyyy-MM-dd'T'HH:mm:ss".

    Data Types:

    1. Numeric: Integer and Real.

    2. String

    3. Nominal(Boolean Values or Range of Values,Multi Valued Attributes)

    4. Date

    Examples@ATTRIBUTE petalwidth NUMERIC

    @ATTRIBUTE class {First, Second, Third}@ATTRIBUTE name STRING@ATTRIBUTE dob DATE@ATTRIBUTE doj DATE "yyyy-MM-dd HH:mm:ss".

    ARFF Data Section:-The ARFF Data section of the file contains the data declaration line

    and the actual instance lines. The @data declaration is a single line denoting the start of the data

    segment in the file. The format is:@data . Each instance is represented on a single line, with

    carriage returns denoting the end of the instance. Attribute values for each instance are delimited

    by commas. They must appear in the order that they were declared in the header section (i.e. the

    data corresponding to the nth @attribute declaration is always the nth field of the attribute).

    Examples:-

    @DATA 5.1,3.5,1.4,0.2,Iris-setosa

    Example

    @relation bank_data@attribute name string@attribute sex {FEMALE,MALE}@attribute region {INNER_CITY,TOWN,RURAL,SUBURBAN}

    @attribute income numeric@attribute married {NO,YES}@attribute car {NO,YES}@dataXyz,FEMALE,INNER_CITY,17546,NO,YESAbc,MALE,RURAL,100000,YES,YES

    RESULT:-

    The data file in ARFF format is successfully created.

  • 8/2/2019 Bilal Ahmed ShaikDMDW Lab

    3/24

    Viva-voce questions:-

    1. How to representeMissing values?

    2. Write the Sparser output for given data in ARFF file@data0, X, 0, Y, "class A"

    3. Explain Nominal attributes?

    4. In your lab what edit is used for creating an ARFF file?

    5. What is the file extension of an ARFF file?

  • 8/2/2019 Bilal Ahmed ShaikDMDW Lab

    4/24

    EXPERIMENT N0:02

    AIM:-Transforming Excel Data into ARFF.

    Problem Statement:-Convert the Excel data set into ARFF file data and also Verifying with

    WEKA tool.

    Procedure: -

    1. Convert the Excel data set into CSV (comma separated value) format.

    2. One easy way to do this is to load it into Excel and use Save As to save the file in

    CSV format.

    3. Edit the CSV file, and add the ARFF header information to the file.

    4. This involves creating the @relation line, one @attribute line per attribute, and @data tosignify the start of data.

    5. Finally save this file with (.arff) extension.

    6. It is also considered good practice to add comments at the top of the file describing

    where you obtained this data set, what its summary characteristics are, etc.

    7. A comment in the ARFF format is started with the percent character % and continues

    until the end of the line.

    8. Open this file with WEKA tool.

    RESULT:- Excel Data transformed into ARFF.

    Viva-voce questions:-

    1. What is the meaning of CSV?

    2. How to write comments in ARFF file?

    3. How to open ARFF file?

  • 8/2/2019 Bilal Ahmed ShaikDMDW Lab

    5/24

    Experiment No: 03AIM: Study of WEKA Tool

    Problem Statement: Study various functionalities of WEKA ToolTheory: WaikatoEnvironment for knowledge Analysis is a collection of machine learningalgorithms for data mining tasks. The algorithm can either be applied directly to a database orcalled from our own java code. Weka contains tools for data pre-processing, classification,regression, clustering, association rules, and visualization.Simple CLI: The Simple CLI provides full access to all Weka classes, i.e., classifiers, filters,clusters, etc., but without the hassle of the CLASSPATH (it facilitates the one, with which Wekawas started). It offers a simple Weka shellwith separated command line and output.The following commands are available in Simple CLI

    java []:- invokes a java class with the given arguments.break: - stops the current thread, e.g., a running classifier, in a friendly manner.kill: - stops the current thread in un-friendly fashion.cls: - clears output area

    exit: - exits Simple CLIhelp []:- provides an overview of the available commands if without acommand name as argument, otherwise more help on specified command.

    Weka Knowledge Explorer:- The Weka Knowledge Explorer is an easy to use graphical userinterface that harnesses the power of Weka software. Each of the major weka packages Filters,Classifiers, Clusters, Associations, and Attribute Selection is represented in the Explorer alongwith a Visualization tool which allows datasets and the predications of Classifiers and Clusters tobe visualized in two dimensions.Explorer tabs are as follows:Preprocess: Choose and modify the data being acted on.Opening Files: The first four buttons at the top of the preprocess section enable you to load datainto WEKA:

    Open File: Brings up a dialogue box allowing you to browse for the data file on local

    system.

    Open URL: Asks for a Uniform resource Locator address for where the data is stored.

    Open DB: Reads data from a database.

    Generate: Enables you to generate artificial data from a variety of Data Generators.

    Using the Open file button you can read files in a variety of formats: Wekas ARFF format,CSV format, C4.5 format, or serialized Instances format. ARFF files typically have a .arffextension, CSV files a .csvextension, C4.5 files a .data and .names extension and serializedInstances objects a .bisextension.The Current Relation: Once some data has been loaded, the preprocess panel shows variety ofinformation. The Current relation box (the current relation is the currently loaded data, whichcan be interpreted as a single relational table in database terminology) has three entities:

    1. Relation: The name of the relation, as given in the file it was loaded from. Filters

    (described below) modify the name of relation

    2. Instances: The number of instances (data points/records) in the data

    3. Attributes: The number of attributes (features) in the data.

    Working with Attribute: Below the Current relation box is a box titled Attributes. There are fourbuttons, and beneath them is a list of attributes in the current relation. The list has three columns:

    1. No. A number that identifies the attribute in order they are specified in data files

    2. Selection tick boxes.These allow you select which attributes are present in

    relation

    3. Name. The name of attribute, as it was declared in the data file.

    When you click on different rows in the list of attributes, the fields changes in box to the right

    titled selected attribute. This box displays the characteristics of the currently highlightedattribute in the list:

  • 8/2/2019 Bilal Ahmed ShaikDMDW Lab

    6/24

    1. Name. The name of the attribute, the same as that given in the attribute list.

    2. Type. The type of the attribute, most commonly Nominal or Numeric.

    3. Missing. The number (and percentage) of instances in the data for which this

    attribute is missing (unspecific).

    4. Distinct. The number of different values that the data contains for this attribute.

    5. Unique. The number (and percentage) of instances in data having value for this

    attribute that no other instances have

    Returning to the attribute list, to begin with all the tick boxes are unticked. They can be toggledon/off by clicking on them individually. The four buttons above can also be used to change theselection:

    1. All: All boxes are ticked

    2. None: All boxes are cleared (unticked)

    3. Invert: Boxes that areticked become unticked and vice versa

    4. Pattern: Enables the user to select attributes based on a Perl 5 Regular

    Expression e.g., *_id selects all attributes which name ends with _id

    Once the desired attributes have been selected, they can be removed by clicking the Removebutton below the list of attributes. Note that this can be undone by clicking Undo button, which islocated next to Edit button in top-right corner of preprocess pane

    Working with Filters: The preprocess section allows filters to be defined that transform the datain various ways. The Filter box is used to set up filters that are required. At the left of Filter box isChoose button. By clicking this button it is possible to select one of filters in Weka tool. Once afilter has been selected, its name and options are shown in the field next to choose button.Clicking on this box with the \textit{left} mouse button brings up a GenericObjectEditor dialog box.

    A click with right mouse button (or Alt+Shift+left click) brings up a menu where you can choose,either to display properties in a GenericObjectEditor dialog box, or to copy the current setupstring to clipboard.

    The GenericObjectEditor Dialog Box: The GenericObjectEditor dialog box lets you configure afilter. The same kind of dialog box is used to configure other objects, such as classifiers andclusters (see below). The fields in the window reflect the available option. Right-clicking (orAlt+Shift+Left-click) on such a field will bring up a popup menu, listing the following options:

    1. Show Properties..has same effect as left-clicking on the field, i.e., a dialog

    appears allowing you to alter setting

    2. Copy configuration to clip board..copies the currently displayed

    configuration string to the systems clipboard and therefore can be used

    anywhere else in WEKA or in console. This is rather handy if you have to setup

    complicated , nested schemes

    3. Enter Configuration.. is the receivingend for configurations that got copied to

    clipboard earlier on. In this dialog you can enter a classname followed by options

    (if the class supports these). This also allows you to transfer a filter setting from

    Preprocess panel to a FilteredClassifier used in Classify panel

    4. Applying Filters..once you have selected and configured a filter, you can

    apply it to the data by pressing the Apply button at the right end of the Filter

    panel in the Preprocess panel. The Preprocess panel will then show the

    transformed data. The change can be undone by pressing Undo button. You can

    also use the edit button to modify your data manually in a dataset editor. Finally,

    the Save button at the top right of Preprocess panel saves the current version of

    the relation in the same formats that can represent the relation allowing it to keptfor future use

  • 8/2/2019 Bilal Ahmed ShaikDMDW Lab

    7/24

    Editing: You can also view the current dataset in a tabular format via the Edit button. Clickingthis button opens dialog of ArffViewer, displaying the currently loaded data. You can edit the

    data, deleted and rename attribute, deleted instance and undo modifications. But themodifications are only applied if you click on OK button and return to main Explorer window.

    Explorer-Classification):- Classification Train and test learning schemes that classify or performregression.Selecting a Classifier: - At the top of the classify section is the classifier box. This box has atext field that gives the name of the currently selected classifier, and its options. Clicking on thetext box with the left mouse button brings up a GenericObjectEditor dialog box, just the same asfor filters that you can use to configure the options of the current classifier. With a right click forAlt+Shift+left click) you can once again copy the setup string to the clipboard or display theproperties in a GenericObjectEditor dialog box. The choose button allows you to choose one ofthe classifiers that are available in WEKA.

    Test Options: - The result of applying the chosen classifier will be tested according to the

    options that are set by clicking in the Test option box. There are four test modes.1. Use training set. The classifier is evaluated on how well it predicts the class of

    the instance it was trained on.

    2. Supplied test set. The classifier is evaluated on how well it predicts the class of

    set of instances loaded from a file. Clicking the set button brings up a dialog

    allowing you to choose the file to test on.

    3. Cross-validation. The classifier is evaluated by cross-validation, using the no of

    folds that are entered in the Folds text field.

    4. Percentage split. The classifier is evaluated on how will it predictions a certain

    percentage of the data which is held out for testing. The amount of data held out

    depends on the value entered in the % field.

    Further testing options can be set by clicking on the More options button:1. Output model. The classification model on the full training set is output so that it

    can be viewed, visualized, etc. This option is selected by default.

    2. Output per-class stats. The precision/recall and true/false statistical for each

    class are output. This option is also selected default.

    3. Output entropy evaluation measures. Entropy evaluation measures are

    included I the output. This option is not selected in default.

    4. Output confusion matrix. The confusion matrix of the classifiers prediction is

    included in the output. This option is selected by default.

    5. Store predictions for visualization. The classifiers predictions are

    remembered so that they can be visualized. This option is selected by default.

    6. Output predictions. The predictions on the evaluation data are output. Notethat in the case of a cross-validation the instance number do not correspond to

    the location in the data!

    7. Cost-sensitive evaluation. The error is evaluated with respect to the cost

    matrix. The set... button allow you to specify the cost matrix used.

    8. Random seed for xval / % Split. This specifies the random speed used when

    randomizing the data before it is divided up for evaluation purposes.

    9. Preserve order for % Split. This suppresses the randomization of the data

    before splitting into train and test set.

    10. Output source code. If the classifier can output the built model as Java source

    code, you can specify the class name here. The code will be printed in the

    classifier output area.

  • 8/2/2019 Bilal Ahmed ShaikDMDW Lab

    8/24

    The Class Attribute:- The classifier in WEKA are designed to be trained to predict asingle class attribute, which is the target for prediction. Some classifiers can only

    learn nominal classes; other can only learn numeric classes (regression problem);still others can learn both. By default, the class is taken to be the last attribute in thedata, If you want to train a classifier to predict a different attribute, click on the boxbelow the Test options box to bring up a drop-down list of attribute to choose from.

    Training a Classifier: - once the classifier, test options and class have all been set,the learning process is started by clicking on the start button. While the classifier isbusy being trained, the little bird moves around. You can stop the training process atany time by clicking on the stop button. When training is complete, several thingshappen. The classifier output area to the right of the display is filled with textdescribing the result of training and testing.

    The Classifier Output Text: - the text in the classifier output area has scroll bars allowing you tobrowse the result Of course, you also resize the explorer window to get a larger display area.

    The output is split into several sections:1. Run information. A list of information learning the given scheme options, relation

    name, instances, attribute and test mode that were involved in the process.

    2. Classifier model (full training set). A textual representation of the classical modal

    that was produced on the full training data.

    3. The result of the chosen test mode are broken down thus:

    4. Summary. A list of statistical summarizing how accurately the classifier was able

    to predict the true class of the instance under the chosen test mode.

    5. Detailed Accuracy By Class. A more detailed per-class breaks down of the

    classifier prediction accuracy.

    6. Confusion Matrix. Shows how many instances have been assigned to each

    class. Elements show the number of tests examples whose actual class is the

    row and whose predicated class is the column.

    7. Source code (optional). This section lists the Java source code if one chose

    output source code in the more options dialog.

    The Result List:- after training several classifiers, the result list will contain several entries . Left-clicking the entire flick back and forth between the various results that have been generated.Right-clicking an entry invokes a menu containing these lines.

    1. View in main window. Shows the output in the main window (just like left-

    clicking the entry).

    2. View in separate window. Opens a new independent window for viewing the

    results.

    3. Save result buffer. Brings up ma dialog allowing you to save a text file

    containing the textual output.

    4. Load model. Loads a pre-trained model object from a binary file.

    5. Save model. Saves a model object to a binary file. Objects are saved in Java

    serialized object from.

    6. Re-evaluate model on current test set. Takes the model that has been built

    and tests its performance on the data set that has been specified with the set

    button under the supplied test set option.

    7. Visualize classifier errors. Brings up a visualization window that plots the result

    of classification. Correctly classified instance are represented by crosses, where

    as incorrectly ones show as squares.

    8. Visualize tree or Visualize graph. Brings up a graphical representation of the

    structure of the classifier model, if possible (i.e for decision trees or Bayesiannetworks). The graph visualization option only appears if a Bayesian networks

  • 8/2/2019 Bilal Ahmed ShaikDMDW Lab

    9/24

    classifier has been built. In the tree visualize, you can bring up a menu by right-

    clicking a blank area, pan around by dragging the mouse, and see the training

    instances at each node by clicking on it. CTRL-clicking zooms the view out,

    while SHIFT-dragging a box zooms the view in. The graph visualize should be

    self-explanatory.

    9. Visualize margin curve. Generate a plot illustrating the prediction margin. The

    margin is defined as the difference between the probabilities predicted for the

    actual. Class and the highest probability predicted for the other classes. For

    example, boosting algorithms may achieve better performance on test data by

    increasing the margins on the training data.

    10. Visualize threshold curve. Generate a plot illustrating the tradeoffs in

    prediction that are obtained by varying the threshold value between the classes.

    For example with the default threshold value of 0.5, the predicated probability of

    positive must be greater than 0.5 for the instance to be predicated as positive.

    The plot can be used to visualize the precision/recall tradeoff, for ROC curve

    analysis (true positive rate vs. false positive rate), and for other types of curves.

    11. Visualize cost curve: Generates a plot that gives an explicit representation of

    the expected cost, as described by Drummond and Holte (2000).

    12. Plugins: This menu item only appears if there are visualization plugins available

    (by default: none) More about these plugins can be found in WekaWiki article

    Explorer visualization plugins.

    Explorer-Clustering : Learn clusters for the data.Selecting a Clusterer: By now you will be familiar with the process of selecting and configuringobjects. Clicking on the clustering scheme listed in the Cluster box at the top of window brings upa GenericObjectEditoe dialog with which to choose a new clustering scheme.Cluster Modes: The cluster mode box is used to choose what to cluster and how to evaluate theresults. The first three options are same as for classification: Use training set, Supplied test setand Percentage split (Section Selecting a Classifier) except that now the data is assigned toclusters instead of trying to predict a specific class. The fourth mode, Classes to clustersevaluation, compares how well the chosen clusters match up with a pre-assigned class in thedata. The drop-down box below this option selects the class, just as in the Classify panel.

    An additional option in the cluster mode box, the \texttbf{Store clusters for visualization} tickbox, determines whether or not it will be possible to visualize the clusters once training iscomplete. When dealing with datasets that are so large that memory becomes a problem it maybe helpful to disable this option.

    Ignoring Attributes: Often, some attributes in the data should be ignored when clustering. Theignore attribute button brings up a small window that allows you to select which attributes areignored. Clicking on an attribute in the window highlights it, holding down the SHIFT key selectsa range of consecutive attributes, and holding down CTRL toggles individual attributes on andoff. To cancel the selection, back out with Cancel button. To activate it, click the Select button.The next time clustering is invoked, the selected attributes are ignored.

    Working with Filters: The filtered cluster meta-clusterer offers the user the possibility to applyfilters directly before the cluster is learned. This approach eliminates the manual application of afilter in the Preprocess panel, since the data gets processed on the fly. Useful if one needs to tryout different filter setups.

    Learning Clusters: The cluster section, like the Classify section, has Start/Stop buttons, a resulttext area and a result list. These all behave just like their classification counterparts. Right-Clicking an entry in the result list brings up a similar menu, except that it shows only twovisualization options: Visualize cluster assignments and Visualize tree. The latter is grayed outwhen it is not applicable.

  • 8/2/2019 Bilal Ahmed ShaikDMDW Lab

    10/24

    Explorer-Associating : Associate, Learn association rules for the data.Setting Up: This panel contains schemes for learning association rules, and the learners are

    chosen and configured in the same way as the clusters, filters, and classifiers in the other panels.

    Learning Association: once appropriate parameters for the association rule learner have beenset, click the Start button. When complete, right-clicking on an entry in the result list allows theresults to be viewed or saved.

    Explorer-Selecting Attributes: Select attributes. Select the most relevant attributes in the data.

    Searching and Evaluating: Attribute selection involves searching through all possiblecombinations of attributes in the data to find which subset of attributes works best for prediction.To do this, two objects must be set up; an attribute evaluator and a search method. Theevaluator determines what method is used to assign a worth to each subset of attributes. Thesearch method determines what style of search is performed.

    Options: The Attributes Selection Mode box has two options1. Use full training set. The worth of attribute subset is determined using the full set

    of training data.

    2. Cross-validation. The worth of attribute subset is determined by a process of

    cross-validation. The Fold and seed fields set the number of folds to use and the

    random seed used when shuffling the data.

    As with Classify (Section Selecting a Classifier), there is a drop-down box that can be used tospecify which attribute to treat as the class.Performing Selection: Clicking Start starts running the attribute selection process. When it isfinished, the results are output into the result area, and an entry is added to the result list. Right-clicking on the result list gives several options. The first three, (View in main window, View inseparate window and Save result buffer), are the same as for classify panel. It is also possible to

    visualize reduced data, or if you have used an attribute transformer such asPrincipalComponents, Visualize transformed data. The reduced/transformed data can be savedto a file with Save reduced data or Save transformed data option. Explorer Visualizing: View in an interactive 2D plot of the data. WEKAs visualization sectionallows you to visualize 2D plots of current relation.The scatter plot matrix: When you select the Visualize panel, it shows a scatter plot matrix for allattributes, color coded according to the currently selected class. It is possible to change the sizeof each individual 2D plot and the point size, and to randomly jitter the data (to uncover obscuredpoints). It is also possible to change the attribute used to color the plots, to select only a subsetof attributes for inclusion in the scatter plot matrix, and to sub sample the data. Note thatchanges will only come into effect once the Update button has been pressed.Selecting an individual 2D Scatter plot: When you click on a cell in the scatter plot matrix, thiswill bring up a separate window with a visualization of scatter plot you selected. (We described

    above how to visualize particular results in a separate windowfor example, classifier errorsthe same visualization controls are used here). Data points are plotted in the main area of thewindow. At the top are two drop-down list buttons for selecting the axes to plot. The one on theleft shows which attribute is used for the x-axis selector is a drop-down list for choosing the colorscheme. This allows you to color the points based on the attribute selected. Below the plot area,a legend describes what values the color corresponds to. If the values are discrete, you canmodify the color used for each one by clicking on them and making an appropriate selection inthe window that pops up. To the right of plot area is a series of horizontal strips. Each striprepresents an attribute, and the dots within it show the distribution of values of the attribute.These values are randomly scattered vertically to help you see concentrations of points. You canchoose what axes are used in the graph by clicking on these strips. Left-clicking an attribute stripchanges the x-axis to that attribute, whereas right-clicking changed the y-axis. The Xand Ywritten beside the strips shows what the current axes are (B is used for both X and Y). Above

    the attribute strips is a slider labeled Jitter, which is a random displacement given to all points inthe plot. Dragging it to the right increases the amount of jitter, which is useful for spotting

  • 8/2/2019 Bilal Ahmed ShaikDMDW Lab

    11/24

    concentrations of points. Without jitter, a million instances at the same point would look nodifferent to just a single lonely instance.

    Selecting Instances: There may be situations where it is helpful to select a subset of data usingvisualization tool. (A special case of this is UserClassifier in the Classify panel, which lets youbuild your own classifier by interactively selecting instances). Below the y-axis selector button isa drop-down list button for choosing a selection method. A group of data points can be selectedin four ways:

    1. Select Instance. Clicking on an individual data point brings up a window listing its

    attributes. If more than one point appears at the same location, more than one

    set of attributes is shown

    2. Rectangle. You can create a rectangle, by dragging, that selects the points

    inside it

    3. Polygon. You can build a free-form polygon that selects the points inside it. Left-

    click to add vertices to the polygon, right-click to complete it. The polygon will

    always be closed off by connecting the first point to the last

    4. Polyline. You can build a polyline that distinguishes the points on one side from

    those on the other. Left-click to add vertices to the polyline, right-click to finish.

    The resulting shape is open (as opposed to a polygon, which is always closed)

    Once an area of the plot has been selected using Rectangle, Polygon or Polyline, it turns grey. Atthis point, clicking the Submit button removes all instances from the plot except those within thegrey selection area. Clicking on the Clear button erases the selected area without affecting thegraph. Once any points have been removed from the graph, the Submit button changes to aReset button. This button undoes all previous removals and returns you to the original graph withall points included. Finally, clicking the Save button allows you to save the currently visibleinstances to a new ARFF file.

    Experimenter:- The experimenter, which can be run from both the command line and aGUI(easier to use), is a tool that allows you to perform more than one experiment at a time,maybe applying different techniques to a datasets, or the same technique repeatedly withdifferent parameters. The setup of experiments is divided into two parts, standard and remoteexperiments. The first is how to setup experiments in general, whereas the later is how todistribute experiments over several machines to speed up the execution time. The WEKAexperiment environment enables the user to create, run, modify, and analyze experiments in amore convenient manner than is possible when processing the schemes individually. Forexample, the user can create an experiment that runs several schemes against a series ofdatasets and then analyze the results to determine if one of the schemes is (statistically) betterthan the other schemes.Knowledge Flow:- The Knowledge Flow provides an alternative to the Explorer as a graphicalfront end to Wekas core algorithms. The Knowledge Flow is a work in progress so some of the

    functionality from the Explorer is not yet available. On the other hand, there are things that canbe done in the Knowledge Flow but not in the Explorer. The Knowledge Flow presents a data -flow inspired interface to Weka. The user can select Weka components from a tool bar, placethem on a layout canvas and connect them together in order to form a Knowledge Flow forprocessing and analyzing data. At present, all of Wekas classifiers and filters are available in theKnowledge Flow along with some extra tools. Components for clustering will be available in alater release.The Knowledge Flow can handle data either incrementally or in batches (the Explorer handlesbatch data only). Of course learning from data incrementally requires a classifier that can beupdated on an instance by instance basis. Currently in Weka there are five classifiers that canhandle data incrementally: NaiveBayesUpdateable, IB1, IBk, LWR (locally weighted regression).There is also one Meta classifier-Raced Incremental Logic Boost- that can use of any regressionbase learner to learn from discrete class data incrementally.

    Features of the Knowledge FlowIntuitive data flow style layoutProcess data in batches or incrementally

  • 8/2/2019 Bilal Ahmed ShaikDMDW Lab

    12/24

    Process multiple batches or streams in parallel (each separate flow executes in its ownthread)

    Chain filters togetherView models produced by classifiers for each fold in a cross validationVisualize performance of incremental classifiers during processing (scrolling plots ofclassification accuracy, RMS error, predictions etc.).

    Components available in the Knowledge Flow

    Evaluation:TrainingSetMaker - make a data set into a training setTestSetMaker - make a data set into a test setCrossValidationFoldMaker - split any data set, training set or test set into foldsTrainTestSplitMaker - split any data set, training set or test set into a training set and a test setClassAssigner - assign a column to be the class for any data set, training set or test setClassValuePicker - choose a class value to be considered as the positive class. This is useful

    when generating data for ROC style curves (see below).ClassifierPerformanceEvaluator - evaluate the performance of batch trained/tested classifiersIncrementalClassifierEvaluator - evaluate the performance of incrementally traine4d classifiersPredictionAppender - append classifier predictions to a test set. For discrete class problems, caneither append predicted class labels or probability distributions

    Visualization:DataVisualizer - component that can pop up a panel for visualizing data in a single large 2Dscatter plotScatterPlotMatrix - component that can pop up a panel containing a matrix of small scatter plots(clicking on a small plot pops up a large scatter plot).Attribute Summarizer - component that can pop up a panel containing a matrix of histogramplot5s one for each of the attributes in the input dataModelPerformanceChat - component that can pop up a panel for visualizing threshold (i.e.ROCstyle) curves.TextViewer - component for showing textual data. Can show data sets, classificationperformance statistics etc.GraphViewer - component that can pop up a panel for visualizing tree based modelsStripChart - component that can pop up a panel that displays a scrolling plot of data 9used forviewing the online performance of incremental classifiers)Filters: All of Wekas filters are availableClassifiers: All of Wekas classifiers are availableDataSources: All of Wekas loaders are available

    RESULT:- WEKA tool is studied

    VIVA-VOICE QUESTIONS:-

    1) Expand the WEKA word?

    2) In which language WEKA tool is implemented?

    3) Give the Maximum data set size in WEKA tool?

    4) What are the main components in WEKA tool?

    5) List any 5 data mining tools?

  • 8/2/2019 Bilal Ahmed ShaikDMDW Lab

    13/24

    EXPERIMENT N0:04

    AIM: - Perform Linear Regression.

    Problem Statement:-consider student 1 subject marks (m1,m2,m3, and final marks) to performLinear Regression and find out the relation between final marks with m1,m2,and m3 marks.

    Theory: -A statistical technique used to find the best-fitting linear relationship between a target(dependent) variable and its predictors (independent variables). You have a set of data on 2variables X and Y, represented in a scatter plot. You wish to find a simple, convenientmathematical function that comes close to most of the points, thereby describing succinctly therelationship between X and Y.

    Linear Regression: - Involves a response variable Y and a single predictor variable XY=W0+W1X

    Where W0(Y-Intercept) and W1 (Slope) are regression coefficients

    Method Of LeastSquares: - estimates the Best-Fitting straight line

    | |

    1

    | |2

    1

    ( )( )

    1( )

    D

    i i

    i

    D

    i

    i

    x x y y

    w

    x x

    10

    w y w x

    Multiple Linear Regression: - Involves more than 1 predictor variable Training data is of theform (X1,Y1) ,(X2,Y2),., (X|D|,Y|D|)Ex. For 2-D data, we may have y=w0+w1x1+w2x2 Solvable by extension of least square methodor using SAS, S-plus Many nonlinear functions can be transformed into linear regression model.For Ex, y=w0+w1x+w2x2+w3x3 convertible to linear with new variables: x2=x2, x3=x3y=w0+w1x+w2x2+w3x3 Other function, can also be transformed to linear model Some modelsare intractable nonlinear (e.g., sum of exponential terms) possible to obtain last square estimatesthrough extensive calculation on more complex formulae

    Procedure: -

    1. Prepare the data set for given problem.2. Go to the Weka tool and select Explorer.3. Click the Open button and select the required data set.4. Select Classify Tab.5. Click on the Choose button and select the Linear Regression Function

    (Weka -> classifiers -> functions->Linear Regression)6. Click the Start button.7. Result is display on Right side.

    8. Take the Relation and verify with manually.

    RESULT:-Regression is performed.

  • 8/2/2019 Bilal Ahmed ShaikDMDW Lab

    14/24

    VIVA-VOICE QUESTIONS:-

    1) It is used for data cleaning and association, and data reduction2)(a) Linear regression(b) Multi Linear regression(c) Lug- Linear regression

  • 8/2/2019 Bilal Ahmed ShaikDMDW Lab

    15/24

    EXPERIMENT N0:05

    AIM: - Implement Apriori algorithm.

    Problem Statement:-Write a program, in your favorite programming language, one that takes asparameters the minimum support, minimum confidence, and the name of transactions andproduces all association rules which can be mined from the transaction file which satisfy theminimum support and confidence requirements.

    Theory: -Apriori pruning principle: If there is any itemset which is infrequent, its superset shouldnot be generated/tested.Method:

    Initially, scan DB once to get frequent 1-itemsetGenerate length (k+1) candidate itemsets from length k frequent itemsetsTest the candidates against DB

    Terminate when no frequent or candidate set can be generated

    Pseudo-code:

    Ck: Candidate itemset of size kLk: frequent itemset of size kL1 = {frequent items};

    for (k= 1; Lk!=; k++) do beginCk+1 = candidates generated from Lk;

    for each transaction tin database doincrement the count of all candidates in Ck+1that are contained in t

    Lk+1 = candidates in Ck+1 with min_support

    endreturnkLk;

    How to generate candidates?Step 1: self-joining LkStep 2: pruning

    How to count supports of candidates?Example of Candidate-generation

    L3={abc, abd, acd, ace, bcd}Self-joining: L3*L3

    abcdfrom abcand abdacdefrom acdand ace

    Pruning:acdeis removed because adeis not in L3

    C4={abcd}Suppose the items in Lk-1 are listed in an orderStep 1: self-joining Lk-1

    insert intoCkselectp.item1, p.item2, , p.itemk-1, q.itemk-1from Lk-1 p, Lk-1 q

    wherep.item1=q.item1, , p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1Step 2: pruning

    For all itemsets c in CkdoFor all (k-1)-subsets s of cdoif (s is not in Lk-1) then delete cfrom Ck

    RESULT:-

    APRIORI algorithm was studied

  • 8/2/2019 Bilal Ahmed ShaikDMDW Lab

    16/24

    Viva-voce questions:-1. Define Support?

    2. Define Confidence?3. List Different Methods for generating association rules?4. Define Apriori property?5. List the applications of Association rules?

  • 8/2/2019 Bilal Ahmed ShaikDMDW Lab

    17/24

    EXPERIMENT N0:06

    AIM: - Perform Apriori algorithm.

    Problem Statement:- Run Apriori algorithm on WEKA Tool , and takes as parameters theminimum support, minimum confidence, and the name of transactions and produces bestassociation rules which can be mined from the transaction file which satisfy the minimum supportand confidence requirements.

    Procedure: -

    1) Prepare the data set for given problem.

    2) Go to the Weka tool and select Explorer.

    3) Click the Open button and select the data set.

    4) Select associate Tab.

    5) Click the Choose button and select the Apriori function

    (Weka -> associations -> Apriori)

    6) Click the Start button.

    7) Result is display on Right side.

    8) Observer the best rules generated by this function.

    RESULT:- The APRIORI algorithm is verified.

    Sample Input:-

    No. outlook temperature humidity windy play1 sunny hot high FALSE no2 sunny hot high TRUE no3 overcast hot high FALSE yes4 rainy mild high FALSE yes5 rainy cool normal FALSE yes6 rainy cool normal TRUE no7 overcast cool normal TRUE yes8 sunny mild high FALSE no9 sunny cool normal FALSE yes10 rainy mild normal FALSE yes11 sunny mild normal TRUE yes12 overcast mild high TRUE yes13 overcast hot normal FALSE yes14 rainy mild high TRUE no

    Sample Output:-

    Apriori=======Minimum support: 0.15 (2 instances)Minimum metric : 0.9Number of cycles performed: 17Generated sets of large itemsets:Size of set of large itemsets L(1): 12

  • 8/2/2019 Bilal Ahmed ShaikDMDW Lab

    18/24

    Size of set of large itemsets L(2): 47

    Size of set of large itemsets L(3): 39

    Size of set of large itemsets L(4): 6

    Best rules found:

    1. humidity=normal windy=FALSE 4 ==> play=yes 4 conf:(1)2. temperature=cool 4 ==> humidity=normal 4 conf:(1)3. outlook=overcast 4 ==> play=yes 4 conf:(1)4. temperature=cool play=yes 3 ==> humidity=normal 3 conf:(1)5. outlook=rainy windy=FALSE 3 ==> play=yes 3 conf:(1)6. outlook=rainy play=yes 3 ==> windy=FALSE 3 conf:(1)7. outlook=sunny humidity=high 3 ==> play=no 3 conf:(1)

    8. outlook=sunny play=no 3 ==> humidity=high 3 conf:(1)9. temperature=cool windy=FALSE 2 ==> humidity=normal play=yes 2 conf:(1)

    10. temperature=cool humidity=normal windy=FALSE 2 ==> play=yes 2 conf:(1)

    Viva-voce questions:-

    1. List Different Methods for generating association rules in WEKA tool?

    2. List various kinds of Association Rules?

  • 8/2/2019 Bilal Ahmed ShaikDMDW Lab

    19/24

    EXPERIMENT N0:07

    AIM: - Perform Classification by decision Tree Induction.

    Problem Statement: - Run Decision Tree Induction on WEKA Tool, takes tree representationand note down the classifications/mis-classifications.

    Procedure: -

    1. Prepare the data set for given problem.2. Go to the Weka tool and select Explorer.3. Click the Open button and select the required data set.4. Select Classify Tab.5. Click on the Choose button and select the J48 function

    (Weka -> classifiers -> trees->J48)6. Select Use Training set Radio button7. Click the Start button.8. Result is display on Right side.9. Observer the Confusion Matrix.

    10. Click mouse right button on Result list.11. Select Visualize Tree.12. Observer the Tree construction.

    RESULT:-The classification of decision Tree Induction is performed.

    Sample Input:-@relation buycomputer

    @attribute age {le30, 31to40, gt40}

    @attribute income {high, medium, low}

    @attribute student {yes, no}

    @attribute credit_rating {fair, excellent}

    @attribute buys_computer {yes, no}

    @data

    le30,high,no,fair,no

    le30,high,no,excellent,no

    31to40,high,no,fair,yes

    gt40,medium,no,fair,yes

    gt40,low,yes,fair,yes

    gt40,low,yes,excellent,no

    31to40,low,yes,excellent,yes

    le30,medium,no,fair,no

    le30,low,yes,fair,yes

    gt40,medium,yes,fair,yes

    le30,medium,yes,excellent,yes

    31to40,medium,no,excellent,yes31to40,high,yes,fair,yes

  • 8/2/2019 Bilal Ahmed ShaikDMDW Lab

    20/24

    Sample Output:-=== Confusion Matrix ===

    a b

  • 8/2/2019 Bilal Ahmed ShaikDMDW Lab

    21/24

    EXPERIMENT N0:08

    AIM: - Perform Classification by Navie Bayesian classification.

    Problem Statement: - Run Navie Bayesian classifier on WEKA Tool, and note down theclassifications/mis-classifications.

    Procedure: -

    1. Prepare the data set for given problem.2. Go to the Weka tool and select Explorer.3. Click the Open button and select the required data set.4. Select Classify Tab.5. Click on the Choose button and select the Navie Bayes

    (Weka -> classifiers -> layers->Navie Bayes)6. Select Use Training set Radio button7. Click the Start button.8. Result is display on Right side.9. Observe the Confusion metrix.

    10. Observe Navie Bayesian classifier.

    RESULT:-The classification by Navie Bayesian classification is performed.

    Input:-

    @relation buycomputer

    @attribute age {le30, 31to40, gt40}

    @attribute income {high, medium, low}

    @attribute student {yes, no}

    @attribute credit_rating {fair, excellent}

    @attribute buys_computer {yes, no}

    @data

    le30,high,no,fair,no

    le30,high,no,excellent,no

    31to40,high,no,fair,yes

    gt40,medium,no,fair,yes

    gt40,low,yes,fair,yes

    gt40,low,yes,excellent,no

    31to40,low,yes,excellent,yes

    le30,medium,no,fair,no

    le30,low,yes,fair,yes

    gt40,medium,yes,fair,yes

    le30,medium,yes,excellent,yes

    31to40,medium,no,excellent,yes31to40,high,yes,fair,yes

  • 8/2/2019 Bilal Ahmed ShaikDMDW Lab

    22/24

    EXPERIMENT N0:09

    AIM: - Perform Classification by Multilayer Perception.

    Problem Statement: - Run function Multilayer Perception on WEKA Tool, and note down theclassifications/mis-classifications.

    Procedure: -

    1. Prepare the data set for given problem.2. Go to the Weka tool and select Explorer.3. Click the Open button and select the required data set.4. Select Classify Tab.5. Click on the Choose button and select the Multilayers

    (Weka -> classifiers -> functions -> Multilayer Perception)6. Click on Multilayer Perception and select GUI option click true then ok is selected.7. Click Start button and observe the Neural Network Graph.8. Observe the error is zero otherwise not zero we will get zero procedure is performed.9. Result is display on Right side.

    10. Observe the confusion metrix.11. Observe the MultiLAyer perception classifier.

    RESULT:-The classification by Multilayer Perception is performed.

    Input:-

    @relation buycomputer

    @attribute age {le30, 31to40, gt40}

    @attribute income {high, medium, low}

    @attribute student {yes, no}

    @attribute credit_rating {fair, excellent}

    @attribute buys_computer {yes, no}

    @data

    le30,high,no,fair,no

    le30,high,no,excellent,no

    31to40,high,no,fair,yes

    gt40,medium,no,fair,yes

    gt40,low,yes,fair,yes

    gt40,low,yes,excellent,no

    31to40,low,yes,excellent,yes

    le30,medium,no,fair,no

    le30,low,yes,fair,yes

    gt40,medium,yes,fair,yes

    le30,medium,yes,excellent,yes

    31to40,medium,no,excellent,yes

  • 8/2/2019 Bilal Ahmed ShaikDMDW Lab

    23/24

    EXPERIMENT N0:10

    AIM: - Perform Clustering by K-Means Algorithm.Problem Statement: Run simpleK-means on WEKA Tool, take cluster visualization and notedown squared-errors and clustered instances.

    Procedure: -

    1. Prepare the data set for given problem.2. Go to the Weka tool and select Explorer.3. Click the Open button and select the required data set.4. Select Cluster Tab.5. Click on the Choose button and select SimpleK-means algorithm.6. Set number of clusters required in Filter properties.7. Select Use Training set Radio button8. Click the Start button.9. Result is display on Right side.10. Observer the Clustered instances.

    11. Right Click on Result list and select Visulize Cluster assignments.12. Observe the clustering pattern.

    RESULT:- The classification by K-Means Algorithm is performed.

    Input:@relation weather@attribute outlook {sunny, overcast, rainy}@attribute temperature real@attribute humidity real@attribute windy {TRUE, FALSE}@datasunny,85,85,FALSEsunny,80,90,TRUEovercast,83,86,FALSErainy,70,96,FALSErainy,68,80,FALSErainy,65,70,TRUEovercast,64,65,TRUEsunny,72,95,FALSEsunny,69,70,FALSErainy,75,80,FALSE

    sunny,75,70,TRUEovercast,72,90,TRUE

    Output:

    === Run information ===

    Scheme: weka.clusterers.SimpleKMeans -N 3 -A "weka.core.EuclideanDistance -R first-last" -I500 -S 10Relation: weatherInstances: 14Attributes: 4

    outlook

    temperaturehumiditywindy

  • 8/2/2019 Bilal Ahmed ShaikDMDW Lab

    24/24

    Test mode: evaluate on training data

    === Model and evaluation on training set ===

    kMeans======

    Number of iterations: 3Within cluster sum of squared errors: 8.928485612238717Missing values globally replaced with mean/mode

    Cluster centroids:Cluster#

    Attribute Full Data 0 1 2(14) (7) (3) (4)

    =========================================================outlook sunny sunny overcast rainytemperature 73.5714 77.8571 70.3333 68.5humidity 81.6429 83 75 84.25windy FALSE FALSE TRUE TRUE

    Clustered Instances

    0 7 ( 50%)1 3 ( 21%)2 4 ( 29%)

    Clustering Pattern: