Upload
mukeshnt
View
136
Download
2
Tags:
Embed Size (px)
DESCRIPTION
Data Mining Journal - MSc Part 1
Citation preview
University of Mumbai
1
DATA MINING
Software Requirement: WEKA
Weka is written in Java and is widely used in different operating systems.
Weka expects the database to be in the format of ARFF (Attribute-Relation File
Format).
Data is in MS Excel Sheet ie bank-data.xls
1st
step is to convert the excel file into comma-separated format (.csv)
The data contains the following fields
id a unique identification number
age age of customer in years (numeric)
sex MALE / FEMALE
region inner_city/rural/suburban/town
income income of customer (numeric)
married is the customer married (YES/NO)
children number of children (numeric)
car does the customer own a car (YES/NO)
save_acct does the customer have a saving account (YES/NO)
current_acct does the customer have a current account (YES/NO)
mortgage does the customer have a mortgage (YES/NO)
pep did the customer buy a PEP (Personal Equity Plan) after the last mailing (YES/NO)
University of Mumbai
2
Practical 1 : Perform the steps of Data Preprocessing in WEKA
Step1:
• Convert the excel file into comma-separated format (.csv)
• Open the excel file, select Save As from File pull-down Menu.
• In the ensuing dialog box, select CSV and save the file.
University of Mumbai
3
Step2:
• The csv file can be opened in the text editor shown below.
Step3: Start Weka
University of Mumbai
4
Step 4: Loading the data
• In addition to native ARFF data format, WEKA can also real .csv format files.
Assuming WEKA is installed properly click on Explorer command box.
• In the Preprocessor Tab click on Open and open the data file (.csv or .arff)
University of Mumbai
5
Step 5: Filtering Attributes:
• In our dataset, each customer has a unique id. This attribute has to be removed
before the data mining step.
• First tick on the Check Box corresponding to the id which is 1.
• On the down left side click on the Remove command button. This removes the id
attribute and all its values.
Step 6: Saving new data set:
• Now save this data set as by clicking on the Save dialog box as bank-data2.arff or
bank-data2.csv
University of Mumbai
6
University of Mumbai
7
Step 7: Now open this new file bank-data2.arff or bank-data2.csv in the text editor (word
pad). As seen, attribute ID and its corresponding values have been removed.
• Note the line @relation bank-data-weka.filters.unsupervised.attribute.Remove-
R1. This statement simply describes the operation that has been done on the data
set till now. As seen, attributes can be both numeric and nominal type.
University of Mumbai
8
Step 8: Decretization
Techniques like association rule mining can only be performed on categorical data.
This requires performing discretization on numeric or continuous attributes.
There are 3 such attributes in our data set, they are: age, children and income.
Make changes in the attribute of Children as children {0,1,2,3}
• By doing this, the key word “numeric” from “children” attribute are getting
removed and replaced with a set of values.
• Now Save the file in the Word Pad and this might give some error message, pass it
off by clicking OK
University of Mumbai
9
Step 9: Decretization in WEKA
• To perform Decretization on the attributes of “age” and “income”, divide each of
these into 3 intervals.
• Open the bank-data2.arff or bank-data2.csv using “Open” command.
• Select weak.filters.unsupervised.attributes.Descretize.
• The textbox in the filter dialog box will have something like
Descretize –B 10 –M -1.0 –R first-last.
University of Mumbai
10
Step 10: Decretization in WEKA cont
• Click on Textbox to open the DiscretizeFilter Dialog box.
• Enter index value as 1, 4 in the textbox corresponding to attributIndices.
• Enter 3 as the number of intervals (bins)
• As this simple binning, all the other options will remain “False”
• Click on OK and then Apply. This will result in a new working relation with two
selected attributes each partitioned into 3 intervals/bins.
• To examine the result, save the new working relation in the file “bank-data3.arff”
or “bank-data3.csv”
University of Mumbai
11
Step 11: Evaluating the Discreized the data.
• For example, the lower range of the attribute “age” is labeled “(-inf-34.333333]”
and middle age as “(34.333333-50.666667]”
• Now replace the following attributed by the Replace option of the Word Pad.
Attribute Age
� ‘ \ ‘ -inf-34.333333]’’ � 0_34
� ‘ \ ’34.33333 – 50.666667\’’ � 35_51
� ‘ \ ’50.66667-inf \’’ � 52_max
Attribute Income
� ‘ \ ‘ - inf-24388.173333\’’ � 0_24386
� ‘ \ ‘24388.173333 – 43758.136667 \’’ � 24386_43758
� ‘ \ ‘43758.136667-inf\” � 43759_max
University of Mumbai
12
• After Replacing the values
• Now save the changes as “bank-data-final.arff” or “bank-data-final.csv”
University of Mumbai
13
Practical 2: Perform Association Rule Mining with WEKA
Step 1:
• Open the file “bank-data-final.arff” or “bank-data-final.csv”
Step 2:
• Clicking on the "Associate" tab will bring up the interface for the association rule
algorithms. The Apriori algorithm which is used is the default algorithm selected.
Click on the text box immediately to the right of the "Choose" button.
• Choose lift as the criteria. Now enter 1.5 as the minimum value for lift which is
computed as the confidence of the rule divided by the support of the right-hand-
side (RHS). In a simplified form, given a rule L => R,lift is the ratio of the probability
that L and R occur together to the multiple of the two individual probabilities for L
and R, i.e.,
lift = Pr(L,R) / Pr(L).Pr(R).
• If this value is 1, then L and R are independent. The higher this value, the more
likely that the existence of L and R together in a transaction is not just a random
occurrence, but because of some relationship between them.
University of Mumbai
14
• Here change the default value of rules (10) to be 100; this indicates that the
program will report no more than the top 100 rules. The upper bound for
minimum support is set to 1.0 (100%) and the lower bound to 0.1 (10%).
• Apriori in WEKA starts with the upper bound support and incrementally decreases
support. The algorithm halts when either the specified number of rules is
generated, or the lower bound for min. support is reached.
University of Mumbai
15
Step 3:
• Once the parameters have been set, the command line text box will show the new
command line. Now click on start to run the program.
University of Mumbai
16
Step 4:
• The panel on the left ("Result list") now shows an item indicating the algorithm
that was run and the time of the run.
• Clicking on one of the results in this list will bring up the details of the run,
including the discovered rules in the right panel. In addition, right-clicking on the
result set allows us to save the result buffer into a separate file. Now save the
output in the file bank-data-ar1.txt
University of Mumbai
17
Practical 3: Classification via decision tree Step1:
• WEKA has implementations of numerous classification and prediction algorithms.
The basic ideas behind using all of these are similar. (Use "bank.arff")
University of Mumbai
18
Step 2:
• Next, select the "Classify" tab and click the "Choose" button to select the J48
classifier.
University of Mumbai
19
• Various parameters can be specified. These can be specified by clicking in the text
box to the right of the "Choose" button.
University of Mumbai
20
Step 3:
• Under the "Test options" in the main panel, select 10-fold cross-validation as our
evaluation approach. Now click "Start" to generate the model. The ASCII version of
the tree as well as evaluation statistics will appear in the eight panel when the
model construction is completed
University of Mumbai
21
• To view this information in a separate window by right clicking the last result set
(inside the "Result list" panel on the left) and selecting "View in separate window"
from the pop-up menu.
University of Mumbai
22
Step 4:
To view graphical rendition of the classification tree, right clicking the last result set
and select "Visualize tree" from the pop-up menu.
University of Mumbai
23
• Note that the attribute section is identical to the training data. However, in the data
section, the value of the "pep" attribute is "?" (or unknown).
Step 5:
• In the main panel, under "Test options" click the "Supplied test set" radio button,
and then click the "Set..." button. This will pop up a window which allows you to
open the file containing test instances.
University of Mumbai
24
University of Mumbai
25
• Open the file "bank-new.arff" and upon returning to the main window, and click the
"start" button.
• This, once again generates the models from our training data, but this time it applies
the model to the new unclassified instances in the "bank-new.arff" file in order to
predict the value of "pep" attribute.
University of Mumbai
26
Step 6:
• To create a file containing all the new instances along with their predicted class value
resulting from the application of the model.
• First, right-click the most recent result set in the left "Result list" panel. In the
resulting pop-up window select the menu item "Visualize classifier errors".
University of Mumbai
27
Step 7:
• To "save" the classification results from which the graph is generated. In the new
window, click on the "Save" button and save the result as the file: "bank-
predicted.arff"
University of Mumbai
28
Practical 4: K-Means Clustering
Step 1:
• The sample data set used for this example is based on the "bank-data.csv ". This
document assumes that appropriate data pre-processing has been performed. In this
case a version of the initial data set has been created in which the ID field has been
removed and the "children" attribute has been converted to categorical.
• To perform clustering, select the "Cluster" tab in the Explorer and click on the
"Choose" button. This results in a drop down list of available clustering algorithms.
In this case, select "Simple KMeans".
• Next, click on the text box to the right of the "Choose" button to get the pop-up
window, for editing the clustering parameter.
University of Mumbai
29
Step 2:
• In the pop-up window, enter 6 as the number of clusters and leave the value of
"seed" as is. The seed value is used in generating a random number which is, in turn,
used for making the initial assignment of instances to clusters.
• Once the options have been specified, we can run the clustering algorithm. Here we
make sure that in the "Cluster Mode" panel, the "Use training set" option is selected,
and we click "Start". We can right click the result set in the "Result list" panel and
view the results of clustering in a separate window.
University of Mumbai
30
• The result window shows the centroid of each cluster as well as statistics on the
number and percentage of instances assigned to different clusters.
• Another way of understanding the characteristics of each cluster in through
visualization. Do this by right-clicking the result set on the left "Result list" panel and
selecting "Visualize cluster assignments".
• The cluster number and any of the other attributes for each of the three different
dimensions available (x-axis, y-axis, and color).
• Different combinations of choices will result in a visual rendering of different
relationships within each cluster. In the above example, as chosen the cluster number
as the x-axis, the instance number (assigned by WEKA) as the y-axis, and the "sex"
attribute as the color dimension.
• This will result in a visualization of the distribution of males and females in each
cluster. For instance, you can note that clusters 2 and 3 are dominated by males,
while clusters 4 and 5 are dominated by females. In this case, by changing the color
dimension to other attributes.
University of Mumbai
31
Step 3:
• Now, click the "Save" button in the visualization window and save the result as the
file "bank-kmeans.arff"
.
----END----