Journal Data Mining

University of Mumbai

1

DATA MINING

Software Requirement: WEKA

Weka is written in Java and is widely used in different operating systems.

Weka expects the database to be in the format of ARFF (Attribute-Relation File

Format).

Data is in MS Excel Sheet ie bank-data.xls

1st

step is to convert the excel file into comma-separated format (.csv)

The data contains the following fields

id a unique identification number

age age of customer in years (numeric)

sex MALE / FEMALE

region inner_city/rural/suburban/town

income income of customer (numeric)

married is the customer married (YES/NO)

children number of children (numeric)

car does the customer own a car (YES/NO)

save_acct does the customer have a saving account (YES/NO)

current_acct does the customer have a current account (YES/NO)

mortgage does the customer have a mortgage (YES/NO)

pep did the customer buy a PEP (Personal Equity Plan) after the last mailing (YES/NO)


2

Practical 1 : Perform the steps of Data Preprocessing in WEKA

Step1:

• Convert the excel file into comma-separated format (.csv)

• Open the excel file, select Save As from File pull-down Menu.

• In the ensuing dialog box, select CSV and save the file.


3

Step2:

• The csv file can be opened in the text editor shown below.

Step3: Start Weka


4

Step 4: Loading the data

• In addition to native ARFF data format, WEKA can also real .csv format files.

Assuming WEKA is installed properly click on Explorer command box.

• In the Preprocessor Tab click on Open and open the data file (.csv or .arff)


5

Step 5: Filtering Attributes:

• In our dataset, each customer has a unique id. This attribute has to be removed

before the data mining step.

• First tick on the Check Box corresponding to the id which is 1.

• On the down left side click on the Remove command button. This removes the id

attribute and all its values.

Step 6: Saving new data set:

• Now save this data set as by clicking on the Save dialog box as bank-data2.arff or

bank-data2.csv


6


7

Step 7: Now open this new file bank-data2.arff or bank-data2.csv in the text editor (word

pad). As seen, attribute ID and its corresponding values have been removed.

• Note the line @relation bank-data-weka.filters.unsupervised.attribute.Remove-

R1. This statement simply describes the operation that has been done on the data

set till now. As seen, attributes can be both numeric and nominal type.


8

Step 8: Decretization

Techniques like association rule mining can only be performed on categorical data.

This requires performing discretization on numeric or continuous attributes.

There are 3 such attributes in our data set, they are: age, children and income.

Make changes in the attribute of Children as children {0,1,2,3}

• By doing this, the key word “numeric” from “children” attribute are getting

removed and replaced with a set of values.

• Now Save the file in the Word Pad and this might give some error message, pass it

off by clicking OK


9

Step 9: Decretization in WEKA

• To perform Decretization on the attributes of “age” and “income”, divide each of

these into 3 intervals.

• Open the bank-data2.arff or bank-data2.csv using “Open” command.

• Select weak.filters.unsupervised.attributes.Descretize.

• The textbox in the filter dialog box will have something like

Descretize –B 10 –M -1.0 –R first-last.


10

Step 10: Decretization in WEKA cont

• Click on Textbox to open the DiscretizeFilter Dialog box.

• Enter index value as 1, 4 in the textbox corresponding to attributIndices.

• Enter 3 as the number of intervals (bins)

• As this simple binning, all the other options will remain “False”

• Click on OK and then Apply. This will result in a new working relation with two

selected attributes each partitioned into 3 intervals/bins.

• To examine the result, save the new working relation in the file “bank-data3.arff”

or “bank-data3.csv”


11

Step 11: Evaluating the Discreized the data.

• For example, the lower range of the attribute “age” is labeled “(-inf-34.333333]”

and middle age as “(34.333333-50.666667]”

• Now replace the following attributed by the Replace option of the Word Pad.

Attribute Age

� ‘ \ ‘ -inf-34.333333]’’ � 0_34

� ‘ \ ’34.33333 – 50.666667\’’ � 35_51

� ‘ \ ’50.66667-inf \’’ � 52_max

Attribute Income

� ‘ \ ‘ - inf-24388.173333\’’ � 0_24386

� ‘ \ ‘24388.173333 – 43758.136667 \’’ � 24386_43758

� ‘ \ ‘43758.136667-inf\” � 43759_max


12

• After Replacing the values

• Now save the changes as “bank-data-final.arff” or “bank-data-final.csv”


13

Practical 2: Perform Association Rule Mining with WEKA

Step 1:

• Open the file “bank-data-final.arff” or “bank-data-final.csv”

Step 2:

• Clicking on the "Associate" tab will bring up the interface for the association rule

algorithms. The Apriori algorithm which is used is the default algorithm selected.

Click on the text box immediately to the right of the "Choose" button.

• Choose lift as the criteria. Now enter 1.5 as the minimum value for lift which is

computed as the confidence of the rule divided by the support of the right-hand-

side (RHS). In a simplified form, given a rule L => R,lift is the ratio of the probability

that L and R occur together to the multiple of the two individual probabilities for L

and R, i.e.,

lift = Pr(L,R) / Pr(L).Pr(R).

• If this value is 1, then L and R are independent. The higher this value, the more

likely that the existence of L and R together in a transaction is not just a random

occurrence, but because of some relationship between them.


14

• Here change the default value of rules (10) to be 100; this indicates that the

program will report no more than the top 100 rules. The upper bound for

minimum support is set to 1.0 (100%) and the lower bound to 0.1 (10%).

• Apriori in WEKA starts with the upper bound support and incrementally decreases

support. The algorithm halts when either the specified number of rules is

generated, or the lower bound for min. support is reached.


15

Step 3:

• Once the parameters have been set, the command line text box will show the new

command line. Now click on start to run the program.


16

Step 4:

• The panel on the left ("Result list") now shows an item indicating the algorithm

that was run and the time of the run.

• Clicking on one of the results in this list will bring up the details of the run,

including the discovered rules in the right panel. In addition, right-clicking on the

result set allows us to save the result buffer into a separate file. Now save the

output in the file bank-data-ar1.txt


17

Practical 3: Classification via decision tree Step1:

• WEKA has implementations of numerous classification and prediction algorithms.

The basic ideas behind using all of these are similar. (Use "bank.arff")


18

Step 2:

• Next, select the "Classify" tab and click the "Choose" button to select the J48

classifier.


19

• Various parameters can be specified. These can be specified by clicking in the text

box to the right of the "Choose" button.


20

Step 3:

• Under the "Test options" in the main panel, select 10-fold cross-validation as our

evaluation approach. Now click "Start" to generate the model. The ASCII version of

the tree as well as evaluation statistics will appear in the eight panel when the

model construction is completed


21

• To view this information in a separate window by right clicking the last result set

(inside the "Result list" panel on the left) and selecting "View in separate window"

from the pop-up menu.


22

Step 4:

To view graphical rendition of the classification tree, right clicking the last result set

and select "Visualize tree" from the pop-up menu.


23

• Note that the attribute section is identical to the training data. However, in the data

section, the value of the "pep" attribute is "?" (or unknown).

Step 5:

• In the main panel, under "Test options" click the "Supplied test set" radio button,

and then click the "Set..." button. This will pop up a window which allows you to

open the file containing test instances.


24


25

• Open the file "bank-new.arff" and upon returning to the main window, and click the

"start" button.

• This, once again generates the models from our training data, but this time it applies

the model to the new unclassified instances in the "bank-new.arff" file in order to

predict the value of "pep" attribute.


26

Step 6:

• To create a file containing all the new instances along with their predicted class value

resulting from the application of the model.

• First, right-click the most recent result set in the left "Result list" panel. In the

resulting pop-up window select the menu item "Visualize classifier errors".


27

Step 7:

• To "save" the classification results from which the graph is generated. In the new

window, click on the "Save" button and save the result as the file: "bank-

predicted.arff"


28

Practical 4: K-Means Clustering

Step 1:

• The sample data set used for this example is based on the "bank-data.csv ". This

document assumes that appropriate data pre-processing has been performed. In this

case a version of the initial data set has been created in which the ID field has been

removed and the "children" attribute has been converted to categorical.

• To perform clustering, select the "Cluster" tab in the Explorer and click on the

"Choose" button. This results in a drop down list of available clustering algorithms.

In this case, select "Simple KMeans".

• Next, click on the text box to the right of the "Choose" button to get the pop-up

window, for editing the clustering parameter.


29

Step 2:

• In the pop-up window, enter 6 as the number of clusters and leave the value of

"seed" as is. The seed value is used in generating a random number which is, in turn,

used for making the initial assignment of instances to clusters.

• Once the options have been specified, we can run the clustering algorithm. Here we

make sure that in the "Cluster Mode" panel, the "Use training set" option is selected,

and we click "Start". We can right click the result set in the "Result list" panel and

view the results of clustering in a separate window.


30

• The result window shows the centroid of each cluster as well as statistics on the

number and percentage of instances assigned to different clusters.

• Another way of understanding the characteristics of each cluster in through

visualization. Do this by right-clicking the result set on the left "Result list" panel and

selecting "Visualize cluster assignments".

• The cluster number and any of the other attributes for each of the three different

dimensions available (x-axis, y-axis, and color).

• Different combinations of choices will result in a visual rendering of different

relationships within each cluster. In the above example, as chosen the cluster number

as the x-axis, the instance number (assigned by WEKA) as the y-axis, and the "sex"

attribute as the color dimension.

• This will result in a visualization of the distribution of males and females in each

cluster. For instance, you can note that clusters 2 and 3 are dominated by males,

while clusters 4 and 5 are dominated by females. In this case, by changing the color

dimension to other attributes.


31

Step 3:

• Now, click the "Save" button in the visualization window and save the result as the

file "bank-kmeans.arff"

.

----END----

Documents

Journal Data Mining