data mining technique using weka tool

8/10/2019 data mining technique using weka tool

1/21

DATA MINING TECHNIQUES

Abstract

With the large amount of data stored in databases and data warehouses, it is increasingly

Important to develop powerful tools for analysis of such data and mining interestingknowledge from it. Data mining is a process of inferring knowledge from such huge data.The main Problem related to the retrieval of information from the World Wide Web is theenormous Number of unstructured documents and resources, i.e., the difficulty of locatingand tracking appropriate sources. In this report a survey of the research at San FranciscoAirport for customer satisfaction is explained and the way in which it can help to make thedecision for improving their airport facility.Introduction

Today is an information age because every people have much data for some particular topic.

In these data some are useful while other is useless. Today many corporate change theirmanagement style and use customer relationship management approach. So, every companywants their customers fully satisfied. To attain these objectives they conduct varioussurveys. By these surveys they gather big data about their customer base. But the main

problem is to extract pattern from these data. These are very big data so it is next toimpossible to analyse these whole data manually. Data mining tool is helping them tounderstand and extract pattern from these big data.

What is data mining?

Data mining is the extraction of useful data from large data sets, analysing and thenrestructuring it into a useful form. It includes identifying patterns that are not previouslyseen, also to identify the relationship amongst those patterns to predict future behaviour. It

basically derives its origin from computer science. In businesses, it can be used to identifyrelationships amongst the various factors that affect the business. For example-retailers suchas Wal-Mart use data mining to identify the buying behaviour patterns amongst theircustomers to implement various marketing strategies in the future to attract them. Datamining has found extensive usage in the fields of business, science and engineering,medicine, sports.

Stages in data mining.

Data mining involves the following basic stages-

1. Defining the problem -This involves identifying the objectives and requirements ofthe project.

2. Collecting the data -To collect the appropriate data that can address the problem. identifying any patterns in it. It also involves cleansing the data and data transformation.

3. Creating a model, testing, evaluating and interpreting the model- This includes buildingthe model with help of algorithms.

4. Application of the model -This involves the integration of the data mining models inthe applications.


2/21

VARIOUS TASK OF DATA MINING

There are six different types of tasks used to do data mining. These tasks arefurther divided into two subtypes-

(1) Directed data mining (2) Undirected data mining.

1. Directed data mining - In directed data mining the data is directly use for building a model that is used to describe one or more attribute of interest.

2. Undirected data mining - In undirected data mining the goal is to establish some relationship among all the attribute.

The six tasks of data mining is as follows :

1. Classification2. Estimation3. Prediction4. Affinity grouping or association rules5. Clustering6. Description and visualization

[1] CLASSIFICATION:

In classification, the features of the presented data are examined and assigning intovarious pre-determined classes. Various methods are used to build a model that is thenapplied to the unclassified data which converts it into classified data. The variousexamples of classification task are as follows:

Loan applicants are classified as low, medium and high. Income groups are classifiedas lower class, middle class and high income group.

[2] ESTIM ATION:

In classification we get discrete outcomes like low, medium or high but in estimation we getcontinuously valued outcomes. When some input data is available, estimation techniques isused to come up with some unknown continuous variable. In estimation one wants to comeup with acceptable value or range for unknown parameters. For example:

Using literacy rate estimating no of children in one family , Estimating ones profession using gadgets that person have

[3] PREDI CTION:

In prediction, the task we do is same as we do in classification and estimation but theonly difference is that here we classified the data according to some predicted future value.There are no such ways to check the accuracy of your classification except wait and see.


3/21

The main reason to separate prediction is that there are additional issues regarding thetemporal relationship of the input variables which is not there in classification andestimation. Any method use for classification and estimation is useful to do the predictivetask. The historical data is use to build a model that explain current observed behaviour.The various examples of predictive task of data mining is as follows:

Predicting which employee is leave within six months , Predicting the size of balancethat will be transferred

[4] ASSOCI ATI ON RUL ES

This task is also known as affinity grouping. In this task we are supposed to find outwhich tasks are going together. The example of this task is to find out which items in ashopping mall cart are going together. This task is the main function of market basket

analysis. Retail chain often use this function to arrange the items in a shelves and in acatalogue.

An association rule works on the simple approach of generating the rules from the data. Iftwo items, cell phone and memory card, occur together frequently then we generate twoassociation rules from that which is as follows:

People who buy cell phone also buy memory card with probability x

People who buy memory card also buy cell phone with probability y

[5] CLUSTERI NG

Clustering means dividing all the data into some number of groups. All these groups arehomogeneous in nature and the data within each group is heterogeneous in nature. Thethings which distinguishes clustering from classification is that the latter relies on pre-determined classes. In classification all data are classified into pre-determined classes onthe basis of model.

In clustering no such pre-defined classes are available. It is up to the user to determinewhat meaning to attach to the resulting cluster. For example cluster of employees mighthave different background. Clustering is often done as a prelude to some other form of datamining or model. For example if you want to make promotional strategy for some productthen to make one strategy for every customer ,you divide your customer into cluster whichhave similar buying habits. Then ask them which strategy is suitable for them.

[6] DESCRI PTION AND VISUALI ZATI ON

The main task of description and visualization is to describe what is going on in

complicated data base in a way to increase our understanding. This is the most powerful toolof data mining, although data visualization is not an easy task. Visualization is useful to

pr ovide visual representation of data like companys customers on a map of a country.


4/21

Techniques of Data Mining

All above task have various techniques to build a model. Some of the mostimportant techniques are described below:

[1] STATI STI CS

Various statistical techniques are used for data mining like Bayesian network,regression analysis, correlation analysis and cluster analysis. Statistical model is builtup using training data base. Bayesian network is a directed graph which represents thecausal relationship among variable using Bayesian probability theorem. Correlationanalysis is use to find out correlation between two variables. Cluster analysis findsout group from set of objects based on distance measures. The following is anexample of linear regression.

The correlation coefficient, a parameter used in correlational analysis, is a measure of thelinear association between two variables. The correlational coefficient lies between -1 and+1. A value of +1 indicates that two variables are perfectly related in a positive linear sense,a correlation coefficient of -1 indicates that two variables are perfectly related in a negativelinear sense, and a correlation coefficient of 0 indicates that there is no linear relationship

between the two variables.

The following is the example of Bayesian network where node represent variableand Edges represent dependencies. From this diagram we see that age, occupationand diet leads to disease and disease leads to symptoms.

[2] M ACHI NE LEARNING

Statistical methods do not work efficiently with subjective, non-quantifiableinformation in their models. They also have to assume various distributions of

parameters and independence of attributes. Machine learning produces predictiveaccuracy in their models because it is free from parametric and structural assumptionsthat underlie statistical methods. Some techniques of machine learning is as follows:


5/21

Artificial Neural Network - Neural network is a very popular technique for data mining. Neural network is a general purpose tool applied to prediction, classification andclustering. This model has many non-linear processing elements arranged in a patternsimilar to biological neuron networks. This technique is applied to broad range ofindustry like financial institutes to medical company.

Genetic Algorithm - This algorithm is based on natural selection and naturalgenetics. Genetic algorithm is also called as evolutionary algorithm. Thismethod is use to optimise the problem in various industries like complexscheduling problem and resource optimisation problem. This technique is alsouse in combination of other data mining techniques.Decision tree -Decision tree is a structure that divides large collection of data into successive smaller sets of records by applying simple decision rules. Decisiontree is a flow chart like tree structure, where each internal node denotes a test onan attribute and each branch represents an outcome of the test and leaf nodesrepresent the classes or class distributions. The top most nodes in a tree is calledroot node. The following decision tree , that indicates whether a customer at acompany is likely to buy a computer or not.

[3] FUZZY LOGIC:

This technique is simply an extension of classical logic system. It provides aconceptual framework for dealing with the problem of knowledge representation inan environment of uncertainty and imprecision. Fuzzy logic technique in its pureform is not useful for classification but it is use for some other hybrid technique ofclassification. In fuzzy logic any logical can be fuzzified.


6/21

[4] ROUGH SETS TECH NI QUE:

In this theory an approximation of sets or concepts are done by means of binaryrelations constructed from empirical data based on the notion of indiscernibility andthe inability to distinguish between objects. The application of this model is work on

with two directions, first is Decision rule induction from attribute value table andsecond is, Data filtration by template generation.

Techniques of Clustering

Clustering means grouping objects in such a way that objects in one group are morehomogeneous then other group objects. In data mining this procedure is done on data to

present data into organised way. All clustering technique is divided into 4 parts which is asfollows:

1.

Hierarchical method2. Partitioning method3. Grid based method4. Density based method

In each parts there are several techniques. Now we discuss each of the parts separately.

1. HI ERARCHI CAL METH OD

In this method the data are divided into a tree of cluster or dendrogram. In thesemethod two techniques is included into this 1) agglomerative and 2) divisive. Inagglomerative approach works are done by growing cluster. The process is start withone cluster and continuously merges them until the single cluster encompassing allitems or certain terminating conditions are meet. In divisive approach works are done

by top down method. The process is start with root cluster and then dividing into subcluster and then again dividing them into sub cluster and this process is ends when allclusters have only one item or certain terminating conditions are meet.


7/21

2. DEN SI TY BASED CLUSTERIN G M ETH OD

Density based clustering techniques work on to find arbitrary shapedcluster. DBSCAN and OPTICS are two techniques for this method.

DBSCAN (Density Based Spatial Clustering Applications with Noise)

This technique is designed to find out clusters of arbitrary shape. The densityis measured b y no. of objects close to it. It uses two input parameters; and MinPts. is used to define neighbourhood of an object. MinPts is used to define theminimum number of points that should be in the neighbourhood of an object if it isto be considered as a core object. The algorithm works as follows:

(1) First of all mark all the objects as unvisited.

(2) Then randomly visit an unvisited object x. If x has at least MinPts number

of objects in its neighbourhood, then a new cluster C is created for it.Otherwise it is marked as a noise point.

(3) If cluster is created, we iteratively visit each point y in this newly formedcluster, if it is unvisited, mark it as visited and if this point has MinPts numberof points in its neighbourhood, we add those points to the cluster C. If y is nota member of any cluster, it is added to the created cluster C.

(4) Repeat steps 2 and 3 until all objects are visited.

OPTICS (Ordering Points to Identify Clustering Structure )

DBSCAN have several disadvantages which are it burdens the user from choosingthe input parameters. Moreover, different parts of data could require different

parameters. So, OPTICS was designed to surmount these challenges. OPTICS givetwo additional attributes which are: Core Distance and Reachability Distances,which are used to derive the ordering such that clusters with higher density will befinished first. It is also a problem in DBSCAN.


8/21

3. GRI D BASED CLUSTERIN G ME TH OD

Grid Based methods divides the attribute into cells. These are the cell on whichall operations of clustering are performed. The main advantage of this techniqueis fast processing time. STING and CLIQUE are two techniques use for this.

a. STING (Statistical Information Grid)

This technique is mainly using numerical attributes. First of all information suchas mean, maximum and minimum are stored in a rectangular cell. Then from the

parameters of the bottom level cells, Parameters at the higher level cells are drawn.First, a layer is determined from which query processing is to begin. This layerconsists of small number of cells. For each cell we check its pertinence bycomputing confidence internal. Irrelevant cells are removed and this process iscontinue until the bottom layer is reached.

b. CLIQUE (Clustering in Quest)

There are certain data which is irrelevant and make whole process complicated.CLIQUES divide whole data into non overlapping interval which is known as cell. ifthe numbers of objects that map to it exceed the threshold then this cell is known asdense. Otherwise, the cell is sparse. Procedure for this techniques is as follows: i. Thedimension space divided into no overlapping units called cells. ii. Then identify thedense and sparse cells. iii. Then use the dense cells to assemble the clusters. iv. Thenstarting with an arbitrary dense cell, we find the maximal region of all connecteddense cells in all dimensions. v. At last, repeat step 4 until all cells are covered.


9/21

4 PARTITIONI NG METH ODS

In this method all data are divided into K clusters. Here K equals to less than or equalto N and each K contain minimum one element. Cluster is improved by relocating theobjects from one group to a more relevant one. This process continue until the

clusters stabilize and no more migration of data from one cluster to another takes place. This method is use for small and medium size data bases. Techniques uses forthis method are K-means and K-medroide

K-MEANS TECHNIQUE

This is the commonly used clustering algorithms. Here K refers to particular numberof cluster. Each cluster must contain one element and make sure that the elements incluster not overlap. Descriptions of K-means and related algorithms gloss over theselection of K. But in many cases, there is no a priori reason to select a particularvalue, there is really an outermost loop to these algorithms that occurs during analysisrather than in the computer program. This outer loop consists of performing automaticcluster detection using one value of K, evaluating the results, then trying again withanother value of K or perhaps modifying the data. After each trial, the strength of theresulting clusters can be evaluated by comparing the average distance between recordsin a cluster with the average distance between clusters, and by other procedures. Thesetests can be automated, but the clusters must also be evaluated on a more subjective

basis to determine their usefulness for a given application. As shown in Figuredifferent values of K may lead to very different clustering that are equally valid. The

figure shows clustering

of a deck of playing cards for K = 2 and K = 4. Is one better than the other? It dependson the use to which the clusters will be put.


10/21

PROCESS OF DOING K-MEANS

1. In first step, you wanted to choose the value of K. in above paragraph we see thatwe can choose whatever amount of K we need according to our requirement. Inabove example it is 2 or 4

2. Then, assign each element to the cluster to which the element is most suitable for.Like in above example all play cards are divided into 2 clusters according to theircolour.3. Then, we find out the means of the each cluster.

4. Then compare the value of each element and re assign the element into the clusterfor which it is most suitable for.5. This process is done up to we get most suitable cluster.

ADVANTAGES:

This is very easy technique compare to other techniques and it is also very easy toimplement. This technique is very efficient in processing large data bases.

WEAKNESSES:

This method is applicable only when the mean of the cluster defined.

This method is not applicable to categorical data. You have to specify the amount of K before applying this techniques. This method is unable to handle the data of outliers and noisy data. Total run time depends on initial partition.

K-MEDOIDS ALGORITHM

This is a variation of the k-means techniques and is less sensitive to outliers because

Instead of using means, the clusters are represented by one of their points. Clusters areformed by points close to respective medoids. The function used for classification is ameasure of dissimilarities of points in a cluster and their representative. The partitioning isdone based on minimizing the sum if the dissimilarities between each object and its clusterrepresentative. This criterion is called as absolute-error criterion. Two main types of k-medoids clustering are the PAM (Partitioning Around Medoids) and CLARA (ClusteringLARge Applications).


11/21

K-MEAN USING WEKA (SURVEY OF SAN FRANCISCO AIRPORT )

Now we understood K-means clustering very well. But the main thing is that, how weapply this technique to solve the practical problems. This method help to solve the

problems like analysis of some research such as customer satisfaction survey in marketing,

geographical demand survey to reduce the distribution cost etc. so, K-means techniques iseasy and widely use clustering techniques in practical world.

We see that aviation industry is suffering from past 3-4 months. Every airport is trying togive maximum satisfaction to their customers. So, San Francisco International Airportconducted a survey to analysis their customers satisfaction level. They gather data of 3500customers who answered their several questions regarding airport facility. Now, theywanted to analyse this data to understand the pattern and various requirements of thecustomers. So, they use data mining clustering techniques to solve this problem.

For them it is important to understand that how many customer use services offered byairport facility, whether the customer visit the airport frequently or once a year, how theyfeel and rate various services offered by airport authority. Thus to sort all this data and use itto make various decisions there are various data mining software that are available toanalyse this type of problems and come to the decision such as Orange, Weka, Tanagra etc.as per now we are analysing this data using Weka tool.

Weka (Waikato Environment for Knowledge Analysis) is a popular suite of machinelearning software written in Java, developed at the University Of Waikato, New Zealand

The Weka contains a collection of visualization tools and algorithms for data analysis and predictive modelling together with graphical user interfaces for easy access to thisfunctionality. Weka supports several standard data mining tasks, more specifically, data pre-

processing, clustering, classification, regression, visualization, and feature selection, but as per our topic we would be dealing with clustering technique for data mining using wekasoftware.

So as you install weka tool and open it the above dialog box opens and there are variousoptions to open the application for clustering technique for our project we need clickexplorer option, The function of all the options are


12/21

Explorer: - It is an platform for exploring data, it support data pre-processing, attributeselection, learning and visualizationExperimenter: - It is a platform for performing experiment and conducting statistical test

between machine learning toolsKnowledge flow: - It is similar to Explorer but has drag and drop interface. It also gives avisual designSimple CLI: - it provides a simple command line interface for executing WEKA command.

So, by clicking exploring another dialog box open in which you upload the data that youwant to analyse. These data should be in either CSV or ASFF format. When we selectexplorer option it will appear as shown in fig (2). Then select open file and then choosedata to be uploaded for analysing. Weka provide filter but it is not relevant in K-meanstechnique because it automatically handle the numerical and categorical data.

This algorithm normalizes numerical attributes automatically when doing distancecomputations. This provides all the attributes that are present in the dataset. We can selectany one which we want to include or select all.

FIGURE (2)


13/21

For our particular data there are 3535 instances, instances are the total number of responses from thecustomers, there are 16 attributes for the analysis of the data on which we can make decision on howto improvise the service. The 16 attributes are namely

1. Art work and exhibition2. Restaurant3. Shop4. Signage and direction5. Escalators6. Info in screen7. Info booth at lower level8. Info booth at upper level9. Wifi10. Parking11. Long term parking12. Rental centre13. SFO as a whole14. Age15. Sex16. Income

Over here in figure 2 it is shown that we had selected art and work exhibition as the attribute, thetable on the right side of weka tool shows number of responses of people who responded average,good, outstanding, below average, blank, never visited, or unacceptable.Thus it can be inferred from the data that 696 people responded average, 1066 responded good andothers as shown in the figure

To get the comparison between various attributes we can use the right bottom corner of the weka tool,the bar graph shows explains that thing. Over here we selected income and other attribute, so thegraphs show the relation between Art work and exhibition and Income. The blue colour in the first

bar shows that from the 696 peoples who responded average for Art work & exhibition there werearound 18 per cent people who also chose the same for Income. So, around 128 peoples werecommon for both the attributes who responded average.Similarly the red colour indicate that from the 696 response of good for Art work & exhibition therewere around 20 per cent common people who also responded same for income.The main drawback of this tool is we cant get the exact number of peoples who were common for

both the instances. This way we can relate other attributes with income. Figure 2(i) shows relation ofall the attributes with Income attribute.


14/21

FIGURE 2(i)


15/21

To know how clustering work for this type of data it is needed to know cluster based diagram. to select clustering we choose cluster from the top line and click on choose button which is on leftside and then select clustering algorithm which we want to apply over here, as it provide variousalgorithm like hierarchical clustering, cob web clustering, DBSCAN, EM etc. As the focus is on Kmean clustering we select simple k means. This is shown in below fig.

Then click on the textbox which is shown right of choose menu. By clicking this pop upwindow opens, which is shown below, in which you edit the parameters of clustering likedistance function, maximum iteration, number of clusters, and seed etc. this all parametersare very important for clustering techniques because it affects the final output .

To explain the different possible outcome which can be affected we can explain it as -In distance function we have Euclidian, Manhattan and chebyshev distance, this is important

because it is used to calculate between the centroid and the original value of the output.for finding K mean we need to select Euclidean distance, because if we choose Manhattanthen we get K median clustering.


16/21

Iteration is the number of repetition done for finding the final cluster. Seed value is use forgenerating initial cluster centers that is use for the starting of clustering process. Then no. ofclusters is selected.We are choosing 5 clusters for our data

.

FIGURE (4)

Once the all parameters have been specified, start the clustering algorithm. For clustering we havedifferent modes like use training set, supplied test set, percentage split, and classes to clusterevaluation. Each mode have different importance like training set use all data to build the clusters,

percentage spilt provide the value in % field which specifies the data which use for clustering, thedefault value is 66% and classes to cluster evaluation is use for compare the resulting clusterassignment with known classes of instances, and determine that instances of same class have beenassign to same cluster. We are using training set in cluster mode and then click start button. If youwant to view the results of cluster ing in a separate window then right click the result set in ResultList panel. This result window will show the centroid of cluster as well as statistics on the numberand percentage of instances assigned to different clusters. Cluster centroids are the mean vectors foreach cluster (so, each dimension value in the centroid represents the mean value for that dimension inthe cluster). Thus, centroids can be used to characterize the cluster.


17/21

Finally, we get the result which divide whole data into 5 different clusters.

In this result the number of iteration is 5 which means that the data is processed 5 times before giving the required output, and the sum of the square error is 24399, it gives thevalue of the square of the distance between the centroid and the instances. The lesser thevalue of the square error the more accurate the cluster is said, and this value depends ontwo parameters 1) number of clusters 2) seed value. In these we also see the total instancesand the percentage of instances assign to each clusters .

In cluster 0 there are 384 customer responses which included those who responded average for all the parameters and people from age varying from 45-54 years , male .

In Cluster 1 there are 900 Customer responses which included those who didnt give anyresponse For Art Work & Exhibation, Restaurant, Shop, Info Booth Lower Level And Upper

Level, , Wifi, Parking, Long Term Parking, Rental Center , Responded good For Sfo As AWhole , outstanding for Escalator, Info On Screen, Signage And Direction . It also includedthe age group of 25 -34 Years Of female in majority


18/21

It can also be inferred that majority of female which were from the age group of 25-34 aresatisfied by the services of SFO airport, and the male sector from the age group 45-54 arenot fully satisfied by the service, thus SFO should focus more on that age group and try toimprovise the facility in such a way to satisfy them.

FIGURE (6)


19/21

This is one way of analysing cluster. There are other ways also to analyse the clusteringresult which is by visualization. This technique is easy to understand and judge. So, many

people use these to analyse the cluster.

For these first of all you right click the result set on the left "Result list" panel and byselecting "Visualize cluster assignments". The visualization window pops up as shown in

Fig 6. In this, choose the No. of clusters and any attributes for each of the three differentdimensions available (x-axis, y-axis, and colour). Different combinations of choices willresult in a visual rendering of different relationships within each cluster.

In the above example we choose no of cluster in x axis, instance number in y axis andart work and exhibition as a colour. We get more results while visualizing differentattributes. The figure for this is given below.

FIGURE (7)

So, these way K-means clustering works in Weka tool. We get final result as a graphicalformat or in numerical format which is use for some further business goals. Like in our case

we analysis customer are satisfied or not, If customer are not happy then in which attributewe are lacking into. Which types of customer are not happy whether they are rich or middleclass customers? So, these all information we generate from the graph and numericalformat we get.


20/21

The same thing also happened in the orange software. But the main difference betweenWeka and Orange is that orange provide only graphical interpretation but Weka provide

both Graphical as well as Numerical.

So, these are the some information how K-means technique work in open sourcesoftware. This way clustering task is done in practical world using software.

Conclusion

Thus we can say that by using various tool we can segregate the data which can be useful intaking decision required to improvise the service or get the knowledge regarding the differentrequirement of the firm. Weka is one of the simple tool which can be used to classify, clusterand interpret large data.

Bibliography (n.d.).

Ian H. Witten, F. E. (2011). Data mining : practical machine learning tools and techniques. Burlington: Elsevier Inc.

Jain, D. R. (n.d.). Introduction to Data Mining Techniques . Retrieved from iasri.res.in:http://www.iasri.res.in/ebook/expertsystem/datamining.pdf

Kharb, M. J. (2013). K-means Clustering Technique on Search Engine. International Journal of Information and Computation Technology , 506-510.

Michael J.A. Berry, G. L. (2004). Data mining techniques : for marketing, sales,and customer. Indianapolis: Wiley Publishing, Inc.

Shalini S Singh, N. C. (2011). K-means v/s K-medoids: A Comparative Study.

University, D. (n.d.). 3-4 . Retrieved from depaul edu wedsite:http://facweb.cs.depaul.edu/mobasher/classes/ect584/weka/preprocess.html


21/21

Documents

data mining technique using weka tool