A Comparative study on Classification and Clustering Techniques Using Assorted Data Mining Tools

International Journal of Advance Foundation and Research in Computer (IJAFRC)

Volume 2, Issue 8, August - 2015. ISSN 2348 4853, Impact Factor 1.317

55 | 2015, IJAFRC All Rights Reserved www.ijafrc.org

A Comparative study on Classification and Clustering

Techniques Using Assorted Data Mining Tools Dr.S.Prasath

Assistant Professor, Department of Computer Science,

Erode Arts and Science College (Autonomous), Erode.

[email protected]

A B S T R A C T

Data mining is basically the disclosure of important data and pattern from extremely large pieces

of accessible information. Two very important strategies of data mining are group and

classification, where the recent utilizes an arrangement of pre-ordered illustrations to build up a

model that can arrange the number of inhabitants in records everywhere, and the previous

partitions the information into gatherings of comparative items. In this paper, proposed a new

method for information integrating so as to group two data mining strategies, viz. bunching and

grouping. At that point a near study has been done between the basic arrangement and new

proposed incorporated grouping characterization procedure. Four prevalent information digging

instruments were utilized for both the strategies by utilizing six distinct classifiers and one

grouped for all sets. It was found that over every one of the apparatuses utilized, the coordinated

grouping arrangement method was superior to the straightforward order strategy. This outcome

was predictable for all the six classifiers utilized. For both of the systems, the best classifier was

observed to be SVM. Out of the four devices utilized, WEKA was observed to be the best as far as

adaptability of calculation. All examinations were drawn by looking at the rate precision of every

classifier utilized.

Keywords: Data mining, Classification, Clustering, Data mining tools, WEKA, Orange, Fuzzy.

I. INTRODUCTION

Data mining fixates on the computerized revelation of new actualities and connections in officially

existing information. The different methods of information mining incorporate affiliation, relapse,

forecast, bunching and characterization. Bunching is the division of information into gatherings of

comparative articles. Cluster is a case of unsupervised learning as it learns by perception [1]. Classify is a

data mining capacity that function that assigns items in a collection to target classifications or classes.

The objective of arrangement is to precisely anticipate the objective class for every case in the

information [2]. This paper manages the utilization of the incorporated bunching order method on a

portion of the free information mining apparatuses accessible nowadays. Devices on which incorporated

bunching arrangement procedure has been executed are KNIME (Konstanz Information Miner), Tanagra

[3], orange and WEKA (Waikato Environment for Knowledge Learning) [4]. The different classifier

utilized for this reason for existing are Nave Bayes, Support Vector machine, K Nearest Neighbor, Zero

Rule, Decision tree and One Rule.

Data mining is the procedure of programmed grouping of cases taking into account information

examples acquired from a dataset. Various calculations have been produced and actualized to

concentrate data and find information designs that may be valuable for choice backing. Information

mining otherwise called KDD (Knowledge Discovery in Databases), information preprocessing, example

acknowledgment, grouping, order are the prevalent advances in information mining. In this paper,




examine definite about the information preprocessing that comes in the database server module after

that they talk about the database module characterized as order and bunching by the information mining

tools.

II. KNOWLEDGE DISCOVERY PROCESS

The terms Knowledge Discovery in Databases (KDD) and Data Mining are regularly utilized conversely.

KDD is the procedure of transforming the low-level information into abnormal state learning.

Consequently, KDD refers to the nontrivial extraction of understood, beforehand obscure and possibly

helpful data from information in databases. While information mining and KDD are frequently regarded

as comparable words yet in genuine information mining is an imperative stride in the KDD process.

The Knowledge Discovery in Databases procedure embodies a couple steps driving from crude

information accumulations to some type of new learning. Data cleaning stage depicts the clamor

information and immaterial information is expelled from the gathering. Data integration stage includes

different information sources, frequently heterogeneous may be consolidated in a typical source. Data

selection stage significant information to the examination is choose and recovered from the information

gathering. Data transformation is known as information union, in this stage they chose information is

changed into structures fitting for the mining method. Data mining is perceptive systems are connected

to concentrate designs possibly helpful. Pattern evaluation includes fascinating examples speaking to

information are distinguished in light of given measures. Knowledge representation is last stage in which

the found information is outwardly spoken to the client. In this stride perception strategies are utilized to

help clients comprehend and translate the information mining results.

III. DATA MINING PROCESS

In the KDD process, the information digging routines are for extricating examples from information. The

examples that can be found rely on the information mining assignments connected. For the most part,

two sorts of information mining assignments. The distinct information mining errands which portray the

general properties of the current information and prescient information mining undertakings that

endeavor to do forecasts in view of accessible information. Information mining should be possible on

information which is in quantitative, literary or sight and sound structures.

Information mining applications can utilize distinctive sort of parameters to look at the information. They

incorporate affiliation (designs where one occasion is associated with another occasion), arrangement or

way examination (designs where one occasion prompts another occasion), characterization (ID of new

examples with predefined targets) and bunching (gathering of indistinguishable or comparative items).

Problem definition: The first step is to identify goals based on the correct series of tools can be applied

to the data to build the corresponding behavioral model.

Data exploration: If the quality of data is not suitable for an accurate model then recommendations on

future data collection and storage strategies can be made and to analysis all data needs to be

consolidated that can be treated consistently.

Data preparation: The purpose to clean and transform the data is missing the invalid values are treated

and the all known valid values are made consistent for more robust analysis.




Modeling: It is based on the data and desired outcomes of a data mining algorithm or combination of

algorithms is selected for analysis. These algorithms include classical techniques such as statistics,

neighborhoods and clustering also consider next generation techniques such as decision trees, networks

and rule based algorithms. The specific algorithm is selected based on the particular objective are

achieved and quality of the data analyzed.

Evaluation and Deployment: Based on the results of the data mining algorithms, an analysis is

conducted to determine key conclusions from the analysis and create a series of recommendations for

further consideration.

Fig.1 Steps involved in Data mining

IV. Data Mining Methods

Classification: Supervised Learning strategy with their classes is known.

Clustering: Unsupervised Learning strategies with their classes are unclear.

Association Rule Mining: Identifying the covered up, beforehand obscure connection between the

elements.

Temporal mining: The utilization of worldly information, displaying transient occasions, time

arrangement, design location, groupings and fleeting affiliation rules.

Time Series Analysis: To portrays nature and conduct of time arrangement information. To anticipate

the future pattern and conduct of the information.

Web Mining: Mining web information, Web substance mining, Web structure mining and Web utilization

mining.

Spatial Mining: Use with GIS for mining learning from spatial database. Spatial arrangement, grouping

and principle era undertakings.

V. Data mining Classification Algorithms

The different classification algorithms available are

Naive Bayes (NB): An independent feature probability model based on the Bayes theorem with

probabilistic classifier.




Decision tree (C4.5): It is statistical classifier developed by Ross Quinlan and classifies data by

generating decision trees.

Support Vector Machine (SVM): The example of non-probabilistic binary linear classifier and from the

set of input data predicts which of the two possible classes forms the output.

K Nearest Neighbor (KNN): An example of instance-based learning, KNN is sensitive to the local

structure of the data thus the function is approximated locally and computation is done after

classification is complete.

VI. CLUSTERING

This pattern partitions the records in database into diverse gatherings. In the same gathering, the

gatherings have the comparative properties and the distinctions ought to make as bigger as could be

expected under the circumstances and in the same gathering, the distinctions ought to be as littler as

would be prudent. There is no predefined class in this gathering it goes under the unsupervised learning.

Techniques included in bunch examination are partioning systems, various leveled routines, thickness

Based strategies, network based techniques, model-based routines, grouping high-dimensional

information, requirement based bunching and Outlier investigation.

i. K-means Clustering ii. Hierarchical clustering iii.Density based clustering

VII. DATA MINING TOOLS

The data mining tools on which the integrated clustering-classification technique has been implemented. WEKA tool

WEKA is Waikato Environment for Knowledge Analysis, data mining/machine learning tool developed by

Department of Computer Science, University of Waikato, New Zealand. It is a collection of open source of many data mining and machine learning algorithms, including pre-processing on data, classification,

regression, clustering, association rule extraction and feature selection which supports .arff (attribute

relation file format) file format.

Tanagra

Tanagra was written an aid to education and research on data mining by Ricco Rakotomalala. The entire

user operation of Tanagra is based on the stream diagram paradigm. Under the stream diagram

paradigm, a user builds a graph specifying the data sources and operations on the data. Paths through the

graph can describe the flow of data through manipulations and analysis. Tanagra simplifies this paradigm

by restricting the graph to be a tree with only one parent to each node and the other one for data source

of an each operation.

KNIME




KNIME is Konstanz Information Miner is open source data analytics, reporting and integration platform.

KNIME integrates various components for machine learning and data mining through its modular data

pipelining concept. A graphical user interface allows assembly of nodes for data preprocessing (ETL:

Extraction, Transformation, Loading) for modeling and data analysis and visualization.

Orange

Orange is a component-based data mining and machine learning software suite, featuring a visual

programming front-end for explorative data analysis, visualization, Python bindings and libraries for

scripting. It includes set of components for data preprocessing, feature scoring and filtering, modeling,

model evaluation and exploration techniques. It is implemented in C++ and Python.

Fig.2. Process Flow

VIII. EXPERIMENTAL RESULTS

The dataset "pima Indian Diabetes" are consider with the use of K-means, Hierarchical and Density based

clustering technique and the different classification algorithms available on data mining tools. The Pima

Indian diabetes data sets available on UCI machine learning repository. The experiment is performed on

the dataset results in Table I shows the accuracy measure of K-means clustering technique for different

classifiers used. SVM provides the highest accuracy in the range of 76-78%, followed by Nave Bayes with

accuracy in the range of 73-76% and KNN with accuracy ranging between 72-73% and followed closely

by C4.5 with accuracy in the range 69-74%. The pictorial representation as shown in the fig.3.

Table I: Accuracy for K-means clustering

Classifier Weka Tanagra Orange KNIME

NB 76.32 % 74.87% 75.38% 73.17%

C4.5 73.82 % 74.21% 70.05% 69.27%

KNN 73.17 % 72.11% 72.90% 72.26%

SVM 78.34 % 76.45% 76.17% 77.60%




Fig.3. Accuracy for K-means clustering

From the below Table II show the accuracy measure of the K-means Clustering technique for different

classifiers used. The SVM classifier gives the better accuracy measure between 73-77%, is followed by

Nave Bayes with accuracy between 64-68%. KNN gives accuracy between 62-71% and C 4.5 between

89-99%. The pictorial representation as shown in the fig.4.

Table II: Accuracy for Hierarchical Clustering


NB 65.22 % 66.34% 68.74% 64.01%

C4.5 62.32 % 71.08% 70.65% 68.75%

KNN 71.08 % 68.84% 69.38% 78.21%

SVM 77.67 % 76.24% 77.68% 73.01%

Fig.4. Accuracy for Hierarchical Clustering

From the below Table III shows the accuracy measure of the Hierarchical Clustering technique for

different classifiers used. The SVM classifier gives the higher accuracy measure between 73-77% and it is

followed by Nave Bayes with accuracy between 64-68%. KNN has accuracy between 62-71% and C 4.5 is

between 89-99% respectively. The pictorial representation as shown in the fig.5.

Table III: Accuracy for Density based clustering


NB 63.58 % 63.34% 64.89% 64.22%

C4.5 62.11 % 68.71% 69.28% 68.59%

KNN 69.65 % 70.84% 70.95% 75.52%

SVM 77.00 % 74.89% 74.87% 73.81%




Fig.5. Accuracy for Density based clustering

Comparing the data in Table I, II and III shows the SVM classifier is the best for the K-means, Hierarchical,

Density based clustering and clustering-classification techniques. However, the percentage accuracy

using SVM classifier is in the range of 76-78% and the range of 76-77%. From the comparison of the

tables, it shows that the results of the accuracy of K-means clustering technique are more accurate than

the other classification data mining technique. Overall, the K-means clustering technique is about 2-12%

greater than the other clustering technique above with a range of tools and algorithms used.

IX. CONCLUSIONS

Data mining is the extraction of useful patterns and relationships from data sources, such as databases,

texts, the web etc. This research discussed the different data mining tool focus importance of tools by

considering in various aspects. The experimental results are compared with existing techniques such as

clustering and classification gives better results to improve the accuracy gives SVM is the best compare to

other method.

X. REFERENCES

[1] David Heckerman,"Bayesian Network for Data Mining and Knowledge Discovery", 1997.

[2] David Hand, Heikki Mannila and Padhraic Smyth,"Principles of Data Mining", the MIT Press, 2001.

[3] Ritu Chauhan, Harleen Kaur, M.Afshar Alam, "Data Clustering Method for Discovering Clusters in

Spatial Cancer Databases", International Journal of Computer Applications, Volume10, No.6,

November 2010.

[4] J.R Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufman, 1993.

[5] S.Kotsiantis, D.Kanellopoulos, P.Pintelas, "Data Preprocessing for Supervised Leaning",

International Journal of Computer Science,Vol.1,No.2, pp.111117,2006.

[6] MacQueen J. B., "Some Methods for classification and Analysis of Multivariate Observations",

Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability. University of

California Press., pp.281297,1967.

[7] Lloyd, S. P., "Least square quantization in PCM", IEEE Transactions on Information Theory 28,pp.

129137,1982 .




[8] Manish Verma, MaulySrivastava, NehaChack, Atul Kumar Diswar and Nidhi Gupta, "A

Comparative Study of Various Clustering Algorithms in Data Mining", International Journal of

Engineering Research and Applications, Vol. 2, Issue.3, 2012.

[9] Timonthy C. Havens, "Clustering in relational data and ontologies", July 2010.

AUTHOR PROFILE

Dr.S.Prasath is currently working as an Assistant Professor in Department of

Computer Science, Erode Arts & Science College (Autonomous), Erode,

Tamilnadu, India. He received Ph.D degree from Bharathiar University,

Coimbatore, Tamilnadu, India in 2015. He has obtained his Masters degree in

Software Engineering from M.Kumarasamy college of Engineering, Karur under

Anna University, Chennai in 2008 and M.Phil degree in Computer Science in the

year 2009. His area of interests includes, Image Processing and Data Mining. He

has presented 6 papers in National and 2 International level conferences. He

has published 10 papers in National and International journals.

Documents

A Comparative study on Classification and Clustering Techniques Using Assorted Data Mining Tools