A Comparative study on Classification and Clustering Techniques Using Assorted Data Mining Tools

  • Upload
    ijafrc

  • View
    4

  • Download
    0

Embed Size (px)

DESCRIPTION

Data mining is basically the disclosure of important data and pattern from extremely large piecesof accessible information. Two very important strategies of data mining are group andclassification, where the recent utilizes an arrangement of pre-ordered illustrations to build up amodel that can arrange the number of inhabitants in records everywhere, and the previouspartitions the information into gatherings of comparative items. In this paper, proposed a newmethod for information integrating so as to group two data mining strategies, viz. bunching andgrouping. At that point a near study has been done between the basic arrangement and newproposed incorporated grouping characterization procedure. Four prevalent information digginginstruments were utilized for both the strategies by utilizing six distinct classifiers and onegrouped for all sets. It was found that over every one of the apparatuses utilized, the coordinatedgrouping arrangement method was superior to the straightforward order strategy. This outcomewas predictable for all the six classifiers utilized. For both of the systems, the best classifier wasobserved to be SVM. Out of the four devices utilized, WEKA was observed to be the best as far asadaptability of calculation. All examinations were drawn by looking at the rate precision of everyclassifier utilized.

Citation preview

  • International Journal of Advance Foundation and Research in Computer (IJAFRC)

    Volume 2, Issue 8, August - 2015. ISSN 2348 4853, Impact Factor 1.317

    55 | 2015, IJAFRC All Rights Reserved www.ijafrc.org

    A Comparative study on Classification and Clustering

    Techniques Using Assorted Data Mining Tools Dr.S.Prasath

    Assistant Professor, Department of Computer Science,

    Erode Arts and Science College (Autonomous), Erode.

    [email protected]

    A B S T R A C T

    Data mining is basically the disclosure of important data and pattern from extremely large pieces

    of accessible information. Two very important strategies of data mining are group and

    classification, where the recent utilizes an arrangement of pre-ordered illustrations to build up a

    model that can arrange the number of inhabitants in records everywhere, and the previous

    partitions the information into gatherings of comparative items. In this paper, proposed a new

    method for information integrating so as to group two data mining strategies, viz. bunching and

    grouping. At that point a near study has been done between the basic arrangement and new

    proposed incorporated grouping characterization procedure. Four prevalent information digging

    instruments were utilized for both the strategies by utilizing six distinct classifiers and one

    grouped for all sets. It was found that over every one of the apparatuses utilized, the coordinated

    grouping arrangement method was superior to the straightforward order strategy. This outcome

    was predictable for all the six classifiers utilized. For both of the systems, the best classifier was

    observed to be SVM. Out of the four devices utilized, WEKA was observed to be the best as far as

    adaptability of calculation. All examinations were drawn by looking at the rate precision of every

    classifier utilized.

    Keywords: Data mining, Classification, Clustering, Data mining tools, WEKA, Orange, Fuzzy.

    I. INTRODUCTION

    Data mining fixates on the computerized revelation of new actualities and connections in officially

    existing information. The different methods of information mining incorporate affiliation, relapse,

    forecast, bunching and characterization. Bunching is the division of information into gatherings of

    comparative articles. Cluster is a case of unsupervised learning as it learns by perception [1]. Classify is a

    data mining capacity that function that assigns items in a collection to target classifications or classes.

    The objective of arrangement is to precisely anticipate the objective class for every case in the

    information [2]. This paper manages the utilization of the incorporated bunching order method on a

    portion of the free information mining apparatuses accessible nowadays. Devices on which incorporated

    bunching arrangement procedure has been executed are KNIME (Konstanz Information Miner), Tanagra

    [3], orange and WEKA (Waikato Environment for Knowledge Learning) [4]. The different classifier

    utilized for this reason for existing are Nave Bayes, Support Vector machine, K Nearest Neighbor, Zero

    Rule, Decision tree and One Rule.

    Data mining is the procedure of programmed grouping of cases taking into account information

    examples acquired from a dataset. Various calculations have been produced and actualized to

    concentrate data and find information designs that may be valuable for choice backing. Information

    mining otherwise called KDD (Knowledge Discovery in Databases), information preprocessing, example

    acknowledgment, grouping, order are the prevalent advances in information mining. In this paper,

  • International Journal of Advance Foundation and Research in Computer (IJAFRC)

    Volume 2, Issue 8, August - 2015. ISSN 2348 4853, Impact Factor 1.317

    56 | 2015, IJAFRC All Rights Reserved www.ijafrc.org

    examine definite about the information preprocessing that comes in the database server module after

    that they talk about the database module characterized as order and bunching by the information mining

    tools.

    II. KNOWLEDGE DISCOVERY PROCESS

    The terms Knowledge Discovery in Databases (KDD) and Data Mining are regularly utilized conversely.

    KDD is the procedure of transforming the low-level information into abnormal state learning.

    Consequently, KDD refers to the nontrivial extraction of understood, beforehand obscure and possibly

    helpful data from information in databases. While information mining and KDD are frequently regarded

    as comparable words yet in genuine information mining is an imperative stride in the KDD process.

    The Knowledge Discovery in Databases procedure embodies a couple steps driving from crude

    information accumulations to some type of new learning. Data cleaning stage depicts the clamor

    information and immaterial information is expelled from the gathering. Data integration stage includes

    different information sources, frequently heterogeneous may be consolidated in a typical source. Data

    selection stage significant information to the examination is choose and recovered from the information

    gathering. Data transformation is known as information union, in this stage they chose information is

    changed into structures fitting for the mining method. Data mining is perceptive systems are connected

    to concentrate designs possibly helpful. Pattern evaluation includes fascinating examples speaking to

    information are distinguished in light of given measures. Knowledge representation is last stage in which

    the found information is outwardly spoken to the client. In this stride perception strategies are utilized to

    help clients comprehend and translate the information mining results.

    III. DATA MINING PROCESS

    In the KDD process, the information digging routines are for extricating examples from information. The

    examples that can be found rely on the information mining assignments connected. For the most part,

    two sorts of information mining assignments. The distinct information mining errands which portray the

    general properties of the current information and prescient information mining undertakings that

    endeavor to do forecasts in view of accessible information. Information mining should be possible on

    information which is in quantitative, literary or sight and sound structures.

    Information mining applications can utilize distinctive sort of parameters to look at the information. They

    incorporate affiliation (designs where one occasion is associated with another occasion), arrangement or

    way examination (designs where one occasion prompts another occasion), characterization (ID of new

    examples with predefined targets) and bunching (gathering of indistinguishable or comparative items).

    Problem definition: The first step is to identify goals based on the correct series of tools can be applied

    to the data to build the corresponding behavioral model.

    Data exploration: If the quality of data is not suitable for an accurate model then recommendations on

    future data collection and storage strategies can be made and to analysis all data needs to be

    consolidated that can be treated consistently.

    Data preparation: The purpose to clean and transform the data is missing the invalid values are treated

    and the all known valid values are made consistent for more robust analysis.

  • International Journal of Advance Foundation and Research in Computer (IJAFRC)

    Volume 2, Issue 8, August - 2015. ISSN 2348 4853, Impact Factor 1.317

    57 | 2015, IJAFRC All Rights Reserved www.ijafrc.org

    Modeling: It is based on the data and desired outcomes of a data mining algorithm or combination of

    algorithms is selected for analysis. These algorithms include classical techniques such as statistics,

    neighborhoods and clustering also consider next generation techniques such as decision trees, networks

    and rule based algorithms. The specific algorithm is selected based on the particular objective are

    achieved and quality of the data analyzed.

    Evaluation and Deployment: Based on the results of the data mining algorithms, an analysis is

    conducted to determine key conclusions from the analysis and create a series of recommendations for

    further consideration.

    Fig.1 Steps involved in Data mining

    IV. Data Mining Methods

    Classification: Supervised Learning strategy with their classes is known.

    Clustering: Unsupervised Learning strategies with their classes are unclear.

    Association Rule Mining: Identifying the covered up, beforehand obscure connection between the

    elements.

    Temporal mining: The utilization of worldly information, displaying transient occasions, time

    arrangement, design location, groupings and fleeting affiliation rules.

    Time Series Analysis: To portrays nature and conduct of time arrangement information. To anticipate

    the future pattern and conduct of the information.

    Web Mining: Mining web information, Web substance mining, Web structure mining and Web utilization

    mining.

    Spatial Mining: Use with GIS for mining learning from spatial database. Spatial arrangement, grouping

    and principle era undertakings.

    V. Data mining Classification Algorithms

    The different classification algorithms available are

    Naive Bayes (NB): An independent feature probability model based on the Bayes theorem with

    probabilistic classifier.

  • International Journal of Advance Foundation and Research in Computer (IJAFRC)

    Volume 2, Issue 8, August - 2015. ISSN 2348 4853, Impact Factor 1.317

    58 | 2015, IJAFRC All Rights Reserved www.ijafrc.org

    Decision tree (C4.5): It is statistical classifier developed by Ross Quinlan and classifies data by

    generating decision trees.

    Support Vector Machine (SVM): The example of non-probabilistic binary linear classifier and from the

    set of input data predicts which of the two possible classes forms the output.

    K Nearest Neighbor (KNN): An example of instance-based learning, KNN is sensitive to the local

    structure of the data thus the function is approximated locally and computation is done after

    classification is complete.

    VI. CLUSTERING

    This pattern partitions the records in database into diverse gatherings. In the same gathering, the

    gatherings have the comparative properties and the distinctions ought to make as bigger as could be

    expected under the circumstances and in the same gathering, the distinctions ought to be as littler as

    would be prudent. There is no predefined class in this gathering it goes under the unsupervised learning.

    Techniques included in bunch examination are partioning systems, various leveled routines, thickness

    Based strategies, network based techniques, model-based routines, grouping high-dimensional

    information, requirement based bunching and Outlier investigation.

    i. K-means Clustering ii. Hierarchical clustering iii.Density based clustering

    VII. DATA MINING TOOLS

    The data mining tools on which the integrated clustering-classification technique has been implemented. WEKA tool

    WEKA is Waikato Environment for Knowledge Analysis, data mining/machine learning tool developed by

    Department of Computer Science, University of Waikato, New Zealand. It is a collection of open source of many data mining and machine learning algorithms, including pre-processing on data, classification,

    regression, clustering, association rule extraction and feature selection which supports .arff (attribute

    relation file format) file format.

    Tanagra

    Tanagra was written an aid to education and research on data mining by Ricco Rakotomalala. The entire

    user operation of Tanagra is based on the stream diagram paradigm. Under the stream diagram

    paradigm, a user builds a graph specifying the data sources and operations on the data. Paths through the

    graph can describe the flow of data through manipulations and analysis. Tanagra simplifies this paradigm

    by restricting the graph to be a tree with only one parent to each node and the other one for data source

    of an each operation.

    KNIME

  • International Journal of Advance Foundation and Research in Computer (IJAFRC)

    Volume 2, Issue 8, August - 2015. ISSN 2348 4853, Impact Factor 1.317

    59 | 2015, IJAFRC All Rights Reserved www.ijafrc.org

    KNIME is Konstanz Information Miner is open source data analytics, reporting and integration platform.

    KNIME integrates various components for machine learning and data mining through its modular data

    pipelining concept. A graphical user interface allows assembly of nodes for data preprocessing (ETL:

    Extraction, Transformation, Loading) for modeling and data analysis and visualization.

    Orange

    Orange is a component-based data mining and machine learning software suite, featuring a visual

    programming front-end for explorative data analysis, visualization, Python bindings and libraries for

    scripting. It includes set of components for data preprocessing, feature scoring and filtering, modeling,

    model evaluation and exploration techniques. It is implemented in C++ and Python.

    Fig.2. Process Flow

    VIII. EXPERIMENTAL RESULTS

    The dataset "pima Indian Diabetes" are consider with the use of K-means, Hierarchical and Density based

    clustering technique and the different classification algorithms available on data mining tools. The Pima

    Indian diabetes data sets available on UCI machine learning repository. The experiment is performed on

    the dataset results in Table I shows the accuracy measure of K-means clustering technique for different

    classifiers used. SVM provides the highest accuracy in the range of 76-78%, followed by Nave Bayes with

    accuracy in the range of 73-76% and KNN with accuracy ranging between 72-73% and followed closely

    by C4.5 with accuracy in the range 69-74%. The pictorial representation as shown in the fig.3.

    Table I: Accuracy for K-means clustering

    Classifier Weka Tanagra Orange KNIME

    NB 76.32 % 74.87% 75.38% 73.17%

    C4.5 73.82 % 74.21% 70.05% 69.27%

    KNN 73.17 % 72.11% 72.90% 72.26%

    SVM 78.34 % 76.45% 76.17% 77.60%

  • International Journal of Advance Foundation and Research in Computer (IJAFRC)

    Volume 2, Issue 8, August - 2015. ISSN 2348 4853, Impact Factor 1.317

    60 | 2015, IJAFRC All Rights Reserved www.ijafrc.org

    Fig.3. Accuracy for K-means clustering

    From the below Table II show the accuracy measure of the K-means Clustering technique for different

    classifiers used. The SVM classifier gives the better accuracy measure between 73-77%, is followed by

    Nave Bayes with accuracy between 64-68%. KNN gives accuracy between 62-71% and C 4.5 between

    89-99%. The pictorial representation as shown in the fig.4.

    Table II: Accuracy for Hierarchical Clustering

    Classifier Weka Tanagra Orange KNIME

    NB 65.22 % 66.34% 68.74% 64.01%

    C4.5 62.32 % 71.08% 70.65% 68.75%

    KNN 71.08 % 68.84% 69.38% 78.21%

    SVM 77.67 % 76.24% 77.68% 73.01%

    Fig.4. Accuracy for Hierarchical Clustering

    From the below Table III shows the accuracy measure of the Hierarchical Clustering technique for

    different classifiers used. The SVM classifier gives the higher accuracy measure between 73-77% and it is

    followed by Nave Bayes with accuracy between 64-68%. KNN has accuracy between 62-71% and C 4.5 is

    between 89-99% respectively. The pictorial representation as shown in the fig.5.

    Table III: Accuracy for Density based clustering

    Classifier Weka Tanagra Orange KNIME

    NB 63.58 % 63.34% 64.89% 64.22%

    C4.5 62.11 % 68.71% 69.28% 68.59%

    KNN 69.65 % 70.84% 70.95% 75.52%

    SVM 77.00 % 74.89% 74.87% 73.81%

  • International Journal of Advance Foundation and Research in Computer (IJAFRC)

    Volume 2, Issue 8, August - 2015. ISSN 2348 4853, Impact Factor 1.317

    61 | 2015, IJAFRC All Rights Reserved www.ijafrc.org

    Fig.5. Accuracy for Density based clustering

    Comparing the data in Table I, II and III shows the SVM classifier is the best for the K-means, Hierarchical,

    Density based clustering and clustering-classification techniques. However, the percentage accuracy

    using SVM classifier is in the range of 76-78% and the range of 76-77%. From the comparison of the

    tables, it shows that the results of the accuracy of K-means clustering technique are more accurate than

    the other classification data mining technique. Overall, the K-means clustering technique is about 2-12%

    greater than the other clustering technique above with a range of tools and algorithms used.

    IX. CONCLUSIONS

    Data mining is the extraction of useful patterns and relationships from data sources, such as databases,

    texts, the web etc. This research discussed the different data mining tool focus importance of tools by

    considering in various aspects. The experimental results are compared with existing techniques such as

    clustering and classification gives better results to improve the accuracy gives SVM is the best compare to

    other method.

    X. REFERENCES

    [1] David Heckerman,"Bayesian Network for Data Mining and Knowledge Discovery", 1997.

    [2] David Hand, Heikki Mannila and Padhraic Smyth,"Principles of Data Mining", the MIT Press, 2001.

    [3] Ritu Chauhan, Harleen Kaur, M.Afshar Alam, "Data Clustering Method for Discovering Clusters in

    Spatial Cancer Databases", International Journal of Computer Applications, Volume10, No.6,

    November 2010.

    [4] J.R Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufman, 1993.

    [5] S.Kotsiantis, D.Kanellopoulos, P.Pintelas, "Data Preprocessing for Supervised Leaning",

    International Journal of Computer Science,Vol.1,No.2, pp.111117,2006.

    [6] MacQueen J. B., "Some Methods for classification and Analysis of Multivariate Observations",

    Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability. University of

    California Press., pp.281297,1967.

    [7] Lloyd, S. P., "Least square quantization in PCM", IEEE Transactions on Information Theory 28,pp.

    129137,1982 .

  • International Journal of Advance Foundation and Research in Computer (IJAFRC)

    Volume 2, Issue 8, August - 2015. ISSN 2348 4853, Impact Factor 1.317

    62 | 2015, IJAFRC All Rights Reserved www.ijafrc.org

    [8] Manish Verma, MaulySrivastava, NehaChack, Atul Kumar Diswar and Nidhi Gupta, "A

    Comparative Study of Various Clustering Algorithms in Data Mining", International Journal of

    Engineering Research and Applications, Vol. 2, Issue.3, 2012.

    [9] Timonthy C. Havens, "Clustering in relational data and ontologies", July 2010.

    AUTHOR PROFILE

    Dr.S.Prasath is currently working as an Assistant Professor in Department of

    Computer Science, Erode Arts & Science College (Autonomous), Erode,

    Tamilnadu, India. He received Ph.D degree from Bharathiar University,

    Coimbatore, Tamilnadu, India in 2015. He has obtained his Masters degree in

    Software Engineering from M.Kumarasamy college of Engineering, Karur under

    Anna University, Chennai in 2008 and M.Phil degree in Computer Science in the

    year 2009. His area of interests includes, Image Processing and Data Mining. He

    has presented 6 papers in National and 2 International level conferences. He

    has published 10 papers in National and International journals.