Integrating Clustering with Different Data Mining Techniques in the Diagnosis of Heart Disease

Embed Size (px)

Citation preview

  • 7/27/2019 Integrating Clustering with Different Data Mining Techniques in the Diagnosis of Heart Disease

    1/10

    JOURNAL OF COMPUTER SCIENCE AND ENGINEERING, VOLUME 20, ISSUE 1, AUGUST 20131

    Integrating Clustering with DifferentData Mining Techniques in the

    Diagnosis of Heart DiseaseMai Shouman, Tim Turner and Rob Stocker

    AbstractHeart disease is the leading cause of death in the world over the past 10 years. The research presented here is partof work to develop tools to assist healthcare practitioners to diagnosis heart disease earlier in the hope of earlier interventionsin this preventable killer. The relative accuracy of common data mining techniques in heart disease diagnosis is difficult toassess from the literature. This research investigates Decision Tree, Nave Bayes, and K-nearest Neighbour performance inthe diagnosis of heart disease patients. It then further assesses performance enhancement through integrating clusteringtechniques. The testing was conducted over a standardized dataset used widely in the literature. The results show thatintegrating clustering with decision tree, nave bayes, and k nearest neighbour could enhance their accuracies in diagnosingheart disease patients. Importantly, the results establish that the ensemble of two cluster inlier k-means clustering with the k-nearest neighbour technique was the most effective at heart disease diagnosis.

    Index Terms- Data Mining, Decision Tree, Nave Bayes, K-nearest Neighbour, K-Means Clustering, Heart DiseaseDiagnosis.

    u

    1 INTRODUCTION

    eart disease is the leading cause of death inthe world over the past 10 years. Moreover, theWorld Health Organization has reported that

    heart disease is the leading cause of death in bothhigh and low income countries [1]. The EuropeanPublic Health Alliance reports that heart attacks and

    other circulatory diseases account for 41% of alldeaths [2]. The Economical and Social Commission ofAsia and the Pacific found that in one fifth of Asiancountries, most lives are lost to non-communicablediseases such as cardiovascular, cancers, and diabetesdiseases [3]. Statistics of South Africa report thatheart and circulatory system diseases are the thirdleading cause of death in Africa [4]. The AustralianBureau of Statistics reported that heart andcirculatory system diseases are the first leading causeof death in Australia, causing 33.7% of all deaths [5].This high level of mortality is a tragedy; mostparticularly because heart diseases are typically

    eminently treatable if detected early [6].Motivated by the increasing mortality of heartdisease patients world-wide each year and theavailability of a huge amount of patients data thatcould be used to extract useful knowledge,researchers have been using data mining techniquesto help health care professionals in the diagnosis ofheart disease patients [7-8]. Data mining is anessential step in knowledge discovery. It involves theexploration of large datasets to extract hidden andpreviously unknown patterns, relationships andknowledge that are difficult to detect with traditionalstatistical methods [9-13]. The application of data

    mining is rapidly spreading in a wide range ofsectors such as analysis of organic compounds,financial forecasting, weather forecasting, and

    healthcare [14].Data mining in healthcare is an emerging field of

    high importance for providing prognosis and adeeper understanding of medical data. Researchersare using data mining techniques in the medicaldiagnosis of several diseases such as diabetes [15],stroke [16], cancer [17], and heart disease [18]. Severaldata mining techniques are used in the diagnosis of

    heart disease such as nave bayes, decision tree, k-nearest neighbor, neural network, kernel density,bagging algorithm, and support vector machineshowing different levels of accuracies [19-24].Recently, researchers are suggesting that integratingmore than one data mining technique can enhancedata mining techniques performance in the diagnosisof heart disease patients [18, 25-27].

    Although heart diseases are among the mostcommon chronic diseases causing a high incidence ofdeath all over the world, they have also beenidentified among the most preventable ones [28].Early detection and healthy behaviours play an

    important role in preventing and controlling theeffects of these diseases [6, 28-29]. There is a need foraccurate systematic tools that identify patients athigh risk and provide information for earlyintervention of these diseases [30].

    This research investigates applying differentsingle and hybrid data mining techniques in thediagnosis of heart disease patients. It investigates ifintegrating clustering with different data miningtechniques can provide better performance thansingle data mining techniques in the diagnosis ofheart disease patients.

    The rest of the paper is divided as follows: the

    background section briefly reviews applying datamining techniques in the diagnosis of heart disease;the methodology section explains the different single

    H

  • 7/27/2019 Integrating Clustering with Different Data Mining Techniques in the Diagnosis of Heart Disease

    2/10

    JOURNAL OF COMPUTER SCIENCE AND ENGINEERING, VOLUME 20, ISSUE 1, AUGUST 20132

    and hybrid data mining techniques used in thediagnosis of heart disease patients; the validation andevaluation section presents the procedure used tomeasure the stability of the proposed model; theheart disease data section introduces the de factostandard dataset used; and the results section

    presents the results of a systematic investigation ofsingle and hybrid data mining techniques used in thediagnosis of heart disease patients which is followed

    by the discussion of the presented results. Finally, wedraw conclusions from this preliminary investigationto emphasize the importance of this systematic studyin establishing a baseline for research and practice,and point to future work.

    2 BACKGROUND

    Researchers have been investigating the use ofstatistical analysis and data mining techniques tohelp healthcare professionals in the diagnosis of heartdisease for many years. Statistical analysis hasidentified the risk factors associated with heartdisease to be age, blood pressure, smoking [31],cholesterol [32], diabetes [33], family history of heartdisease [34], obesity, and lack of physical activity[35]. Knowledge of the risk factors associated withheart disease helps health care professionals toidentify patients at high risk of having heart disease.

    Although statistical methods help in identifyingthe risk factors associated with heart disease, they arenot sufficient in helping in the diagnosis of patients.Data mining can play an important role in thediagnosis of heart disease patients. Researchers have

    been applying different data mining techniques overdifferent heart disease datasets to help health careprofessionals in the diagnosis of heart disease [18-19,22-24, 36]. Unfortunately in general, the results of thedifferent data mining research cannot be compared

    because they use different datasets [37]. However,over time, a benchmark data set has arisen in theliterature: the Cleveland Heart Disease Dataset(CHDD) [38].

    Researchers have been using the CHDD to applydifferent data mining techniques such as decisiontree, nave bayes, bagging algorithm, support vector

    machine, and neural network showing differentlevels of accuracy [19, 23, 36, 39-40]. Recently,researchers are investigating whether integratingmore than one data mining technique can enhancedata mining performance in the diagnosis of heartdisease patients on the CHDD. Polat et al., [41] usedartificial immune recognition system (AIRS) andother research used fuzzy AIRS and k-nearestneighbour [26]. Kavitha et. al, presented a newsystem for the detection of heart disease that usesneural network and genetic algorithm for forwardfeedback. The results showed that the proposedhybridization is more stable [42] . This research

    investigates integrating clustering with different datamining techniques to identify if this integration canenhance their performance in the diagnosis of heart

    disease patients.K-means clustering is one of the most popular and

    well known clustering techniques. Its simplicity andreliable behavior make it popular in manyapplications [43]. Initial centroid selection is a criticalissue in k-means clustering and strongly affects its

    results [44].This research investigates applying three common

    data mining techniques which are decision tree, navebayes, and k-nearest neighbour as single data miningtechniques. It then investigates integrating k-meansclustering with different initial centroid selectionmethods as well as different numbers of clusters inthe diagnosis of heart disease patients. Importantly,our research thoroughly investigates eachhybridization technique, testing which technique andwhich initial centroid selection method can provide

    better performance in diagnosing heart diseasepatients and if applying different numbers of clusters

    can provide different performance in diagnosingheart disease patients.

    3 METHODOLOGY

    This section discusses the single and hybrid datamining techniques used in the diagnosis of heartdisease patients which are shown in Figure 1. For thesingle data mining techniques decision tree, nave

    bayes, and k-nearest neighbour are discussed. For thehybridization technique of k-means clustering withfive different initial centroid selection methods isdiscussed.

    FIGURE 1:PROPOSED APPROACH

    3.1 Single Data Mining Techniques

    3.1.1 Decision Tree

    The decision tree technique cannot deal withcontinuous attributes so they need to be convertedinto discrete attributes, a process calleddiscretization. Dougherty et al. carried out a

    comparative study between two unsupervised andtwo supervised discretization methods using 16 datasets showing that differences between the

  • 7/27/2019 Integrating Clustering with Different Data Mining Techniques in the Diagnosis of Heart Disease

    3/10

  • 7/27/2019 Integrating Clustering with Different Data Mining Techniques in the Diagnosis of Heart Disease

    4/10

  • 7/27/2019 Integrating Clustering with Different Data Mining Techniques in the Diagnosis of Heart Disease

    5/10

    JOURNAL OF COMPUTER SCIENCE AND ENGINEERING, VOLUME 20, ISSUE 1, AUGUST 20135

    single and hybrid data mining techniques arecalculated.

    5 HEART DISEASE DATA

    The data used in this study is the Cleveland ClinicFoundation Heart disease data set available at

    http://archive.ics.uci.edu/ml/datasets/Heart+Disease.The data set has 76 raw attributes. However, all of thepublished experiments only refer to 13 of them. Thedata set contains 303 rows of which 297 are complete.Six rows contain missing values and they areremoved from the experiment. The attributes used inthis study are shown in Table 1. In the type attributeC represents continuous and D represent discrete.

    TABLE 1:SELECTED CLEVELAND HEART DISEASE DATASET ATTRIBUTES

    Name Type Description

    Age C Age in years

    Sex D 1 = male

    0 = female

    Cp D Chest pain type:

    1 = typical angina

    2 = atypical angina

    3 = non-anginal pain

    4 =asymptomatic

    Trestbps C Resting blood pressure (in mm Hg)

    Chol C Serum cholesterol in mg/dl

    Fbs D Fasting blood sugar > 120 mg/dl:

    1 = true

    0 = false

    Restecg D Resting electrocardiographic

    results:0 = normal

    1 = having ST-T wave abnormality

    2 =showing probable or define left

    ventricular hypertrophy

    Thalach C Maximum heart rate achieved

    Exang D Exercise induced angina:

    1 = yes

    0 = no

    Old peak

    ST

    C Depression induced by exercise

    relative to rest

    Slope D The slope of the peak exercise

    segment :

    1 = up sloping2 = flat

    3= down sloping

    Ca D Number of major vessels colored by

    fluoroscopy that ranged between 0

    and 3.

    Thal D 3 = normal

    6= fixed defect

    7= reversible defect

    Diagnosis D Diagnosis classes:

    0 = healthy

    1= patient who is subject to possible

    heart disease

    6 RESULTS

    The results section will discuss the achieved resultswhen integrating clustering with decision tree, nave bayesand k-nearest neighbour techniques in the diagnosis of heartdisease patients.

    6.1 Single Data Mining TechniquesThe results ofsensitivity, specificity, and accuracy in the diagnosis

    of heart disease using Gain Ratio Decision Trees,nave bayes, and k-nearest neighbor are shown inTable 2. Nave bayes is achieving the best resultsfollowed by k nearest neighbor and decision tree asshown in Figure 2. Nave bayes as a single techniqueis showing mean accuracy of 83.5% (standarddeviation of 5.2%) as shown in Table 2.

    TABLE 2:DIFFERENT SINGLE DATA MINING TECHNIQUESRESULTS

    Technique

    Name

    Sensitivity Specificity Accuracy

    Mean StDev

    Mean StDev

    Mean StDev

    Gain Ratio

    Decision

    Tree

    75.6 6.1 81.6 12.1 79.1 5.8

    Nave

    Bayes78 13.8 80.8 12.6 83.5 5.2

    KNN K=19 76.7 10.7 85.1 7.5 83.2 4.1

    FIGURE2:DIFFERENT SINGLE DATA MININGTECHNIQUES

    6.2 Clustering With Data Mining Techniques

    Different clustering columns are used involvingage, blood pressure and cholesterol. Using the age asthe clustering column showed the best results. Theresults of sensitivity, specificity, and accuracy in thediagnosis of heart disease using k-means clusteringand different data mining techniques with differentinitial centroids selection methods and differentnumbers of clusters are presented next.

    0

    10

    20

    30

    40

    50

    60

    70

    80

    90

    100

    GainRatio NaveBayes KNNK=19

    Accuracy

    DifferentSingleDataMiningTechniques

  • 7/27/2019 Integrating Clustering with Different Data Mining Techniques in the Diagnosis of Heart Disease

    6/10

    JOURNAL OF COMPUTER SCIENCE AND ENGINEERING, VOLUME 20, ISSUE 1, AUGUST 20136

    6.2.1 Integrating Clustering With Gain RatioDecision TreeTable 3 presents the results ofintegrating gain ratio decision tree with k-meansclustering with different initial centroid selectionmethods and with different numbers of clusters.Integrating gain ratio decision tree with k-means

    clustering could enhance gain ratio accuracy in thediagnosis of heart disease patients. The best resultsfor the gain ratio decision tree is achieved by the twoclusters inlier initial centroid selection methodshowing mean accuracy, of 81.2% (standarddeviation of 6.2%) as shown in Table 3. It is showing2.1% increase in mean accuracy when comparing itwith the single gain ratio decision tree which isshowing 79.1% as shown in Table 3.

    TABLE 3:INTEGRATING CLUSTERING WITH GAIN RATIODECISION TREE

    Technique

    /ClustersNo

    Sensitivity Specificity Accuracy

    Mean StDev

    Mean StDev

    Mean StDev

    Gain Ratio 75.6 6.1 81.6 12.1 79.1 5.8

    Inlier

    2 75.9 7.2 85.1 11.4 81.2 6.2

    3 75.8 8.5 83.4 14.6 80.8 8.6

    4 71.5 11.8 78.1 12.3 77.5 8.6

    5 71.5 11.8 78.1 12.3 77.5 8.6

    Outlier

    2 76 9.6 80.3 13.4 78.7 6.7

    3 75.8 8.5 83.4 14.6 80.8 8.6

    4 73.2 11.2 84.1 11.8 79.7 8.2

    5 73.2 11.2 84.1 11.8 79.7 8.2

    Range

    2 76 9.6 80.3 13.4 78.7 6.7

    3 76.2 10.8 76 13.2 78.9 8.5

    4 71.9 12 78.1 12.3 77.8 8.6

    5 71.9 12 78.1 12.3 77.8 8.6

    Random

    Row

    2 76.1 8.9 82.8 9.6 80.1 4.7

    3 69.8 16.5 77 13.1 76.3 9.3

    4 73.2 9 79.1 10.5 78.4 8.3

    5 69.3 13.9 79.5 9.6 77.1 7.8

    Random

    Attribute

    2 74.3 8 84 12.2 80.1 7

    3 70.5 10.8 81.7 12.4 79.4 8.8

    4 68.1 15.9 80.8 11.3 77 9.1

    5 69.3 13.9 79.5 9.6 77.1 7.8

    Figure 3 shows the means accuracy of integratinggain ratio decision tree with k means clustering withdifferent initial centroid selection methods and withdifferent numbers of clusters. There is not a specifictrend between increasing the number of clusters withintegrating gain ratio decision tree and theincrease/decrease of the accuracy as shown in Figure3.

    FIGURE3:INTEGRATING CLUSTERING WITH GAIN RATIODECISION TREE

    6.2.2 Clustering and Nave BayesTable 4 presents the results of integrating nave

    bayes with k-means clustering with different initialcentroid selection methods and with differentnumbers of clusters. Integrating nave bayes with k-means clustering could enhance nave bayes accuracyin the diagnosis of heart disease patients. The best

    results for the nave bayes is achieved by the threeclusters random row initial centroid selection methodshowing mean accuracy, standard deviation of 84.8%and 4.8% as shown in Table 4. It is showing 1.3%increase in mean accuracy when comparing it withthe single nave bayes which is showing 83.5% asshown in Table 4.TABLE4:INTEGRATING CLUSTERING WITH NAVE BAYES

    Technique

    /Clusters

    No

    Sensitivity Specificity Accuracy

    MeanSt

    DevMean

    St

    DevMean

    St

    Dev

    Nave

    Bayes78 13.8 80.8 12.6 83.5 5.2

    Inlier

    2 73.2 22 83.3 12.2 83.1 4.7

    3 75.9 14.7 83.7 12.6 83.6 4.4

    4 76.9 13.5 81.7 13.1 83.5 3.1

    5 76.9 13.5 81.7 13.1 83.5 3.1

    Outlier

    2 76.4 15.2 82.3 13.1 83.4 4.5

    3 75.9 14.7 83.7 12.6 83.6 4.4

    4 70.1 17.4 83.1 12.9 80.7 6.8

    5 70.1 17.4 83.1 12.9 80.7 6.8

    R

    ange

    2 76.4 15.2 82.3 13.1 83.4 4.5

    3 78 13.1 81.3 14.6 84.1 4.3

    4 76.9 13.5 81.7 13.1 83.5 3.15 76.9 13.5 81.7 13.1 83.5 3.1

    Random

    Row

    2 76.5 14.6 82 12.7 83.1 3.4

    3 76.6 14.4 85 13.2 84.8 4.7

    4 71.4 19.6 83 13.1 81.6 6.3

    5 71.1 14.8 81.1 13.1 80.4 4.7

    Random

    Attribute

    2 76.4 15.2 83.8 12.5 84 4.4

    3 74.3 17.6 82.2 14.1 82.3 6

    4 70.9 20.5 83.5 12.3 82 5.7

    5 71.1 14.8 81.1 13.1 80.4 4.7

    Figure 4 shows the means accuracy of integratingnave bayes with k means clustering with differentinitial centroid selection methods and with differentnumbers of clusters. There is not a specific trend

    0

    10

    20

    30

    40

    50

    60

    70

    80

    90

    100

    2 3 4 5 2 3 4 5 2 3 4 5 2 3 4 5 2 3 4 5

    single Inlier Outlier Range RandomRow RandomAttribute

    MeanAccuracy

  • 7/27/2019 Integrating Clustering with Different Data Mining Techniques in the Diagnosis of Heart Disease

    7/10

  • 7/27/2019 Integrating Clustering with Different Data Mining Techniques in the Diagnosis of Heart Disease

    8/10

    JOURNAL OF COMPUTER SCIENCE AND ENGINEERING, VOLUME 20, ISSUE 1, AUGUST 20138

    another. The best increase in accuracy of the hybriddata mining technique is achieved by the k nearestneighbour showing 2.5% enhancement of the meanaccuracy to increase from 83.2% to 85.7% as shown inFigure 6.

    FIGURE6:SINGLE AND HYBRID DATA MINING TECHNIQUES

    Table 6 shows the increase in accuracy of differentdata mining techniques when integrated with kmeans clustering. The best increase in accuracy isachieved by k means clustering followed by gainratio decision tree and nave bayes showing increaseof 2.5%, 2.1%, and 1.3% respectively. Table 6 alsoshows the t test significance when comparing eachsingle data mining technique and comparing it with

    the integration of k means clustering. The t test iscalculated using 95% significance. The t testcalculation is showing that there is significantincrease in the accuracy between each single datamining technique and when integrating it with kmeans clustering as shown in Table 6. It is alsoshowing that there is significant difference whencomparing the integrated k means clustering knearest neighbour with the integrated k meansclustering nave bayes and decision tree.

    Although decision tree showed the least accuracyin the diagnosis of heart disease patients, it can still

    be very useful in the diagnosis of heart diseasepatients as it provides rules that explain how thepatients are diagnosed as healthy or sick. Theresulting rule sets offer the rules for explaining thediagnosis. The nave bayes and k-nearest neighbourare based on probability and distance measuring ofthe instances respectively. They do not provide anydiagnostic explanation of why each testing instanceis diagnosed as healthy or sick.

    TABLE6:SINGLEANDHYBRIDDATAMININGTECHNIQUES

    Techniques Accuracy Increasein

    Accuracy

    T-testSignificanceMean

    St

    Dev

    Decision

    Tree

    Single 79.1 5.8

    2.1

    Yes

    (t=4.28,

    p

  • 7/27/2019 Integrating Clustering with Different Data Mining Techniques in the Diagnosis of Heart Disease

    9/10

  • 7/27/2019 Integrating Clustering with Different Data Mining Techniques in the Diagnosis of Heart Disease

    10/10

    JOURNAL OF COMPUTER SCIENCE AND ENGINEERING, VOLUME 20, ISSUE 1, AUGUST 201310

    disease using an artificial immune recognition system (AIRS) with

    fuzzy resource allocation mechanism and k-nn (nearest neighbour)

    based weighting preprocessing. Expert Systems with Applications

    2007. 32p. 625631.

    27.Xiong, W. and C. Wang, Feature Selection: A Hybrid Approach

    Based on Selfadaptive Ant Colony and Support Vector Machine.

    International Conference on Computer Science and Software

    Engineering, IEEE, 2008.

    28.Centers for Disease Control and Prevention (CDC), C.D.P.a.H.P.

    August 2012; Available from: http://www.cdc.gov/nccdphp/.

    29.American Cancer Society, B.C.F.F. August 2012; Available from:

    http://www.cancer.org/downloads/STT/F861009_final 9-08-09.pdf.

    30.Paladugu, S. and C. Shyu, Temporal mining framework for risk

    reduction and early detection of chronic diseases. 2010.

    31.Heller, R.F., et al., How well can we predict coronary heart

    disease? Findings in the United Kingdom Heart Disease Prevention

    Project. BRITISH MEDICAL JOURNAL, 1984.

    32.Wilson, P.W.F., et al., Prediction of Coronary Heart Disease

    Using Risk Factor Categories. American Heart Association Journal,1998.

    33.Simons, L.A., et al., Risk functions for prediction of

    cardiovascular disease in elderly Australians: the Dubbo Study.

    Medical Journal of Australia, 2003. 178.

    34.Salahuddin and F. Rabbi, Statistical Analysis of Risk Factors for

    Cardiovascular disease in Malakand Division. Pak. j. stat. oper. res.,

    2006. Vol.II: p. pp49-56.

    35.Shahwan-Akl, L., Cardiovascular Disease Risk Factors among

    Adult Australian-Lebanese in Melbourne. International Journal of

    Research in Nursing, 2010. 6 (1).

    36.Tu, M.C., D. Shin, and D. Shin, Effective Diagnosis of Heart

    Disease through Bagging Approach. Biomedical Engineering and

    Informatics, IEEE, 2009.

    37.Shouman, M., T. Turner, and R. Stocker, Using data mining

    techniques in heart disease diagnosis and treatment. Japan-Egypt

    Conference on Electronics, Communications and Computers (JEC-

    ECC), 2012: p. 173 -177.

    38.Blake, C.L. and C.J. Merz, UCI repository of machine learning

    databases. University of California, Irvine, Department of

    Information and Computer Sciences. , 1998

    39.Parthiban, L. and R. Subramanian, Intelligent Heart Disease

    Prediction System using CANFIS and Genetic Algorithm.

    International Journal of Biological and Life Sciences, 2007.

    40.Patil, S.B. and Y.S. Kumaraswamy, Extraction of Significant

    Patterns from Heart Disease Warehouses for Heart Attack Prediction.

    International Journal of Computer Science and Network Security

    (IJCSNS), 2009. VOL.9 No.2.

    41.Polat, K., et al., A new classification method to diagnosis heart

    disease: Supervised artificial immune system (AIRS). In Proceedings

    of the Turkish Symposium on Artificial Intelligence and Neural

    Networks (TAINN), 2005.

    42.Kavitha, K.S., K.V. Ramakrishnan, and M.K. Singh,Modeling and

    design of evolutionary neural network for heart disease detection.

    International Journal of Computer Science Issues (IJCSI ), 2010. Vol.

    7, Issue 5.

    43.Wu, X., et al., Top 10 algorithms in data mining analysis. Knowl.

    Inf. Syst., 2007.44.Tajunisha, N. and V. Saravanan, A new approach to improve the

    clustering accuracy using informative genes for unsupervised

    microarray data sets. International Journal of Advanced Science and

    Technology, 2011.

    45.Bramer, M.,Principles of data mining. 2007: Springer.

    46.Esposito, F., D. Malerba, and G. Semeraro A Comparative

    Analysis of Methods for Pruning Decision Trees. IEEE

    TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE

    INTELLIGENCE, 1997. VOL. 19, NO. 5.

    47.Alpaydin, E., Voting over Multiple Condensed Nearest Neighbors.

    Artificial Intelligence Review, 1997. 11: p. 115132.

    48.Pavan, K.K., et al., Robust seed selection algorithm for k-means

    type algorithms. International Journal of Computer Science &

    Information Technology (IJCSIT) 2011. Vol 3, No 5.

    49.Poomagal, S. and T. Hamsapriya, Optimized k-means clustering

    with intelligent initial centroid selection for web search using URL

    and tag contents, in Proceedings of the International Conference on

    Web Intelligence, Mining and Semantics. 2011, ACM: Sogndal,

    Norway. p. 1-8.

    50.Santhi, M.V.B.T., et al., Enhancing K-Means Clustering

    Algorithm. International Journal of Computer Sci ence & Technology,2011.

    51.Immaculate Mary, C. and D.S.V. Kasmir Raja, Refinement of

    clusters from k-means with ant colony optimization. Journal of

    Theoretical and Applied Information Technology, 2009.

    52.Khan, D.M. and N. Mohamudally, A Multiagent System (MAS) for

    the Generation of Initial Centroids for kmeans Clustering Data

    Mining Algorithm based on Actual Sample Datapoints. Journal of

    Next Generation Information Technology, August, 2010. Volume 1,

    Number 2.