Integrating Clustering with Different Data Mining Techniques in the Diagnosis of Heart Disease

7/27/2019 Integrating Clustering with Different Data Mining Techniques in the Diagnosis of Heart Disease

1/10

JOURNAL OF COMPUTER SCIENCE AND ENGINEERING, VOLUME 20, ISSUE 1, AUGUST 20131

Integrating Clustering with DifferentData Mining Techniques in the

Diagnosis of Heart DiseaseMai Shouman, Tim Turner and Rob Stocker

AbstractHeart disease is the leading cause of death in the world over the past 10 years. The research presented here is partof work to develop tools to assist healthcare practitioners to diagnosis heart disease earlier in the hope of earlier interventionsin this preventable killer. The relative accuracy of common data mining techniques in heart disease diagnosis is difficult toassess from the literature. This research investigates Decision Tree, Nave Bayes, and K-nearest Neighbour performance inthe diagnosis of heart disease patients. It then further assesses performance enhancement through integrating clusteringtechniques. The testing was conducted over a standardized dataset used widely in the literature. The results show thatintegrating clustering with decision tree, nave bayes, and k nearest neighbour could enhance their accuracies in diagnosingheart disease patients. Importantly, the results establish that the ensemble of two cluster inlier k-means clustering with the k-nearest neighbour technique was the most effective at heart disease diagnosis.

Index Terms- Data Mining, Decision Tree, Nave Bayes, K-nearest Neighbour, K-Means Clustering, Heart DiseaseDiagnosis.

u

1 INTRODUCTION

eart disease is the leading cause of death inthe world over the past 10 years. Moreover, theWorld Health Organization has reported that

heart disease is the leading cause of death in bothhigh and low income countries [1]. The EuropeanPublic Health Alliance reports that heart attacks and

other circulatory diseases account for 41% of alldeaths [2]. The Economical and Social Commission ofAsia and the Pacific found that in one fifth of Asiancountries, most lives are lost to non-communicablediseases such as cardiovascular, cancers, and diabetesdiseases [3]. Statistics of South Africa report thatheart and circulatory system diseases are the thirdleading cause of death in Africa [4]. The AustralianBureau of Statistics reported that heart andcirculatory system diseases are the first leading causeof death in Australia, causing 33.7% of all deaths [5].This high level of mortality is a tragedy; mostparticularly because heart diseases are typically

eminently treatable if detected early [6].Motivated by the increasing mortality of heartdisease patients world-wide each year and theavailability of a huge amount of patients data thatcould be used to extract useful knowledge,researchers have been using data mining techniquesto help health care professionals in the diagnosis ofheart disease patients [7-8]. Data mining is anessential step in knowledge discovery. It involves theexploration of large datasets to extract hidden andpreviously unknown patterns, relationships andknowledge that are difficult to detect with traditionalstatistical methods [9-13]. The application of data

mining is rapidly spreading in a wide range ofsectors such as analysis of organic compounds,financial forecasting, weather forecasting, and

healthcare [14].Data mining in healthcare is an emerging field of

high importance for providing prognosis and adeeper understanding of medical data. Researchersare using data mining techniques in the medicaldiagnosis of several diseases such as diabetes [15],stroke [16], cancer [17], and heart disease [18]. Severaldata mining techniques are used in the diagnosis of

heart disease such as nave bayes, decision tree, k-nearest neighbor, neural network, kernel density,bagging algorithm, and support vector machineshowing different levels of accuracies [19-24].Recently, researchers are suggesting that integratingmore than one data mining technique can enhancedata mining techniques performance in the diagnosisof heart disease patients [18, 25-27].

Although heart diseases are among the mostcommon chronic diseases causing a high incidence ofdeath all over the world, they have also beenidentified among the most preventable ones [28].Early detection and healthy behaviours play an

important role in preventing and controlling theeffects of these diseases [6, 28-29]. There is a need foraccurate systematic tools that identify patients athigh risk and provide information for earlyintervention of these diseases [30].

This research investigates applying differentsingle and hybrid data mining techniques in thediagnosis of heart disease patients. It investigates ifintegrating clustering with different data miningtechniques can provide better performance thansingle data mining techniques in the diagnosis ofheart disease patients.

The rest of the paper is divided as follows: the

background section briefly reviews applying datamining techniques in the diagnosis of heart disease;the methodology section explains the different single

H


2/10


and hybrid data mining techniques used in thediagnosis of heart disease patients; the validation andevaluation section presents the procedure used tomeasure the stability of the proposed model; theheart disease data section introduces the de factostandard dataset used; and the results section

presents the results of a systematic investigation ofsingle and hybrid data mining techniques used in thediagnosis of heart disease patients which is followed

by the discussion of the presented results. Finally, wedraw conclusions from this preliminary investigationto emphasize the importance of this systematic studyin establishing a baseline for research and practice,and point to future work.

2 BACKGROUND

Researchers have been investigating the use ofstatistical analysis and data mining techniques tohelp healthcare professionals in the diagnosis of heartdisease for many years. Statistical analysis hasidentified the risk factors associated with heartdisease to be age, blood pressure, smoking [31],cholesterol [32], diabetes [33], family history of heartdisease [34], obesity, and lack of physical activity[35]. Knowledge of the risk factors associated withheart disease helps health care professionals toidentify patients at high risk of having heart disease.

Although statistical methods help in identifyingthe risk factors associated with heart disease, they arenot sufficient in helping in the diagnosis of patients.Data mining can play an important role in thediagnosis of heart disease patients. Researchers have

been applying different data mining techniques overdifferent heart disease datasets to help health careprofessionals in the diagnosis of heart disease [18-19,22-24, 36]. Unfortunately in general, the results of thedifferent data mining research cannot be compared

because they use different datasets [37]. However,over time, a benchmark data set has arisen in theliterature: the Cleveland Heart Disease Dataset(CHDD) [38].

Researchers have been using the CHDD to applydifferent data mining techniques such as decisiontree, nave bayes, bagging algorithm, support vector

machine, and neural network showing differentlevels of accuracy [19, 23, 36, 39-40]. Recently,researchers are investigating whether integratingmore than one data mining technique can enhancedata mining performance in the diagnosis of heartdisease patients on the CHDD. Polat et al., [41] usedartificial immune recognition system (AIRS) andother research used fuzzy AIRS and k-nearestneighbour [26]. Kavitha et. al, presented a newsystem for the detection of heart disease that usesneural network and genetic algorithm for forwardfeedback. The results showed that the proposedhybridization is more stable [42] . This research

investigates integrating clustering with different datamining techniques to identify if this integration canenhance their performance in the diagnosis of heart

disease patients.K-means clustering is one of the most popular and

well known clustering techniques. Its simplicity andreliable behavior make it popular in manyapplications [43]. Initial centroid selection is a criticalissue in k-means clustering and strongly affects its

results [44].This research investigates applying three common

data mining techniques which are decision tree, navebayes, and k-nearest neighbour as single data miningtechniques. It then investigates integrating k-meansclustering with different initial centroid selectionmethods as well as different numbers of clusters inthe diagnosis of heart disease patients. Importantly,our research thoroughly investigates eachhybridization technique, testing which technique andwhich initial centroid selection method can provide

better performance in diagnosing heart diseasepatients and if applying different numbers of clusters

can provide different performance in diagnosingheart disease patients.

3 METHODOLOGY

This section discusses the single and hybrid datamining techniques used in the diagnosis of heartdisease patients which are shown in Figure 1. For thesingle data mining techniques decision tree, nave

bayes, and k-nearest neighbour are discussed. For thehybridization technique of k-means clustering withfive different initial centroid selection methods isdiscussed.

FIGURE 1:PROPOSED APPROACH

3.1 Single Data Mining Techniques

3.1.1 Decision Tree

The decision tree technique cannot deal withcontinuous attributes so they need to be convertedinto discrete attributes, a process calleddiscretization. Dougherty et al. carried out a

comparative study between two unsupervised andtwo supervised discretization methods using 16 datasets showing that differences between the


3/10


4/10


5/10


single and hybrid data mining techniques arecalculated.

5 HEART DISEASE DATA

The data used in this study is the Cleveland ClinicFoundation Heart disease data set available at

http://archive.ics.uci.edu/ml/datasets/Heart+Disease.The data set has 76 raw attributes. However, all of thepublished experiments only refer to 13 of them. Thedata set contains 303 rows of which 297 are complete.Six rows contain missing values and they areremoved from the experiment. The attributes used inthis study are shown in Table 1. In the type attributeC represents continuous and D represent discrete.

TABLE 1:SELECTED CLEVELAND HEART DISEASE DATASET ATTRIBUTES

Name Type Description

Age C Age in years

Sex D 1 = male

0 = female

Cp D Chest pain type:

1 = typical angina

2 = atypical angina

3 = non-anginal pain

4 =asymptomatic

Trestbps C Resting blood pressure (in mm Hg)

Chol C Serum cholesterol in mg/dl

Fbs D Fasting blood sugar > 120 mg/dl:

1 = true

0 = false

Restecg D Resting electrocardiographic

results:0 = normal

1 = having ST-T wave abnormality

2 =showing probable or define left

ventricular hypertrophy

Thalach C Maximum heart rate achieved

Exang D Exercise induced angina:

1 = yes

0 = no

Old peak

ST

C Depression induced by exercise

relative to rest

Slope D The slope of the peak exercise

segment :

1 = up sloping2 = flat

3= down sloping

Ca D Number of major vessels colored by

fluoroscopy that ranged between 0

and 3.

Thal D 3 = normal

6= fixed defect

7= reversible defect

Diagnosis D Diagnosis classes:

0 = healthy

1= patient who is subject to possible

heart disease

6 RESULTS

The results section will discuss the achieved resultswhen integrating clustering with decision tree, nave bayesand k-nearest neighbour techniques in the diagnosis of heartdisease patients.

6.1 Single Data Mining TechniquesThe results ofsensitivity, specificity, and accuracy in the diagnosis

of heart disease using Gain Ratio Decision Trees,nave bayes, and k-nearest neighbor are shown inTable 2. Nave bayes is achieving the best resultsfollowed by k nearest neighbor and decision tree asshown in Figure 2. Nave bayes as a single techniqueis showing mean accuracy of 83.5% (standarddeviation of 5.2%) as shown in Table 2.

TABLE 2:DIFFERENT SINGLE DATA MINING TECHNIQUESRESULTS

Technique

Name

Sensitivity Specificity Accuracy

Mean StDev

Mean StDev

Mean StDev

Gain Ratio

Decision

Tree

75.6 6.1 81.6 12.1 79.1 5.8

Nave

Bayes78 13.8 80.8 12.6 83.5 5.2

KNN K=19 76.7 10.7 85.1 7.5 83.2 4.1

FIGURE2:DIFFERENT SINGLE DATA MININGTECHNIQUES

6.2 Clustering With Data Mining Techniques

Different clustering columns are used involvingage, blood pressure and cholesterol. Using the age asthe clustering column showed the best results. Theresults of sensitivity, specificity, and accuracy in thediagnosis of heart disease using k-means clusteringand different data mining techniques with differentinitial centroids selection methods and differentnumbers of clusters are presented next.

0

10

20

30

40

50

60

70

80

90

100

GainRatio NaveBayes KNNK=19

Accuracy

DifferentSingleDataMiningTechniques


6/10


6.2.1 Integrating Clustering With Gain RatioDecision TreeTable 3 presents the results ofintegrating gain ratio decision tree with k-meansclustering with different initial centroid selectionmethods and with different numbers of clusters.Integrating gain ratio decision tree with k-means

clustering could enhance gain ratio accuracy in thediagnosis of heart disease patients. The best resultsfor the gain ratio decision tree is achieved by the twoclusters inlier initial centroid selection methodshowing mean accuracy, of 81.2% (standarddeviation of 6.2%) as shown in Table 3. It is showing2.1% increase in mean accuracy when comparing itwith the single gain ratio decision tree which isshowing 79.1% as shown in Table 3.

TABLE 3:INTEGRATING CLUSTERING WITH GAIN RATIODECISION TREE

Technique

/ClustersNo


Mean StDev

Mean StDev

Mean StDev

Gain Ratio 75.6 6.1 81.6 12.1 79.1 5.8

Inlier

2 75.9 7.2 85.1 11.4 81.2 6.2

3 75.8 8.5 83.4 14.6 80.8 8.6

4 71.5 11.8 78.1 12.3 77.5 8.6

5 71.5 11.8 78.1 12.3 77.5 8.6

Outlier

2 76 9.6 80.3 13.4 78.7 6.7

3 75.8 8.5 83.4 14.6 80.8 8.6

4 73.2 11.2 84.1 11.8 79.7 8.2

5 73.2 11.2 84.1 11.8 79.7 8.2

Range

2 76 9.6 80.3 13.4 78.7 6.7

3 76.2 10.8 76 13.2 78.9 8.5

4 71.9 12 78.1 12.3 77.8 8.6

5 71.9 12 78.1 12.3 77.8 8.6

Random

Row

2 76.1 8.9 82.8 9.6 80.1 4.7

3 69.8 16.5 77 13.1 76.3 9.3

4 73.2 9 79.1 10.5 78.4 8.3

5 69.3 13.9 79.5 9.6 77.1 7.8

Random

Attribute

2 74.3 8 84 12.2 80.1 7

3 70.5 10.8 81.7 12.4 79.4 8.8

4 68.1 15.9 80.8 11.3 77 9.1

5 69.3 13.9 79.5 9.6 77.1 7.8

Figure 3 shows the means accuracy of integratinggain ratio decision tree with k means clustering withdifferent initial centroid selection methods and withdifferent numbers of clusters. There is not a specifictrend between increasing the number of clusters withintegrating gain ratio decision tree and theincrease/decrease of the accuracy as shown in Figure3.

FIGURE3:INTEGRATING CLUSTERING WITH GAIN RATIODECISION TREE

6.2.2 Clustering and Nave BayesTable 4 presents the results of integrating nave

bayes with k-means clustering with different initialcentroid selection methods and with differentnumbers of clusters. Integrating nave bayes with k-means clustering could enhance nave bayes accuracyin the diagnosis of heart disease patients. The best

results for the nave bayes is achieved by the threeclusters random row initial centroid selection methodshowing mean accuracy, standard deviation of 84.8%and 4.8% as shown in Table 4. It is showing 1.3%increase in mean accuracy when comparing it withthe single nave bayes which is showing 83.5% asshown in Table 4.TABLE4:INTEGRATING CLUSTERING WITH NAVE BAYES

Technique

/Clusters

No


MeanSt

DevMean

St

DevMean

St

Dev

Nave

Bayes78 13.8 80.8 12.6 83.5 5.2

Inlier

2 73.2 22 83.3 12.2 83.1 4.7

3 75.9 14.7 83.7 12.6 83.6 4.4

4 76.9 13.5 81.7 13.1 83.5 3.1

5 76.9 13.5 81.7 13.1 83.5 3.1

Outlier

2 76.4 15.2 82.3 13.1 83.4 4.5

3 75.9 14.7 83.7 12.6 83.6 4.4

4 70.1 17.4 83.1 12.9 80.7 6.8

5 70.1 17.4 83.1 12.9 80.7 6.8

R

ange

2 76.4 15.2 82.3 13.1 83.4 4.5

3 78 13.1 81.3 14.6 84.1 4.3

4 76.9 13.5 81.7 13.1 83.5 3.15 76.9 13.5 81.7 13.1 83.5 3.1

Random

Row

2 76.5 14.6 82 12.7 83.1 3.4

3 76.6 14.4 85 13.2 84.8 4.7

4 71.4 19.6 83 13.1 81.6 6.3

5 71.1 14.8 81.1 13.1 80.4 4.7

Random

Attribute

2 76.4 15.2 83.8 12.5 84 4.4

3 74.3 17.6 82.2 14.1 82.3 6

4 70.9 20.5 83.5 12.3 82 5.7

5 71.1 14.8 81.1 13.1 80.4 4.7

Figure 4 shows the means accuracy of integratingnave bayes with k means clustering with differentinitial centroid selection methods and with differentnumbers of clusters. There is not a specific trend

0

10

20

30

40

50

60

70

80

90

100

2 3 4 5 2 3 4 5 2 3 4 5 2 3 4 5 2 3 4 5

single Inlier Outlier Range RandomRow RandomAttribute

MeanAccuracy


7/10


8/10


another. The best increase in accuracy of the hybriddata mining technique is achieved by the k nearestneighbour showing 2.5% enhancement of the meanaccuracy to increase from 83.2% to 85.7% as shown inFigure 6.

FIGURE6:SINGLE AND HYBRID DATA MINING TECHNIQUES

Table 6 shows the increase in accuracy of differentdata mining techniques when integrated with kmeans clustering. The best increase in accuracy isachieved by k means clustering followed by gainratio decision tree and nave bayes showing increaseof 2.5%, 2.1%, and 1.3% respectively. Table 6 alsoshows the t test significance when comparing eachsingle data mining technique and comparing it with

the integration of k means clustering. The t test iscalculated using 95% significance. The t testcalculation is showing that there is significantincrease in the accuracy between each single datamining technique and when integrating it with kmeans clustering as shown in Table 6. It is alsoshowing that there is significant difference whencomparing the integrated k means clustering knearest neighbour with the integrated k meansclustering nave bayes and decision tree.

Although decision tree showed the least accuracyin the diagnosis of heart disease patients, it can still

be very useful in the diagnosis of heart diseasepatients as it provides rules that explain how thepatients are diagnosed as healthy or sick. Theresulting rule sets offer the rules for explaining thediagnosis. The nave bayes and k-nearest neighbourare based on probability and distance measuring ofthe instances respectively. They do not provide anydiagnostic explanation of why each testing instanceis diagnosed as healthy or sick.

TABLE6:SINGLEANDHYBRIDDATAMININGTECHNIQUES

Techniques Accuracy Increasein

Accuracy

T-testSignificanceMean

St

Dev

Decision

Tree

Single 79.1 5.8

2.1

Yes

(t=4.28,

p


9/10


10/10


disease using an artificial immune recognition system (AIRS) with

fuzzy resource allocation mechanism and k-nn (nearest neighbour)

based weighting preprocessing. Expert Systems with Applications

2007. 32p. 625631.

27.Xiong, W. and C. Wang, Feature Selection: A Hybrid Approach

Based on Selfadaptive Ant Colony and Support Vector Machine.

International Conference on Computer Science and Software

Engineering, IEEE, 2008.

28.Centers for Disease Control and Prevention (CDC), C.D.P.a.H.P.

August 2012; Available from: http://www.cdc.gov/nccdphp/.

29.American Cancer Society, B.C.F.F. August 2012; Available from:

http://www.cancer.org/downloads/STT/F861009_final 9-08-09.pdf.

30.Paladugu, S. and C. Shyu, Temporal mining framework for risk

reduction and early detection of chronic diseases. 2010.

31.Heller, R.F., et al., How well can we predict coronary heart

disease? Findings in the United Kingdom Heart Disease Prevention

Project. BRITISH MEDICAL JOURNAL, 1984.

32.Wilson, P.W.F., et al., Prediction of Coronary Heart Disease

Using Risk Factor Categories. American Heart Association Journal,1998.

33.Simons, L.A., et al., Risk functions for prediction of

cardiovascular disease in elderly Australians: the Dubbo Study.

Medical Journal of Australia, 2003. 178.

34.Salahuddin and F. Rabbi, Statistical Analysis of Risk Factors for

Cardiovascular disease in Malakand Division. Pak. j. stat. oper. res.,

2006. Vol.II: p. pp49-56.

35.Shahwan-Akl, L., Cardiovascular Disease Risk Factors among

Adult Australian-Lebanese in Melbourne. International Journal of

Research in Nursing, 2010. 6 (1).

36.Tu, M.C., D. Shin, and D. Shin, Effective Diagnosis of Heart

Disease through Bagging Approach. Biomedical Engineering and

Informatics, IEEE, 2009.

37.Shouman, M., T. Turner, and R. Stocker, Using data mining

techniques in heart disease diagnosis and treatment. Japan-Egypt

Conference on Electronics, Communications and Computers (JEC-

ECC), 2012: p. 173 -177.

38.Blake, C.L. and C.J. Merz, UCI repository of machine learning

databases. University of California, Irvine, Department of

Information and Computer Sciences. , 1998

39.Parthiban, L. and R. Subramanian, Intelligent Heart Disease

Prediction System using CANFIS and Genetic Algorithm.

International Journal of Biological and Life Sciences, 2007.

40.Patil, S.B. and Y.S. Kumaraswamy, Extraction of Significant

Patterns from Heart Disease Warehouses for Heart Attack Prediction.

International Journal of Computer Science and Network Security

(IJCSNS), 2009. VOL.9 No.2.

41.Polat, K., et al., A new classification method to diagnosis heart

disease: Supervised artificial immune system (AIRS). In Proceedings

of the Turkish Symposium on Artificial Intelligence and Neural

Networks (TAINN), 2005.

42.Kavitha, K.S., K.V. Ramakrishnan, and M.K. Singh,Modeling and

design of evolutionary neural network for heart disease detection.

International Journal of Computer Science Issues (IJCSI ), 2010. Vol.

7, Issue 5.

43.Wu, X., et al., Top 10 algorithms in data mining analysis. Knowl.

Inf. Syst., 2007.44.Tajunisha, N. and V. Saravanan, A new approach to improve the

clustering accuracy using informative genes for unsupervised

microarray data sets. International Journal of Advanced Science and

Technology, 2011.

45.Bramer, M.,Principles of data mining. 2007: Springer.

46.Esposito, F., D. Malerba, and G. Semeraro A Comparative

Analysis of Methods for Pruning Decision Trees. IEEE

TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE

INTELLIGENCE, 1997. VOL. 19, NO. 5.

47.Alpaydin, E., Voting over Multiple Condensed Nearest Neighbors.

Artificial Intelligence Review, 1997. 11: p. 115132.

48.Pavan, K.K., et al., Robust seed selection algorithm for k-means

type algorithms. International Journal of Computer Science &

Information Technology (IJCSIT) 2011. Vol 3, No 5.

49.Poomagal, S. and T. Hamsapriya, Optimized k-means clustering

with intelligent initial centroid selection for web search using URL

and tag contents, in Proceedings of the International Conference on

Web Intelligence, Mining and Semantics. 2011, ACM: Sogndal,

Norway. p. 1-8.

50.Santhi, M.V.B.T., et al., Enhancing K-Means Clustering

Algorithm. International Journal of Computer Sci ence & Technology,2011.

51.Immaculate Mary, C. and D.S.V. Kasmir Raja, Refinement of

clusters from k-means with ant colony optimization. Journal of

Theoretical and Applied Information Technology, 2009.

52.Khan, D.M. and N. Mohamudally, A Multiagent System (MAS) for

the Generation of Initial Centroids for kmeans Clustering Data

Mining Algorithm based on Actual Sample Datapoints. Journal of

Next Generation Information Technology, August, 2010. Volume 1,

Number 2.

Documents

Integrating Clustering with Different Data Mining Techniques in the Diagnosis of Heart Disease