View
220
Download
0
Embed Size (px)
Citation preview
7/27/2019 Integrating Clustering with Different Data Mining Techniques in the Diagnosis of Heart Disease
1/10
JOURNAL OF COMPUTER SCIENCE AND ENGINEERING, VOLUME 20, ISSUE 1, AUGUST 20131
Integrating Clustering with DifferentData Mining Techniques in the
Diagnosis of Heart DiseaseMai Shouman, Tim Turner and Rob Stocker
AbstractHeart disease is the leading cause of death in the world over the past 10 years. The research presented here is partof work to develop tools to assist healthcare practitioners to diagnosis heart disease earlier in the hope of earlier interventionsin this preventable killer. The relative accuracy of common data mining techniques in heart disease diagnosis is difficult toassess from the literature. This research investigates Decision Tree, Nave Bayes, and K-nearest Neighbour performance inthe diagnosis of heart disease patients. It then further assesses performance enhancement through integrating clusteringtechniques. The testing was conducted over a standardized dataset used widely in the literature. The results show thatintegrating clustering with decision tree, nave bayes, and k nearest neighbour could enhance their accuracies in diagnosingheart disease patients. Importantly, the results establish that the ensemble of two cluster inlier k-means clustering with the k-nearest neighbour technique was the most effective at heart disease diagnosis.
Index Terms- Data Mining, Decision Tree, Nave Bayes, K-nearest Neighbour, K-Means Clustering, Heart DiseaseDiagnosis.
u
1 INTRODUCTION
eart disease is the leading cause of death inthe world over the past 10 years. Moreover, theWorld Health Organization has reported that
heart disease is the leading cause of death in bothhigh and low income countries [1]. The EuropeanPublic Health Alliance reports that heart attacks and
other circulatory diseases account for 41% of alldeaths [2]. The Economical and Social Commission ofAsia and the Pacific found that in one fifth of Asiancountries, most lives are lost to non-communicablediseases such as cardiovascular, cancers, and diabetesdiseases [3]. Statistics of South Africa report thatheart and circulatory system diseases are the thirdleading cause of death in Africa [4]. The AustralianBureau of Statistics reported that heart andcirculatory system diseases are the first leading causeof death in Australia, causing 33.7% of all deaths [5].This high level of mortality is a tragedy; mostparticularly because heart diseases are typically
eminently treatable if detected early [6].Motivated by the increasing mortality of heartdisease patients world-wide each year and theavailability of a huge amount of patients data thatcould be used to extract useful knowledge,researchers have been using data mining techniquesto help health care professionals in the diagnosis ofheart disease patients [7-8]. Data mining is anessential step in knowledge discovery. It involves theexploration of large datasets to extract hidden andpreviously unknown patterns, relationships andknowledge that are difficult to detect with traditionalstatistical methods [9-13]. The application of data
mining is rapidly spreading in a wide range ofsectors such as analysis of organic compounds,financial forecasting, weather forecasting, and
healthcare [14].Data mining in healthcare is an emerging field of
high importance for providing prognosis and adeeper understanding of medical data. Researchersare using data mining techniques in the medicaldiagnosis of several diseases such as diabetes [15],stroke [16], cancer [17], and heart disease [18]. Severaldata mining techniques are used in the diagnosis of
heart disease such as nave bayes, decision tree, k-nearest neighbor, neural network, kernel density,bagging algorithm, and support vector machineshowing different levels of accuracies [19-24].Recently, researchers are suggesting that integratingmore than one data mining technique can enhancedata mining techniques performance in the diagnosisof heart disease patients [18, 25-27].
Although heart diseases are among the mostcommon chronic diseases causing a high incidence ofdeath all over the world, they have also beenidentified among the most preventable ones [28].Early detection and healthy behaviours play an
important role in preventing and controlling theeffects of these diseases [6, 28-29]. There is a need foraccurate systematic tools that identify patients athigh risk and provide information for earlyintervention of these diseases [30].
This research investigates applying differentsingle and hybrid data mining techniques in thediagnosis of heart disease patients. It investigates ifintegrating clustering with different data miningtechniques can provide better performance thansingle data mining techniques in the diagnosis ofheart disease patients.
The rest of the paper is divided as follows: the
background section briefly reviews applying datamining techniques in the diagnosis of heart disease;the methodology section explains the different single
H
7/27/2019 Integrating Clustering with Different Data Mining Techniques in the Diagnosis of Heart Disease
2/10
JOURNAL OF COMPUTER SCIENCE AND ENGINEERING, VOLUME 20, ISSUE 1, AUGUST 20132
and hybrid data mining techniques used in thediagnosis of heart disease patients; the validation andevaluation section presents the procedure used tomeasure the stability of the proposed model; theheart disease data section introduces the de factostandard dataset used; and the results section
presents the results of a systematic investigation ofsingle and hybrid data mining techniques used in thediagnosis of heart disease patients which is followed
by the discussion of the presented results. Finally, wedraw conclusions from this preliminary investigationto emphasize the importance of this systematic studyin establishing a baseline for research and practice,and point to future work.
2 BACKGROUND
Researchers have been investigating the use ofstatistical analysis and data mining techniques tohelp healthcare professionals in the diagnosis of heartdisease for many years. Statistical analysis hasidentified the risk factors associated with heartdisease to be age, blood pressure, smoking [31],cholesterol [32], diabetes [33], family history of heartdisease [34], obesity, and lack of physical activity[35]. Knowledge of the risk factors associated withheart disease helps health care professionals toidentify patients at high risk of having heart disease.
Although statistical methods help in identifyingthe risk factors associated with heart disease, they arenot sufficient in helping in the diagnosis of patients.Data mining can play an important role in thediagnosis of heart disease patients. Researchers have
been applying different data mining techniques overdifferent heart disease datasets to help health careprofessionals in the diagnosis of heart disease [18-19,22-24, 36]. Unfortunately in general, the results of thedifferent data mining research cannot be compared
because they use different datasets [37]. However,over time, a benchmark data set has arisen in theliterature: the Cleveland Heart Disease Dataset(CHDD) [38].
Researchers have been using the CHDD to applydifferent data mining techniques such as decisiontree, nave bayes, bagging algorithm, support vector
machine, and neural network showing differentlevels of accuracy [19, 23, 36, 39-40]. Recently,researchers are investigating whether integratingmore than one data mining technique can enhancedata mining performance in the diagnosis of heartdisease patients on the CHDD. Polat et al., [41] usedartificial immune recognition system (AIRS) andother research used fuzzy AIRS and k-nearestneighbour [26]. Kavitha et. al, presented a newsystem for the detection of heart disease that usesneural network and genetic algorithm for forwardfeedback. The results showed that the proposedhybridization is more stable [42] . This research
investigates integrating clustering with different datamining techniques to identify if this integration canenhance their performance in the diagnosis of heart
disease patients.K-means clustering is one of the most popular and
well known clustering techniques. Its simplicity andreliable behavior make it popular in manyapplications [43]. Initial centroid selection is a criticalissue in k-means clustering and strongly affects its
results [44].This research investigates applying three common
data mining techniques which are decision tree, navebayes, and k-nearest neighbour as single data miningtechniques. It then investigates integrating k-meansclustering with different initial centroid selectionmethods as well as different numbers of clusters inthe diagnosis of heart disease patients. Importantly,our research thoroughly investigates eachhybridization technique, testing which technique andwhich initial centroid selection method can provide
better performance in diagnosing heart diseasepatients and if applying different numbers of clusters
can provide different performance in diagnosingheart disease patients.
3 METHODOLOGY
This section discusses the single and hybrid datamining techniques used in the diagnosis of heartdisease patients which are shown in Figure 1. For thesingle data mining techniques decision tree, nave
bayes, and k-nearest neighbour are discussed. For thehybridization technique of k-means clustering withfive different initial centroid selection methods isdiscussed.
FIGURE 1:PROPOSED APPROACH
3.1 Single Data Mining Techniques
3.1.1 Decision Tree
The decision tree technique cannot deal withcontinuous attributes so they need to be convertedinto discrete attributes, a process calleddiscretization. Dougherty et al. carried out a
comparative study between two unsupervised andtwo supervised discretization methods using 16 datasets showing that differences between the
7/27/2019 Integrating Clustering with Different Data Mining Techniques in the Diagnosis of Heart Disease
3/10
7/27/2019 Integrating Clustering with Different Data Mining Techniques in the Diagnosis of Heart Disease
4/10
7/27/2019 Integrating Clustering with Different Data Mining Techniques in the Diagnosis of Heart Disease
5/10
JOURNAL OF COMPUTER SCIENCE AND ENGINEERING, VOLUME 20, ISSUE 1, AUGUST 20135
single and hybrid data mining techniques arecalculated.
5 HEART DISEASE DATA
The data used in this study is the Cleveland ClinicFoundation Heart disease data set available at
http://archive.ics.uci.edu/ml/datasets/Heart+Disease.The data set has 76 raw attributes. However, all of thepublished experiments only refer to 13 of them. Thedata set contains 303 rows of which 297 are complete.Six rows contain missing values and they areremoved from the experiment. The attributes used inthis study are shown in Table 1. In the type attributeC represents continuous and D represent discrete.
TABLE 1:SELECTED CLEVELAND HEART DISEASE DATASET ATTRIBUTES
Name Type Description
Age C Age in years
Sex D 1 = male
0 = female
Cp D Chest pain type:
1 = typical angina
2 = atypical angina
3 = non-anginal pain
4 =asymptomatic
Trestbps C Resting blood pressure (in mm Hg)
Chol C Serum cholesterol in mg/dl
Fbs D Fasting blood sugar > 120 mg/dl:
1 = true
0 = false
Restecg D Resting electrocardiographic
results:0 = normal
1 = having ST-T wave abnormality
2 =showing probable or define left
ventricular hypertrophy
Thalach C Maximum heart rate achieved
Exang D Exercise induced angina:
1 = yes
0 = no
Old peak
ST
C Depression induced by exercise
relative to rest
Slope D The slope of the peak exercise
segment :
1 = up sloping2 = flat
3= down sloping
Ca D Number of major vessels colored by
fluoroscopy that ranged between 0
and 3.
Thal D 3 = normal
6= fixed defect
7= reversible defect
Diagnosis D Diagnosis classes:
0 = healthy
1= patient who is subject to possible
heart disease
6 RESULTS
The results section will discuss the achieved resultswhen integrating clustering with decision tree, nave bayesand k-nearest neighbour techniques in the diagnosis of heartdisease patients.
6.1 Single Data Mining TechniquesThe results ofsensitivity, specificity, and accuracy in the diagnosis
of heart disease using Gain Ratio Decision Trees,nave bayes, and k-nearest neighbor are shown inTable 2. Nave bayes is achieving the best resultsfollowed by k nearest neighbor and decision tree asshown in Figure 2. Nave bayes as a single techniqueis showing mean accuracy of 83.5% (standarddeviation of 5.2%) as shown in Table 2.
TABLE 2:DIFFERENT SINGLE DATA MINING TECHNIQUESRESULTS
Technique
Name
Sensitivity Specificity Accuracy
Mean StDev
Mean StDev
Mean StDev
Gain Ratio
Decision
Tree
75.6 6.1 81.6 12.1 79.1 5.8
Nave
Bayes78 13.8 80.8 12.6 83.5 5.2
KNN K=19 76.7 10.7 85.1 7.5 83.2 4.1
FIGURE2:DIFFERENT SINGLE DATA MININGTECHNIQUES
6.2 Clustering With Data Mining Techniques
Different clustering columns are used involvingage, blood pressure and cholesterol. Using the age asthe clustering column showed the best results. Theresults of sensitivity, specificity, and accuracy in thediagnosis of heart disease using k-means clusteringand different data mining techniques with differentinitial centroids selection methods and differentnumbers of clusters are presented next.
0
10
20
30
40
50
60
70
80
90
100
GainRatio NaveBayes KNNK=19
Accuracy
DifferentSingleDataMiningTechniques
7/27/2019 Integrating Clustering with Different Data Mining Techniques in the Diagnosis of Heart Disease
6/10
JOURNAL OF COMPUTER SCIENCE AND ENGINEERING, VOLUME 20, ISSUE 1, AUGUST 20136
6.2.1 Integrating Clustering With Gain RatioDecision TreeTable 3 presents the results ofintegrating gain ratio decision tree with k-meansclustering with different initial centroid selectionmethods and with different numbers of clusters.Integrating gain ratio decision tree with k-means
clustering could enhance gain ratio accuracy in thediagnosis of heart disease patients. The best resultsfor the gain ratio decision tree is achieved by the twoclusters inlier initial centroid selection methodshowing mean accuracy, of 81.2% (standarddeviation of 6.2%) as shown in Table 3. It is showing2.1% increase in mean accuracy when comparing itwith the single gain ratio decision tree which isshowing 79.1% as shown in Table 3.
TABLE 3:INTEGRATING CLUSTERING WITH GAIN RATIODECISION TREE
Technique
/ClustersNo
Sensitivity Specificity Accuracy
Mean StDev
Mean StDev
Mean StDev
Gain Ratio 75.6 6.1 81.6 12.1 79.1 5.8
Inlier
2 75.9 7.2 85.1 11.4 81.2 6.2
3 75.8 8.5 83.4 14.6 80.8 8.6
4 71.5 11.8 78.1 12.3 77.5 8.6
5 71.5 11.8 78.1 12.3 77.5 8.6
Outlier
2 76 9.6 80.3 13.4 78.7 6.7
3 75.8 8.5 83.4 14.6 80.8 8.6
4 73.2 11.2 84.1 11.8 79.7 8.2
5 73.2 11.2 84.1 11.8 79.7 8.2
Range
2 76 9.6 80.3 13.4 78.7 6.7
3 76.2 10.8 76 13.2 78.9 8.5
4 71.9 12 78.1 12.3 77.8 8.6
5 71.9 12 78.1 12.3 77.8 8.6
Random
Row
2 76.1 8.9 82.8 9.6 80.1 4.7
3 69.8 16.5 77 13.1 76.3 9.3
4 73.2 9 79.1 10.5 78.4 8.3
5 69.3 13.9 79.5 9.6 77.1 7.8
Random
Attribute
2 74.3 8 84 12.2 80.1 7
3 70.5 10.8 81.7 12.4 79.4 8.8
4 68.1 15.9 80.8 11.3 77 9.1
5 69.3 13.9 79.5 9.6 77.1 7.8
Figure 3 shows the means accuracy of integratinggain ratio decision tree with k means clustering withdifferent initial centroid selection methods and withdifferent numbers of clusters. There is not a specifictrend between increasing the number of clusters withintegrating gain ratio decision tree and theincrease/decrease of the accuracy as shown in Figure3.
FIGURE3:INTEGRATING CLUSTERING WITH GAIN RATIODECISION TREE
6.2.2 Clustering and Nave BayesTable 4 presents the results of integrating nave
bayes with k-means clustering with different initialcentroid selection methods and with differentnumbers of clusters. Integrating nave bayes with k-means clustering could enhance nave bayes accuracyin the diagnosis of heart disease patients. The best
results for the nave bayes is achieved by the threeclusters random row initial centroid selection methodshowing mean accuracy, standard deviation of 84.8%and 4.8% as shown in Table 4. It is showing 1.3%increase in mean accuracy when comparing it withthe single nave bayes which is showing 83.5% asshown in Table 4.TABLE4:INTEGRATING CLUSTERING WITH NAVE BAYES
Technique
/Clusters
No
Sensitivity Specificity Accuracy
MeanSt
DevMean
St
DevMean
St
Dev
Nave
Bayes78 13.8 80.8 12.6 83.5 5.2
Inlier
2 73.2 22 83.3 12.2 83.1 4.7
3 75.9 14.7 83.7 12.6 83.6 4.4
4 76.9 13.5 81.7 13.1 83.5 3.1
5 76.9 13.5 81.7 13.1 83.5 3.1
Outlier
2 76.4 15.2 82.3 13.1 83.4 4.5
3 75.9 14.7 83.7 12.6 83.6 4.4
4 70.1 17.4 83.1 12.9 80.7 6.8
5 70.1 17.4 83.1 12.9 80.7 6.8
R
ange
2 76.4 15.2 82.3 13.1 83.4 4.5
3 78 13.1 81.3 14.6 84.1 4.3
4 76.9 13.5 81.7 13.1 83.5 3.15 76.9 13.5 81.7 13.1 83.5 3.1
Random
Row
2 76.5 14.6 82 12.7 83.1 3.4
3 76.6 14.4 85 13.2 84.8 4.7
4 71.4 19.6 83 13.1 81.6 6.3
5 71.1 14.8 81.1 13.1 80.4 4.7
Random
Attribute
2 76.4 15.2 83.8 12.5 84 4.4
3 74.3 17.6 82.2 14.1 82.3 6
4 70.9 20.5 83.5 12.3 82 5.7
5 71.1 14.8 81.1 13.1 80.4 4.7
Figure 4 shows the means accuracy of integratingnave bayes with k means clustering with differentinitial centroid selection methods and with differentnumbers of clusters. There is not a specific trend
0
10
20
30
40
50
60
70
80
90
100
2 3 4 5 2 3 4 5 2 3 4 5 2 3 4 5 2 3 4 5
single Inlier Outlier Range RandomRow RandomAttribute
MeanAccuracy
7/27/2019 Integrating Clustering with Different Data Mining Techniques in the Diagnosis of Heart Disease
7/10
7/27/2019 Integrating Clustering with Different Data Mining Techniques in the Diagnosis of Heart Disease
8/10
JOURNAL OF COMPUTER SCIENCE AND ENGINEERING, VOLUME 20, ISSUE 1, AUGUST 20138
another. The best increase in accuracy of the hybriddata mining technique is achieved by the k nearestneighbour showing 2.5% enhancement of the meanaccuracy to increase from 83.2% to 85.7% as shown inFigure 6.
FIGURE6:SINGLE AND HYBRID DATA MINING TECHNIQUES
Table 6 shows the increase in accuracy of differentdata mining techniques when integrated with kmeans clustering. The best increase in accuracy isachieved by k means clustering followed by gainratio decision tree and nave bayes showing increaseof 2.5%, 2.1%, and 1.3% respectively. Table 6 alsoshows the t test significance when comparing eachsingle data mining technique and comparing it with
the integration of k means clustering. The t test iscalculated using 95% significance. The t testcalculation is showing that there is significantincrease in the accuracy between each single datamining technique and when integrating it with kmeans clustering as shown in Table 6. It is alsoshowing that there is significant difference whencomparing the integrated k means clustering knearest neighbour with the integrated k meansclustering nave bayes and decision tree.
Although decision tree showed the least accuracyin the diagnosis of heart disease patients, it can still
be very useful in the diagnosis of heart diseasepatients as it provides rules that explain how thepatients are diagnosed as healthy or sick. Theresulting rule sets offer the rules for explaining thediagnosis. The nave bayes and k-nearest neighbourare based on probability and distance measuring ofthe instances respectively. They do not provide anydiagnostic explanation of why each testing instanceis diagnosed as healthy or sick.
TABLE6:SINGLEANDHYBRIDDATAMININGTECHNIQUES
Techniques Accuracy Increasein
Accuracy
T-testSignificanceMean
St
Dev
Decision
Tree
Single 79.1 5.8
2.1
Yes
(t=4.28,
p
7/27/2019 Integrating Clustering with Different Data Mining Techniques in the Diagnosis of Heart Disease
9/10
7/27/2019 Integrating Clustering with Different Data Mining Techniques in the Diagnosis of Heart Disease
10/10
JOURNAL OF COMPUTER SCIENCE AND ENGINEERING, VOLUME 20, ISSUE 1, AUGUST 201310
disease using an artificial immune recognition system (AIRS) with
fuzzy resource allocation mechanism and k-nn (nearest neighbour)
based weighting preprocessing. Expert Systems with Applications
2007. 32p. 625631.
27.Xiong, W. and C. Wang, Feature Selection: A Hybrid Approach
Based on Selfadaptive Ant Colony and Support Vector Machine.
International Conference on Computer Science and Software
Engineering, IEEE, 2008.
28.Centers for Disease Control and Prevention (CDC), C.D.P.a.H.P.
August 2012; Available from: http://www.cdc.gov/nccdphp/.
29.American Cancer Society, B.C.F.F. August 2012; Available from:
http://www.cancer.org/downloads/STT/F861009_final 9-08-09.pdf.
30.Paladugu, S. and C. Shyu, Temporal mining framework for risk
reduction and early detection of chronic diseases. 2010.
31.Heller, R.F., et al., How well can we predict coronary heart
disease? Findings in the United Kingdom Heart Disease Prevention
Project. BRITISH MEDICAL JOURNAL, 1984.
32.Wilson, P.W.F., et al., Prediction of Coronary Heart Disease
Using Risk Factor Categories. American Heart Association Journal,1998.
33.Simons, L.A., et al., Risk functions for prediction of
cardiovascular disease in elderly Australians: the Dubbo Study.
Medical Journal of Australia, 2003. 178.
34.Salahuddin and F. Rabbi, Statistical Analysis of Risk Factors for
Cardiovascular disease in Malakand Division. Pak. j. stat. oper. res.,
2006. Vol.II: p. pp49-56.
35.Shahwan-Akl, L., Cardiovascular Disease Risk Factors among
Adult Australian-Lebanese in Melbourne. International Journal of
Research in Nursing, 2010. 6 (1).
36.Tu, M.C., D. Shin, and D. Shin, Effective Diagnosis of Heart
Disease through Bagging Approach. Biomedical Engineering and
Informatics, IEEE, 2009.
37.Shouman, M., T. Turner, and R. Stocker, Using data mining
techniques in heart disease diagnosis and treatment. Japan-Egypt
Conference on Electronics, Communications and Computers (JEC-
ECC), 2012: p. 173 -177.
38.Blake, C.L. and C.J. Merz, UCI repository of machine learning
databases. University of California, Irvine, Department of
Information and Computer Sciences. , 1998
39.Parthiban, L. and R. Subramanian, Intelligent Heart Disease
Prediction System using CANFIS and Genetic Algorithm.
International Journal of Biological and Life Sciences, 2007.
40.Patil, S.B. and Y.S. Kumaraswamy, Extraction of Significant
Patterns from Heart Disease Warehouses for Heart Attack Prediction.
International Journal of Computer Science and Network Security
(IJCSNS), 2009. VOL.9 No.2.
41.Polat, K., et al., A new classification method to diagnosis heart
disease: Supervised artificial immune system (AIRS). In Proceedings
of the Turkish Symposium on Artificial Intelligence and Neural
Networks (TAINN), 2005.
42.Kavitha, K.S., K.V. Ramakrishnan, and M.K. Singh,Modeling and
design of evolutionary neural network for heart disease detection.
International Journal of Computer Science Issues (IJCSI ), 2010. Vol.
7, Issue 5.
43.Wu, X., et al., Top 10 algorithms in data mining analysis. Knowl.
Inf. Syst., 2007.44.Tajunisha, N. and V. Saravanan, A new approach to improve the
clustering accuracy using informative genes for unsupervised
microarray data sets. International Journal of Advanced Science and
Technology, 2011.
45.Bramer, M.,Principles of data mining. 2007: Springer.
46.Esposito, F., D. Malerba, and G. Semeraro A Comparative
Analysis of Methods for Pruning Decision Trees. IEEE
TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE
INTELLIGENCE, 1997. VOL. 19, NO. 5.
47.Alpaydin, E., Voting over Multiple Condensed Nearest Neighbors.
Artificial Intelligence Review, 1997. 11: p. 115132.
48.Pavan, K.K., et al., Robust seed selection algorithm for k-means
type algorithms. International Journal of Computer Science &
Information Technology (IJCSIT) 2011. Vol 3, No 5.
49.Poomagal, S. and T. Hamsapriya, Optimized k-means clustering
with intelligent initial centroid selection for web search using URL
and tag contents, in Proceedings of the International Conference on
Web Intelligence, Mining and Semantics. 2011, ACM: Sogndal,
Norway. p. 1-8.
50.Santhi, M.V.B.T., et al., Enhancing K-Means Clustering
Algorithm. International Journal of Computer Sci ence & Technology,2011.
51.Immaculate Mary, C. and D.S.V. Kasmir Raja, Refinement of
clusters from k-means with ant colony optimization. Journal of
Theoretical and Applied Information Technology, 2009.
52.Khan, D.M. and N. Mohamudally, A Multiagent System (MAS) for
the Generation of Initial Centroids for kmeans Clustering Data
Mining Algorithm based on Actual Sample Datapoints. Journal of
Next Generation Information Technology, August, 2010. Volume 1,
Number 2.