1
CHAPTER I
INTRODUCTION
1.1 BACKGROUND
Data and information have become major assets for most businesses.
Knowledge discovery in medical databases is a well-defined process and data
mining an essential step. Databases are collections of data with a specific well
defined structure and purpose. The programs to develop and manipulate these data
are called DBMS. Knowledge discovery in databases is the overall process that is
involved in unearthing knowledge from data. Data mining is concerned with the
process of computationally extracting hidden knowledge structures represented in
models and patterns from large data repositories.
Fayyad et al. (1996) defines KDD as a non-trivial process of identifying
valid, novel, potentially useful and ultimately understandable patterns in data.
According to this definition, data are any facts, numbers or text that can be
processed by a computer. The term pattern indicates models and regularity which
can be observed within the data. The patterns, associations or relationships among
all this data can provide information and it can be converted into knowledge about
historical patterns and future trends. There are other steps such as data
preprocessing, data selection, data cleaning and data visualization which also part
of the KDD process.
2
‘Data mining is an interdisciplinary field of study in databases, machine
learning and visualization’. It helps to identify the patterns of successful medical
therapies for different illnesses and also it aims to find useful information from
large collections of data. (http://en.wikipedia.org/wiki/Data_mining)
According to Frawley et al. (1991) data mining is the core of KDD which is
used to extract interesting patterns from data that are easy to perceive, interpret and
manipulate. It is the science of finding patterns in huge reserves of data, in order to
generate useful information. The KDD process comprises of few steps leading from
raw data collections to form new knowledge.
Figure 1.1 Knowledge discovery as a process
As shown in Figure 1.1, Subbalakshmi et al. (2011) have described the
knowledge discovery process as consisting of an iterative sequence of data
cleaning, data integration, data selection, data mining pattern recognition and
3
knowledge presentation. Data mining is the search for the relationships and global
patterns that exist in large databases hidden among large amounts of data. A target
data set must be assembled before data mining algorithms can be used. A common
source for data is a data mart or data warehouse and pre-processing is essential to
analyze these multivariate data sets. The final step of knowledge discovery from
data is to verify that the patterns produced by the data mining algorithms occur in
the wider data set. The discovered knowledge may contain rules that describe the
properties of the data, patterns that occur frequently and objects that are found to be
in clusters in the database etc.,
According to Shelly Gupta et al. (2011) the motivation for handling data
and performing computation is the discovery of knowledge. The KDD process
employs data mining methods to identify patterns at some measure of
interestingness and it is the process of turning the low-level data into high-level
knowledge.
As Hand et al. (2001) defined “Data mining as the analysis of observational
datasets to find unsuspected relationships and to summarize the data in novel ways
that are both understandable and useful to the data owner”.
According to Connolly et al. (2005) data mining is a process of extracting
valid, previously unknown, comprehensive and actionable information from large
databases and using it to make crucial business decisions.
According to Luminita State et al. (2007) data mining is the process of
discovering meaningful new correlations, patterns and trends by shifting through
large amounts of data stored in repositories, using pattern recognition technologies
as well as statistical and mathematical techniques. It is an evolving and growing
area of research and development, both in academia as well as in industry.
Data mining is a challenging area in the field of medical research.
Extraction of useful knowledge from the database and providing scientific decision-
making for the diagnosis and treatment of diseases increasingly becomes necessary.
4
Data mining in medicine can deal with this problem. Medical data mining has great
potential for exploring the hidden patterns in the data sets of the medical domain.
As Aqueel Ahmed et al. (2006) have stated, data mining is a logical process
that is used to search through large amount of data in order to find useful data. A
wide availability of huge amount of data and the need to convert such data to useful
information necessitates the use of data mining techniques. It has become an
established method for improving statistical tools to predict future trends. The
necessity for using data mining techniques to develop and improve risk models
arises from the need for clinicians to improve their prediction models for individual
patients. Various data mining techniques are available with their suitability
dependent on the domain application. Data mining applications in health can have a
tremendous potential and usefulness. It automates the process of finding predictive
information in large databases.
Some of the data mining tools such as Weka (Waikato Environment for
Knowledge Analysis) written in Java, developed at the University of Waikato, New
Zealand is a collection of machine learning algorithms for data mining tasks and
RapidMiner, formerly YALE (Yet Another Learning Environment) also written in
Java perform data analysis and may uncover important data patterns, contributing to
business strategies, knowledge bases and scientific and medical research.
(http://www.cs.waikato.ac.nz/ml/weka)
Smita Malik et al. (2012) have stated, the successful application of data
mining in highly visible fields like retail, marketing & e-business have led to the
popularity of its use in knowledge discovery in databases in other industries and
sectors. The data generated by healthcare transactions is huge and this medical data
about large patient population is analyzed to perform medical research.
5
1.2 DATA MINING TASKS
Classification of data is a common task in machine learning. Artificial
intelligence first achieved recognition as a discipline in the mid 1950's. One of the
fundamental requirements for any intelligent behavior is learning. Most of the
researchers today agree that there is no intelligence without learning. In artificial
intelligence research, machine learning has been central to its development from the
very beginning. Machine learning is a branch of computer science that is concerned
with the development of algorithms that allow computers to learn. It can be used to
develop systems that can ensure increased efficiency and effectiveness of the
system.
Han and Kamber (2006) have stated that data mining tasks are used to
specify the kind of patterns to be found in data mining process. Basically, the
algorithms try to fit a model closest to the characteristics of data under
consideration and models can be either predictive or descriptive. Predictive models
are used to make predictions, for example, of the diagnosis of a particular disease.
These will analyze past performance to assess the likelihood of a customer
exhibiting a specific behavior in order to improve marketing effectiveness.
Descriptive models are used to identify the patterns in data. For example, a
physician might be interested in discovering the influence of climate among
typhoid patients by grouping patients in different climate zones. Unlike the
predictive models that focus on predicting a single customer behavior, descriptive
models identify many different relationships between customers or products.
As shown in Figure 1.2, classification, regression and time series analysis
are some of the tasks of predictive modeling. Clustering, association rules,
visualization are some of the tasks of descriptive modeling.
6
Figure 1.2 Data mining tasks
In machine learning, classification is the task of identification to which a set
of categories a new observation belongs. This is done on the basis of training a
set of data containing observations or instances whose category membership is
known. According to Han and Kamber (2006) classification is the process of
finding a model or function that describes and distinguishes data classes for the
purpose of the ability use the model to predict the class of objects whose class label
is unknown. It is a learning function that maps or classifies a data item into one of
several predefined groups or classes which come under supervised learning. The
classification model makes use of training data set in order to build a classification
predictive model and testing of data set is done for testing the classification
efficiency. Two separate problems such as binary classification and multiclass
classification can be considered as its two components. In binary classification,
only two classes are involved, whereas multiclass classification involves assigning
an object to one of several classes.
7
Prediction is achieved with the help of regression. It is the process of
analyzing the current and past states of the attribute and prediction of its future
state. Regression is a data mining technique that is used to predict a value. It takes a
numeric dataset and develops a mathematical formula to fit the data. A regression
task begins with a dataset of known target values and regression analysis can be
used to model the relationship between one or more independent or predictor
variables and a dependent or response variable. The types of regression methods are
linear regression, multivariate linear regression, nonlinear regression and
multivariate nonlinear regression.
Time series is a sequence of data points, measured typically at successive
points in time spaced at uniform time intervals. Time series analysis comprises
methods for analyzing time series data in order to extract meaningful statistics and
other characteristics of such data. Methods for time series analysis may be divided
into two classes, namely, frequency-domain methods which include spectral
analysis and time-domain methods which include auto-correlation and cross-
correlation analysis. There are several types of motivation and data analysis
available for time series which are appropriate for different purposes. In the context
of data mining, pattern recognition and machine learning, time series analysis can
be used for clustering, classification, query by content, anomaly detection as well as
forecasting.
A cluster is a collection of objects which are similar and are dissimilar to the
objects belonging to others. Clustering has no predefined classes and identifies
groups of items that share specific characteristics which come under unsupervised
learning. It analyzes data objects without consulting a known class label. The
objects are clustered or grouped based on the principle of maximizing the
intra-class similarity and minimizing the inter-class similarity. It is the main task of
exploratory data mining and a common technique for statistical data analysis used
in many fields including machine learning, pattern recognition, image analysis,
information retrieval and bioinformatics. Clustering can be roughly distinguished as
hard clustering which specifies each object belongs to a cluster or not and soft
8
clustering specifies each object belongs to each cluster to a certain degree. One of
the most used clustering algorithms is K-means clustering algorithm.
Association rule learning is a popular and well researched method for
discovering interesting relations between variables in large databases. It is intended
for identify strong rules discovered in databases using different measures of
interest. According to Jagjeevan Rao et al. (2012) there are several association rule
algorithms which are mainly useful in summarizing and identifying the patterns.
They also use correlation along with support and confidence in order to find the
right patterns. These are usually required to satisfy a user-specified minimum
support and a user-specified minimum confidence at the same time. It is split up
into two separate steps such as initially minimum support is applied to find all
frequent item sets in a database and these frequent item sets and the minimum
confidence constraint are used to form rules. Association and correlation are usually
meant for locating frequent item set findings among large data sets. The association
differs from classification as it can predict any attribute, not just the class and they
can predict more than one attribute’s value at a time. The types of association rules
are multilevel association rule, multidimensional association rule and quantitative
association rule.
Supervised learning algorithm analyzes the training data and produces an
inferred function, which is called a classifier. If the output is discrete or categorical
attributes, it is called classification and if the output is numerical or continuous
attributes, then it is termed as regression. Unsupervised learning refers to the
problem of trying to find hidden structures in unlabeled data.
1.3 DATA MINING IN HEALTH INFORMATICS
Data mining is an integration of multiple disciplines such as statistics,
machine learning, neural networks and pattern recognition. It is concerned with the
process of computationally extracting hidden knowledge structures represented in
models and patterns from large data repositories.
9
Healthcare is a data intensive process. Many processes run simultaneously
producing new data every second. It is a research intensive field and the largest
consumer of public funds. With the emergence of computers and new algorithms,
health care has seen an increase of computer tools and could no longer ignore these
emerging tools. This has resulted in unification of healthcare and computing to
form health informatics. They typically work through an analysis of medical data
and a knowledge base of clinical expertise and it is an emerging field.
Peyman Mohammadi et al. (2013) described as data mining in healthcare, in
medical areas today, data collection about different diseases as very important.
Medical and health areas are among the most important sections in industrial
societies. The extraction of knowledge from a massive volume of data related to
diseases and medical records using the data mining process can lead to identifying
the laws governing the creation, the development of epidemic diseases.
Some medical applications of data mining are :
Prediction of health care costs.
Determination of disease treatment
Diagnosis and prediction of diseases of most kind etc.,
Health informatics is defined as an evolving scientific discipline that deals
with the collection, storage, retrieval, communication and optimal use of health
related data, information and knowledge. It is the field of study applied to clinical
care, nursing, public health and biomedical research all dedicated to the
improvement of patient care and population health.
It is one of the fastest growing areas within the health sector and covers a
wide range of applications and research. It deals with biomedical information, data
and knowledge. With the help of smart algorithms and machine intelligence, quality
healthcare can be provided through problem solving and decision-making systems.
In the domain of health informatics, Decision Support Systems are defined as
10
knowledge based systems that support information sciences and assist decision
making activities. Physicians can input the patient data through electronic health
forms and can run a decision support system on the data input to get an opinion on
the patient’s health and the care required.
According to Pragnyaban Mishra et al. (2012) the success of healthcare data
mining hinges on the availability of clean healthcare data. Possible directions
include the standardization of clinical vocabulary and the sharing of data across
organizations to enhance the benefits of healthcare data mining applications.
Data mining for healthcare is useful in evaluating the effectiveness of
medical treatments. Through comparing and contrasting various causes, symptoms,
and treatment methodologies, data mining can produce an analysis of treatments
that can correct specific symptoms most effectively. It is widely used in healthcare
fields due to its descriptive and predictive power. It can predict health insurance
fraud, healthcare cost, disease prognosis, disease diagnosis, and length of stay
needed in a hospital. It also obtains frequent patterns from biomedical and
healthcare databases such as relationships between health conditions and a disease,
relationships among diseases and relationships among drugs etc.
Data mining today has successful applications in various fields including
health care. This industry generates large amounts of complex data about patient
records, hospitals resources, disease diagnosis, medical devices etc.
These data are a key resource to be processed and analyzed for knowledge
extraction and data mining in various areas as under :
Healthcare insurers detect fraud and abuse
Healthcare organizations make customer relationship management decisions
Physicians identify effective treatments and best practices
Patients receive better and more affordable healthcare services.
11
Some of the challenges of data mining in the medical domain are in the following
areas :
Extraction of useful knowledge and provision of scientific decision-
making ability for the diagnosis and treatment of disease.
Identification of the patterns of successful medical therapies for different
ailments.
Too many disease markers (attributes) now available for decision making.
Voluminous data now being collected with the help of computerization
(text, graphs, images)
Handling noisy (containing errors or outliers), inconsistent (containing
discrepancy in codes or names), and incompleteness (lacking attribute
values or containing only aggregates) of medical data issues to be
preprocessed.
The current applications of data mining in healthcare and medicine are :
Prognosis, to predict future outcomes based on previous experience and
present conditions.
Therapy, which is to select from available treatment methods based on
effectiveness, suitability to patient etc.,
Diagnosis, to recognize and classify patterns in multivariate patient
attributes.
Data mining not only focuses on collecting and managing data, it also
includes analysis and prediction. The wide range of applications from business
tasks to scientific tasks has led to a huge variety of learning methods and
algorithms for rule extraction and prediction. For medical diagnosis, there are many
12
expert systems based on logical rules for decision making and prediction. Even
though there exist many data mining techniques in prediction of heart disease, there
is a weakness in availability of data for prediction of heart diseases on the diabetes
dataset. Prediction of risk using data mining can be helpful to understand the
possible risk involved in getting that disease. It has prophylactic capability with the
advancement in tools like Rapid miner that can be used with ease in large datasets
and a large number of attributes, the hassle today lies in the determination of the
appropriate machine learning technique that can ensure accuracy.
Hian Chye Koh et al. (2010) say that, data mining is becoming increasingly
popular in health care if not increasingly essential. Several factors have motivated
the use of data mining applications in healthcare. These can greatly benefit the
healthcare industry. However, they are not without limitations. Healthcare data
mining can be limited by the accessibility of data, because the raw inputs for data
mining often exist in different settings and systems, such as administration, clinics,
laboratories and more. Hence, the data have to be collected and integrated before
data mining can be done.
According to Salim Diwani et al. (2013) data mining applications can be
developed to evaluate the effectiveness of medical treatments. It also can benefit
healthcare providers such as hospitals, clinics, physicians, and patients by
identifying effective treatments and best practices.
The aims of quality healthcare services are :
Provision of safe healthcare treatments
Use of scientific medical knowledge to provide healthcare services to
everyone
Provision of various healthcare treatments based on the patient’s needs,
symptoms and preferences
Minimizing the time to wait for the medical treatment
13
Healthcare industry is a type of industry, where the available data is
voluminous and sensitive. The data requires careful handling without any
mismanagement. There are various data mining classification techniques that have
been used in healthcare industry. The best among them can be chosen.
1.4 DATA MINING CLASSIFICATION TECHNIQUES
In the early days of data warehousing, data mining was viewed as a subset
of the activities associated with the warehouse. Today, a warehouse may be a good
source for the data to be mined and data mining is recognized as an independent
activity. One of the greatest strengths of data mining lies in its wide range of
methodologies and techniques that can be applied to a various problem sets. Data
mining is a natural activity to be performed on large datasets. Data classification
process involves learning and classification. In learning, the training data are
analyzed by classification algorithms and in classification, test data are used to
estimate the accuracy of the classification rules.
According to Vikas Chaurasia et al. (2013) many researchers used data
mining techniques in the diagnosis of diseases such as tuberculosis, diabetes, cancer
and heart disease in which several data mining techniques are used in the diagnosis
of heart disease such as neural networks, Bayesian classification, classification
based on clustering, genetic Algorithm, naive Bayes and decision tree which are
showing accuracy at different levels. Each data mining technique serves a different
purpose depending on the modeling objective.
According to Lashari et al. (2013) classification in data mining is used to
predict group membership for data instances. Data mining involves the use of
sophisticated data analysis tools to discover the relationship in large datasets.
Decision tree based classification methods are widely used in data mining for the
decision support application. Thus, there is a great potential for the use of data
mining techniques for medical data classification.
14
Figure 1.3 Data mining classification methods
15
As shown in Figure 1.3, classification is the discovery of predictive learning
function that classifies a data item into one of several predefined classes. These
techniques have been widely applied with the great success in the field of medical
databases. The types of classification models are Bayesian classification, Support
vector machine, neural networks, classification by decision tree induction and
classification based on associations.
Datta et al. (2011) have state that the classification and association rules
play a major part in data mining. Classification is the process of dividing a dataset
into mutually exclusive groups and enables us to categorize records in a large
database into predefined set of classes. Association rules give a process to find
relationships among data items in a given dataset. It enables us to establish
association and relationships between large unclassified data items based on certain
attributes and characteristics. It defines certain rules of associability between data
items and then uses those rules to establish relationship.
According to Varun Kumar et al. (2011) association analysis is the
discovery of association rules showing attribute-value conditions that occur
frequently together in a given set of data. It is also widely used for market basket or
transaction data analysis.
Sunita Beniwal et al. (2012) have illustrated, the classification process as
one divided into two phases training, when a classification model is built from the
training set and testing, when the model is evaluated on the test set. One of the
major goals of a classification algorithm is to maximize the predictive accuracy
obtained by the classification model.
The Bayesian classifiers have a structural model and a set of conditional
probabilities. They assume that the contribution of all variables is independent. It
first estimates the prior probability for each class and the occurrence of each
variable value applies to an unknown case. A Bayes network classifier is based on a
Bayesian network which represents a joint probability distribution over a set of
categorical attributes.
16
Naive Bayes classifier is a term dealing with simple probabilistic classifiers.
This method is based on probabilistic knowledge and on supervised learning. It
reads a set of examples from the training set and uses the Bayes theorem to estimate
the probabilities of all classifications. For each instance, the classification with the
highest probability is chosen as the prediction class.
Support Vector Machine is a concept in Statistics and Computer Science for
a set of related supervised learning methods that analyze data and recognize
patterns. Introduced by Corinna Cortes and Vladimir Vapnik (1995) it is used for
classification and regression analysis. It constructs a hyper plane or set of hyper
planes in a high- or infinite dimensional space, which can be used for classification,
regression, or other tasks. In addition to performing linear classification, SVMs can
efficiently perform non-linear classification using what are called kernel functions
which implicitly map their inputs into high-dimensional feature spaces. It is a
learning machine that plots the training vectors in high dimensional space and
labels each vector by its class. (http://en.wikipedia.org/wiki/Support_vector_machine)
According to Barakat et al., (2007) it is based on the principle of risk
minimization which aim to minimize the error rate. SVM uses a supervised learning
approach for classifying data. It uses kernel functions to map the data set to a high
dimensional data space for performing classification and the major advantage of
SVM is its classification accuracy.
Decision trees belong to classification methods and construct a hierarchical
like a tree structure and their goal is to create a model that predicts the value of a
target variable based on several input variables. Each internal node here denotes a
test on an attribute, each branch represents an outcome of the test, and each leaf
node holds a class label. It is a popular classifier and prediction method for
handling high dimensional data. Construction of a decision tree is the training step
of classification and the method for the construction of the tree is called ASM.
17
Decision trees used in data mining are of two main types, classification tree
analysis which is used when the predicted outcome is the class to which the data
belongs. Regression tree analysis which is used when the predicted outcome can be
considered a real number. C4.5 is an algorithm used to generate a decision tree
developed by Ross Quinlan (1996) and it is an extension of Quinlan's earlier ID3
algorithm. The decision trees generated by C4.5 can be used for classification, and
for this reason, C4.5 is often referred to as a statistical classifier.
(http://en.wikipedia.org/wiki/Decision_tree_learning)
Tina Patil et al. (2013) have described classification as an important data
mining technique with broad applications to classify the various kinds of data. It is
used to classify the item according to its features of the item with respect to the
predefined set of classes. The performance of classification algorithm is usually
examined by evaluation of the accuracy of the classification. For classifications, the
Bayesian networks are used to construct classifiers from a given set of training
examples with class labels.
Shruti Ratnakar et al. (2013) have illustrated, in the field of artificial
intelligence, a genetic algorithm is a search heuristic that imitates the process of
natural evolution. This heuristic is routinely used to generate useful solutions to
belong to the larger class of evolutionary algorithms, which generate optimized
solutions using techniques inspired by natural evolution, such as inheritance,
mutation, selection, and crossover. The evolution usually starts from a population
of randomly generated individuals, and is an iterative process, with the population
in each iteration called a generation. In each generation, the fitness of every
individual in the population is evaluated; the fitness is usually the value of the
objective function in the optimization problem being solved.
According to Chitra et al. (2013) the application of Artificial Neural
Networks (ANN) can be time-consuming due to the selection of input features for
the multi layer perceptron. The number of layers, number of neurons in each layer
was also determined by the input attributes. It is inspired by attempts to simulate
18
biological neural systems. Each node or neuron here is interconnected with other
nodes via weighted links. During the learning phase, the network learns by
adjusting weights to enable predication of the correct class labels of the input
tuples. The nodes are classified into three categories like input, hidden and output
layers. The neural networks are ideal for identifying patterns or trends in data and
well suited for prediction or forecasting needs and the most widely used is multi-
layer perception with back-propagation algorithm. Some of the disadvantages of
neural networks are: they require many parameters, that are empirically determined
and classification performance is sensitive to the parameters selected. It is very
slow training process and clinicians find it difficult to understand how its
classification decisions are taken and cannot interpret the results easily.
1.5 DATA MINING TOOLS
According to Witten (2011) Waikato Environment for Knowledge Analysis
tool is used for data mining, which finds valuable information hidden in large
volumes of data. It was developed at the University of Waikato in New Zealand and
easiest way to use is through a graphical user interface called Explorer. Weka is a
collection of machine learning algorithms for data mining tasks, written in Java and
contains tools for data pre-processing, classification, regression, clustering,
association rules, and visualization.
According to Matthew North (2012) Rapid Miner is the world-leading
open-source system for knowledge discovery and data mining. It is easy to install
and can run on just about any computer and provides specific data mining
functions. This has various machine learning algorithms for doing the data mining
experiments.
The basic approach taken with this tool is the preparation of a process
model which uses 10-Cross validation along with the Machine Learning algorithm
to increase the accuracy of the model. X-Validation works by splitting the data into
10 partitions each containing 90% of the original data.
19
It then builds a model on each 90% and applies it to the 10% to get a
performance. It does this for all 10 partitions to get an average.
Figure 1.4 Main process using Rapid Miner
As shown in Figure 1.4, the main process takes care of the execution of the
classification model from retrieval, setting the role of the attribute for learning and
10-fold cross validation and providing the results. The model is set inside the 10-
Fold Cross validation operator. Any Rapid miner application starts with a main
process which is of the class com.rapidMiner.Process. Once we have all the
operators defined, run method is called to execute the process.
According to Lakshmi et al. (2013) the basic phenomenon used to classify
the heart disease classification using classifier is its performance and accuracy. The
performance of a chosen classifier is validated based on error rate and computation
time. The classification accuracy is predicted in terms of sensitivity and specificity.
The computation time is noted for each classifier is taken in to account. Confusion
matrix displays the frequency of correct and incorrect predictions.
20
A two-class prediction problem (binary classification), in which the
outcomes are labeled either as positive (p) or as negative (n) may be considered.
There are four possible outcomes from a binary classifier namely True Positive,
False Positive, True Negative and False Negative. Suppose in an experiment there
are P positive instances and N negative instances. The four outcomes can be
formulated in a 2×2 contingency table or confusion matrix.
The preprocessing steps detailed in the experiment in the present work are
as follows:
Step 1: Import of the data to the repository. The records of about 1000
diabetic patients have been collected on which we use the Rapid miner tool.
Step 2: Ensuring the availability of all the attributes defined along with
Class Label, considering that class label is the one which will help us in the
machine learning process.
Step 3: Once we use the data source, we need to connect it to the X-
Validation Operator. X-Validation encapsulates a cross-validation process. It
applies k-fold Cross Validation. In our case k=10. (Ten is about the right number of
folds to get the best estimate of error)
Step 4: X-Validation operator helps to split the data into separate partitions
to evaluate the data. It builds the model on 90% of data and applies the model to the
remaining 10% of data to evaluate the performance. The data is selected using
stratified sampling. In contrast to the simple sampling operator, this operator
performs a stratified sampling for data sets with nominal label attributes, i.e. the
class distributions remain (almost) the same after sampling.
Step 5: Use of the appropriate model operator such as Naïve Bayes,
Decision Tree and SVM.
21
Step 6: Connection of the outputs of the model and use the apply model
operator to apply on the test data available with the help of the X-Validation
operator.
Step 7: Use of a performance classifier operator to determine the
performance output.
Step 8: X-Validation operator averages the performance for all the iterations
inside the operator.
1.6 BIOLOGICAL LINK BETWEEN DIABETES AND HEART DISEASE
There is a big link between diabetes and heart disease. Diabetes by itself is
now regarded as the strongest risk factor for heart disease. Diabetes is about blood
glucose control, and heart disease is about blood pressure and cholesterol control.
Both the diseases have insulin resistance in common. It increases the chances of
developing type 2 diabetes and heart disease. Both type 1 diabetes and type 2
diabetes are independent risk factors for CHD. In fact, from the point of view of
cardiovascular medicine, it may be appropriate to say, Diabetes is a cardiovascular
disease.
Mai Shouman et al. (2011) say heart disease is the leading cause of death in
the world over the past 10 years. The European Public Health Alliance reports that
heart attacks, strokes and other circulatory diseases account for 41% of all deaths.
The Australian bureau of statistics reports that heart and circulatory system diseases
are the first leading cause of death in Australia, 33.7% being fatal. Motivated by the
world-wide increasing mortality of heart disease patients each year and the
availability of huge amount of patient data from which to extract useful knowledge,
researchers have been using data mining techniques to help health care
professionals in the diagnosis of heart disease.
22
Muhamad Hariz et al. (2012) point out that diabetes is a metabolic disorder
where the body cannot make proper use of carbohydrate and greatly affected by the
patient’s lifestyle. CHD is a serious disease that causes many deaths especially in
china.
Researchers have found that high blood sugar (hyperglycemia), activates a
biological pathway that causes irregular heartbeats, a condition called cardiac
arrhythmia, that triggers heart failure and sudden cardiac death.
(http://www.medicalnewstoday.com/articles/266891.php)
People who suffer from diabetes are two to four times more likely to
develop cardiovascular disease, compared to non-diabetics. (http://www.world-
heart-federation.org/cardiovascular-health/cardiovascular-disease-risk-
factors/diabetes)
The American Heart Association says that around 65% of diabetics die from
heart disease or stroke, emphasizing the need for new research looking at links
between the conditions. There is also evidence that obesity, having a sedentary
lifestyles and poor blood glucose control contribute to increased chances of high
blood pressure. Women prior to menopause stage, usually have less risk of heart
disease than men of the same age. However, women of all ages with diabetes have
an increased risk of heart disease because diabetes cancels out the protective effects
of being a woman in her child-bearing years.
In fact, the cardiovascular disease leading to heart attack or stroke is by far
the leading cause of death in both men and women diabetics. Another major
component of cardiovascular disease is poor circulation in the legs, which
contributes to a greatly increased risk of foot ulcers and amputations.
Control of the ABCs of diabetes can reduce risk for heart disease and stroke,
where A stands for A1C, a test that measures blood glucose control and it shows
the average blood glucose level over the past 3 months. B stands for blood pressure
and C stands for cholesterol. The best way to prevent or delay the development of
23
cardiovascular disease lies in its prevention. Weight control and smoking cessation
are two important lifestyle measures that have an impact on preventing heart
disease. In addition, good control of blood glucose levels and low-dose aspirin can
enhance these benefits. (http://diabetes.niddk.nih.gov/dm/pubs/stroke/)
1.7 STATEMENT OF THE PROBLEM
Discovery of new information in terms of patterns or rules from large
amounts of data is based on the machine learning technique. Disease prediction
plays an important role in data mining. Diagnosis of a disease requires the
performance of a number of tests on the patient. However, use of data mining
techniques, can reduce the number of tests. This reduced test set plays an important
role in time and performance. Diabetes data mining is important because it allows
doctors to see which features or attributes are more important for diagnosis such as
age, weight, etc. This will help the doctors diagnose diabetes more efficiently.
There are various data mining techniques in use in healthcare industry but
the research that has to be done is on the performance of the various classification
techniques, to enable the choice of the best among them can be chosen.
The research presented in this thesis is intended to address the challenge of
improving the prediction model to predict the heart disease for diabetic patients and
providing timely response in predicting the disease. Briefly the important research
functions are therefore stated as,
How various data mining techniques can be used in health care industry and
to identify their performance in prediction?
How does a classification techniques help in developing the prediction
model so as to predict accurately the risk of heart disease among diabetic
patients?
24
1.8 OBJECTIVES OF THE RESEARCH
Application of data mining in analyzing the medical data is a good method
for investigating the existing relationships between variables. Nowadays, data
stored in medical databases are growing in an increasingly rapid rate. It has been
widely recognized that medical data analysis can lead to an enhancement of
health care.
The primary objective of the research work is the effective development of
prediction model using various classification techniques to predict the heart disease
and performance in prediction. It also shows that data mining can be applied to the
medical databases to predict or classify the data with reasonable accuracy.
The following are the objectives leading to achievement of the primary
objective mentioned supra:
To identify the best classification model which can help the physicians in
predicting the risk of heart disease using diabetic attributes.
To recognize and classify patterns in multivariate patient attributes.
To predict the future outcomes based on previous experiences and present
conditions.
To identify the patients at risk, with the aim of increasing quality of care
and to reduce the cost of care.
To build a prediction model using appropriate classification techniques such
as naïve Bayes, decision trees and support vector machines.
25
1.9 ORGANIZATION OF THE THESIS
Chapter two describes the literature review on data mining, its major
predictive techniques, applications, survey of the comparative analysis by other
researchers and the criteria to be used for model comparison in this work.
Chapter three presents the Naïve Bayes, Support vector machine and
Decision tree based experimentation and methodology and data sets of the proposed
diagnosis system.
Chapter four gives the analysis of the experiments done by combining three
data mining techniques. The various heart disease risk prediction models are
created by categorizing the dataset based on certain attribute value pairs.
Chapter five describes the summary of the results, compares the results of
the techniques on the data sets and the performances are compared through
accuracy, sensitivity, specificity and F-score.
Chapter six gives the conclusions and future enhancement.
Appendices which includes Appendix I to Appendix XIV describe the
sample data view of data’s and their attributes, Naïve Bayes sample distributions,
C 4.5 decision tree algorithm, sample decision tree graph views, text views and
Rapid miner screen shot for overall model NB, SVM, DT main process.