Download pdf - CHAPTER I INTRODUCTION - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/45247/4/04_chapter-i.pdf · CHAPTER I INTRODUCTION ... RapidMiner, formerly YALE (Yet ... confidence

1

CHAPTER I

INTRODUCTION

1.1 BACKGROUND

Data and information have become major assets for most businesses.

Knowledge discovery in medical databases is a well-defined process and data

mining an essential step. Databases are collections of data with a specific well

defined structure and purpose. The programs to develop and manipulate these data

are called DBMS. Knowledge discovery in databases is the overall process that is

involved in unearthing knowledge from data. Data mining is concerned with the

process of computationally extracting hidden knowledge structures represented in

models and patterns from large data repositories.

Fayyad et al. (1996) defines KDD as a non-trivial process of identifying

valid, novel, potentially useful and ultimately understandable patterns in data.

According to this definition, data are any facts, numbers or text that can be

processed by a computer. The term pattern indicates models and regularity which

can be observed within the data. The patterns, associations or relationships among

all this data can provide information and it can be converted into knowledge about

historical patterns and future trends. There are other steps such as data

preprocessing, data selection, data cleaning and data visualization which also part

of the KDD process.

2

‘Data mining is an interdisciplinary field of study in databases, machine

learning and visualization’. It helps to identify the patterns of successful medical

therapies for different illnesses and also it aims to find useful information from

large collections of data. (http://en.wikipedia.org/wiki/Data_mining)

According to Frawley et al. (1991) data mining is the core of KDD which is

used to extract interesting patterns from data that are easy to perceive, interpret and

manipulate. It is the science of finding patterns in huge reserves of data, in order to

generate useful information. The KDD process comprises of few steps leading from

raw data collections to form new knowledge.

Figure 1.1 Knowledge discovery as a process

As shown in Figure 1.1, Subbalakshmi et al. (2011) have described the

knowledge discovery process as consisting of an iterative sequence of data

cleaning, data integration, data selection, data mining pattern recognition and

http://en.wikipedia.org/wiki/Data_mining

3

knowledge presentation. Data mining is the search for the relationships and global

patterns that exist in large databases hidden among large amounts of data. A target

data set must be assembled before data mining algorithms can be used. A common

source for data is a data mart or data warehouse and pre-processing is essential to

analyze these multivariate data sets. The final step of knowledge discovery from

data is to verify that the patterns produced by the data mining algorithms occur in

the wider data set. The discovered knowledge may contain rules that describe the

properties of the data, patterns that occur frequently and objects that are found to be

in clusters in the database etc.,

According to Shelly Gupta et al. (2011) the motivation for handling data

and performing computation is the discovery of knowledge. The KDD process

employs data mining methods to identify patterns at some measure of

interestingness and it is the process of turning the low-level data into high-level

knowledge.

As Hand et al. (2001) defined “Data mining as the analysis of observational

datasets to find unsuspected relationships and to summarize the data in novel ways

that are both understandable and useful to the data owner”.

According to Connolly et al. (2005) data mining is a process of extracting

valid, previously unknown, comprehensive and actionable information from large

databases and using it to make crucial business decisions.

According to Luminita State et al. (2007) data mining is the process of

discovering meaningful new correlations, patterns and trends by shifting through

large amounts of data stored in repositories, using pattern recognition technologies

as well as statistical and mathematical techniques. It is an evolving and growing

area of research and development, both in academia as well as in industry.

Data mining is a challenging area in the field of medical research.

Extraction of useful knowledge from the database and providing scientific decision-

making for the diagnosis and treatment of diseases increasingly becomes necessary.

http://en.wikipedia.org/wiki/Data_mart

http://en.wikipedia.org/wiki/Data_warehouse

http://en.wikipedia.org/wiki/Multivariate_statistics

4

Data mining in medicine can deal with this problem. Medical data mining has great

potential for exploring the hidden patterns in the data sets of the medical domain.

As Aqueel Ahmed et al. (2006) have stated, data mining is a logical process

that is used to search through large amount of data in order to find useful data. A

wide availability of huge amount of data and the need to convert such data to useful

information necessitates the use of data mining techniques. It has become an

established method for improving statistical tools to predict future trends. The

necessity for using data mining techniques to develop and improve risk models

arises from the need for clinicians to improve their prediction models for individual

patients. Various data mining techniques are available with their suitability

dependent on the domain application. Data mining applications in health can have a

tremendous potential and usefulness. It automates the process of finding predictive

information in large databases.

Some of the data mining tools such as Weka (Waikato Environment for

Knowledge Analysis) written in Java, developed at the University of Waikato, New

Zealand is a collection of machine learning algorithms for data mining tasks and

RapidMiner, formerly YALE (Yet Another Learning Environment) also written in

Java perform data analysis and may uncover important data patterns, contributing to

business strategies, knowledge bases and scientific and medical research.

(http://www.cs.waikato.ac.nz/ml/weka)

Smita Malik et al. (2012) have stated, the successful application of data

mining in highly visible fields like retail, marketing & e-business have led to the

popularity of its use in knowledge discovery in databases in other industries and

sectors. The data generated by healthcare transactions is huge and this medical data

about large patient population is analyzed to perform medical research.

http://en.wikipedia.org/wiki/Java_(programming_language)

http://en.wikipedia.org/wiki/University_of_Waikato

http://en.wikipedia.org/wiki/New_Zealand

http://en.wikipedia.org/wiki/New_Zealand

http://www.cs.waikato.ac.nz/ml/weka

5

1.2 DATA MINING TASKS

Classification of data is a common task in machine learning. Artificial

intelligence first achieved recognition as a discipline in the mid 1950's. One of the

fundamental requirements for any intelligent behavior is learning. Most of the

researchers today agree that there is no intelligence without learning. In artificial

intelligence research, machine learning has been central to its development from the

very beginning. Machine learning is a branch of computer science that is concerned

with the development of algorithms that allow computers to learn. It can be used to

develop systems that can ensure increased efficiency and effectiveness of the

system.

Han and Kamber (2006) have stated that data mining tasks are used to

specify the kind of patterns to be found in data mining process. Basically, the

algorithms try to fit a model closest to the characteristics of data under

consideration and models can be either predictive or descriptive. Predictive models

are used to make predictions, for example, of the diagnosis of a particular disease.

These will analyze past performance to assess the likelihood of a customer

exhibiting a specific behavior in order to improve marketing effectiveness.

Descriptive models are used to identify the patterns in data. For example, a

physician might be interested in discovering the influence of climate among

typhoid patients by grouping patients in different climate zones. Unlike the

predictive models that focus on predicting a single customer behavior, descriptive

models identify many different relationships between customers or products.

As shown in Figure 1.2, classification, regression and time series analysis

are some of the tasks of predictive modeling. Clustering, association rules,

visualization are some of the tasks of descriptive modeling.

6

Figure 1.2 Data mining tasks

In machine learning, classification is the task of identification to which a set

of categories a new observation belongs. This is done on the basis of training a

set of data containing observations or instances whose category membership is

known. According to Han and Kamber (2006) classification is the process of

finding a model or function that describes and distinguishes data classes for the

purpose of the ability use the model to predict the class of objects whose class label

is unknown. It is a learning function that maps or classifies a data item into one of

several predefined groups or classes which come under supervised learning. The

classification model makes use of training data set in order to build a classification

predictive model and testing of data set is done for testing the classification

efficiency. Two separate problems such as binary classification and multiclass

classification can be considered as its two components. In binary classification,

only two classes are involved, whereas multiclass classification involves assigning

an object to one of several classes.

http://en.wikipedia.org/wiki/Machine_learning

http://en.wikipedia.org/wiki/Categorical_data

http://en.wikipedia.org/wiki/Observation

http://en.wikipedia.org/wiki/Training_set

http://en.wikipedia.org/wiki/Training_set

http://en.wikipedia.org/wiki/Binary_classification

http://en.wikipedia.org/wiki/Multiclass_classification

http://en.wikipedia.org/wiki/Multiclass_classification

7

Prediction is achieved with the help of regression. It is the process of

analyzing the current and past states of the attribute and prediction of its future

state. Regression is a data mining technique that is used to predict a value. It takes a

numeric dataset and develops a mathematical formula to fit the data. A regression

task begins with a dataset of known target values and regression analysis can be

used to model the relationship between one or more independent or predictor

variables and a dependent or response variable. The types of regression methods are

linear regression, multivariate linear regression, nonlinear regression and

multivariate nonlinear regression.

Time series is a sequence of data points, measured typically at successive

points in time spaced at uniform time intervals. Time series analysis comprises

methods for analyzing time series data in order to extract meaningful statistics and

other characteristics of such data. Methods for time series analysis may be divided

into two classes, namely, frequency-domain methods which include spectral

analysis and time-domain methods which include auto-correlation and cross-

correlation analysis. There are several types of motivation and data analysis

available for time series which are appropriate for different purposes. In the context

of data mining, pattern recognition and machine learning, time series analysis can

be used for clustering, classification, query by content, anomaly detection as well as

forecasting.

A cluster is a collection of objects which are similar and are dissimilar to the

objects belonging to others. Clustering has no predefined classes and identifies

groups of items that share specific characteristics which come under unsupervised

learning. It analyzes data objects without consulting a known class label. The

objects are clustered or grouped based on the principle of maximizing the

intra-class similarity and minimizing the inter-class similarity. It is the main task of

exploratory data mining and a common technique for statistical data analysis used

in many fields including machine learning, pattern recognition, image analysis,

information retrieval and bioinformatics. Clustering can be roughly distinguished as

hard clustering which specifies each object belongs to a cluster or not and soft

http://en.wikipedia.org/wiki/Data_point

http://en.wikipedia.org/wiki/Frequency-domain

http://en.wikipedia.org/wiki/Frequency_spectrum#Spectrum_analysis

http://en.wikipedia.org/wiki/Frequency_spectrum#Spectrum_analysis

http://en.wikipedia.org/wiki/Time-domain

http://en.wikipedia.org/wiki/Auto-correlation

http://en.wikipedia.org/wiki/Cross-correlation

http://en.wikipedia.org/wiki/Cross-correlation


http://en.wikipedia.org/wiki/Pattern_recognition


http://en.wikipedia.org/wiki/Cluster_analysis

http://en.wikipedia.org/wiki/Statistical_classification

http://en.wikipedia.org/wiki/Anomaly_detection

http://en.wikipedia.org/wiki/Forecasting


http://en.wikipedia.org/wiki/Statistics

http://en.wikipedia.org/wiki/Data_analysis


http://en.wikipedia.org/wiki/Pattern_recognition

http://en.wikipedia.org/wiki/Image_analysis

http://en.wikipedia.org/wiki/Information_retrieval

http://en.wikipedia.org/wiki/Bioinformatics

8

clustering specifies each object belongs to each cluster to a certain degree. One of

the most used clustering algorithms is K-means clustering algorithm.

Association rule learning is a popular and well researched method for

discovering interesting relations between variables in large databases. It is intended

for identify strong rules discovered in databases using different measures of

interest. According to Jagjeevan Rao et al. (2012) there are several association rule

algorithms which are mainly useful in summarizing and identifying the patterns.

They also use correlation along with support and confidence in order to find the

right patterns. These are usually required to satisfy a user-specified minimum

support and a user-specified minimum confidence at the same time. It is split up

into two separate steps such as initially minimum support is applied to find all

frequent item sets in a database and these frequent item sets and the minimum

confidence constraint are used to form rules. Association and correlation are usually

meant for locating frequent item set findings among large data sets. The association

differs from classification as it can predict any attribute, not just the class and they

can predict more than one attribute’s value at a time. The types of association rules

are multilevel association rule, multidimensional association rule and quantitative

association rule.

Supervised learning algorithm analyzes the training data and produces an

inferred function, which is called a classifier. If the output is discrete or categorical

attributes, it is called classification and if the output is numerical or continuous

attributes, then it is termed as regression. Unsupervised learning refers to the

problem of trying to find hidden structures in unlabeled data.

1.3 DATA MINING IN HEALTH INFORMATICS

Data mining is an integration of multiple disciplines such as statistics,

machine learning, neural networks and pattern recognition. It is concerned with the

process of computationally extracting hidden knowledge structures represented in

models and patterns from large data repositories.

9

Healthcare is a data intensive process. Many processes run simultaneously

producing new data every second. It is a research intensive field and the largest

consumer of public funds. With the emergence of computers and new algorithms,

health care has seen an increase of computer tools and could no longer ignore these

emerging tools. This has resulted in unification of healthcare and computing to

form health informatics. They typically work through an analysis of medical data

and a knowledge base of clinical expertise and it is an emerging field.

Peyman Mohammadi et al. (2013) described as data mining in healthcare, in

medical areas today, data collection about different diseases as very important.

Medical and health areas are among the most important sections in industrial

societies. The extraction of knowledge from a massive volume of data related to

diseases and medical records using the data mining process can lead to identifying

the laws governing the creation, the development of epidemic diseases.

Some medical applications of data mining are :

Prediction of health care costs.

Determination of disease treatment

Diagnosis and prediction of diseases of most kind etc.,

Health informatics is defined as an evolving scientific discipline that deals

with the collection, storage, retrieval, communication and optimal use of health

related data, information and knowledge. It is the field of study applied to clinical

care, nursing, public health and biomedical research all dedicated to the

improvement of patient care and population health.

It is one of the fastest growing areas within the health sector and covers a

wide range of applications and research. It deals with biomedical information, data

and knowledge. With the help of smart algorithms and machine intelligence, quality

healthcare can be provided through problem solving and decision-making systems.

In the domain of health informatics, Decision Support Systems are defined as

10

knowledge based systems that support information sciences and assist decision

making activities. Physicians can input the patient data through electronic health

forms and can run a decision support system on the data input to get an opinion on

the patient’s health and the care required.

According to Pragnyaban Mishra et al. (2012) the success of healthcare data

mining hinges on the availability of clean healthcare data. Possible directions

include the standardization of clinical vocabulary and the sharing of data across

organizations to enhance the benefits of healthcare data mining applications.

Data mining for healthcare is useful in evaluating the effectiveness of

medical treatments. Through comparing and contrasting various causes, symptoms,

and treatment methodologies, data mining can produce an analysis of treatments

that can correct specific symptoms most effectively. It is widely used in healthcare

fields due to its descriptive and predictive power. It can predict health insurance

fraud, healthcare cost, disease prognosis, disease diagnosis, and length of stay

needed in a hospital. It also obtains frequent patterns from biomedical and

healthcare databases such as relationships between health conditions and a disease,

relationships among diseases and relationships among drugs etc.

Data mining today has successful applications in various fields including

health care. This industry generates large amounts of complex data about patient

records, hospitals resources, disease diagnosis, medical devices etc.

These data are a key resource to be processed and analyzed for knowledge

extraction and data mining in various areas as under :

Healthcare insurers detect fraud and abuse

Healthcare organizations make customer relationship management decisions

Physicians identify effective treatments and best practices

Patients receive better and more affordable healthcare services.

11

Some of the challenges of data mining in the medical domain are in the following

areas :

Extraction of useful knowledge and provision of scientific decision-

making ability for the diagnosis and treatment of disease.

Identification of the patterns of successful medical therapies for different

ailments.

Too many disease markers (attributes) now available for decision making.

Voluminous data now being collected with the help of computerization

(text, graphs, images)

Handling noisy (containing errors or outliers), inconsistent (containing

discrepancy in codes or names), and incompleteness (lacking attribute

values or containing only aggregates) of medical data issues to be

preprocessed.

The current applications of data mining in healthcare and medicine are :

Prognosis, to predict future outcomes based on previous experience and

present conditions.

Therapy, which is to select from available treatment methods based on

effectiveness, suitability to patient etc.,

Diagnosis, to recognize and classify patterns in multivariate patient

attributes.

Data mining not only focuses on collecting and managing data, it also

includes analysis and prediction. The wide range of applications from business

tasks to scientific tasks has led to a huge variety of learning methods and

algorithms for rule extraction and prediction. For medical diagnosis, there are many

12

expert systems based on logical rules for decision making and prediction. Even

though there exist many data mining techniques in prediction of heart disease, there

is a weakness in availability of data for prediction of heart diseases on the diabetes

dataset. Prediction of risk using data mining can be helpful to understand the

possible risk involved in getting that disease. It has prophylactic capability with the

advancement in tools like Rapid miner that can be used with ease in large datasets

and a large number of attributes, the hassle today lies in the determination of the

appropriate machine learning technique that can ensure accuracy.

Hian Chye Koh et al. (2010) say that, data mining is becoming increasingly

popular in health care if not increasingly essential. Several factors have motivated

the use of data mining applications in healthcare. These can greatly benefit the

healthcare industry. However, they are not without limitations. Healthcare data

mining can be limited by the accessibility of data, because the raw inputs for data

mining often exist in different settings and systems, such as administration, clinics,

laboratories and more. Hence, the data have to be collected and integrated before

data mining can be done.

According to Salim Diwani et al. (2013) data mining applications can be

developed to evaluate the effectiveness of medical treatments. It also can benefit

healthcare providers such as hospitals, clinics, physicians, and patients by

identifying effective treatments and best practices.

The aims of quality healthcare services are :

Provision of safe healthcare treatments

Use of scientific medical knowledge to provide healthcare services to

everyone

Provision of various healthcare treatments based on the patient’s needs,

symptoms and preferences

Minimizing the time to wait for the medical treatment

13

Healthcare industry is a type of industry, where the available data is

voluminous and sensitive. The data requires careful handling without any

mismanagement. There are various data mining classification techniques that have

been used in healthcare industry. The best among them can be chosen.

1.4 DATA MINING CLASSIFICATION TECHNIQUES

In the early days of data warehousing, data mining was viewed as a subset

of the activities associated with the warehouse. Today, a warehouse may be a good

source for the data to be mined and data mining is recognized as an independent

activity. One of the greatest strengths of data mining lies in its wide range of

methodologies and techniques that can be applied to a various problem sets. Data

mining is a natural activity to be performed on large datasets. Data classification

process involves learning and classification. In learning, the training data are

analyzed by classification algorithms and in classification, test data are used to

estimate the accuracy of the classification rules.

According to Vikas Chaurasia et al. (2013) many researchers used data

mining techniques in the diagnosis of diseases such as tuberculosis, diabetes, cancer

and heart disease in which several data mining techniques are used in the diagnosis

of heart disease such as neural networks, Bayesian classification, classification

based on clustering, genetic Algorithm, naive Bayes and decision tree which are

showing accuracy at different levels. Each data mining technique serves a different

purpose depending on the modeling objective.

According to Lashari et al. (2013) classification in data mining is used to

predict group membership for data instances. Data mining involves the use of

sophisticated data analysis tools to discover the relationship in large datasets.

Decision tree based classification methods are widely used in data mining for the

decision support application. Thus, there is a great potential for the use of data

mining techniques for medical data classification.

14

Figure 1.3 Data mining classification methods

15

As shown in Figure 1.3, classification is the discovery of predictive learning

function that classifies a data item into one of several predefined classes. These

techniques have been widely applied with the great success in the field of medical

databases. The types of classification models are Bayesian classification, Support

vector machine, neural networks, classification by decision tree induction and

classification based on associations.

Datta et al. (2011) have state that the classification and association rules

play a major part in data mining. Classification is the process of dividing a dataset

into mutually exclusive groups and enables us to categorize records in a large

database into predefined set of classes. Association rules give a process to find

relationships among data items in a given dataset. It enables us to establish

association and relationships between large unclassified data items based on certain

attributes and characteristics. It defines certain rules of associability between data

items and then uses those rules to establish relationship.

According to Varun Kumar et al. (2011) association analysis is the

discovery of association rules showing attribute-value conditions that occur

frequently together in a given set of data. It is also widely used for market basket or

transaction data analysis.

Sunita Beniwal et al. (2012) have illustrated, the classification process as

one divided into two phases training, when a classification model is built from the

training set and testing, when the model is evaluated on the test set. One of the

major goals of a classification algorithm is to maximize the predictive accuracy

obtained by the classification model.

The Bayesian classifiers have a structural model and a set of conditional

probabilities. They assume that the contribution of all variables is independent. It

first estimates the prior probability for each class and the occurrence of each

variable value applies to an unknown case. A Bayes network classifier is based on a

Bayesian network which represents a joint probability distribution over a set of

categorical attributes.

16

Naive Bayes classifier is a term dealing with simple probabilistic classifiers.

This method is based on probabilistic knowledge and on supervised learning. It

reads a set of examples from the training set and uses the Bayes theorem to estimate

the probabilities of all classifications. For each instance, the classification with the

highest probability is chosen as the prediction class.

Support Vector Machine is a concept in Statistics and Computer Science for

a set of related supervised learning methods that analyze data and recognize

patterns. Introduced by Corinna Cortes and Vladimir Vapnik (1995) it is used for

classification and regression analysis. It constructs a hyper plane or set of hyper

planes in a high- or infinite dimensional space, which can be used for classification,

regression, or other tasks. In addition to performing linear classification, SVMs can

efficiently perform non-linear classification using what are called kernel functions

which implicitly map their inputs into high-dimensional feature spaces. It is a

learning machine that plots the training vectors in high dimensional space and

labels each vector by its class. (http://en.wikipedia.org/wiki/Support_vector_machine)

According to Barakat et al., (2007) it is based on the principle of risk

minimization which aim to minimize the error rate. SVM uses a supervised learning

approach for classifying data. It uses kernel functions to map the data set to a high

dimensional data space for performing classification and the major advantage of

SVM is its classification accuracy.

Decision trees belong to classification methods and construct a hierarchical

like a tree structure and their goal is to create a model that predicts the value of a

target variable based on several input variables. Each internal node here denotes a

test on an attribute, each branch represents an outcome of the test, and each leaf

node holds a class label. It is a popular classifier and prediction method for

handling high dimensional data. Construction of a decision tree is the training step

of classification and the method for the construction of the tree is called ASM.

17

Decision trees used in data mining are of two main types, classification tree

analysis which is used when the predicted outcome is the class to which the data

belongs. Regression tree analysis which is used when the predicted outcome can be

considered a real number. C4.5 is an algorithm used to generate a decision tree

developed by Ross Quinlan (1996) and it is an extension of Quinlan's earlier ID3

algorithm. The decision trees generated by C4.5 can be used for classification, and

for this reason, C4.5 is often referred to as a statistical classifier.

(http://en.wikipedia.org/wiki/Decision_tree_learning)

Tina Patil et al. (2013) have described classification as an important data

mining technique with broad applications to classify the various kinds of data. It is

used to classify the item according to its features of the item with respect to the

predefined set of classes. The performance of classification algorithm is usually

examined by evaluation of the accuracy of the classification. For classifications, the

Bayesian networks are used to construct classifiers from a given set of training

examples with class labels.

Shruti Ratnakar et al. (2013) have illustrated, in the field of artificial

intelligence, a genetic algorithm is a search heuristic that imitates the process of

natural evolution. This heuristic is routinely used to generate useful solutions to

belong to the larger class of evolutionary algorithms, which generate optimized

solutions using techniques inspired by natural evolution, such as inheritance,

mutation, selection, and crossover. The evolution usually starts from a population

of randomly generated individuals, and is an iterative process, with the population

in each iteration called a generation. In each generation, the fitness of every

individual in the population is evaluated; the fitness is usually the value of the

objective function in the optimization problem being solved.

According to Chitra et al. (2013) the application of Artificial Neural

Networks (ANN) can be time-consuming due to the selection of input features for

the multi layer perceptron. The number of layers, number of neurons in each layer

was also determined by the input attributes. It is inspired by attempts to simulate


http://en.wikipedia.org/wiki/Decision_tree_learning

http://en.wikipedia.org/wiki/Ross_Quinlan

http://en.wikipedia.org/wiki/ID3_algorithm

http://en.wikipedia.org/wiki/ID3_algorithm

http://en.wikipedia.org/wiki/Statistical_classification

http://en.wikipedia.org/wiki/Decision_tree_learning

http://en.wikipedia.org/wiki/Iteration

http://en.wikipedia.org/wiki/Fitness_(biology)

18

biological neural systems. Each node or neuron here is interconnected with other

nodes via weighted links. During the learning phase, the network learns by

adjusting weights to enable predication of the correct class labels of the input

tuples. The nodes are classified into three categories like input, hidden and output

layers. The neural networks are ideal for identifying patterns or trends in data and

well suited for prediction or forecasting needs and the most widely used is multi-

layer perception with back-propagation algorithm. Some of the disadvantages of

neural networks are: they require many parameters, that are empirically determined

and classification performance is sensitive to the parameters selected. It is very

slow training process and clinicians find it difficult to understand how its

classification decisions are taken and cannot interpret the results easily.

1.5 DATA MINING TOOLS

According to Witten (2011) Waikato Environment for Knowledge Analysis

tool is used for data mining, which finds valuable information hidden in large

volumes of data. It was developed at the University of Waikato in New Zealand and

easiest way to use is through a graphical user interface called Explorer. Weka is a

collection of machine learning algorithms for data mining tasks, written in Java and

contains tools for data pre-processing, classification, regression, clustering,

association rules, and visualization.

According to Matthew North (2012) Rapid Miner is the world-leading

open-source system for knowledge discovery and data mining. It is easy to install

and can run on just about any computer and provides specific data mining

functions. This has various machine learning algorithms for doing the data mining

experiments.

The basic approach taken with this tool is the preparation of a process

model which uses 10-Cross validation along with the Machine Learning algorithm

to increase the accuracy of the model. X-Validation works by splitting the data into

10 partitions each containing 90% of the original data.

19

It then builds a model on each 90% and applies it to the 10% to get a

performance. It does this for all 10 partitions to get an average.

Figure 1.4 Main process using Rapid Miner

As shown in Figure 1.4, the main process takes care of the execution of the

classification model from retrieval, setting the role of the attribute for learning and

10-fold cross validation and providing the results. The model is set inside the 10-

Fold Cross validation operator. Any Rapid miner application starts with a main

process which is of the class com.rapidMiner.Process. Once we have all the

operators defined, run method is called to execute the process.

According to Lakshmi et al. (2013) the basic phenomenon used to classify

the heart disease classification using classifier is its performance and accuracy. The

performance of a chosen classifier is validated based on error rate and computation

time. The classification accuracy is predicted in terms of sensitivity and specificity.

The computation time is noted for each classifier is taken in to account. Confusion

matrix displays the frequency of correct and incorrect predictions.

20

A two-class prediction problem (binary classification), in which the

outcomes are labeled either as positive (p) or as negative (n) may be considered.

There are four possible outcomes from a binary classifier namely True Positive,

False Positive, True Negative and False Negative. Suppose in an experiment there

are P positive instances and N negative instances. The four outcomes can be

formulated in a 2×2 contingency table or confusion matrix.

The preprocessing steps detailed in the experiment in the present work are

as follows:

Step 1: Import of the data to the repository. The records of about 1000

diabetic patients have been collected on which we use the Rapid miner tool.

Step 2: Ensuring the availability of all the attributes defined along with

Class Label, considering that class label is the one which will help us in the

machine learning process.

Step 3: Once we use the data source, we need to connect it to the X-

Validation Operator. X-Validation encapsulates a cross-validation process. It

applies k-fold Cross Validation. In our case k=10. (Ten is about the right number of

folds to get the best estimate of error)

Step 4: X-Validation operator helps to split the data into separate partitions

to evaluate the data. It builds the model on 90% of data and applies the model to the

remaining 10% of data to evaluate the performance. The data is selected using

stratified sampling. In contrast to the simple sampling operator, this operator

performs a stratified sampling for data sets with nominal label attributes, i.e. the

class distributions remain (almost) the same after sampling.

Step 5: Use of the appropriate model operator such as Naïve Bayes,

Decision Tree and SVM.

http://en.wikipedia.org/wiki/Binary_classification

http://en.wikipedia.org/wiki/Contingency_table

http://en.wikipedia.org/wiki/Confusion_matrix

21

Step 6: Connection of the outputs of the model and use the apply model

operator to apply on the test data available with the help of the X-Validation

operator.

Step 7: Use of a performance classifier operator to determine the

performance output.

Step 8: X-Validation operator averages the performance for all the iterations

inside the operator.

1.6 BIOLOGICAL LINK BETWEEN DIABETES AND HEART DISEASE

There is a big link between diabetes and heart disease. Diabetes by itself is

now regarded as the strongest risk factor for heart disease. Diabetes is about blood

glucose control, and heart disease is about blood pressure and cholesterol control.

Both the diseases have insulin resistance in common. It increases the chances of

developing type 2 diabetes and heart disease. Both type 1 diabetes and type 2

diabetes are independent risk factors for CHD. In fact, from the point of view of

cardiovascular medicine, it may be appropriate to say, Diabetes is a cardiovascular

disease.

Mai Shouman et al. (2011) say heart disease is the leading cause of death in

the world over the past 10 years. The European Public Health Alliance reports that

heart attacks, strokes and other circulatory diseases account for 41% of all deaths.

The Australian bureau of statistics reports that heart and circulatory system diseases

are the first leading cause of death in Australia, 33.7% being fatal. Motivated by the

world-wide increasing mortality of heart disease patients each year and the

availability of huge amount of patient data from which to extract useful knowledge,

researchers have been using data mining techniques to help health care

professionals in the diagnosis of heart disease.

22

Muhamad Hariz et al. (2012) point out that diabetes is a metabolic disorder

where the body cannot make proper use of carbohydrate and greatly affected by the

patient’s lifestyle. CHD is a serious disease that causes many deaths especially in

china.

Researchers have found that high blood sugar (hyperglycemia), activates a

biological pathway that causes irregular heartbeats, a condition called cardiac

arrhythmia, that triggers heart failure and sudden cardiac death.

(http://www.medicalnewstoday.com/articles/266891.php)

People who suffer from diabetes are two to four times more likely to

develop cardiovascular disease, compared to non-diabetics. (http://www.world-

heart-federation.org/cardiovascular-health/cardiovascular-disease-risk-

factors/diabetes)

The American Heart Association says that around 65% of diabetics die from

heart disease or stroke, emphasizing the need for new research looking at links

between the conditions. There is also evidence that obesity, having a sedentary

lifestyles and poor blood glucose control contribute to increased chances of high

blood pressure. Women prior to menopause stage, usually have less risk of heart

disease than men of the same age. However, women of all ages with diabetes have

an increased risk of heart disease because diabetes cancels out the protective effects

of being a woman in her child-bearing years.

In fact, the cardiovascular disease leading to heart attack or stroke is by far

the leading cause of death in both men and women diabetics. Another major

component of cardiovascular disease is poor circulation in the legs, which

contributes to a greatly increased risk of foot ulcers and amputations.

Control of the ABCs of diabetes can reduce risk for heart disease and stroke,

where A stands for A1C, a test that measures blood glucose control and it shows

the average blood glucose level over the past 3 months. B stands for blood pressure

and C stands for cholesterol. The best way to prevent or delay the development of

http://www.medicalnewstoday.com/articles/8887.php



http://www.medicalnewstoday.com/info/diabetes/



23

cardiovascular disease lies in its prevention. Weight control and smoking cessation

are two important lifestyle measures that have an impact on preventing heart

disease. In addition, good control of blood glucose levels and low-dose aspirin can

enhance these benefits. (http://diabetes.niddk.nih.gov/dm/pubs/stroke/)

1.7 STATEMENT OF THE PROBLEM

Discovery of new information in terms of patterns or rules from large

amounts of data is based on the machine learning technique. Disease prediction

plays an important role in data mining. Diagnosis of a disease requires the

performance of a number of tests on the patient. However, use of data mining

techniques, can reduce the number of tests. This reduced test set plays an important

role in time and performance. Diabetes data mining is important because it allows

doctors to see which features or attributes are more important for diagnosis such as

age, weight, etc. This will help the doctors diagnose diabetes more efficiently.

There are various data mining techniques in use in healthcare industry but

the research that has to be done is on the performance of the various classification

techniques, to enable the choice of the best among them can be chosen.

The research presented in this thesis is intended to address the challenge of

improving the prediction model to predict the heart disease for diabetic patients and

providing timely response in predicting the disease. Briefly the important research

functions are therefore stated as,

How various data mining techniques can be used in health care industry and

to identify their performance in prediction?

How does a classification techniques help in developing the prediction

model so as to predict accurately the risk of heart disease among diabetic

patients?

http://diabetes.niddk.nih.gov/dm/pubs/stroke/

24

1.8 OBJECTIVES OF THE RESEARCH

Application of data mining in analyzing the medical data is a good method

for investigating the existing relationships between variables. Nowadays, data

stored in medical databases are growing in an increasingly rapid rate. It has been

widely recognized that medical data analysis can lead to an enhancement of

health care.

The primary objective of the research work is the effective development of

prediction model using various classification techniques to predict the heart disease

and performance in prediction. It also shows that data mining can be applied to the

medical databases to predict or classify the data with reasonable accuracy.

The following are the objectives leading to achievement of the primary

objective mentioned supra:

To identify the best classification model which can help the physicians in

predicting the risk of heart disease using diabetic attributes.

To recognize and classify patterns in multivariate patient attributes.

To predict the future outcomes based on previous experiences and present

conditions.

To identify the patients at risk, with the aim of increasing quality of care

and to reduce the cost of care.

To build a prediction model using appropriate classification techniques such

as naïve Bayes, decision trees and support vector machines.

25

1.9 ORGANIZATION OF THE THESIS

Chapter two describes the literature review on data mining, its major

predictive techniques, applications, survey of the comparative analysis by other

researchers and the criteria to be used for model comparison in this work.

Chapter three presents the Naïve Bayes, Support vector machine and

Decision tree based experimentation and methodology and data sets of the proposed

diagnosis system.

Chapter four gives the analysis of the experiments done by combining three

data mining techniques. The various heart disease risk prediction models are

created by categorizing the dataset based on certain attribute value pairs.

Chapter five describes the summary of the results, compares the results of

the techniques on the data sets and the performances are compared through

accuracy, sensitivity, specificity and F-score.

Chapter six gives the conclusions and future enhancement.

Appendices which includes Appendix I to Appendix XIV describe the

sample data view of data’s and their attributes, Naïve Bayes sample distributions,

C 4.5 decision tree algorithm, sample decision tree graph views, text views and

Rapid miner screen shot for overall model NB, SVM, DT main process.