26
AN EFFECTIVE FEATURE SUBSET SELECTION BASED DATA CLASSIFICATION MODEL USING STOCHASTIC ANT COLONY OPTIMIZATION IN HEALTHCARE INDUSTRY M. Arasakumar, Dr. P. Sudhakar 1 Assistant Professor, Department of Information Technology, Faculty of Engineering and Technology, Annamalai University 2 Associate Professor, Department of Computer Science and Engineering, Faculty of Engineering and Technology, Annamalai University [email protected], [email protected] ABSTRACT At present times, the developing medical sector generates massive quantity of needed details related to the patient's demographics; medications, payments, and insurance coverage have attracted physicians and academicians to a greater extent. Several studies have been presented in the various aspects of data mining applications in medicinal sector. This paper proposes a feature selection based classification model for data mining in healthcare industry. A set of four FS approaches has been employed namely ant colony optimization (ACO) based FS; genetic algorithm (GA) based FS, gain entropy and principal component analysis (PCA). Besides, to prevent the local optima problem of ACO, an improved version named stochastic ACO algorithm is derived by incorporating periodic partial reinitialization of the population. The presented model is validated using a set of three benchmark medical dataset namely chronic kidney disease (CKD), Indian Liver Patient (ILP) and Wisconsin Breast Cancer dataset. An extensive comparative analysis of the presented model takes place and verified the superiority of the presented model against all the applied dataset. Keywords: Healthcare, Data mining, Feature selection, Data classification 1. INTRODUCTION At present days, Healthcare domain gains more importance in several aspects in all over the globe [1]. With its growth, various issues exist namely cost, inefficiency, bad quality, and high complexity [2]. The expense of healthcare in US is raised by 123% from 2010 and 2015 [3]. Ineffective and on-value added processes results in an increase of 2147% of these massive expenses [4]. A study showed that around 2 lakh patients die in US because of medicinal errors [5]. Effective decision-making using existing details can resolve the issues and offers better treatment to patients. Nowadays, healthcare sectors adopt information technology in their management system [6]. Massive quantity of data gets gathered by this model in a periodic manner. Analytics offers different models for the extraction of details from the difficult and massive data and transform it to the needed information for the assistance of decision making in healthcare sector. From the past decade, several studies have Journal of Information and Computational Science Volume 9 Issue 12 - 2019 ISSN: 1548-7741 www.joics.org 1653

AN EFFECTIVE FEATURE SUBSET SELECTION BASED DATA

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: AN EFFECTIVE FEATURE SUBSET SELECTION BASED DATA

AN EFFECTIVE FEATURE SUBSET SELECTION

BASED DATA CLASSIFICATION MODEL USING

STOCHASTIC ANT COLONY OPTIMIZATION IN

HEALTHCARE INDUSTRY

M. Arasakumar, Dr. P. Sudhakar 1Assistant Professor, Department of Information Technology, Faculty of Engineering and

Technology, Annamalai University 2Associate Professor, Department of Computer Science and Engineering,

Faculty of Engineering and Technology, Annamalai University

[email protected], [email protected]

ABSTRACT

At present times, the developing medical sector generates massive quantity of needed details

related to the patient's demographics; medications, payments, and insurance coverage have

attracted physicians and academicians to a greater extent. Several studies have been

presented in the various aspects of data mining applications in medicinal sector. This paper

proposes a feature selection based classification model for data mining in healthcare

industry. A set of four FS approaches has been employed namely ant colony optimization

(ACO) based FS; genetic algorithm (GA) based FS, gain entropy and principal component

analysis (PCA). Besides, to prevent the local optima problem of ACO, an improved version

named stochastic ACO algorithm is derived by incorporating periodic partial reinitialization

of the population. The presented model is validated using a set of three benchmark medical

dataset namely chronic kidney disease (CKD), Indian Liver Patient (ILP) and Wisconsin

Breast Cancer dataset. An extensive comparative analysis of the presented model takes place

and verified the superiority of the presented model against all the applied dataset.

Keywords: Healthcare, Data mining, Feature selection, Data classification

1. INTRODUCTION

At present days, Healthcare domain gains more importance in several aspects in all over the

globe [1]. With its growth, various issues exist namely cost, inefficiency, bad quality, and

high complexity [2]. The expense of healthcare in US is raised by 123% from 2010 and 2015

[3]. Ineffective and on-value added processes results in an increase of 21–47% of these

massive expenses [4]. A study showed that around 2 lakh patients die in US because of

medicinal errors [5]. Effective decision-making using existing details can resolve the issues

and offers better treatment to patients. Nowadays, healthcare sectors adopt information

technology in their management system [6]. Massive quantity of data gets gathered by this

model in a periodic manner. Analytics offers different models for the extraction of details

from the difficult and massive data and transform it to the needed information for the

assistance of decision making in healthcare sector. From the past decade, several studies have

Journal of Information and Computational Science

Volume 9 Issue 12 - 2019

ISSN: 1548-7741

www.joics.org1653

Page 2: AN EFFECTIVE FEATURE SUBSET SELECTION BASED DATA

been presented on the development of data classification models for the diagnosis of diseases

utilizing the patient data.

FS is an important part in the data classification process, finds helpful to attain small

collection of rules from the training set. Different models like machine learning algorithms,

biologically inspired algorithms have been utilized for feature selection.

In [7], a hybrid genetic algorithm with support vector machine (GA-SVM) method is

proposed for efficient selection of attributes. The elimination of repeated attributes by hybrid

GA-SVM enhances the classification accuracy. The GA-SVM method operates on two levels.

In the first level, the attributes are selected by evolutionary algorithms and they are applied to

SVM to obtain a fitness measure for every attribute set in the next level. The obtained fitness

values are applied to select the best set of attributes using GA. SVM and GA are easily

incorporated using a wrapper method. Finally, the hybrid method is enhanced by the use of a

correlation measure among attributes in place of fitness measure. It substitutes the weaker

members of the population with recently created chromosomes. It produces good diversity

and improves the total fitness of the population. The hybrid GA-SVM is tested against 5

datasets (iris, diabetes, breast cancer, HD and hepatitis) from UCI repository. The outcomes

obtained by the GA-SVM indicate that it performs well and produces significantly better

results.

[8] proposed a method to identify and classify CKD and non-CKD patients. The proposed

method involves three steps: 1) a framework is created to mine data, 2) Wrapper subset

attribute evaluator and best first search method are applied for attribute selection and 3) three

classifiers are used to classify the CKD and non-CKD patients. The attribute selector

eliminates the unwanted attributes to reduce the size of the dataset. The attribute evaluator

model achieves better selection of attributes by decreasing the number of attributes. These

observations reveal that the accuracy is higher for reduced dataset when compared to the

original dataset. In [9], a new method is developed to enhance the diagnosing quality of

CKD. The proposed method comprises of three steps which includes feature selection,

ensemble learning and classification. The original dataset include 400 instances with 25

attributes which are minimized using Correlation-based Feature Selection (CFS). The

classifiers such as k nearest neighbor (kNN), SVM and naive bayes (NB) were used in

ensemble learning for base classifier. AdaBoost is employed for ensemble learning to

enhance the detection of CKD. The results show that the integration kNN classifier achieves

higher accuracy of 0.981.

In [10], several data mining techniques like SVM, decision tree (DT), NB and kNN algorithm

are used to investigate the CKD dataset gathered from UCI repository to predict KD. The

performance of the classifiers is measured interms of accuracy, Root Mean Squared Error,

Mean Absolute Error and Receiver Operating Characteristic curve. Ranking algorithm

provides vital improvements in classifications with proper number of attributes. The

experimental results prove that the DT accomplishes better results with an accuracy of 99%

and SVM achieves an accuracy of 97.75%.

Journal of Information and Computational Science

Volume 9 Issue 12 - 2019

ISSN: 1548-7741

www.joics.org1654

Page 3: AN EFFECTIVE FEATURE SUBSET SELECTION BASED DATA

[11] proposed an artificial neural network (ANN) to increases the performance of heart

disease identification interms of cost, response time and accuracy. The proposed approach

utilizes ANN to select essential features from the input layer of the network. MLP-NN

method is employed to select features from Ischemic Heart Disease (IHD) dataset. Initially,

the total number of attributes is 17 for 712 patient records and the number of attributes is

minimized to 12 after feature selection. When the number of features is 12, the prediction

accuracy for training and testing process is 89.4% and 82.2%. When the number of features is

reduced more, then the performance will be affected, so, the number of features is set to 12

for IHD data set.

[12] presented a method to identify CKD using two kinds of feature selection techniques

namely wrapper and filter approach. It is observed that the reduction in the number of

features does not guarantee higher classification accuracy. For example, the obtained

accuracy rate of SVM in prior and after the attribute selection is found to be 98.5% and 98%

respectively.

In [13], a multilayer perceptron (MLP) with Back-Propagation learning method is integrated

to the feature selection algorithm to predict HD. Information gain is used for feature selection

to minimize the number of attributes. Initially 13 attributes are included to classify HD. MLP

is used for data classification. Without reducing the number of attributes, the accuracy of

training and validation data set is 88.46% and 80.17% respectively. When feature selection is

involved and number of attributes is decreased from 13 to 8, the accuracy of training and

validation data set is 89.56% and 80.99% respectively. From the experimental results, the

resultant accuracy is improved by 1.1% and 0.82% for training and validation data set.

Several studies have been presented in the various aspects of data mining applications in

medicinal sector. This paper proposes a feature selection based classification model for data

mining in healthcare industry. A set of four FS approaches has been employed namely ant

colony optimization (ACO) based FS, genetic algorithm (GA) based FS, gain entropy and

principal component analysis (PCA). Besides, to prevent the local optima problem of ACO,

an improved version named stochastic ACO algorithm is derived by incorporating periodic

partial reinitialization of the population. The presented model is validated using a set of three

benchmark medical dataset namely chronic kidney disease (CKD), Indian Liver Patient (ILP)

and Wisconsin Breast Cancer dataset. An extensive comparative analysis of the presented

model takes place and verified the superiority of the presented model against all the applied

dataset.

2. PROPOSED MODEL

The entire working operation of the proposed model is shown in Fig. 1. Initially, the medical

data will undergo pre-processing and FS process to choose the needed features along with the

removal of unwanted features. Then, the stochastic ACO algorithm is applied for the

classification of medical dataset into the occurrence of disease or not.

Journal of Information and Computational Science

Volume 9 Issue 12 - 2019

ISSN: 1548-7741

www.joics.org1655

Page 4: AN EFFECTIVE FEATURE SUBSET SELECTION BASED DATA

2.1. FEATURE SELECTION

2.1.1. ACO Based feature selection (ACO-FS)

Provided a feature group of size n, the FS problem is to recognize a minimum feature subset

of size s (s<n), as retaining a high accuracy in the demonstration of actual features. Partial

results do not signify some ordering among the result of features. Simultaneously, the future

feature for selection is not essentially authorized with the preceding feature attached to partial

result. But, there is no require which the outcomes of FS problem must be corresponding size.

The mapping of FS problem to ACO technique occupies the subsequent stages:

Represent Graph,

Heuristic desirability,

Update Pheromone and

Construct result

Fig. 1. Overall process of proposed work

Journal of Information and Computational Science

Volume 9 Issue 12 - 2019

ISSN: 1548-7741

www.joics.org1656

Page 5: AN EFFECTIVE FEATURE SUBSET SELECTION BASED DATA

2.1.2. GA Based FS

The essential EA for FS is GA that is a stochastic method for optimization functions utilizing

the nature of genetics as well as biological evolution. Usually, the genes include the trend to

develop over following generations to adjust by surrounding. GA functions on a population

of separates to grow optimal approximations. In each generation, a new population is

increased with method of choosing the separates dependent on their level in the problem area,

and reintegrating them with utilize of operators in natural genetics. The children can also

execute mutation function with it outcome to generation of populations of separates that is

adjusted to surroundings than the persons that they were created from, presently as in natural

adjustment. In SPP, each person in the population specifies a predictive method. The number

of genes is the number of obtainable features in the data set. Genes are double values that

denote the count or removal of definite features in the method. The number of particulars, or

population size would be choose to all application is set to 10 with default (N=10, where N is

the number of features).

2.1.3. PCA Based FS

The PCA-based FS system to machine form observing was dependent on the perceptive

which the amplitude of vibration signals of imperfect machine modules raises as the severity

of the fault raises.

Commonly, the PCA method transforms 𝑛 vectors (p1, p2, … p𝑢 , p𝑛) from a 𝑑‐dimensional

space to 𝑛 vectors (p1′ , p2

′ , … p𝑢′ , … , p𝑛

′ ) in a new, 𝑑′‐dimensional space as

p𝑢′ = ∑ 𝑎𝑧,𝑢

𝑑′

𝑧=1

e𝑧 , 𝑑′ ≤ 𝑑 (1)

where e𝑧 are the eigenvectors equivalent to the 𝑑′leading eigenvalues for the scatter matrix

Sand 𝑎𝑧,𝑢 is calculation of actual vectors p𝑢 on the eigenvectors e𝑧. These calculations are

known as the principal modules of actual data set. Both𝑑 and 𝑑′are positive integers with

dimension 𝑑′cannot are larger than d. The d‐by‐d scatter matrix 𝑆 to actual data set

(p1, p2, … p𝑢 , … , p𝑛) is described as

S = 𝐸[p𝑢p𝑢𝐿 ], 𝑓𝑜𝑟𝑢 = 1 𝑡𝑜 𝑛 (2)

where 𝐸[𝑝𝑢𝑝𝑢𝐿] is the statistical probability operator executed on outer product of 𝑝𝑢 and its

reverse. The demonstration revealed in (1) diminishes the mistake among the actual with

altered vectors. This is showed with regarding the difference of the principal modules

𝑎𝑧,𝑢provided as

𝜎2(e𝑧) = 𝐸[𝑎𝑧,𝑢2 ] = e𝑧

𝐿Se𝑧 (3)

Journal of Information and Computational Science

Volume 9 Issue 12 - 2019

ISSN: 1548-7741

www.joics.org1657

Page 6: AN EFFECTIVE FEATURE SUBSET SELECTION BASED DATA

where e𝑧signifies the d‐by‐1 vector e𝑧 = [𝑒1,𝑧𝑒2,𝑧 … 𝑒𝑑,𝑧]𝐿. It is apparent that the difference

of principal modules in the operation of magnitude of the modules in the vectors e𝑧. At the

local maximum as well as minimum to difference operation in (3), the subsequent connection

exists:

𝜎2(e𝑧 + 𝛿e𝑧) = 𝜎2(e𝑧). (4)

Equation (4) is pleased when

(𝛿e𝑧)𝐿Se𝑧 − 𝜆(𝛿e𝑧)𝐿e𝑧 = 0 (5)

where 𝜆 is a scaling factor. This leads to

Se𝑧 = 𝜆e𝑧 . (6)

This equation could be identified as an eigenvalue problem through non trivial outcomes only

when 𝜆 is the Eigen values of scatter matrix S. So, the connection vectors e𝑧 (𝑧 = 1 to 𝑑′) be

the eigenvectors. If the state𝑑′ < 𝑑 is fulfilled, then the over demonstration also diminishes

the dimensionality of the vectors.

Provided that features altered with principal modules are not directly associated for the

physical nature of the fault, the fault classified method proposed in this study was dependent

on actual features themselves. The eigenvectors to altered data were only utilized for

selection the most receptive features from actual feature set.

2.1.4. Gain Ratio Based FS

GR is a modified of the information gain (IG) to decreases its bias. GR gets number as well

as size of branches into account if selecting a parameter. It’s accurate the IG with getting the

intrinsic information of a divide into account. Intrinsic information is entropy of sharing of

examples into divisions (i.e. how much information do we require to inform that branch an

example belongs to). Attribute value are reduces as intrinsic information takes bigger.

𝐺𝑎𝑖𝑛 𝑅𝑎𝑡𝑖𝑜 (𝐴𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒) =𝐺𝑎𝑖𝑛 (𝐴𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒)

𝐼𝑛𝑡𝑟𝑖𝑛𝑠𝑖𝑐_𝑖𝑛𝑓𝑜 (𝐴𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒) (7)

Ranking processes are utilized for chosen subset of required attributes from the actual dataset

of total attributes.

2.2. STOCHASTIC ACO FOR DATA CLASSIFICATION

Here, ACO model is applied for the extraction of classifier rules from the data depending

upon the nature of ants and data mining approaches. The intention of this model is to allocate

every individual instance to a class label from the available classes by the sue of specific

parameters. In generally, in the classification process, the discovered knowledge can be

defined in the form of if then rules as mentioned below.

Journal of Information and Computational Science

Volume 9 Issue 12 - 2019

ISSN: 1548-7741

www.joics.org1658

Page 7: AN EFFECTIVE FEATURE SUBSET SELECTION BASED DATA

IF < 𝑐𝑜𝑛𝑑𝑖𝑡𝑖𝑜𝑛𝑠 > 𝑇𝐻𝐸𝑁 < 𝑐𝑙𝑎𝑠𝑠 > (8)

The rule antecedent (IF part) holds a set of criteria which are normally related by a AND

operation. Then, the rule consequent (THEN part) defines the class predicted for cases whose

predictor variables ensure that every term can be represented by the rule antecedent. The

application of ACO method for disease prediction comprises of the following processes:

Represent the structure

Create rules

Heuristic function

Prune rules

Pheromone update

Use explored rules

2.2.1. Represent the structure

The structure of the ACO model is represented in Fig. 2. A variable is defined by 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒𝑘,

where 𝑘 represents the series of variables and Valkl defines the non-continuous variable value.

The resulting part of the variables comes under the class and the class is defined by Cz, where

𝑧 is the series value in the class. At this point, the ant has begun its traversal process from the

source and select s value for every variable. Once it completes the traversal of all the

parameters, it will choose a value for every class. As depicted in Fig. 2, the elected route is

indicated by solid lines: Val1, 2, Val2, 1, Val3, 3, C3, destination. To discover the rules, a certain

number of ants should traverse an individual path as defined in the following subsections.

Fig. 2. Structural representation of ACO method

2.2.2. Create rules

For the exploration of the sequence of classifier rules, a sequential covering technique is

applied. In the previous stage, a collection of explored rules are set to 0 and the training set

holds explored rules. While classifier rules are explored at each round, they are moved to the

classifier rule list and discarded from the training set. The explorations of the rules are carried

out in case of fulfilling below mentioned conditions.

Journal of Information and Computational Science

Volume 9 Issue 12 - 2019

ISSN: 1548-7741

www.joics.org1659

Page 8: AN EFFECTIVE FEATURE SUBSET SELECTION BASED DATA

i. A value lesser than the fixed value will be included to the rule, known as

minimum_cases_per_rule.

ii. When the exploitation of parameters takes place by every ant, the rule create procedure

gets stopped. The ants utilize a probability function (Prouv) as provided in Eq. (9) for

selecting a variable value for rule creation.

Prouv =ηuv.τuv

∑ (pu)𝑎𝑢=1 .∑ (ηuv.τuv(l))

𝑏

𝑣=1

(9)

where ηuv is the problem oriented heuristic function and τuv defines the quanitty of the

pheromone quantity.

2.2.3. Heuristic function

Under each termuv, it can be appended to the current rule and the ACO model determines the

η ijindicating the term quality based on the capability of enhancing the predictive accuracy.

Particularly, the value of η uv for termuv involves a measure of the entropy (or amount of

information) linked with the term. Under each termuv, the entropy can be determined by

𝐻(𝑊|𝐴𝑢 = 𝑉𝑢𝑣) = − ∑ (𝑃(𝑤|𝐴𝑢 = 𝑉𝑢𝑣). 𝑙𝑜𝑔2 𝑃(𝑤|𝐴𝑢 = 𝑉𝑢𝑣))𝑧𝑤=1 (10)

where W denotes the class variable, z represents the class count, and 𝑃(𝑤|𝐴𝑢 = 𝑉𝑢𝑣) is the

empirical likelihood of observing class w conditional on having observed 𝐴𝑢 = 𝑉𝑢𝑣 .

2.2.4. Prune Rules

It intends to discard the irrelevant terms that exist in the rule. It mainly enhances the

prediction ability of the rule and assists to resolve the issue of over fitting of the training data.

But, it improves the simplicity of the rules where a short rule is easier to interpret over a long

rule. In case of rule generation by an ant, pruning task has begun. It eliminates the unwanted

rules generated by ants, increasing the rule quality. The rule quality (Q) present in the range

of 0 ≤ 𝑊 ≤ 1q. (11).

𝑄 = 𝑇𝑃

(𝑇𝑃 + 𝐹𝑁)∗

𝑇𝑁

(𝐹𝑃 + 𝑇𝑁) (11)

where TP- True positive, TN- True Negative, FN- False Negative and FP- false positive.

2.2.5. Update pheromones

It indicates the volatile nature of pheromone laid by ants in the real time. Due to the positive

feedback procedure, the errors exist in the heuristic parameter can be resolved and leads to

enhanced classifier outcome. The ants make use of this process for the exploration of simple

and effective classifier performance. At the beginning, the trails are provided with an

identical quantity of pheromone as defined by

τuv(𝑙 = 0) =1

∑ 𝑏𝑢𝑎𝑢=1

(12)

where au denotes the attribute count and bu is the probable value of au. The quantity of

pheromone can be utilized for updates due to the fact that the ants lay pheromone while

exploring the paths. On the other hand, the volatility of pheromone has to be designed. As a

result, a repetitive procedure takes palace using Eq. (13).

Journal of Information and Computational Science

Volume 9 Issue 12 - 2019

ISSN: 1548-7741

www.joics.org1660

Page 9: AN EFFECTIVE FEATURE SUBSET SELECTION BASED DATA

τuv(l) = (1 − ρ)τuv(l − 1) + (1 −1

1+Q) τuv(l − 1) (13)

Where ρ is the pheromone evaporation rate, Q is the quality as given in Eq. (11) and t is the

series number of the iteration. On the other hand, nodes which are not utilized by the existing

rule have only pheromone evaporation and are represented as

τuv(l) =τuv(l−1)

∑ ∑ τuv(l−1)buv=1

au=1

(14)

where 𝑎 is the number of attributes, bu is the number of values in the attibuteu and 𝑙

represents the series number of the round. It indicates that the quantity of pheromone of the

undiscovered nodes gets minimized with an increase in time.

2.2.6. Use explored rules

For the classification of fresh testing instances, the explored rules are applied in the series of

exploration as they are saved in an ordered list. The initial rule that envelop the testing

instance is provided, representing that the instance is allocated to the predicted class. In few

cases, no rules present in the list can hold the testing instances. In such cases, the testing

instances undergo classification using a default rule that predicts the main class in the set of

uncovered training cases.

2.2.7. Stochastic process

Though ACO algorithm provides several benefits such as parallelism, self learning and

efficient information feedback, it shows poor convergence due to the absence of information

at the earlier stages. To resolve this issue, a new stochastic ACO algorithm is presented by

the incorporation of periodic partial reinitialization of the population into the ACO to

improve its total efficiency. The global convergence rate of stochastic ACO algorithm is

ensured by the process of periodic restart in use under the conditions of participating in

comparison, which helps to avoid the premature convergence. Once the learning agent gets

trapped into local optimal, the learning agents will learn with no observations, however with

complete arbitrariness by stochastic searching process. It assists to improve the global

ergodicity of the knowledge library and eliminates earlier convergence. Under the population

reinitialization strategy, the average fitness will oscillate from a very low minimum value,

after reinitialization exists to a maximum value when convergence takes place. The

maximum fitness advances in steps between the reinitializations, which correspond to better

solutions evolving by the recombination of the best individuals with the randomly generated

ones.

3. PERFORMANCE VALIDATION

Table 1 provides a brief description regarding the dataset employed in this study. For

experimentation, a set of three dataset namely CKD [14], ILP [15] and WBC [16] dataset are

employed. In the CKD dataset, a total of 400 instances were present and a set of 24 attributes

exist under the presence of two classes.

A set of 250 instances comes under the class 1 and remaining 150 instances falls under the

class 2. In the ILP dataset, a total of 583 instances were present and a set of 10 attributes exist

under the presence of two classes.

Journal of Information and Computational Science

Volume 9 Issue 12 - 2019

ISSN: 1548-7741

www.joics.org1661

Page 10: AN EFFECTIVE FEATURE SUBSET SELECTION BASED DATA

Table 1 Dataset Description

Dataset Source # of instances # of attributes # of class Class 1/Class 2

CKD UCI 400 24 2 250/150

ILP UCI 583 10 2 416/167

WBC UCI 699 10 2 458/241

Table 2 Dataset Description of CKD Dataset

Table 3 Dataset Description of ILP Dataset

Journal of Information and Computational Science

Volume 9 Issue 12 - 2019

ISSN: 1548-7741

www.joics.org1662

Page 11: AN EFFECTIVE FEATURE SUBSET SELECTION BASED DATA

Table 4 Dataset Description of WBC Dataset

A set of 416 instances comes under the class 1 and remaining 167 instances falls under the

class 2. In the WBC dataset, a total of 699 instances were present and a set of 10 attributes

exist under the presence of two classes. A set of 458 instances comes under the class 1 and

remaining 241 instances falls under the class 2. Tables 2-4 provide the description of the

attributes present in the dataset.

3.1. Results analysis under CKD dataset

Table 5 provides a comparison of the results offered by different FS models on the applied

CKD dataset. The table values indicated that the gain ratio model selects a total of 10 features

with the best cost of 0.104 which is significantly higher than other methods. Similarly, the

Journal of Information and Computational Science

Volume 9 Issue 12 - 2019

ISSN: 1548-7741

www.joics.org1663

Page 12: AN EFFECTIVE FEATURE SUBSET SELECTION BASED DATA

PCA model also selects a total of 10 features with the best of 0.041. Though the best cost of

PCA is superior to gain ration, it does not outperform the ACO-FS and GA-FS. Next to that,

the GA-FS shows slightly better performance over the other methods except ACO-FS by

attaining a moderate best cost of 0.18763 with the selection of 9 features. However, the ACO-

FS model offers superior results by attaining a least best cost of 0.0084956 with the selection

of 16 features. These values indicated the effective FS performance of the presented ACO-FS

model on the applied CKD dataset.

Table 5 Comparative analysis of state of art with proposed method for CKD Dataset

Methods Best Cost Selected Features

ACO 0.0084956 15, 4, 1, 12, 18, 10, 13, 16, 2, 6, 7, 23, 11, 22, 20, 8

GA-FS 0.0187653 13, 4, 8, 9, 10, 12, 11, 22, 20

PCA 0.0410000 1,5,6,4,8,9,10,12,15,17

Gain Ratio 0.1040000 2,5,4,3,8,7,10,14,21,24

Fig. 3. Best cost analysis of presented model on CKD dataset

Journal of Information and Computational Science

Volume 9 Issue 12 - 2019

ISSN: 1548-7741

www.joics.org1664

Page 13: AN EFFECTIVE FEATURE SUBSET SELECTION BASED DATA

Table 6 Selected Features of CKD using ACO

Features I-1 I-2 I-3 I-4 I-5 I-6 I-7 I-8 I-9 I-10 I-11 I-12 I-13 I-14 I-15 I-16 I-17 I-18 I-19 I-20

Age

Blood Pressure

Specific Gravity - - - - - - - - -

Albumin

Sugar - - - - - - - - -

Red Blood Cells

Pus Cell - - - - - - - - - - -

Pus Cell clumps

Bacteria - - - - - - - - - - - - - - - - - - - -

Blood Glucose Random

Blood Urea

Serum Creatinine - - - - - - - - - - -

Sodium

Potassium - - - - - - - - -

Haemoglobin

Packed Cell Volume

White Blood Cell Count - - - - - - - - -

Red Blood Cell Count

Hypertension - - - - - - - - - - - - - - - - - - - -

Diabetes Mellitus

Coronary Artery

Disease

- - - - - - - - - - - - - - - - - - - -

Appetite - - - - - - - - - - -

Pedal Edema - - - - - - - - - - -

Anaemia - - - - - - - - - - - - - - - - - - - -

Journal of Information and Computational Science

Volume 9 Issue 12 - 2019

ISSN: 1548-7741

www.joics.org1665

Page 14: AN EFFECTIVE FEATURE SUBSET SELECTION BASED DATA

Fig. 3 shows the best cost analysis of the ACO-FS model on the applied CKD dataset. Then,

the set of features chosen by the ACO-FS model under a set of 20 iterations are provided in

Table 6.

Table 7 and Fig. 4 provide a comparison of the results offered by the presented model under

several measures under CKD dataset. On measuring the classifier results under several

performance measures, it is noted that the RF classifier model offers a least performance with

the sensitivity of 95.14, specificity of 90.17, accuracy of 93.86, F-score of 94.19,MCC of

0.87 and kappa value of 89.92 respectively. It is also depicted that the DT classifier model

tries to manage well with the sensitivity of 96.45, specificity of 91.43, accuracy of 95.18, F-

score of 95.23, MCC of 0.89 and kappa value of 91.46 respectively. Similarly, it is shown

that the LR model offers moderate classifier outcome with the sensitivity of 98.24, specificity

of 92.96, accuracy of 95.89, F-score of 96.32, MCC of 0.91 and kappa value of 91.84

respectively. Likewise, the RBFNetwork model shows competitive result over the stochastic

ACO algorithm by attaining sensitivity of 98.35, specificity of 92.99, accuracy of 96.25, F-

score of 96.95, MCC of 0.92 and kappa value of 92.07 respectively. Finally, the stochastic

ACO model shows superior outcome with the maximum sensitivity of 98.81, specificity of

93.16, accuracy of 96.84, F-score of 97.13, MCC of 0.93 and kappa value of 92.14

respectively.

Table 7 Performance of CKD using Proposed method with various classifiers

Classifier Sensitivity Specificity Accuracy F-score MCC Kappa

Stochastic-ACO 98.81 93.16 96.84 97.13 0.93 92.14

RBFNetwork 98.35 92.99 96.25 96.95 0.92 92.07

LR 98.24 92.96 95.89 96.32 0.91 91.84

DT 96.45 91.43 95.18 95.23 0.89 91.46

RF 95.14 90.17 93.86 94.19 0.87 89.92

Fig. 4. Classifier results analysis of diverse models on CKD dataset

Journal of Information and Computational Science

Volume 9 Issue 12 - 2019

ISSN: 1548-7741

www.joics.org1666

Page 15: AN EFFECTIVE FEATURE SUBSET SELECTION BASED DATA

For further verification of the presented model on the applied CKD dataset, a comparison is

made with the recent methods interms of accuracy and is shown in Table 8 and Fig. 5. As

illustrated in the figure, it is shown that the MLP offers worst classification by attaining a

minimum accuracy of 51.50. Simultaneously, the SVM model shows somewhat better results

over the MLP and offered a slightly higher accuracy of 60.70. Afterwards, the NN model

exhibits manageable outcome over SVM and MLP models by attaining an accuracy of 87.00.

Simultaneously, the PNN model shows effective outcome by offering a near optimal

accuracy of 96.70. At last, it is noted that the stochastic ACO model offers optimal results by

attaining a maximum accuracy of 96.84. These values proved that the stochastic model is

found to an effective tool for the classification of medicinal CKD dataset.

Table 8 Performance of stochastic ACO with recent methods on CKD dataset

Classifier Accuracy

Stochastic-ACO 96.84

PNN 96.70

SVM 60.70

NN 87.00

MLP 51.50

Fig. 5. Accuracy analysis of different recent methods on CKD dataset

3.2. Results analysis under ILP dataset

Table 9 provides a comparison of the results offered by different FS models on the applied

ILP dataset. The table values indicated that the selects a total of 10 features with the best cost

Journal of Information and Computational Science

Volume 9 Issue 12 - 2019

ISSN: 1548-7741

www.joics.org1667

Page 16: AN EFFECTIVE FEATURE SUBSET SELECTION BASED DATA

of 0.104 which is significantly higher than other methods. Similarly, the PCA model also

selects a total of 10 features with the best of 0.041. Though the best cost of PCA is superior

to gain ration, it does not outperform the ACO-FS and GA-FS. Next to that, the GA-FS

shows slightly better performance over the other methods except ACO-FS by attaining a

moderate best cost of 0.18763 with the selection of 9 features. However, the ACO-FS model

offers superior results by attaining a least best cost of 0.0084956 with the selection of 16

features. These values indicated the effective FS performance of the presented ACO-FS

model on the applied ILP dataset.

Table 9 Comparative analysis of state of art with proposed method for ILP Dataset

Methods Best Cost Selected Features

ACO 0.16279 9, 7, 10, 5, 8, 1, 3, 6

GA-FS 0.19837 8, 3, 7, 2, 6, 9

PCA 0.03850 1,2,3,4,5,6,7

Gain Ratio 0.17382 1,5,4,3,6,8

Journal of Information and Computational Science

Volume 9 Issue 12 - 2019

ISSN: 1548-7741

www.joics.org1668

Page 17: AN EFFECTIVE FEATURE SUBSET SELECTION BASED DATA

Table 10 Selected Features of ILP using ACO

Features I-1 I-2 I-3 I-4 I-5 I-6 I-7 I-8 I-9 I-10 I-11 I-12 I-13 I-14 I-15 I-16 I-17 I-18 I-19 I-20

Age of Patient

Gender of Patient - - - - - - - - - - - - - - - - -

Total Bilirubin (TB) - - - -

Direct Bilirubin (DB) - - - - - - - - - - - - - - - -

Alkphos - -

SGPT Alamine

SGOT Aspartate

Total Proteins (TP)

ALB Albumin -

A/G Ratio

Journal of Information and Computational Science

Volume 9 Issue 12 - 2019

ISSN: 1548-7741

www.joics.org1669

Page 18: AN EFFECTIVE FEATURE SUBSET SELECTION BASED DATA

Fig. 6. Best cost analysis of presented model on ILP dataset

Fig. 6 shows the best cost analysis of the ACO-FS model on the applied ILP dataset. Then,

the set of features chosen by the ACO-FS model under a set of 20 iterations are provided in

Table 10.

Table 11 and Fig. 7 provide a comparison of the results offered by the presented model under

several measures under ILP dataset. On measuring the classifier results under several

performance measures, it is noted that the RBFNetwork classifier model offers a least

performance with the sensitivity of 71.35, specificity of 0, accuracy of 71.35, F-score of

83.28, MCC of 0 and kappa value of 0 respectively. It is also depicted that the DT classifier

model tries to manage well with the sensitivity of 75.65, specificity of 44.09, accuracy of

68.78, F-score of 79.13, MCC of 0.18 and kappa value of 17.74 respectively. Similarly, it is

shown that the LR model offers moderate classifier outcome with the sensitivity of 75.91,

specificity of 53.93, accuracy of 72.55, F-score of 82.41, MCC of 0.323 and kappa value of

21.95 respectively. Likewise, the RFmodel shows competitive result over the stochastic ACO

algorithm by attaining sensitivity of 76.33, specificity of 49.12, accuracy of 71.01, F-score of

80.90, MCC of 0.22 and kappa value of 21.64 respectively. Finally, the stochastic ACO

model shows superior outcome with the maximum sensitivity of 92.19, specificity of 83.75,

accuracy of 89.88, F-score of 92.97, MCC of 0.74 and kappa value of 74.93 respectively.

Table 11 Performance of ILP using Proposed method with various classifiers

Classifier Sensitivity Specificity Accuracy F-score MCC Kappa

Stochastic-

ACO 92.19 83.75 89.88 92.97 0.74 74.93

RBFNetwork 71.35 0 71.35 83.28 0 0

Journal of Information and Computational Science

Volume 9 Issue 12 - 2019

ISSN: 1548-7741

www.joics.org1670

Page 19: AN EFFECTIVE FEATURE SUBSET SELECTION BASED DATA

LR 75.91 53.93 72.55 82.41 0.23 21.95

DT 75.65 44.09 68.78 79.13 0.18 17.74

RF 76.33 49.12 71.01 80.90 0.22 21.64

Fig. 7. Classifier results analysis of diverse models on ILP dataset

For further verification of the presented model on the applied ILP dataset, a comparison is

made with the recent methods interms of accuracy and is shown in Table 12 and Fig. 8. As

illustrated in the figure, it is shown that the NBTree offers worst classification by attaining a

minimum accuracy of 67.01. Simultaneously, the KStar model shows somewhat better results

over the NBTree and offered a slightly higher accuracy of 73.07. Afterwards, the SVM model

exhibits manageable outcome over KStar and NBTree models by attaining an accuracy of

75.10. Simultaneously, the Bayesian Network model shows effective outcome by offering a

near optimal accuracy of 66.09. At last, it is noted that the stochastic ACO model offers

optimal results by attaining a maximum accuracy of 89.88. These values proved that the

stochastic model is found to an effective tool for the classification of medicinal ILP dataset.

Table 12 Performance of stochastic ACO with recent methods on ILP dataset

Classifier Accuracy

Stochastic-ACO 89.88

KStar 73.07

NBTree 67.01

SVM 75.10

Journal of Information and Computational Science

Volume 9 Issue 12 - 2019

ISSN: 1548-7741

www.joics.org1671

Page 20: AN EFFECTIVE FEATURE SUBSET SELECTION BASED DATA

Bayesian Network 66.09

Fig. 8. Accuracy analysis of different recent methods on ILP dataset

3.3. Results analysis under WBC dataset

Table 13 provides a comparison of the results offered by different FS models on the applied

WBC dataset. The table values indicated that gain ratio model selects a total of 6 features

with the best cost of 0.128392 which is significantly higher than other methods. Similarly, the

PCA model also selects a total of 7 features with the best of 0.099000. Though the best cost of

PCA is superior to gain ration, it does not outperform the ACO-FS and GA-FS. Next to that,

the GA-FS shows slightly better performance over the other methods except ACO-FS by

attaining a moderate best cost of 0.098372 with the selection of 5 features. However, the ACO-

FS model offers superior results by attaining a least best cost of 0.068607 with the selection of

6 features. These values indicated the effective FS performance of the presented ACO-FS

model on the applied WBC dataset.

Table 13 Comparative analysis of state of art with proposed method for WBC Dataset

Methods Best Cost Selected Features

ACO 0.068607 7, 8, 1, 6, 2, 4

GA-FS 0.098372 1,3,2,7,6

PCA 0.099000 1,2,3,4,5,6,7

Gain Ratio 0.128392 9, 3, 2, 1, 6, 8

Journal of Information and Computational Science

Volume 9 Issue 12 - 2019

ISSN: 1548-7741

www.joics.org1672

Page 21: AN EFFECTIVE FEATURE SUBSET SELECTION BASED DATA

Fig. 9. Best cost analysis of presented model on WBC dataset

Fig. 9 shows the best cost analysis of the ACO-FS model on the applied WBC dataset. Then,

the set of features chosen by the ACO-FS model under a set of 20 iterations are provided in

Table 14.

Journal of Information and Computational Science

Volume 9 Issue 12 - 2019

ISSN: 1548-7741

www.joics.org1673

Page 22: AN EFFECTIVE FEATURE SUBSET SELECTION BASED DATA

Table 14 Selected Features of WBC using ACO

Features I-1 I-2 I-3 I-4 I-5 I-6 I-7 I-8 I-9 I-10 I-11 I-12 I-13 I-14 I-15 I-16 I-17 I-18 I-19 I-20

Sample ID Number

Clump Thickness

Uniformity of Cell Size - - - - - - - - - - - - - - - - - - -

Uniformity of Cell Shape -

Marginal Adhesion - - - - - - - - - - - - - - - -

Single Epithelial Cell Size

Bare Nuclei - - - -

Bland Chromatin -

Normal Nucleoli - - - - - - - - - - - - - - - - - - -

Mitoses - - - - - - - - - - - - - - - - - - - -

Journal of Information and Computational Science

Volume 9 Issue 12 - 2019

ISSN: 1548-7741

www.joics.org1674

Page 23: AN EFFECTIVE FEATURE SUBSET SELECTION BASED DATA

Table 15 and Fig. 10 provide a comparison of the results offered by the presented model

under several measures under WBC dataset. On measuring the classifier results under several

performance measures, it is noted that the DT classifier model offers a least performance with

the sensitivity of 96.05, specificity of 91.76, accuracy of 94.56, F-score of 95.84, MCC of

0.87 and kappa value of 87.99 respectively. It is also depicted that the LR classifier model

tries to manage well with the sensitivity of 97.37, specificity of 95.02, accuracy of 96.56, F-

score of 97.37, MCC of 0.92 and kappa value of 92.40 respectively. Similarly, it is shown

that the RBFNetwork model offers moderate classifier outcome with the sensitivity of 98.20,

specificity of 91.73, accuracy of 95.85, F-score of 96.79, MCC of 0.91 and kappa value of

90.95 respectively. Likewise, the RFmodel shows competitive result over the stochastic ACO

algorithm by attaining sensitivity of 98.23, specificity of 94.33, accuracy of 95.85, F-score of

96.79, MCC of 0.91 and kappa value of 90.95 respectively. Finally, the stochastic ACO

model shows superior outcome with the maximum sensitivity of 98.02, specificity of 95.47,

accuracy of 97.14, F-score of 97.81, MCC of 0.93 and kappa value of 93.67 respectively.

Table 15 Performance of WBC using Proposed method with various classifiers

Classifier Sensitivity Specificity Accuracy F-score MCC Kappa

Stochastic-ACO 98.02 95.47 97.14 97.81 0.93 93.67

RBFNetwork 98.20 91.73 95.85 96.79 0.91 90.95

LR 97.37 95.02 96.56 97.37 0.92 92.40

DT 96.05 91.76 94.56 95.84 0.87 87.99

RF 98.23 94.33 96.85 97.58 0.93 93.07

Journal of Information and Computational Science

Volume 9 Issue 12 - 2019

ISSN: 1548-7741

www.joics.org1675

Page 24: AN EFFECTIVE FEATURE SUBSET SELECTION BASED DATA

Fig. 10. Classifier results analysis of diverse models on WBC dataset

For further verification of the presented model on the applied WBC dataset, a comparison is

made with the recent methods interms of accuracy and is shown in Table 16 and Fig. 11. As

illustrated in the figure, it is shown that the BP offers worst classification by attaining a

minimum accuracy of 94.40. Simultaneously, the SVM model shows somewhat better results

over the BP and offered a slightly higher accuracy of 94.50. Afterwards, the GAW+BP and

GAW+CSSSVM models exhibit manageable outcome over SVM and BP models by attaining

an identical accuracy of 95.00. Simultaneously, the IGSAGAW+CSSSVM and

IGSAGAW+BP models show effective outcome by offering a near optimal accuracy of 95.80

and 96.30. At last, it is noted that the stochastic ACO model offers optimal results by

attaining a maximum accuracy of 97.14. These values proved that the stochastic model is

found to an effective tool for the classification of medicinal WBC dataset.

Table 16 Performance of stochastic ACO with recent methods on WBC dataset

Classifier Accuracy

Stochastic-ACO 97.14

SVM 94.50

GAW+CSSSVM 95.00

IGSAGAW+CSSSVM 95.80

BP 94.40

GAW+BP 95.00

IGSAGAW+BP 96.30

Journal of Information and Computational Science

Volume 9 Issue 12 - 2019

ISSN: 1548-7741

www.joics.org1676

Page 25: AN EFFECTIVE FEATURE SUBSET SELECTION BASED DATA

Fig.11. Accuracy analysis of different recent methods on WBC dataset

4. CONCLUSION

This paper has presented an optimal FS based classification model for data mining in

healthcare industry. Several studies employed ACO algorithm for data classification. Though

ACO algorithm provides several benefits such as parallelism, self-learning, and efficient

information feedback, it shows poor convergence due to the absence of information at the

earlier stages. To resolve this issue, a new stochastic ACO algorithm is presented by the

incorporation of periodic partial reinitialization of the population into the ACO to improve its

total efficiency. A detailed experimentation takes place by the use of three benchamrk dataset

namely CKD, ILP and WBC dataset. The experimental results clearly showcased the

extraordinary performance of the proposed method over the compared methods.

References

[1] Koh, H.C. and Tan, G., 2011. Data mining applications in healthcare. Journal of

healthcare information management, 19(2), p.65.

[2] Tomar, D. and Agarwal, S., 2013. A survey on Data Mining approaches for

Healthcare. International Journal of Bio-Science and Bio-Technology, 5(5), pp.241-

266.

[3] Yoo, I., Alafaireet, P., Marinov, M., Pena-Hernandez, K., Gopidi, R., Chang, J.F.

and Hua, L., 2012. Data mining in healthcare and biomedicine: a survey of the

literature. Journal of medical systems, 36(4), pp.2431-2448.

[4] Srinivas, K., Rani, B.K. and Govrdhan, A., 2010. Applications of data mining

techniques in healthcare and prediction of heart attacks. International Journal on

Computer Science and Engineering (IJCSE), 2(02), pp.250-255.

Journal of Information and Computational Science

Volume 9 Issue 12 - 2019

ISSN: 1548-7741

www.joics.org1677

Page 26: AN EFFECTIVE FEATURE SUBSET SELECTION BASED DATA

[5] Herland, M., Khoshgoftaar, T.M. and Wald, R., 2014. A review of data mining using

big data in health informatics. Journal of Big data, 1(1), pp.1-35.

[6] Banaee, H., Ahmed, M.U. and Loutfi, A., 2013. Data mining for wearable sensors in

health monitoring systems: a review of recent trends and

challenges. Sensors, 13(12), pp.17472-17500.

[7] Tan, K. C., Teoh, E. J., Yu, Q., &Goh, K. C. (2009).A hybrid evolutionary algorithm

for attribute selection in data mining. Expert Systems with Applications, 36(4),

8616-8630.

[8] Chetty, N., Vaisla, K. S., &Sudarsan, S. D. (2015, December). Role of attributes

selection in classification of Chronic Kidney diseasepatients.In Computing,

Communication and Security (ICCCS), 2015 International Conference on (pp. 1-

6).IEEE.

[9] Wibawa, M. S., Maysanjaya, I. M. D., & Putra, I. M. A. W. (2017, August).Boosted

classifier and features selection for enhancing chronic kidney diseasediagnose.

In Cyber and IT Service Management (CITSM), 2017 5th International Conference

on (pp. 1-6). IEEE.

[10] Tazin, N., Sabab, S. A., &Chowdhury, M. T. (2016, December).Diagnosis of Chronic

Kidney diseaseusing effective classification and FStechnique.In Medical

Engineering, Health Informatics and Technology (MediTec), 2016 International

Conference on (pp. 1-6).IEEE.

[11] Rajeswari, K., Vaithiyanathan, V., &Neelakantan, T. R. (2012). FS in ischemic heart

disease identification using feed forward neural networks. Procedia

Engineering, 41, 1818-1823.

[12] Polat, H., Mehr, H. D., & Cetin, A. (2017).Diagnosis of Chronic Kidney

diseaseBased on Support Vector Machine by FS Methods. Journal of medical

systems, 41(4), 55.

[13] Khemphila, A., &Boonjing, V. (2011, August).Heart disease classification using

neural network and FS.In Systems Engineering (ICSEng), 2011 21st International

Conference on (pp. 406-409).IEEE.

[14] https://archive.ics.uci.edu/ml/datasets/Chronic_Kidney_Disease

[15] https://archive.ics.uci.edu/ml/datasets/ILPD+(Indian+Liver+Patient+Dataset)

[16] https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(original)

Journal of Information and Computational Science

Volume 9 Issue 12 - 2019

ISSN: 1548-7741

www.joics.org1678