Upload
nabeel-asim
View
228
Download
0
Embed Size (px)
Citation preview
8/16/2019 Diff Criteria
1/13
Feature Selection Based on Class-DependentDensities for High-Dimensional Binary Data
Kashif Javed, Haroon A. Babri, and Mehreen Saeed
Abstract—Data and knowledge management systems employ feature selection algorithms for removing irrelevant, redundant, and
noisy information from the data. There are two well-known approaches to feature selection, feature ranking (FR) and feature subset
selection (FSS). In this paper, we propose a new FR algorithm, termed as class-dependent density-based feature elimination (CDFE),
for binary data sets. Our theoretical analysis shows that CDFE computes the weights, used for feature ranking, more efficiently as
compared to the mutual information measure. Effectively, rankings obtained from both the two criteria approximate each other. CDFE
uses a filtrapper approach to select a final subset. For data sets having hundreds of thousands of features, feature selection with FR
algorithms is simple and computationally efficient but redundant information may not be removed. On the other hand, FSS algorithms
analyze the data for redundancies but may become computationally impractical on high-dimensional data sets. We address these
problems by combining FR and FSS methods in the form of a two-stage feature selection algorithm. When introduced as a
preprocessing step to the FSS algorithms, CDFE not only presents them with a feature subset, good in terms of classification, but also
relieves them from heavy computations. Two FSS algorithms are employed in the second stage to test the two-stage feature selection
idea. We carry out experiments with two different classifiers (naive Bayes’ and kernel ridge regression) on three different real-life data
sets (NOVA, HIVA, and GINA) of the “Agnostic Learning versus Prior Knowledge” challenge. As a stand-alone method, CDFE shows
up to about 92 percent reduction in the feature set size. When combined with the FSS algorithms in two-stages, CDFE significantly
improves their classification accuracy and exhibits up to 97 percent reduction in the feature set size. We also compared CDFE against
the winning entries of the challenge and found that it outperforms the best results on NOVA and HIVA while obtaining a third position in
case of GINA.
Index Terms—Feature ranking, binary data, feature subset selection, two-stage feature selection, classification.
Ç
1 INTRODUCTION
T
HE advancements in data and knowledge managementsystems have made data collection easier and faster. Raw
data are collected by researchers and scientists working indiverse application domains such as engineering (robotics),pattern recognition (face, speech), internet applications(anomaly detection), and medical applications (diagnosis).These data sets may consist of thousands of observations orinstances where each instance may be represented by tens orhundreds of thousands of variables, also known as features.The number of instances and the number of variablesdetermine the size and the dimension of a data set. Data setssuch as NOVA [1], a text classification data set, consisting of 16,969 features and 19,466 instances and DOROTHEA [2], adata set used for drug discovery, consisting of 100,000 fea-
tures and 1,950 instances are not too uncommon these days.Intuitively, having more features implies more discrimina-tive power in classification [3]. However, this is not alwaystrue in practical experience, because not all the featurespresent in high-dimensional datasetshelpin classprediction.
Many features might be irrelevant and possibly detrimentalto classification. Also, redundancy among the features is not
uncommon [4], [5]. The presence of irrelevant and redundantfeatures not only slows down the learning algorithm but alsoconfuses it by causing it to overfit the training data [4]. Inother words, eliminating irrelevant and redundant featuresmakes the classifier’s design simple, improves its predictionperformance and its computational efficiency [6], [7].
High-dimensional data sets are inherently sparse andhence, can be transformed to lower dimensions withoutlosing too much information about the classes [8]. Thisphenomenon known as the empty space phenomenon [9] isresponsible for the well-known issue of “curse of dimen-sionality,” a term first coined by Bellman [10] in 1961 to
describe the problems faced in the analysis of high-dimensional data. He proved that to effectively estimatethe multivariate density functions up to a given degree of accuracy, an increase in data dimensions leads to anexponential growth in the number of data samples. Whilestudying the small sample size effects on classifier design, aphenomenon related to the curse of dimensionality wasobserved by Ruadys and Jain [11] and was termed as the“peaking phenomenon.” They found that for a givensample size, the accuracy of a classifier first increases withthe increase in the number of features, approaches theoptimal value, and then starts decreasing.
Problems faced by learning algorithms with high-dimen-sional data sets have been intensively worked on byresearchers. Algorithms that have been developed can becategorized into two broad groups. The algorithms that
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 3, MARCH 2012 465
. K. Javed and H.A. Babri are with the Department of Electrical Engineering,University of Engineering and Technology, Lahore 54890, Pakistan.E-mail: {kashif.javed, babri}@uet.edu.pk.
. M. Saeed is with the Department of Computer Science, NationalUniversity of Computer and Emerging Sciences, Block-B, Faisal Town,Lahore, Pakistan. E-mail: [email protected].
Manuscript received 24 Oct. 2009; revised 11 May 2010; accepted 14 Aug.2010; published online 21 Dec. 2010.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TKDE-2009-10-0734.Digital Object Identifier no. 10.1109/TKDE.2010.263.
1041-4347/12/$31.00 2012 IEEE Published by the IEEE Computer Society
8/16/2019 Diff Criteria
2/13
select, from theoriginalfeature set, a subsetof features, whichare highly effective in discriminating classes, are categorizedas feature selection (FS) methods. Relief [12] is a popular FSmethod that filters out irrelevant features using the nearestneighbor approach.Another well-knownmethod is recursivefeature elimination support vector machine (RFE-SVM) [13]which selects useful features while training the SVMclassifier.TheFSalgorithmsarefurtherdiscussedinSection2.On theotherhand, algorithms that createa newset of featuresfrom the original features, through the application of sometransformationor combination of original features aretermedas feature extraction (FE) methods. Among them, principalcomponent analysis (PCA) and linear discriminant analysis(LDA), are the two well-known linear algorithms which arewidely used because of their simplicity and effectiveness [3].Nonlinear FE algorithms include isomap [14] and locallylinear embedding (LLE) [15]. For a comparative review of FEmethods, interested readers are referred to [16]. In this paper,our focus is on the problem of supervised feature selectionand we propose a solution that is suitable for binary data sets.
Theremainder of the paper is organized into four sections.Section 2 describes the theory related to feature selection andpresents a literature survey of the existing methods. InSection 3, we propose a new feature ranking (FR) algorithm,termed as class-dependent density-based feature elimination(CDFE). Section 4 discusses howto combine CDFE with otherfeature selection algorithms in two stages. Experimentalresults on three real-life data sets are discussed in Section 5.The conclusions are drawn in Section 6.
2 FEATURE SELECTION
This section describes the theory related to the featureselection problem and surveys the various methodspresented in the literature for its solution. Suppose we aregiven a labeled data set fxt; C tgN t¼1 consisting of N instancesand M features such that xt 2 RM and C t denotes the classvariable of instance t. There can be L number of classes.Each vector x is, thus, an M -dimensional vector of features;hence, xt ¼ fF t1; F
t2; . . . ; F
tM g. We use F to denote the set
comprising all features of a data set whereas G denotes afeature subset. The feature selection problem is to find asubset G of m features from the set F having M featureswith the smallest classification error [17] or at least withouta significant degradation in the performance [6].
A straightforward solution to the feature selectionproblem is to explore all M m
possible subsets of size m.
However, this kind of search is computationally expensivefor even moderate values of M and m. Therefore, alter-native search strategies have to be designed.
Generally speaking, the feature selection process mayconsist of four basic steps, namely, subset generation, subsetevaluation, stopping criterion, and result validation [6]. Inthe subset generation step, the feature space is searchedaccording to a search strategy for a candidate subset, whichis evaluated later. The search can begin with either anempty set, which is then successively built up (forward
selection) or it starts with the entire feature set and thenfeatures are successively eliminated (backward selection).Different search strategies have been devised such ascomplete search, sequential search, and random search
[18]. The newly generated subset is evaluated either withthe help of the classifier performance or some criterion thatdoes not involve the classifier feedback. These two steps arerepeated until a stopping criterion is met.
Two well-known classes of feature selection algorithmsarefeature ranking andfeature subset selection(FSS) [7], [19].Feature ranking methods typically assign weights to features by assessing each feature individually according to somecriterion such as the degree of relevance to the class variable.Correlation, information theoretic, and probabilistic-basedranking criteria are discussed in [20]. Features arethen sortedaccording to their weights in descending order. A fixednumber of the top ranked features can comprise the optimalsubset or alternatively, a threshold value provided by theuser can be set on the ranking criterion to retain/discardfeatures. Thus, FR methods don’t indulge themselves inexplicit search of the smallest optimal set. They are highlyattractive for microarray analysis and text-categorizationdomains because of their computational efficiency andsimplicity [7]. Kira and Rendell’s “Relief” algorithm [12]estimates the relevance of a feature using the values of thefeatures of its nearest neighbors. Hall [21] proposes a rankingcriterion that evaluates and ranks subsets of features ratherthan assessing features individually. In [22], a comparison of four feature ranking methods is given. Presenting the mostrelevant features to a classifier may not produce an optimalresult as it may contain redundant features. In other words,the selected m best features may not result in the highestclassification accuracy, which can be achieved with the bestm features. Yu and Liu [23] suggest analyzing the subsetobtained by the feature ranking methods for featureredundancy in a separate stage.
Unlike feature ranking methods, feature subset selectionmethods select subsets of features, which together havegood predictive power. Guyon and Elisseeff presenttheoretical examples to illustrate superiority of the FSSmethods over the FR methods in [7]. Feature subsetselection methods are divided into three broad categories:filter, wrapper, and embedded methods [7], [19], [20]. Afilter acts as a preprocessing step to a learning algorithmand assesses feature subsets without the algorithm’sinvolvement. Fleuret [24] proposes a filtering criterion based on conditional mutual information (MI) for binarydata sets. A feature F i among the unselected features is
selected if its mutual information, I ðC ;F ijF kÞ conditionedon every feature F k already in the selected subset of features, is the largest. This conditional mutual informationmaximization (CMIM) criterion discards features similar tothe already selected ones as they do not carry additionalinformation about the class. In [25], Peng et al. propose theminimal-redundancy-maximal-relevance (mRMR) criterion,which adds a feature to the final subset if it maximizes thedifference between its mutual information with the classand the sum of its mutual information with each of theindividual features already selected. Qu et al. [26] suggest anew redundancy measure and a feature subset merit
measure based on mutual information concepts to quantifythe relevance and redundancy among features. Theproposed filter first finds a subset of highly relevantfeatures. Among these features, the most relevant feature
466 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 3, MARCH 2012
8/16/2019 Diff Criteria
3/13
having the least redundancy with the already selectedfeatures is added to the final subset.
Wrapper methods, motivated by Kohavi and John [17]use the performance of a predetermined learning algorithmfor searching an optimal feature subset. They suggest usingn-fold cross validation for evaluating feature subsets andfind that the best first search strategy outperforms the hill-climbing technique for forward selection. In practice,wrappers are considered to be computationally moreexpensive as compared to filters.
In theembedded approach, the feature selection process isintegrated into the training process of a given classifier. Anexample is the recursive feature elimination (RFE) algorithm[13] in which the support vector machines (SVM) is used asthe classifier. Features are assigned weights that areestimated by the SVM classifier after being trained with thedata set. During each iteration, feature(s), which decreasemargin of the class separation the least, are eliminated.
Another class of feature selection algorithms uses theconcepts of Bayesian networks [27]. The Markov blanket(MB) of a target variable is a minimal set of variablesconditioned on, which all other variables are probabilisti-cally independent of the target. The optimal feature subsetfor classification is the Markov blanket of the class variable.One way of identifying the Markov blanket is throughlearning the Bayesian network [28]. Another way is todiscover the Markov blanket directly from the data [4]. TheMarkov blanket filtering (MBF) algorithm of Koller andSahami [4] calculates pairwise correlations between all thefeatures and assumes the K highest correlated features of afeature to comprise its Markov blanket. During eachiteration, expected cross entropy is used to estimate the
MB of a feature and a feature whose MB is bestapproximated, is eliminated. For large values of K , MBFruns into computational and data fragmentation problems.To address these problems, in [29], we propose a Bernoullimixture model-based Markov blanket filtering (BMM-MBF)algorithm for binary data sets that estimates the expectedcross entropy measure via Bernoulli mixture models ratherthan from the training data set.
3 CLASS-DEPENDENT DENSITY-BASED FEATUREELIMINATION
In this section, we propose a new feature ranking algorithm,termed as, class-dependent density-based feature elimina-tion, for binary data sets. Binary data sets are found in awide variety of applications including document classifica-tion [30], binary image recognition [31], drug discovery [32],databases [33], and agriculture [34]. It may be possible to binarize nonbinary data sets in many cases (e.g., binariza-tion of the GINA data set; see Section 5). CDFE uses ameasure termed as, diff-criterion, to estimate the relevanceof features. The diff-criterion is a probabilistic measure andassigns weights to features by determining their densityvalue in each class. Mathematically, we show that the
computational cost of estimating the weights by diff-criterion is less as compared to the cost for calculatingweights by mutual information. Feature rankings obtained by the two criteria are similar to each other. Instead of using
a user-provided threshold value, CDFE determines the finalsubset with the help of a classifier.
Guyon et al. [35], proposed the Zfilter method to rank thefeatures of a sparse-integer data set. The filter counts thenonzero values of a feature irrespective of the class variableand assigns the sum of the count as its weight. Featureshaving a weight less than a given threshold value are thenremoved to obtain the final subset. In earlier work [36], [37],
we proposed a similar density-based elimination strategyfor high-dimensional binary data sets using the max-criterion (see definition 3.2). In the following, we suggesta new and more effective density-based ranking criterion(diff-criterion) and present a formal analysis of its working.The discussion that follows is for binary features and two-class classification problems unless stated otherwise.
Definition 3.1. The density of a binary feature for a given class isthe fraction of the number of instances whose value is 1.
The density of the ith feature, F i, in the lth class, C lhaving N C l instances, is calculated as
d i1l ¼
PN C lt¼1 F
ti
N C l
¼ pðF i ¼ 1jC lÞ
8i; 1 i M; 8l; 1 l L:
ð1Þ
Remark 3.1. 0 d i1l 1 can be intuitively acquired. Theextremums are the cases when a feature’s value remainsidentical over all the instances in a given class.
Definition 3.2. The max-criterion [36] calculates the densityvalue of a feature in each class and then, scores it with the
maximum density value over all the classes.
The weight of the ith feature, F i using max-criterion, is
W ðF iÞmax ¼ argmaxl
d i1l: ð2Þ
Remark 3.2. 0 W ðF iÞmax 1 can be intuitively acquired.Irrelevant features will be assigned a value, W ðF iÞmax ¼ 0.
Definition 3.3. The diff-criterion calculates the density value of a feature in each class and then, scores it with difference of thedensity values over the two classes C 0 and C 1.
The weight of the ith feature, F i using diff-criterion, is
W ðF iÞdiff ¼ jd i11 d
i10j
¼ j pðF i ¼ 1jC 1Þ pðF i ¼ 1jC 0Þj:ð3Þ
Remark 3.3. 0 W ðF iÞdiff 1 can be intuitively acquired.A feature having W ðF iÞdiff ¼ 0 is irrelevant whereasW ðF iÞdiff ¼ 1 means F i is the most relevant feature of adata set. Features with lower weights are, thus, lessrelevant as compared to features with higher values of W diff .
The class-dependent density-based feature eliminationstrategy ranks the features using diff-criterion given by (3)and sorts them according to decreasing relevance. In featureselection algorithms such as [5], [23], [26], where relevance
JAVED ET AL.: FEATURE SELECTION BASED ON CLASS-DEPENDENT DENSITIES FOR HIGH-DIMENSIONAL BINARY DATA 467
8/16/2019 Diff Criteria
4/13
and redundancy are analyzed separately in two steps, apreliminary subset of relevant features is chosen in the firststep using a threshold value provided by the user. A highthreshold value provided by the user may result in a verysmall subset of highly relevant features, whereas, with a lowvalue the subset may consist of too many features includinghighly relevant features along with less relevant features. Inthe former case, a lot of information about the class may belost, whereas, the subset, in the latter case, will still contain alot of information irrelevant to the class, thus, requiring acomputationally expensivesecondstage for selecting the bestfeatures. Feature ranking algorithms such as “Relief” [12] andothers also suffer from the same problem with a user-provided threshold value. To address this problem, CDFEuses a filtrapper approach [20] and defines nested sets of features S 1 S 2 S N T in search of the optimal subset.Here, N T denotes the number of threshold levels. A sequenceof increasing W diff values can be used as thresholds toprogressively eliminate more and more features of decreas-ing relevance in the nested subsets. Each feature subset, thus,
generated is evaluated with a classifier and the final subset ischosen according to the application requirement. Either thesubset smallest in size having the same accuracy as attained by the entire feature set, is selected or the one with bestclassification accuracy is chosen.
The W ðF iÞdiff value of the ith feature, F i, is determined by counting the number of 1s over the instances of the twoclasses. Therefore, the time complexity of diff-criterion isOðNM Þ, where N is the number of instances and M isnumber of features in the training set. Consequently, thetime complexity of CDFE is OðNMN T V Þ, where V is thecomputing time of the classifier used.
3.1 Rationale of the Diff-Criterion Measure
In the remainder of this section, theoretical justification fordiff-criterion is provided. We determine mutual informa-tion in terms of diff-criterion and show that diff-criterion iscomputationally more efficient.
Definition 3.4. Mutual information is a measure of the amountof information that one variable contains about anothervariable [38].
It is calculated by finding the relative entropy orKullback-Leibler distance between the joint distribution
pðC; F iÞ of two random variables C and F i and their productdistribution pðC Þ pðF iÞ [38]. Being consistent with ournotation used in (2) and (3), the weight of the ith feature,F i using mutual information, is
W ðF iÞmi ¼ DKLð pðC; F iÞk pðC Þ pðF iÞÞ
¼ X
C
X
F i
pðC; F iÞ log2 pðC Þ pðF iÞ
pðC; F iÞ :
ð4Þ
Remark 3.4. Because of the properties of Kullback-Leiblerdivergence, W ðF iÞmi 0 with equality, if and only if, C and F i are independent. Larger W mi value means a
feature is more important.
Writing mutual information given in (4) in terms of class-conditional probabilities
W ðF iÞmi ¼ X
C
X
F i
pðF ijC Þ pðC Þ log2 pðF iÞ
pðF ijC Þ: ð5Þ
S i n c e pðF i ¼ f Þ ¼ pðF i ¼ f jC ¼ 0Þ pðC ¼ 0Þ þ pðF i ¼
f jC ¼ 1Þ pðC ¼ 1Þ and pðC ¼ 0Þ þ pðC ¼ 1Þ ¼ 1, and using
the notation pðC ¼ cÞ ¼ P c for prior probability of the class
variable and d ifc ¼ pðF i ¼ f jC ¼ cÞ for class-conditional
probabilities of the ith feature, where f; c 2 f0; 1g, in (5),
we get
W ðF iÞmi ¼
P 0ðd i00Þ log2
ðP 0ðd i00 d
i01Þ þ d
i01Þ
d i00
P 0ðd i10Þ log2
ðP 0ðd i10 d
i11Þ þ d
i11Þ
d i10
ð1 P 0Þðd i01Þ log2
ðP 0ðd i00 d
i01Þ þ d
i01Þ
d i01
ð1 P 0Þðd i11Þ log2
ðP 0ðd i10 d
i11Þ þ d
i11Þ
d i11:
ð6Þ
Putting d i00 þ d i10 ¼ 1 and d
i01 þ d
i11 ¼ 1 in (6), suppressing
the index i and rearranging the terms, we get
W ðF iÞmi ¼
ðP 0Þ log2
ðd 10Þd 10ð1 d 10Þ
ð1d 10Þ
þ ð1 P 0Þ log2
ðd 11Þd 11ð1 d 11Þ
ð1d 11Þ
log2
ðd 11 P 0ðd 11 d 10ÞÞðd 11P 0ðd 11d 10ÞÞ
ð1 ðd 11 P 0ðd 11 d 10ÞÞÞð1ðd 11P 0ðd 11d 10ÞÞÞÞ
:
ð7Þ
Equation (7) indicates that mutual information between a
feature and the class variable depends on P 0, d 11, d 10, and thediff-criterion measure, ðd 11 d 10Þ. The first term in (7) ranges
over [P 0 0] with minimum at d 10 ¼ 0:5 and maxima at
d 10 ¼ 0; 1. Similarly, the second term lies in the range
[ð1 P 0Þ 0] with a minimum value at d 11 ¼ 0:5 and maxima
at d 11 ¼ 0; 1. The third term lies over the range [0 1] and
depends on P 0 and ðd 11 d 10Þ. The most significant con-
tribution to mutual information comes from the third term
which contains the diff-criterion measure. Fig. 1 shows the
relationship between mutual information and the diff-
criterion for a balanced data set, P 0 ¼ 0:5, a partially
unbalanced data set with P 0 ¼ 0:25 and an unbalanced data
set having P 0 ¼ 0:035. Mutual information increases as a
function of the diff-criterion. The change in mutual informa-
tion due to different values of d 11 and d 10 but the same value
of ðd 11 d 10Þ is relatively small as evident from the standard
deviation bars on the three plots in Fig. 1. It is also observed
that the maximum value of mutual information obtained
with diff-criterion, ðd 11 d 10Þ ¼ 1 decreases as P 0 decreases.
Remark 3.5. Mutual information of a feature, F i, whose
value remains the same over the two classes is 0.
Proof. In this case, diff-criterion, ðd 11 d 10Þ becomes 0 or
W ðF iÞdiff
¼ 0. Putting this value in (7) and using
limy!0;1 log2ððyyÞð1 yÞð1yÞÞ ¼ 0, we get, W ðF iÞmi ¼ 0. tu
Theorem 3.1. Mutual information is upper bounded by entropy
of the class variable.
468 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 3, MARCH 2012
8/16/2019 Diff Criteria
5/13
Proof. The most relevant feature has ðd 11 d 10Þ ¼ 1 orW ðF iÞdiff ¼ 1. Putting this value in (7) and usinglimy!0;1 log2ððy
yÞð1 yÞð1yÞÞ ¼ 0, we get
W ðF iÞmi
¼ ðP 0Þ log2 P 0 ðP 1Þ log2 P 1
¼ pðC ¼ 0Þ log2ð pðC ¼ 0ÞÞ pðC ¼ 1Þ log2ð pðC ¼ 1ÞÞ
¼ EntropyðC Þ:
ut
Remark 3.6. The range of diff-criterion measure is [0 1]whereas mutual information lies within [0 Entropy(C)].
In other words, features with higher W diff will reduce theuncertainty of theclass variable more as compared to featureswithlower W diff anda feature whoseW diff is1 willcontain allthe information required to predict the class variable.
Remark 3.7. Diff-criterion is a computationally less ex-pensive measure as compared to mutual information.
If we assume that it takes t1 units of time to calculate thedensity term, pðF i ¼ 1jC Þ, a subtraction operation isperformed in t2 and an absolute operation takes t3 unitsof time, then the computational cost of W ðF iÞdiff given in(3) is 2 t1 þ t2 þ t3. Further, if we assume that pðC Þ and pðF iÞ take t4, log2 takes t5, a division operation takes t6 anda multiplication takes t7 units of time, then the computa-tional cost of W ðF iÞmi given in (5) is 4 t1 þ 4 t2 þ 4 t4 þ 4 t5 þ 4 t6 þ 8 t7. Comparing the two computa-
tional costs and keeping in mind that logarithm, multi-plication, and division are expensive operations, we findthat diff-criterion is a relatively less expensive measure.
4 TWO-STAGE FEATURE SELECTION ALGORITHMS
Feature ranking algorithms while selecting a final subsetignore redundancies among the features. Without anysearch strategy, they choose features that are highlyrelevant to the class variable. Due to their simplicity andcomputational efficiency, they are highly popular inapplication domains involving high-dimensional data. Onthe other hand, feature subset selection algorithms take the
redundancies among features into consideration whileselecting features but are computationally expensive withdata having a very large number of features. In this section,we suggest combining an FR algorithm using a filtrapper
approach [20] with an FSS algorithm to overcome theselimitations. The idea of designing dimensionality reductionalgorithms with more than one stage is not new [25], [39].However, this kind of combination of FR and FSSalgorithms for high-dimensional binary data has not yet been explored. The first stage of the two-stage algorithm is
based on a computationally cheap FR measure and selects apreliminary subset with best classification accuracy. Apotentially large number of irrelevant and redundantfeatures are discarded in this phase. This makes the job of an FSS algorithm relatively easy. In the second stage, a“higher performance” computationally more expensive FSSalgorithm is employed to select the most useful featuresfrom the reduced feature set produced in the first stage.
4.1 First Stage: Selection of the Preliminary FeatureSubset
To evaluate its effectiveness as a preprocessor to FSSalgorithms, CDFE is employed in the first stage of our two-
stage algorithm. In this capacity, it provides them with areduced initial feature subset having good classificationaccuracy as compared to the entire feature set. Besides theirrelevant features, a large number of redundant featuresare eliminated by CDFE during this stage. The subset thus,generated is not only easier to manipulate by the FSSalgorithm in the second stage but also improves itsclassification performance.
4.2 Second Stage: Selection of the Final FeatureSubset
In this paper, we have tested two FSS algorithms in thesecond stage: Koller and Sahami’s Markov blanket filtering
algorithm [4], which is an approximation to the theoreticallyoptimal feature selection criterion, and our Bernoullimixture model-based Markov blanket filtering algorithm[29], which makes MBF computationally more efficient. Thetwo algorithms are briefly described here.
4.2.1 Koller and Sahami’s Markov Blanket Filtering
Algorithm [4]
Koller and Sahami show that a feature F i can be safelyeliminated from a set without an increase in the divergencefrom the true class distribution, if its Markov blanket, M,can be identified. Practically, it is not possible to exactly
pinpoint the true MB of F i; hence, heuristics have to beapplied. MBF is a backward elimination algorithm and isoutlined in Table 1. For each feature F i, a candidate set Miconsisting of those K features, which have the highest
JAVED ET AL.: FEATURE SELECTION BASED ON CLASS-DEPENDENT DENSITIES FOR HIGH-DIMENSIONAL BINARY DATA 469
Fig. 1. Relationship between diff-criterion and mutual information for balanced (left), partially unbalanced (middle), and highly unbalanced (right)data.
8/16/2019 Diff Criteria
6/13
correlation with F i, is selected. The value of K should be aslarge as possible to subsume all the information F i containsabout the class and other features. Then, MBF estimateshow close Mi is to being the MB of F i using the followingexpected cross entropy measure:
GðF ijMiÞ ¼X
f Mi ;f i
P ðMi ¼ f Mi ; F i ¼ f iÞ
DKLðP ðC jM ¼ f M; F i ¼ f iÞjjP ðC jM ¼ f M ÞÞ:
ð8Þ
The feature F i having the smallest value of GðF ijMiÞ isomitted. The output of this algorithm can also be a list of features sorted according to relevance to the class variable.Its time complexity is OðrMKN 2K LÞ where r is the numberof features to eliminate and L is the total number of classes.
4.2.2 Bernoulli Mixture Model-Based Markov Blanket
Filtering Algorithm [29]
Larger values of K in the MBF algorithm demand heavy
computations for calculating the expected cross entropymeasure given in (8) from the training data. This issue isaddressed by the BMM-MBF algorithm for binary data sets by estimating the cross entropy measure from the Bernoullimixture model instead of the training set. A Bernoullimixture model can be seen as a tool for partitioning an M -dimensional hypercube, identifying regions of high datadensity on the corners of the hypercube. BMM-MBF is thesame as the MBF algorithm given in Table 1, except thatStep 2b is replaced by the steps given in Table 2.
BMM-MBF, first determines the Bernoulli mixtures, (Q1and Q0), from the training data for the positive and negative
(C 1 and C 0) classes, respectively. The q th mixture is
specified by the prior q and the probability vector
pq 2 ½0; 1M , 1 q ðQ ¼ Q1 þ Q0Þ. These two parameters
can be determined by the expectation maximization (EM)
algorithm [40]. Then, BMM-MBF thresholds the values of
probability vector to see, which corner of the hypercube is
represented by this mixture. A probability value greater
than 0.5 is taken as a 1 and 0 otherwise. This converts pq
into a feature vector x
whose probability of occurrence can be estimated as
pðxjq Þ ¼ q mini
pxiqi ð1 pqiÞ1xi ; ð9Þ
where pqi 2 ½0; 1, 1 i M , denotes the probability of
success of the ith feature in the q th mixture. The feature
vector having the highest probability of occurrence accord-
ing to (9) in the mixture density, is termed as “main vector”
and is denoted by v
v ¼ argmaxx2X
pðxjq Þ;
where X represents the set of all binary vectors in f0; 1gM .Once the main vectors are estimated, the steps given in
Table 2 are then followed. Here, the kth mixtures for the
positive and negative classes are denoted by q 1k and q 0k ,
respectively.The BMM-MBF algorithm has a time complexity of
OðrMKQ2LÞ. The cross entropy measure in MBF is
computed from N K sized data and we need to look at
2K combinations of values. On the other hand, BMM-MBF
computes the cross entropy measure from Q K sized
data, where we are only looking at Q main vectors in each
Bernoulli mixture, resulting in a dramatic reduction in time.
470 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 3, MARCH 2012
TABLE 1MBF Algorithm [4]
TABLE 2BMM-MBF Algorithm [29]
8/16/2019 Diff Criteria
7/13
5 EXPERIMENTAL RESULTS
This section first measures the effectiveness of our class-dependent density-based feature elimination algorithm usedas a stand-alone method. Then, we evaluate CDFE as apreprocessor to the FSS algorithms, described in Section 4, asa part of our two-stage algorithm for selecting features fromhigh-dimensional binary data. Experiments are carried outon three different real-life benchmark data sets using twodifferent classifiers. The three data sets are NOVA, GINA,and HIVA, which were collected from the text-mining,
handwriting, and medicine domains, respectively, and wereintroduced in agnostic learning track of the “AgnosticLearning versus Prior Knowledge” challenge organized bythe International Joint Conference on Neural Networks in2007 [1]. The data sets are summarized in Table 3.
Designed for the text classification task, NOVA classifiesemails into two classes: politics and religion. The data are asparse binary representation of a vocabulary of 16,969 wordsand hence consists of 16,969 features. The positive class is28.5 percent of the total instances. Thus, NOVA is a partiallyunbalanced data set.
HIVA is used for predicting the compounds that are
active against the AIDS HIV infection. The data arerepresented as 1,617 sparse binary features and 3.5 percentof the class variable comprise the positive class. HIVA is,thus, an unbalanced data set.
The GINA data set is used for the handwritten digitrecognition task, which consists of separating the two-digiteven numbers from the two-digit odd numbers. With sparsecontinuous input variables, it is designed such that only theunit digit provides the information about the classes. TheGINA features are integers quantized to 256 grayscalelevels. We converted these 256 gray levels into 2 bysubstituting 1 for the values greater than 0. This is
equivalent to converting a grayscale image to a binaryimage. Data sets with GINA-like feature values can be binari zed wit h this str ategy which does not affect
the sparsity of the data. The positive class is 49.2 percentof the total instances. In other words, GINA is balanced between the positive and negative classes.
The class labels of the test sets of these data sets are notpublicly available but one can make an online submission toknow the prediction accuracy on the test set. In ourexperiments, the training and validation sets are combinedto train the Naive Bayes’ and kernel ridge regression(kridge) classifiers. The software implementation given inChallenge Learning Object Package (CLOP) [35] for both theclassifiers, was used. The classification performance is
evaluated by the balanced error rate (BER) over fivefoldcross validation. BER is the average of the error rates of thepositive and negative classes [20]. Given two classes, if tndenotes the number of negative instances that are correctlylabeled by the classifier and f p refers to number of negativeinstances that are incorrectly labeled, then false positive rateis defined as, fpr ¼ f p=ðtn þ f pÞ. Similarly, we can definefalse negative rate as fnr ¼ f n=ðt p þ f nÞ, where f n is thenumber of positive instances that are incorrectly labeled bythe classifier and t p denotes the number of positiveinstances that are correctly labeled. Thus, BER, is given by
BER ¼ 0:5 ðfpr þ fnrÞ:
For data sets that are unbalanced in cardinality, BER gives a better picture of the error than the simple error rate [20].
5.1 Class-Dependent Density-Based FeatureElimination as a Stand-Alone Feature SelectionAlgorithm
Experiments described in this section test the performanceof CDFE as a stand-alone feature selection algorithm.Features are scored using the diff-criterion and are thensorted in descending order according to their weights. Fig. 2shows the weights (sorted in descending order) assigned tothe features by the max-criterion, diff-criterion, and mutualinformation measures using (2), (3), and (4), respectively.Although, each measure assigns a different value to a
JAVED ET AL.: FEATURE SELECTION BASED ON CLASS-DEPENDENT DENSITIES FOR HIGH-DIMENSIONAL BINARY DATA 471
TABLE 3Summary of the Data Sets [1]
The number of classes for each data set is 2. The train, valid and test columns show the total number of instances in the corresponding data sets.
Fig. 2. Comparison of weights assigned to the features for NOVA (left), HIVA (middle), and GINA (right).
8/16/2019 Diff Criteria
8/13
feature, we are actually interested in looking at theirpatterns. The curve of diff-criterion for the three data setslies in middle of the curves of the other two measures, whilethe curve of max-criterion lies at top of the three curves. It isevident from these patterns that diff-criterion behaves in amore similar fashion to mutual information as compared to
max-criterion. For NOVA, the W diff values lie in the range[0 0.231] with most of the features having value close to zeroas shown by its diff-criterion pattern. Thus, most NOVAfeatures have poor discriminating power. The W diff valuesof HIVA ranges over [0 0.272]. Compared to NOVA, a largerfraction of HIVA features have good class separationcapability as seen in Fig. 2. In case of GINA, the W diff values lie within the range [0 0.471] with a fairly largefraction of features have good discriminating power.
To find the final subset, the space of M features issearched with a filtrapper like approach [20]. We definenested sets of features progressively eliminating more and
more features of decreasing relevance with the help of asequence of increasing threshold values on W diff . For agiven threshold value, we discard a number of features andretain the remaining features. The usefulness of everyfeature subset, thus, generated is tested using the classifica-tion accuracy of a classifier.
In Fig. 3, we look at the effectiveness of a feature rankingmethod. The CDFE algorithm is compared against a baseline method that generates nested feature subsets butselects features randomly from the data set that is notranked. From the plots, we observe that ranking the NOVA,HIVA, and GINA features significantly improves theclassification accuracy. Besides this, a feature subset of
smaller size attains the BER value that is obtained with theset containing all the features.
Next, we compare CDFE against three feature selectionalgorithms: the mutual information-based ranking method,Koller and Sahami’s MBF algorithm, and BMM-MBF algo-rithm using kridge classifier. The MI-based ranking methodassigns weights to features according to the mutual informa-tion measure given in (4) and sorts them in the order of
decreasing weights. The plots are given in Fig. 4, and theresults are tabulated in Table 4. Among these algorithms,CDFE is the least expensive. Both MBF and BMM-MBFapplied on the entire NOVA feature set become computa-tionally infeasible, as each algorithm involves calculating thecorrelation matrix of size M M . For data sets with a largenumber of features, suchcalculations render thesealgorithmsimpractical. Due to this reason, we could not compare theperformance of CDFE against that of theMBF andBMM-MBFalgorithms for NOVA. However, when compared with theMI-based ranking method, CDFE gives better results asshown in Fig. 4. CDFE reduces the original dimensionality to
a set of 3,135 features (81.53 percent reduction) having aclassification accuracy as good as attained with the entirefeature set. On the other hand, the MI-based ranking methodselects a subset of 4,950 features(70.83percent reduction). Forthe HIVA data set, CDFE results in higher classificationaccuracy as compared to the other three feature selectionalgorithms. It generates a subset with about 8.6 percent of theoriginal features. Here, CDFE’s performance is close to MBFand outperforms the MI-based ranking method and theBMM-MBF algorithm. In case of GINA, the classificationaccuracy patterns of CDFE, MBF, and MI-based rankingmethod are similar. CDFE generates a subset whose
dimensionality is 33 percent of original feature set and comesthird in terms of feature reduction.
472 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 3, MARCH 2012
0 2000 4000 6000 8000 10000 12000 14000 16000 180000.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45NOVA − Performance of CDFE and Baseline method without feature ranking using kridge
size of feature subsets
B E R
CDFE
Baseline method without feature ranking
BER − 16969 =0.070175
0 200 400 600 800 1000 1200 1400 1600 18000.23
0.24
0.25
0.26
0.27
0.28
0.29
0.3
0.31
0.32HIVA − Performance of CDFE and Baseline method without feature ranking using kridge
size of feature subsets
B E R
CDFE
Baseline method without feature ranking
BER − 1617 =0.26778
100 200 300 400 500 600 700 800 900 10000.13
0.14
0.15
0.16
0.17
0.18
0.19
0.2
0.21
0.22
size of feature subsets
B E R
GINA − Performance of CDFE and Baseline method without feature ranking using kridge
CDFE
Baseline method without feature ranking
BER − 970 =0.14044
Fig. 3. Comparison of CDFE against a baseline method of selecting random features without feature ranking for NOVA (left), HIVA (middle), andGINA (right) using kridge classifier.
Fig. 4. Comparison of the CDFE algorithm against MI-based ranking, MBF and BMM-MBF algorithms for NOVA (left), HIVA (middle) and GINA(right) using kridge classifier.
8/16/2019 Diff Criteria
9/13
5.2 Two-Stage Feature Selection Algorithms
This section measures the performance of the two-stagealgorithm with CDFE used as a preprocessor to an FSSalgorithm (MBF or BMM-MBF) in the second stage. For thispurpose, we compare the performance of two stages used inunison against only the second stage feature selectionalgorithm.
5.2.1 Stage-1: Class-Dependent Density-Based Feature Elimination
When CDFE is used as a preprocessor, we choose thefeature subset resulting in minimum BER value for the nextstage. Table 5 summarizes the minimum BER results for thethree data sets obtained from Fig. 4. The NOVA plotindicates that a BER of 6.38 percent is obtained when weeliminate features using a threshold of W diff
8/16/2019 Diff Criteria
10/13
features selected by the two-stage algorithm result in aclassification accuracy as good as the one achieved by all thefeatures. From the HIVA plot, a shift in the optimum BERpoint of MBF toward the left is evident when it is combinedwith CDFE. MBF alone results in an optimum BER of 26.8 percent with 185 features while it attains an optimumBER of 26.4 percent with 140 features in two stages. The
smallest subset which attains a BER value equal to thatattained with all the HIVA features, consists of 96 featureswhen MBF is used as a stand-alone method. It consists of 76 features when MBF is combined with CDFE. In case of GINA, MBF alone performs the classification task with128 features with an accuracy equal to that obtained withthe entire feature set. The size of this subset is reduced to32 features when CDFE and MBF are combined in two stages.
Fig. 6 investigates performance of the two-stage algorithmagainst that of MBF using kridge classifier. When applied onNOVA, the smallest subset selected by the two-stagealgorithm which attains a BER value equal to that obtainedwith all the features, consists of 1,792 features. For HIVA,
MBF selects a subset of 64 features while CDFE and MBF intwo stages select 58 features to perform the classification taskwithout anydegradation in theaccuracy that is obtained withall the features. The GINA results indicate that 165 features
selectedbyMBFresultinaBERthatisobtainedwiththeentirefeature set. On the other hand, subset selected by the two-stage algorithm consists of 150 features.
5.2.3 Results of the Two-Stage Algorithm with the
BMM-MBF Algorithm in Second Stage
In this section, the performance of two-stage algorithm is
discussed when CDFE is combined with the BMM-MBFalgorithm. We experimented with the BMM-MBF algorithmusing different values of K and found that unlike Koller andSahami’s MBF algorithm, it remains computationally effi-cient even if it has to search for the Markov blanket of afeature using values of K as large as 40. For each data set, weuse the optimal value of K . Like MBF, we evaluated theperformance of our two-stage algorithm against the classi-fication accuracy of the entire feature set obtained by naiveBayes’ and kridge classifiers for the NOVA data set. ForHIVA and GINA, the two-stage algorithm was comparedagainst the performance of the BMM-MBF algorithm. Theempirical results are shown in Figs. 7 and 8 and are
summarized in Table 7. We find that the performance of the BMM-MBF algorithm, both in terms of feature reductionand classification accuracy, is significantly improved withthe addition of the CDFE algorithm as a first stage.
474 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 3, MARCH 2012
0 1000 2000 3000 4000 50000.06
0.08
0.1
0.12
0.14
0.16
0.18
size of feature subsets
B E R
NOVA − Performance of kridge
CDFE + MBF−k=2
BER−16969 =0.070175
0 200 400 600 800 1000 1200 1400 1600 18000.23
0.24
0.25
0.26
0.27
0.28
0.29
0.3
0.31
0.32
0.33
size of feature subsets
B E R
HIVA − Performance of kridge
CDFE + MBF−k = 2
MBF−k = 2
BER − 1617 =0.26778
0 200 400 600 800 10000.13
0.14
0.15
0.16
0.17
0.18
0.19
0.2
0.21
0.22
size of feature subsets
B E R
GINA − Performance of kridge
CDFE + MBF−k = 1
MBF−k = 1
BER − 970 =0.14044
Fig. 6. Comparison of the two-stage (CDFE þ MBF) algorithm against the MBF algorithm for NOVA (left), HIVA (middle), and GINA (right) usingkridge classifier.
TABLE 6Comparison of the Two-Stage (CDFE þ MBF) Algorithm against the MBF Algorithm
F is the entire feature set, G is the selected feature subset, and BER is the balanced error rate.
Fig. 7. Comparison of the two-stage (CDFE þ BMM MBF) algorithm against the BMM-MBF algorithm for NOVA (left), HIVA (middle), and GINA(right) using naive Bayes’ classifier.
8/16/2019 Diff Criteria
11/13
Fig. 7 compares the performance of the two-stage algo-rithmandthatoftheBMM-MBFalgorithmusingnaiveBayes’classifier. Our two-stage algorithm, for the NOVA data set,leads to an optimum BER value of 2 percent with 2,048 fea-tures while it selects a subset of 605 features with classifica-tion accuracy as good as obtained by all the features. The
HIVAplot indicatesthatclassificationaccuracyof BMM-MBFis improved with the introduction of the CDFE stage in such amanner that almost 8 percent of the original features result inan accuracy obtained with all the features. In case of GINA,we find that BMM-MBF alone performs theclassification taskwith 279 features with an accuracy equal to that attained withall the features. The addition of CDFE to BMM-MBF reducesthis subset to 165 features.
InFig.8,resultsofthekridgeclassifierwhenappliedonthethree data sets are shown. Dimensionality of the NOVAsubsetselectedfromthefirststageisreducedfurtherto780byBMM-MBF without compromising the classification accu-racythatisobtainedwithallthefeatures.FromtheHIVAplot,
we find that the smallest subset selected by BMM-MBF toperform the classification task with a BER value equal to thatattained by allthe features,consists of 817features. Thesize of this subset is reduced to 140 when CDFE and BMM-MBF arecombined in two stages. When theexperiment wasrun on theGINA data set, BMM-MBF selected 550 features while thetwo-stage algorithm selected 279 features.
5.3 Comparison of CDFE Performance against theTop 3 Winning Entries of the Agnostic LearningTrack [1]
The organizers of the agnostic learning track of “AgnosticLearning versus Prior Knowledge” challenge evaluated allthe entrants on the basis of the BER on the test sets. We testedCDFE in both the capacities, as a stand-alone method and as apart of thetwo-stage algorithm (i.e., a preprocessor to MBForBMM-MBF) with kridge classifier and the classification
method given in [36] for NOVA, HIVA, and GINA. Table 8gives a comparison of CDFE’s performance against the top 3winning entries of theagnostic learning track ofthechallenge.
In case of NOVA, both our methods, the CDFE stand-alone algorithm and the two-stage algorithm outperformthe top 3 results. We also find that CDFE performs better in
two stages (CDFE þ MBF) as compared to the CDFE stand-alone case. For the HIVA data set, we observe that the BERvalue obtained by CDFE with the kridge classifier outper-forms the top 3 results. When combined with MBF in twostages, CDFE results in a performance that is comparable tothe three winning BER results. Results obtained on GINAindicate that the two-stage (CDFE þ MBF) algorithm beatsthe second and third winning entries. As a stand-alonemethod, CDFE obtains the third position in the ranking of the top 3 entries of the challenge.
Feature selection algorithms may behave differently ondata sets from different application domains. The mainfactors that affect the performance include the number of
features and samples and the balance of the classes of thetraining data [20]. NOVA, HIVA, and GINA belong todifferent application domains. The ratio of features tosamples is 0.103, 2.378, and 3.251 and the positive class is28.5, 3.5, and 49.2 percent of the total samples, respectively.Results in Table 8 indicate that CDFE, which is currentlylimited to the domain of two-class classification with binary-valued features, performs consistently better as compared tothe other feature selection algorithms used in the challenge.
6 CONCLUSIONS
This paper is devoted to feature selection in high-dimen-sional binary data sets. We proposed a ranking criterion,called diff-criterion, to estimate the relevance of featuresusing their density values over the classes. We showed that itis equivalent to the mutual information measure but is
JAVED ET AL.: FEATURE SELECTION BASED ON CLASS-DEPENDENT DENSITIES FOR HIGH-DIMENSIONAL BINARY DATA 475
TABLE 7Comparison of the Two-Stage (CDFE þ BMM MBF) Algorithm against the BMM-MBF Algorithm
Fig. 8. Comparison of the two-stage (CDFE þ BMM MBF) algorithm against the BMM-MBF algorithm for NOVA (left), HIVA (middle), and GINA(right) using kridge classifier.
8/16/2019 Diff Criteria
12/13
computationally more efficient. Based on the diff-criterion,we proposed a supervised feature selection algorithmtermed as class-dependent density-based feature elimina-tion, to select a subset of useful binary features. CDFE uses aclassifier instead of a user-provided threshold value to selectthe final subset. Our experiments on three real-life data setsdemonstrate that CDFE, in spite of its simplicity andcomputational efficiency, either outperforms other well-known feature selection algorithms or is comparable to themin terms of classification and feature selection performance.
We also found that CDFE can be effectively used as apreprocessing step for other feature selection algorithms fordetermining compact subsets of features without compro-mising the accuracy on a classification task. It, thus,provides them with a substantially smaller feature subsethaving better class separability. Feature selection algo-rithms, such as MBF and BMM-MBF, involving squarematrices of size equal to the number of features, becomecomputationally intractable for high-dimensional data sets.It was shown empirically that CDFE adequately relievesthem from this problem and significantly improves theirclassification and feature selection performance.
Furthermore, we analyzed CDFE’s performance bycomparing it against the winning entries of the agnostic
learning track of “Agnostic Learning versus Prior Knowl-edge” challenge. Results indicate that CDFE outperformsthe best entries obtained on NOVA and HIVA data sets andattains the third position on the GINA data set.
ACKNOWLEDGMENTS
Kashif Javed was supported by a doctoral fellowship at theUniversity of Engineering and Technology, Lahore. Theauthors would like to thank the anonymous reviewers fortheir helpful comments.
REFERENCES[1] I. Guyon, A. Saffari, G. Dror, and G. Cawley, “Agnostic Learning
vs. Prior Knowledge Challenge,” Proc. Int’l Joint Conf. NeuralNetworks (IJCNN), http://www.agnostic.inf.ethz.ch, 2007.
[2] “Feature Selection Challenge by Neural Information ProcessingSystems Conference (NIPS),” http://www.nipsfsc.ecs.soton.ac.uk, 2003.
[3] R.O. Duda, P.E. Hart, and D.G. Stork, Pattern Classification, seconded. Wiley, 2001.
[4] D. Koller and M. Sahami, “Toward Optimal Feature Selection,”Proc. 13th Int’l Conf. Machine Learning, pp. 284-292, 1996.
[5] L. Yu and H. Liu, “Efficient Feature Selection via Analysis of Relevance and Redundancy,” J. Machine Learning Research, vol. 5,pp. 1205-1224, 2004.
[6] M. Dash and H. Liu, “Feature Selection for Classification,”Intelligent Data Analysis, Elsevier Science B.V., vol. 1, no. 3,pp. 131-156, 1997.
[7] I. Guyon and A. Elisseeff, “An Introduction to Variable andFeature Selection,” J. Machine Learning Research, vol. 3, pp. 1157-1182, 2003.
[8] L. Jimenez and D. Landgrebe, “Supervised Classification in HighDimensional Space: Geometrical, Statistical and AsymptoticalProperties of Multivariate Data,” IEEE Trans. Systems, Man andCybernetics—Part C: Applications and Rev., vol. 28, no. 1, pp. 39-54,Feb. 1998.
[9] D. Scott and J. Thompson, “Probability Density Estimation inHigher Dimensions,” Proc. 15th Symp. Interface, Elsevier SciencePublishers, pp. 173-179, 1983.
[10] R. Bellman, Adaptive Control Processes: A Guided Tour. PrincetonUniv. Press, 1961.
[11] S. Ruadys and A. Jain, “Small Sample Size Effects in StatisticalPattern Recognition: Recommendations for Practitioners,” IEEETrans. Pattern Analysis and Machine Intelligence, vol. 13, no. 3,pp. 252-264, Mar. 1991.
[12] K. Kira and L.A. Rendell, “A Practical Approach to FeatureSelection,” Proc. Ninth Int’l Conf. Machine Learning, pp. 249-256,1992.
[13] I. Guyon, J. Watson, S. Barnhill, and V. Vapnik, “Gene Selectionfor Cancer Classification Using Support Vector Machines,”
Machine Learning, vol. 46, pp. 389-422, 2002.[14] J.B. Tenenbaum, V. de Silva, and J.C. Langford, “A Global
Geometric Framework for Nonlinear Dimensionality Reduction,”Science, vol. 290, pp. 2319-2323, 2000.
[15] L.K. Saul and S.T. Roweis, “Think Globally, Fit Locally:Unsupervised Learning of Low Dimensional Manifolds,”
J. Machine Learning Research, vol. 4, pp. 119-155, 2003.[16] L. van der Maaten, E. Postma, and H. van den Herik,
“Dimensionality Reduction: A Comparative Review,” TechnicalReport TiCC-TR 2009-005, Tilburg Univ., 2009.
[17] R. Kohavi and G. John, “Wrappers for Feature Subset Selection,” Artificial Intelligence, vol. 97, pp. 273-324, Dec. 1997.[18] H. Liu and L. Yu, “Toward Integrating Feature Selection
Algorithms for Classification and Clustering,” IEEE Trans. Knowl-edge and Data Eng., vol. 17, no. 4, pp. 491-502, Apr. 2005.
476 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 3, MARCH 2012
TABLE 8Comparison of CDFE Performance against Top 3 Winning Entries of the Agnostic Learning Track [1]
BMM is Bernoulli mixture model, PCA is principal component analysis, PSO is particle swarm optimization, and SVM is support vector machine.
8/16/2019 Diff Criteria
13/13
[19] A.L. Blum and P. Langley, “Selection of Relevant Features andExamples in Machine Learning,” Artificial Intelligence, ElsevierB.V., vol. 97, pp. 245-271, 1997.
[20] I. Guyon, S. Gunn, M. Nikravesh, and L.A. Zadeh, FeatureExtraction Foundations and Applications. Springer, 2006.
[21] M. Hall, “Correlation-Based Feature Selection for Discrete andNumeric Class Machine Learning,” Proc. 17th Int’l Conf. MachineLearning, 2000.
[22] R. Ruiz and J.S. Aguilar-Ruiz, “Analysis of Feature Rankings forClassification,” Proc. Int’l Symp. Intelligent Data Analysis (IDA),
pp. 362-372, 2005.[23] L. Yu and H. Liu, “Feature Selection for High-Dimensional Data:
A Fast Correlation-Based Filter Solution,” Proc. 20th Int’l Conf. Machine Learning, 2003.
[24] F. Fleuret, “Fast Binary Feature Selection with Conditional MutualInformation,” J. Machine Learning Research, vol. 5, pp. 1531-1555,2004.
[25] H. Peng, F. Long, and C. Ding, “Feature Selection Based on MutualInformation: Criteria of Max-Dependency, Max-Relevance, andMin-Redundancy,” IEEE Trans. Pattern Analysis and MachineIntelligence, vol. 27, no. 8, pp. 1226-1238, Aug. 2005.
[26] G. Qu, S. Hariri, and M. Yousaf, “A New Dependency andCorrelation Analysis for Features,” IEEE Trans. Knowledge and DataEng., vol. 17, no. 9, pp. 1199-1207, Sept. 2005.
[27] J. Pearl, Probabilistic Reasoning in Intelligent Systems. Morgan
Kaufmann, 1988.[28] A. Freno, “Selecting Features by Learning Markov Blankets,” Proc.11th Int’l Conf., KES 2007 and XVII Italian Workshop Neural NetworksConf. Knowledge-Based Intelligent Information and Eng. Systems: PartI (KES/WIRN), pp. 69-76, 2007.
[29] M. Saeed, “Bernoulli Mixture Models for Markov Blanket Filteringand Classification,” J. Machine Learning Research, vol. 3, pp. 77-91,2008.
[30] A. Juan and E. Vidal, “On the Use of Bernoulli Mixture Models forText Classification,” Pattern Recognition, vol. 35, pp. 2705-2710,2002.
[31] A. Juan and E. Vidal, “Bernoulli Mixture Models for BinaryImages,” Proc. 17th Int’l Conf. Pattern Recognition, (ICPR ’04), 2004.
[32] “Annual KDD Cup 2001,” http://www.sigkdd.org/kddcup/,2001.
[33] R. Agrawal and R. Srikant, “Fast Algorithms for Mining
Association Rules,” Proc. 20th Int’l Conf. Very Large Databases(VLDB ’94), 1994.
[34] J. Wilbur, J. Ghosh, C. Nakatsu, S. Brouder, and R. Doerge,“Variable Selection in High-Dimensional Multivariate Binary Datawith Application to the Analysis of Microbial Community DNAFingerprints,” Biometrics, vol. 58, pp. 378-386, 2002.
[35] I. Guyon et al., “CLOP,” http://ymer.org/research/files/clop/clop.zip, 2011.
[36] M. Saeed, “Hybrid Learning Using Mixture Models and ArtificialNeural Networks,” Hands-on Pattern Recognition Challenges in DataRepresentation, Model Selection, and Performance Prediction, http://www.clopinet.com/ChallengeBook.html, Microtome, 2008.
[37] M. Saeed and H. Babri, “Classifiers Based on Bernoulli MixtureModels for Text Mining and Handwriting Recognition,” Proc.IEEE Int’l Joint Conf. Neural Networks, 2008.
[38] T.M. Cover and J.A. Thomas, Elements of Information Theory. JohnWiley and Sons, 1991.
[39] L. Jimenez and D.A. Landgrebe, “Projection Pursuit in HighDimensional Data Reduction: Initial Conditions, Feature Selectionand the Assumption of Normality,” Proc. IEEE Int’l Conf. Systems,
Man and Cybernetics, 1995.[40] C.M. Bishop, Pattern Recognition and Machine Learning. Springer,
2006.[41] R.W. Lutz, “Doubleboost,” Fact Sheet http://clopinet.com/
isabelle/Projects/agnostic/, 2007.[42] V. Nikulin, “Classification with Random Sets, Boosting and
Distance-Based Clustering,” Fact Sheet http://clopinet.com/isabelle/Projects/agnostic/, 2007.
[43] V. Franc, “Modified Multi-Class SVM Formulation; Efficient LOOComputation,” Fact Sheet http://clopinet.com/isabelle/Projects/agnostic/, 2007.
[44] H.J. Escalante, “Particle Swarm Optimization for NeuralNetworks,” Fact Sheet http://clopinet.com/isabelle/Projects/agnostic/, 2007.
[45] J. Reunanen, “Cross-Indexing,” Fact Sheet http://clopinet.com/isabelle/Projects/agnostic/, 2007.
[46] I.C. ASML team, “Feature Selection with Redundancy Eliminationþ Gradient Boosted Trees,” Fact Sheet http://clopinet.com/isabelle/Projects/agnostic/, 2007.
Kashif Javed received the BSc and MScdegrees in electrical engineering in 1999 and2004, respectively, from the University ofEngineering and Technology (UET), Lahore,Pakistan, where he is currently working towardthe PhD degree. He joined the Department of
Electrical Engineering at UET in 1999, where heis currently an assistant professor. His researchinterests include machine learning, patternrecognition, and ad hoc network security.
Haroon A. Babri received the BSc degree inelectrical engineering from the University ofEngineering and Technology (UET), Lahore,Pakistan, in 1981, and the MS and PhD degreesin electrical engineering from the University ofPennsylvania in 1991 and 1992, respectively. Hewas with the Nanyang Technological University,Singapore, from 1992 to 1998, with the KuwaitUniversity from 1998 to 2000, and with the
Lahore University of Management Sciences(LUMS) from 2000 to 2004. He is currently a professor of electricalengineering at UET. He has written two book chapters and has morethan 60 publications in machine learning, pattern recognition, neuralnetworks, and software reverse engineering.
Mehreen Saeed received the doctorate degreefrom the Department of Engineering Mathe-matics, University of Bristol, United Kingdom,in 1999. She is currently working as an assistantprofessor in the Department of ComputerScience, FAST National University of Computerand Emerging Sciences, Lahore Campus, Paki-stan. Her main areas of interest include artificialintelligence, machine learning and statisticalpattern recognition.
. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.
JAVED ET AL.: FEATURE SELECTION BASED ON CLASS-DEPENDENT DENSITIES FOR HIGH-DIMENSIONAL BINARY DATA 477