18
Supervised feature subset selection with ordinal optimization Dingcheng Feng, Feng Chen , Wenli Xu Tsinghua National Laboratory for Information Science and Technology (TNList), Department of Automation, Tsinghua University, Beijing 100084, China article info Article history: Received 12 January 2013 Received in revised form 5 November 2013 Accepted 7 November 2013 Available online 28 November 2013 Keywords: Supervised feature selection Uniform sampling Ordinal optimization Ordered performance curve Feature scoring Order-good-enough Value-good enough abstract The ultimate goal of supervised feature selection is to select a feature subset such that the prediction accuracy is maximized. Most of the existing methods, such as filter and embedded models, can be viewed as using approximate objectives which are different from the prediction accuracy. The wrapper models maximize the prediction accuracy directly, but the optimization has very high computational complexity. To address the limitations, we present an ordinal optimization perspective for feature selection (OOFS). Feature subset evaluation is formulated as a system simulation process with randomness. Supervised fea- ture selection becomes maximizing the expected performance of the system, where ordinal optimization can be applied to identify a set of order-good-enough solutions with much reduced complexity and par- allel computing. These solutions correspond to the really good enough (value-good-enough) solutions when the solution space structure, characterized by ordered performance curve (OPC), exhibits concave shapes. We analyze that this happens in some important applications such as image classification, where a large number features have relatively similar abilities in discrimination. We further improve the OOFS method with a feature scoring algorithm, called OOFSs. We prove that, when the performance difference of solutions increases monotonically with respect to the solution difference, the expectation of the scores reflects useful information for estimating the globally optimal solution. Experimental results in sixteen real-world datasets show that our method provides a good trade-off between prediction accuracy and computational complexity. Ó 2013 Elsevier B.V. All rights reserved. 1. Introduction The rapid development of information technology has produced more and more high dimensional data, and posed a great challenge to many realistic learning tasks. As a fundamental problem in sta- tistical machine learning, feature selection (or variable selection) reduces the dimensionality by identifying the most informative subset from a large number of variables [1]. The advantages of per- forming feature selection include reducing the computational cost, saving the storage space, facilitating model selection procedures for accurate prediction, and interpreting complex dependencies among variables. Due to its importance, feature selection has re- ceived persistent attention over the past several decades [2,1,3–5]. Existing feature selection algorithms have been designed with different strategies [6,1,3,7]. According to the degree of using label information, feature selection algorithms are designed as super- vised, semi-supervised and unsupervised methods, which correspond to labeled, partially labeled and unlabeled training data [8,5]. According to the degree of evaluating feature variables, feature selection algorithms can either return a feature subset or return the scores of individual features [2,1]. According to the degree of using learning algorithms during the selection process, feature selection algorithms can be classified into filter, wrapper, and embedded models [6,5]. The filter models rank the features by the characteristics of data without using any learning algorithm, which is usually fast but not directly linked to subsequent learning tasks such as classification. Typical filter models include mutual information, Pearson product-moment correlation coefficient, and the inter/intra class distance [1]. The wrapper models evaluate and select feature subsets by a predetermined predictor [2,9,10], which maximizes the learning objective (e.g., classification accu- racy) directly but the optimization is usually slow. The embedded models incorporate feature selection into some specific training processes, which often provide a trade-off between learning speed and prediction accuracy [11,12]. Recently, many feature selection algorithms have been developed with other interesting machine learning topics, such as manifold learning in nonlinear dimension- ality reduction, sparse learning in large-scale regression models, and structure learning in probabilistic graphical models [13–18]. For example, the framework of spectral feature selection [19], which is based on spectral graph theory, studies how to select fea- tures to preserve the intrinsic manifold structure among samples [7,13,14,19,20]. Sparse learning provides a way for shrinking dependencies among random variables, which is more continuous and does not exhibit high variance compared with feature selec- tion [15]. Feature selection with sparse learning can usually be for- mulated as regularization problems [11,1,21,16,17]. Probabilistic 0950-7051/$ - see front matter Ó 2013 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.knosys.2013.11.004 Corresponding author. Tel.: +86 1062797145; fax: +86 1062786911. E-mail address: [email protected] (F. Chen). Knowledge-Based Systems 56 (2014) 123–140 Contents lists available at ScienceDirect Knowledge-Based Systems journal homepage: www.elsevier.com/locate/knosys

Supervised feature subset selection with ordinal optimization

  • Upload
    wenli

  • View
    228

  • Download
    4

Embed Size (px)

Citation preview

Page 1: Supervised feature subset selection with ordinal optimization

Knowledge-Based Systems 56 (2014) 123–140

Contents lists available at ScienceDirect

Knowledge-Based Systems

journal homepage: www.elsevier .com/ locate /knosys

Supervised feature subset selection with ordinal optimization

0950-7051/$ - see front matter � 2013 Elsevier B.V. All rights reserved.http://dx.doi.org/10.1016/j.knosys.2013.11.004

⇑ Corresponding author. Tel.: +86 1062797145; fax: +86 1062786911.E-mail address: [email protected] (F. Chen).

Dingcheng Feng, Feng Chen ⇑, Wenli XuTsinghua National Laboratory for Information Science and Technology (TNList), Department of Automation, Tsinghua University, Beijing 100084, China

a r t i c l e i n f o

Article history:Received 12 January 2013Received in revised form 5 November 2013Accepted 7 November 2013Available online 28 November 2013

Keywords:Supervised feature selectionUniform samplingOrdinal optimizationOrdered performance curveFeature scoringOrder-good-enoughValue-good enough

a b s t r a c t

The ultimate goal of supervised feature selection is to select a feature subset such that the predictionaccuracy is maximized. Most of the existing methods, such as filter and embedded models, can be viewedas using approximate objectives which are different from the prediction accuracy. The wrapper modelsmaximize the prediction accuracy directly, but the optimization has very high computational complexity.To address the limitations, we present an ordinal optimization perspective for feature selection (OOFS).Feature subset evaluation is formulated as a system simulation process with randomness. Supervised fea-ture selection becomes maximizing the expected performance of the system, where ordinal optimizationcan be applied to identify a set of order-good-enough solutions with much reduced complexity and par-allel computing. These solutions correspond to the really good enough (value-good-enough) solutionswhen the solution space structure, characterized by ordered performance curve (OPC), exhibits concaveshapes. We analyze that this happens in some important applications such as image classification, wherea large number features have relatively similar abilities in discrimination. We further improve the OOFSmethod with a feature scoring algorithm, called OOFSs. We prove that, when the performance differenceof solutions increases monotonically with respect to the solution difference, the expectation of the scoresreflects useful information for estimating the globally optimal solution. Experimental results in sixteenreal-world datasets show that our method provides a good trade-off between prediction accuracy andcomputational complexity.

� 2013 Elsevier B.V. All rights reserved.

1. Introduction feature selection algorithms can be classified into filter, wrapper,

The rapid development of information technology has producedmore and more high dimensional data, and posed a great challengeto many realistic learning tasks. As a fundamental problem in sta-tistical machine learning, feature selection (or variable selection)reduces the dimensionality by identifying the most informativesubset from a large number of variables [1]. The advantages of per-forming feature selection include reducing the computational cost,saving the storage space, facilitating model selection proceduresfor accurate prediction, and interpreting complex dependenciesamong variables. Due to its importance, feature selection has re-ceived persistent attention over the past several decades [2,1,3–5].

Existing feature selection algorithms have been designed withdifferent strategies [6,1,3,7]. According to the degree of using labelinformation, feature selection algorithms are designed as super-vised, semi-supervised and unsupervised methods, whichcorrespond to labeled, partially labeled and unlabeled training data[8,5]. According to the degree of evaluating feature variables,feature selection algorithms can either return a feature subset orreturn the scores of individual features [2,1]. According to thedegree of using learning algorithms during the selection process,

and embedded models [6,5]. The filter models rank the featuresby the characteristics of data without using any learning algorithm,which is usually fast but not directly linked to subsequent learningtasks such as classification. Typical filter models include mutualinformation, Pearson product-moment correlation coefficient, andthe inter/intra class distance [1]. The wrapper models evaluateand select feature subsets by a predetermined predictor [2,9,10],which maximizes the learning objective (e.g., classification accu-racy) directly but the optimization is usually slow. The embeddedmodels incorporate feature selection into some specific trainingprocesses, which often provide a trade-off between learning speedand prediction accuracy [11,12]. Recently, many feature selectionalgorithms have been developed with other interesting machinelearning topics, such as manifold learning in nonlinear dimension-ality reduction, sparse learning in large-scale regression models,and structure learning in probabilistic graphical models [13–18].For example, the framework of spectral feature selection [19],which is based on spectral graph theory, studies how to select fea-tures to preserve the intrinsic manifold structure among samples[7,13,14,19,20]. Sparse learning provides a way for shrinkingdependencies among random variables, which is more continuousand does not exhibit high variance compared with feature selec-tion [15]. Feature selection with sparse learning can usually be for-mulated as regularization problems [11,1,21,16,17]. Probabilistic

Page 2: Supervised feature subset selection with ordinal optimization

124 D. Feng et al. / Knowledge-Based Systems 56 (2014) 123–140

graphical models provide a solution for feature selection whereconditional independence structures and causal influence dia-grams can be used to represent and infer complex dependenciesamong different features and labels [22,18].

In this paper, we consider the supervised feature selection prob-lem, where the ‘‘ultimate objective’’ is to select a feature subsetsuch that the generalized classification accuracy (prediction accu-racy) is maximized [2,23]. Most of previous feature selection meth-ods, such as filter and embedded models, spectral, sparse andcausal feature selection formulations, can be viewed as indirectobjectives to maximize the supervised learning accuracy[2,1,24,13,11]. The indirectness often facilitates optimization, butthe optimal solutions of these indirect objectives are not necessar-ily optimal with respect to the prediction accuracy. For example,filter models rank features by correlation coefficients, which isindependent of feature evaluation procedure [2,1]. Embeddedmodels and sparse feature selection methods provide a trade-offbetween goodness-of-fit maximization (in terms of regression)and feature number minimization, but this does not directly opti-mize the prediction accuracy [11,21,16,17]. The objectives of spec-tral feature selection models are closely related to the datamanifold assumptions, such as forming the best lower dimensionalmanifold representation [24], preserving the multi-cluster struc-ture among samples [13], or minimizing the within-class affinityand maximizing between-class affinity at the same time [25].Though these objectives are related with the accuracy maximiza-tion objective, it is difficult to measure their gaps. Feature selectionbased on probabilistic graphical models can effectively identify themost correlated features [22,18], but the ‘‘optimal’’ feature subsetin terms of conditional independence or causality may not be effec-tive in optimizing the classification performance [3].

In supervised feature selection, optimizing the prediction accu-racy directly is more advantageous in achieving the ultimate goal.This is the basic idea of wrapper models, which directly optimizethe classification accuracy with respect to a given predictor (classi-fier). However, the central difficulty is that the complexity of thewrapper models is very high, and the computation needs to be per-formed in a serial way. It is because supervised feature selection isa combinatorial optimization problem (in fact, it is NP-hard [1]),the prediction accuracy is often maximized by heuristic searchingstrategies such as greedy forward selection, greedy backward elim-ination, simulated annealing and genetic algorithms [6,2,9,26].Evaluating the prediction accuracy of a feature subset involves par-titioning data as training and test sets randomly, and averaging thetest accuracies by training classifiers repeatedly, which has consid-erable complexity. Searching based optimization requires perform-ing the feature evaluation sequentially for many times, which isusually computationally infeasible. For example, the next solutionupdate in greedy forward selection is based on evaluating many lo-cal solutions of the current solution; the next population evolutionin genetic algorithms is based on evaluating each solution in thecurrent population. Another limitation of the wrapper models isthat they usually converge to solutions whose goodness of fit withother solutions in the entire space is unknown. Improvement tothese wrapper models, such as adding random perturbation orusing approximate performance list [27,10,4,28], do not overcomethese limitations essentially.

To address the limitations, we present an ordinal optimization(OO) perspective [29,30], which solves the ‘‘hard’’ feature selectionproblem in a ‘‘soft’’ way. The general idea is to view feature subsetevaluation as a simulation process of a complex system with ran-domness. Then the supervised feature selection problem becomesmaximizing the expected performance of the system. Since the per-formance ‘‘order’’ is usually much more robust against noise than‘‘value’’, and finding ‘‘good enough’’ solutions is much more compu-tationally efficient than identifying the ‘‘best’’ solutions, it is possi-

ble to find order-good-enough solutions (defined by top-g%

solutions, e.g., top-5%, according to the underlying performance or-der) with much smaller and controllable computation complexity[29,30]. Specifically, OO samples a set of solutions uniformly andindependently from the solution space, which can be performedin a parallel way. Then OO computes the approximated perfor-mances of these solutions with a crude but computationally fastmodel. After estimating the ordered performance curve (OPC) ofthe simulation process and the noise level of the crude model, OOsystematically selects a subset of solutions according to the orderedperformances, which ensures that the selected solution subset andthe underlying order-good-enough solution set have a large inter-section with high probability (called assignment probability).

Performing OO procedures to maximize the prediction accuracyleads to a new ordinal optimization based feature selection method(called OOFS). Similar to ordinal optimization, OOFS does not re-quire evaluating feature subsets accurately by training classifiersrepeatedly, which saves considerable complexity. The reason isthat performance ‘‘order’’ is more robust against noise than‘‘value’’, OOFS only requires sorting approximated performancesestimated by a crude but computationally fast model. The perfor-mance of the crude model, for example, can be the test accuracyby partitioning data as training and test set for a small numberof times. Moreover, OOFS does not require evaluating feature sub-sets sequentially, since it is based on analyzing a set of solutionswhich are uniformly and independently from the solution space.Therefore, compared with wrapper models, OOFS reduces the com-plexity greatly and facilitates parallel computing at the price ofachieving only order-good-enough solutions.

We analyze that OOFS not only identifies order-good-enoughsolutions, but also can find really good enough (value-good-en-ough) solutions. A value-good-enough solution means that the rel-ative difference of its performance with respect to the optimalperformance is smaller than g%. Note that OO guarantees to findorder-good-enough solutions, but the supervised feature selectiontask is a numerical optimization problem, which requires maximiz-ing the prediction accuracy value. To address this issue, we studythe structures of the solution space in unconstrained (the numberof selected features is not constrained) and constrained (the num-ber of selected features is constrained) feature selection [31]. Thesolution space structure is defined as the shape of ordered perfor-mance curve, which characterizes close connections between ordi-nal and numerical optimization. In some applications such asimage classification, a large number features usually have similarabilities in discrimination. In these applications, we show thatOPCs in unconstrained solution spaces exhibit a concave shape,and OPCs in constrained solution spaces exhibit an inverse-Sshape, which correspond to the Steep-class and Bell-class OPCs inOO theory [29,30], respectively. These conclusions imply that or-der-good-enough solutions obtained by OOFS can be close to va-lue-good-enough solutions. Therefore, although OO method isdesigned for general complex simulation optimization problems,it can have extra good performances in these supervised featureselection applications.

We further explore the conditions under which the OOFS meth-od can be used to estimate the globally optimal solution. To achievethis, we propose an OOFS based scoring algorithm (called OOFSs),which measures the importance (scores) of individual features inmaximizing the prediction accuracy. The importance of individualfeatures, represented as a score vector, is computed by the votingof top ranked solutions in OOFS. We prove that, when the perfor-mance difference of solutions increases monotonically with respectto the solution difference, the expectation of the scores can locatethe globally optimal solution by thresholding. The difference be-tween two solutions can be defined as their Hamming distance.Theoretical results in both unconstrained and constrained solution

Page 3: Supervised feature subset selection with ordinal optimization

Table 1Summary of notations.

Variable Explanation

d Number of all variables.k Number of selected variables.m Number of all samples.mtr/mts Number of training/test samples.l Number of classes (labels).X Variable matrix, xðjÞi is the feature i in sample j.y Label vector, y 2 f1; . . . ; lg.C Classifier type C is set as a certain one c, e.g., SVM.h Selection vector, h 2 f0;1gd .hj j-th element of selection vector h.ho Optimal selection vector in solution space.J ðhÞ Objective function (performance value of selection vector h).Go Order-good-enough solution set.Gv Value-good-enough solution set.meanðM;2Þ Column mean of a matrix M.Mða;bÞ Sub-matrix of a matrix M with row index a and column index b.Dðh; hoÞ Hamming distance between h and ho .H Unconstrained solution space, H ¼ hjh 2 f0;1gd

n o.

Hk Constrained solution space, Hk ¼ hjh 2 f0;1gd ;Pd

i¼1hi ¼ kn o

.

Hk;i Subset of constrained solution space,Hk;i ¼ hjh 2 Hk;Dðh; hoÞ ¼ 2if g.

HN A solution set, HN ¼ h1; . . . ; hNf g, or sorted asHN ¼ hð1Þ; . . . ; hðNÞ

� �.

HðtÞ Concatenation of top t elements inHN : HðtÞ ¼ hðN�tþ1Þ; . . . ; hðNÞ

� �.

Hð�tÞ Concatenation of bottom t elements in HN ;HðtÞ ¼ hð1Þ; . . . ; hðtÞ� �

.

D. Feng et al. / Knowledge-Based Systems 56 (2014) 123–140 125

spaces are provided. When the condition holds roughly in real-world datasets, OOFSs can be applied to estimate the globallyoptimal solution approximately. Experimental results in sixteenreal-world datasets show that different variants of OOFS providea better trade-off between prediction accuracy and computationalcomplexity among popular feature selection methods.

The innovations of this study can be summarized as follows.First, previous works usually formulate supervised feature selec-tion as combinatorial optimization [2,23,3]. In our paper, we for-mulate the task as a simulation process (with random) of acomplex system, which describes the feature selection processmore reasonably. Second, previous works solve supervised featureselection either by optimizing the prediction accuracy with greatcomplexity to achieve the optimal solution [6,2,9,26], or by opti-mizing some approximate objectives with less complexity andaccuracy [11,21,24,25,13,16,17]. In our paper, we solve supervisedfeature selection by ‘‘softing’’ the prediction accuracy maximiza-tion to achieve good enough solutions, which provides a newtrade-off between complexity and accuracy. Third, we discuss therelationships between order-good-enough solutions and value-good-enough solutions in supervised feature selection, which pro-vides a new perspective for ordinal optimization theory and itsapplications [29,10,30].

The organization of this paper is as follows. Section 2 presentsthe supervised feature selection problem. Section 3 provides anordinal optimization based feature selection method (OOFS). Sec-tion 4 analyzes the connections between ordinal optimizationand feature selection in OOFS. Section 5 proposes a scoring algo-rithm (OOFSs) which improves OOFS in estimating the informationof globally optimal solutions. In Section 6, experimental results insixteen real-word datasets compare the performance of previousfeature selection algorithms and different OOFS variants. Section 7summarizes our work and discusses future work.

2. The supervised feature selection problem

2.1. Notations

We first introduce some notations in the paper. Scalars are writ-ten as italic lowercase letters (e.g., x 2 R). Vectors are written asboldface italic lowercase letters (e.g., h 2 Rd). 0 and 1 represent avector whose elements are all 0 and 1, respectively. Matrices arewritten as boldface italic uppercase letters (e.g., X 2 Rd�n). Setsare denoted by italic uppercase letters (e.g., S ¼ f1;2;3g;T ¼ f1;2;3g and H ¼ f0;1gd), and the cardinality of a set is de-noted by j � j (e.g., jSj ¼ k). EJ and VarJ denote the expectation andvariance of a random variable J, respectively.

In feature subset selection, we use X ¼ xðjÞi

h i2 Rd�m

(i ¼ 1; . . . ; d; j ¼ 1; . . . ;m) to denote the feature matrix, where xðjÞi

is the feature i in sample j; d is the number of features, and m isthe number of samples. The label vector is denoted byy 2 f1; . . . ; lgm, where l is the number of classes. A classifier is de-noted by c (e.g., c denotes nearest neighbor classifier or supportvector machine). We use the MATLAB function ‘‘mean’’ to denotethe column mean of a matrix, and ‘‘Mða;bÞ’’ to denote a sub-matrixextracted by row and column indices a and b. We use the notationof order statistic and denote the sorted solutions as hð1Þ; . . . ; hðNÞ.Table 1 summarizes some important variables of the paper. Seethe following sections for details.

2.2. Problem statement

The feature selection problem for classification can be formu-lated as follows. Given a d�m data matrix X 2 Rd�m and the

corresponding label vector y, identify a feature subset such thatthe prediction accuracy with respect to a given classifier can bemaximized. The selecting process can be performed in uncon-strained (there is no constraint on the number of selected features)or constrained (the number of selected features is constrained tobe k) case [31]. Specifically, we want to

maxh2HJ ðhÞ ð1Þ

where J ðhÞ 2 ½0;1� denotes the evaluation function for selected fea-

tures, the selection vector h ¼ h1; h2; . . . ; hd½ �T denotes a selectedfeature subset: hi ¼ 1 (or hi ¼ 0) means the ith feature is selected

(or not), and H ¼ f0;1gd denotes the set of all possible solutions.

Note that the solution space cardinality is jHj ¼ 2d, which growsexponentially with respect to the total number of features. Whenwe select k features, the solution space becomes

Hk ¼ hjh 2 f0;1gd;Pd

i¼1hi ¼ kn o

with cardinality jHkj ¼dk

� �, which

also grows very fast with d.Usually, there is no closed form expression for J ðhÞ. Moreover,

J ðhÞ can be viewed as a function with many factors which is diffi-cult to optimize [23]:

J ðhÞ ¼ EJ h;DXyðh;X; yÞ;CðDXyðh;X; yÞ; cÞ; n� �

ð2Þ

where DXy denotes data (depending on h;X; y), C denotes theestimated classifier (depending on data DXy, and a predeterminedclassifier type c), n denotes the evaluation settings (depending onthe way of partitioning data as training, validation and test sets), Jdenotes the testing accuracy when everything (selected subset,data, classifier, partition setting) is provided. Since the partitionbrings randomness, J ðhÞ is defined as the expectation of J withrespect to the random partition. To simplify the following discus-sions, we write J h;Dðh;X; yÞ;CðDXyðh;X; yÞ; cÞ; n

� �as Jðh; nÞ. Fig. 1

illustrates the process of evaluating a feature subset. The evaluationprocess is very similar to the standard wrapper approach [2]. The‘‘Data matrix’’ is used to evaluate feature subsets (selection vector),

Page 4: Supervised feature subset selection with ordinal optimization

Fig. 1. Supervised Feature Evaluation Process. The selection vector h selects variables in data matrix X; y½ � to form the data submatrix, which are then partitioned as training,validation and test sets. Given a classifier type, a classifier can be trained using the training data (and the validation data, if the classifier has free parameters). Theclassification accuracy is computed in the test data. Since there is randomness in the partition, it requires repeating the training-test process several times and compute theaverage accuracy in test data.

50 100 150 200 250

0.95

0.955

0.96

0.965

Remove/add one feature

Acc

urac

y

usps

randomremove/addrandom average1random average2remove averageadd averageborder

200 400 600 800 1000

0.66

0.68

0.7

0.72

Remove/add one feature

Acc

urac

y

yaleb32

randomremove/addrandom average1random average2remove averageadd averageborder

50 100 150 200 250

0.96

0.965

0.97

Remove/add one feature

Acc

urac

y

usps

randomremove/addrandom average1random average2remove averageadd averageborder

200 400 600 800 1000

0.76

0.78

0.8

Remove/add one feature

Acc

urac

y

yaleb32

randomremove/addrandom average1random average2remove averageadd averageborder

Fig. 2. Comparison of the effect of random partitioning and individual feature addition or removal. See the text for details.

1 http://code.google.com/p/pmtk3/.

126 D. Feng et al. / Knowledge-Based Systems 56 (2014) 123–140

and feature selection should be performed in other data. Performingfeature selection and evaluation in the same data leads to over-opti-mistic results due to over-fitting [32].

In many wrapper feature selection methods, heuristic searchsuch as the greedy forward selection and the greedy backwardelimination is very popular. When there are a lot of features (hun-dreds, thousands, or more), the contribution of each feature to theprediction can be very small. In the simulation-process formulism,these wrapper methods can be ineffective since it is difficult to dis-tinguish the effect of randomness from the sequential featureinclusion or exclusion. This is the reason why wrapper methods re-quire repeated simulations to remove the effect of randomness.Fig. 2 shows the comparison of random partitioning and individualfeature addition or removal. Suppose the task is to select k (from d)features. Given the data matrix X 2 Rd�m and the label vectory 2 f1; . . . ; lgm, we randomly pick up a solution h 2 Hk ¼hjh 2 f0;1gd

;Pd

i¼1hi ¼ kn o

. With this solution, we compute the test

accuracies with different partitions of data for d times. We setmtrmts¼ 0:6

0:4, and k � 0:4d, where mtr and mts denote the number of

training and test samples, respectively. To analyze the effects ofrandom partitioning, we choose the default setting of predictorsfrom the PMTK package,1 which does not require validation set tolearn those hyperparameters. In the experiments, we find that theclassification accuracy is not sensitive to different settings of mtr

and mts. Since the training of predictors is important and also deter-mines the test accuracy, we set mtr to be a little bigger than mts.Other settings, e.g., mtr

mts¼ 0:5

0:5, return similar results in our intermediate

tests. With the deterministic partition mtrmts¼ 0:6

0:4

� , we compute the

accuracies of removing or adding one feature. The removing is per-formed in the k features, and the adding is performed in the remain-ing d� k features. Then, we get d new solutions: k solutions whichselect k� 1 features, and d� k solutions which select kþ 1 features.We compute the accuracies of these d solutions with the determin-istic partition. Fig. 2 shows the comparison of the effect of randompartitioning (change of n in J ðh; nÞ) and individual feature additionor removal (change of h in J ðh; nÞ):

Page 5: Supervised feature subset selection with ordinal optimization

D. Feng et al. / Knowledge-Based Systems 56 (2014) 123–140 127

� Given h, the black squares are accuracies of a solution in d ran-domly partitioned training-test datasets: J ðh; n1Þ;J ðh; n2Þ, . . .,J ðh; ndÞ. The dashed and real black lines are the averages inthe first k partitions, namely 1

k

Pki¼1J ðh; niÞ, and the following

d� k random partitions, namely 1d�k

Pdi¼kþ1J ðh; niÞ.

� Given n, the red squares are accuracies of d solutions where afeature is removed: J ðhr1 ; nÞ;J ðhr2 ; nÞ, . . ., J ðhrk

; nÞ, or added:J ðha1 ; nÞ;J ðha2 ; nÞ, . . ., J ðhad�k

; nÞ, where the green line is the bor-der of removing a feature (J ðhri

; nÞ) and adding a feature(J ðhaj

; nÞ). The dashed and real red lines are the average accura-cies in the k solution (selecting k� 1 features), namely1k

Pki¼1J ðhri

; nÞ, and d� k solution (selecting kþ 1 features),namely 1

d�k

Pd�ki¼1J ðhaj

; nÞ.

The figures in the first row correspond to the results of datasets(‘‘usps’’, ‘‘yaleb32’’) with nearest neighbor classifier. The figures inthe second row correspond to the results with SVM classifier.

In datasets with smaller number of features (e.g., several hun-dreds), adding one feature increases the prediction accuracy thanremoving one feature in average, e.g., ‘‘usps’’ (d ¼ 256;m ¼ 9298).In datasets with larger number of features (e.g., over one thou-sand), the difference between adding and removing one featurein the prediction accuracy is not obvious even in the average level,e.g., ‘‘yaleb32’’ (d ¼ 1024;m ¼ 2414). In both cases, the variance ofaccuracy in individual feature addition or removal is smaller thanthat of random training-test partitioning. The result also holdsfor different classifiers (e.g., nearest neighbor and SVM). Therefore,wrapper models such as greedy feature selection have high com-plexity even in each step, since it requires training classifiersrepeatedly to average the prediction accuracies.

The properties of the above supervised feature selection prob-lem can be summarized as follows.

1. The evaluation process Jðh; nÞ is performed by simulation withrandomness, where there is no analytical models for accuratedescription. Computing EJðh; nÞ should be performed by averag-ing repeated simulations, which brings high computationalcomplexity to searching based wrapper models.

2. The solution space H is discrete, and grows fast with thenumber of features. This makes it very difficult to search goodsolutions over the entire space.

3. Softening feature selection goal with ordinal optimization

3.1. Ordinal optimization preliminaries

This section provides a short introduction to ordinal optimiza-tion. We refer interested readers to [29,30] for details. Without loss

G

S

G S

ΘNΘ

ΘNΘ

G

S

( )θ

1

θ

( ( ))p θ

( )θ

0

0

Fig. 3. Important concepts in ordinal optimization. The left figure represents the relationstypical shapes of the OPC (first row): Steep-class, Neutral-class, and Bell-class [29,30]. T

of generality, we modify some notations and formulation to makethem consistent with this paper.

Ordinal optimization (OO) is a methodology for performanceevaluation and optimization in complex human made systems,which is usually known as Discrete Event Dynamic Systems(DEDS). Specifically, DEDS optimization deals with maximizingthe expected performance, namely maxh2HJ ðhÞ, whereJ ðhÞ ¼ EJðh; nÞ � 1

n

Pni¼1Jðh; niÞ, h is the system design parameters,

H denotes the search or design space for h, J represents the perfor-mance of the simulation model, n denotes the randomness during asimulation process, and ni denotes the randomness during the ithsimulation. The properties of such kind of optimization problemsare as follows.

1. The evaluation process Jðh; nÞ is performed by random simula-tion, which is usually complex and there is no analytical modelsfor appropriate description. Moreover, accurate evaluation(computing EJðh; nÞ) should be performed by averaging repeatedsimulations to remove randomness (e.g., Fig. 1), where the com-putational burden is very heavy.

2. The solution space H is discrete, and grows fast (usually expo-nentially) with the size of the problem (e.g., number of vari-ables). This is similar to the class of NP-hard problems, wherethe difficulty cannot be overcome by mathematics and comput-ers currently.

The OO analysis is usually based on a solution subsetHN ¼ h1; h2; . . . ; hNf g, where N denotes the number of designs uni-formly chosen from H. The OO task is to select a subset S such thatthe alignment probability PðjG \ SjP kÞ is big enough (e.g., 0.95),where the order-good-enough set G denotes top-g% (e.g., top-5%) designs in HN (or H sometimes), and S denotes the selected ssolutions from HN using some selection rule (Fig. 3). The solutionsin HN can be sorted as J ðhð1ÞÞ 6 J ðhð2ÞÞ 6 � � � 6 J ðhðNÞÞ, wherehðiÞ 2 HN is the ith solution. The order-good-enough solution setis defined as

Go ¼ hðiÞjiN

P 1� g%

�: ð3Þ

Typical selection rules include the blind pick rule which selects ssolutions randomly from HN , and horse race rule which selectsthe top-s solutions from HN . Note that both rules only require eval-uating the performance of a design using a crude model, e.g., usingJ ðhÞ � 1

n

Pni¼1Jðh; niÞ where n is small. In blind pick rule, the align-

ment probability can be computed analytically. In horse race rule,the alignment probability has no closed form solutions, and it alsodepends on the nature of the problem, which is usually character-ized by the shape of ordered performance curve (OPC). OPC is theplot of ordered performance against designs. Practically, we can

( )θ

θ

( )θ

θ1 11

1 11 ( )θ( )θ

( ( ))p θ ( ( ))p θ

0 0

0 0

hips of different sets in the selection process of OO. The right figures illustrate somehe names correspond to the shapes of the performance distributions (second row).

Page 6: Supervised feature subset selection with ordinal optimization

128 D. Feng et al. / Knowledge-Based Systems 56 (2014) 123–140

estimate it via sorted J ðhð1ÞÞ 6 J ðhð2ÞÞ 6 � � � 6 J ðhðNÞÞ against thecorresponding designs hð1Þ; hð2Þ, . . ., hðNÞ. If we normalize both axesto ½0;1� by bi ¼

JðhðiÞÞ�J ðhð1ÞÞJ ðhðNÞÞ�J ðhð1ÞÞ

and ai ¼ i�1N�1, we can get the normalized

OPC (plot of bis against ais), based on which typical types of prob-lems can be defined, such as Steep-class and Bell-class, as is illus-trated in Fig. 3.

Given the assignment probability P;N, the type of normalizedOPC and the variance of performance simulation, the selectionset S and its size can be identified with a function Zðg; kÞ, whichhave been tabulated (called universal assignment tables, UAPs) inliterature [30]. Existing OO research shows that HN can be usedas to represent the entire design space, since it has been provedthat the top-g% designs in HN covered by S have a high overlap le-vel with the really top-g% designs in H with large probability [30].The size of N in HN depends on g%, and N P 1000 is sufficient largefor g% to be very small, e.g., 2%. Let’s take a simple example. Con-sider the N solution in HN . The probability that at least one solutionis in the top-g% solution set in H will be p ¼ 1� ð1� g%ÞN . Letp > 0:99; g% ¼ 0:02, we have N P 228.

3.2. Feature selection as ordinal optimization

The supervised feature selection problem and the problems thatOO aims to solve exhibit a lot of similarity. Note that we have alreadychanged the notations in OO to make them consistent with the fea-ture selection problem. Moreover, the difficulties in supervised fea-ture selection are almost identical with those in OO problems. Anatural idea is to apply OO directly in supervised feature selection.

To achieve this, the first step is to soften the goal of featureselection. The task of supervised feature selection is changed as fol-lows. Given a d�m data matrix X 2 Rd�m and the correspondinglabel vector y, identify a subset of features such that the predictionaccuracy with respect to a given classifier is order-good-enoughamong all possible values. Following the OO idea, ‘‘order-good-en-ough’’ means that the test accuracy of the selected feature subsetranks in top-g% (e.g., top-5%) among the test accuracies of all pos-sible subsets. The horse race rule is preferred since it utilizes moreknowledge and usually selects better solutions than the blind pickrule. In practice, the crude and accurate model in the feature selec-tion problem can be defined as bJ ðhÞ ¼ Jðh; nÞ (e.g., one time testaccuracy) and J ðhÞ ¼ 1

n

Pni¼1Jðh; niÞ (average of multiple test accura-

cies), respectively. Algorithm 1 illustrates the framework of usingOO for supervised feature selection. The solution space can denotethe unconstrained space H, or the constrained space Hk. In the laststep, accurate evaluation of the solutions in S can be performed toidentify the selected features.

Algorithm 1. Feature selection with ordinal optimization.

1: Feature selection with ordinal optimizationSelect Nsolutions from solution space uniformly:HN ¼ h1; h2; . . . ; hNf g.

2: for i ¼ 1 : N do3: Evaluate the performance of each solution using a crude

model: bJ ðh1Þ; bJ ðh2Þ; . . . ; bJ ðhNÞ.4: endfor5: Sort the solutions according to their estimated

performances as bJ ðhð1ÞÞ 6 bJ ðhð2ÞÞ 6 � � � 6 bJ ðhðNÞÞ.6: According to the normalized OPC, noise level of the crude

model and the universal assignment tables in OO theory,select the top s solutions from the N solutions as theselected set S.

7: Estimate the optimal feature subset from the selected set Sby accurate performance evaluation.

It is easy to analyze the complexity of Algorithm 1. Given data, apredictor and the number of selected features, the complexity of

evaluating the performance of a solution using a crude modelcan be viewed as a constant C. Suppose the complexity of the laststep is E, then the complexity of Algorithm 1 is OðNC þ N log N þ EÞ.Since the cardinality of S is small, evaluating the solutions in S hassmall complexity. We can see that the loop in Algorithm 1 can beperformed in parallel, which reduces time complexity greatly ifwe have multiple evaluation machines.

4. Analysis of feature selection solution space

In the previous section, we have shown that the supervised fea-ture selection problem can be solved by OO when the goal is soft-ened from numerical optimization to ordinal optimization.However, the original supervised feature selection is a numericaloptimization problem, where the goal is to select a feature subsetsuch that the prediction accuracy value is maximized. In this sec-tion, we show that, the selected feature subset which is order-good-enough can also be value-good-enough when the solutionspace exhibits some structure. The value-good-enough solutionset is defined as

Gv ¼ hj J ðhÞ � J ðhoÞ

J ðhoÞ � J ðhoÞ

���� ���� < g%

�where ho and ho are the optimal solutions to maxh2HJ ðhÞ andminh2HJ ðhÞ, respectively. For a solution h in Gv , we haveJ ðhÞ � J ðhoÞj j < g% J ðhoÞ � J ðhoÞ½ � 6 g%, or J ðhÞ > J ðhoÞ � g%,

which is good enough when g% is small (e.g., 5%) in many applica-tions. The central idea is to analyze the OPC class in featureselection solution space, which builds close connection betweenorder-good-enough and value-good-enough solutions.

4.1. The structure of the unconstrained solution space

The OPC class in OO is usually estimated empirically by com-puting the performances of a set of solutions. It is very difficultto analyze the shape of OPC theoretically in the solution space.To overcome this difficulty, we analyze the theoretical OPC forsolution space H by decomposing it as

H ¼ [dk¼0Hk ð4Þ

where Hk ¼ hjh 2 f0;1gd;Pd

i¼1hi ¼ kn o

. We then define the averageperformance for the solutions in Hk as

J ðHkÞ ¼1jHkj

Xh2Hk

J ðhÞ: ð5Þ

The average performances can be sorted asJ ðHð0ÞÞ 6 J ðHð1ÞÞ 6 � � � 6 J ðHðdÞÞ. In many pattern recognition data,additional features usually provide more discriminative informa-tion for classification (in average). Statistically speaking, the mono-tonicity condition J ðHiÞ 6 J ðHjÞ (80 6 i < j 6 d), which is oftenused in some feature selection methods (e.g., branch and boundbased algorithms [33]) can be assumed under some bias-variancetrade-off. In the experiments, we will check this assumption inreal-world datasets. We define the Decomposition OPC (DOPC) asthe piecewise linear curve which connects J ðHkÞ ð0 6 k 6 dÞ(y-axis) together, where the corresponding x-axis is the center ofspace Hk. Fig. 4 shows the DOPC when d ¼ 5.

Another difficulty of analyzing the theoretical OPC is in the nor-malization. Note that normalized OPC plays a central role in bothOO theory and this paper. In OO, the horse race rule depends onthe structure of the problem, or the shape of the normalized OPC.In the following, the connection between numerical optimization

Page 7: Supervised feature subset selection with ordinal optimization

D. Feng et al. / Knowledge-Based Systems 56 (2014) 123–140 129

and ordinal optimization are also based on the normalized OPC.Since normalized OPC with concave shapes represents that or-der-good-enough solutions (top-g% in performance order) are alsovalue-good-enough solutions (the relative difference from theoptimal performance is bounded by g%), we study the conditionsunder which the normalized OPC is concave. Theorem 4.1 showsthat when the increase of J ðHiÞ is bounded by some function,the DOPC will be concave.

Theorem 4.1 (concavity of DOPC). Suppose J ðHiÞ 6 J ðHjÞ;8 06 i < j 6 d, and denote the ratio of performance increase as

Rk ,J ðHkþ1Þ � JHkÞJ ðHkÞ � J ðHk�1Þ

:

When Rk 6 Dk, the DOPC is concave, where Dk ¼ �1þ dþ2kþ1

(1 6 k 6 d� 1).

Proof. The slope of the kth (k ¼ 1; . . . ; d) line segment is

lk ¼J ðHkÞ � J ðHk�1Þ

12

d

k

� �þ

d

k� 1

� � � :So we can compute

lkþ1

lk¼ 1

Dk

J ðHkþ1Þ � J ðHkÞJ ðHkÞ � J ðHk�1Þ

where Dk ¼ �1þ dþ2kþ1 (1 6 k 6 d� 1). Note that we have DkDd�k ¼ 1,

and Dk decreases with k:

d2¼ D1 > D2 > . . . > Dd�2 > Dd�1 ¼

2d:

The assumption in the theorem implies that lkþ1lk� 1, which is suffi-

cient for DOPC to be concave. h

Theorem 4.1 characterizes the approximated OPC analyticallyfor some supervised feature selection problems. It can be seen thata stronger condition for DOPC to be concave is Rk 6

2d

(1 6 k 6 d� 1), which implies that the increase of performance de-creases exponentially with k : J ðHkþ1Þ � J ðHkÞ 6 J ðH1Þ

h�J ðH0Þ� 2

d

� �k. This can be true in some applications, where the per-formance increase is very slow with respect to new features sincediscriminative ability of old features is already ‘‘saturated’’. Fig. 5shows the increase of pixels for visual discrimination. When thereare enough pixels, we can distinguish different classes without dif-ficulty. The classification performances with respect to differentnumber of pixels are compared in the experiments.

( )θ

(2)(0) (1) (4)(3) (5)

θ

Fig. 4. Illustration of DOPC (d ¼ 5). The size of each subpace Hk is

jHkj ¼dk

� �; k ¼ 0;1; . . . ;d.

Based on the concavity assumption of approximate OPC, we cananalyze the property of OOFS from the perspective of numericaloptimization. Since concavity is preserved under affine mapping(b ¼ f ðaÞ is concave, b ¼ bf ð1a aÞ is concave, where a > 0; b > 0)[34], concave OPC also implies concave normalized OPC. Both axes

of OPC can be normalized to ½0;1� by bi ¼JðhðiÞÞ�J ðhð1ÞÞJ ðhðjHjÞÞ�J ðhð1ÞÞ

¼ f ðaiÞ,ai ¼ i�1

jHj�1 (i ¼ 1; . . . ; jHj), where b ¼ f ðaÞ (a; b 2 ½0;1�) is a continuous

function which is assumed to be strictly increasing and approxi-mates the normalized OPC.

The local shape of normalized OPC b ¼ f ðaÞ plays an importantrole in connecting ordinal and numerical optimization. Specifically,df ðaÞ

da � 1 (81� g% 6 a � 1) implies that a top-g% solution in ordermust be in the top-g% solution set in value. According to the meanvalue theorem, there exists a unique point a0 such that df ðaÞ

da ja¼a0¼ 1

when b ¼ f ðaÞ a; b 2 ½0;1� is strictly increasing and concave (Fig. 6,left). Generally speaking, a0 can be arbitrarily close to 1, whichmakes the numerical optimization difficult (Fig. 6, middle). How-ever, in the feature selection problem, we usually havef ð1Þ � f ð12ÞP f ð12Þ � f ð0Þ. This is because the performance increasefrom empty feature set to half feature set is usually bigger thanthat from half feature set to full feature set, which is similar tothe assumptions in Theorem 4.1. Therefore, f ð12Þ > 1

2. If b ¼ f ðaÞa; b 2 ½0;1� is strictly increasing, concave, and f ð12Þ > 1

2, it is easy toshow that there exists a unique point a0 <

12 such that

df ðaÞda ja¼a0

¼ 1 (Fig. 6, right). This means that 80 6 g � 50, top-g%

solution in order must be in the top-g% solution set in value.

4.2. The structure of the constrained solution subspace

In constrained feature selection, the number of selected fea-tures are constrained to a predetermined number k. The solution

space becomes Hk ¼ hjh 2 f0;1gd;Pd

i¼1hi ¼ kn o

, which is the sub-

space of H. The size of the subspace is jHkj ¼dk

� �¼ d!

k!ðd�kÞ!. It is ex-

tremely big for some k when d increases. It is also easy to selectsolutions uniformly from the subspace Hk. Similarly to previousanalysis, OOFS can be used to select top-g% (in order) feature sub-set from Hk.

To estimate the numerical optimality of these order-good-en-ough solutions, it is necessary to analyze the structure of the sub-space in constrained feature selection. Generally, it is very difficultto determine the shape of OPC since it depends heavily on datasettypes and different d and k. This section provides approximateanalysis of the theoretical OPC in Hk. The conclusion is that theOPC in Hk is usually similar to the Bell-class. This means that theOO solutions to constrained feature selection are inferior to top-g% (in value) solutions, but are also good in a sense that they areat least moderate solutions.

Defining the local change of solution subset by the gradient ofperformance as

rJ ðaÞ½ �i ¼@J ðaÞ@ai

� J ða�i;aiÞ � J ða�i;aiÞai � ai

ð6Þ

where a 2 H;ai is the Boolean complement of ai;J ðhÞ can beapproximated locally as

J ðhÞ � bJ ðhÞ ¼ J ðaÞ þ rJ ðaÞ½ �Tðh� aÞ ð7Þ

which is similar to the first order Taylor expansion in continuousdifferentiable functions. Eq. (7) incorporates some previous ideasas special examples.

� a ¼ 0 corresponds to some filtering methods, where the impor-tance of each feature is measured by the individual contributionto the classification accuracy [1,35].

Page 8: Supervised feature subset selection with ordinal optimization

coil100 (k/d=0.1) coil100 (k/d=0.3) coil100 (k/d=0.5) coil100 (k/d=0.7)

Fig. 5. Illustration of the monotonicity condition in the entire solution space. The figures are visualization of different number of selected features (kd) in the ‘‘coil100’’ dataset

(d ¼ 1024). In each feature, the horizontal direction denotes ten different selected features in Hk , and the vertical direction denotes ten samples which are from differentclasses.

1

2

1

10

11

00 1 10a

( )b f a

0a 0a

( )b f a ( )b f a

Fig. 6. Different concave functions. a0 denotes the point where f 0ða0Þ ¼ 1.

130 D. Feng et al. / Knowledge-Based Systems 56 (2014) 123–140

� a ¼ 1 corresponds to the leave-one-out strategy, where theimportance of each feature is measured by the performancechange when it is removed from the full feature subset [36,37].

Since Eq. (7) is a local approximation for J ðhÞ, and the perfor-mances of subspaces can be different (J ðHkÞ changes with k), itis more accurate to restricted h;a 2 Hk. rJðaÞ½ �i (denoted as gi

for simplicity in the following) describes the relative importanceof each feature. In the following, we show that the distribution ofbJ ðhÞ ¼ gThþ C (C ¼ J ðaÞ � gTa is a constant, h 2 Hk) is usually‘‘unimodal’’, which implies a Bell-class OPC for J ðhÞ.

In bJ ðhÞ; g is deterministic, and h is random in Hk, which meansselecting k numbers from g randomly. From another perspective,gis can be viewed as samples from some underlying distribution,and computing bJ ðhÞ becomes estimating the sum of k samplesfrom the distribution. When there are a large number of features(e.g., in images), the contribution of each feature (e.g., each pixel)to the classification performance is very small and similar, so thedistribution of gis can be viewed as uniform or Normal. When gi

(i ¼ 1; . . . ; d) follows a uniform distribution, bJ ðhÞ follows a Normaldistribution approximately (by the law of large numbers). When gi

(i ¼ 1; . . . ; d) follows a Normal distribution, bJ ðhÞ is also Normal. Inboth cases, the distribution of bJ ðhÞ exhibits a ‘‘unimodal’’ pattern,and the corresponding OPC exhibits an inverse-S shape (Fig. 7).

When some important features contributes obviously more toothers, bJ ðhÞ still exhibit a ‘‘unimodal’’ pattern. Suppose gi ¼ a (orgi ¼ 0) if feature i is important (or not) for classification, and thereare s important features. Then bJ ðhÞ follows a unimodal hypergeo-metric distribution

Pð bJ ðhÞ ¼ C þ ak0Þ ¼

s

k0

� �d� s

k� k0

� �d

k

� � ð8Þ

where maxð0; k� ðd� sÞÞ 6 k0 6minðs; kÞ. It can be shown that

Pð bJ ðhÞ ¼ C þ ak0Þ is a unimodal function of k0, which is maximal

when k0 is the largest integer that does not exceed ðsþ1Þðkþ1Þdþ2 . We usu-

ally have s > k d� s, and therefore k� ðd� sÞÞ 6 k0 6 k. So

Pð bJ ðhÞ ¼ C þ ak0Þ is increasing for most k0, which means that the

OPC of bJ ðhÞ is between the Bell-class and Steep-class.We visualize some typical OPCs estimated from real-world

datasets. Given the data matrix X 2 Rd�m and label vectory 2 f1; . . . ; lgm, we randomly pick up N solutionsHN ¼ h1; h2; . . . ; hNf g, where hi 2 Hk; i ¼ 1;2; . . . ;N. Given a classifierc, the prediction accuracy of each feature subset is estimated asbJ ðh1Þ; bJ ðh2Þ; . . . ; bJ ðhNÞ using one-time partitioning of data. Theseperformances are then sorted as bJ ðh1Þ 6 bJ ðh2Þ 6 � � � 6 bJ ðhNÞ.The estimated OPC is the plot of bJðhðiÞÞ against the index i. We esti-mate the OPCs for different k (k 6 d), and visualize these OPCs to-gether. Specifically, we set k ¼ k1; k2; . . . ; kI½ � which forms anarithmetic sequence approximately. In the experiments, we setk1 ¼ 1; kI ¼ d, and kiþ1 � ki � kI�k1

I (i ¼ 1;2; . . . ; I). For example,when d ¼ 1024, we have k = 1, 115, 228, 342, 456, 569, 683, 797,910, 1024.

Fig. 8 shows the estimated OPCs of different constrained spacesin real-world datasets. It can be seen that the average performanceincreases with ki approximately, though the performance of a solu-tion in Hki

can be better than a solution in Hkjwhen ki < kj. In each

ki, OPC often exhibits a Bell-class shape. The results are consistentwith our analysis for solution spaces. Moreover, it can be seen thatfor some ki, the performances change a lot in the N solutions (e.g.,‘‘ionosphere’’, ‘‘mfeat-fou’’), while for some ki, the performanceschange little (e.g., ‘‘usps’’, ‘‘yaleb32’’). On the one hand, this vari-ance of performance reflects the flexibility of the OO approach inidentifying representative solutions. On the other hand, the vari-ance also shows the ‘‘saturation’’ phenomenon in performance:when k increases, it is more and more difficult to increase the pre-diction accuracy.

5. Toward better feature subset construction

In the previous section, we have analyzed the OPC class in thefeature selection space, including the unconstrained space and itsspace where the number of features are constrained. The analysisshows that the order-good-enough solutions returned by OOFSmay also have good classification performance accuracies. Similarto ordinal optimization theory, OOFS computes the noise level of

Page 9: Supervised feature subset selection with ordinal optimization

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1Uniform

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1Normal

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

10−1 Sequence

Fig. 7. Simulation results of the OPCs in constrained space. The figures show the normalized OPCs of bJ ðhÞ ¼ gThþ C where h 2 Hk ¼ hjh 2 f0;1gd;Pd

i¼1hi ¼ kn o

; gi is from adistribution (left: uniform; middle: Normal) or gd�1 ¼ 11�s ;01�ðd�sÞ

� �T is a 0–1 sequence (right). We set d ¼ 1024; s ¼ 800, and k = 50, 100, 150, . . ., 750, 800.

D. Feng et al. / Knowledge-Based Systems 56 (2014) 123–140 131

the crude model which reflects confidence interval of the perfor-mance estimate, and then calculates the size of selected solutionset from the normalized OPC class and universal assignment tables[30]. The drawback of this method is that, the best solution thatOOFS can achieve is no better than the really best solution inHN ¼ h1; h2; . . . ; hNf g. To overcome this limitation, we aim to con-struct solutions that may outperform all the solutions in HN .

The idea is to estimate the location information of the globallyoptimal solution in the solution space. The assumption is that solu-tions with similar performances should exhibit similar patterns inthe selected feature subsets, which means that these solutions areclose in the solution space. Specifically, the globally optimal solu-tion (denoted as ho) can be viewed as an unknown point in solutionspace. The uniformly selected solutions HN ¼ h1; h2; . . . ; hNf g can beviewed as random points around ho, where the similarity betweenhi and ho are roughly described as the performance bJ ðhiÞ(i ¼ 1; . . . ;N). The feature selection task can be viewed as identify-ing the point ho from a set of points h1; h2; . . . ; hN where the dis-tances (inverse of similarities) between h1; h2; . . . ; hN and ho havebeen estimated approximately as the performances. Algorithm 2implements the above idea using a simple voting strategy. Similarto many existing feature selection algorithms, Algorithm 2 returnsthe scores of features, where the scores quantitatively describewhich features should be selected.

0.4

0.5

0.6

0.7

0.8

0.9

k= 1, ..., d

Est

imat

ed O

PC

ionosphere

0.2

0.4

0.6

k= 1, ..., d

Est

imat

ed O

PC

yaleb32

Fig. 8. Estimated OPC of different constrained spaces in real-world datasets. From lef(d ¼ 1024;m ¼ 2414), ‘‘mfeat-fou’’ (d ¼ 76;m ¼ 2000). In each dataset, the estimated OPCare shown between each two dashed lines. For each ki;N ¼ 1000 solutions are uniformly gfor the underlying OPCs. The dashdot lines in each figure are plots of all the sorted NI p

Algorithm 2. Feature Scoring Algorithm.

1: Run Algorithm 1 and get the sorted solutions and their

estimated performances: bJ ðhð1ÞÞ 6 bJ ðhð2ÞÞ 6 � � � 6 bJ ðhðNÞÞ.2: Concatenate the top t solution vectors as a matrix

HðtÞ ¼ hðN�tþ1Þ; hðN�tþ2Þ; . . . ; hðNÞ� �

.3: Compute the column mean of HðtÞ as s ¼ meanðHðtÞ;2Þ.4: Return the d dimensional vector s as scores of features.

The score vector s plays an important role in constructing bettersolutions than OOFS. Intuitively, bigger entries in s imply more vot-ing from the top ranked solutions, which suggests more possibilityof selecting the corresponding features. Specifically, when si is big(or small), it is reasonable to estimate that the ith element of theglobally optimal solution ho

i is 1 (or 0). This means that the differ-ence between the score si and the globally optimal solution ho

i issmall. We aim to provide conditions where the expectation ofthe difference between si and ho

i can be bounded, which character-izes the reasonableness of Algorithm 2 in achieving bettersolutions. For example, if

E si � hoi

�� �� < b

0.2

0.4

0.6

0.8

k= 1, ..., d

Est

imat

ed O

PC

usps

0.2

0.4

0.6

0.8

k= 1, ..., d

Est

imat

ed O

PC

mfeat−fou

t to right: ‘‘ionosphere’’ (d ¼ 34;m ¼ 351), ‘‘usps’’ (d ¼ 256;m ¼ 9298), ‘‘yaleb32’’s (real lines) of different constrained spaces Hk (k ¼ k1; k2; . . . ; kI½ � ¼ 1; . . . ; d½ �; I ¼ 10)enerated from Hki

, and the sorted performances are connected as an approximationerformances.

Page 10: Supervised feature subset selection with ordinal optimization

132 D. Feng et al. / Knowledge-Based Systems 56 (2014) 123–140

holds, then we have

Esi � b < hoi < Esi þ b

which provides very useful information for estimating hoi when b is

small. We prove that, when the performance difference increaseswith solution difference (distance), the expectation of the differencebetween si and ho

i can be bounded. In the following subsections, wediscuss the results in unconstrained and constrained solution spaces.

5.1. Scoring features in the unconstrained space

Algorithm 2 summarizes the contributions of each feature inmaximizing the prediction performance by the voting of topranked solutions. These contributions are represented as a scorevector s. In this subsection we show that, when the performancedifferences of solutions increases with the solution differences,the expectation of score vector s in Algorithm 2 is close to the glob-ally optimal solution. This interprets why it is reasonable to use sfor supervised feature selection. In the unconstrained case, we pro-vide Theorem 5.1 and Corollary 5.1.

Theorem 5.1. Let ho be the optimal solution tomaxh2HJ ðhÞ;H ¼ f0;1gd. Suppose J ðhÞ ¼ gðDðh; hoÞÞ, where g is adecreasing function, Dðh; hoÞ is the Hamming distance between h andho. Then we have

E1t

Xt

j¼1

D ðhðN�jþ1ÞÞi; hoi

� �( )¼ E si � ho

i

�� �� < 12

ð9Þ

where 1t

Ptj¼1D ðhðN�jþ1ÞÞi; h

oi

� �is the average Hamming distance be-

tween hðN�jþ1Þ (j ¼ 1;2; . . . ; t) and ho in the ith component.

Proof. Consider the average Hamming distance between hðN�jþ2Þ(j ¼ 1;2; . . . ; t) and ho in the ith feature:

1t

Xt

j¼1

D ðhðN�jþ1ÞÞi;hoi

� �¼

1t

Xt

j¼1

ðhðN�jþ1ÞÞi�hoi

� �if ho

i ¼0

1t

Xt

j¼1

hoi �ðhðN�jþ1ÞÞi

� �if ho

i ¼1

8>>>>><>>>>>:¼

si�hoi if ho

i ¼0ho

i � si if hoi ¼1

¼ si�ho

i

�� ��:

We partition the space H as

H ¼ [di¼0Hi ð10Þ

where Hi ¼ hjh 2 f0;1gd;Dðh; hoÞ ¼ i

n o(i ¼ 0; . . . ;d). We can see that

Hij j ¼di

� �. For the N solutions HN ¼ h1; h2; . . . ; hNf g, denote

ni ¼ HN \Hij j (i ¼ 0; . . . ;d) as the number of solutions from Hi. Thenthe random vector n0;n1; . . . ;nd follows a multinomial distribution

Pðn0;n1; . . . ;ndÞ ¼Pd

i¼0ni

� !Qd

i¼0ni!

Yd

i¼0

pnii ð11Þ

wherePd

i¼0ni ¼ N, pi ¼ 12d

di

� �(i ¼ 0; . . . ;d).

Since g is a decreasing function, and the performance J ðhÞ aresorted in ascending order, we can assume the top t solutions arefrom Hi with ni solutions (i ¼ 1; . . . ;h), and t ¼

Phi¼0ni (h 6 d). Note

that si is a function of h, so we write it as siðhÞ. The expectation ofthe average distance is

E siðhÞ � hoi

�� �� ¼ Xn0þ���þnd¼N

1Phi¼1ni

Xh

i¼0

id

niPðn0;n1; . . . ;ndÞ" #

:

Since

f ðhÞ ¼Ph

i¼0iniPhi¼0ni

is an increasing function: 0 ¼ f ð0Þ < 1 ¼ f ð1Þ < f ð2Þ < � � � < f ðdÞ(when ni – 0; i ¼ 1; . . . ;h), we have

E siðhÞ � hoi

�� �� < E siðhþ 1Þ � hoi

�� �� < � � � < E siðdÞ � hoi

�� ��and

E siðdÞ � hoi

�� �� Xn0þ���þnd¼N

1Pdi¼1ni

Xd

i¼0

id

niPðn0;n1; . . . ;ndÞ" #

¼ 1Nd

Xd

i¼0

iX

n0þ���þnd¼N

niPðn0;n1; . . . ;ndÞ" #

¼ 1Nd

Xd

i¼0

i N

d

i

� �2d

2666437775

¼ 1d

Xd

i¼0

i

d

i

� �2d¼ 1

dd2¼ 1

2

where we use the equalityPd

i¼1i di

� �¼ d2d�1. When g is a

decreasing function, and the performance J ðhÞ is sorted in ascend-ing order, Algorithm 2 is averaging the solutions which are close toho. This orientation of selecting close solution leads toE siðtÞ � ho

i

�� �� < 12. h

Remark 5.1. Generally, it is intuitive to conclude thatE siðNÞ � ho

i

�� �� ¼ 12, and E siðtÞ � ho

i

�� �� < 12 when t < N. To compute the

expectation of si � hoi

�� ��, it is necessary to run Algorithm 2 manytimes and average the estimation of si � ho

i

�� ��, which brings compu-tational complexity and analytical difficulty. The proof in Theo-rem 5.1 quantitatively characterizes the monotonicity betweenE siðhÞ � ho

i

�� �� and h by the decomposing the solution space H. Thismeans that in Algorithm 2, fewer topped ranked solutions (t issmall) implies more accurate estimation of ho, which correspondsto ‘‘less bias’’. However, smaller t implies more variance in estimat-ing E si � ho

i

�� �� since less solution samples are used. Therefore, Theo-rem 5.1 and its proof suggest a trade-off in choosing t to estimateho efficiently.

Remark 5.2. The assumption in Theorem 5.1 can be written as

J ðhoÞ � J ðhÞj j ¼ J ðhoÞ � gðDðh; hoÞÞ

which means that the performance difference J ðhoÞ � J ðhÞj j in-creases monotonically with the solution distance Dðh; hoÞ. This canbe approximately true in many applications. The reason is thatsome features (called discriminative features) are relatively moreimportant than others (called indiscriminative features) in super-vised learning, which is also the motivation for performing featureselection. Suppose the global optimal solution corresponds to thesituation where the discriminative features are selected and indis-criminative features are not selected. The performance of a solutionwill be close to the global optimum when these discriminative fea-tures are selected, and will be far from the global optimum whenthose indiscriminative features are selected, which correspond tosmall and big solution differences with respect to the optimal solu-tion, respectively.

Corollary 5.1. With the same notations and assumptions in Theorem5.1, if we select the bottom t solutions in Algorithm 2, then we haveE si � ho

i

�� �� > 12.

In Theorem 5.1, si � hoi

�� �� < 12 implies si � 1

2 < hoi < si þ 1

2, whichprovides very useful information for estimating ho

i . One simplestrategy is taking a threshold:

Page 11: Supervised feature subset selection with ordinal optimization

0 200 400 600 800 10000

0.2

0.4

0.6

0.8

1

Solutions

Expe

ctat

ions

0 200 400 600 800 10000

0.2

0.4

0.6

0.8

1

Solutions

Expe

ctat

ions

0 200 400 600 800 10000

0.2

0.4

0.6

0.8

1

Solutions

Expe

ctat

ions

0 200 400 600 800 10000

0.2

0.4

0.6

0.8

1

Solutions

Expe

ctat

ions

Fig. 9. Numerical validations for Theorem 5.1 and Corollary 5.1. Suppose ho 2 H ¼ f0;1gd (d ¼ 100). Choose HN ¼ h1; h2 . . . ; hNf g(N ¼ 1000) uniformly from H, and sort itas hð1Þ; hð2Þ . . . ; hðNÞ� �

where Dðhð1Þ; hoÞP . . . P DðhðNÞ; hoÞ;DðhðiÞ; hoÞ is the Hamming distance between hðiÞ and ho . Denote top t solutions as HðtÞ ¼ hðN�tþ1Þ; hðN�tþ2Þ; . . . ; hðNÞ� �

andbottom solutions as Hð�tÞ ¼ hð1Þ; hð2Þ; . . . ; hðtÞ

� �, respectively. The figure shows the plot of some statistics against different t. From left to right, they are: top-score

siðtÞ ¼ meanðHðtÞ;2Þ, top-distance diðtÞ ¼ 1t

Ptj¼1D ðhðN�jþ1ÞÞi; h

oi

� �, bottom-score �siðtÞ ¼ meanðHð�tÞ;2Þ and bottom-distance �dðtÞ ¼ 1

t

Ptj¼1D ðhðN�jþ1ÞÞi; h

oi

� �, respectively. The black

curves denote those features i where hoi ¼ 1. The red curves denote those features where ho

i ¼ 0.

D. Feng et al. / Knowledge-Based Systems 56 (2014) 123–140 133

hoi ¼

0 if si <12

1 if si >12

(: ð12Þ

In Corollary 5.1, E si � hoi

�� �� > 12 implies ho

i > si þ 12 or ho

i < si � 12. Since

hoi 2 f0;1g, this provides another thresholding method to estimate

hoi :

hoi ¼

1 if si <12

0 if si >12

(: ð13Þ

Fig. 9 shows numerical validations for Theorem 5.1 and Corol-lary 5.1:

� Theorem 5.1. hoi ¼ 0 and ho

i ¼ 1 can be distinguished from si bythe thresholding: ho

i ¼ 0 if si <12 (first, black lines), and ho

i ¼ 1if si P 1

2 (first, red lines). E siðtÞ � hoi

�� �� < 12 when

1 6 t < d; E siðtÞ � hoi

�� �� increases with t, and E siðdÞ � hoi

�� �� ¼ 12

(second).� Corollary 5.1. ho

i ¼ 0 and hoi ¼ 1 can be distinguished from si by

the thresholding: hoi ¼ 1 if si <

12, and ho

i ¼ 0 if si P 12 (third).

E siðtÞ � hoi

�� �� > 12 when 1 6 t < d; E siðtÞ � ho

i

�� �� decreases with t,and E siðdÞ � ho

i

�� �� ¼ 12 (fourth).

5.2. Scoring features in the constrained space

We have similar results in the constrained feature selectionproblem, where the number of selected features are constrainedto be k. The solution space becomes more complex in the con-strained case. However, we can still partition the space

Hk ¼ hjh 2 f0;1gd;Pd

i¼1hi ¼ kn o

according to solution differences.

The results are as follows.

Theorem 5.2. Let ho be the optimal solution to maxh2HkJ ðhÞ. SupposeJ ðhÞ ¼ gðDðh; hoÞÞ, where g is a decreasing function, Dðh; hoÞ is thehamming distance between h and ho. Then we have

E1t

Xt

j¼1

D ðhðN�jþ1ÞÞi; hoi

� �( )¼ E si � ho

i

�� �� < kd if ho

i ¼ 0d�k

d if hoi ¼ 1

(: ð14Þ

where 1t

Ptj¼1D ðhðN�jþ1ÞÞi; h

oi

� �is the average Hamming distance be-

tween hðN�jþ1Þ (j ¼ 1;2; . . . ; t) and ho in the ith component.

Proof. Similar to Theorem 5.1, the average Hamming distancebetween hðN�jþ2Þ (j ¼ 1;2; . . . ; t) and ho in the ith feature can bewritten as 1

t

Ptj¼1 D½ðhðN�jþ1ÞÞi; h

oi � ¼ si � ho

i

�� ��.We partition the space Hk as

Hk ¼ [minðk;d�kÞi¼0 Hk;i ð15Þ

where Hk;i ¼ hjh 2 Hk;Dðh; hoÞ ¼ 2if g, ho is the optimal solution to

max h 2 HkJ ðhÞ. We can see that jHkj ¼dk

� �; jHk;ij ¼

ki

� �d� ki

� �, and d

k

� �¼Pminðk;d�kÞ

i¼0ki

� �d� ki

� �. For the N solu-

tions HN ¼ h1; h2; . . . ; hNf g, denote ni ¼ HN \Hk;i

�� �� (i ¼ 0; . . . ;minðk;d� kÞ) as the number of solutions from Hk;i. Then the randomvector n0;n1; . . . ;nd follows a multinomial distribution

Pðn0;n1; . . . ;ndÞ ¼Pminðk;d�kÞ

i¼0 ni

� !Qminðk;d�kÞ

i¼0 ni!

Yminðk;d�kÞ

i¼0

pnii ð16Þ

wherePminðk;d�kÞ

i¼0 ni ¼ N; pi ¼

ki

� �d� ki

� �dk

� � (i ¼ 0; . . . ;minðk; d� kÞ).

Since g is a decreasing function, and the performance J ðhÞ aresorted in ascending order, we can assume the top t solutions arefrom Hi with ni solutions (i ¼ 1; . . . ;h), and t ¼

Phi¼0ni (h 6 d). Note

that si is a function of h, so we write it as siðhÞ. When hoi ¼ 0, the

expectation of the average distance is

E siðhÞ � hoi

�� �� ¼ Xn0þ���þnd¼N

1Phi¼0ni

Xh

i¼0

id� k

niPðn0;n1; . . . ;ndÞ" #

:

Since

f ðhÞ ¼Ph

i¼0iniPhi¼0ni

is an increasing function: 0 ¼ f ð0Þ < 1 ¼ f ð1Þ < f ð2Þ< � � � < f ðminðk; d� kÞÞ (when ni – 0; i ¼ 1; . . . ;h), we have

E siðhÞ � hoi

�� �� < E siðhþ 1Þ � hoi

�� �� < � � � < E siðminðk; d� kÞÞ � hoi

�� ��and

E siðminðk;d� kÞÞ � hoi

�� ��¼

Xn0þ...þnd¼N

1Pminðk;d�kÞi¼0 ni

Xminðk;d�kÞ

i¼0

id� k

niPðn0 ;n1 ; . . . ;ndÞ" #

¼ 1ðd� kÞN

Xminðk;d�kÞ

i¼0

iX

n0þ���þnd¼N

niPðn0 ;n1; . . . ; ndÞ" #

¼ 1ðd� kÞN

Xminðk;d�kÞ

i¼0

iN

k

i

� �d� k

i

� �d

k

� �¼ 1ðd� kÞ

kðd� kÞd

¼ kd

where we use the equalityPminðk;d�kÞ

i¼0 i

ki

� �d� ki

� �dk

� � ¼ kðd�kÞd . Simi-

larly, we can compute E siðminðk;d� kÞÞ � hoi

�� �� ¼ d�kd when ho

i ¼ 1.In summary, we have

Page 12: Supervised feature subset selection with ordinal optimization

134 D. Feng et al. / Knowledge-Based Systems 56 (2014) 123–140

E siðhÞ � hoi

�� �� < � � � < E siðminðk; d� kÞÞ � hoi

�� �� ¼ kd if ho

i ¼ 0d�k

d if hoi ¼ 1

(:

When g is a decreasing function, and the performance J ðhÞ issorted in ascending order, Algorithm 2 is averaging the solutionswhich are close to ho. This orientation of selecting close solutionleads to

E si � hoi

�� �� < kd if ho

i ¼ 0d�k

d if hoi ¼ 1

(: �

Remark 5.3. The results in Theorem 5.2 are also consistent withthe intuition. When we average all the solutions in Algorithm 2(set t ¼ N), we have Esi ¼ 1� k

dþ 0� d�kd ¼ k

d ;8i ¼ 1;2; . . . ; d. There-fore, we have E si � ho

i

�� �� ¼ Esi ¼ kd when ho

i ¼ 0, andE si � ho

i

�� �� ¼ 1� Esi ¼ d�kd when ho

i ¼ 1. Since we are selecting top tsolutions (0 < t < N) which are close to ho, it is reasonable to con-clude E si � ho

i

�� �� < kd when ho

i ¼ 0, and E si � hoi

�� �� < d�kd when ho

i ¼ 1.The proof in Theorem 5.2 quantitatively implements the aboveintuition by partitioning the solution space Hk according to solu-tion differences, and averaging all the possibilities of N solutionsin these partitioned spaces. Similarly to Theorem 5.2, the monoto-nicity between E siðhÞ � ho

i

�� �� and h in the proof suggests selecting0 < t < N top ranked solutions to achieve a trade-off between bias(reducing the bound of E si � ho

i

�� ��) and variance (reducing the vari-ance Var si � ho

i

�� ��).

2 http://www.cad.zju.edu.cn/home/dengcai/Data/data.html, http://archive.ics.u-ci.edu/ml/.

Corollary 5.2. With the same notations and assumptions in Theorem5.2, if we select the bottom t solutions in Algorithm 2, then we have

E si � hoi

�� �� > kd if ho

i ¼ 0d�k

d if hoi ¼ 1

(:

Theorem 5.2 and Corollary 5.2 can be similarly interpreted andused as Theorem 5.1 and Corollary 5.1 in selecting features. Theo-rem 5.2 and Corollary 5.2 estimate lower and upper bounds forE si � ho

i

�� ��, which provides very useful information for estimatingho

i . In Theorem 5.2, if k < d� k, we have E si � hoi

�� �� < kd <

12 for

hoi ¼ 0. Therefore thresholding method can be used to estimate ho

i

hoi ¼

0 if si <kd

1 if si >kd

(: ð17Þ

If k > d� k, we have E si � hoi

�� �� < d�kd < 1

2 for hoi ¼ 1. Therefore thres-

holding method can still be used to estimate hoi

hoi ¼

0 if si � 1j j < d�kd

1 if si � 1j j > d�kd

(: ð18Þ

Similar thresholding methods can be used for Corollary 5.2. Fig. 10shows numerical validations for Theorem 5.2 and Corollary 5.2.

� Theorem 5.2. hoi ¼ 0 and ho

i ¼ 1 can be distinguished from si bythe thresholding ho

i ¼ 0 if si <kd (d ¼ 100; k ¼ 35), and ho

i ¼ 1 ifsi P k

d (first). When hoi ¼ 0; E siðtÞ � ho

i

�� �� < kd when

1 6 t < d; E siðtÞ � hoi

�� �� increases with t, and E siðdÞ � hoi

�� �� ¼ kd (sec-

ond, red lines). When hoi ¼ 1; E siðtÞ � ho

i

�� �� < d�kd when

1 6 t < d; E siðtÞ � hoi

�� �� increases with t, and E siðdÞ � hoi

�� �� ¼ d�kd

(second, black lines).� Corollary 5.2. ho

i ¼ 0 and hoi ¼ 1 can be distinguished from si by

the thresholding hoi ¼ 0 if si >

kd (third, red) and ho

i ¼ 1 if si 6kd

(third, black). When hoi ¼ 0; E siðtÞ � ho

i

�� �� > kd when

1 6 t < d; E siðtÞ � hoi

�� �� decreases with t, and E siðdÞ � hoi

�� �� ¼ kd

(fourth, red lines). When hoi ¼ 1; E siðtÞ � ho

i

�� �� > d�kd when

1 6 t < d; E siðtÞ � hoi

�� �� decreases with t, and E siðdÞ � hoi

�� �� ¼ d�kd

(fourth, black lines).

Theorem 5.1, Corollary 5.1, Theorem 5.2 and Corollary 5.2show that the scores in Algorithm 2 can be used to identify theoptimal solution when the distances among solutions can be re-flected in the performance values. Therefore, the score vector scharacterizes the importance of each feature in terms of maximiz-ing the classification performance. Similar to OO, it emphasizesidentifying good solutions using performance order rather thanperformance values. Different from OO which computes the prob-ability of achieving the top-g% solutions for general black-boxoptimization, the theorems and corollaries analyze the possibilityof recovering the optimal solution in supervised feature selection.Theoretically, selecting less top solutions from HN leads to lessbias (smaller E si � ho

i

�� ��), but it also implies more variance (biggerVar si � ho

i

�� ��). Practically, one needs to select appropriate numberof top solutions to achieve a good trade-off based on cross valida-tion or domain knowledge. We will use Algorithm 2 instead ofstandard OO procedures in the experiments. Theorem 5.1, Theo-rem 5.2 and their corollaries analyze Algorithm 2 from the per-spective of ‘‘ordinal’’ optimization, where the performanceorders of solutions are emphasized more than the performancevalues. Since performance orders are usually more robust thanvalues, it is reasonable to use the voting of top ranked solutionsin Algorithm 2. In applications, the assumption in Theorem 5.1and Theorem 5.2 may not hold (at least not hold strictly). How-ever, it is intuitive to use the scores as an approximate estimationfor good solutions.

The main theoretical results in Section 4 and Section 5 can besummarized as follows.

� If the value-good-enough solutions are not extremely sparse inthe solution space, then the ordinal optimization based featureselection method which returns order-good-enough solutionswill also return value-good-enough solutions with largeprobability.� If the distance among solutions changes monotonically with

respect to the performance differences of solutions, then theordinal optimization based feature scoring method can approx-imately locate the position of globally optimal selection vectorin the solution space.

We will verify the assumptions in these conclusions by compar-ing our method with other feature selection methods in theexperiments.

6. Experiments

In previous sections, we have analyzed the theoreticalsoundness of OO perspective for supervised feature selection. Ouranalysis shows that our feature selection method can provide a or-der-good-enough feature subset, which has moderate or even bet-ter performances in classification. This method can be viewed as adirect approach for maximizing the classification performance inthe test data, while many standard feature selection methods canbe viewed as indirect approaches. Since the ultimate goal is tomaximize the performance in the test data, it is possible for ourmethod to outperform those standard feature selection methods.In this section, we verify these conclusions empirically by compar-ing the classification accuracies of different feature selectionmethods with the same number of selected features. Experimentsare performed in sixteen real-world datasets2 (Table 2), most ofwhich are widely used for feature selection and supervised learning[25,13].

Page 13: Supervised feature subset selection with ordinal optimization

0 200 400 600 800 10000

0.2

0.4

0.6

0.8

1

Solutions

Expe

ctat

ions

0 200 400 600 800 10000

0.2

0.4

0.6

0.8

1

Solutions

Expe

ctat

ions

0 200 400 600 800 10000

0.2

0.4

0.6

0.8

1

Solutions

Expe

ctat

ions

0 200 400 600 800 10000

0.2

0.4

0.6

0.8

1

Solutions

Expe

ctat

ions

Fig. 10. Numerical validations for Theorem 5.2 and Corollary 5.2. Suppose ho 2 Hk ¼ hjh 2 f0;1gd;Pd

i¼1 ¼ kn o

(d ¼ 100; k ¼ 35). Choose HN ¼ h1; h2 . . . ; hNf g(N ¼ 1000)uniformly from Hk , and sort it as hð1Þ; hð2Þ . . . ; hðNÞ

� �where Dðhð1Þ; hoÞP . . . P DðhðNÞ; hoÞ;DðhðiÞ; hoÞ is the Hamming distance between hðiÞ and ho . Denote top t solutions as

HðtÞ ¼ hðN�tþ1Þ; hðN�tþ2Þ; . . . ; hðNÞ� �

and bottom solutions as Hð�tÞ ¼ hð1Þ; hð2Þ; . . . ; hðtÞ� �

, respectively. The figure shows the plot of some statistics against different t. From left to right,

they are: top-score siðtÞ ¼ meanðHðtÞ;2Þ, top-distance diðtÞ ¼ 1t

Ptj¼1D ðhðN�jþ1ÞÞi; h

oi

� �, bottom-score �siðtÞ ¼ meanðHð�tÞ;2Þ and bottom-distance �dðtÞ ¼ 1

t

Ptj¼1D ðhðN�jþ1ÞÞi; h

oi

� �,

respectively. The black curves denote those features where hoi ¼ 1. The red curves denote those features where ho

i ¼ 0.

Table 2Real-world Datasets.

Index Dataset Size (m) Dimensions (n) Classes (l)

1 glass 214 9 62 dermatology 366 34 63 ionosphere 351 34 24 mfeat-kar 2000 64 105 mfeat-fou 2000 76 106 mfeat-fac 2000 216 107 mfeat-pix 2000 240 108 usps 9298 256 109 isolet1 1560 617 26

10 mnist 4000 784 1011 yale32 165 1024 1512 yaleb32 2414 1024 3813 orl32 400 1024 4014 coil20 1440 1024 2015 coil100 7200 1024 10016 pie32 11554 1024 68

D. Feng et al. / Knowledge-Based Systems 56 (2014) 123–140 135

Experimental settings are as follows. Given a data matrixX 2 Rd�m and the label vector y 2 f1; . . . ; lgm, we compare theclassification accuracies of different methods under the sameselected number of features. Specifically, we setk ¼ k1; k2; . . . ; k10½ � � 0:1d;0:2d; . . . ;1:0d½ �Þ, and compare the aver-age classification accuracies with n ¼ 5 training-test partitions ofdata. For feature selection algorithms which return the weight ofeach feature, we select top k features. For feature selectionalgorithms which return a feature subset, we set the number of fea-tures as k and run the algorithm. For our method, we randomly pick

up N solutions HN ¼ h1; h2; . . . ; hNf g, where hi 2 H ¼ f0;1gd; i ¼ 1;

2; . . . ;N. These performances are computed (using nearest neighbor

classifier) and sorted as bJ ðh½1�Þ 6 bJ ðh½2�Þ 6 . . . 6 bJ ðh½N�Þ. We thenrandomly pick up N solutions Hk;N ¼ hk;1; hk;2; . . . ; hk;N

� �, where

hk;i 2 Hk; i ¼ 1;2; . . . ;N. These performances are computed (using

nearest neighbor classifier) and sorted as bJ ðhð1ÞÞ 6 bJ ðhð2ÞÞ6 . . . 6 bJ ðhðNÞÞ. We compare several implementations of the OOFSmethod: ‘‘OOSkb’’, ‘‘OOSka’’ and ‘‘OOW0a’’. ‘‘OOSkb’’, ‘‘OOSka’’ re-turn a feature subset, and ‘‘OOW0a’’ returns a feature score vector.‘‘OOSkb’’ denotes selecting features using the best estimated solu-tion in Hk;N . The output is the selected feature indices in hðNÞ. ‘‘OOS-ka’’ denotes selecting features using a score vector s which iscomputed as Algorithm 2 (H ¼ Hk;N; t � 0:1N; k ¼ ki). The output isthe feature indices of top k scores. ‘‘OOW0a’’ denotes selecting fea-tures using a score s which is computed as Algorithm 2(H ¼ HN; t � 0:1N).

We compare the results of some popular feature selectionmethods. ‘‘fisher’’ (Fisher Score [35]), ‘‘infogain’’ (Information Gain[38,7]), ‘‘relieff’’ (ReliefF algorithm [39,5]) are supervised feature

scoring algorithms from a filter model. ‘‘fisher’’ selects featureswhich assign similar values to the samples in the same class andassign different values to the samples from different classes [35].‘‘infogain’’ selects features which have higher dependencies withthe class label than other features [38,7]. ‘‘relieff’’ selects featureswhich contributes more to the separation of samples from differentclasses [39,5]. ‘‘mrmr’’ (Minimum Redundancy Maximum Rele-vance [40]) and ‘‘traceRatio’’ (Trace Ratio criterion [25]) are super-vised filter models which return a feature subset. ‘‘mrmr’’ selectsfeatures which are mutually far away from each other (minimizingthe redundancy) and have high correlation to the classification la-bel (maximizing the relevance) [40]. ‘‘traceRatio’’ selects featureswhich maximize the between-class scatter and minimize the with-in-class scatter of samples at the same time, where the objective issimplified as maximizing the ratio of two matrix traces [25].‘‘sbmlr’’ (Sparse multinomial logistic regression via bayesian l1regularization [41]) is an embedded method which selects featuresusing a sparse logistic regression model. For these methods, we useall samples to select a feature subset, which is then evaluated andcompared with OOFS methods in maximizing the prediction accu-racy. Other wrapper models such as greedy forward selection arenot incorporated for comparison since they have much highercomplexity than all the methods used in our experiments. Weemphasize that OOFS serves as one implementation of wrappermodels with much less complexity. In the following it can beshown that the performances in OOFS are already comparative orbetter than previous feature selection methods, which means thatOOFS overcomes the difficulty of wrapper models in high complex-ity but still preserves enough accuracy.

Figs. 11 and 12 show the classification accuracies of differentfeature methods in real-world datasets using nearest neighbor clas-sifier and SVM, respectively. It can be seen that, OOFS methods havesimilar performances, which are competitive to the best results ofthe popular feature selection methods. In many datasets, OOFSmethods achieve the maximal prediction accuracy. This is particu-larly true when we select a small number of features from a datasetwhich has a large number of features, e.g., ‘‘mnist’’, ‘‘coil20’’,‘‘coil10’’. The results hold for different classifiers such as nearestneighbor classifier and SVM. Note that OOFS accuracies in nearestneighbor classifier are a little higher in average than those in SVMsince we use the former to estimate the performances. However,the comparison of different feature selection methods is very sim-ilar. This verifies our analysis in previous sections that OOFS meth-ods directly optimizes the classification accuracy in the test data,while other methods are built on optimizing other specific objec-tives. These objectives are usually related with the classificationobjective, but the gap between them can be large in some datasets.

It can also be seen that ‘‘OOSkb’’ can be unstable in some situ-ations. For example, some accuracies of ‘‘OOSkb’’ are obviously

Page 14: Supervised feature subset selection with ordinal optimization

0 0.5 1

0.4

0.5

0.6

0.7

0.8

k/d

Acc

glass

F−A−KB−A−KmrmrtraceRatiosbmlrfisherinfogainrelieffOOSkbOOSkaOOW0a

0 0.5 10.5

0.6

0.7

0.8

0.9

1

k/d

Acc

dermatology

0 0.5 10.8

0.85

0.9

0.95

1

k/d

Acc

ionosphere

0 0.5 10.4

0.6

0.8

1

k/d

Acc

mfeat−kar

0 0.5 10.65

0.7

0.75

0.8

0.85

0.9

k/d

Acc

mfeat−fou

0 0.5 10.7

0.8

0.9

1

k/d

Acc

mfeat−fac

0 0.5 10.4

0.6

0.8

1

k/d

Acc

mfeat−pix

0 0.5 10.5

0.6

0.7

0.8

0.9

1

k/d

Acc

usps

0 0.5 10.2

0.4

0.6

0.8

1

k/d

Acc

isolet1

0 0.5 10

0.2

0.4

0.6

0.8

1

k/d

Acc

mnist

0 0.5 10.55

0.6

0.65

0.7

0.75

0.8

k/d

Acc

yale32

0 0.5 10.65

0.7

0.75

0.8

0.85

0.9

k/d

Acc

yaleb32

0 0.5 10.8

0.85

0.9

0.95

1

k/d

Acc

orl32

0 0.5 10.85

0.9

0.95

1

k/d

Acc

coil20

0 0.5 10.4

0.6

0.8

1

k/d

Acc

coil100

0 0.5 10.75

0.8

0.85

0.9

0.95

k/d

Acc

pie32

Fig. 11. Comparison of classification accuracies of different feature selection methods in real-world datasets (using nearest neighbor classifier). In each dataset, theclassification accuracies are compared in different selected number of features.

136 D. Feng et al. / Knowledge-Based Systems 56 (2014) 123–140

lower than other OOFS methods in ‘‘glass’’, ‘‘yaleb32’’, ‘‘orl32’’,‘‘mfeat-kar’’. The reason is that the solution which has the best per-formance is not necessarily optimal in these N solutions, since theestimated performances can be disturbed by the randomness inpartitioning data. Moreover, the truly optimal solution in the Nsolutions is not necessarily value-good-enough in the featurespace. The experimental results of ‘‘OOSka’’ and ‘‘OOW0a’’ showthat, constructing better solutions using the top solutions can re-turn a solution that is value-good-enough in the solution space.These solutions are more stable since they are based on the votingand averaging of many good solutions. We can see that OOFSmethods are inferior to previous methods in some datasets (e.g.,‘‘yale32’’, ‘‘yaleb32’’). The reason is that the OPCs in these datasetsdo not have desired shapes, which means that the value-good-en-ough solutions is very sparse. This reflects the essential difficulty ofselecting features in these datasets, where previous methods alsohave relatively lower performances than other datasets. In prac-tice, it is better to estimate the property of data (shape of OPC) first,and then choose feature selection algorithms.

We test the performance of wrapper methods (Figs. 13 and 14),and compare the running time of all methods (Table 3 and 4. Thecomparison is performed in the first six datasets since they havecontrollable computation. A wrapper method requires the search-ing strategy, the test times (of classifiers), and the classifier type.The popular greedy forward selection and greedy backward elimi-nation methods are used. Suppose we want to select k features

from a dataset (with d features) using wrapper methods, and n isthe time of partitioning training-test data for subset evaluation.Then the complexity of forward selection is n

Pk�1i¼0 ðd� iÞ

Oðc; iþ 1Þ, and the complexity of backward elimination isnPkþ1

i¼d iOðc; i� 1Þ, where Oðc; jÞ is the complexity of training classi-fier c in a dataset with j features. In the experiments, ‘‘F’’ (forward)denotes the greedy forward selection searching, ‘‘B’’ (backward)denotes greedy backward elimination searching, ‘‘C’’ (crude) de-notes using the crude model (one time test accuracy)bJ ðhÞ ¼ Jðh; n1Þ, ‘‘A’’ (accurate) denotes using accurate model (ntime test accuracy) bJ ðhÞ ¼ 1

n

Pni¼1Jðh; niÞ (set n ¼ 5 in the

experiments), ‘‘K’’ denotes using nearest neighbor classifier, and‘‘S’’ denotes using SVM.

Figs. 13 and 14 compare the classification accuracies of differentwrapper methods in the first six real-world datasets. Similar to thecomparison between OO based selection methods and other vari-able selection methods, all selected variable sets are evaluatedusing nearest neighbor (Fig. 13) and SVM (Fig. 14) with five train-ing-partitioning of data. These results show that the settings ofwrapper methods affect the accuracies reasonably:

� Accuracies of greedy forward selection (green lines) and greedybackward elimination (magenta lines) are similar.� Accuracies of accurate evaluation (real lines) are much better

than those of crude evaluation (dashed lines). This verifies pre-vious conclusions that accurate variable subset evaluation is

Page 15: Supervised feature subset selection with ordinal optimization

0 0.5 1

0.4

0.5

0.6

0.7

k/d

Acc

glass

F−A−SB−A−SmrmrtraceRatiosbmlrfisherinfogainrelieffOOSkbOOSkaOOW0a

0 0.5 10.5

0.6

0.7

0.8

0.9

1

k/d

Acc

dermatology

0 0.5 10.85

0.9

0.95

1

k/d

Acc

ionosphere

0 0.5 10.4

0.6

0.8

1

k/d

Acc

mfeat−kar

0 0.5 10.65

0.7

0.75

0.8

0.85

0.9

k/d

Acc

mfeat−fou

0 0.5 10.85

0.9

0.95

1

k/d

Acc

mfeat−fac

0 0.5 10.5

0.6

0.7

0.8

0.9

1

k/d

Acc

mfeat−pix

0 0.5 10.5

0.6

0.7

0.8

0.9

1

k/d

Acc

usps

0 0.5 10.4

0.6

0.8

1

k/d

Acc

isolet1

0 0.5 10.2

0.4

0.6

0.8

1

k/d

Acc

mnist

0 0.5 10.4

0.5

0.6

0.7

0.8

k/d

Acc

yale32

0 0.5 10.4

0.5

0.6

0.7

0.8

0.9

k/d

Acc

yaleb32

0 0.5 10.7

0.75

0.8

0.85

0.9

0.95

k/d

Acc

orl32

0 0.5 10.7

0.8

0.9

1

k/d

Acc

coil20

0 0.5 10.4

0.6

0.8

1

k/d

Acc

coil100

0 0.5 10.5

0.6

0.7

0.8

0.9

1

k/dA

cc

pie32

Fig. 12. Comparison of classification accuracies of different feature selection methods in real-world datasets (using SVM). In each dataset, the classification accuracies arecompared in different selected number of features.

0 0.5 1

0.4

0.5

0.6

0.7

0.8

k/d

Acc

glass

F−C−KF−C−SF−A−KF−A−SB−C−KB−C−SB−A−KB−A−S

0 0.5 10.7

0.8

0.9

1

k/d

Acc

dermatology

0 0.5 10.84

0.86

0.88

0.9

0.92

0.94

k/d

Acc

ionosphere

0 0.5 10.8

0.85

0.9

0.95

1

k/d

Acc

mfeat−kar

0 0.5 10.78

0.8

0.82

0.84

0.86

k/d

Acc

mfeat−fou

0 0.5 10.88

0.9

0.92

0.94

0.96

0.98

k/d

Acc

mfeat−fac

Fig. 13. Comparison of classification accuracies of different wrapper methods in real-world datasets (using nearest neighbor classifier).

D. Feng et al. / Knowledge-Based Systems 56 (2014) 123–140 137

very important since it directly determines the performance ofwrapper methods. However, accurate evaluation of subsetsleads to very high computational complexity in wrappermethods.

� Accuracies of nearest neighbor and SVM depend on the classi-fier type used in evaluation. If nearest neighbor is used in subsetevaluation, then the wrapper method which sets classifier typeas nearest neighbor will outperform other setting. This is

Page 16: Supervised feature subset selection with ordinal optimization

0 0.5 1

0.4

0.5

0.6

0.7

k/d

Acc

glass

F−C−KF−C−SF−A−KF−A−SB−C−KB−C−SB−A−KB−A−S

0 0.5 10.8

0.85

0.9

0.95

1

k/d

Acc

dermatology

0 0.5 10.85

0.9

0.95

1

k/d

Acc

ionosphere

0 0.5 10.85

0.9

0.95

1

k/d

Acc

mfeat−kar

0 0.5 10.8

0.82

0.84

0.86

0.88

k/d

Acc

mfeat−fou

0 0.5 10.95

0.96

0.97

0.98

0.99

1

k/d

Acc

mfeat−fac

Fig. 14. Comparison of classification accuracies of different wrapper methods in real-world datasets (using SVM).

Table 3Running time of different selection methods (in seconds).

mrmr traceRatio sbmlr fisher infogain relieff OOSkb OOSka OOW0a

glass 3.83e�1 2.94e�2 9.67e�1 5.67e�3 1.77e�2 3.47e�2 1.94e+1 1.98e+1 2.06e+0

dermatology 7.99e�1 6.22e�2 2.22e�1 1.98e�2 3.20e�2 1.82e�1 2.51e+1 2.60e+1 2.66e+0

ionosphere 8.69e�1 6.24e�2 1.07e+1 7.48e�3 3.32e�2 1.15e�1 2.01e+1 2.04e+1 1.95e+0

mfeat-kar 6.09e+0 1.49e+0 3.06e+0 5.35e�2 3.56e�1 6.07e+0 3.03e+2 3.06e+2 2.94e+1

mfeat-fou 7.12e+0 1.57e+0 1.10e+1 6.52e�2 4.17e�1 7.35e+0 3.15e+2 3.15e+2 3.02e+1

mfeat-fac 1.95e+1 2.81e+0 7.17e+1 2.02e�1 4.85e�1 2.01e+1 4.60e+2 4.64e+2 4.13e+1

mfeat-pix 2.19e+1 2.78e+0 1.67e+1 2.21e�1 3.49e�1 2.22e+1 4.93e+2 4.98e+2 4.56e+1usps 1.22e+2 5.61e+1 7.00e+1 2.79e�1 6.50e+0 3.86e+2 7.33e+3 7.33e+3 7.01e+2isolet1 1.33e+2 5.46e+0 6.64e+1 1.26e+0 2.71e+0 4.30e+1 7.46e+2 7.50e+2 6.65e+1mnist 2.78e+2 2.86e+1 2.18e+2 7.16e�1 1.95e+0 2.45e+2 3.37e+3 3.37e+3 3.30e+2yale32 2.02e+2 7.47e+0 4.90e+0 1.22e+0 3.19e�1 1.73e+0 6.32e+1 6.60e+1 5.63e+0yaleb32 3.61e+2 1.96e+1 1.97e+2 3.06e+0 2.66e+0 1.78e+2 2.15e+3 2.16e+3 2.03e+2orl32 2.19e+2 7.45e+0 2.13e+1 3.20e+0 8.76e�1 9.08e+0 2.34e+2 2.36e+2 1.82e+1coil20 3.44e+2 1.25e+1 8.77e+1 1.71e+0 5.14e+0 5.87e+1 1.02e+3 1.03e+3 9.81e+1coil100 7.13e+2 1.16e+2 3.01e+3 7.82e+0 1.02e+1 1.48e+3 1.22e+4 1.21e+4 1.13e+3pie32 1.61e+2 6.37e+2 4.75e+3 5.56e+0 7.69e+0 1.32e+1 2.89e+4 3.03e+4 2.62e+3

Table 4Running time of different wrapper methods (in seconds).

F-C-K F-C-S F-A-K F-A-S B-C-K B-C-S B-A-K B-A-S

glass 1.19e�1 3.39e�1 3.02e�1 1.22e+0 9.76e�2 2.90e�1 3.15e�1 1.26e+0

dermatology 1.87e+0 5.32e+0 6.57e+0 2.77e+1 3.85e+0 9.09e+0 1.09e+1 3.51e+1

ionosphere 1.83e+0 5.10e+0 7.01e+0 2.47e+1 2.82e+0 5.88e+0 7.70e+0 2.69e+1

mfeat-kar 8.40e+1 4.76e+2 3.99e+2 2.35e+3 9.52e+1 7.99e+2 4.65e+2 4.10e+3

mfeat-fou 1.44e+2 8.35e+2 6.42e+2 3.85e+3 1.38e+2 1.35e+3 6.71e+2 6.71e+3

mfeat-fac 1.35e+3 8.46e+3 6.48e+3 4.11e+4 2.62e+3 1.53e+4 9.63e+3 7.59e+4

138 D. Feng et al. / Knowledge-Based Systems 56 (2014) 123–140

because the optimal subset selected from one type of classifiermight not be optimal with respect to another one. This alsointerprets the advantages of wrapper models (including OO)since they can optimize the subset evaluation process (givenclassifier) directly.

To facilitate comparison, we merge both sets of plots of wrappermethods and our ordinal optimization method in Fig. 11 and 12(the first six datasets). It can be seen that wrapper methods usuallyachieve the highest accuracy compared with other methods, andour ordinal optimization method has very close performances withwrapper methods in most situations.

OO method serves as a good trade-off between accuracy andcomplexity, which can be seen by comparing Fig. 13 and 14 withFig. 11 and 12, and comparing the running times of different

methods in Table 3 and 4. The running time is the sum of selectingdifferent number of variables (k ¼ 0:1d; . . . ;1:0d), and computedwith MATLAB 7.7.0.471 (R2008b), Intel (R) Core i5 [email protected] GHz. For example, 3:83e� 1 means 3:83� 10�1 s, and1:22eþ 2 means 1:22� 102 s. The running time is the time ofachieving all variable subsets (e.g., ‘‘mrmr’’, ‘‘traceRatio’’) or com-puting variable scores, where the former usually takes more timesince it needs to deal with each constrained space. Some of thenon-wrapper method also have high complexity, e.g., ‘‘sbmlr’’.Some of the non-wrapper method depend heavily on dataset,e.g., ‘‘relief’’. We have the following comparisons:

� The best accuracy values of wrapper methods in Fig. 13 and 14is similar to the OO methods in Fig. 11 and 12. This means thatOO methods have already achieved good enough solutions

Page 17: Supervised feature subset selection with ordinal optimization

D. Feng et al. / Knowledge-Based Systems 56 (2014) 123–140 139

which are comparable with those from wrapper methods. Thesegood enough solutions are ‘‘saturated’’, since it takes a lot ofcomputation complexity to gain a little improvement.� The running time of OO methods is relatively more comparable

with non-wrapper methods, especially the ‘‘OOW0a’’ methodwhich selects features with the scores estimated from theunconstrained space. Wrapper methods have obviously muchhigher complexity among all methods, and can only be per-formed in datasets with at most several hundreds of variables.For example, in ‘‘mfeat-fac’’ data, the running time of wrappermethods is as 10—100 times as OO methods.

Fig. 15 visualizes the selected features of different methods intypical real-world datasets. It can be seen that most of the previousmethods tend to select feature variables which have close connec-tions. For example, these feature variables represent some localareas of samples. OOFS methods tend to select features whichare uniformly from the feature space. These features does not rep-resent local areas of samples. They represent the global structuresof samples. For example, in ‘‘mnist’’ and ‘‘coil100’’ (k

d ¼ 0:3), the se-lected features from OO methods can reconstruct the global shapeof characters or objects, while the selected features from previousmethods only reconstruct some local areas. This difference is veryimportant when we select a small number of features for classifi-cation (e.g., k

d � 0:3). Intuitively, the global shapes are more dis-criminative than the local areas. Experimentally, the classificationaccuracies in OOFS methods are competitive or higher in thesedatasets. This also interprets why OOFS methods can be competi-tive with previous methods in these situations.

The experimental results also illustrate the advantages of per-forming feature selection. In some datasets, a small number of fea-tures contain enough information for the representation of objects,since their prediction accuracy is close to (e.g., ‘‘coil20’’, ‘‘coil100’’)or even better than (e.g., ‘‘ionosphere’’, ‘‘mfeat-fou’’) those of fullfeature sets. The visualization of selected features recovers thesource of performance gain, which interprets the dependencies be-tween different features and labels in some sense. The datasets inthis section serve as comparing different algorithms and showingthe effectiveness of the OOFS perspective. From these datasetswe have seen that, performing feature selection can reduce the

mnist (k/d=0.1) mnist (k/d=0.3)

orl32 (k/d=0.1) orl32 (k/d=0.3)

coil100 (k/d=0.1) coil100 (k/d=0.3)

Fig. 15. Visualization the selected features from different methods. The three rows are thobjects (‘‘coil100’’). The four columns correspond to different number of selected featurfeature selection methods: ‘‘mrmr’’, ‘‘traceRatio’’, ‘‘MCFSSup’’, ‘‘sbmlr’’, ‘‘fisher’’, ‘‘info‘‘OOW0a’’. The vertical direction denotes ten samples which are from different classes.

computational cost, save the storage space, preserve or even im-prove prediction accuracy, and interpret the relationships amongvariables.

7. Discussions and conclusions

In supervised feature subset selection, the ultimate goal is tomaximize the generalized classification performance. It is a numer-ical optimization problem since the goal is to maximize the predic-tion accuracy values. The main difficulty of previous wrappermodels is that the optimization is based on evaluating feature sub-sets sequentially for many steps, where each step requires parti-tioning data and averaging the test accuracies repeatedly. Thisbrings very heavy computational burden, and makes wrappermodels infeasible with large number of features. We present anordinal optimization perspective for supervised feature subsetselection, and propose the OOFS algorithm, which can systemati-cally identify order-good-enough solution set. OOFS reduces thecomputational complexity greatly and facilitates parallel comput-ing. The price is that the feature selection objective is softenedfrom finding value-good-enough solutions to finding order-good-enough solutions. We build close connections between the featureselection problem (numerical optimization) and the ordinal opti-mization method (ordinal optimization). The idea is to determinethe shape of ordered performance curves in the solution space.With the feature selection background, we prove that these curvesusually correspond to the situation where the order-good-enoughsolutions and value-good-enough solutions can have a large inter-section. This means that solutions from ordinal optimization canbe good in numerical optimization. Moreover, we improve theOOFS method by proposing a scoring algorithm. The scoring algo-rithm, called OOFSs, measures the contributions of individual fea-ture variables to the classification accuracy. We prove that thealgorithm can approximate the underlying optimal solution whenthe distances among solutions can be reflected in their perfor-mance differences. We empirically verify the effectiveness of theproposed feature selection methods in sixteen real-world datasets.

The feature selection method in this paper can be extendedwith other methods. The OOFS procedures can be viewed as anapproximate initialization of some optimization algorithms, such

mnist (k/d=0.5) mnist (k/d=0.7)

orl32 (k/d=0.5) orl32 (k/d=0.7)

coil100 (k/d=0.5) coil100 (k/d=0.7)

e selected features in images which contain characters (‘‘mnist’’), faces (‘‘orl32’’), andes (k

d ¼0.1, 0.3, 0.5, 0.7.). In each feature, the horizontal direction denotes thirteengain’’, ‘‘relieff’’, ‘‘laplacianScore’’, ‘‘spectrum’’, ‘‘OOSkb’’, ‘‘OOSka’’, ‘‘OOWra’’, and

Page 18: Supervised feature subset selection with ordinal optimization

140 D. Feng et al. / Knowledge-Based Systems 56 (2014) 123–140

as genetic algorithms, simulated annealing, where the differencesare that the generated solutions do not have to be evaluated accu-rately, and the initialized population size should be suitably large(e.g., N ¼ 1000). Compared with these optimization algorithms,OOFS provides a new trade-off between performance accuracyand computational complexity: the accuracy is not sacrificed verymuch, but the complexity is greatly reduced. The underlyingassumption is that the features have relatively similar ability indiscrimination, which corresponds to some important areas suchas image classification. OOFS can be improved by updating thesolutions with search-based optimization algorithms when morecomputational resources are available. Moreover, OOFS can becombined with the optimization methods such as genetic algo-rithms to generate better solutions [30]. Compared with theseblack-box optimization procedures, it is more interesting to applythe prior information from applications such as images to facilitatethe feature selection task. It’s also straightforward to generalizeour method to regression-oriented and unsupervised feature selec-tion problems. Our future work includes combining OO with otheroptimization methods in pattern analysis applications.

Acknowledgments

This work was supported by National Natural Science Founda-tion of China (Project No.61071131 and Project No.61271388),Research Project of Tsinghua University (No.2012Z01011), BeijingNatural Science Foundation (No.4122040), and United Technolo-gies Research Center (UTRC).

References

[1] I. Guyon, A. Elisseeff, An introduction to variable and feature selection, J. Mach.Learn. Res. 3 (2003) 1157–1182.

[2] R. Kohavi, G.H. John, Wrappers for feature subset selection, Artif. Intell. 97(1997) 273–324.

[3] I. Guyon, A. Elisseeff, C. Aliferis, Causal Feature Selection, Technical report,2007.

[4] H. Liu, H. Motoda, Computational Methods of Feature Selection, Chapman &Hall/CRC, 2007.

[5] Z. Zhao, F. Morstatter, S. Sharma, S. Alelyani, A. Anand, H. Liu, AdvancingFeature Selection Research, Technical report, ASU Feature Selection Repository,2011.

[6] P. Langley, Selection of relevant features in machine learning, in: Proceedingsof the AAAI Fall symposium on Relevance, AAAI Press, 1994, pp. 140–144.

[7] Z. Zhao, L. Wang, H. Liu, Efficient spectral feature selection with minimumredundancy, in: Proceedings of the 24th AAAI Conference on ArtificialIntelligence, AAAI.

[8] Y. Yang, H.T. Shen, Z. Ma, Z. Huang, X. Zhou, l2,1-norm regularizeddiscriminative feature selection for unsupervised learning, in: Proceedings ofthe Twenty-Second International Joint Conference on Artificial Intelligence,AAAI Press, 2011, pp. 1589–1594.

[9] I.-S. Oh, J.-S. Lee, B.-R. Moon, Hybrid genetic algorithms for feature selection,IEEE Trans. Pattern Anal. Mach. Intell. 26 (2004) 1424–1437.

[10] F. Gao, Y. Ho, Random approximated greedy search for feature subset selection,Asian J. Control 6 (2004) 439–446.

[11] R. Tibshirani, Regression shrinkage and selection via the lasso, J. Roy. Stat. Soc.Series B (Methodological) (1996) 267–288.

[12] F. Nie, H. Huang, X. Cai, C. Ding, Efficient and robust feature selection via jointl2, 1-norms minimization, Advan. Neural Inform. Process. Syst. 23 (2010)1813–1821.

[13] D. Cai, C. Zhang, X. He, Unsupervised feature selection for multi-cluster data,in: Proceedings of the 16th ACM SIGKDD International Conference onKnowledge Discovery and Data Mining, ACM, New York, USA, 2010, pp. 333–342.

[14] Y. Jiang, J. Ren, Eigenvector sensitive feature selection for spectral clustering,in: Proceedings of the 2011 European Conference on Machine Learning andKnowledge Discovery in Databases, Springer-Verlag, Berlin, Heidelberg, 2011,pp. 114–129.

[15] R. Jenatton, J.-Y. Audibert, F. Bach, Structured variable selection with sparsity-inducing norms, J. Mach. Learn. Res. (2011) 2777–2824.

[16] D. Witten, R. Tibshirani, A framework for feature selection in clustering, J. Am.Stat. Assoc. 105 (2010) 713–726.

[17] S. Kim, E. Xing, Feature selection via block-regularized regression (2012), arXivpreprint arXiv:1206.3268.

[18] S.R. de Morais, A. Aussem, A novel markov boundary based feature subsetselection algorithm, Neurocomputing 73 (2010) 578–584.

[19] H.L. Zheng Zhao, Spectral Feature Selection for Data Mining, Chapman andHall/CRC Press, 2011.

[20] Z. Xu, I. King, M.-T. Lyu, R. Jin, Discriminative semi-supervised feature selectionvia manifold regularization, IEEE Trans. Neural Networks 21 (2010) 1033–1047.

[21] T. Hastie, R. Tibshirani, J. Friedman, J. Franklin, The Elements of StatisticalLearning-Data Mining, Inference and Prediction, vol. 27, Springer, 2005.

[22] D. Koller, M. Sahami, Toward Optimal Feature Selection, Morgan Kaufman,1996. pp. 284–292.

[23] I. Tsamardinos, C. Aliferis, Towards principled feature selection: relevancy,filters and wrappers, in: Proceedings of the Ninth International Workshop onArtificial Intelligence and Statistics, Morgan Kaufman Publishers, 2003.

[24] Z. Zhao, H. Liu, Spectral feature selection for supervised and unsupervisedlearning, in: Proceedings of the 24th International Conference on MachineLearning, ACM, New York, USA, 2007, pp. 1151–1157.

[25] F. Nie, S. Xiang, Y. Jia, C. Zhang, S. Yan, Trace ratio criterion for feature selection,in: Proceedings of the 23rd National Conference on Artificial Intelligence.

[26] R. Meiri, J. Zahavi, Using simulated annealing to optimize the feature selectionproblem in marketing applications, Euro. J. Oper. Res. 171 (2006) 842–858.

[27] T. Feo, M. Resende, Greedy randomized adaptive search procedures, J. GlobalOptim. 6 (1995) 109–133.

[28] T. Lau, Y. Ho, Super-heuristics and their applications to combinatorialproblems, Asian J. Control 1 (2008) 1–13.

[29] Y.-C. Ho, An explanation of ordinal optimization: soft computing for hardproblems, Inform. Sci. 113 (1999) 169–192.

[30] Y. Ho, Q. Zhao, Q. Jia, Ordinal Optimization: Soft Optimization for HardProblems, Springer, 2007.

[31] J. Pena, R. Nilsson, On the complexity of discrete feature selection for optimalclassification, IEEE Trans. Pattern Anal. Mach. Intell. (2010).

[32] J. Reunanen, Overfitting in making comparisons between variable selectionmethods, J. Mach. Learn. Res. 3 (2003) 1371–1382.

[33] P. Somol, P. Pudil, J. Kittler, Fast branch amp bound algorithms for optimalfeature selection, IEEE Trans. Pattern Anal. Mach. Intell. 26 (2004) 900–912.

[34] S. Boyd, L. Vandenberghe, Convex Optimization, Cambridge Univ. Pr., 2004.[35] R. Duda, P. Hart, D. Stork, Pattern classification, John Wiley, New York, 2001

(Section 10, l).[36] K. Shen, C. Ong, X. Li, E. Wilder-Smith, Feature selection via sensitivity analysis

of svm probabilistic outputs, Mach. Learn. 70 (2008) 1–20.[37] A. Danyluk, N. Arnosti, Feature selection via probabilistic outputs, in:

Proceedings of the 29th International Conference on Machine Learning, NewYork, USA, pp. 1791–1798.

[38] T. Cover, J. Thomas, Elements of Information Theory, Wiley, 2006.[39] I. Kononenko, Estimating Attributes-Analysis and Extensions of Relief,

Springer, Verlag, 1994. pp. 171–182.[40] H.C. Peng, F.H. Long, C. Ding, Feature selection based on mutual information:

Criteria of max-dependency, max-relevance, and min-redundancy, IEEE Trans.Pattern Anal. Mach. Intell. 27 (2005) 1226–1238.

[41] G. Cawley, N. Talbot, M. Girolami, Sparse multinomial logistic regression viabayesian l1 regularisation, Advan. Neural Inform. Process Syst. 19 (2007) 209.