Local Feature Selection with Gaussian Process Regressionsaturno.ing.puc.cl/media/papers_alvaro/Karim-IDA-2014.pdfSection 3 describes related work on global and local feature selection

Noname manuscript No.(will be inserted by the editor)

Local Feature Selection with Gaussian Process Regression

Karim Pichara · Alvaro Soto

Received: date / Accepted: date

Abstract Keywords Feature Selection · Local discriminative subspaces · Gaussian

Process · Expectation Maximization · Nearest Neighbor · Classification

Most feature selection algorithms determine a global subset of features, where all

data instances are projected in order to improve classification accuracy. An attractive

alternative solution is to adaptively find a local subset of features for each data instance,

such that, the classification of each instance is performed according to its own selective

subspace. This paper presents a novel application of Gaussian Processes that improves

classification performance by learning discriminative local subsets of features for each

instance in a dataset. Gaussian Processes are used to build for each available feature

a function that estimates the discriminative power of the feature over all the input

space. Using these functions, we are able to determine a discriminative subspace for

each possible instance by locally joining the features that present the highest levels

of discriminative power. New instances are then classified by using a K-NN classifier

that operates in the local subspaces. Experimental results show that by using local

discriminative subspaces, we are able to reach higher levels of accuracy than alternative

state-of-the-art feature selection approaches.

1 Introduction

Performing classification in high dimensional spaces has strong limitations for most

classification models, mainly due to the Curse of Dimensionality problem and the usual

high computational load [3] [29]. In case of distance based models is hard to reach high

Pontificia Universidad Catolica de ChileVicuna Mackenna 4860, Macul, Santiago, ChileTel.: +562-3544440E-mail: [email protected]

Pontificia Universidad Catolica de ChileVicuna Mackenna 4860, Macul, Santiago, ChileTel.: +562-3544440E-mail: [email protected]

2

levels of accuracy since distance metrics lose part of their discriminative power in high

dimensions [3]. In case of probabilistic models that estimate class conditional density

functions, the ability to perform a good fit using a training set decreases exponentially

with the number of features [5].

Traditional dimensionality reduction techniques, such as feature selection and fea-

ture transformation, partially overcome the previous problem [21] [18]. Global feature

selection removes irrelevant and redundant dimensions by finding a global subset of

features where all the data instances are projected. Feature transformation reduces

dimensionality by summarizing the dataset by means of algebraic combinations of di-

mensions. As a relevant weakness, in both cases all the data instances are project into

the same subspace in order to improve classification. Furthermore, in the case of feature

transformation, the resulting new features usually do not have a clear interpretation.

In many applications, particularly in high dimensions, projecting all data instances

into the same subspace does not produce satisfactory results. For example, in a face

recognition application each individual usually has particular visual characteristics,

such a particle nose shape, that are more suitable represented by a specific subset

of features. In this case, a natural extension of global feature selection is to select a

particular local subset of discriminative features for each data instance. Unfortunately,

an exhaustive search for suitable local subspaces is generally not possible, particularly

in high dimensions, where there are 2n − 1 possible subsets of features to characterize

each data instance in a n-dimensional space.

In this paper, we present a local feature selection algorithm that for each data

instance selects a subset of features, such that by projecting the instance to the resulting

subspace, we facilitate its classification. Our model starts by finding for each feature a

function that estimates the discriminative power of the feature over all the input space.

We estimate these functions using Gaussian Processes (GP) regression [35][26]. To

achieve the estimation we propose an iterative optimization method to jointly select the

observations included in the regressions and the GP hyperparameters. After learning

the GPs for all features, we are able to find a discriminative subspace for each possible

input by locally joining the features presenting higher levels of discriminative power.

Accordingly, the main contributions of our work are: i) A new local feature selection

algorithm, able to find a particular discriminative subset of features for any position of

the input space, ii) A new view that poses the feature selection problem as a regression

task using a discriminative score based on GPs, and iii) A new optimization method

that uses an iterative strategy to jointly determine the informative observations and

the hyperparameters of the GPs.

This paper is organized as follows. Section 2 presents a brief overview of GPs.

Section 3 describes related work on global and local feature selection techniques. Section

4 presents the main details of our method. Section 5 shows experimental results. Finally

Section 6 presents the main conclusions of this work.

2 Gaussian Processes

A GP is a collection of random variables V = {v1, . . . , vn} such that any finite sub-

set of variables from V has a joint Gaussian distribution. A GP is fully specified by

a mean function M(x) and a covariance function K(xi, xj), becoming a distribution

over functions of possible infinite dimensional mean and covariance matrix. Covari-

ance function K(xi, xj) typically depends on some hyperparameters that have to be

3

determined. GPs have proven to be a powerful tool for regression, being able to model

smooth functions giving uncertainty about estimations [35]. In a standard regression

problem, we estimate a function F (x) : x ∈ X → y ∈ Y given a set of observations

Ay ∈ Y for a set of values Ax ∈ X. In GP regression, the jointly Gaussian distributed

variables are any subset of predictions Sy ⊂ Y , with mean M(Sy) and covariance

matrix Kij = K(Sxi , Sxj ) where Sx ∈ X are the respective pre images of Sy under

F . Note that the covariance functions for the output variables Sy depend only on the

input variables Sx. The posterior mean and variance of Sy given observations Ay with

respective points Ax are [35][26]:

µSy|Sx,Ax,Ay= µSy

+ΣSxAxΣ−1AxAx

(Ay − µAy) (1)

σ2Sy|Sx,Ax,Ay= K(Sx, Sx)−ΣSxAx

Σ−1AxAxΣT

SxAx(2)

where ΣSxAxis a covariance vector with one entry for each m ∈ Ax with value

K(Sx,m). It is important to note that the posterior variance of Sy does not depend

on the observations Ay. In other words, the uncertainty about the predicted values for

Sy only depends on the points Ax related to the observations Ay.

3 Related Work

An extensive literature presents many algorithms to select global subsets of features

for classification [20][17][21][22][11].

In supervised scenarios, common methods are divided in two main types, filters

and wrappers [17]. In filter models, feature selection is performed as a preprocessing

step, searching for features with some criterion that does not depend on the classifier,

ranking the features according to that global criterion. Close to our ideas, many meth-

ods find features that are discriminative to classify data points. Guided by this idea,

the Relief algorithm [20] selects features on two class classification problems based on

calculating weights for each dimension. To calculate the weights, they randomly sample

data points and update the weight of each dimension based on the difference between

the sampled data point and its nearest elements from the same and different class

(nearest hit and nearest miss respectively). That results in high weights for features

that are more discriminative for classification, unfortunately this algorithm works only

for two class problems and it do not detect redundant features. Some extensions of

the Relief algorithm have been proposed [23][36]. The algorithm Relief-F [23] improves

the robustness of Relief. They consider the k nearest misses and hits and extend the

approach to the multi-class case, also showing better results dealing with noisy data.

Regressional Relief-F (R-Relief-F) [36] instead of using information about the class of

each data point to estimate nearest hits and misses, they estimate a probability that

the predicted class value of two instances are different. They model this probability

with a relative distance between the predicted class of two instances. Conditioning

each probability on the predicted class value, the weight of each attribute is estimated

using Bayes’s rule. Relief approaches do not consider cases where some attributes are

discriminative for only some instances.

A different filter approach to search for relevant features is to evaluate relations be-

tween attributes and the attribute representing the class of the instances. In [14] they

perform a best first strategy guided by a score depending on the statistical correla-

tion between the features and the attribute representing the class of instances. In [13]

4

they proposed the inconsistency criterion, that gives a score to features depending on

the number of “inconsistencies” or cases where similar elements belong to different

classes. In [8] they use an information theoretic criterion to select variables in a text

categorization problem, calculating the mutual information between each feature and

the class label to estimate a score of relevance. This criterion is suitable for problems

where the probabilities needed to calculate the mutual information can be empiri-

cally estimated through frequencies of occurrences among data. In [22] they propose

an heuristic algorithm that estimates the Markov blanket for each feature in order to

allow the elimination of features that do not add information to the class label.

An alternative approach to filters are wrapper methods. In wrapper methods fea-

ture selection is guided by a specific classification algorithm, searching for more suitable

features that improves the accuracy for that particular classifier [17] [21]. In these mod-

els we need to define: (i) How to search the space of all possible subsets of variables;

(ii) How to assess the classification accuracy of a classification algorithm to guide the

search and halt it; and (iii) Which classifier to use. Some algorithms use genetic ap-

proaches to search over the hypothesis spaces. In [30] they propose an algorithm that

uses an evolutive search method with leave-one-out cross validation to select subsets of

features. They evaluate features in parallel until one feature is better than the others.

They use a 1-nearest neighbor classifier as a predictor model. It is common to find

add hoc heuristics related to specific classifiers in wrapper approaches, for example

in [25] they select features for a Naive Bayes classifier removing variables that intro-

duce dependencies and using leave-one-out cross validation to validate the candidate

subsets of attributes. In [32], they search for dependencies among features, creating

groups of variables that share dependencies, adding new features represented by these

compounds in order to improve a Naive Bayes classifier. In [39] they select features

for a Bayesian network classifier. They first select the class label as the initial feature,

then iteratively add the variable having the higher increment in accuracy of a Bayes

network estimated from the test set. In [40] they select subset of features to use with

nearest neighbor classifier, they develop a Monte Carlo technique to choose the sub-

set of prototype training instances and features as well. SVM classifiers also can help

the feature selection process [12]. A recursive wrapper feature elimination algorithm is

proposed in [12], they use SVM as the underlying classifier to recursively eliminate the

features that produce a minimal contribution to the margin of the classifier.

In general wrapper models have high computational cost because they have to run

an induction algorithm multiple times, moreover the fact that guiding the feature se-

lection to improve the accuracy of one particular classifier usually increase the risk to

suffer overfitting [21].

Due that the heuristics that evaluate the performance of subsets of features are global

(in the sense they return scores related to all data points), filter and wrapper approaches

are not able to select these subsets for particular data points, requiring different strate-

gies to drive a local subset selection.

A variant of wrappers methods for feature selection are embedded methods. In

these methods feature selection is performed within the classification training process,

similar to wrapper methods, the selection is linked with a particular classifier, the

main difference is that the selection of features and the training of the classifier are

linked and dependent processes. This particular characteristic constitutes an advan-

5

tage in time complexity. Examples of embedded methods are decision trees [34] and

boosting methods [9]. Decision tree classification algorithms select as nodes of the trees

variables that well discriminate among elements from different classes. First selected

variables are placed from root to leafs as an indicator of the discriminative power of

each variable. Boosting methods determine ensembles of classifiers, each of these clas-

sifiers are experts classifying some areas of the input space. For each of these “weak

classifiers” the algorithm select the variables that improves its particular classifica-

tion. The final decision is taken by the ensemble that combines the decision of each

of the weak classifiers. SVM classifiers also have been used as underlying classifiers for

embedded approaches [4][31]. In [31] they use a penalization term on the number of

selected features to the standard cost function of the SVM, by optimizing the modified

cost function features are selected simultaneously to the model construction. ConcaVe

feature selection is proposed in [4], they minimizes an approximate zero norm of the

standard weights vector in the SVM optimization, using an iterative method called

Successive Linearization Algorithm [4]

Distance Metric Learning can also be casted as a feature selection approach, where

a distance matrix is learned in order to improve classification. Commonly these meth-

ods find linear transformations of the input space in order to minimize distances among

elements from the same class and maximizing distances among elements from different

classes. Some of these methods performs the linear transformation by computing Ma-

halanobis distance which calculus depends on a positive semidefinite matrix that can

be determined by numerical optimization of a predefined loss function. Other methods

perform the linear transformation by finding a matrix that directly reduces the dimen-

sionality of the input space. Relative to distance matrix methods, in [37] they propose a

Pseudometric Online Learning Algorithm (POLA). POLA receives at each step pairs of

input data points and attempts to learn a Mahalanobis metric M and a scalar thresh-

old b such that similarly labeled inputs are at most a distance of b − 1 apart, while

differently labeled inputs are at least a distance of b + 1 apart. The distance metric

M and threshold b are updated after each tuple to correct any violation of the de-

sired relation. Another related distance metric algorithm is Neighborhood Component

Analysis (NCA) [10], it computes the expected leave-one-out classification error from

a stochastic variant of kNN classification. The stochastic classifier uses a Mahalanobis

distance metric parameterized by the linear transformation. The algorithm attempts

to estimate the linear transformation that minimizes the expected classification error.

Unfortunately many of these algorithms obtain poor results in cases where there are

many groups of data points from the same class far from each other (multimodal la-

beled data). In [41], they propose a Local Fisher Discriminant Analysis technique to

deal with multimodal data. The model merges ideas from an unsupervised technique

called Locally Preserved Projections (LPP) [16] and Fisher discriminant analysis, with

is a discrimination indicator for features that perform poor in presence of multimodal

data.

Also related with NCA and POLA, a more recently work is proposed in [42], named

Large Margin Nearest Neighbor Classifier (LMNN). They define a safety perimeter for

each instance where they expect to have only instances of the same class (target neigh-

bors), any different class elements within the perimeter are impostors. Before learning,

a training input has both target neighbors and impostors within its safety perimeter,

during learning, impostors are pushed outside, after learning, there exist a finite mar-

6

gin between the perimeter and the impostors. To push out the impostors they define

a loss function that consists in two terms, one which acts to pull target neighbors

closer together and another which acts to push differently labeled examples further

apart. Each of these terms depends on a distance matrix, the goal is to find the matrix

that maximizes the loss function. The loss function is expressed in terms of a positive

semidefinite distance matrix, which allows to find an optimum value in a closed form.

Many other similar methods have been proposed [6][2] [43] [38]. Most of distance metric

learning methods also find one distance matrix to be applied over all the instances, un-

fortunately in some cases the learned matrix is suitable for only a subset of data points.

There are many other kind of approaches that can be viewed as a feature selection

techniques. For example in object recognition by image analysis, the method proposed

in [33] groups the images taken from a specific view and build a feature space for

each view. This idea is extended in [19] with the Locally Linear Discriminant Analysis,

where they find linear transformations that map data points to lower dimensional

spaces where the inter-class covariance is maximized and the intra-class covariance

is minimized. In [15] the Optimal Local Basis method uses a reinforcement learning

approach where each state is represented by the last set of features selected for a given

point, the action corresponds to the selection of a new feature, and the reward depends

on the classification accuracy given the selected pool of features.

As we mentioned before, most of the approaches that attempts to select features

do not consider to select a specific set of features for each data point. Our model

constitutes a function that is able to determine a set of features for any test data

point, this function do not suffer from most of the problems mentioned above, such as

redundant features and multimodal data among others.

4 Our Approach

One of the key steps behind our approach is the estimation of a set of functions that

evaluate the discriminative power of each feature over the entire input space. Each of

these functions are estimated with GP regression using as observations the discrimina-

tive score of the feature for some strategically selected data points. We give the details

about the calculus of the discriminative score and the strategy of observation selection

in next sections. Figure 1 illustrates the process of building functions to estimate the

discriminative score of two features using GPs. In this case, for 2 hypothetical features

F1 and F3, dimension 1 is discriminative to classify a data instance xi, while dimension

3 is not. We can observe (black dots) that we add a high score value for the observation

relative to xi on GP1 and a low score value for the observation relative to GP3. Gray

dots correspond to previous observations.

Next, we present first the discriminative score used in this work. Then, we describe

our application of GPs to estimate the scoring functions related to each feature. After-

wards, we present our technique to locally join these functions in order to achieve local

feature selection. Finally, we present a classification scheme that uses our approach.

7

Fig. 1: Example of observations for GPs according to the discriminative score of di-

mensions 1 and 3 for data instance xi. Dotted lines represents the GP uncertainty.

4.1 Discriminative Score

This section describes how we calculate the discriminative score of a feature for a given

data point, remember this calculus is used to generate the observations needed to esti-

mate the functions described above. In this work, we measure the discriminative power

of each variable using a score that resembles the near-hit and near-miss strategy pro-

posed by the Relief feature selection algorithm [20]. Let xi = [xi(1), xi(2), . . . , xi(N), Cxi ],

i ∈ [1 . . .M ], be a data instance that belongs to class Cxi , and let j ∈ [1 . . . N ] be a

feature. The discriminative score of feature j with respect to xi is calculated consid-

ering how close is xi to elements of its class and how far is from elements from other

classes along dimension j.

Accordingly, the discriminative score Fj(xi) for dimension j at the input location

xi is given by:

Fj(xi) =1

η

M∑k=1

z(xi, xk)K(djik), (3)

where K(·) is a Gaussian kernel given by

K(d) =e−d

2/2

√2π

,

and

djik =

{|xi(j)− xk(j)| if j is continuous

δxi(j)xk(j) if j is categorical

also

δxi(j)xk(j) =

{0 if xi(j) = xk(j)

1 otherwise

8

Where z(xi, xk) = −1 if xi and xk belong to the same class and z(xi, xk) = 1

otherwise. Finally, η is a normalization constant that keeps the discriminative score in

the range [0..1]. Note that function d(·, ·) can be extended to other types of variables

(ordinal,binary, scaled, etc.). With equation (3) we are able to add observations to

regressions {F1, F2, . . . , FN} at any location of the input space.

4.2 Estimation of regressions for discriminative scores

As we mentioned, we use GPs to estimate the discriminative scoring function asso-

ciated to each feature. GPs have the property of allowing us not only to estimate a

regression, but also to provide information about the level of uncertainty of the esti-

mation in different parts of the input space. This last property is very important to the

practical implementation of our method. In general, there are important limitations in

the number of observations that is possible to include in a GP regression. Specifically,

for m observations in a regression model, the training cost for each GP is O(m3). For-

tunately, using the estimation of the level of uncertainty provided by the GP, we can

implement efficient strategies to include observations (evaluations of the discriminative

score) in key informative areas of the input space.

A popular solution to select observations for a GP regression is provided in [24].

In this work, they propose an active learning scheme for select locations in a sensor

placement problem, where there is a limited number of positions to allocate the sen-

sors. They select points that reduce the entropy of the posterior distribution of the

parameters given new observations. Unfortunately, this method does not scale prop-

erly with the number of possible observations, therefore is computationally infeasible

for our case, where we can potentially observe at any point in the input domain. In

fact, for a discretization of a two parameter covariance function in d possible values, the

calculation of the expected reduction of entropy for one given point, takes O(mNd2)

for a dataset of m points and N variables.

An additional inconvenience to apply a GP based solution to our case is the estima-

tion of hyperparameters. In effect, in a GP regression problem the hyperparameters of

the GP are usually unknown, therefore we also need to select observations to estimate

their values.

Accordingly, we need to solve two main problems to perform the GP regressions:

i) Determine the set of points to be included as observations on each of the GPs,

and ii) Determine the Kernel hyperparameters of each of the GPs. This is a chicken-

and-egg problem, because we need the observations to estimate the hyperparameters

and vice versa. To solve this problem, we use a strategy that resembles the operation

of the Expectation Maximization algorithm [7]. We iterate between two steps, select

observations given GP hyperparameters, and find GP hyperparameters given a set of

observations. We start the iterations by randomly selecting initial values for the GP

hyperparameters. Afterwards, we iterate the following two steps:

4.2.1 Determination of the set of observations to include in the regression given the

hyperparameters.

Assuming a known set of hyperparameters for a GP, McKay et al. [27] shows that for

a fixed covariance function, we can obtain a suitable estimation of the GP by greedily

9

selecting observations according to the position with highest posterior variance, as

given in Equation (2). The advantage is that this strategy requires only one data scan

to determine the next point to be observed. Using this strategy we can also determine

a suitable number of observations to be included in the regression. This is achieved by

calculating the mean squared error (MSE) [35] of the regression in a test set given the

discriminative scores (observations) of the selected points so far. The test set is simply

a set of points where we calculate the discriminative score using equation 3. We check

in our experiments that after being adding a determined number of observations, the

MSE remains almost constant. That suggest us to add observations until the MSE do

not change with the addition of more observations.

4.2.2 Determination of hyperparameters given observations

Given a set of observations, a suitable approach to determine the kernel hyperparam-

eters of a GP is the maximum likelihood (ML) method. As shown in [35], an efficient

strategy to find this maximum is to use a conjugate gradient technique that optimizes

the marginal loglikelihood given a set of observations. Specifically, considering a set Ay

of n observations relative to a set of data points Ax, the marginal loglikelihood, as a

function of the hyperparameters Θ of a GP, is given by:

log p(Ay|Ax, Θ) = −1

2AtyΣ−1AyAy −

1

2log |ΣAy

| − n

2log 2π (4)

The first term in Equation (4) corresponds to the data fit, i.e., how the regression

fits the observations, the second term is the complexity penalty that depends on the

number of inputs and the covariance function, and the third term is a normalization

constant. We use the conjungate gradient method to numerically find Θ that maxi-

mizes Equation (4).

In this work, we use a covariance based on a squared exponential isotropic func-

tion with Gaussian noise, a commonly used covariance function for machine learning

problems [35]. This covariance function is given by:

k(xp, xq) = θ exp−(xp − xq)TP−1(xp − xq)

2+ σ2nδpq

where P = l2I and I is the identity matrix, θ is the signal variance, σ2n is the noise

variance, l is the characteristic length scale, and δpq is the Kronecker delta which is

one if p = q and zero otherwise. We use the same covariance function and parameters

for all the GPs.

Algorithm 1 summarizes the main steps of the strategy used to estimate hyper-

parameters and to select observations. The algorithm iterates until there are not sig-

nificant changes in the hyperparameter values. We evaluate this by checking changes

between consecutive iterations. We empirically determine a suitable threshold to stop

the iterations.

To guarantee convergence of Algorithm 1, we should check if there is an increase

of the likelihood function in Equation (4) at each iteration. In case of the step to find

new hyperparameters, clearly this is a maximization step, therefore, the log-likelihood

increases. In case of the step to select new observations, we can consider that the next

10

selected observation always reduces the variance of the regression. As a consequence,

this also reduces the penalization term in Equation (4), which in turns increases the

log-likelihood.

Algorithm 1 Strategy to estimate hyperparameters and to select observations

Randomly select an initial set of hyperparameters.while No significant change on hyperparameters do

Given the hyperparameters select the observations (sec. 4.2.1)Given the observations, determine the hyperparameters (sec. 4.2.2).

end while

4.3 Construction of local discriminative subspaces

Once we have estimated the GP regression for each feature, we are able to select a

discriminative subspace for each possible input. We achieve this by locally selecting the

features presenting highest levels of discriminative power. To avoid selecting redundant

features, we include features in the set sequentially, avoiding to include features that

present high correlation with the features already selected. Formally, let Lxi be the

discriminative subspace for xi, the construction of Lxi is performed by adding the

feature with greater discriminative power penalized by its correlation with respect to

the features already in Lxi . The discriminative power for variable j at point xi is

denoted by Di(j) and is calculated by considering the height of the estimation (mean

of F j(xi)) and the uncertainty (variance of F j(xi)).

Given that F j(xi) is itself a Gaussian distribution, we can evaluate Di(j) calcu-

lating how far is F j(xi) from a fixed distribution y∗ ∼ G(µ∗, (σ∗)2). To calculate the

difference between F j(xi) and y∗ we use the Kullback Leibler (KL) divergence. We

manually set low values for µ∗ and σ∗ (the same among all the GPs). We use y∗ just

as a reference distribution to obtain a relative value among all the features at the point

xi.

Figure 2 illustrates this idea for a case where there are two features: j and k. In

this case, distribution yj(xi) is farther from distribution y∗ compared to distribution

yk(xi), indicating that variable j is more discriminative than variable k with respect

to input xi.

Accordingly, the final score of a variable at a specific instance location xi is given

by:

Scorexi(j) = [Dxi(j)]−1

η

|Lxi|∑

k=1

|Corr(j, Lxi(k))|,

where Dxi(j) is the discriminative power of variable j for xi and Corr(j, Lxi(k))

is the correlation between j and k-th variables in Lxi . Constant η is used to normalize

the sum in order to compare scores. The selection process is detailed in the algorithm 2

Note that γ is a parameter that regulates the number of features we select. In this

work we set that parameter manually on each dataset. In section 5.6 we show how the

results changes with different values of this parameter.

11

Fig. 2: In the example, for input xi variable j is more discriminative than variable k

due to its greater KL divergence score with respect to the reference distribution y∗.

Algorithm 2 Selection of the discriminative Subspace for xiCalculus of the discriminative subspace Lxi

A : Let m the feature with highest discriminative power Dxi (m)Initialize Lxi = {m}B : Calculate Scores = {Scorexi (1), . . . , Scorexi (m − 1), . . . , Scorexi (m +1), . . . , Scorexi (N)} for the remaining featuresLet T = Mean(Scores) + γ ∗ Std(Scores)Let h the feature with highest score (Scorexi (h))if Scorexi (h) ≥ T thenLxi = Lxi ∪ hreturn to B

elseStop

end if

4.4 Classification of new data instances

For a new test instance xt, we proceed to build its local subspace Lxt by using the

method described in Section 4.3. Afterwards, we project all the training instances to

Lxt , and we apply a K-NN classifier in this subspace.

4.5 Running time of the method

In the training process, the selection of locations needs the estimation of the posterior

variance at each of the possible locations, this takes O(n3) due the invertion of the

covariance matrix in Equation (2). It is important to note that the posterior variance

depends mainly on sums of distances between the candidate location and the locations

12

already selected (due to the fact that we are using a squared exponential isotropic co-

variance function). Moreover, we do not need to calculate the inverse of the covariance

matrix in order to select the next locations, given that this matrix is constant when

we compare the posterior variance among the candidate locations. This result can be

obtained from Equation (2). Suppose we have to choose the next location Sx (whose

respective observation is Sy) that maximizes the posterior variance σ2Sy|Sx,Ax,Ay. Also,

suppose that we already have selected B locations Ax (with respective observations

Ay). The posterior variance of Sy can be expressed as:

σ2Sy|Sx,Ax,Ay= K(Sx, Sx)−ΣSxAx

Σ−1AxAxΣT

SxAx

= K(Sx, Sx)−−∑B

i=1[K(Sx, Axi)∗∗∑B

t=1 Ci,tK(Sx, Axt)],

(5)

where Ci,t = Σ−1AxAx(i, t) does not depend on Sx. Using the definition of the squared

exponential isotropic covariance function we can write:

σ2Sy|Sx,Ax,Ay= K(Sx, Sx)−

∑Bi=1[K(Sx, Axi)∗

∗∑B

t=1 Ci,tK(Sx, Axt)]

= η−∑Bi=1[θe

−(Sx−Axi)T P−1(Sx−Axi

)

2 ∗∗∑B

t=1 Ci,tθ∗

∗e−(Sx−Axt )T P−1(Sx−Axt )

2 ]

(6)

The constant η is the sum of constant K(Sx, Sx) and the noise variances. We can

see that the instance Sx that maximizes the posterior variance σ2Sy|Sx,Ax,Ayis the

instance that also minimizes the right term in Equation (6). This term requires the

computation of a sum of differences between the candidate elements in the training set

and the locations already selected in Ax. This operation takes O(n2) instead of the

cubic inversion of the covariance matrix ΣAxAx.

Given that the selection process is performed for each of the features, to learn all

of the GPs the total cost is O(mn2) for a dataset with m variables.

At testing time, the estimation of the local discriminative space for a query in-

stance requires the estimation of the respective posterior mean and variance at each of

the GPs. This takes O(mB3) operations for a set of B observations included in the GPs.

Unfortunately, the computational cost calculated above is still expensive for some

datasets, this forces us to perform a preprocessing step in order to reduce the number of

candidate locations n at each iteration. We achieve this by preprocessing data adding

a clustering step using a k-medoids algorithm that provides the cluster centers as the

possible candidate locations.

13

5 Experimental Results

We evaluate our approach using synthetic and real datasets. We compare the accuracy

in classification against four global feature selection algorithms and two distance metric

learning techniques. 1) Best First with CFS subset evaluator [14], 2) Greedy Stepwise

with Consistency Subset Evaluator [13], 3) Relief-F [23], 4) A wrapper approach with

a Naive Bayes Classifier [25], and two Distance metric learning algorithms: 1) Local

Fisher Discriminant Analysis (LFDA) [41] and 2) Large Margin Nearest Neighboor

Classifier (LMNN) [42]. We use a K-NN classifier with the set of features obtained by

each of the feature selection algorithms. In our experiments we notice that different

values of K do not change substantially the relative results among the algorithms,

we choose K = 8 in all the experiments. To evaluate accuracy, we use 10-fold cross-

validation. We also provide an analysis about the strategic selection of observations,

the levels of accuracy respect to the percentage of data used in regressions, and the

robustness of our strategy to find GP hyperparameters with respect to different starting

values.

5.1 Synthetic Dataset

In this case we generate synthetic data according to the main hypothesis of this work.

The generation process starts by generating a random set of candidate subspaces, then

each data point is generated from one of the candidate subspaces. Points generated

from the same subspace have the same label. Different subspaces can also generate data

points with the same label. As underlying distributions we use Gaussian distributions.

To add noise we generate some points with uniform distributions. Algorithm 3 shows

the main steps of the data generation process:

Algorithm 3 Generation of Synthetic data

To generate n instances xi i ∈ [1 . . . n] from c different classes in a d dimensional spaceRandomly generate a set of B = {S1, S2, . . . , Sn} subspaces. |Si| ≤ dGenerate a normal distribution Gi i ∈ [1 . . . d] for each variable in subspaces from BLet B′ = B ∪ S∅.Associate to each Si ∈ B′ one class from 1 to c. Let c(Si) be the associated class for Si

for i = 1 To n doRandomly select one element Sp from B‘Set the label of xi equal to c(Sp)if Sp = S∅ then

Sample xi from an uniform distribution in the d dimensional spaceelse

for j = 1 To d doif j ∈ Sp then

Sample xi(j) from G(j)else

Sample xi(j) from an Uniform distributionend if

end forend if

end for

14

In this experiment we generate 500 data points with 50 features, four different

classes and six candidate subspaces. Table 1 shows the classification performance for

each of the algorithms under test. We can observe that our method outperforms the

results of the other six global feature selection algorithms.

Table 1: Mean accuracy and standard deviation for different feature selection models

and datasets using 10-Fold Cross-validation.

Method AccuracyGaussian Process 0.8542 ± 0.0343

Best First - CFS 0.7625 ± 0.0298Greedy Stepwise - CSE 0.7479 ± 0.0533

Relief-F 0.7500 ± 0.0529Wrapper-NB 0.8063 ± 0.0598

LFDA 0.8000 ± 0.0623LMNN 0.7896 ± 0.0632

5.2 Real Datasets

We use the following real datasets: Breast Cancer Dataset (Digitized images of a fine

needle aspirate (FNA) of breast masses [1]), Isolet Dataset (Isolated Letter Speech

Recognition [1]), Spectrometer Dataset (Infra-Red Astronomy Satellite Project Database

[1]), and X-ray image Dataset (X-ray images from aluminum wheels [28]). See Table

2 for details of the datasets. In the Isolet dataset, we use a preprocesing km-medoids

clustering with km = 1000.

Table 2: Details of real datasets.

Name Instances Features Classes Instances per classBreast Cancer 569 31 2 357-212

Isolet 7797 617 26 ≈ 300 per classSpectrometer 531 103 5 12-90-273-38-96X-ray image 1780 336 2 869-911

Table 3 shows the classification performance for each of the algorithms under test.

We can observe that our method outperforms the results of the other six feature se-

lection algorithms. We run a T-test to check if the difference in mean and standard

deviation on each result is not a random occurrence (Behrens-Fisher problem). We

found that in all cases our improvement is statistically significative with a 90% of con-

fidence, except for the X-Ray images in two cases (Best First CFS and LMNN), where

our results are significative better with a 80% of confidence and in the Isolet dataset

in three cases (Relief-F, Wrapper-NB and LMNN) where our results are significative

better with just a 60% of confidence. Note that the algorithm works well despite the

unbalanced proportion of classes, as in spectrometer dataset.

15

Table 3: Mean accuracy and standard deviation for different feature selection models

and datasets using 10-Fold Cross-validation.

Method Breast Cancer IsoletGPs 0.9693 ± 0.0216 0.954 ± 0.0357

Best First - CFS 0.9029 ± 0.0376 0.8726 ± 0.0232Greedy Stepwise - CSE 0.9124 ± 0.0435 0.8613 ± 0.0324

Relief-F 0.9441 ± 0.0344 0.9531 ± 0.0217Wrapper-NB 0.9429 ± 0.0479 0.9472 ± 0.0399

LFDA 0.9476 ± 0.0251 0.6734 ± 0.0315LMNN 0.9476 ± 0.0333 0.9481 ± 0.0289

Method Spectrometer X-Ray ImagesGP 0.9062 ± 0.0422 0.8627 ± 0.0327

Best First - CFS 0.8733 ± 0.0427 0.8452 ± 0.0542Greedy Stepwise - CSE 0.8831 ± 0.0375 0.8294 ± 0.0363

Relief-F 0.8712 ± 0.0527 0.8215 ± 0.0490Wrapper-NB 0.8694 ± 0.0492 0.8271 ± 0.0405

LFDA 0.8515 ± 0.0334 0.7927 ± 0.1106LMNN 0.8693 ± 0.0463 0.8483 ± 0.0245

5.3 Analysis of the generalization power of the regression.

Here we analize the number of observations that our method uses to reach the accuracy

values showed in table 3. Figure (3) shows the percent of data used as observations

versus classification accuracy. For example in Breast Cancer dataset we use less than

20% of the available data to obtain good classification accuracy. In other cases like

x-Ray images dataset, we need less than 70% of the available data to obtain good

classification accuracy.

5.4 Analysis of the observation selection strategy

In this section, we evaluate the performance of our strategy to select observations

considering the uncertainty of the estimation provided by the GPs. Figure 4 shows

the error levels in regression versus the number of observations included. The error

corresponds to the MSE between the predicted discriminative score provided by the

model and the real discriminative score. Note that using Equation (3) we can directly

evalue the real discriminative score. We can appreciate that selecting points considering

uncertainty achieves low error levels with less observations included. This is true for

all the datasets considered. Given that this analysis can be performed for each of the

features available in the datasets, we present the average error among all of them. It

is important to note that after adding a certain number of observations, the models

overfits the training data, increasing the levels of errors in both cases. This is the main

reason to worry about the number of points we should select as observations for GPs.

5.5 Analysis of the strategy to select hyperparameters and observations

In this section we evaluate the sensibility of our model to different values of the hyper-

parameters used to initialize Algorithm 1. To perform this, we use the same synthetic

16

(a) (b)

(c) (d)

Fig. 3: Evaluation of the generalization capabilites for each real dataset. We can see

that we need just a fraction of the available data as observations to reach suitable

accuracy levels.

dataset as in 5.1. We run our model with 6 different starting values of the GP hyperpa-

rameters. Table 4 shows the results. We can see that the final values of the hyperparam-

eters are slightly different for distinct starting points, however, the final classification

results are not significantly affected. Note that the hyperparameters correspond to the

characteristic length scale and the signal variance, respectively.

5.6 Accuracy in the detection of discriminative subspaces.

Given that using the synthetic dataset in 5.1 we know exactly the local subspaces from

where we generate the data, we can compare that subspace with the ones detected

17

(a) Breast Cancer (b) Isolet

(c) Spectrometer (d) X-Ray Wheels

Fig. 4: Comparing levels of mean square error between selecting observations consid-

ering uncertainty or select observations randomly. X axis corresponds to the number

of observations included in the regression.

Table 4: Classification accuracy for different starting values of the hyperparameters.

starting value final value Accuracy[0.3; 1.0] [0.913; 0.991] 0.8496[3.0; 1.0] [1.137; 0.993] 0.8732[1.0; 0.3] [0.897; 0.833] 0.8561[1.0; 1.0] [0.972; 0.957] 0.8691[1.0; 3.0] [1.204; 1.093] 0.8605

by our algorithm. To perform that we determine for each data point two indicators:

1) Recall: how many features from the correct subspaces we detect and 2) Precision:

From the subspace we detect, how many features correspond to features included in

the correct subspaces. We analyze our results for different values of the parameter γ.

Table 5 shows the mean ± standard deviation values of precision and recall among the

test data points. We can see that for higher values of γ the algorithm tends to choose

less features on local subspaces, that achieves higher levels of precision but lower levels

of recall. If we decrease the value of γ, the algorithm start choosing more features per

subspace, that increases the recall levels but decreases the levels of precision.

18

Table 5: Comparison between the detected subspaces and the real subspaces.

γ Precision Recall2.5 0.925± 0.201 0.55± 0.2932 0.863± 0.232 0.66± 0.204

1.88 0.831± 0.239 0.69± 0.2021.81 0.802± 0.239 0.71± 0.2971.66 0.613± 0.271 0.86± 0.2111.42 0.601± 0.243 0.89± 0.2461.25 0.563± 0.285 0.94± 0.246

6 Discussion

We present a novel method to find local discriminative subspaces to project data in-

stances in order to improve classification results. Our experiments show that the pro-

posed method outperforms several traditional feature selection algorithms. The iter-

ative strategy proposed allow us to approximately solve the well known problems of

hyperparameter learning and observation selection. Our analysis also shows that select-

ing observations considering uncertainty requires less observations to achieve low error

levels in regression, that is a desirable situation to reduce computational costs and to

avoid overfitting. Our method contributes to eliminate noisy and redundant attributes

that deteriorate the results of traditional distance based classifiers. A relevant contri-

bution of our approach is to cast as a regression problem the representation of a score

related to the discriminative power of each attribute. This allow us to take advantage

of properties of GPs achieving an efficient estimation of the classification relevance of

each feature over all the input space. An important advantage of our method is that it

is not affected by the presence of unbalanced classes, because if any class is larger that

the others, the model just select the most informative instances discarding the ones

that do not contribute with extra information. As future work, we are developing new

techniques for GP regression in order to deal with higher number of instances without

losing accuracy in regression. Another important improvement is to evaluate discrim-

inative power of subsets of features instead of single features, the greedy approach of

best first search heuristic loses many subspaces where features becomes discriminative

when they are joined with another features.

References

1. Asuncion, A., Newman, D.: UCI Machine Learning Repository (2007). URLhttp://archive.ics.uci.edu/ml/

2. Bar-Hillel, A., Hertz, T., Shental, N., Weinshall, D.: Learning a mahalanobis metric fromequivalence constraints. Journal of Machine Learning Research 6(1), 937–965 (2006)

3. Bellman, R.: Adaptive control processes - A guided tour. Princeton University Press,Princeton, New Jersey, U.S.A. (1961)

4. Bradley, P., Mangasarian, O.: Feature selection via concave minimization and support vec-tor machines. In: Machine Learning Proceedings of the Fifteenth International Conference(ICML 98), pp. 82–90. Morgan Kaufmann (1998)

5. Chickering, D.: Learning from data: Artificial intelligence and statistics. Learning Bayesiannetworks is NP-Complete. Springer-Verlag (1996)

6. Chopra, S., Hadsell, R., LeCun, Y.: Learning a similiarty metric discriminatively, withapplication to face verification. In: Proc. of the IEEE Conference on Computer Vision andPattern Recognition, pp. 349–356 (2005)

19

7. Dempster, A., Laird, N., Rubin, D.: Maximum likelihood from incomplete data via the emalgorithm. Journal of the Royal Statistical Society. Series B (Methodological) 39(1), 1–38(1977)

8. Dumais, S., Platt, J., Heckerman, D., Sahami, M.: Inductive learning algorithms andrepresentations for text categorization. In: Proc. of 7th Int. Conf. on Information andknowledge management, CIKM ’98., pp. 148–155. ACM (1998)

9. Freund, Y., Schapire, R.: A decision-theoretic generalization of on-line learning and anapplication to boosting. In: Proc. of the Second European Conference on ComputationalLearning Theory, pp. 23–37. Springer-Verlag, London, UK (1995)

10. Goldberger, J., Roweis, S., Hinton, G., Salakhutdinov, R.: Neighbourhood componentsanalysis. Advances in Neural Information Processing Systems 17, 513–520 (2005)

11. Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. Journal ofMachine Learning (3), 1157–1182 (2003)

12. Guyon, I., Weston, J., Barnhill, S., Vapnik, V.: Gene selection for cancer classificationusing support vector machines. Machine Learning pp. 389–422 (2002)

13. H. Liu, H., Setiono, R.: A probabilistic approach to feature selection. a filter solution. In:Proc. of 13th International Conference on Machine Learning, pp. 319–327 (1996)

14. Hall, M., Smith, L.: Feature subset selection: a correlation based filter approach. In:N. Kasabov, et al. (eds.) Proc. of Fourth Int. Conf. on Neural Information Processing andIntelligent Information Systems, pp. 855–858 (1998)

15. Harandi, M., Ahmadabadi, M., Araabi, B.: Optimal local basis: A reinforcement learningapproach for face recognition. Int. Journal of Computer Vision (IJCV) 81(2), 191–204(2009)

16. He, X., Niyogi, P.: Locality preserving projections. Advances in Neural Information Pro-cessing Systems 16 (2004)

17. John, G., Kohavi, R., Pfleger, K.: Irrelevant features and the subset selection problem. In:Proc. of Int. Conf. on Machine Learning, pp. 121–129. Morgan Kaufmann (1994)

18. Jolliffe, I.: Principal Component Analysis. Springer-Verlag (1986)19. Kim, T., Kittler, J.: Locally linear discriminant analysis for multimodally distributed

classes for face recognition with a single model image. IEEE Transactions on PatternAnalysis and Machine Intelligence 27(3), 318–327 (2005)

20. Kira, K., Rendell, L.: The feature selection problem: Traditional methods and a newalgorithm. In: 10th National Conf. on Artificial Intelligence, pp. 129–134 (1992)

21. Kohavi, R., John, G.: Wrappers for feature subset selection. Artificial Intelligence 97(1-2),273–324 (1997)

22. Koller, D., Sahami, M.: Toward optimal feature selection. In: Int. Conf. on MachineLearning, pp. 284–292 (1996)

23. Kononenko, I.: Estimating attributes: Analysis and extensions of relief. In: Proc. of Euro-pean Conf. on Machine Learning, pp. 171–182 (1994)

24. Krause, A., Guestrin, C.: Nonmyopic active learning of Gaussian processes: An exploration-exploitation approach. In: Proc. of 24th Int. Conf. on Machine Learning (ICML), pp.518–523 (2007)

25. Langley, P., Sage, S.: Induction of selective Bayesian classifiers. In: Proc. of 10th Conf. onUncertainty in Artificial Intelligence, pp. 399–406 (1994)

26. Mackay, D.: Introduction to Gaussian processes. In: Book Neural Networks and MachineLearning, Springer-Verlag., pp. 84–92 (1998)

27. McKay, M., Beckman, R., Conover, W.: A comparison of three methods for selectingvalues of input variables in the analysis of output from a computer code. Technometrics21, 239–245 (1979)

28. Mery, D., Filbert, D.: Automated flaw detection in aluminum castings based on the track-ing of potential defects in a radioscopic image sequence. IEEE Transactions on Roboticsand Automation 18(6), 890–901 (2002)

29. Mitchell, T.: Machine Learning. McGraw Hill (1997)30. Moore, A., Lee, M.: Efficient algorithms for minimizing cross validation error. In: Proc. of

11th Int. Conf. on Machine Learning, pp. 190–198 (1994)31. Neumann, J., Schnrr, C., Steidl, G.: Combined svm-based feature selection and classifica-

tion. Machine Learning (61), 129–150 (2005)32. Pazzani, M.: Searching for dependencies in Bayesian classifiers. In: Proc. of 5th Int.

Workshop on Artificial Intelligence and Statistics, pp. 239–248 (1995)33. Pentland, A., Moghaddam, B., Starner, T.: View-based and modular eigenspaces for face

recognition. In: IEEE Int. Conf. on Computer Vision and Pattern Recognition, pp. 84–91(1994)

20

34. Quinlan, R.: Induction of decision trees. Machine Learning 1(1), 81–106 (1986)35. Rasmussen, C., Williams, C.: Gaussian Processes for Machine Learning. The MIT Press.

(2006)36. Robnik Sikonja, M., Kononenko, I.: An adaptation of relief for attribute estimation in

regression. In: Proc. of 14th Int. Conf. on Machine Learning, pp. 296–304 (1997)37. Shalev-Shwartz, S., Singer, Y., Andrew, N.: In: Proc. of the Twenty First International

Conference on Machine Learning, pp. 94–101 (2004)38. Shental, N., Hertz, T., Weinshall, D., Pavel, M.: Adjustment learning and relevant compo-

nent analysis. In: Proc. of the Seventh European Conference on Computer Vision, vol. 4,pp. 776–792. Springer-Verlag (2002)

39. Singh, M., Provan, G.: A comparison of induction algorithms for selective and non-selectiveBayesian classifiers. In: Proc. of 12th Int. Conf. on Machine Learning, pp. 497–505. MorganKaufmann (1995)

40. Skalak, D.: Prototype and feature selection by sampling and random mutation hill climbingalgorithms. In: Machine Learning: Proc. of 11th Int. Conf., pp. 293–301. Morgan Kaufmann(1994)

41. Sugiyama, M.: Dimensionality reduction of multimodal labeled data by local fisher dis-criminant analysis. Journal of Machine Learning Research 8, 1027–1061 (2007)

42. Weinberger, K., Blitzer, J., Saul, L.: Distance metric learning for large margin nearestneighbor classification. Journal of Machine Learning Research. 10, 207–244 (2009)

43. Xing, E., Andrew, N., Jordan, M., Russell, S.: Distance metric learning, with applicationto clustering with side-information. 14, 521–528 (2002)

Documents

Local Feature Selection with Gaussian Process Regressionsaturno.ing.puc.cl/media/papers_alvaro/Karim-IDA-2014.pdfSection 3 describes related work on global and local feature selection