Bayesian feature selection for classification with possibly large number of classes

Contents lists available at ScienceDirect

Journal of Statistical Planning and Inference

Journal of Statistical Planning and Inference 141 (2011) 3256–3266

0378-37

doi:10.1

� Cor

E-m

journal homepage: www.elsevier.com/locate/jspi

Bayesian feature selection for classification with possibly largenumber of classes

Justin Davis a, Marianna Pensky a,�, William Crampton b

a Department of Mathematics, University of Central Florida, Orlando, USAb Department of Biology, University of Central Florida, Orlando, USA

a r t i c l e i n f o

Article history:

Received 26 August 2010

Received in revised form

6 April 2011

Accepted 8 April 2011Available online 16 April 2011

Keywords:

Classification

High-dimensional data

Bayesian feature selection

ANOVA

58/$ - see front matter & 2011 Elsevier B.V. A

016/j.jspi.2011.04.011

responding author at: Department of Mathem

ail address: [email protected] (M. Pe

a b s t r a c t

In what follows, we introduce two Bayesian models for feature selection in high-

dimensional data, specifically designed for the purpose of classification. We use two

approaches to the problem: one which discards the components which have ‘‘almost

constant’’ values (Model 1) and another which retains the components for which

variations in-between the groups are larger than those within the groups (Model 2). We

assume that pbn, i.e. the number of components p is much larger than the number of

samples n, and that only few of those p components are useful for subsequent

classification. We show that particular cases of the above two models recover familiar

variance or ANOVA-based component selection. When one has only two classes and

features are a priori independent, Model 2 reduces to the Feature Annealed Indepen-

dence Rule (FAIR) introduced by Fan and Fan (2008) and can be viewed as a natural

generalization of FAIR to the case of L42 classes. The performance of the methodology

is studies via simulations and using a biological dataset of animal communication

signals comprising 43 groups of electric signals recorded from tropical South American

electric knife fishes.

& 2011 Elsevier B.V. All rights reserved.

1. Introduction

In the last decade, classification of high-dimensional vectors on the basis of a small number of samples became veryimportant in applications of statistics in engineering, biology, genetics and many others areas. It is well known thatdimension reduction can significantly improve precision of classification procedures. In fact, Fan and Fan (2008) argue thatwhen the dimension of vectors to be classified tends to infinity and the sample size remains finite, almost all lineardiscriminant rules can perform as badly as the random guessing. Feature selection for classification becomes even moreimportant when the number of classes is also large since with a large number of classes one is usually resigned to simplelinear decision-theoretical classification rules. Classification problems with a large number of classes arise, for example, inbiology when classification of species is carried out on the basis of animal communication signals (see, e.g. Crampton et al.,2008).

Consider the following problem. One needs to classify a p-dimensional vector into one of the L classes, o1, . . . ,oL, on the

basis of n training samples dð1Þ,dð2Þ, . . . ,dðnÞ 2 Rp from those classes where pbn. Here, for simplicity, we assume that

vectors dð1Þ, . . . ,dðn1Þ belong to class o1, vectors dðn1þ1Þ, . . . ,dðn1þn2Þ are from class o2 and so on. This set-up is common to

ll rights reserved.

atics, University of Central Florida, Orlando, USA.

nsky).

www.elsevier.com/locate/jspi

dx.doi.org/10.1016/j.jspi.2011.04.011

mailto:[email protected]

dx.doi.org/10.1016/j.jspi.2011.04.011

J. Davis et al. / Journal of Statistical Planning and Inference 141 (2011) 3256–3266 3257

many areas of applications such as classification of gene expression data on the basis of microarray experiments,classification and recognition of targets for the military or classification of various signals in engineering and biological

sciences. The common feature of all those setting is that only very few components of the vectors dð1Þ,dð2Þ, . . . ,dðnÞ containuseful information while the rest should be discarded. A reasonable choice is to discard components which are constantacross the classes. Indeed, this approach is used in many natural science applications where components with the highestvariances are retained. This approach is especially useful in signal processing when the data represent low-dimensionalobjects embedded in a high-dimensional space and majority of components are zeros. Model 1, which we consider below,follows this idea and allows one to select components with the highest variances; for this reason we refer to it as VARianceSELector (VARSEL). Tai and Speed (2006) used a similar procedure for identifying differentially expressed genes on thebasis of microarray experiments. VERTISHRINK of Chang and Vidakovic (2002) applies essentially the same strategy as longit is applied to components with zero sums. These methods, however, are not designed specifically for subsequentclassification of the vectors and do not take into account any classwise structure.

A less obvious approach is to identify components which have relatively large variations between the classes incomparison with in-class variations, similar to ANOVA. This approach is also used in natural sciences (see, e.g. Johnson andSynovec, 2002; Michel et al., 2008), though mostly in an ad hoc fashion, especially when it comes to selecting the number ofcomponents retained for classification. Model 2 follows this approach and identifies components which are ‘‘almost constant’’within the classes. This method of feature selection is not only designed for classification, but actually becomes more accurateas the number of classes grows. In what follows, we refer to the method based on Model 2 as CONstant FEature SelectionStrategy (CONFESS). When one has only two classes and features are assumed to be a priori independent, CONFESS reduces toFeature Annealed Independence Rule (FAIR) introduced by Fan and Fan (2008) and can be viewed as a natural generalizationof FAIR to the case of L42 classes. Unlike FAIR, CONFESS is naturally designed to accommodate any number of classes.

In what follows, we introduce two Bayesian models for feature selection in high-dimensional data, specifically designedfor the purpose of classification. We use two approaches to the problem: one which discards the components which have‘‘almost constant’’ values (VARSEL) and another which retains the components for which variations between groups arelarger than those within the groups (CONFESS). We assume that pbn, i.e. the number p of components are much largerthan the number n of samples, and that only few of those p components are useful for subsequent classification. We showthat particular cases of the above two models recover familiar variance or ANOVA-based component selection. TheBayesian approach allows one not only to allocate the most significant features but also to select a number of features tobe kept.

The Bayesian set-up which is used below is similar to that of George and Foster (2000) and Abramovich and Angelini(2006). The theory is supplemented by a simulation study using synthetic data. It is also applied to the classification of abiological dataset of animal communication signals comprising 21 groups of electric signals recorded from tropical SouthAmerican electric knife fishes.

2. Bayesian models

For convenience, we arrange row vectors dð1Þ,dð2Þ, . . . ,dðnÞ into an ðn� pÞ-dimensional matrix D and denote its columns

by di 2 Rn, i¼ 1, . . . ,p. The objective is to select a sparse subset of these p vectors which enable classification of vectors

dð1Þ,dð2Þ, . . . ,dðnÞ into classes o1, . . . ,oL. For this purpose, we introduce a binary vector x 2 Rp with xi ¼ 1 if vector

component di is ‘‘informative’’ and should be retained in subsequent discriminatory analysis, or xi ¼ 0 if di should bediscarded. The goal of the analysis, then, is to draw conclusions about vector x on the basis of matrix D. Here and in whatfollows, matrices and vectors are denoted in bold while their components are not.

We consider the following Bayesian set-up. Let di be a noisy measurement of the ‘‘true’’ i-th component li, i.e.

di ¼ liþei, ð2:1Þ

where ei are multivariate normal ei �Nð0,s2i InÞ. The distribution of li depends on whether xi is informative (xi¼1) or not

(xi¼0). We assume a priori that x1, . . . ,xp are identically distributed and that the number of informative components

X ¼Xp

i ¼ 1

xi

is such that PðX ¼ kÞ ¼ pðkÞZ0,Pp

k ¼ 0 pðkÞ ¼ 1. If x1, . . . ,xp are also independent with Pðxi ¼ 1Þ ¼ p, then X has the binomialdistribution with parameters p and p. In general, we assume that p(k) depends on parameter p, i.e. pðkÞ ¼ ppðkÞ.

As mentioned above, we introduce two models describing two possible situations when components are noninforma-tive. On one hand, a component may be noninformative if it takes essentially a constant value across the samples. On theother hand, non-constant vectors may be useless for classification if the information they contain does not allow fordiscrimination among the classes; e.g. individuals in a class may demonstrate some meaningful variation, but none of thisvariation is specific to the class.

In order to account for these two cases, we introduce the following notations. Let e 2 Rn be a column vector with unitcomponents ei ¼1, i¼ 1, . . . ,n. We shall also need vectors gl 2 Rn, l¼ 1, . . . ,L, which act as ‘‘indicators’’ of class l. Inparticular, gl 2 Rn, l¼ 1, . . . ,L, is the column vector with the j-th components ðglÞj ¼ 1 if it corresponds to class l, i.e. if

J. Davis et al. / Journal of Statistical Planning and Inference 141 (2011) 3256–32663258

n1þ . . . ,nl�1þ1r jrn1þ . . . ,nl�1þnl, and ðglÞj ¼ 0 otherwise. We also denote ðn� LÞ-dimensional matrix with columns gl,l¼ 1, . . . ,L by G.

Now we are ready to introduce the two models.In VARSEL (Model 1), we assume that constant vectors are noninformative, i.e. for some scalar values mi one has

li ¼ n�1=2mieþwi, i¼ 1, . . . ,p, ð2:2Þ

where eT wi ¼ 0 and

mi �Nð0,s2i t

2Þ,

ðwijxi ¼ 0Þ � dð0Þ,

ðwijxi ¼ 1Þ �Nð0,s2i RwÞ: ð2:3Þ

Matrix Rw here characterizes correlation among the informative columns and will be chosen so that it enforcesorthogonality (and, therefore, independence between mie and wi). Note that in VARSEL noninformative components a

priori exhibit no variations across the samples: all variations are due only to random errors in (2.1).In CONFESS (Model 2), we search for the vectors which are constant within the classes but vary between the classes.

This is consistent with the intuitive idea that the means of observations from the same class should be closer than themeans of observations from different classes. In particular, we assume that li can be decomposed into an orthogonal sumof three components:

li ¼ n�1=2mieþuiþvi, i¼ 1, . . . ,p, ð2:4Þ

where ui are vector components of which are constant within a class but vary in-between classes and vectors vi are chosento be orthogonal (and, thus, a priori independent) of mie and ui. We further assume that

mi �Nð0,s2i t

2Þ,

ðuijxi ¼ 0Þ � dð0Þ,

ðuijxi ¼ 1Þ �Nð0,s2i RuÞ,

vi �Nð0,s2i RvÞ, ð2:5Þ

where, for xi¼0, vectors ui are constant within classes but vary between classes and orthogonality (and, hence,independence) of mie, ui and vi is further enforced by covariance matrices Ru and Rv.

Next, we ought to construct covariance matrices Rw in (2.3) and Ru and Rv in (2.5). For this purpose, we introduceSC ¼ SpanðeÞ, the linear subspace of constant vectors, and SN ¼Rn

\SC , its complement in Rn. It is easy to see that matrices

PC ¼ n�1eT e and PN ¼ In�PC ð2:6Þ

are projection matrices for spaces SC and SN, respectively. Therefore, for any positive definite matrix R, matrix Rw ¼ PNRPN

will ensure that mie and wi are a priori independent. It is easy to check that, in this case, s2i R is the covariance matrix of

vector li given xi¼1, i¼ 1, . . . ,p.Construction of matrices Ru and Rv is more involved. Consider the following linear spaces SG ¼ Spanðg1, . . . ,gLÞ,

S1 ¼ SG\SC and S0 ¼Rn\SG, so that Rn

¼ SC � S0 � S1. Then, we want to construct matrices Ru and Rv so that to ensureinclusions ui 2 S1 and vi 2 S0. The latter will guarantee orthogonality and independence since intersection of S1 and S0

contains only zero vector. Note that matrix

PG ¼ GðGT GÞ�1GT

projects an arbitrary vector into SG. Define matrices

P0 ¼ In�PG, P1 ¼ PG�PC : ð2:7Þ

Then the following statement is true.

Lemma 1. Matrices PC , P0 and P1 are projection matrices for linear spaces SC, S0 and S1, respectively, so that, for any positive

definite matrix R, matrices Ru ¼ P1RP1 and Rv ¼ P0RP0 ensure that ui 2 S1 and vi 2 S0 and vectors ui and vi are independent. In

this case, s2i R is the covariance matrix of the vector li given xi¼1, i¼ 1, . . . ,p.

Proof of this and later statements is given in Appendix A.

3. Bayesian inference

To select ‘‘informative’’ vectors di, one needs to evaluate Pðxi ¼ 1jDÞ, i¼ 1, . . . ,p. However, for each i¼ 1, . . . ,p, only a partof each vector di carries information about xi, in particular, the part associated with wi 2 SN for VARSEL (Model 1) and withui 2 S1 for CONFESS (Model 2). Therefore, one needs to extract ‘‘informative’’ parts from vectors di, or, in other words,


construct sufficient statistics for xi. For this purpose, one can write the likelihood function and try to single out parts whichdo and do not interact with xi’s. This, however, turns out to be extremely elaborate and messy, and, in order to bypass thisdifficulty, we shall construct orthogonal transformations of the data which will naturally decompose them intocomponents with required properties.

To this end, we construct matrices R 2 Rn�n and Q 2 Rn�n which satisfy the following conditions. Matrix R has n�1=2eT

as its first row and matrix HN 2 Rðn�1Þ�n as its next (n�1) rows. Matrix Q 2 Rn�n has n�1=2eT as its first row, matrix

H1 2 RðL�1Þ�n as its next (L�1) rows and matrix H0 2 Rðn�LÞ�n as its last (n�L) rows.In what follows, we extract yi ¼HNdi Model 1 or yi ¼H1di and zi ¼H0di for Model 2, and show that yi form sufficient

statistics for either model and, therefore, model selection is carried out on the basis of vectors yi only. For this to be true,one needs matrices R and Q for Models 1 and 2, respectively, with the properties stated below.

Proposition 1. Let matrix R 2 Rn�n described above be such that RRT¼ In and also HT

NHN ¼ PN . Let yi0 ¼ n�1=2eT di and

yi ¼HNdi. Then, under conditions (2.1)–(2.3), yi0 and yi, i¼ 1, . . . ,p, are independent with

yi0 �Nð0,s2i ð1þt

2ÞÞ,

ðyijxi ¼ 0Þ �Nð0,s2i In�1Þ,

ðyijxi ¼ 1Þ �Nð0,s2i ðIn�1þHNRHT

NÞÞ: ð3:1Þ

Proposition 2. Let matrix Q 2 Rn�n described above be such that QQ T¼ In and also HT

1H1 ¼ P1 and HT0H0 ¼ P0. Let

yi0 ¼ n�1=2eT di, yi ¼H1di and zi ¼H0di. Then, under conditions (2.1), (2.4), and (2.5), one has


2ÞÞ,

ðyijxi ¼ 0Þ �Nð0,s2i IL�1Þ,

ðyijxi ¼ 1Þ �Nð0,s2i ðIL�1þH1RHT

1ÞÞ,

zi �Nð0,s2i ðIn�LþH0RHT

0ÞÞ: ð3:2Þ

Also, yi0, yi and zi are independent, i¼ 1, . . . ,p.

Note that matrices R and Q carry out orthogonal transformation of the data di, i¼ 1, . . . ,p, replacing vectors di by yi0

and yi for Model 1 and yi0, yi and zi for Model 2, i¼ 1, . . . ,p. It follows from Propositions 1 and 2 that vectors yi alone

contain information about xi. With some abuse of notation, denote I¼ In�1, Ry ¼HNRHTN , zi ¼ zi ¼ 0 and Rz ¼ 1 for Model 1

and I¼ IL�1, Ry ¼H1RHT1, Z¼ fz1, . . . ,zpg and Rz ¼ In�LþH0RHT

0 for Model 2. Let X be a diagonal matrix

X¼ diagðs21,s2

2, . . . ,s2pÞ. Then, since all configurations of zeros and ones in vector x are a priori equally likely, the joint

pdf of Y0 ¼ ðy10,y20, . . . ,yp0Þ, Y¼ fy1, . . . ,ypg, Z¼ fz1, . . . ,zpg and x is of the form

pðY0,Y,Z,xjX,Ry,Rz,t,pÞ ¼ p

k

� ��1

IðX ¼ kÞppðkÞ ð2pÞ�np=2ð1þt2Þ

�p=2jXj�n=2jIþRyj

�k=2

�jRzj�p=2exp �

Xp

i ¼ 1

1

2s2i

½ð1þt2Þ�1y2

i0þxiyTi ðIþRyÞ

�1yiþð1�xiÞyTi yiþzT

i R�1z zi�

( ),

so that, indeed, the following corollary holds:

Corollary 1. In both Models 1 and 2, matrix Y is the sufficient statistic for vector x.

The posterior distribution of each configuration x is given by

pðx,kjY,X,Ry,pÞp p

k

� ��1

IðX ¼ kÞppðkÞjIþRyj�k=2exp �

1

2

Xp

i ¼ 1

xiyTi ðIþR�1

y Þ�1yi

s2i

( ),

and it is independent of Y0 and Z. Following Abramovich and Angelini (2006), we apply a maximum a posteriori (MAP) ruleto choose the most likely configuration of zeros and ones in vector x. The MAP rule implies that, for a given value of k,xi ¼ 1 for the k largest values of Di where

Di ¼ s�2i yT

i ðIþR�1y Þ�1yi, i¼ 1, . . . ,p, ð3:3Þ

and xi ¼ 0 otherwise. Let DðiÞ be the i-th largest value, i.e. Dð1ÞZDð2ÞZ � � �ZDðpÞ. Then, denoting the chosen configurationby xðkÞ, we derive the MAP value of k:

k ¼ argmaxk

2 lnp

k

� ��1

þ lnppðkÞ�klnjRyþIj

� �þXk

i ¼ 1

DðiÞ

" #: ð3:4Þ


In order to carry out model selection according to (3.3) and (3.4), one needs to construct matrices R and Q describedabove and estimate unknown parameters s2

1,s22, . . . ,s2

p ,p and unknown matrix Ry.

4. Construction of matrices R and Q

To construct matrices HN , H1 and H0 satisfying Propositions 1 and 2, we need to construct matrix HN for Model 1 andmatrices H1 and H0 for Model 2 with the following properties:

HNe¼ 0, HNHTN ¼ In�1, HT

NHN ¼ PN , ð4:1Þ

H1e¼ 0, H0e¼ 0, H1HT0 ¼ 0, H1HT

1 ¼ IL�1, ð4:2Þ

H0HT0 ¼ In�L, HT

1H1 ¼ P1, HT0H0 ¼ P0,

since the first rows of both matrices, R and Q , are n�1=2eT .For this purpose, we introduce diagonal n� n matrices KN , K1 and K0 where KN has (n�1) consecutive ones and a zero

on the diagonal, K1 has (L�1) consecutive ones and (n�Lþ1) zeros on the diagonal and, finally, K0 has (L�1) consecutivezeros followed by (n�L) consecutive ones and then a zero on the diagonal. Introduce also matrices TN 2 Rðn�1Þ�n with In�1

in its first (n�1) columns, the rest being identically zero, T1 2 RðL�1Þ�n with IL�1 in its first (L�1) columns, the rest beingidentically zero, and T0 2 Rðn�LÞ�n with (L�1) first columns being identically zero, then matrix In�L in the next (n�L)columns and the last column being zero. By construction, we have

TNTTN ¼ In�1, T1TT

1 ¼ IL�1, T0TT0 ¼ In�L, ð4:3Þ

TTNTN ¼KN , TT

1T1 ¼K1, TT0T0 ¼K0:

Now, recall that PN is a symmetric, idempotent matrix of rank (n�1); hence, there exists an orthogonal matrix U suchthat PN ¼UTKNU. Let

HN ¼ TNU: ð4:4Þ

Then, HNHTN ¼ TNUUT TT

N ¼ In�1 and HTNHN ¼UT TT

NTNU¼ PN due to (4.3). Also, observe that

JHNeJ2¼ eT HT

NHNe¼ JPNeJ2¼ 0,

so that HNe¼ 0, and HN satisfies all conditions of Proposition 1.To construct matrices H1 and H0, note that according to formula (2.7) and Lemma 1, matrices P1 and P0 are symmetric

idempotent matrices of ranks ðL�1Þ and ðn�LÞ, respectively, and commute pairwise, i.e. P1P0 ¼ P0P1 ¼ 0. For this reason,matrices P0 and P1 are simultaneously diagonalizable (see, e.g. Rao and Rao, 1998, p. 192) and there exists an orthogonalmatrix V such that

P1 ¼VTK1V, P0 ¼VTK0V:

Construct matrices H1 and H0 as

H1 ¼ T1V, H0 ¼ T0V: ð4:5Þ

Then, the last four equalities in (4.2) can be verified using (4.3) in a manner similar to the proof for HN . To show that thefirst two equalities in (4.2) hold, recall that PGe¼ PCe¼ e, so that JH1eJ2

¼ eT P1e¼ eT PGe�eT PCe¼ 0, and similarly,JH0eJ2

¼ 0. Now, the remaining equality H1HT0 ¼ 0 follows directly from T1TT

0 ¼ 0, and, hence, matrices P1 and P0 satisfy allconditions of Proposition 2.

We should mention here that some versions of matrices R and Q can be constructed explicitly. In particular, matrix Rcan be the n-dimensional Helmert matrix, so that the HN ¼HðnÞ with elements ðHðnÞÞji of the form

ðPðnÞÞji ¼

½jðjþ1Þ��1=2, 1r jrn, 1r ir j,

�½j=ðjþ1Þ�1=2, 1r jrn, i¼ jþ1,

0, 1r jrn, jþ2r irn:

8><>:

Matrix H0 can be constructed as a block matrix with matrix Hðn1Þ in the first ðn1�1Þ rows, Hðn2Þ in the next ðn2�1Þ rowsand so on. Matrix H1 can be recovered by direct Gram–Schmidt orthogonalization. Set n0 ¼ 0. Then, elements of matrix H1

are of the form

ðH1Þji ¼

0, 1r irn1þ � � � þnj�1,

hjj, n1þ � � � þnj�1þ1r irn1þ � � � þnj,

hj,jþ1, i4n1þ � � � þnj, 1r jrL�1:

8><>:

Here,

hjj ¼

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiðn�n1� � � ��njÞ=½njðn�n1� � � ��nj�1Þ�

q, 1r jrL�1, l¼ j,


hj,jþ1 ¼

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffinj=½ðn�n1� � � � �njÞðn�n1� � � � �nj�1Þ�

q, 1r jrL�1, jþ1r lrL:

Note that formulae (4.4) and (4.5) deliver some orthogonal transformations of the explicit forms mentioned above.

5. Estimation of parameters

Observe that both models reduce to


2ÞÞ,

ðyijxi ¼ 0Þ �Nð0,s2i ImÞ,

ðyijxi ¼ 1Þ �Nð0,s2i ðImþRyÞÞ,

zi �Nð0,s2i ðIrþRzÞÞ, ð5:1Þ

where m¼ n�1, Ry ¼HNRHTN , zi ¼ 0 and r¼0 for Model 1, and m¼ L�1, r¼ n�L, Ry ¼H1RHT

1 and Rz ¼H0RHT0 for

Model 2. To apply the model selection procedure described above, one needs to estimate unknown parameters s2i ,

i¼ 1, . . . ,p, matrix Ry and parameter p associated with the prior ppðkÞ. Note that for model selection one does not need toknow t or Rz.

Since vector x is unknown, a single-step estimation of the parameters is usually intractable. Therefore, we apply EMalgorithm treating vector x as a latent variable and alternating between computing the expectation of log-likelihood, giventransformed data yi, i¼ 1, . . . ,p, and values of parameters (E-step), and estimating parameters by maximizing the expectedvalue of the log-likelihood (M-step).

The algorithm starts with assigning initial values s2i,½0�, i¼ 1, . . . ,p, matrix Ry,½0� and parameter p½0�. Then, at the h-th

iteration of the algorithm, given values of unknown parameters, s2i,½h�, i¼ 1, . . . ,p, Ry,½h� and p½h�, at the E-step one needs to

find the posterior expectation of the latent vector x given the data y. If the number k of nonzero components of vectors xhas binomial distribution (and, thus, components xi are independent), then, following Abramovich and Angelini (2006),one can find posterior expectations xi of xi given yi as xi ¼ ð1þBiðyiÞÞ

�1 where BiðyiÞ are Bayes factors

BiðyiÞ ¼pðyijxi ¼ 0Þð1�p½h�Þ

pðyijxi ¼ 1Þp½h�¼ jImþRy,½h�j

1=2exp �yT

i ðIþR�1y,½h�Þ

�1yi

2s2i,½h�

( ):

Alternatively, if ppðkÞ is not binomial, following George and Foster (2000), one can replace the posterior mean estimators ofxi’s by the posterior mode, which leads to choosing xi¼1 for k largest values of Di and then estimating k by (3.4).

At the M-step, one needs to maximize the log-likelihood of matrix Ry, of the entries s2i , i¼ 1, . . . ,p, of the diagonal

matrix X and of parameter p, given the data and the latent vector x:

lðX,Ry,p;Y,xÞ ¼ constþ logppðkÞ�logp

k

� �þ logI

Xxi ¼ k

� ��

k

2logjImþRyj�

Xp

i ¼ 1

nlogðs2i Þ

2þ

xiyTi ðImþR�1

y Þ�1yiþð1�xiÞy

Ti yi

2s2i

" #:

ð5:2Þ

Since the general model is too diverse and includes a very large number of parameters, we consider two special cases ofthe general model above. In particular, we study Case 1 where all si’s are equal to each other, si ¼ s, i¼ 1, . . . ,p, and Case 2where matrix R is proportional to identity: R¼ r2In.

Case1: si ¼ s, i¼ 1, . . . ,p. If k5p, then one can estimate s by a variety of methods; e.g. the median of the absolutedeviations of yi divided by 0.6745 (Donoho and Johnstone, 1994). If assumption k5p does not hold, then s2 can beestimated by

s2 ¼1

mp

Xp

i ¼ 1

½xiyTi ðImþRyÞ

�1yi�:

Then, matrix Ry is estimated by

Ry ¼1

ks2

Xp

i ¼ 1

xiyiyTi �Im

!þ

:

Here, in order to evaluate ðAÞþ for a symmetric matrix A, we carry out its singular value decomposition A¼UBUT with anorthogonal matrix U and diagonal matrix B, replace all negative entries in B with zeros obtaining Bþ and finally setAþ ¼UBþUT . If Ry is assumed to be diagonal, then its diagonal entries are estimated by

ðSy Þii ¼ k�1s�2Xp

i ¼ 1

xiyTi yi�1

!þ

, ð5:3Þ

where tþ ¼maxðt,0Þ for any t 2 R.


Case2: R¼ r2In. It follows from Propositions 1 and 2 that ImþRy ¼ ð1þr2ÞIm. Hence, in Model 1, maximization of (5.2)with respect to s2

i and r yields

s2i ¼

yTi yi

n�1

xi

1þ r2þ1þð1�xiÞ

" #,

where r is the solution of the equation

r2 ¼1

k

Xp

i ¼ 1

xið1þr2Þ

xiþð1þr2Þð1�xiÞ�1

!þ

:

In Model 2, one can use the separate portion of likelihood associated with zi, for estimation of ai ¼ ð1þr2Þs2i , i¼ 1, . . . ,p.

Since zi �Nð0,aiIn�LÞ, one can estimate ai by a i ¼ ðn�LÞ�1zTi zi. Then, r2 can be estimated by plugging a i, i¼ 1, . . . ,p, into

(5.2) and maximizing (5.2) with respect to r for a given vector x. Therefore,

r ¼Xp

i ¼ 1

a�1i ð1�xiÞy

Ti yi

!ðL�1Þðp�kÞ�

Xp

i ¼ 1

a�1i ð1�xiÞy

Ti yi

!þ

:

Then, s2i can be estimated by

s2i ¼ ½ð1þ r

2Þðn�LÞ��1zT

i zi: ð5:4Þ

After X and Ry are estimated, parameter p can be found as a solution of the following optimization problem:

p ¼ argmaxp

lðX,Ry ,p; y,xÞ,

where lðX,Ry,p;Y,xÞ is defined in (5.2).

6. Interpretation

The model selection schemes suggested above offer one a wide variety of choices. For some particular choices of

parameters, they also reduce to the procedures studied in the literature earlier. Let R¼ r2In so that Ry ¼ ð1þr2ÞIn�1 for

Model 1 and Ry ¼ ð1þr2ÞIL�1 for Model 2. In both models, one selects k vectors di with the highest values of Di defined in

(3.3). With some abuse of notation, denote by ðdiÞlj the j-th component of vector di associated with class ol. Let dðlÞ

i be the

mean of the l-th class components of vector di and let di be its overall sample mean

dðlÞ

i ¼ n�1l

Xnl

j ¼ 1

ðdiÞlj, di ¼ n�1XL

l ¼ 1

Xnl

j ¼ 1

ðdiÞlj:

Let also SSBðdiÞ and SSWðdiÞ be the sum of squares between classes and the sum of squares within classes in a standardANOVA model, i.e.

SSBðdiÞ ¼XL

l ¼ 1

nlðdðlÞ

i �di Þ2, SSWðdiÞ ¼

XL

l ¼ 1

Xnl

i ¼ 1

½ðdiÞlj�dðlÞ

i �2, ð6:1Þ

where nl is the number of training samples in class ol, l¼ 1, . . . ,L.Let us show that VARSEL (Model 1) leads to the selection of vectors di with the highest variances while CONFESS (Model 2)

reduces to ANOVA if si’s are different and can be viewed as a generalization of FAIR of Fan and Fan (2008) to the case of L42classes when all si’s are identical. In particular, the following statements are true.

Proposition 3. If R¼ r2In and si ¼ s, i¼ 1, . . . ,p, then in VARSEL (Model 1), Di’s are proportional to the variances of vectors di

DipVarðdiÞ ¼ n�1dTi di�ðdiÞ

2: ð6:2Þ

Proposition 4. If R¼ r2In and si ¼ s, i¼ 1, . . . ,p, then in CONFESS (Model 2), Di’s are proportional to

DipSSBðdiÞ, ð6:3Þ

where SSBðdiÞ is defined in (6.1).

Proposition 5. If R¼ r2In and si, i¼ 1, . . . ,p, are not necessarily the same, then in CONFESS (Model 2), Di’s are proportional to

DipSSBðdiÞ=SSWðdiÞ, ð6:4Þ

where SSBðdiÞ and SSWðdiÞ are defined in (6.1).

Proposition 3 implies that the selection procedure based on VARSEL (Model 1) is related to VERTISHRINK procedure of

Chang and Vidakovic (2002), which selects vectors di such that n�1dTi diZ s2 where s2 is an estimator of s2. If one


subtracts the mean from each vector di, the latter reduces to DiZ s2. The difference between VERTISHRINK and VARSELthen lies in the number, not order, of selected features.

Proposition 4 shows that when L¼2 and R¼ r2In, CONFESS reduces to Feature Annealed Independence Rule (FAIR) ofFan and Fan (2008). Indeed, it is easy to check that in this case,

Di ¼ n�1n1n2ðdð1Þ

i �dð2Þ

i Þ2,

so that Di are proportional to ti statistics in FAIR. CONFESS retains k components di with the highest values of Dipti whichis equivalent to hard thresholding of ti. Thus, CONFESS can be viewed as a natural generalization of the FAIR rule when onehas a classification problem with L42 classes.

Finally, Proposition 5 demonstrates that in the case of different si’s, Model 2 selects the features with the highestANOVA ratios.

7. Simulation study

Recall that classification of vectors, not dimension reduction, is the ultimate goal of the analysis. Therefore, the accuracywith which a method selects informative vectors, while an interesting metric in itself, cannot be the final measure of theprecision of the method. For this reason, we use classification error (percentage of incorrectly classified data) as themeasure of quality of the model selection procedures.

In the simulation study described below, we generated data according to the assumptions of CONFESS, subjected to thedata to three dimension reduction techniques, VARSEL, CONFESS and VERTISHRINK of Chang and Vidakovic (2002) andthen used the remaining vectors to train the standard linear discriminant analysis (LDA). Specifically, we generated datawith s2 ¼ 1 and S¼ r2In and pa ‘‘informative’’ components. Although these choices do not represent the most generalmodel, they allow one to interpret the value of r as a loose analogue of the signal-to-noise ratio (SNR). Simulations werecarried out with p¼ 1000, a¼ 0:05 and n¼ 10 individuals per class. We allowed r to vary between 1 and 5 and the numberof classes L¼2,3,4,5,7,10. For each combination of r and L we carried out M¼1000 runs. A randomly selected subset of thevectors, 40% of the individuals from any given group, were used for model selection and construction of classification ruleand classification rules were then applied to the remaining 60% for validation. Classification errors are summarized inTable 1.

Although we report simulations with n¼ 10 only, our extensive studies show that moderate variations in the values of nhave very little effect on the accuracies of the model selection techniques. On the other hand, it follows from Table 1 thatprecisions of the procedures depend heavily on the percentage of informative vectors a.

Table 1 shows that if assumptions of CONFESS are satisfied, CONFESS model selection technique leads to a muchsmaller classification error than VARSEL and VERTISHRINK, while the latter two techniques lead to comparableclassification errors. One also should mention that classification errors for VARSEL and VERTISHRINK increase mono-tonically with the number of classes while classification errors for CONFESS decrease when the number of classes becomeslarge. This can be a very desirable feature when one needs to carry out classification with a large number of classes as ithappens in the real life example below.

Table 1Classification error and (standard deviation), a¼ 0:05, n¼ 10.

r L¼Number of groups

2 3 4 5 7 10

CONFESS

1 0.249 (0.157) 0.449 (0.092) 0.519 (0.050) 0.554 (0.033) 0.578 (0.016) 0.588 (0.008)

2 0.143 (0.087) 0.293 (0.048) 0.392 (0.050) 0.449 (0.030) 0.516 (0.023) 0.549 (0.014)

3 0.071 (0.041) 0.184 (0.056) 0.255 (0.040) 0.324 (0.045) 0.380 (0.042) 0.352 (0.058)

4 0.026 (0.029) 0.071 (0.041) 0.114 (0.040) 0.140 (0.041) 0.111 (0.041) 0.072 (0.028)

5 0.005 (0.013) 0.021 (0.019) 0.029 (0.022) 0.028 (0.023) 0.012 (0.011) 0.009 (0.007)

VARSEL

1 0.485 (0.037) 0.561 (0.020) 0.580 (0.013) 0.588 (0.009) 0.593 (0.006) 0.597 (0.003)

2 0.475 (0.042) 0.545 (0.026) 0.570 (0.018) 0.579 (0.013) 0.589 (0.008) 0.594 (0.005)

3 0.400 (0.068) 0.510 (0.039) 0.531 (0.030) 0.554 (0.020) 0.573 (0.016) 0.583 (0.009)

4 0.286 (0.089) 0.373 (0.080) 0.429 (0.058) 0.452 (0.050) 0.480 (0.041) 0.503 (0.037)

5 0.112 (0.083) 0.183 (0.085) 0.210 (0.098) 0.224 (0.070) 0.232 (0.061) 0.246 (0.049)

VERTISHRINK

1 0.511 (0.035) 0.566 (0.021) 0.583 (0.014) 0.588 (0.008) 0.594 (0.006) 0.596 (0.004)

2 0.477 (0.045) 0.536 (0.030) 0.565 (0.018) 0.572 (0.015) 0.586 (0.010) 0.591 (0.006)

3 0.353 (0.067) 0.459 (0.045) 0.492 (0.034) 0.515 (0.030) 0.541 (0.022) 0.554 (0.015)

4 0.219 (0.087) 0.284 (0.072) 0.336 (0.051) 0.362 (0.049) 0.396 (0.039) 0.432 (0.034)

5 0.070 (0.055) 0.142 (0.068) 0.146 (0.061) 0.164 (0.056) 0.197 (0.046) 0.221 (0.041)


8. Application to classification of communication signals of South American knife fishes

We applied feature selection techniques discussed above to a dataset of communication signals recorded fromSouth American knife fishes of the genus Gymnotus. These nocturnally active freshwater fishes generate pulsedelectrostatic fields from electric organ discharges (EODs). In association with an array of dermal electroreceptors,these EODs permit electric communication, but also serve as the carrier signals for electrolocation (see, e.g. Bullock et al.,2005). The three-dimensional electrostatic EOD fields of Gymnotus can be summarized by two-dimensional head-to-tail waveforms recorded from underwater electrodes placed in front of and behind a fish. Signal recording andconditioning procedures are described in Crampton et al. (2008). EOD waveforms vary, among species, from approximately2–4 ms in duration, and comprise 1–6 ‘‘phases’’ of alternating polarity. The signals of 473 individual fishes were recordedfrom 21 geographically co-occurring species at a site in the Amazon Basin. These groups contained from 5 to 69individuals. Part of this dataset was previously used to demonstrate the classificatory power of signal features derivedfrom discrete wavelet transforms (DWTs) (see Crampton et al., 2008). Fig. 1 shows representative electric signals of someof the specimens.

The objective is to classify fish species into groups according to their signal shapes using the above described dataset astraining samples. Note that, however, there is no expectation that these groups should all be mutually separable;especially, among juveniles overlap may be considerable. Therefore, while our primary metric will be the misclassificationrate, we also consider the efficiency of the representation, i.e. the dimensionality of the data after reduction. For thepurpose of efficient representation of waveforms we apply discrete wavelet transform with Symmlet 4 wavelets and thenselect features for subsequent classification using CONFESS, VARSEL, and VERTISHRINK. For both our models we assumethat si ¼ s, i¼ 1, . . . ,p, and estimate s by the median of the variances of yi. For VARSEL, we also impose the restriction thatmatrix Ry is diagonal, so that diagonal entries of Ry are estimated using (5.3). Subsequently, classification is carried outusing LDA.

In order to evaluate the performance of the techniques, we selected at random 40% of the data in each class as a trainingset. Here we insisted on the smallest possible training set; the smallest groups had (0.4)(5)¼2 representatives in thetraining set. In general practice, it is, of course, recommended that the LDA be trained on as large a set as possible to avoidspurious classification rules which do not generalize to the whole set, but here we intentionally risk these errors topenalize the model selection techniques for the inclusion of noninformative vectors. Unavoidably, the LDA will performsome of the work of a model selection technique but we may lessen its contribution.

The procedure was repeated 1000 times. CONFESS, VARSEL, and VERTISHRINK achieve average misclassification rates of0.277, 0.391, 0.326 with standard deviations 0.017, 0.021, 0.020, respectively; for each technique the errors weredistributed roughly normally. Out of 515 vectors, the techniques retained 97, 162, 24 vectors, respectively. There is atrade-off between the strictness and inclusiveness of a model selection technique, reflected in the accuracy of the set itproduces; while we do not claim that CONFESS achieves the optimal combination, in this case it clearly avoids excesses ineither.

9. Discussion

In the present paper, we introduced two model selection algorithms, VARSEL and CONFESS. The technique can beviewed as a Bayesian version of VERTISHRINK studied by Chang and Vidakovic (2002). Both, VARSEL or VERTISHRINK,select features which are merely non-constant, not necessarily those which have any groupwise structure. The featuresselected by CONFESS will very likely be selected by VARSEL or VERTISHRINK, but they will also include features not useful

50 100 150 200 250

−0.3

−0.2

−0.1

0

0.1

0.2

Total Signals

Time in Samples

Nor

mal

ized

Vol

tage

Fig. 1. Representative electric signals of some of the specimens of electric fish.


for classification, increasing the variance of any subsequent analysis. For this reason, CONFESS demonstrates performancesuperior to VARSEL or VERTISHRINK in both, the simulation and the case study.

Additionally, if an implementation of VARSEL allows Ry to vary arbitrarily, estimation of parameters requiresconsideration of an estimated ðn�1Þ � ðn�1Þ matrix where n is the number of vectors to be classified; as n grows, thisbecomes more problematic and eventually intractable. If Ry is structured, e.g. blockwise diagonal, the computations maybe factored and made more feasible but we lose the only explicit characterization of between-class variation in VARSEL.

In CONFESS, however, Ry is only ðL�1Þ � ðL�1Þ where L is the number of classes. Not only does this imply that one mayusually let Ry vary without restriction, but this also affords one additional flexibility in the specification of other variables.For instance, one might not need to assert that si ¼ s, i¼ 1, . . . ,p; i.e. there may be multiple classes of useful features withdifferent distributions.

Nevertheless, we note that the main assumption of CONFESS, that features are normally distributed around classwisemeans, may not be true, especially when one suspects that the data are not simply linearly separable, the distributions offeatures may admit no familiar characterization except that a vector is not constant. In this case, VARSEL of VERTISHRINKmay give better results. Assumption of normality can be loosened by using normal mixtures, e.g. scale or Dirichletmixtures. This approach can potentially allow to investigate applicability of CONFESS and VARSEL in the case of non-Gaussian data. This topic, however, is reserved for future research.

Acknowledgments

Research of the first two authors was supported in part by the NSF Grant DMS-0652524. Research of the third authorwas supported in part by the NSF Grant DEB-0614334.

Appendix A

Proof of Lemma 1. By construction as orthogonal projections, matrices PC and PG are symmetric, idempotent, andidentities on their respective subspaces SC and SG. Then PGPC ¼ PC since SC DSG. Also, P1e¼ PGe�PCe¼ e�e¼ 0 becausee 2 SC ,SG. Finally,

P1P0 ¼ P0P1 ¼ ðIn�PGÞðPG�PCÞ ¼ PG�PC�PGPGþPGPC ¼ 0:

To complete the proof of the lemma, observe that P1ui ¼ ui since ui 2 S1. &

Proof of Proposition 1. Note that RT R¼ In implies that matrix HN satisfies (4.1). Then, for any i, one has yi ¼HNwiþHNei

where HNei �Nð0,s2i HNHT

NÞ �Nð0,s2i In�1Þ. Similarly, yi0 ¼miþn�1=2eTei and n�1=2eTei �Nð0,s2

i Þ. Validity of formulae (3.1)can be checked by direct calculations. Also, covariance between yi0 and yi

Covðyi0,yiÞ ¼ Eðyi0yiÞ ¼ n�1=2eTEðdidTi ÞHN ¼ 0,

and since the vector yi0,yi is normally distributed, yi0 and yi are independent. &

Proof of Proposition 2. The proof is very similar to the proof of Proposition 1. &

Proof of Proposition 3. Note that

dðlÞ

i ¼ n�1l dT

i gl, di ¼ n�1dTi e:

Then, in Model 1, one has In�1�R�1y ¼ ð1þr�2Þ

�1In�1 and, by Proposition 1,

Di ¼ ð1þr�2Þ�1yT

i yi ¼ ð1þr�2Þ�1dT

i PNdi:

Note that dTi PNdi ¼ dT

i di�n�1ðdTi eÞ2, so that (6.2) is valid. &

Proof of Proposition 4. Note that in Model 2, IL�1�R�1y ¼ ð1þr�2Þ

�1IL�1 and, by Proposition 2, Di ¼ ð1þr�2Þ�1yT

i yi ¼

ð1þr�2Þ�1dT

i P1di: Recall that matrix P1 ¼ PG�PC is idempotent and symmetric, so that

dTi P1di ¼ JðPG�PCÞdiJ

2:

Now, observe that PC and PG project di onto the space of constant and classwise constant vectors, respectively, i.e.PCdi ¼ die and

PGdi ¼ ðd1i ,d1

i , . . . , . . . ,d1i|fflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflffl}

n1

,d2i ,d2

i , . . . ,d2i|fflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflffl}

n2

, . . . ,dLi ,dL

i , . . . ,dLi|fflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflffl}

nL

ÞT : ðA:1Þ

Therefore, (6.3) holds. &


Proof of Proposition 5. By Proposition 4,

yTi ðIþR�1

y Þ�1yipSSBðdiÞ:

Hence, to prove the proposition it is sufficient to show that s2i pSSWðdiÞ. Note that, by (5.4), s2

i pzTi zi ¼ dT

i HT0H0di ¼

dTi di�dT

i PGdi since HT0H0 ¼ P0 ¼ In�PG by (2.7) and (4.2). Now, to complete the proof use (A.1). &

References

Abramovich, F., Angelini, C., 2006. Bayesian maximum a posteriori multiple testing procedure. Sankhya 68, 436–460.Bullock, T.H., Hopkins, C.D., Popper, A.N., Fay, R.R., 2005. Electroreception. Springer, New York.Chang, W., Vidakovic, B., 2002. Wavelet estimation of a base-line signal from repeated noisy measurements by vertical block shrinkage. Computing and

Statistical Data Analysis 40, 317–328.Crampton, W.G.R., Davis, J.K., Lovejoy, N.R., Pensky, M., 2008. Multivariate classification of animal communication signals: a simulation based comparison

of alternative signal processing procedures, using electric fishes. Journal of Physiology—Paris 102, 304–321.Donoho, D.L., Johnstone, I.M., 1994. Ideal spatial adaptation via wavelet shrinkage. Biometrica 81, 425–455.Fan, J., Fan, Y., 2008. High-dimensional classification using feature annealed independence rules. Annals of Statistics 36, 2605–2637.George, E.I., Foster, D.P., 2000. Calibration and empirical Bayes variable selection. Biometrica 84, 731–747.Johnson, K.J., Synovec, R.E., 2002. Pattern recognition of jet fuels: comprehensive GC�GC with ANOVA-based feature selection and principal component

analysis. Chemometrics and Intelligent Laboratory Systems 60, 225–237.Michel, V., Damon, C., Thirion, B., 2008. Mutual information-based feature selection enhances FMRI brain activity classification. In: Biomedical Imaging:

From Nano to Macro: 2008, ISBI, May 2008, pp. 592–595.Rao, C.R., Rao, M.B., 1998. Matrix Algebra and its Applications in Statistics and Econometrics. World Scientific, Singapore.Tai, Y.C., Speed, T.P., 2006. A multivariate empirical Bayes statistic for replicated microarray time course data. Annals of Statistics 34, 2387–2412.

Documents

Bayesian feature selection for classification with possibly large number of classes