Feature reduction.pdf

8/9/2019 Feature reduction.pdf

1/43

1Dipartimento di Ingegneria

Biofisica ed Elettronica Università di Genova

Prof. Sebastiano B. Serpico

4. Feature Reduction


2/43

2

Complexity of a Classifier

• Increasing the number n of the features , the classifying designpresents different issues connected to the dimensionality of theproblem (“curse of dimensionality”):

– Computational complexity;

– Hughes phenomenon.

• Computational complexity

– Increasing n , the computational complexity of a classifierincreases. For some classification techniques this increment islinear with n , for other is of a higher order (e.g. quadratic).

– The increase in complexity involves an increase of computationtime and a larger memory occupation.


3/43

3

Hughes Phenomenon

• Intuitive reasoning

– increasing n , the amount ofavailable information for theclassifier should increase andconsequently also the classifi-cation accuracy, but...

• Experimental observation

– … on the contrary, fixed the

number N of the trainingsamples, the probability of acorrect decision of a classifierincreases for 1 n n* till amaximum and decreases forn n* (Hughes phenomenon).

• Interpretation

– Increasing n , the number ofparameters K n of the classifier becomes higher and higher.

–By increasing the ratio K n/N ,the number of the availabletraining samples is “too little” to obtain a satisfactory estima-te of such parameters.


4/43

4

Feature Reduction

• A solution to these dimensionality issues is to reduce thenumber n of the features used in the classification process(feature reduction or parameter reduction).

• Disadvantage: reducing the dimension of the feature space involves a loss of information.

• Two main strategies exist to achieve feature reduction:– feature selection: inside the set of the n available features , the

identification of a subset of m features is obtained by adopting ofan optimization criterion, chosen to minimize the loss ofinformation or maximize classification accuracy;

– feature extraction: the transformation (often linear) of the original(n-dimensional) feature space in a space of smaller dimension m isapplied in such a way to minimize the information loss ormaximize the classification accuracy.


5/43

5

Feature Selection

• Problem setting:

– Given a set X = {x1 , x

2 , … , x

n

} of n features , identify the subset S X , composed of m features (m < n), such to maximize thefunctional (·):

• An algorithm for features selection is then defined on the basisof two distinct objects:– the functional (·). It has to be defined such that (S) measures

the “goodness” of the feature subset S in the classification process;

– the algorithm for the search of the subset S*. The subsets of X are

in fact 2

n

, then an exhaustive search is computationally notfeasible, except for small values of n. Therefore, sub-optimalstrategies of maximization are adopted to detect “good” solutions, even if they do not correspond to global optima.

* arg max ( )S X

S S


6/43

6

Bhattacharyya Bounds

• A choice of the functional (·), which is significant from the

classification point of view, can be based on the criterion of theminimum of the error probability Pe.

– In the presence of two classes 1 and 2 , only, the Bhattacharyya distance B and the Bhattacharyya coefficient provide an upper

bound of Pe:

– Moreover, it’s possible to demonstrate that:

2 2

2

1 1 1 42 2

e u u

u e u

P

P

1 2

1 2

exp( )

where ln e ( | ) ( | )

n

e uP P P B

B p p dx x x


7/43

7

Bhattacharyya Distance and Coefficient

• An approach to feature selection consists in the maximizationof the Bhattacharyya distance B or (equivalently) in theminimization of the Bhattacharyya coefficient .

– In particular, a distance B(S) (or a coefficient (S)) can beassociated to each subset S of m features; in fact, indicating avector of the feature subset S with xS , one can define:

• Properties:

– 0 (S) 1 and then B(S) 0;

– if p(xS| 1) and p(xS| 2) are different from zero only in separeted

regions, then (S) = 0 and B(S) = + ;– if p(xS| 1) = p(x

S| 2) for any xS , then (S) is the integral of a pdf

over the entire space , therefore (S) = 1 and B(S) = 0.

1 2( ) ln ( ) e ( ) ( | ) ( | )m

S S SB S S S p p d x x x

m


8/43

8Computation of the Bhattacharyya Coefficient and Distance

• (S) is a multiple integral in an m-dimensional space, then itsanalytical computation starting from the conditional pdf iscomplex. Two particular cases, in which the computation issimple, exist:

– if the features in the subset S are independent , when conditionedto each class, we have:

then the following property is valid:

– if p(xS| i) = (miS , iS) (i = 1, 2), we obtain:1 2

1

1 22 1 2 1

1 2

21 1( ) ( ) ( ) ln

8 2 2

S S

S SS S t S S

S SB S

m m m m

( ) ({ }) e ( ) ({ })rr

r rx Sx S

S x B S B x

( | ) ( | ) 1,2r

Si r i

x S p p x i

x Additive property


9/43

9

Other Inter-Class Distances

• In addition to the Bhattacharyya distance, different measuresof inter-class distances have been introduced in literature.

– For example, the Divergence measures the separation betweentwo classes as a function of the likelihood ratio between therespective conditional pdfs.

– The Bhattacharyya distance and the Divergence are not upperlimited. This make them less appropriate as measures of inter-class separation. In fact, if we focus on the Gaussian case forsimplicity, when two classes are well-separated, an increment ofthe distance m1

S – m2S between the conditional means generates

a “large” increment of B(S), but an irrelevant reduction of Pe.

– Then, other measures of inter-classes distance have beenproposed (not treated in depth here) that, being upper limited,don’t present such a problem. Among them, we recall the Jeffries-Matusita Distance and the Modified Divergence [Richards 1999,Swain 1978].


10/43

10

Multiclass Extension

• Extension to the case of the M classes 1 , 2 , … , M.

– If ij(S) and Bij(S) are the Bhattacharyya coefficient and distance between two classes i and j computed over a feature subset S and ifPi = P(i) is the a priori probability of the class i , the followingaverage Bhattacharyya coefficient and average Bhattacharyya distance are defined:

• Remarks

– In the case M = 2, the maximization of B(S) was equivalent to theminimization of (S) (because B(S) = – ln (S)). In the multiclass casethe maximization of Bave(S) is no more equivalent to the

minimization of ave(S), because the relation between Bave(S) andave(S) is no more monotonic.

– Under the hypothesis of class conditional feature independence, wehave:

1 1

1 1 1 1

( ) ( ), ( ) ( ) M M M M

ave i j ij ave i j iji j i i j i

S P P S B S P P B S

( ) ({ })r

ave ave rx S

B S B x Attention! It is notvalid for ave(S).


11/43

11

Maximization of the Functional

• In a problem of feature selection the introduced measures ofinter-class separation are used in the role of the functional (·)

to be maximized

• Preliminary observations:

– The Bhattacharyya distance has to be maximized, while theBhattacharyya coefficient has to be minimized. Therefore, in the

following, (S) may correspond to Bave(S) or to – ave(S).– An exhaustive search over all possible subsets of X is, in general,

computationally not affordable.

– It’s feasible if the features are independent , when conditioned toeach class, and if the adopted functional is Bave. In such a case, in

fact, computed all values of the functional associated to the single features , for the additive property, the optimum subset S* of m features is simply composed by the m features that individuallypresents the highest m values of Bave({xr}).


12/43

12Sequential Forward Selection

• In general, the search for a subset of m features is conducted bymeans of a sub-optimal algorithm. Among such algorithms weconsider (for its simplicity) the sequential forward selection (SFS), which is based on the following steps:

– initialize S* = ;

– compute the value of the functional for all the subset S* {xi},

with xi S*, and choose the feature x* S *, that corresponds tothe maximum value of

– update S* setting S* = S* {x*};

– continue by iteratively adding one feature at a time until S*reaches the desired cardinality m or until the value of the

functional stabilizes (reaches saturation).

( * { });iS x


13/43

13

Remarks on SFS

• SFS identifies the optimum subset that can be obtained byiteratively adding a single feature at time.

– At the first step the single feature that corresponds to themaximum value of the functional is chosen. At the second step,the feature that, coupled with the previous one, provides themaximum value of the functional is added. And so on...

– The method is sub-optimal. For example, the optimal couple of features does not always include the single optimal feature .

• Advantage

– SFS is not computationally heavy even if X contains hundreds of features.

• Disadvantage

– A feature that has been included in the selected subset S* at aspecific iteration cannot be removed during the followingiterations, it means that SFS doesn’t allow backtracking.


14/43

14Sequential Backward Selection

• Sequential backward selection (SBS) proceed in a dual way wrtSFS, initializing S* = X and eliminating a single feature at a time

from S*, to maximize the functional (S) step by step.• Disadvantages

– Like SFS, SBS, too, doesn’t allow backtracking: the feature , eliminated from S* at a specific iteration, will never be recoveredin the following steps;

– Usually SBS is computationally disadvantageous wrt SFS: whileSFS starts from an empty subset and adds a feature at a time, SBSstarts from the original feature space. Therefore, SBS computesvalues of the functional in spaces with much higher dimensionsthan SFS. However, it’s advantageous if m n.

• In literature other, more complex, methods have beenproposed (which we will not see) to search for suboptimalsubsets, which allow also backtracking [Serpico et al. , 2001].


15/43

15

Operational Aspects of Features Selection

• The computation of inter-classes distance measures, used

in features selection, requiresthe knowledge of the classconditional pdfs and of theclass prior probabilities.

– Usually, such pdfs are not a

priori known but should beestimated from a training set , by means of parametric or non-parametric methods.

– Globally, a classificationsystem that involves a feature

selection step can besummarized by the followingflowchart.

Training set Data set

Class

conditional pdfs

and class prior

probabilities

Feature

selection

{ p(x| i ), P i }

training

samples foreaxh class {i }

Application of the

classifier to the

data set

Classification of the data set

S* Training of the

classifier


16/43

16

Example

• Hyperspectral data set with 202 features and 9 classes.

2

4

6

8

2 8 14 20 26 32 38 44 50 56

m

Bav e

RGB compositionof three of the 202 bands acquired by the sensor.

Map of theground truth

that highlightsthe training pixel

Estimated probability of correctclassification for a MAP

classifier under the hypothesisof Gaussian classes.

50%

60%

70%

80%

90%

100%

0 50 100 150 200

m

O A

Pc ,max = 88.6%for m = 40


17/43

17

Feature Extraction

• Problem definition:

– Given a set X = {x1 , x2 , … , xn} of n features , we want to identify alinear transformation that provides a transformed set of m

features Y ={ y1 , y2 , … , ym} (with m


18/43

18

Extraction Based on Inter-Class Distances

• Considering again the Bhattacharyya distance , in the case of twoGaussian classes , we look for the orthonormal feature

transformation that maximize the distance in the transformedspace.

– Let miY = E{y| i} = T·mi and i

Y = Cov{y | i} = T·i·T t (for i = 1,

2), B in the transformed space Y is given by:

• In the expression of B , two distinct contributions Bm(Y ) and B(Y )appear, respectively linked to the conditional means and to theconditional covariance matrices.

1 2

11 2

2 1 2 1

1 2

21 1( ) tr ( )( ) ln

8 2 2

Y Y

Y Y Y Y Y Y t

Y Y B Y

m m m m

( )mB Y ( )B Y


19/43

19

Extraction Based on Inter-Class Distances

• In principle, we would search for the orthogonal matrix T thatwould maximize B(Y ). However:

– The general problem of the maximization of B(Y ) with respect toT has no closed-form solution.

– The problems of separately maximizing Bm(Y ) or BΣ(Y ) haveclosed form solutions (eigenproblems). Details can be found in[Fukunaga, 1990].

– Therefore, if one of the two contributions is largely dominantover the other (i.e., Bm(Y ) >> BΣ(Y ) or Bm(Y )


20/43

20

Linear discriminant analysis

• A popular method for feature extraction is linear discriminantanalysis (LDA , aka discriminant analysis feature extraction,

DAFE), which maximizes a measure of separation andcompactness of the classes directly defined on the training set.

– Although explicit parametric assumptions are not stated, DAFEis usually considered parametric , because it “works poorly,” forexample, with multimodal classes and it characterizes the classesonly through first and second-order moments.

– Anyway, nonparametric extensions of this method have beenrecently introduced.

• The method can be applied to both binary and multiclass

problems.– Focusing first to the case of two classes , 1 and 2 , the linear

discriminant analysis provides an optimum scalar projection,named Fisher transform.


21/43

21

DAFE: Fisher transform

• In general, although the classes are well separated in theoriginal n-dimensional space, they may not be such in a

transformed one-dimensional space , because the projectioncan overlay samples drawn from different classes.

• The problem is to find the orientation of the projection linethat provides the best separation between the two classes.

– Given a set {x1 , x2 , … , xN } of N pre-classified samples, let Di be thesubset of the samples assigned to i (i = 1, 2) and let N i be thecardinality of Di (obviously N = N 1 + N 2).

– A transformation y = wtx projects the sample xk to yk = wtxk.Let Ei

= { y = wtx: x Di}.

– We search for the transformation y = wtx that maximizes theinter-class separation and minimizes the intra-class dispersion,conveniently quantified.


22/43

22

Inter-class separation and intra-class dispersion

• First, a functional that measures inter-class separation anddispersion inside each class is necessary.– As a measure of inter-class separation , the difference between the

centroids of the samples in the transformed space is used:

– As a measure of class dispersion around the centroids, thescatter values are adopt in the transformed space, i.e.:

1

, 1,2

1

i

i

iDi t

i i

i y Ei

N i

yN

x

x

w

2

2 2

( )( )

, 1,2( )

i

i

ti i i

D ti i

i i y E

S

s S is y

x

x x

w w

Si is called scattermatrix of the class

i (i = 1, 2).


23/43

23

The Fisher Functional

• The goal of the Fisher transform is to maximize the distance between the centroids of the classes and to minimize the

scatters in the one-dimensional transformed space.– For this purpose, the following Fisher functional is introduced:

– Let us explicitly write the functional as a function of w:

where Sb = ( 1 – 2)( 1 – 2)t is named between class scatter matrix

and Sw = S1 + S2 is named within class scatter matrix.

2

1 22 21 2

( )s s

w

2 21 2 1 2

22

1 2 1 2 1 2 1 2

( )

( ) ( )( )

( ) ,

t tw

t t t tb

t

bt

w

s s S S S

S

SS

w w w w

w w w w w

w www w

2


24/43

24

Optimality condition for the Fisher functional

• Optimality condition – Through the usual zero-gradient condition, one may prove that

the vector w* that maximizes the Fisher functional is aneigenvector of the product matrix Sw

– 1Sb:

– where is the corresponding eigenvalue.

• Close form solution– Therefore,w* satisfies the condition:

– ( 1 – 2)

tw* and are scalars, so w* is parallel to Sw– 1(

1 – 2).Since the scale factors are irrelevant in linear projections, we

obtain the following closed-form solution (with no need forexplicitly computing eigenvectors):

– Typically, the vector w* is also normalized.

1( ) * , i.e., ( ) * ,w b b wS S I S Sw 0 w 0

1 11 2 1 2* * ( )( ) * *

tw wbS S S w w w w

11 2* ( )wS

w

25


25/43

25

DAFE: multiclass Fisher transform

• We extend the discriminant analysis from the binary case tothe case of M classes 1 , 2 , … , M and of an m × n

transformation matrix.– Let us consider a set {x1 , x2 , … , xN } of N preclassified samples,

denote as Di the subset of the samples assigned to i (i = 1, 2, … , M) and as N i the cardinality of Di (N = N 1 + N 2 + … + N M).

– The transformation y = T x maps xk

to yk

= T xk

. Given Ei

= {wtx: x Di}, let us define:

( )( )

, 1,2, ...,( )( )

i

i

ti i i

D ti it

i i iE

S

S TS T i MS

x

y

x x

y y

1

, 1,2, ...,1

i

i

iDi

i i

i

Ei

N T i M

N

x

y

x

y

Centroids of i inthe original and

transformed spaces.

Scatter matrices ofi in the originaland transformed

spaces.

26


26/43

26

DAFE: multiclass Fisher functional (1)

• Let us extend the Fisher functional to the multiclass case.

– In the multiclass case, we quantify inter-class separation throughthe mean differences between the centroids of the classes and thecentroid of the entire training set in the transformed space:

– We measure the dispersions inside the single classes by means ofthe scatter matrices in the transformed space.

– Then, the Fisher functional is generalized as follows:

1

1

( )( )

( )

Mt

i i i

i

M

ii

N

T

S

1 1 1

1 1 1 , where:

N N M

k k i ik k i

T N N N N

y x

27


27/43

27

DAFE: multiclass Fisher functional (2)

• Let us explicitly write the Fisher functional as a function of theunknown transformation matrix T .

– Let us express numerator and denominator as functions of T andlet us consequently introduce a within class scatter matrix Sw anda between class scatter matrix Sb:

1 1 1

1 1

1

, where:

( )( ) ( )( )

where: ( )( )

( )

M M Mt t

i i w w i

i i i M M

t t t ti i i i i i b

i i

Mt

b i i i

it

b

tw

S T S T TS T S S

N T N T TS T

S N

TS T T

TS T

28


28/43

28

Optimality condition for the multiclass case

• Optimality condition

– Again through a zero-gradient condition, one may prove that the

row vectors e1 , e2 , … , em of the matrix T * that maximizes theFisher functional are eigenvectors of Sw

– 1Sb:

where i is the eigenvalue corresponding to ei and is nonzero.

• Remarks– The M matrices ( i – )( i – )

t , i = 1, 2, … , M , have unit ranks.Because of the linear relationship among the overall centroidand the class centroids i , i = 1, 2, … , M , they are also linearlydependent.

– Thus, rank(Sb) M – 1 and, then, rank(Sw– 1Sb) rank(Sb) M – 1.

– Therefore, at most ( M – 1) eigenvalues of Sw-1·Sb are nonzero, i.e.,

the eigenvector equation provides at most ( M – 1) solutionvectors.

1( ) , i.e., ( ) , 1,2,..., ,w b i i b i w iS S I S S i me 0 e 0

29


29/43

29

DAFE: comments

• DAFE allows up to ( M – 1) transformed features to be linearlyextracted (remember that M is the number of classes).

• Operational issues

– The eigenvalues of Sw-1·Sb can be computed as the roots of the

characteristic polynomial, i.e.,:

– The second formulation is more convenient because it does notrequire any matrix inversion.

– The characteristic equation provides at most ( M – 1) nonzeroroots 1 , 2 , … , M – 1 and at least (n – M + 1) zero solutions.

– An eigenvector ei is computed from each resulting nonzeroeigenvalue i.

– The optimal transformation matrix T * is obtained through a row juxtaposition of the resulting eigenvectors.

1 0 or, equivalently: 0w b b wS S I S S

30


30/43

30

Principal component analysis

• The principal component analysis (PCA , or Karhunen-Loevetransform, KL) is an unsupervised algorithm for feature

extraction. In particular, PCA reduces the dimension of thefeature space on the basis of a mean square error criterion.

• Problem setting

– Let a data set {x1 , x2 , … , xN } composed of N samples be given.

– A coordinate system in the n-D feature space is determined by anorthonormal basis {e1 , e2 , … , en} and by an origin c.

– In such a coordinate system each sample is expressed as:

– To reduce the dimension of the feature space, one could keeponly m components:

however it is not obvious that the m components yik be theprojections of ( xk-c) along ei.

1

, 1,2, ...,n

ik iki

y k N

x c e

1

, 1,2,...,m

ik iki

y k N

x c e

31


31/43

31

Geometric interpretation

• Two-dimensional example

– Approximation of the samples in a two-dimensional feature

space (plane) as the sum of a constant vector c and of thecomponent along one unit vector e1.

c

e1

x1

x2

O

32


32/43

32

PCA: mean square error

• If the components of xk along (n – m) axes are discarded, anerror is obviously introduced. PCA selects the coordinate

system that minimizes the mean square error.– The adopted functional is:

– This functional has to be minimized with respect to all relatedvariables, i.e., the origin c , the vectors ei , and the components yik ,under the following orthonormality constraint:

– Plugging this constraint in the expression of the functional yields:

22

1 1 1

1 1N N m

ik k k ikk k i

yN N

x x x c e

, 1,2,...,ti j ij i j m e e

2 2

1 1 1

12 ( )

N m mtik ik k ik

k i i

y yN

x c e x c

33


33/43

33

PCA: optimal components of the samples

• Let us compute, first, the optimum components of the samplesalong the first m components of the basis {e1 , e2 , ..., en}

(unconstrained minimization).– The stationarity of the functional with respect to each component

yik yields:

where bi = eitc is the component of c along the ith unit vector ei in

the unknown orthonormal basis (i = 1, 2, ..., m).

– Plugging this optimal values into allows obtaining:

0 ( ) , 1,2,..., ,t tik i k i k i

ik

y b k N y

e x c e x

2 2

1 1

2 2

1 1 1

2 2

1 1 1 1

1[ ( )]

1[ ( )] [ ( )]

1 1[ ( )] ( )

N mt

k i k

k i

N n mt ti k i k

k i i

N n N nt ti k i k i

k i m k i m

N

N

bN N

x c e x c

e x c e x c

e x c e x

34


34/43

34

PCA: optimal origin

• depends now on the origin c only through the components

bm + 1 , bm + 2 , ..., bn of c along em + 1 , em + 2 , ..., en.

– The zero-gradient condition with respect to bi (i = m + 1, m + 2, ...,n) yields:

– Consequently:

Centroid of the data set

1

1 1

2( ) 0

1 1 , where:

N t

i i kki

N N

t ti i k i k

k k

bb N

bN N

e x

e x e x

2 2

1 1 1 1

1 1 1

1

1 1( ) [ ( )]

1( )( )

1where: ( )( )

N n N nt t ti k i i k

k i m k i m

N n N t t ti k k i i i

k i m i m

N t

k kk

N N

N

N

e x e e x

e x x e e e

x x

Sample-covariance ofthe data set

35


35/43

35

PCA: optimal orthonormal basis

• The vectors ei (i = 1, 2, ..., n) are supposed to be orthonormal,so their optimization is a constrained problem.

– Optimum vector basis i:

– The sample covariance is symmetric and positive semidefinite.Therefore, it has n real nonnegative eigenvalues 1 , 2 , … , n withcorresponding orthonormal eigenvectors e1 , e2 , … , en.

– To establish which m eigenvectors should be preserved (andwhich (n – m) should be discarded), let us plug the obtained

optimal values in the expression of the functional. This yields thefollowing minimum mean square error:

2

min2 2 ( )

1

i

ti i

i i i i it

i i i

I ee e

e e 0 e 0

e e e

Through Lagrangemultipliers

1 1

*n n

ti i i i

i m i m

e e

36


36/43

36

PCA: feature reduction

• Therefore, the minimum value of * is obtained if m + 1 , m + 2 ,

… , n are the smallest eigenvalues, i.e., if the preserved unit

vectors e1 , e2 , ..., em correspond to the m largest eigenvalues 1 ,2 , … , m.

• Expression of the PCA transformation

– If the n eigenvalues of are ordered in decreasing order (i.e., 1

2 … n), the PCA transformation projects the samples(centered with respect to the centroid) along the axes e1 , e2 , … , em corresponding to the first m eigenvalues:

1

2( ) , ( ) with:

t

tt

ik i k k k

tm

y i k T T

e

ee x y x

e

37


37/43

37

PCA: remarks

• Operatively , PCA is applied as follows:

– Compute the centroid and the sample-covariance of the

whole data set.

– Compute the eigenvalues and the eigenvectors of .

– Order the eigenvalues in decreasing order.

– Compute the matrix T through the row juxtaposition of the

eigenvectors corresponding to the first m eigenvalues.• Remarks

– Therefore, the PCA transformation is y = T (x – ).

– According to the expression of the minimum mean square error,the information loss due to feature reduction through PCA isoften quantified through the following efficiency factor :

1

1

m

i

i

n

i

i

80%

85%

90%

95%

100%

1 6 11 16

m

m*

38

f h l


38/43

38

PCA: interpretation of the principal components

• The eigenvalue i represents the sample-variance along theaxis ei (i = 1, 2, … , n).

– The components along the axis e1 , e2 , … , en are named principalcomponents. Therefore, one may say that “PCA preserves thefirst m principal components.”

– Geometrically, e1 is the direction along which the samples exhibitthe maximum dispersion and en is the direction along which the

sample dispersion is lowest.

– Since the transformed features associated with maximumdispersion are chosen, PCA implicitly assumes that informationis conveyed by the variance of the data (see the figure in slide 31).

39

PCA k h i i l


39/43

39

PCA: remarks on the principal components

• Choosing features related to maximum dispersion does notimply choosing features that well discriminate the classes.

– In this 2D example, separation between the classes is poor withonly the first PCA component y1 , while considering both y1 and

y2 yields better separation:

– Indeed, PCA does not use information about class membership ofthe samples. If a training set is available, it is convenient to use asupervised feature extraction method (e.g., LDA or moresophisticated approaches).

x1

x2

O

e1

e2

1

2

40

E l (1)


40/43

Example (1)

• Apply PCA to the following samples: (0, 0, 0), (1, 0, 0), (1, 0, 1),(1, 1, 0), (0, 0, 1), (0, 1, 0), (0, 1, 1), (1, 1, 1).

– Transformation matrix for the extraction of two features:

1

2 3

1 2 3

1/ 2 2 1 111

8, 1/ 2 1 2 11/ 44

1/ 2 1 1 2

1 2 01 1 1

1 , 1 , 13 6 2

1 1 1

1 1 1

3 3 3

2 1 1

6 6 6

N

T

e e e

41

E l (2)


41/43

Example (2)

• Compute the transformed samples:

– Subtraction of the centroid from the samples:

– Transformed samples:

1/ 2 1/ 2 1/ 2 1/ 2 1/ 2 1/ 2 1/ 2 1/ 2

1/ 2 , 1/ 2 , 1/ 2 , 1/ 2 , 1/ 2 , 1/ 2 , 1/ 2 , 1/ 2

1/ 2 1/ 2 1/ 2 1/ 2 1/ 2 1/ 2 1/ 2 1/ 2

1 2 3 4

5 6 7 8

1 1 13

2 3 2 3 2 3 , , , ,2 3

2 1 10

6 6 6

1 1 1 32 3 2 3 2 3

, , , 2 31 1 2

06 6 6

y y y y

y y y y

42

S l ti t ti


42/43

Selection vs. extraction

• Advantage of extraction methods– An extraction method projects the feature space onto a subspace

such that the maximum information is preserved, and isconsequently more flexible (indeed, selection is a particular caseof extraction).

• Advantage of selection methods– The features provided by a selection method are a subset of the

original ones. Therefore, they maintain their physical meanings.This is relevant when information about the interpretations of thefeatures are used in the classification process (e.g., knowledge-

based methods).

– On the contrary, an extraction method generates “virtual” features, which are defined as linear combinations of the“measured” original features and usually have well definedmathematical meanings but not physical meanings.

– Through selection, the discarded features are not necessary. Withextraction, one usually needs using all the original features (e.g.,to compute linear combinations).

43

Bibli h


43/43

Bibliography

• R. O. Duda, P. E. Hart, D. G. Stork,Pattern Classification , 2nd Edition. NewYork: Wiley, 2001.

• K. Fukunaga, Introduction to statistical pattern recognition , 2nd edition,Academic Press, New York, 1990.

• G. Hughes, "On the mean accuracy ofstatistical pattern recognizers", IEEETransactions on Information Theory , vol.

14, no. 1, pp. 55-63, 1968.• L. O. Jimenez, D. A. Landgrebe,

"Supervised classification in high-dimensional space: geometrical,statistical, and asymptotical propertiesof multivariate data", IEEE Transactionson Systems, Man and Cybernetics, Part C ,

vol. 28, no. 1, pp. 39-54, 1998.• Harry C. Andrews, Introduction to

Mathematical Techniques in PatternRecognition , Wiley International, NewYork., 1972.

• P. H. Swain and S.M. Davis, Remotesensing: the quantitative approach ,McGraw-Hill, New York, 1978.

• J. A. Richards, X. Jia, Remote sensingdigital image analysis , Springer-Verlag,Berlin, 1999.

• S. B. Serpico and L. Bruzzone, “A NewSearch Algorithm for Feature Selectionin Hyperspectral Remote Sensing

Images”, IEEE Transaction on Geoscienceand Remote Sensing , vol. 39, pp. 1360-1367, 2001.

• L. O. Jimenez and D. A. Landgrebe,“Hyperspectral Data Analysis andFeature Reduction Via ProjectionPursuit”, IEEE Transactions on

Geoscience and Remote Sensing. vol. 37,pp. 2653-2667, 1999.

Documents

Feature reduction.pdf