Upload
biljets
View
228
Download
0
Embed Size (px)
Citation preview
8/9/2019 Feature reduction.pdf
1/43
1Dipartimento di Ingegneria
Biofisica ed Elettronica Università di Genova
Prof. Sebastiano B. Serpico
4. Feature Reduction
8/9/2019 Feature reduction.pdf
2/43
2
Complexity of a Classifier
• Increasing the number n of the features , the classifying designpresents different issues connected to the dimensionality of theproblem (“curse of dimensionality”):
– Computational complexity;
– Hughes phenomenon.
• Computational complexity
– Increasing n , the computational complexity of a classifierincreases. For some classification techniques this increment islinear with n , for other is of a higher order (e.g. quadratic).
– The increase in complexity involves an increase of computationtime and a larger memory occupation.
8/9/2019 Feature reduction.pdf
3/43
3
Hughes Phenomenon
• Intuitive reasoning
– increasing n , the amount ofavailable information for theclassifier should increase andconsequently also the classifi-cation accuracy, but...
• Experimental observation
– … on the contrary, fixed the
number N of the trainingsamples, the probability of acorrect decision of a classifierincreases for 1 n n* till amaximum and decreases forn n* (Hughes phenomenon).
• Interpretation
– Increasing n , the number ofparameters K n of the classifier becomes higher and higher.
–By increasing the ratio K n/N ,the number of the availabletraining samples is “too little” to obtain a satisfactory estima-te of such parameters.
8/9/2019 Feature reduction.pdf
4/43
4
Feature Reduction
• A solution to these dimensionality issues is to reduce thenumber n of the features used in the classification process(feature reduction or parameter reduction).
• Disadvantage: reducing the dimension of the feature space involves a loss of information.
• Two main strategies exist to achieve feature reduction:– feature selection: inside the set of the n available features , the
identification of a subset of m features is obtained by adopting ofan optimization criterion, chosen to minimize the loss ofinformation or maximize classification accuracy;
– feature extraction: the transformation (often linear) of the original(n-dimensional) feature space in a space of smaller dimension m isapplied in such a way to minimize the information loss ormaximize the classification accuracy.
8/9/2019 Feature reduction.pdf
5/43
5
Feature Selection
• Problem setting:
– Given a set X = {x1 , x
2 , … , x
n
} of n features , identify the subset S X , composed of m features (m < n), such to maximize thefunctional (·):
• An algorithm for features selection is then defined on the basisof two distinct objects:– the functional (·). It has to be defined such that (S) measures
the “goodness” of the feature subset S in the classification process;
– the algorithm for the search of the subset S*. The subsets of X are
in fact 2
n
, then an exhaustive search is computationally notfeasible, except for small values of n. Therefore, sub-optimalstrategies of maximization are adopted to detect “good” solutions, even if they do not correspond to global optima.
* arg max ( )S X
S S
8/9/2019 Feature reduction.pdf
6/43
6
Bhattacharyya Bounds
• A choice of the functional (·), which is significant from the
classification point of view, can be based on the criterion of theminimum of the error probability Pe.
– In the presence of two classes 1 and 2 , only, the Bhattacharyya distance B and the Bhattacharyya coefficient provide an upper
bound of Pe:
– Moreover, it’s possible to demonstrate that:
2 2
2
1 1 1 42 2
e u u
u e u
P
P
1 2
1 2
exp( )
where ln e ( | ) ( | )
n
e uP P P B
B p p dx x x
8/9/2019 Feature reduction.pdf
7/43
7
Bhattacharyya Distance and Coefficient
• An approach to feature selection consists in the maximizationof the Bhattacharyya distance B or (equivalently) in theminimization of the Bhattacharyya coefficient .
– In particular, a distance B(S) (or a coefficient (S)) can beassociated to each subset S of m features; in fact, indicating avector of the feature subset S with xS , one can define:
• Properties:
– 0 (S) 1 and then B(S) 0;
– if p(xS| 1) and p(xS| 2) are different from zero only in separeted
regions, then (S) = 0 and B(S) = + ;– if p(xS| 1) = p(x
S| 2) for any xS , then (S) is the integral of a pdf
over the entire space , therefore (S) = 1 and B(S) = 0.
1 2( ) ln ( ) e ( ) ( | ) ( | )m
S S SB S S S p p d x x x
m
8/9/2019 Feature reduction.pdf
8/43
8Computation of the Bhattacharyya Coefficient and Distance
• (S) is a multiple integral in an m-dimensional space, then itsanalytical computation starting from the conditional pdf iscomplex. Two particular cases, in which the computation issimple, exist:
– if the features in the subset S are independent , when conditionedto each class, we have:
then the following property is valid:
– if p(xS| i) = (miS , iS) (i = 1, 2), we obtain:1 2
1
1 22 1 2 1
1 2
21 1( ) ( ) ( ) ln
8 2 2
S S
S SS S t S S
S SB S
m m m m
( ) ({ }) e ( ) ({ })rr
r rx Sx S
S x B S B x
( | ) ( | ) 1,2r
Si r i
x S p p x i
x Additive property
8/9/2019 Feature reduction.pdf
9/43
9
Other Inter-Class Distances
• In addition to the Bhattacharyya distance, different measuresof inter-class distances have been introduced in literature.
– For example, the Divergence measures the separation betweentwo classes as a function of the likelihood ratio between therespective conditional pdfs.
– The Bhattacharyya distance and the Divergence are not upperlimited. This make them less appropriate as measures of inter-class separation. In fact, if we focus on the Gaussian case forsimplicity, when two classes are well-separated, an increment ofthe distance m1
S – m2S between the conditional means generates
a “large” increment of B(S), but an irrelevant reduction of Pe.
– Then, other measures of inter-classes distance have beenproposed (not treated in depth here) that, being upper limited,don’t present such a problem. Among them, we recall the Jeffries-Matusita Distance and the Modified Divergence [Richards 1999,Swain 1978].
8/9/2019 Feature reduction.pdf
10/43
10
Multiclass Extension
• Extension to the case of the M classes 1 , 2 , … , M.
– If ij(S) and Bij(S) are the Bhattacharyya coefficient and distance between two classes i and j computed over a feature subset S and ifPi = P(i) is the a priori probability of the class i , the followingaverage Bhattacharyya coefficient and average Bhattacharyya distance are defined:
• Remarks
– In the case M = 2, the maximization of B(S) was equivalent to theminimization of (S) (because B(S) = – ln (S)). In the multiclass casethe maximization of Bave(S) is no more equivalent to the
minimization of ave(S), because the relation between Bave(S) andave(S) is no more monotonic.
– Under the hypothesis of class conditional feature independence, wehave:
1 1
1 1 1 1
( ) ( ), ( ) ( ) M M M M
ave i j ij ave i j iji j i i j i
S P P S B S P P B S
( ) ({ })r
ave ave rx S
B S B x Attention! It is notvalid for ave(S).
8/9/2019 Feature reduction.pdf
11/43
11
Maximization of the Functional
• In a problem of feature selection the introduced measures ofinter-class separation are used in the role of the functional (·)
to be maximized
• Preliminary observations:
– The Bhattacharyya distance has to be maximized, while theBhattacharyya coefficient has to be minimized. Therefore, in the
following, (S) may correspond to Bave(S) or to – ave(S).– An exhaustive search over all possible subsets of X is, in general,
computationally not affordable.
– It’s feasible if the features are independent , when conditioned toeach class, and if the adopted functional is Bave. In such a case, in
fact, computed all values of the functional associated to the single features , for the additive property, the optimum subset S* of m features is simply composed by the m features that individuallypresents the highest m values of Bave({xr}).
8/9/2019 Feature reduction.pdf
12/43
12Sequential Forward Selection
• In general, the search for a subset of m features is conducted bymeans of a sub-optimal algorithm. Among such algorithms weconsider (for its simplicity) the sequential forward selection (SFS), which is based on the following steps:
– initialize S* = ;
– compute the value of the functional for all the subset S* {xi},
with xi S*, and choose the feature x* S *, that corresponds tothe maximum value of
– update S* setting S* = S* {x*};
– continue by iteratively adding one feature at a time until S*reaches the desired cardinality m or until the value of the
functional stabilizes (reaches saturation).
( * { });iS x
8/9/2019 Feature reduction.pdf
13/43
13
Remarks on SFS
• SFS identifies the optimum subset that can be obtained byiteratively adding a single feature at time.
– At the first step the single feature that corresponds to themaximum value of the functional is chosen. At the second step,the feature that, coupled with the previous one, provides themaximum value of the functional is added. And so on...
– The method is sub-optimal. For example, the optimal couple of features does not always include the single optimal feature .
• Advantage
– SFS is not computationally heavy even if X contains hundreds of features.
• Disadvantage
– A feature that has been included in the selected subset S* at aspecific iteration cannot be removed during the followingiterations, it means that SFS doesn’t allow backtracking.
8/9/2019 Feature reduction.pdf
14/43
14Sequential Backward Selection
• Sequential backward selection (SBS) proceed in a dual way wrtSFS, initializing S* = X and eliminating a single feature at a time
from S*, to maximize the functional (S) step by step.• Disadvantages
– Like SFS, SBS, too, doesn’t allow backtracking: the feature , eliminated from S* at a specific iteration, will never be recoveredin the following steps;
– Usually SBS is computationally disadvantageous wrt SFS: whileSFS starts from an empty subset and adds a feature at a time, SBSstarts from the original feature space. Therefore, SBS computesvalues of the functional in spaces with much higher dimensionsthan SFS. However, it’s advantageous if m n.
• In literature other, more complex, methods have beenproposed (which we will not see) to search for suboptimalsubsets, which allow also backtracking [Serpico et al. , 2001].
8/9/2019 Feature reduction.pdf
15/43
15
Operational Aspects of Features Selection
• The computation of inter-classes distance measures, used
in features selection, requiresthe knowledge of the classconditional pdfs and of theclass prior probabilities.
– Usually, such pdfs are not a
priori known but should beestimated from a training set , by means of parametric or non-parametric methods.
– Globally, a classificationsystem that involves a feature
selection step can besummarized by the followingflowchart.
Training set Data set
Class
conditional pdfs
and class prior
probabilities
Feature
selection
{ p(x| i ), P i }
training
samples foreaxh class {i }
Application of the
classifier to the
data set
Classification of the data set
S* Training of the
classifier
8/9/2019 Feature reduction.pdf
16/43
16
Example
• Hyperspectral data set with 202 features and 9 classes.
2
4
6
8
2 8 14 20 26 32 38 44 50 56
m
Bav e
RGB compositionof three of the 202 bands acquired by the sensor.
Map of theground truth
that highlightsthe training pixel
Estimated probability of correctclassification for a MAP
classifier under the hypothesisof Gaussian classes.
50%
60%
70%
80%
90%
100%
0 50 100 150 200
m
O A
Pc ,max = 88.6%for m = 40
8/9/2019 Feature reduction.pdf
17/43
17
Feature Extraction
• Problem definition:
– Given a set X = {x1 , x2 , … , xn} of n features , we want to identify alinear transformation that provides a transformed set of m
features Y ={ y1 , y2 , … , ym} (with m
8/9/2019 Feature reduction.pdf
18/43
18
Extraction Based on Inter-Class Distances
• Considering again the Bhattacharyya distance , in the case of twoGaussian classes , we look for the orthonormal feature
transformation that maximize the distance in the transformedspace.
– Let miY = E{y| i} = T·mi and i
Y = Cov{y | i} = T·i·T t (for i = 1,
2), B in the transformed space Y is given by:
• In the expression of B , two distinct contributions Bm(Y ) and B(Y )appear, respectively linked to the conditional means and to theconditional covariance matrices.
1 2
11 2
2 1 2 1
1 2
21 1( ) tr ( )( ) ln
8 2 2
Y Y
Y Y Y Y Y Y t
Y Y B Y
m m m m
( )mB Y ( )B Y
8/9/2019 Feature reduction.pdf
19/43
19
Extraction Based on Inter-Class Distances
• In principle, we would search for the orthogonal matrix T thatwould maximize B(Y ). However:
– The general problem of the maximization of B(Y ) with respect toT has no closed-form solution.
– The problems of separately maximizing Bm(Y ) or BΣ(Y ) haveclosed form solutions (eigenproblems). Details can be found in[Fukunaga, 1990].
– Therefore, if one of the two contributions is largely dominantover the other (i.e., Bm(Y ) >> BΣ(Y ) or Bm(Y )
8/9/2019 Feature reduction.pdf
20/43
20
Linear discriminant analysis
• A popular method for feature extraction is linear discriminantanalysis (LDA , aka discriminant analysis feature extraction,
DAFE), which maximizes a measure of separation andcompactness of the classes directly defined on the training set.
– Although explicit parametric assumptions are not stated, DAFEis usually considered parametric , because it “works poorly,” forexample, with multimodal classes and it characterizes the classesonly through first and second-order moments.
– Anyway, nonparametric extensions of this method have beenrecently introduced.
• The method can be applied to both binary and multiclass
problems.– Focusing first to the case of two classes , 1 and 2 , the linear
discriminant analysis provides an optimum scalar projection,named Fisher transform.
8/9/2019 Feature reduction.pdf
21/43
21
DAFE: Fisher transform
• In general, although the classes are well separated in theoriginal n-dimensional space, they may not be such in a
transformed one-dimensional space , because the projectioncan overlay samples drawn from different classes.
• The problem is to find the orientation of the projection linethat provides the best separation between the two classes.
– Given a set {x1 , x2 , … , xN } of N pre-classified samples, let Di be thesubset of the samples assigned to i (i = 1, 2) and let N i be thecardinality of Di (obviously N = N 1 + N 2).
– A transformation y = wtx projects the sample xk to yk = wtxk.Let Ei
= { y = wtx: x Di}.
– We search for the transformation y = wtx that maximizes theinter-class separation and minimizes the intra-class dispersion,conveniently quantified.
8/9/2019 Feature reduction.pdf
22/43
22
Inter-class separation and intra-class dispersion
• First, a functional that measures inter-class separation anddispersion inside each class is necessary.– As a measure of inter-class separation , the difference between the
centroids of the samples in the transformed space is used:
– As a measure of class dispersion around the centroids, thescatter values are adopt in the transformed space, i.e.:
1
, 1,2
1
i
i
iDi t
i i
i y Ei
N i
yN
x
x
w
2
2 2
( )( )
, 1,2( )
i
i
ti i i
D ti i
i i y E
S
s S is y
x
x x
w w
Si is called scattermatrix of the class
i (i = 1, 2).
8/9/2019 Feature reduction.pdf
23/43
23
The Fisher Functional
• The goal of the Fisher transform is to maximize the distance between the centroids of the classes and to minimize the
scatters in the one-dimensional transformed space.– For this purpose, the following Fisher functional is introduced:
– Let us explicitly write the functional as a function of w:
where Sb = ( 1 – 2)( 1 – 2)t is named between class scatter matrix
and Sw = S1 + S2 is named within class scatter matrix.
2
1 22 21 2
( )s s
w
2 21 2 1 2
22
1 2 1 2 1 2 1 2
( )
( ) ( )( )
( ) ,
t tw
t t t tb
t
bt
w
s s S S S
S
SS
w w w w
w w w w w
w www w
2
8/9/2019 Feature reduction.pdf
24/43
24
Optimality condition for the Fisher functional
• Optimality condition – Through the usual zero-gradient condition, one may prove that
the vector w* that maximizes the Fisher functional is aneigenvector of the product matrix Sw
– 1Sb:
– where is the corresponding eigenvalue.
• Close form solution– Therefore,w* satisfies the condition:
– ( 1 – 2)
tw* and are scalars, so w* is parallel to Sw– 1(
1 – 2).Since the scale factors are irrelevant in linear projections, we
obtain the following closed-form solution (with no need forexplicitly computing eigenvectors):
– Typically, the vector w* is also normalized.
1( ) * , i.e., ( ) * ,w b b wS S I S Sw 0 w 0
1 11 2 1 2* * ( )( ) * *
tw wbS S S w w w w
11 2* ( )wS
w
25
8/9/2019 Feature reduction.pdf
25/43
25
DAFE: multiclass Fisher transform
• We extend the discriminant analysis from the binary case tothe case of M classes 1 , 2 , … , M and of an m × n
transformation matrix.– Let us consider a set {x1 , x2 , … , xN } of N preclassified samples,
denote as Di the subset of the samples assigned to i (i = 1, 2, … , M) and as N i the cardinality of Di (N = N 1 + N 2 + … + N M).
– The transformation y = T x maps xk
to yk
= T xk
. Given Ei
= {wtx: x Di}, let us define:
( )( )
, 1,2, ...,( )( )
i
i
ti i i
D ti it
i i iE
S
S TS T i MS
x
y
x x
y y
1
, 1,2, ...,1
i
i
iDi
i i
i
Ei
N T i M
N
x
y
x
y
Centroids of i inthe original and
transformed spaces.
Scatter matrices ofi in the originaland transformed
spaces.
26
8/9/2019 Feature reduction.pdf
26/43
26
DAFE: multiclass Fisher functional (1)
• Let us extend the Fisher functional to the multiclass case.
– In the multiclass case, we quantify inter-class separation throughthe mean differences between the centroids of the classes and thecentroid of the entire training set in the transformed space:
– We measure the dispersions inside the single classes by means ofthe scatter matrices in the transformed space.
– Then, the Fisher functional is generalized as follows:
1
1
( )( )
( )
Mt
i i i
i
M
ii
N
T
S
1 1 1
1 1 1 , where:
N N M
k k i ik k i
T N N N N
y x
27
8/9/2019 Feature reduction.pdf
27/43
27
DAFE: multiclass Fisher functional (2)
• Let us explicitly write the Fisher functional as a function of theunknown transformation matrix T .
– Let us express numerator and denominator as functions of T andlet us consequently introduce a within class scatter matrix Sw anda between class scatter matrix Sb:
1 1 1
1 1
1
, where:
( )( ) ( )( )
where: ( )( )
( )
M M Mt t
i i w w i
i i i M M
t t t ti i i i i i b
i i
Mt
b i i i
it
b
tw
S T S T TS T S S
N T N T TS T
S N
TS T T
TS T
28
8/9/2019 Feature reduction.pdf
28/43
28
Optimality condition for the multiclass case
• Optimality condition
– Again through a zero-gradient condition, one may prove that the
row vectors e1 , e2 , … , em of the matrix T * that maximizes theFisher functional are eigenvectors of Sw
– 1Sb:
where i is the eigenvalue corresponding to ei and is nonzero.
• Remarks– The M matrices ( i – )( i – )
t , i = 1, 2, … , M , have unit ranks.Because of the linear relationship among the overall centroidand the class centroids i , i = 1, 2, … , M , they are also linearlydependent.
– Thus, rank(Sb) M – 1 and, then, rank(Sw– 1Sb) rank(Sb) M – 1.
– Therefore, at most ( M – 1) eigenvalues of Sw-1·Sb are nonzero, i.e.,
the eigenvector equation provides at most ( M – 1) solutionvectors.
1( ) , i.e., ( ) , 1,2,..., ,w b i i b i w iS S I S S i me 0 e 0
29
8/9/2019 Feature reduction.pdf
29/43
29
DAFE: comments
• DAFE allows up to ( M – 1) transformed features to be linearlyextracted (remember that M is the number of classes).
• Operational issues
– The eigenvalues of Sw-1·Sb can be computed as the roots of the
characteristic polynomial, i.e.,:
– The second formulation is more convenient because it does notrequire any matrix inversion.
– The characteristic equation provides at most ( M – 1) nonzeroroots 1 , 2 , … , M – 1 and at least (n – M + 1) zero solutions.
– An eigenvector ei is computed from each resulting nonzeroeigenvalue i.
– The optimal transformation matrix T * is obtained through a row juxtaposition of the resulting eigenvectors.
1 0 or, equivalently: 0w b b wS S I S S
30
8/9/2019 Feature reduction.pdf
30/43
30
Principal component analysis
• The principal component analysis (PCA , or Karhunen-Loevetransform, KL) is an unsupervised algorithm for feature
extraction. In particular, PCA reduces the dimension of thefeature space on the basis of a mean square error criterion.
• Problem setting
– Let a data set {x1 , x2 , … , xN } composed of N samples be given.
– A coordinate system in the n-D feature space is determined by anorthonormal basis {e1 , e2 , … , en} and by an origin c.
– In such a coordinate system each sample is expressed as:
– To reduce the dimension of the feature space, one could keeponly m components:
however it is not obvious that the m components yik be theprojections of ( xk-c) along ei.
1
, 1,2, ...,n
ik iki
y k N
x c e
1
, 1,2,...,m
ik iki
y k N
x c e
31
8/9/2019 Feature reduction.pdf
31/43
31
Geometric interpretation
• Two-dimensional example
– Approximation of the samples in a two-dimensional feature
space (plane) as the sum of a constant vector c and of thecomponent along one unit vector e1.
c
e1
x1
x2
O
32
8/9/2019 Feature reduction.pdf
32/43
32
PCA: mean square error
• If the components of xk along (n – m) axes are discarded, anerror is obviously introduced. PCA selects the coordinate
system that minimizes the mean square error.– The adopted functional is:
– This functional has to be minimized with respect to all relatedvariables, i.e., the origin c , the vectors ei , and the components yik ,under the following orthonormality constraint:
– Plugging this constraint in the expression of the functional yields:
22
1 1 1
1 1N N m
ik k k ikk k i
yN N
x x x c e
, 1,2,...,ti j ij i j m e e
2 2
1 1 1
12 ( )
N m mtik ik k ik
k i i
y yN
x c e x c
33
8/9/2019 Feature reduction.pdf
33/43
33
PCA: optimal components of the samples
• Let us compute, first, the optimum components of the samplesalong the first m components of the basis {e1 , e2 , ..., en}
(unconstrained minimization).– The stationarity of the functional with respect to each component
yik yields:
where bi = eitc is the component of c along the ith unit vector ei in
the unknown orthonormal basis (i = 1, 2, ..., m).
– Plugging this optimal values into allows obtaining:
0 ( ) , 1,2,..., ,t tik i k i k i
ik
y b k N y
e x c e x
2 2
1 1
2 2
1 1 1
2 2
1 1 1 1
1[ ( )]
1[ ( )] [ ( )]
1 1[ ( )] ( )
N mt
k i k
k i
N n mt ti k i k
k i i
N n N nt ti k i k i
k i m k i m
N
N
bN N
x c e x c
e x c e x c
e x c e x
34
8/9/2019 Feature reduction.pdf
34/43
34
PCA: optimal origin
• depends now on the origin c only through the components
bm + 1 , bm + 2 , ..., bn of c along em + 1 , em + 2 , ..., en.
– The zero-gradient condition with respect to bi (i = m + 1, m + 2, ...,n) yields:
– Consequently:
Centroid of the data set
1
1 1
2( ) 0
1 1 , where:
N t
i i kki
N N
t ti i k i k
k k
bb N
bN N
e x
e x e x
2 2
1 1 1 1
1 1 1
1
1 1( ) [ ( )]
1( )( )
1where: ( )( )
N n N nt t ti k i i k
k i m k i m
N n N t t ti k k i i i
k i m i m
N t
k kk
N N
N
N
e x e e x
e x x e e e
x x
Sample-covariance ofthe data set
35
8/9/2019 Feature reduction.pdf
35/43
35
PCA: optimal orthonormal basis
• The vectors ei (i = 1, 2, ..., n) are supposed to be orthonormal,so their optimization is a constrained problem.
– Optimum vector basis i:
– The sample covariance is symmetric and positive semidefinite.Therefore, it has n real nonnegative eigenvalues 1 , 2 , … , n withcorresponding orthonormal eigenvectors e1 , e2 , … , en.
– To establish which m eigenvectors should be preserved (andwhich (n – m) should be discarded), let us plug the obtained
optimal values in the expression of the functional. This yields thefollowing minimum mean square error:
2
min2 2 ( )
1
i
ti i
i i i i it
i i i
I ee e
e e 0 e 0
e e e
Through Lagrangemultipliers
1 1
*n n
ti i i i
i m i m
e e
36
8/9/2019 Feature reduction.pdf
36/43
36
PCA: feature reduction
• Therefore, the minimum value of * is obtained if m + 1 , m + 2 ,
… , n are the smallest eigenvalues, i.e., if the preserved unit
vectors e1 , e2 , ..., em correspond to the m largest eigenvalues 1 ,2 , … , m.
• Expression of the PCA transformation
– If the n eigenvalues of are ordered in decreasing order (i.e., 1
2 … n), the PCA transformation projects the samples(centered with respect to the centroid) along the axes e1 , e2 , … , em corresponding to the first m eigenvalues:
1
2( ) , ( ) with:
t
tt
ik i k k k
tm
y i k T T
e
ee x y x
e
37
8/9/2019 Feature reduction.pdf
37/43
37
PCA: remarks
• Operatively , PCA is applied as follows:
– Compute the centroid and the sample-covariance of the
whole data set.
– Compute the eigenvalues and the eigenvectors of .
– Order the eigenvalues in decreasing order.
– Compute the matrix T through the row juxtaposition of the
eigenvectors corresponding to the first m eigenvalues.• Remarks
– Therefore, the PCA transformation is y = T (x – ).
– According to the expression of the minimum mean square error,the information loss due to feature reduction through PCA isoften quantified through the following efficiency factor :
1
1
m
i
i
n
i
i
80%
85%
90%
95%
100%
1 6 11 16
m
m*
38
f h l
8/9/2019 Feature reduction.pdf
38/43
38
PCA: interpretation of the principal components
• The eigenvalue i represents the sample-variance along theaxis ei (i = 1, 2, … , n).
– The components along the axis e1 , e2 , … , en are named principalcomponents. Therefore, one may say that “PCA preserves thefirst m principal components.”
– Geometrically, e1 is the direction along which the samples exhibitthe maximum dispersion and en is the direction along which the
sample dispersion is lowest.
– Since the transformed features associated with maximumdispersion are chosen, PCA implicitly assumes that informationis conveyed by the variance of the data (see the figure in slide 31).
39
PCA k h i i l
8/9/2019 Feature reduction.pdf
39/43
39
PCA: remarks on the principal components
• Choosing features related to maximum dispersion does notimply choosing features that well discriminate the classes.
– In this 2D example, separation between the classes is poor withonly the first PCA component y1 , while considering both y1 and
y2 yields better separation:
– Indeed, PCA does not use information about class membership ofthe samples. If a training set is available, it is convenient to use asupervised feature extraction method (e.g., LDA or moresophisticated approaches).
x1
x2
O
e1
e2
1
2
40
E l (1)
8/9/2019 Feature reduction.pdf
40/43
Example (1)
• Apply PCA to the following samples: (0, 0, 0), (1, 0, 0), (1, 0, 1),(1, 1, 0), (0, 0, 1), (0, 1, 0), (0, 1, 1), (1, 1, 1).
– Transformation matrix for the extraction of two features:
1
2 3
1 2 3
1/ 2 2 1 111
8, 1/ 2 1 2 11/ 44
1/ 2 1 1 2
1 2 01 1 1
1 , 1 , 13 6 2
1 1 1
1 1 1
3 3 3
2 1 1
6 6 6
N
T
e e e
41
E l (2)
8/9/2019 Feature reduction.pdf
41/43
Example (2)
• Compute the transformed samples:
– Subtraction of the centroid from the samples:
– Transformed samples:
1/ 2 1/ 2 1/ 2 1/ 2 1/ 2 1/ 2 1/ 2 1/ 2
1/ 2 , 1/ 2 , 1/ 2 , 1/ 2 , 1/ 2 , 1/ 2 , 1/ 2 , 1/ 2
1/ 2 1/ 2 1/ 2 1/ 2 1/ 2 1/ 2 1/ 2 1/ 2
1 2 3 4
5 6 7 8
1 1 13
2 3 2 3 2 3 , , , ,2 3
2 1 10
6 6 6
1 1 1 32 3 2 3 2 3
, , , 2 31 1 2
06 6 6
y y y y
y y y y
42
S l ti t ti
8/9/2019 Feature reduction.pdf
42/43
Selection vs. extraction
• Advantage of extraction methods– An extraction method projects the feature space onto a subspace
such that the maximum information is preserved, and isconsequently more flexible (indeed, selection is a particular caseof extraction).
• Advantage of selection methods– The features provided by a selection method are a subset of the
original ones. Therefore, they maintain their physical meanings.This is relevant when information about the interpretations of thefeatures are used in the classification process (e.g., knowledge-
based methods).
– On the contrary, an extraction method generates “virtual” features, which are defined as linear combinations of the“measured” original features and usually have well definedmathematical meanings but not physical meanings.
– Through selection, the discarded features are not necessary. Withextraction, one usually needs using all the original features (e.g.,to compute linear combinations).
43
Bibli h
8/9/2019 Feature reduction.pdf
43/43
Bibliography
• R. O. Duda, P. E. Hart, D. G. Stork,Pattern Classification , 2nd Edition. NewYork: Wiley, 2001.
• K. Fukunaga, Introduction to statistical pattern recognition , 2nd edition,Academic Press, New York, 1990.
• G. Hughes, "On the mean accuracy ofstatistical pattern recognizers", IEEETransactions on Information Theory , vol.
14, no. 1, pp. 55-63, 1968.• L. O. Jimenez, D. A. Landgrebe,
"Supervised classification in high-dimensional space: geometrical,statistical, and asymptotical propertiesof multivariate data", IEEE Transactionson Systems, Man and Cybernetics, Part C ,
vol. 28, no. 1, pp. 39-54, 1998.• Harry C. Andrews, Introduction to
Mathematical Techniques in PatternRecognition , Wiley International, NewYork., 1972.
• P. H. Swain and S.M. Davis, Remotesensing: the quantitative approach ,McGraw-Hill, New York, 1978.
• J. A. Richards, X. Jia, Remote sensingdigital image analysis , Springer-Verlag,Berlin, 1999.
• S. B. Serpico and L. Bruzzone, “A NewSearch Algorithm for Feature Selectionin Hyperspectral Remote Sensing
Images”, IEEE Transaction on Geoscienceand Remote Sensing , vol. 39, pp. 1360-1367, 2001.
• L. O. Jimenez and D. A. Landgrebe,“Hyperspectral Data Analysis andFeature Reduction Via ProjectionPursuit”, IEEE Transactions on
Geoscience and Remote Sensing. vol. 37,pp. 2653-2667, 1999.