Upload
tranphuc
View
242
Download
0
Embed Size (px)
Citation preview
- 1 - 6/18/04
Nonparametric Weighted Feature Extraction for Classification
Bor-Chen Kuo
Department of Mathematics Education
National Taichung Teachers College, Taichung, Taiwan 403
Tel: 886-4-22263181 ext 223
Email: [email protected]
David A. Landgrebe
School of Electrical and Computer Engineering
Purdue University, West Lafayette, Indiana 47907-1285
Tel: 765-494-3486
Email: [email protected]
Copyright © 2004 IEEE. Reprinted from IEEE Transactions on Geoscienceand Remote Sensing, Volume 42 No. 5, pp 1096-1105, May, 2004.
This material is posted here with permission of the IEEE. Internal orpersonal use of this material is permitted. However, permission to
reprint/republish this material for advertising or promotional purposes orfor creating new collective works for resale or redistribution must be
obtained from the IEEE by sending a blank email message [email protected].
By choosing to view this document, you agree to all provisions of thecopyright laws protecting it.
- 2 - 6/18/04
Nonparametric Weighted Feature Extraction for Classification1
Bor-Chen Kuo, Member, IEEE and David A. Landgrebe, Life Fellow, IEEE
Abstract
In this paper, a new nonparametric feature extraction method is proposed for high
dimensional multiclass pattern recognition problems. It is based on a nonparametric extension of
scatter matrices. There are at least two advantages to using the proposed nonparametric scatter
matrices. First, they are generally of full rank. This provides the ability to specify the number of
extracted features desired and to reduce the effect of the singularity problem. This is in contrast
to parametric discriminant analysis, which usually only can extract L–1 (number of classes
minus one) features. In a real situation, this may not be enough. Second, the nonparametric
nature of scatter matrices reduces the effects of outliers and works well even for non-normal data
sets. The new method provides greater weight to samples near the expected decision boundary.
This tends to provide for increased classification accuracy.
Index Terms—Dimensionality reduction, discriminant analysis, nonparametric feature
extraction.
1. Introduction
Among the ways to approach high dimensional data classification, a useful processing
model that has evolved in the last several years [1,2] is shown schematically in Figure 1. Given
the availability of data (box 1), the process begins by the analyst specifying what classes are
desired, usually by labeling training samples for each class (box 2). New elements that have
- 3 - 6/18/04
proven important in the case of high dimensional data are those indicated by boxes in the
diagram marked 3 and 4. These are the focus of this work.
Figure 1. A schematic diagram for a hyperspectral data analysis procedure
The reason for the importance of elements 3 and 4 in this context is as follows.
Classification techniques in pattern recognition typically assume that there are enough training
samples available to obtain reasonably accurate class descriptions in quantitative form.
Unfortunately, the number of training samples required to train a classifier for high-dimensional
data is much greater than that required for conventional data, and gathering these training
samples can be difficult and expensive. Therefore, the assumption that enough training samples
are available to accurately estimate the class quantitative description is frequently not satisfied
for high-dimensional data. Small training sets usually result in the Hughes phenomenon [3] and
singularity problems. There are several ways to overcome these problems. In [4], these
techniques are categorized into three groups:
1 The work described in this paper was sponsored in part by the National Imagery and Mapping Agency undergrant NMA 201-01-C-0023
5 FeatureSelection
1 HyperspectralData
6 Classifier4 Class ConditionalFeature Extraction
2 Label TrainingSamples
3 Determine QuantitativeClass Descriptions
- 4 - 6/18/04
A. Dimensionality reduction by feature extraction or feature selection.
B. Regularization of the class sample covariance matrices (e.g. [5], [6], [7], [8], [9]).
C. Structurization of a true covariance matrix described by a small number of parameters[4].
Group C is useful when the property and structure of the true covariance are known;
otherwise, methods in Group A and B are suggested. Generally, methods in Group B, or Group B
followed by Group A methods, are useful when class training sample sizes are small and
especially, when the total number of training samples is less than the dimensionality of data.
When the total number of training samples is greater than the dimensionality, feature extraction
methods may be a better choice. This paper will focus on the situation in which general feature
extraction methods can be used and develop a new nonparametric feature extraction algorithm
that is suitable for both simple and complex distributed data.
2. Background: Relevant Existing Feature Extraction Methods
2.1 Parametric Feature Extraction
Discriminant Analysis Feature Extraction (DAFE) is often used for dimension reduction in
classification problems. It is also called the parametric feature extraction method in [10], since
DAFE uses the mean vector and covariance matrix of each class. Usually within-class, between-
class, and mixture scatter matrices are used to formulate the criteria of class separability. A
within-class scatter matrix for L classes is expressed by [10]:
∑∑==
=Σ=L
i
DAwii
L
iii
DAw SPPS
11
(1)
where Pi denotes the prior probability of class i, mi is the class mean and Σi is the class
covariance matrix. A between-class scatter matrix is expressed as
€
SbDA = Pi mi −mo( )∑ mi −mo( )T = ∑ ∑
−
= +=
−−1
1 1
))((L
i
Tjiji
L
ijji mmmmPP (2)
where m0 represents the expected vector of the mixture distribution and is given by
- 5 - 6/18/04
i
L
iimPm ∑
=
=1
0 (3)
The optimal features are determined by optimizing the Fisher criteria given by
€
JDAFE = tr SwDA( )
−1SbDA( )[ ] (4)
In [11], DAFE is shown to be equivalent to finding the ML estimators of a Gaussian model,
assuming that all class discrimination information resides in the transformed subspace and the
within-class covariances are equal for all classes. The advantage of DAFE is that it is
distribution-free but there are three major disadvantages in DAFE. One is that it works well only
if the distributions of classes are normal-like distributions [10]. When the distributions of classes
are nonnormal-like or multi-modal mixture distributions, the performance of DAFE is not
satisfactory. The second disadvantage of DAFE is the rank of the between-scatter matrix is
number of classes (L) –1, so assuming sufficient observations and the rank of within-class scatter
matrix is v, then only min(L-1, v) features can be extracted. From [10] Chapter 10, we know that
unless a posterior probability function is specified, L–1 features are suboptimal in a Bayes sense,
although they are optimal based on the chosen criterion. In real situations, the data distributions
are often complicated and not normal-like, therefore only using L-1 features is not sufficient for
much real data. The third limitation is that if the within-class covariance is singular, which often
occurs in high dimensional problems, DAFE will have a poor performance on classification.
Foley-Sammon feature extraction and its extension [13][14][15][19] can help to extract
more than L-1 orthogonal features from n-dimensional space based on the following:
€
ri = maxr
rT SbDAr
rT SwDAr
, i =1,2,...,n −1
jirSr jDAw
Ti ≠= ,0 subject to
This third limitation can be relieved by using regularized covariance estimators in the
estimating procedure of the within-class scatter matrix [16] or by adding Singular Value
Perturbation to the within-class scatter matrix to solve the generalized eigenvalue problem [17].
- 6 - 6/18/04
Approximated pairwise accuracy criterion Linear Dimension Reduction (aPAC-LDR) [18]
can be seen as DAFE weighted contributions of individual class pairs according to the Euclidian
distance of respective class means. The major difference between DAFE and aPAC-LDR is that
the Fisher criteria is redefined as
€
JLDR =i=1
L−1
∑ PiPjω Δ ij( )tr SwDA( )
−1SijLDR( )[ ]
j= i+1
L
∑ , (5)
where
€
SijLDR = mi −m j( ) mi −m j( )
T , ω Δ ij( ) =
12Δ ij
2 erfΔ ij
2 2
,
and
€
Δ ij = mi −m j( )TSwDA( )
−1mi −m j( ) (6)
The above weighted Fisher criteria is the same as (4) by redefining the between-class
scatter matrix as
€
SbLDR =
i=1
L−1
∑ PiPjω Δ ij( ) mi −m j( ) mi −m j( )T
j= i+1
L
∑ (7)
Hence the optimization problem is the same as DAFE.
There are one simulated and one real data experiments in [18]. They show that the
advantages of this method are
1 . It can be designed to confine the influence of outlier classes on the final LDR
transformation.
2. aPAC-LDR needs fewer features to reach the optimal accuracy of DAFE, but the best
accuracy of aPAC-LDR is almost the same as that of DAFE
aPAC-LDR is the same as DAFE in that it is still using the mean vector and covariance to
formulate the scatter matrices, hence it still suffers from those three major disadvantages of
DAFE.
- 7 - 6/18/04
2.2 Nonparametric Discriminant Analysis
Nonparametric Discriminant Analysis (NDA) [10][20] was proposed to solve the problems
of DAFE. In NDA, the between-class scatter matrix is redefined as a new nonparametric
between-class scatter matrix (for the 2 classes problem), denoted
€
SbNDA , as
( ) ( )( )( ) ( ) ( )( )( ){ }( ) ( )( )( ) ( ) ( )( )( ){ }22
122
12
2
11
211
21
1
ω
ω
T
TNDAb
XMXXMXEP
XMXXMXEPS
−−+
−−=
where )(iX denotes the random variable used to describe the distribution of class i, and )(ilx
denotes the l-th outcome of this random variable. ( )( ) ( )∑=
=k
j
ijNN
ilj x
kxM
1
1 is called the local kNN
mean, )(ijNNx is the jth nearest neighborhood (NN) from class i ( iω ) to the sample )(i
lx . If k = Ni,
the training sample size of class i, [10] shows that the features extracted by
maximizing ])[( 1 NDAb
NDAw SStr − must be the same as the ones from ])[( 1 DA
bDAw SStr − . Thus, the
parametric feature extraction obtained by maximizing ])[( 1 DAb
DAw SStr − is a special case of feature
extraction with the more general nonparametric criterion ])[( 1 NDAb
NDAw SStr − , where the definition
of NDAwS is in (12).
- 8 - 6/18/04
Figure 2. The relationship between sample points and their local means (*s are neighbors of )(ilx ,
+s are neighbors of )(itx , and ⊗s represent local means.)
Further understanding of
€
SbNDA
is obtained by examining the vector ( )( )()( ilj
il xMx − ). Figure
2 shows the importance of using boundary points and local means. Pointing to the local mean
from the other class, each vector indicates the direction to the other class locally. If we select
these vectors only from the samples located near the classification boundary (e.g.
)( )()( ilj
il xMx − ), the scatter matrix of these vectors should specify the subspace in which the
boundary region is embedded. Vectors of samples that are far away from the boundary
(e.g. )( )()( itj
it xMx − ) tend to have large magnitudes. These large magnitudes can exert a
considerable influence on the scatter matrix and distort the information of the boundary structure.
)(ilx
∗ ∗∗∗
∗
€
⊗
)( )()( ilj
il xMx −
)( )(ilj xM
i Classj Class
∗∗∗ ∗
∗
€
⊗
)( )()( ili
il xMx −
)( )(ili xM
)( )()( itj
it xMx −
€
⊗++ ++
+)( )(i
tj xM
)(itx
++
+++
€
⊗
)( )()( iti
it xMx − )( )(i
ti xM
- 9 - 6/18/04
Therefore, some method of de-emphasizing samples far from the boundary seems appropriate.
To accomplish this, [10] uses a weighting function for each ( )( )()( ilj
il xMx − ). The value of the
weighting function, denoted as
€
wl , for )(ilx is defined as
),(),(
)},(),,(min{)()()()(
)()()()(),(
jkNN
il
ikNN
il
jkNN
il
ikNN
ilji
l xxdxxd
xxdxxdw
αα
αα
+= , (10)
where α is a control parameter between zero and infinity, and d( )(ilx , )( j
kNNx ) is the Euclidean
distance from )(ilx to its kNN point in class j.
Based on [18] and [26], the final discrete form of within and between-class scatter matrices
for multi-class problem are expressed by
( )( ) ( )( )( ) ( ) ( )( )( ) ∑∑∑∑
==≠==
=−−=L
i
NDAbi
Tilj
il
N
l
ilj
il
i
jil
L
ijj
L
ii
NDAb SPxMxxMx
Nw
PSi
11
,
11
(11)
( )( ) ( )( )( ) ( ) ( )( )( ) ∑∑∑
===
=−−=L
i
NDAwii
Tili
il
N
l
ili
il
i
jil
L
ii
NDAw SPxMxxMx
N
wPS
i
11
,
1
(12)
Although the nonparametric version of the within-class matrix was proposed in [10] and [20], the
parametric
€
SwDA was still suggested to optimize ( )[ ]NDA
bDAwNDA SStrJ
1−= be used in NDA by the
authors.
The disadvantages of NDA are
1. Parameters k and α are usually decided by rules of thumb. So the better result usually comes
after several trials.
2. The within-class scatter matrix in NDA is still with a parametric form. When the training set
size is small, NDA will have the singularity problem.
- 10 - 6/18/04
2.3 Discriminant Analysis Using Malina’s Criterion
In [21] and [22], the criterion function of DAFE and NDA was modified based on Malina’s
criterion [23],[24],[25], and DAM is used for representing it. For a 2-class classification
problem, the following criterion was proposed for
( ) ( )
rSr
rSrrSrJ
wT
jiw
Tb
T
DAM
−+−=
ββ1(13)
where r is the vector feature that will be extracted, the Sb, Sw, Sbi, and Swi , could be parametric
(DAFE) or nonparametric (NDA) versions, β denotes a user-supplied parameter and
wiwjwjwiji
w SSSSS −−=− or )( .
In [26], this criterion has extended to a multiclass version. For the data normalized by the
common covariance, the criteria are
Parametric: ( ) ( ) ( )∑∑+=
−−
=− +−=
L
ij
jiw
Tji
L
ib
TPDAM rSrPPrSrrJ
1
1
1
1, βββ (14)
Nonparametric:
€
JDAM −N r,β,α,k( ) = 1−β( )rT SbNDAr + βi=1
L−1
∑ PiPj rT Sw
NDA i− j( )rj= i+1
L
∑ (15)
where
€
SwNDA i− j( ) = Swi
NDA − SwjNDA or
€
SwjNDA − Swi
NDA ,
€
0 ≤ β ≤1
For convenience, DAM-P (parametric) and DAM-N (nonparametric) are used for representing
them respectively in this paper. Like NDA, DAM-NP is used for representing the situation that
the between-class scatter matrix is nonparametric form and the within-class scatter matrix is
parametric form. [21], [22], and [26] suggested that, for extracting the first feature, the Euclidean
distance should be used in the weighting function, and for extracting the second feature, the
projected distance
- 11 - 6/18/04
||||),( 11 jT
iT
ji xrxrxxd −= (16)
should be applied.
In [21], the results of simulated and real data experiments showed that for 2-class
classification problems, and using just one or two features, the performance of DAM-N is better
that those of NDA, DAFE and DAM-P.
There are a few advantages of DAM. First, it is a generalized version of NDA, so it has the
advantages of NDA. Second, it has better performance when the difference of class variances is
large. The disadvantage is that if the number of classes is L then there are 2/)1(2 −LL different
)( jiwS− and for one ),,,( kr αβ , it is necessary to do 2/)1(2 −LL times eigenvalue decompositions
[26]. For example, if there is a 7-class problem then for finding the optimal eigenvector of one
case ),,,( kr αβ , it is necessary to compute eigenvectors 2,097,152 times. For solving the
problem, a binary tree multiclass mapping technique was proposed in [26]. The method is user-
friendly for 2D mappings and interactive classifier design.
The main idea of the method [21], [22], and [26] is to compute discriminant vectors by
successive optimization of the discriminant criterion for specific values of the control parameters
searched for by a trial and error procedure. In [21], [22], and [26], the distance (16) was used for
extracting the second discriminant vector. In [27], another extraction method, called “removal of
classification structure” was proposed and experimental results show that this method is better
than those successive extracting methods proposed in [26]. In this study, this successive
extracting method (REM) and traditional simultaneous orthogonal extracting method (ORTH)
are used for finding successive vector features.
3 Nonparametric Weighted Feature Extraction
In this section, a new feature extraction method called nonparametric weighted feature
extraction (NWFE) is proposed. From NDA, we know that the “local information” is important
and useful for improving DAFE. The main ideas of NWFE are putting different weights on every
- 12 - 6/18/04
sample to compute the “weighted means” and defining new nonparametric between-class and
within-class scatter matrices to obtain more than L–1 features. In NWFE, the nonparametric
between-class scatter matrix for L classes is defined as
( )( ) ( )( )( ) ( ) ( )( )( )∑∑∑
=≠==
−−=iN
l
Tilj
il
ilj
il
i
jil
L
ijj
L
ii
NWb xMxxMx
nPS
1
,
11
λ (17)
where )(ilx refers to the l-th sample from class i, iN is training sample size of class i, Pi denotes
the prior probability of class i,
Basically, Equation (17) is similar to Equation (11). The differences are in the definitions of
weights and weighted means. The scatter matrix weight ),( jilλ is a function of )(i
lx and )( )(ilj xM ,
and defined as:
∑=
−
−
=jN
t
itj
it
ilj
ilji
l
xMxdist
xMxdist
1
1)()(
1)()(),(
))(,(
))(,(λ , (18)
where ),( badist denotes the Euclidean distance from a to b.
If the distance between )(ilx and )( )(i
lj xM is small then its weight ),( jilλ will be close to 1;
otherwise, ),( jilλ will be close to 0. The sum of the ),( ji
lλ for class i is 1.
)( )(ilj xM denotes the weighted mean of )(i
lx in class j and defined as:
∑=
=iN
k
jk
jilk
ilj xwxM
1
)(),()( )( , (19)
where
∑=
−
−
=jn
t
jk
it
jk
ilji
lk
xxdist
xxdistw
1
1)()(
1)()(),(
),(
),( . (20)
- 13 - 6/18/04
The weight ),( jilkw for computing weighted means is a function of xl
(i) and xk(j). If the distance
between xl(i) and xk
(j) is small then its weight wlk(i,j) will be close to 1; otherwise, wlk
(i,j) will be
close to 0. The sum of the wlk(i,j) for Mj(xl
(i)) is 1.
The nonparametric within-class scatter matrix is defined as
( )( ) ( )( )( ) ( ) ( )( )( )Ti
liil
N
l
ili
il
i
jil
L
ii
NWw xMxxMx
nPS
i
−−= ∑∑== 1
,
1
λ (21)
In NDA, nearest neighbors are used to estimate the local mean and weighting method is used to
emphasis the importance of boundary points and related between-class vectors. Just using kNN
points to estimate local mean may lose some information and not all kNN points are with the
same information about finding class boundary. Based on this thought, NWFE proposes the
“weighed mean” (eq. (19) ) and using weighted between- and within-class vector to improve
NDA.
The extracted f features are the f eigenvectors with largest f eigenvalues of the following matrix:
NWb
NWw SS 1)( −
To reduce the effect of the cross products of within-class distances and prevent the singularity,
some regularized techniques [5, 29], can be applied to within-class scatter matrix. In this study,
within-class scatter matrix is regularized by
€
SwNW = 0.5Sw
NW + 0.5diag SwNW( ) ,
where diag(A) means the diagonal parts of matrix A.
Finally, the NWFE algorithm is
1. Compute the distances between each pair of sample points and form the distance matrix.
2. Compute ),( jilkw using the distance matrix
3. Use ),( jilkw to compute the weighted means )( )(i
lj xM
4. Compute the scatter matrix weight ),( jilλ
- 14 - 6/18/04
5. Compute
€
SbNW and regularized
€
SwNW
6. Extract features by using ORTH or REM methods
4 Experiment Design
In this paper, only real data experiment results are displayed; related simulated data
experiment results can be found in [16].
The design of Experiment 1 is to compare the multiclass classification performances of
using DAFE, aPAC-LDR, NDA and NWFE (with ORTH method) features applied to Gaussian,
2NN, and Parzen classifiers. The design of Experiment 2 is to compare the 2-class classification
performances of using DAFE, NDA, DAM-P, DAM-NP, DAM-N and NWFE features applied to
Gaussian, 2NN, and Parzen classifiers. In Experiment 1, only the simultaneous orthogonal
feature extraction method (ORTH, [27]) is used. In Experiment 2, ORTH and the successive
extracting method (REM, [27]) are used. Euclidean distance and 2NN are used in NDA, DAM-P,
DAM-NP, DAM-N, and kNN classifier. The grid method is used for successively finding
optimal β1 (first feature) and β2 (second feature) in DAM cases. Besides, all classifiers are from
[28].
There are four different real data sets, Cuprite: a site of geologic interest in western
Nevada, Jasper Ridge: a site of ecological interest in California, Indian Pine: a mixed
forest/agricultural site in Indiana, and the Washington, DC Mall as an urban site, in the
experiments. The first three of these data sets were gathered by a sensor known as AVIRIS,
mounted in an aircraft flown at 65,000 ft. altitude and operated by the NASA/Jet Propulsion Lab.
It produces pixels in 220 spectral bands measuring approximately 20 m across on the ground.
The fourth data set was flown with a sensor system in a lower altitude aircraft producing again
data in 220 bands but at a spatial resolution of approximately 5 m. Some water absorption
channels are discarded, so only 191 bands are used in the experiments. There are 8, 6, 6, and 7
classes in Cuprite, Jasper Ridge, Indian Pine, and DC Mall data sets respectively. There are 40
training samples, which are different from testing samples, in each class of Cuprite, Jasper
Ridge, Indian Pine, and DC Mall experiments. The data sets in Experiment 2 only contain the
- 15 - 6/18/04
first two classes of those four real data sets. At each experiment, 10 training and testing sample
data sets are selected randomly for establishing the classification process and its performance.
5 Experiment Results
5.1 Results of Experiment 1
The results of experiment 1 are displayed in Figure 3(a) to 3(c) (Cuprite), 4(a) to 4(c)
(Jasper Ridge), 5(a) to 5(c) (Indian Pine) and 6(a) to 6(c) (DC Mall), respectively. In each case,
L indicates the number of classes, Ni the number of training samples in each class, and p the
number of features (spectral bands) used. These figures show that
1. For these three classifiers, NWFE performs better than the other methods.
2. Gaussian and 2NN classifier perform better than Parzen classifier.
3 . Figure 5(a) shows that if only 5 (i.e. L-1) features are used, then the accuracy
(expressed as a percentage of the test samples correctly classified) of DAFE is 57% and
that of NWFE is 86%. But if 7 features of NWFE are used then the accuracy increases
to 91%. This shows that only using L-1 features is not enough in this real situation.
DAFE cannot do this due to the restriction of the rank of the between-class scatter
matrix. NWFE does not have this restriction.
4. Comparing Figure 7(b) and 7(c), one sees that the performance of NWFE is better than
that of DAFE in almost in all classes.
- 16 - 6/18/04
Cuprite (L=8, Ni=40, p=191, Gaussian Classifier)
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Number of Features
Acc
ura
cy
NWFE
DAFE
aPAC
NDA_2NN
Figure 3(a) Mean of accuracies using 1~15 features (Cuprite, Gaussian Classifier).
Cuprite (L=8, Ni=40, p=191, 2NN Classifier)
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Number of Features
Acc
ura
cy
NWFE
DAFE
aPAC
NDA_2NN
Figure 3(b) Mean of accuracies using 1~15 features (Cuprite, 2NN Classifier).
- 17 - 6/18/04
Cuprite (L=8, Ni=40, p=191, Parzen Classifier)
0.35
0.4
0.45
0.5
0.55
0.6
0.65
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Number of Features
Acc
ura
cy
NWFE
DAFE
aPAC
NDA_2NN
Figure 3(c) Mean of accuracies using 1~15 features (Cuprite, Parzen Classifier).
Jasper Ridge (L=6, Ni=40, p=191, Gaussian Classifier )
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Number of Features
Acc
ura
cy
NWFE
DAFE
aPAC
NDA_2NN
Figure 4(a) Mean of accuracies using 1~15 features (Jasper Ridge, Gaussian Classifier).
- 18 - 6/18/04
Jasper Ridge (L=6, Ni=40, p=191, 2NN Classifier)
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Number of Features
Acc
ura
cy
NWFE
DAFE
aPAC
NDA_2NN
Figure 4(b) Mean of accuracies using 1~15 features (Jasper Ridge, 2NN Classifier).
Jasper Ridge (L=6, Ni=40, p=191, Parzen Classifier)
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Number of Features
Acc
ura
cy
NWFE
DAFE
aPAC
NDA_2NN
Figure 4(c) Mean of accuracies using 1~15 features (Jasper Ridge, Parzen Classifier).
- 19 - 6/18/04
Indian Pine (L=6, Ni=40, p=191, Gaussian Classifier)
0.25
0.35
0.45
0.55
0.65
0.75
0.85
0.95
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Number of Features
Acc
ura
cy
NWFE
DAFE
aPAC
NDA_2NN
Figure 5(a) Mean of accuracies using 1~15 features (Indian Pine, Gaussian Classifier).
Indian Pine (NC=6, Ni=40, p=191, 2NN Classifier)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Number of Features
Acc
ura
cy
NWFE
DAFE
aPAC
NDA_2NN
Figure 5(b) Mean of accuracies using 1~15 features (Indian Pine, 2NN Classifier).
- 20 - 6/18/04
Indian Pine (NC=6, Ni=40, p=191, Parzen Classifier)
0.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Number of Features
Acc
ura
cy
NWFE
DAFE
aPAC
NDA_2NN
Figure 5(c) Mean of accuracies using 1~15 features (Indian Pine, Parzen Classifier).
DC Mall (L=7, Ni=40, p=191, Gaussian Classifier)
0.25
0.35
0.45
0.55
0.65
0.75
0.85
0.95
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Number of Features
Acc
ura
cy
NWFE
DAFE
aPAC
NDA_2NN
Figure 6(a) Mean of accuracies using 1~15 features (DC Mall, Gaussian Classifier).
- 21 - 6/18/04
DC Mall (NC=7, Ni=40, p=191, 2NN Classifier)
0.25
0.35
0.45
0.55
0.65
0.75
0.85
0.95
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Number of Features
Acc
ura
cy
NWFE
DAFE
aPAC
NDA_2NN
Figure 6(b) Mean of accuracies using 1~15 features (DC Mall, 2NN Classifier).
DC Mall (L=7, Ni=40, p=191, Parzen Classifier)
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Number of Features
Acc
ura
cy
NWFE
DAFE
aPAC
NDA_2NN
Figure 6(c) Mean of accuracies using 1~15 features (DC Mall, Parzen Classifier).
- 22 - 6/18/04
Figure 7(a) A simulated color IR image of a portion of the DC data set.
Figure 7(b). The thematic map resulting from the classification of the area of Figure 7(a) using
DAFE features and Gaussian Classifier.
- 23 - 6/18/04
Figure 7(c). The thematic map resulting from the classification of the area of Figure 7(a) using
NWFE features and Gaussian Classifier.
5.2 Results of Experiment 2
The results of Experiment 2 are displayed in Table 1, 2, and 3. The shadow parts mean that
the performance of REM is better than ORTH. They show that
1. REM is useful when DAM criteria are applied and data distribution is non-normal like
(e.g. Indian Pine). For NWFE and NDA, ORTH is still a better choice.
2. For NWFE/ORTH, three type classifiers have similar best results. For NDA, 2NN
classifier gets the better performances. For DAM criteria, the Gaussian classifier
obtains the better performances.
3. Overall, using the NWFE criterion and the ORTH extraction method is a good and
robust choice.
- 24 - 6/18/04
Table 1 Performances of Gaussian classifier using different criteria and extraction methods
Cuprite Jasper Ridge Indian Pine DC MallData Set
No. of features No. of features No. of features No. of features
Criterion Extraction Method 1 2 1 2 1 2 1 2
DAFE ORTH 0.9307 N/A 0.9994 N/A 0.7992 N/A 0.8044 N/A
ORTH 0.9985 0.9982 0.9992 0.9993 0.9422 0.9391 0.9645 0.9634NWFE
REM 0.9985 0.9981 0.9992 0.9993 0.9422 0.9401 0.9645 0.9634
ORTH 0.6476 0.9306 0.8299 0.9991 0.5878 0.7774 0.6200 0.7465NDA
REM 0.6476 0.6562 0.8299 0.9192 0.5878 0.636 0.6200 0.6281
ORTH 0.9583 0.9583 0.8415 0.8415 0.5393 0.5393 0.8559 0.8559DAM-P
REM 0.9583 0.8473 0.8415 0.9999 0.5393 0.9292 0.8559 0.8234
ORTH 0.8192 0.7010 0.9992 0.9994 0.8707 0.8685 0.6043 0.6924DAM-NP
REM 0.8192 0.7964 0.9992 1 0.8707 0.9242 0.6043 0.6859
ORTH 0.8192 0.7070 0.9992 0.9994 0.8707 0.9042 0.6048 0.7078DAM-N
REM 0.8192 0.8111 0.9992 1 0.8707 0.9264 0.6048 0.6509
Table 2 Performances of 2NN classifier using different criteria and extraction methods
Cuprite Jasper Ridge Indian Pine DC MallData Set
No. of features No. of features No. of features No. of features
Criterion Extraction Method 1 2 1 2 1 2 1 2
DAFE ORTH 0.9348 N/A 0.9994 N/A 0.7985 N/A 0.7964 N/A
ORTH 0.9992 0.9707 0.9992 0.9993 0.9265 0.9407 0.9598 0.9594NWFE
REM 0.9992 0.9466 0.9992 0.9990 0.9265 0.9352 0.9598 0.9571
ORTH 0.6515 0.9453 0.8173 0.9993 0.5899 0.7885 0.6807 0.7743NDA
REM 0.6515 0.6306 0.8173 0.9441 0.5899 0.9027 0.6807 0.6838
ORTH 0.6278 0.9583 0.7033 0.9553 0.8671 0.5393 0.8700 0.8561DAM-P
REM 0.6278 0.7775 0.7033 0.9997 0.8671 0.9202 0.8700 0.8126
ORTH 0.6732 0.7010 0.9992 0.9971 0.8696 0.8764 0.5535 0.6924DAM-NP
REM 0.6732 0.7707 0.9992 1 0.8696 0.9027 0.5535 0.5792
ORTH 0.6732 0.7070 0.9992 0.9992 0.8696 0.8801 0.5562 0.7078DAM-N
REM 0.6732 0.7707 0.9992 1 0.8696 0.9053 0.5562 0.5792
- 25 - 6/18/04
Table 3 Performances of Parzen classifier using different criteria and extraction methods
Cuprite Jasper Ridge Indian Pine DC MallData Set
No. of features No. of features No. of features No. of features
Criterion Extraction Method 1 2 1 2 1 2 1 2
DAFE ORTH 0.9695 N/A 0.9182 N/A 0.6364 N/A 0.8657 N/A
ORTH 0.9964 0.9718 0.9949 0.9993 0.9374 0.9455 0.9625 0.9612NWFE
REM 0.9964 0.9181 0.9949 0.9990 0.9374 0.9383 0.9625 0.9612
ORTH 0.7127 0.9474 0.7995 0.9992 0.5685 0.7358 0.7526 0.8542NDA
REM 0.7127 0.5943 0.7995 0.9492 0.5685 0.7436 0.7526 0.5979
ORTH 0.8234 0.9583 0.9956 0.8415 0.8347 0.5393 0.7823 0.8557DAM-P
REM 0.8234 0.8267 0.9956 1 0.8347 0.9179 0.7823 0.8541
ORTH 0.8053 0.7010 0.9894 0.9971 0.8836 0.8764 0.5851 0.6924DAM-NP
REM 0.8053 0.8413 0.9894 1 0.8836 0.9246 0.5851 0.7592
ORTH 0.8053 0.7070 0.9992 0.9992 0.8836 0.9067 0.5994 0.7078DAM-N
REM 0.8053 0.8296 0.9992 1 0.8836 0.9348 0.5994 0.6983
7 Concluding Comments
The volume available in high dimensional feature spaces is very large, making possible the
discrimination between classes with only very subtle differences. On the other hand, this large
volume makes increasingly challenging the problem of defining adequate precisely the desired
classes in terms of the feature space variables. The problems of class statistics estimation error
resulting from training sets of finite size grows rapidly with dimensionality, thus making it
desirable to use no larger feature space dimensionality than necessary for the problem at hand,
and therefore the importance of an effective, case-specific feature extraction procedure.
The NWFE algorithm presented here is intended to take advantage of the desirable
characteristics of DAFE and NDA, while avoiding their shortcomings. DAFE is fast and easy to
apply, but its limitation of L-1 features, its reduced performance particularly when the difference
in the mean values of classes is small, and the fact that it is based on the statistical description of
the entire training set, making it sensitive to outliers, limit its performance in many cases. NDA
- 26 - 6/18/04
does not have these limitations. It focuses the attention on training samples near the required
decision boundary. NDA does not perform well on unequal covariance or complexly distributed
data.
NWFE does not have any of these limitations. It appears to have improved performance in
a broad set of circumstances, making possible substantially better classification accuracy in the
data sets tested, which included sets of agricultural, geological, ecological and urban
significance. This improved performance is perhaps due to the fact that, like NDA, attention is
focused upon training samples that are near to the eventual decision boundary, rather than
equally weighted on all training pixels as with DAFE. It also appears to provide feature sets
which are relatively insensitive to the precise choice of feature set size, since the accuracy versus
dimensionality curves are relatively flat beyond the initial knee of the curve. This characteristic
would appear to be significant for the circumstance when this technology begins to be used by
general remote sensing practitioners who are not otherwise highly versed in signal processing
principles and thus might not realize how to choose the right dimensionality to use.
Weighted between- and within-scatter matrices and regularization are the most important
parts in NWFE. Only applying one of them can not get a satisfactory result.
An implementation of NWFE is available for testing in MultiSpec. MultiSpec is a personal
computer multispectral data analysis software package that may be downloaded free from
http://dynamo.ecn.purdue.edu/~biehl/MultiSpec/.
- 27 - 6/18/04
References
[1] D. A. Landgrebe, "Information Extraction Principles and Methods for Multispectral and
Hyperspectral Image Data," Chapter 1 of Information Processing for Remote Sensing,
edited by C. H. Chen, published by the World Scientific Publishing Co., Inc., 1060 Main
Street, River Edge, NJ 07661, USA 1999.
[2] David Landgrebe, Signal Theory Methods In Multispectral Remote Sensing, 508 pages
plus a CD containing exercises and data. John Wiley & Sons, January 2003, ISBN 0-471-
42028-X.
[3] G. F. Hughes, “ On the mean accuracy of statistical pattern recognition”, IEEE Trans.
Information Theory, 1968, vol. IT-14, no. 1, pp. 55-63.
[4] S. Raudys and A. Saudargiene, “Structures of the Covariance Matrices in Classifier Design”,
Advances in Pattern Recognition, A. Amin, D. Dori, P. Pudil, and H. Freeman, ed., Berlin
Heidelberg: Springer-Verlag, 1998, pp. 583-592.
[5] J.H. Friedman, “Regularized Discriminant Analysis,” Journal of the American Statistical
Association, vol. 84, 1989, pp. 165-175.
[6] W. Rayens and T. Greene, “ Covariance pooling and stabilization for classification.”
Computational Statistics and Data Analysis, vol. 11, 1991, pp. 17-42.
[7] J. P. Hoffbeck and D.A. Landgrebe, “ Covariance matrix estimation and classification with
limited training data” IEEE Transactions on Pattern Analysis & Machine Intelligence, vol.
18, No. 7, 1996, pp. 763-767.
[8] S. Tadjudin and D.A. Landgrebe, Classification of High Dimensional Data with Limited
Training Samples, PhD thesis Purdue University, West Lafayette, IN., ECE Technical Report
TR-EE 98-8, April, 1998, pp. 35-82.
- 28 - 6/18/04
[9] W.J. Krzanowski, P. Jonathan, W. V. McCarthy, and M. R. Thomas, “Discriminant analysis
with singular covariance matrices: methods and applications to spectroscopic data.”
Applied Statistics, vol. 44, 1995, pp.101-115,.
[10] K. Fukunaga, Introduction to Statistical Pattern Recognition, San Diego: Academic Press
Inc., 1990.
[11] C. B. Moler and G.W. Stewart, "An Algorithm for Generalized Matrix Eigenvalue
Problems", SIAM J. Numer. Anal., vol. 10, no. 2, April 1973.
[12] A. Campbell, “Canonical Variate Analysis—A General Model Formulation,” Australian J.
Statistics, vol. 26, pp.86-96, 1984.
[13] D.H. Foley and J.W. Sammon, "An optimal set of discriminant vectors", IEEE Trans.
Comput., vol.C-24, pp.281-289, 1975.
[14] T. Okada and S. Tomita, "An optimal orthonormal system for discriminant analysis",
Pattern Recognition, 18, pp.139-144, 1985.
[15] J. Duchene and S. Leclercq, “An optimal transformation for discriminant analysis and
principal component analysis,” IEEE Trans. Pattern Analysis and Machine Intelligence,
vol. 10, no. 6, 1988, pp. 978-983,.
[16] B-C. Kuo and D. A. Landgrebe, Improved Statistics Estimation And Feature Extraction For
Hyperspectral Data Classification, PhD Thesis and School of Electrical & Computer
Engineering Technical Report. TR-ECE 01-6, December 2001. (Available for download
from http://dynamo.ecn.purdue.edu/~landgreb/publications.html)
[17] Z-Q Hong and J-Y Yang, "Optimal discriminant plane for a small number of samples",
Pattern Recognition, vol. 24, no.4, 1991, pp. 317-324,.
- 29 - 6/18/04
[18] M. Loog, R, P. W. Duin and R. Haeb-Umbach, “Multiclass Linear Dimension Reduction by
Weighted Pairwise Fisher Criteria,” IEEE Trans. Pattern Analysis and Machine
Intelligence, vol. 23, 2001, pp. 762-766.
[19] Y. Hamamoto, Y. Matsuura, T. Kanaoka, and S. Tomita, ”A Note on the Orthonormal
Discriminant Vector Method for Feature Extraction,”Pattern Recognition, vol. 24, pp. 681-
684, 1991.
[20] K. Fukunaga and M. Mantock, Nonparametric Discriminant Analysis, IEEE Trans. Pattern
Analysis and Machine Intelligence, vol. 5, 1983, pp. 671-678.
[21] M. Aladjem, "Parametric and nonparametric linear mappings of multidimensional data",
Pattern Recognition, vol.24, no 6, 1991, pp. 543-553.
[22] M. Aladjem, "PNM: A program for parametric and nonparametric mapping of
multidimensional data", Computers in Biology and Medicine, vol.21, 1991, pp. 321-343.
[23] W. Malina, "Some extended Fisher criterion for feature selection," IEEE Transactions on
Pattern Analysis and Machine Intelligence, vol. 3, 1981, pp. 611-614
[24] W. Malina, "Some multiclass Fisher feature selection algorithms and their comparison with
Karhunen-Loeve algorithms," Pattern Recognition Letters, vol. 6, 1987, pp. 279-285.
[25] W. Malina, "Two-parameter Fisher criterion," IEEE Transactions on Systems, Man, and
Cybernetics – Part B: Cybernetics, vol. 31, no. 4, 2001, pp. 629-636.
[26] M. Aladjem, " Multiclass discriminant mappings", Signal Processing, vol. 35, 1994, pp.1-
18.
[27] M. Aladjem, "Linear Discriminant analysis for two classes via removal of classification
structure", IEEE Transaction on Pattern Analysis and Machine Intelligence, vol.19, no 2,
1997, pp.187-191.
- 30 - 6/18/04
[28] R, P. W. Duin, "PRTools, a Matlab Toolbox for Pattern Recognition", August 2002.
(Available for download from http://www.ph.tn.tudelft.nl/prtools/)
[29] B-C. Kuo, D. A. Landgrebe, L-W. Ko, and C-H. Pai. “Regularized Feature Extractions for
Hyperspectral Data Classification”. International Geoscience and Remote Sensing
Symposium. Toulouse, France, July 21-25, 2003
- 31 - 6/18/04
Bor-Chen Kuo
Bor-Chen Kuo (S’01-M’02) received the B.S. and M.S. degrees from National TaichungTeachers College, Taiwan, R.O.C., in 1993 and 1996, the Ph.D. degree from school of electricaland computer engineering, Purdue University, West Lafayette, IN, in 2001. He is currently anassociate professor of mathematic education department and graduate institute of educationalmeasurement and statistics at National Taichung Teachers College, Taiwan, R.O.C. His researchinterests are pattern recognition, remote sensing, image processing, and nonparametric functionalestimation.
- 32 - 6/18/04
David A. Landgrebe
Dr. Landgrebe holds the BSEE, MSEE, and PhD degrees from Purdue University. He isProfessor (Emeritus) of Electrical and Computer Engineering at Purdue University. His area ofspecialty in research is communication science and signal processing, especially as applied toEarth observational remote sensing.
He was President of the IEEE Geoscience and Remote Sensing Society for 1986 and 1987and a member of its Administrative Committee from 1979 to 1990. He received that Society’sOutstanding Service Award in 1988. He is a co-author of the textbook, Remote Sensing: TheQuantitative Approach (1978), and a contributor to the ASP Manual of Remote Sensing (1stedition, 1974), and the books, Remote Sensing of Environment, (1976), and InformationProcessing for Remote Sensing (1999). He is the author of the textbook Signal Theory Methodsin Multispectral Remote Sensing (2003). He has been a member of the editorial board of thejournal, Remote Sensing of Environment, since its inception in 1970.
Dr. Landgrebe is a Life Fellow of the Institute of Electrical and Electronic Engineers, aFellow of the American Society of Photogrammetry and Remote Sensing, a Fellow of theAmerican Association for the Advancement of Science, a member of the Society of Photo-Optical Instrumentation Engineers and the American Society for Engineering Education, as wellas Eta Kappa Nu, Tau Beta Pi, and Sigma Xi honor societies. He received the NASAExceptional Scientific Achievement Medal in 1973 for his work in the field of machine analysismethods for remotely sensed Earth observational data. In 1976, on behalf of the Purdue'sLaboratory for Applications of Remote Sensing, which he directed, he accepted the William T.Pecora Award, presented by NASA and the U.S. Department of Interior. He was the 1990individual recipient of the William T. Pecora Award for contributions to the field of remotesensing. He was the 1992 recipient of the IEEE Geoscience and Remote Sensing Society’sDistinguished Achievement Award and the 2003 recipient of the IEEE Geoscience and RemoteSensing Society’s Education Award.