14
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 55, NO. 11, NOVEMBER 2017 6085 Discriminant Analysis of Hyperspectral Imagery Using Fast Kernel Sparse and Low-Rank Graph Lei Pan, Heng-Chao Li, Senior Member, IEEE, Wei Li, Senior Member, IEEE, Xiang-Dong Chen, Guang-Ning Wu, Senior Member, IEEE , and Qian Du, Senior Member, IEEE Abstract— Due to the high-dimensional characteristic of hyper- spectral images, dimensionality reduction (DR) is an impor- tant preprocessing step for classification. Recently, sparse and low-rank graph-based discriminant analysis (SLGDA) has been developed for DR of hyperspectral images, for which the prop- erties of sparsity and low-rankness are simultaneously exploited to capture both local and global structures. However, SLGDA may not achieve satisfactory results when handling complex data with nonlinear nature. To address this problem, this paper presents two kernel extensions of SLGDA. In the first proposed classical kernel SLGDA (cKSLGDA), the kernel trick is exploited to implicitly map the original data into a high-dimensional space. With a totally different perspective, we further propose a Nyström-based kernel SLGDA (nKSLGDA) by constructing a virtual kernel space by the Nyström method, in which virtual samples can be explicitly obtained from the original data. Both cKSLGDA and nKSLGDA can achieve more informative graphs than SLGDA, and offer superiority over other state-of-the-art DR methods. More importantly, the nKSLGDA can outperform cKSLGDA with much lower computational cost. Index Terms— Dimensionality reduction (DR), graph embed- ding (GE), hyperspectral image, kernel methods, sparse and low-rank graph. I. I NTRODUCTION H YPERSPECTRAL images are acquired by advanced remote sensors that are capable of capturing hundreds of spectral bands [1]. In hyperspectral images, each pixel typically represents a high-dimensional vector whose entries correspond to various spectral band responses. Thus, a hyper- spectral image usually contains rich spectral information Manuscript received December 28, 2016; revised May 19, 2017; accepted June 10, 2017. Date of publication July 21, 2017; date of current version October 25, 2017. This work was supported in part by the National Natural Science Foundation of China under Grant 61371165 and Grant 91638201, and in part by the Frontier Intersection Basic Research Project for the Central Universities under Grant A0920502051714-5. (Corresponding author: Heng-Chao Li.) L. Pan, H.-C. Li, and X.-D. Chen are with the School of Information Science and Technology, Southwest Jiaotong University, Chengdu 610031, China (e-mail: [email protected]). W. Li is with the College of Information Science and Technology, Beijing University of Chemical Technology, Beijing 100029, China. G.-N. Wu is with the School of Electrical Engineering, Southwest Jiaotong University, Chengdu 610031, China. Q. Du is with the Department of Electrical and Computer Engineering, Mississippi State University, Starkville, MS 39762 USA. Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TGRS.2017.2720584 such that it has been extensively used in many applications, e.g., agricultural monitoring [2], [3], target detection [4], [5], and land-cover classification [6], [7]. Due to the curse of dimensionality and the high computational complexity, dimen- sionality reduction (DR) with preserving valuable intrinsic information in a low-dimensional subspace has been substan- tially investigated for hyperspectral data analysis. Generally, DR includes two strategies: band selection and projection transformation. The former is to reduce the number of original bands and to extract sufficient information with only a small subset of original bands [8]–[10] while the latter aims to project the original data to a low-dimensional subspace that maintains crucial information by using certain criteria. As far as projection transformation is concerned, it can be further categorized into unsupervised and supervised methods. Principal component analysis (PCA) [11] is a widely-used unsupervised technique for DR, which is to find a linear transformation by maximizing the variance in the projected subspace. To improve the discriminative ability of projected samples, many supervised methods have been proposed. For instance, linear discriminant analysis (LDA) [12] is proposed to maximize the trace ratio of between-class and within- class scatter matrices. Nevertheless, LDA is prone to fail when the observed data has multimodel distribution [13]. To overcome this drawback, local Fisher’s discriminant analy- sis is developed for DR in [14]. However, the aforementioned methods only consider data statistics while neglecting the intrinsic manifold structure in hyperspectral images. Manifold learning methods have been successfully applied in DR of hyperspectral data, such as isometric feature map- ping [15], locally linear embedding (LLE) [16], local tangent space alignment [17], and Laplacian eigenmaps (LE) [18]. Apart from these nonlinear DR techniques, locality preserving projection (LPP) [19] and neighborhood preserving embed- ding [20], as the linear approximations to LE and LLE, respectively, have been proposed for DR. They can preserve the local structure of the data in a low-dimensional subspace and are also computationally efficient. As for these aforementioned manifold learning methods, graph theory is a popular technique. It has been successfully employed in the process of DR for the high-dimensional data. In [21], a general graph embedding (GE) framework is proposed to unify existing DR algorithms. In this frame- 0196-2892 © 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

IEEE TRANSACTIONS ON GEOSCIENCE AND …cist.buct.edu.cn/staff/WeiLi/Lee_Journals/2017/2017_TGRS...IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 55, NO. 11, NOVEMBER 2017

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: IEEE TRANSACTIONS ON GEOSCIENCE AND …cist.buct.edu.cn/staff/WeiLi/Lee_Journals/2017/2017_TGRS...IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 55, NO. 11, NOVEMBER 2017

IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 55, NO. 11, NOVEMBER 2017 6085

Discriminant Analysis of HyperspectralImagery Using Fast Kernel Sparse

and Low-Rank GraphLei Pan, Heng-Chao Li, Senior Member, IEEE, Wei Li, Senior Member, IEEE, Xiang-Dong Chen,

Guang-Ning Wu, Senior Member, IEEE, and Qian Du, Senior Member, IEEE

Abstract— Due to the high-dimensional characteristic of hyper-spectral images, dimensionality reduction (DR) is an impor-tant preprocessing step for classification. Recently, sparse andlow-rank graph-based discriminant analysis (SLGDA) has beendeveloped for DR of hyperspectral images, for which the prop-erties of sparsity and low-rankness are simultaneously exploitedto capture both local and global structures. However, SLGDAmay not achieve satisfactory results when handling complexdata with nonlinear nature. To address this problem, this paperpresents two kernel extensions of SLGDA. In the first proposedclassical kernel SLGDA (cKSLGDA), the kernel trick is exploitedto implicitly map the original data into a high-dimensionalspace. With a totally different perspective, we further proposea Nyström-based kernel SLGDA (nKSLGDA) by constructing avirtual kernel space by the Nyström method, in which virtualsamples can be explicitly obtained from the original data. BothcKSLGDA and nKSLGDA can achieve more informative graphsthan SLGDA, and offer superiority over other state-of-the-artDR methods. More importantly, the nKSLGDA can outperformcKSLGDA with much lower computational cost.

Index Terms— Dimensionality reduction (DR), graph embed-ding (GE), hyperspectral image, kernel methods, sparse andlow-rank graph.

I. INTRODUCTION

HYPERSPECTRAL images are acquired by advancedremote sensors that are capable of capturing hundreds

of spectral bands [1]. In hyperspectral images, each pixeltypically represents a high-dimensional vector whose entriescorrespond to various spectral band responses. Thus, a hyper-spectral image usually contains rich spectral information

Manuscript received December 28, 2016; revised May 19, 2017; acceptedJune 10, 2017. Date of publication July 21, 2017; date of current versionOctober 25, 2017. This work was supported in part by the National NaturalScience Foundation of China under Grant 61371165 and Grant 91638201,and in part by the Frontier Intersection Basic Research Project for theCentral Universities under Grant A0920502051714-5. (Corresponding author:Heng-Chao Li.)

L. Pan, H.-C. Li, and X.-D. Chen are with the School of Information Scienceand Technology, Southwest Jiaotong University, Chengdu 610031, China(e-mail: [email protected]).

W. Li is with the College of Information Science and Technology, BeijingUniversity of Chemical Technology, Beijing 100029, China.

G.-N. Wu is with the School of Electrical Engineering, Southwest JiaotongUniversity, Chengdu 610031, China.

Q. Du is with the Department of Electrical and Computer Engineering,Mississippi State University, Starkville, MS 39762 USA.

Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TGRS.2017.2720584

such that it has been extensively used in many applications,e.g., agricultural monitoring [2], [3], target detection [4], [5],and land-cover classification [6], [7]. Due to the curse ofdimensionality and the high computational complexity, dimen-sionality reduction (DR) with preserving valuable intrinsicinformation in a low-dimensional subspace has been substan-tially investigated for hyperspectral data analysis.

Generally, DR includes two strategies: band selection andprojection transformation. The former is to reduce the numberof original bands and to extract sufficient information withonly a small subset of original bands [8]–[10] while the latteraims to project the original data to a low-dimensional subspacethat maintains crucial information by using certain criteria.As far as projection transformation is concerned, it can befurther categorized into unsupervised and supervised methods.Principal component analysis (PCA) [11] is a widely-usedunsupervised technique for DR, which is to find a lineartransformation by maximizing the variance in the projectedsubspace. To improve the discriminative ability of projectedsamples, many supervised methods have been proposed. Forinstance, linear discriminant analysis (LDA) [12] is proposedto maximize the trace ratio of between-class and within-class scatter matrices. Nevertheless, LDA is prone to failwhen the observed data has multimodel distribution [13].To overcome this drawback, local Fisher’s discriminant analy-sis is developed for DR in [14]. However, the aforementionedmethods only consider data statistics while neglecting theintrinsic manifold structure in hyperspectral images.

Manifold learning methods have been successfully appliedin DR of hyperspectral data, such as isometric feature map-ping [15], locally linear embedding (LLE) [16], local tangentspace alignment [17], and Laplacian eigenmaps (LE) [18].Apart from these nonlinear DR techniques, locality preservingprojection (LPP) [19] and neighborhood preserving embed-ding [20], as the linear approximations to LE and LLE,respectively, have been proposed for DR. They can preservethe local structure of the data in a low-dimensional subspaceand are also computationally efficient.

As for these aforementioned manifold learning methods,graph theory is a popular technique. It has been successfullyemployed in the process of DR for the high-dimensionaldata. In [21], a general graph embedding (GE) frameworkis proposed to unify existing DR algorithms. In this frame-

0196-2892 © 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Page 2: IEEE TRANSACTIONS ON GEOSCIENCE AND …cist.buct.edu.cn/staff/WeiLi/Lee_Journals/2017/2017_TGRS...IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 55, NO. 11, NOVEMBER 2017

6086 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 55, NO. 11, NOVEMBER 2017

work, an undirected weight graph is constructed, wherein thesimilarity between two vertices corresponding to two samplesactually is measured by the graph weight matrix that cancharacterize geometric information of the data. Traditionaladjacency graphs include: k-nearest neighbor and ε-radius ball.Recently, sparse representation [22], [23] has attracted muchattention because of its benefits of data-adaptive neighbor-hoods and noise robustness. In [24], a sparse GE (SGE) basedDR algorithm is first proposed to explore the sparse structureby solving an L1-norm optimization problem. Since SGEdoes not sufficiently exploit class-label information, sparsegraph-based discriminant analysis (SGDA) [25] is proposedto construct a block-structured similarity matrix with class-specific labeled samples, which reinforces the discriminativepower of the graph. In [26], weighted SGDA is proposedto integrate both the locality and sparsity structure of the data.Furthermore, collaborative graph-based discriminant analy-sis (CGDA) [27] has been developed by replacing theL1-norm minimization with an L2-norm minimization, result-ing in lower computational cost. Laplacian regularized CGDA(LapCGDA) [28] is proposed to exploit the intrinsic geometricinformation by incorporating a Laplacian graph into CGDA.In addition, simultaneous SGE is introduced to exploit thespatial information in the SGE framework [29].

In order to maintain global structure, in [30], low-rankrepresentation is proposed to preserve the relationship of sam-ples belonging to the same subspace. By combining sparsityand low-rankness together, a nonnegative low-rank and sparsegraph is built in [31]. Subsequently, sparse and low-rankgraph-based discriminant analysis (SLGDA) for hyperspectralimage classification is developed in [32]. In SLGDA, bothsparse and low-rank constraints are taken into account tosimultaneously capture the local and global structures of ahyperspectral image.

It is well-known that spectral characteristics of hyperspectraldata are usually highly correlated, which means that classesmay not be linearly separable. This fact indicates that theaforementioned DR methods, such as SLGDA, may be inad-equate when data distribution is complex. Fortunately, kernelmethods [33]–[36] can project the original data into a higher-dimensional kernel-induced feature space where class separa-bility can be improved. Actually, CGDA and LapCGDA havebeen successfully extended to kernel versions, i.e., KCGDAand KLapCGDA [28], respectively, demonstrating that thekernel strategy is helpful with handling complicated hyper-spectral data.

Motivated by these previous works, two kernel exten-sions of SLGDA, i.e., classical kernel SLGDA (cKSLGDA)based on the classical kernel trick and Nyström-based kernelSLGDA (nKSLGDA) based on the Nyström theory, are pro-posed in this paper. By projecting the samples into a high-dimensional feature space and kernelizing the SLGDA withimplicit (i.e., cKSLGDA) or explicit (i.e., nKSLGDA) kernelmappings, an informative graph is constructed in the sense thatit has high discriminative power, low sparsity, and adaptiveneighborhood. Moreover, the kernel LPP is performed toobtain the kernel projection matrix to reduce the dimension-ality of samples. It is noteworthy that the proposed methods

focus on spectral feature extraction in the kernel space, whichare different from those considering the spectral–spatial infor-mation in the original space [29], [37]. Compared with theexisting DR methods, the main contributions of nKSLGDAlie in the following two aspects.

1) This is the first time that the Nyström theory is suc-cessfully applied in kernel operation for hyperspectralDR and classification. The proposed method shows greatadvantage in classification performance.

2) By exploiting the virtual kernel technique (i.e., Nyströmapproximation), the newly proposed nKSLGDA hasmuch lower computational complexity, which is alsoour straightforward motivation in this paper. As such,the process of nKSLGDA can be almost as fast as thatof SLGDA.

The remainder of this paper is organized as follows.Section II overviews some state-of-the-art DR approaches.Sections III and IV describe the frameworks of the pro-posed cKSLGDA and nKSLGDA algorithms, respectively.In Section V, experimental results compared with some relatedtechniques are presented. Finally, concluding remarks aremade in Section VI.

II. RELATED WORK

Given a hyperspectral image with d bands, N samples fromc classes, i.e., X = [x1, x2, . . . , xN ] ∈ R

d×N , the traininglabel set is denoted as L = [l1, l2, . . . , lN ], where li ∈{1, 2, . . . , c} is the label corresponding to xi . The testing setwith M samples is defined as Y = [y1, y2, . . . , yM ] ∈ R

d×M .

A. Sparse Graph Embedding

In [21], GE method is proposed to unify existing DRtechniques into a common framework. Specifically, an intrinsicgraph G = {X, W} and a penalty graph Gp = {X, Wp}are constructed to describe the relationship that emphasizesand suppresses the similarity, respectively. Besides k-nearestneighbors and ε-radius ball that are two classical methodsto construct the graph, a robust and adaptive L1-graph isproposed to exploit the reconstruction coefficients rather thanpairwise Euclidean distance [24]. The representation coeffi-cients are first obtained by solving an L1-norm optimizationproblem

arg minαi‖αi‖1

s.t. ‖Xiαi − xi‖22 ≤ ε (1)

where Xi = [x1, . . . , xi−1, xi+1, . . . , xN ] ∈ Rd× (N−1) and

αi = [αi1, . . . , αi(N−1)]T ∈ R(N−1)×1. Then, the similarity

matrix W can be defined as

Wi j =

⎧⎪⎨⎪⎩0, i = j

αi j , i > j

αi( j−1), i < j.

(2)

The objective of SGE is to find a low-dimensional spacewhere the desired geometrical information or similaritybetween vertices in the graph is preserved. Suppose thereis a d × B(B � d) projection matrix P that can transform

Page 3: IEEE TRANSACTIONS ON GEOSCIENCE AND …cist.buct.edu.cn/staff/WeiLi/Lee_Journals/2017/2017_TGRS...IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 55, NO. 11, NOVEMBER 2017

PAN et al.: DISCRIMINANT ANALYSIS OF HYPERSPECTRAL IMAGERY 6087

the data X from dimensionality d to B resulting in a low-dimensional representation by Y = PT X. The objectivefunction could be formulated as

P∗ = arg minPTXLpXTP

∑i �= j

‖PT xi − PT x j‖22Wi j

= arg minPTXLpXTP

tr(PTXLsXTP) (3)

where Ls = D−W is defined as the Laplacian matrix of theintrinsic graph G, D is a diagonal matrix with the i th diagonalentry being Dii = ∑N

j=1 Wi j , and Lp may be the Laplacianmatrix of the penalty graph or a simple scale normalizationconstraint [21].

Equation (3) can be further formulated as

P∗ = arg minP

|PT XLsXT P||PT XLpXT P| (4)

which can be solved as a generalized eigendecompositionproblem

XLsXT pb = λbXLpXT pb. (5)

The bth projection vector pb is the eigenvector correspondingto the bth smallest nonzero eigenvalue. The projection matrixcan be formed as P = [p1, . . . , pB ].

B. Graph-Based Discriminant Analysis

SGE adopts all available training samples to determinethe similarity matrix W, which may result in an impropergraph structure. In [25], block SGDA (BSGDA) fully makesuse of class-label information. Specifically, the method con-structs an affinity matrix using within-class samples only,which can reflect correct membership among samples. Conse-quently, the similarity matrix is designed as a block-diagonalstructure

W =

⎛⎜⎜⎝W1 0

. . .

0 Wc

⎞⎟⎟⎠ (6)

where Wl denotes the obtained weight matrix using thesamples from the lth class only.

Though BSGDA can successfully reveal the local structureof the data in the sense that sparse representation representseach sample individually, it fails to capture global information.To handle this problem, SLGDA [32] has been developed topreserve local neighborhood structure and global geometricalstructure simultaneously by combining the sparse and low-rankconstraints.

To enhance discriminative ability, class-label informationis also employed to generate a block-diagonal similaritymatrix as presented in (6). Thus, the objective function ofSLGDA is

arg minWl

1

2‖Xl − XlWl‖2F + β‖Wl‖∗ + λ‖Wl‖1

s.t. diag(Wl) = 0 (7)

Fig. 1. Illustration of data structure in original space and kernel space.

where Xl represents samples from the lth class. After obtainingthe graph weight matrix W, the projection operator can besolved similarly as formulated in (5).

III. PROPOSED cKSLGDA METHOD

In SLGDA, the sparsity and low-rankness constraints arecombined to simultaneously capture local and global struc-tures of hyperspectral data such that the resulting graphcan significantly reinforce the discriminative power. However,a hyperspectral image usually exhibits nonlinear structure.A nonlinear mapping of the data to a higher dimensionalspace (i.e., kernel space) could effectively improve the discrim-inability. Fig. 1 shows the data structure in the original spaceand the kernel space, which illustrates that data nonlinearlytransformed to high-dimensional space is more separable.

A. Kernel Trick

The kernel trick is a very well-known technique in machinelearning and pattern recognition. It has been successfullyapplied to different kernel methods, such as support vectormachine (SVM) [38], kernel PCA, and kernel LDA [39]. Theuse of kernel trick makes the extension of a linear algorithmto its nonlinear counterpart much easier.

Only a kernel satisfying Mercer’s condition is valid, whichgenerates a reproducing kernel Hilbert space. Let xi ∈ X be asample in the input space, for a nonlinear mapping function �,xi ∈ R

d → �(xi ) ∈ RD (D d and it might be infinite).

A Mercer kernel k(·, ·), which computes the inner productbetween two samples in the kernel space, can be expressed as

k(xi , x j ) = 〈�(xi ),�(x j )〉 = �(xi )T �(x j ) (8)

where �(·) is an implicit nonlinear function. Based on thekernel trick, it is not necessary to know the specific formof �(·). Some commonly used kernels include the linearkernel k(xi , x j ) = xT

i x j , the τ -degree polynomial kernelk(xi , x j ) = (xT

i x j + 1)τ , and the Gaussian radial basisfunction (RBF) kernel k(xi , x j ) = exp[−(‖(xi − x) j‖22/2σ 2)](σ is the parameter of RBF kernels).

B. cKSLGDA

Following the idea of kernel trick, we extend the SLGDAto cKSLGDA in a classical way. Suppose that there exists anonlinear mapping function �: R

d → RD associated with

the kernel function k(·, ·), it projects training samples into ahigher-dimensional feature space: X = [x1, x2, . . . , xN ] →�(X) = [�(x1),�(x2), . . . ,�(xN )]. As such, the objective

Page 4: IEEE TRANSACTIONS ON GEOSCIENCE AND …cist.buct.edu.cn/staff/WeiLi/Lee_Journals/2017/2017_TGRS...IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 55, NO. 11, NOVEMBER 2017

6088 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 55, NO. 11, NOVEMBER 2017

function of cKSLGDA can be described as

arg minW

1

2‖�(X)−�(X)W‖2F + β‖W‖∗ + λ‖W‖1

⇒ arg minW

1

2tr((I −W)T K(I−W))

+ β‖W‖∗ + λ‖W‖1s.t. diag(W) = 0 (9)

where K = �(X)T �(X) with Ki j = k(xi , x j ), andtr(·) denotes the trace operator. It can be efficiently solved bythe alternating direction method of multipliers (ADMM) [40],in which the underlying variables are updated alternately. Twoauxiliary variables Z and J are first introduced to make theobjective function separable

arg minZ, J,W

1

2tr((I −W)T K(I−W))+ β‖Z‖∗ + λ‖J‖1

s.t. W = Z, W = J − diag(J). (10)

Then, the augmented Lagrangian function of problem (10)becomes

L(Z, J, W, D1, D2)

= 1

2tr((I−W)T K(I−W))+ β‖Z‖∗ + λ‖J‖1+〈D1, W − Z〉 + 〈D2, W − J + diag(J)〉+ μ

2

(‖W − Z‖2F + ‖W − J + diag(J)‖2F)

(11)

where D1 and D2 are Lagrangian multipliers, and μ is apenalty parameter.

The ADMM method is to alternately update one of thesethree variables with other variables fixed by minimizingL(Z, J, W). First, the updating schemes of Z and J areformulated as

Zt+1 = arg minZ

β‖Z‖∗ + 〈D1,t , Wt − Z〉 + μt

2‖Wt − Z‖2F

= arg minZ

β

μt‖Z‖∗ + 1

2‖Z−

(Wt + D1,t

μt

)‖2F

= � βμt

(Z+ D1,t

μt

)Jt+1 = arg min

Jλ‖J‖1 + 〈D2,t , Wt − J〉 + μt

2‖Wt − J‖2F

= arg minJ

λ

μt‖J‖1 + 1

2‖J −

(Wt + D2,t

μt

)‖2F

= S λμt

(Wt + D2,t

μt

)Jt+1 = Jt+1 − diag(Jt+1) (12)

where τ (T) = USτ (∑

)VT is the singular value thresholdingoperator, in which Sτ (x) = sgn(x) max(|x | − τ, 0) is the softthresholding operator [41]. Then, after fixing Z and J, W can

be updated as

Wt+1 = arg minW

1

2tr((I −W)T K(I−W))

+〈D1,t , W − Zt+1〉 + 〈D2,t , W − Jt+1〉+ μt

2

(‖W − Zt+1‖2F + ‖W − Jt+1‖2F)

= (K + 2μt I)−1(K+ μt Zt+1

+μt Jt+1 − (D1,t + D2,t )). (13)

The complete algorithm is outlined in Algorithm 1.

Algorithm 1 ADMM for Solving cKSLGDAInput: Training samples X, parameters β and λ, kernelfunction k.Initialize: Z0 = J0 = W0 = 0, D1,0 = D2,0 = 0,μ0 = 0.1, μmax = 106, ρ0 = 1.1, ε1 = 10−4, ε2 = 10−3,max I ter = 100, t = 0.Repeat1. Compute the kernel Gram matrix K.2. Compute Zt+1, Jt+1, and Wt+1 according to (12)and (13).3. Update the Lagrangian multipliers:

D1,t+1 = D1,t + μt (Wt+1 − Zt+1).D2,t+1 = D2,t + μt (Wt+1 − Jt+1).

4. Update μ:μt+1 = min(ρμt , μmax), where

ρ =⎧⎨⎩

ρ0, i f μt max(‖Wt+1 −Wt‖F , ‖Zt+1 − Zt‖F ,‖Jt+1 − Jt‖F )/‖X‖F < ε2,

1, otherwi se.5. Check convergence conditions:‖Wt+1 − Zt+1‖∞ < ε1, ‖Wt+1 − Jt+1‖∞ < ε1.

6. t ← t + 1.until convergence conditions are satisfied or t > max I ter .Output: An optimal solution Wt .

C. Kernel LPP

Since each column of the projection matrix P should liein the span of all training samples in the kernel space [39],we decompose pi =∑N

j=1 qi j �(x j ) = �(X)qi with qi beinga coefficient vector. By defining Q = [q1, q2, . . . , qB], theprojection matrix P can be expressed as

P = �(X)[q1, q2, . . . , qB ] = �(X)Q. (14)

According to [21] and [25], the coefficient matrix Q canbe obtained by solving the following generalized eigenvaluedecomposition problem:

KLsKT Q = �KLpKT Q. (15)

Subsequently, DR for the samples in the kernel space isperformed as

X = PT �(X) = QT K

Y = PT �(Y) = QT KXY (16)

where Ki jXY = k(xi , y j ), i = 1, 2, . . . , N, j = 1, 2, . . . , M.

Page 5: IEEE TRANSACTIONS ON GEOSCIENCE AND …cist.buct.edu.cn/staff/WeiLi/Lee_Journals/2017/2017_TGRS...IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 55, NO. 11, NOVEMBER 2017

PAN et al.: DISCRIMINANT ANALYSIS OF HYPERSPECTRAL IMAGERY 6089

Fig. 2. Framework of the proposed nKSLGDA algorithm.

IV. PROPOSED nKSLGDA METHOD

The proposed cKSLGDA requires the calculation and eigen-decomposition of the kernel Gram matrix K, leading tocomplexity on the order of O(N3). For large-scale problems,this method becomes computationally intractable.

To address this problem, SLGDA can be success-fully extended into the kernel space in an elegant waythat uses explicit kernel mapping based on the Nyströmmethod [42], [43] to avoid the calculation of the Grammatrix K, yielding the proposed nKSLGDA algorithm. Fig. 2illustrates its framework, which mainly includes explicit kernelmapping, graph construction, and virtual kernel LPP. Thedetailed process is described in Sections IV-A–IV-C.

A. Explicit Kernel Mapping

In [44], a nonlinear mapping �(·) can be determined basedon eigendecomposition of the Gram matrix, e.g., K = U�UT ,where � ∈ R

r×r is a diagonal matrix whose entries are thetop-r eigenvalues of K in a decreasing order, and U ∈ R

N×r isan orthogonal matrix containing the eigenvectors correspond-ing to the eigenvalues. Then, we have �(X) = �1/2UT , andK = ST S = �(X)T �(X), where S = [s1, . . . , sN ] ∈ R

r×N

includes the virtual samples residing on the r -dimensionalsubkernel space. Notice that the subkernel space is defined asthe virtual kernel space and the value of r is consistent withthe original dimension. Consequently, the virtual samples canbe denoted as

S = �1/2UT . (17)

The symmetric K can be decomposed into the form of

K =(

Knn KT(N−n)n

K(N−n)n K(N−n)(N−n)

)(18)

and simultaneously, the submatrix KNn of K is defined asKNn = [Knn K(N−n)n]T , where KNn = k(X, XS) andKnn = k(XS, XS) with XS ∈ R

d×n denoting a subset extractedfrom the input samples X. Typically, the sampling schemesinclude uniform sampling without replacement [45], K-meansclustering [46], and coresets method [47].

An approximation of K is obtained with the help of KNn

and Knn according to the Nyström theory

K ≈ KNnK†nnKT

Nn (19)

Algorithm 2 Explicit Kernel Mapping for nKSLGDA

Input: Training samples X ∈ Rd×N , testing samples Y ∈

Rd×M , kernel function k, number of samples in the subset n,

desired dimensionality of virtual space r , sampling scheme ω(e.g., K-means).1. Construct the reduced set: XS = f (X, ω, n).2. Compute the kernel matrix between training samples and

reduced set: KNn,X = k(X, XS).3. Compute the kernel matrix of reduced set:

Knn = k(XS, XS).4. Compute Knn,r by eigendecomposition:

Knn,r = Vr�r VTr .

5. Compute the virtual training samples:SX = (�†

r )(1/2)VTr KT

Nn,X .6. Compute the kernel matrix between testing samples and

reduced set: KMn,Y = k(Y, XS).7. Compute the virtual testing samples:

SY = (�†r )(1/2)VT

r KTMn,Y .

Output: SX and SY

where (·)† stands for the pseudoinverse. The pseudoinverseof Knn can be computed as K†

nn = V�†VT dependingon its eigendecomposition. In this case, combining with thevirtual samples defined in (17), the kernel matrix K can berepresented as

K = ST S = KNnK†nnKT

Nn = KNnV�†VT KTNn . (20)

Finally, the desired virtual samples in the r -dimensional virtualkernel space are expressed as

Sr =(�†

r

)(1/2)VTr KT

Nn (21)

where �r ∈ Rr×r is a diagonal matrix containing the first r

largest eigenvalues and Vr ∈ Rn×r is the corresponding

orthogonal eigenvectors.Note that eigendecomposition is implemented on the sub-

kernel matrix Knn , so the complexity is on the order of O(n3).The computational burden can be reduced significantly whenn � N . The explicit kernel mapping can be referred inAlgorithm 2.

Page 6: IEEE TRANSACTIONS ON GEOSCIENCE AND …cist.buct.edu.cn/staff/WeiLi/Lee_Journals/2017/2017_TGRS...IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 55, NO. 11, NOVEMBER 2017

6090 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 55, NO. 11, NOVEMBER 2017

Fig. 3. Parameter tuning of β and λ for the proposed cKSLGDA andnKSLGDA algorithms using three data sets. (a) and (b) Indian Pines data.(c) and (d) University of Pavia data. (e) and (f) Salinas data.

Fig. 4. Parameter tuning of σ in RBF kernel using three experimental datasets. (a) cKSLGDA. (b) nKSLGDA.

B. nKSLGDA

Once the virtual training samples SX are available, we pro-ceed to graph construction. The underlying objective functionof nKSLGDA can be formulated as

arg minWl

1

2‖Sl

X − SlX Wl‖2F + β‖Wl‖∗ + λ‖Wl‖1

s.t. diag(Wl) = 0 (22)

where SlX is the set of virtual training samples corresponding

to the lth class.Similarly, the ADMM is utilized to solve (22). After

introducing two variables Z and J, (22) turns to be

TABLE I

CLASS LABELS AND THE NUMBER OF TRAINING ANDTESTING SAMPLES FOR INDIAN PINES

TABLE II

CLASS LABELS AND THE NUMBER OF TRAINING ANDTESTING SAMPLES FOR PAVIA UNIVERSITY

TABLE III

CLASS LABELS AND THE NUMBER OF TRAINING AND

TESTING SAMPLES FOR SALINAS

separable

arg minZ, J,W

1

2‖SX − SX W‖2F + β‖Z‖∗ + λ‖J‖1

s.t. W = Z, W = J − diag(J). (23)

Page 7: IEEE TRANSACTIONS ON GEOSCIENCE AND …cist.buct.edu.cn/staff/WeiLi/Lee_Journals/2017/2017_TGRS...IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 55, NO. 11, NOVEMBER 2017

PAN et al.: DISCRIMINANT ANALYSIS OF HYPERSPECTRAL IMAGERY 6091

Fig. 5. SVM classification accuracy versus reduced dimensionality for all methods using three data sets. (a) Indian Pines data. (b) University of Pavia data.(c) Salinas data.

Fig. 6. Classification maps of different methods for the Indian Pines data set. (a) Pseudocolor image. (b) Ground truth. (c) Original (82.57%).(d) PCA (78.48%). (e) LDA (76.87%). (f) SGE (81.38%). (g) BSGDA (84.90%). (h) SLGDA (85.30%). (i) cKSLGDA (86.72%). (j) nKSLGDA (88.67%).

The augmented Lagrangian function of problem (23) becomes

L(Z, J, W, D1, D2)

= 1

2‖SX − SX W‖2F + β‖Z‖∗+ λ‖J‖1 + 〈D1, W − Z〉 + 〈D2, W − J + diag(J)〉+ μ

2

(‖W − Z‖2F + ‖W− J+ diag(J)‖2F). (24)

The variables Z and J are updated the same as (12), while Wis updated as

Wt+1 = arg minW

1

2‖SX − SX W‖2F

+〈D1,t , W − Zt+1〉 + 〈D2,t , W − Jt+1〉+ μt

2

(‖W − Zt+1‖2F + ‖W− Jt+1‖2F)

= (STX SX + 2μt I

)−1(STX SX + μt Zt+1 + μt Jt+1

− (D1,t + D2,t )). (25)

The process of solving the W in the nKSLGDA can be referredin Algorithm 1.

C. Virtual Kernel LPP

With the weight matrix W in hand, the projectionmatrix P(k) in the virtual kernel space can be obtained fromthe following objective function:P(k)∗ = arg min

P(k)T SX STX P(k)=I

∑i �= j

‖P(k)TSX i − P(k)T

SX j‖22Wi j

= arg minP(k)T SX ST

X P(k)=Itr(P(k)T

SX L(k)STX P(k)

)(26)

where L(k) is the Laplacian matrix calculated according tothe kernel graph weight matrix W. The above function can betransformed into a generalized eigenvalue problem

SX L(k)STX P(k) = �(k)SX ST

X P(k). (27)

Page 8: IEEE TRANSACTIONS ON GEOSCIENCE AND …cist.buct.edu.cn/staff/WeiLi/Lee_Journals/2017/2017_TGRS...IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 55, NO. 11, NOVEMBER 2017

6092 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 55, NO. 11, NOVEMBER 2017

Fig. 7. Classification maps of different methods for the University of Pavia data set. (a) Pseudocolor image. (b) Ground truth. (c) Original (83.10%).(d) PCA (81.94%). (e) LDA (82.98%). (f) SGE (82.64%). (g) BSGDA (83.22%). (h) SLGDA (84.01%). (i) cKSLGDA (84.85%). (j) nKSLGDA (85.74%).

Algorithm 3 Proposed nKSLGDA Algorithm

Input: Training samples X ∈ Rd×N , testing samples

Y ∈ Rd×M , parameters β and λ, dimensionality of virtual

kernel space r (here, r = d), desired reduced dimensionalityB (B < r).1. Normalize the samples of X and Y to have an unitL2-norm.2. Compute the virtual samples SX ∈ R

r×N and SY ∈ Rr×M

using Algorithm 2.3. Obtain graph weight matrix W using Algorithm 1 with

(13) being replaced by (25), X by SX , and K by STX SX .

4. Compute projections according to (27).Output: A kernel projection matrix P(k) ∈ R

r×B .

After obtaining the optimal projection matrix P(k), DR is per-formed on SX and SY . The overall process of the nKSLGDAalgorithm is summarized in Algorithm 3. nKSLGDA alsoadopts the ADMM to solve the objective function and com-putes the projection matrix based on LPP, which are similarto cKSLGDA.

V. EXPERIMENTAL RESULTS

In this section, we perform experiments on threehyperspectral data sets and compare with traditional andstate-of-the-art techniques, including unsupervised method(i.e., PCA [11], SGE [24]), and supervised methods(i.e., LDA [12], BSGDA [25], and SLGDA [32]), to validatethe effectiveness of our proposed DR methods, i.e., cKSLGDAand nKSLGDA. To this end, the standard SVM classifier isapplied to the resulting low-dimensional data for classification.For quantitative assessment, individual classification accuracy,overall accuracy (OA), average accuracy (AA), and Kappacoefficient (κ) are calculated as objective metrics. The exper-iments are repeated ten times and the corresponding averagedresults are presented. All experiments are implemented inMATLAB on an Intel Core i5-4590 CPU personal computerwith 8 GB of RAM.

A. Hyperspectral Data

The first data set was acquired by airborne visible/infraredimaging spectrometer (AVIRIS) sensor over northwest Indi-ana’s Indian Pine test site in June 1992. The AVIRIS sensor

Page 9: IEEE TRANSACTIONS ON GEOSCIENCE AND …cist.buct.edu.cn/staff/WeiLi/Lee_Journals/2017/2017_TGRS...IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 55, NO. 11, NOVEMBER 2017

PAN et al.: DISCRIMINANT ANALYSIS OF HYPERSPECTRAL IMAGERY 6093

Fig. 8. Classification maps of different methods for the Salinas data set. (a) Pseudocolor image. (b) Ground truth. (c) Original (88.56%). (d) PCA (88.57%).(e) LDA (89.21%). (f) SGE (89.04%). (g) BSGDA (89.73%). (h) SLGDA (90.34%). (i) cKSLGDA (90.96%). and (j) nKSLGDA (92.14%).

TABLE IV

TUNING PARAMETER VALUES FOR cKSLGDA AND nKSLGDA

generates the wavelength range of 0.4–2.45 μm covered220 spectral bands. After removing 18 water-absorption bands,a total of 202 bands is used in experiments. The image with

145×145 pixels represents a rural scenario having 16 differentland-cover classes, whose the number of training and testingsamples are shown in Table I.

Page 10: IEEE TRANSACTIONS ON GEOSCIENCE AND …cist.buct.edu.cn/staff/WeiLi/Lee_Journals/2017/2017_TGRS...IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 55, NO. 11, NOVEMBER 2017

6094 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 55, NO. 11, NOVEMBER 2017

TABLE V

SVM CLASSIFICATION ACCURACY(%) OF DIFFERENTTECHNIQUES FOR THE INDIAN PINES SET

TABLE VI

SVM CLASSIFICATION ACCURACY(%) OF DIFFERENT TECHNIQUES FOR THE UNIVERSITY OF PAVIA SET

The second data set is the University of Pavia whichwas collected by the reflective optics system imaging spec-trometer (ROSIS) sensor in Italy. The image has 103 bandsby removing 12 noisy bands with a spectral coverage from0.43 to 0.86 μm, covering a region of 610×340 pixels. Thereare nine ground-truth classes, from which we randomly selecttraining and testing samples as shown in Table II.

The third data set was also collected by the AVIRIS sensorover the Valley of Salinas, Central Coast of California, in 1998.The image comprises 512×217 pixels with a spatial resolutionof 3.7 m, and only preserves 204 bands after 20 water-absorption bands removed. Table III lists 16 land-cover classesand the number of training and testing samples.

B. Parameter Tuning

There are three critical parameters in both cKSLGDA andnKSLGDA, which are regularization parameters β and λ in theobjective function, and the RBF kernel parameter σ in kerneltrick and explicit kernel mapping. A fivefold cross-validationstrategy is adopted to tune these parameters. Specifically,

β and λ are selected from {0.00001, 0.0001, 0.001, 0.01,0.1, 1}, and σ is from {0.1, 0.2, 0.3, 0.4} and {0.1, 0.5, 1, 10},respectively. Figs. 3 and 4 show the overall classification accu-racy of the proposed cKSLGDA and nKSLGDA algorithmswith respect to β, λ, and σ , respectively. Specifically, the val-ues of these three parameters for cKSLGDA and nKSLGDAare presented in Table IV.

C. Comparison of Classification Performance

In the experiments, classification performance of SVMusing all the spectral bands is used as a baseline, while theproposed cKSLGDA and nKSLGDA algorithms are comparedto PCA, LDA, SGE, BSGDA, and SLGDA. Fig. 5 showsthe OA for all the considered DR algorithms with differentreduced dimensionality. In Fig. 5, it can be seen that allthe results are improved as the dimensionality increases andthen become stable after a certain value. Specifically, a closelook of Fig. 5 reveals that a reduced dimension of 20 issufficient for nKSLGDA in the three data sets. Moreover,cKSLGDA reaches the stable state at the dimension of 30 in

Page 11: IEEE TRANSACTIONS ON GEOSCIENCE AND …cist.buct.edu.cn/staff/WeiLi/Lee_Journals/2017/2017_TGRS...IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 55, NO. 11, NOVEMBER 2017

PAN et al.: DISCRIMINANT ANALYSIS OF HYPERSPECTRAL IMAGERY 6095

Fig. 9. Classification accuracy and standard deviation of different training size for three hyperspectral data sets. (a) Indian Pines data. (b) University of Paviadata. and (c) Salinas data.

TABLE VII

SVM CLASSIFICATION ACCURACY(%) OF DIFFERENT TECHNIQUES FOR THE SALINAS SET

three data sets. As for PCA, LDA, SGE, BSGDA, and SLGDA,the best performance appears when the reduced dimensionis 20. To make a fair comparison, we choose 30 features forall the considered algorithms to obtain satisfactory results.

Tables V–VII present individual class accuracy, OA, AA,and κ coefficient for these three experimental data sets. Over-all, the kernel techniques (cKSLGDA and nKSLGDA) offerbetter classification performance than SLGDA. Moreover,nKSLGDA achieves the best classification results in manyclasses, as well as the best OA, AA, and κ among all the DRmethods, and also surpasses the baseline algorithm. Further-more, as shown in Tables V–VII, nKSLGDA provides slightlyhigher performance than cKSLGDA. The possible reason isthat nKSLGDA exploits the K-means clustering in the explicitkernel mapping, leading to a more discriminative structure inthe virtual samples and in the constructed intraclass graph.Specifically, it can be seen that nKSLGDA yields over3% higher accuracy than that of SLGDA for the Indian Pinesdata in Table V, 1.7% and 1.8% for the Pavia University andSalinas data with such small training sets in Tables VI and VII,while cKSLGDA offers more than 1.4% gain than SLGDA for

the Indian Pines data. In addition, cKSLGDA and nKSLGDAdramatically improve the individual classification accuracy forhigh correlated classes, such as class 2, class 3, class 10,and class 11 in the Indian Pines data. Based on these obser-vations, it can be concluded that the proposed cKSLGDAand nKSLGDA are effective in DR for hyperspectralimages.

For visual comparison, classification maps of differentalgorithms for these three hyperspectral scenes are given inFigs. 6–8 where the pseudocolor images and ground truthmaps are also provided. It can be seen in Figs. 6–8 thatthe advantages of the proposed cKSLGDA and nKSLGDAalgorithms compared to other considered DR methods canbe clearly perceived, especially for the regions of class 2(Corn-no till), class 3 (Corn-min till), class 10 (Soybeans-notill) in Fig. 6, and the ones of class 2 (Meadows) and class 8(Bricks) in Fig. 7, as well as the one of class 8 (Grapesuntrained) in Fig. 8. The classification maps are consistentwith the quantitative results listed in Tables V–VII.

It is well-known that available training samples may beinsufficient to evaluate the effectiveness of classification

Page 12: IEEE TRANSACTIONS ON GEOSCIENCE AND …cist.buct.edu.cn/staff/WeiLi/Lee_Journals/2017/2017_TGRS...IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 55, NO. 11, NOVEMBER 2017

6096 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 55, NO. 11, NOVEMBER 2017

TABLE VIII

EXECUTION TIME (IN SECONDS) OF DIFFERENTMETHODS IN THE THREE DATA SETS

models as a result of expensive cost in acquisition of labeledsamples in practical situations. Consequently, in this exper-iment, we investigate how the number of training samplesaffects the classification performance for each algorithm. Fig. 9shows the variations of overall classification accuracy underdifferent training size. For the Indian Pines data set, we ran-domly choose 6% to 14% of the labeled data in each classas the training set and the remaining samples for testing; inthe case of the University of Pavia and Salinas data sets,the training size is changed from the regions of [40, 120]and [20, 100] with an interval of 20 samples, respectively.As shown in Fig. 9, the overall classification performance ofall the considered algorithms monotonically increases as thenumber of training samples also increases. The OA valuesof the proposed cKSLGDA and nKSLGDA algorithms areconsistently superior to those of BSGDA and SLGDA, andnKSLGDA always performs the best. These advantages furtherdemonstrate the effectiveness of cKSLGDA and nKSLGDAin DR.

In Table VIII, computational time of the aforementioned DRmethods is summarized. Obviously, PCA and LDA includemuch lower computational cost than other methods, andBSGDA sufficiently exploits label information to obtain theblock-diagonal weight matrix, leading to the computationalcost lower than SGE. In addition, cKSLGDA is the mosttime-consuming, because it involves the calculation of twokernel matrices K and KXY, and more importantly, it adoptsall available samples to construct the graph. In contrast,due to the utilization of Nyström-based approximate kernelscheme, the computational burden of the proposed nKSLGDAalgorithm is much less than that of cKSLGDA and almost thesame as that of SLGDA.

VI. CONCLUSION

In this paper, two kernel versions of SLGDA,i.e., cKSLGDA and nKSLGDA, have been proposed for DRand classification of hyperspectral images. In cKSLGDA,the classical kernel trick is exploited to implicitly map theoriginal data into a high-dimensional kernel space, while theNyström-based approximation is utilized to explicitly obtainsamples in the virtual kernel space in nKSLGDA. By furthercombining the sparse and low-rank constraints, informativegraphs constructed by cKSLGDA and nKSLGDA canbetter preserve the intrinsic structure of the data. Therefore,the induced projection is able to offer more discriminativeinformation for complex data. Due to the utilization of the

Nyström method, the computational cost of nKSLGDA issignificantly reduced, and such a fast kernel technique offersa novel approach for kernel hyperspectral image processing.Experimental results on three real hyperspectral images havedemonstrated that the proposed cKSLGDA and nKSLGDAalgorithms can provide better performance than the state-of-the-art DR methods, and nKSLGDA outperforms cKSLGDAwith much lower computational cost.

ACKNOWLEDGMENT

The authors would like to thank Prof. D. A. Landgreve fromPurdue University for providing the AVIRIS image of IndianPines and Prof. P. Gamba from the University of Pavia forproviding the ROSIS data set. The authors would also liketo thank the Associate Editor who handled this paper andthe anonymous reviewers for their outstanding comments andsuggestions, which greatly helped in improving the qualityand presentation of this paper, and would also like to thankA. Golts for her helpful suggestions.

REFERENCES

[1] X. Kang, S. Li, L. Fang, and J. A. Benediktsson, “Intrinsic imagedecomposition for feature extraction of hyperspectral images,” IEEETrans. Geosci. Remote Sens., vol. 53, no. 4, pp. 2241–2253, Apr. 2015.

[2] W. A. Dorigo, R. Zurita-Milla, A. J. W. de Wit, J. Brazile, R. Singh,and M. E. Schaepman, “A review on reflective remote sensing and dataassimilation techniques for enhanced agroecosystem modeling,” Int. J.Appl. Earth Observ. Geoinf., vol. 9, no. 2, pp. 165–193, May 2007.

[3] Q. Xie et al., “Leaf area index estimation using vegetation indicesderived from airborne hyperspectral images in winter wheat,” IEEEJ. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 7, no. 8,pp. 3586–3594, Aug. 2014.

[4] Y. Chen, N. M. Nasrabadi, and T. D. Tran, “Sparse representation fortarget detection in hyperspectral imagery,” IEEE J. Sel. Topics SignalProcess., vol. 5, no. 3, pp. 629–640, Jun. 2011.

[5] N. M. Nasrabadi, “Hyperspectral target detection: An overview ofcurrent and future challenges,” IEEE Signal Process. Mag., vol. 31, no. 1,pp. 34–44, Jan. 2014.

[6] F. Melgani and L. Bruzzone, “Classification of hyperspectral remotesensing images with support vector machines,” IEEE Trans. Geosci.Remote Sens., vol. 42, no. 8, pp. 1778–1790, Aug. 2004.

[7] X. Kang, S. Li, and J. A. Benediktsson, “Spectral–spatial hyperspectralimage classification with edge-preserving filtering,” IEEE Trans. Geosci.Remote Sens., vol. 52, no. 5, pp. 2666–2677, May 2014.

[8] Q. Du and H. Yang, “Similarity-based unsupervised band selection forhyperspectral image analysis,” IEEE Geosci. Remote Sens. Lett., vol. 5,no. 4, pp. 564–568, Oct. 2008.

[9] H. Yang, Q. Du, H. Su, and Y. Sheng, “An efficient method forsupervised hyperspectral band selection,” IEEE Geosci. Remote Sens.Lett., vol. 8, no. 1, pp. 138–142, Jan. 2011.

[10] X. Cao, T. Xiong, and L. Jiao, “Supervised band selection using localspatial information for hyperspectral image,” IEEE Geosci. Remote Sens.Lett., vol. 13, no. 3, pp. 329–333, Mar. 2016.

[11] I. Jolliffe, Principal Component Analysis. New York, NY, USA:Springer-Verlag, 1986.

[12] T. V. Bandos, L. Bruzzone, and G. Camps-Valls, “Classification of hyper-spectral images with regularized linear discriminant analysis,” IEEETrans. Geosci. Remote Sens., vol. 47, no. 3, pp. 862–873, Mar. 2009.

[13] M. Sugiyama, “Dimensionality reduction of multimodal labeled databy local Fisher discriminant analysis,” J. Mach. Learn. Res., vol. 8,pp. 1027–1061, May 2007.

[14] W. Li, S. Prasad, J. E. Fowler, and L. M. Bruce, “Locality-preserving dimensionality reduction and classification for hyperspectralimage analysis,” IEEE Trans. Geosci. Remote Sens., vol. 50, no. 4,pp. 1185–1198, Apr. 2012.

[15] J. B. Tenenbaum, V. de Silva, and J. C. Langford, “A global geometricframework for nonlinear dimensionality reduction,” Science, vol. 290,no. 5500, pp. 2319–2323, Dec. 2000.

Page 13: IEEE TRANSACTIONS ON GEOSCIENCE AND …cist.buct.edu.cn/staff/WeiLi/Lee_Journals/2017/2017_TGRS...IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 55, NO. 11, NOVEMBER 2017

PAN et al.: DISCRIMINANT ANALYSIS OF HYPERSPECTRAL IMAGERY 6097

[16] S. T. Roweis and L. K. Saul, “Nonlinear dimensionality reduction bylocally linear embedding,” Science, vol. 290, no. 5500, pp. 2323–2326,Dec. 2000.

[17] Z. Zhang and H. Zha, “Principal manifolds and nonlinear dimensionalityreduction via tangent space alignment,” SIAM J. Sci. Comput., vol. 26,no. 1, pp. 313–338, Dec. 2004.

[18] M. Belkin and P. Niyogi, “Laplacian eigenmaps for dimensionalityreduction and data representation,” Neural Comput., vol. 15, no. 6,pp. 1373–1396, Jun. 2003.

[19] X. He and P. Niyogi, “Locality preserving projections,” in Proc. Adv.Neural Inf. Process. Syst., vol. 16. Vancouver, BC, Canada, 2003,pp. 234–241.

[20] X. He, D. Cai, S. Yan, and H.-J. Zhang, “Neighborhood preservingembedding,” in Proc. 10th IEEE ICCV, Beijing, China, Oct. 2005,pp. 1208–1213.

[21] S. Yan, D. Xu, B. Zhang, H.-J. Zhang, Q. Yang, and S. Lin, “Graphembedding and extensions: A general framework for dimensionalityreduction,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 29, no. 1,pp. 40–51, Jan. 2007.

[22] J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry, and Y. Ma, “Robustface recognition via sparse representation,” IEEE Trans. Pattern Anal.Mach. Intell., vol. 31, no. 2, pp. 210–227, Feb. 2009.

[23] J. Wright, Y. Ma, J. Mairal, G. Sapiro, T. S. Huang, and S. Yan, “Sparserepresentation for computer vision and pattern recognition,” Proc. IEEE,vol. 98, no. 6, pp. 1031–1044, Jun. 2010.

[24] B. Cheng, J. Yang, S. Yan, Y. Fu, and T. S. Huang, “Learning with�1-graph for image analysis,” IEEE Trans. Image Process., vol. 19, no. 4,pp. 858–866, Apr. 2010.

[25] N. H. Ly, Q. Du, and J. E. Fowler, “Sparse graph-based discriminantanalysis for hyperspectral imagery,” IEEE Trans. Geosci. Remote Sens.,vol. 52, no. 7, pp. 3872–3884, Jul. 2014.

[26] W. He, H. Zhang, L. Zhang, W. Philips, and W. Liao, “Weighted sparsegraph based dimensionality reduction for hyperspectral images,” IEEEGeosci. Remote Sens. Lett., vol. 13, no. 5, pp. 686–690, May 2016.

[27] N. H. Ly, Q. Du, and J. E. Fowler, “Collaborative graph-based discrimi-nant analysis for hyperspectral imagery,” IEEE J. Sel. Topics Appl. EarthObserv. Remote Sens., vol. 7, no. 6, pp. 2688–2696, Jun. 2014.

[28] W. Li and Q. Du, “Laplacian regularized collaborative graph for discrim-inant analysis of hyperspectral imagery,” IEEE Trans. Geosci. RemoteSens., vol. 54, no. 12, pp. 7066–7076, Dec. 2016.

[29] Z. Xue, P. Du, J. Li, and H. Su, “Simultaneous sparse graph embeddingfor hyperspectral image classification,” IEEE Trans. Geosci. RemoteSens., vol. 53, no. 11, pp. 6114–6133, Nov. 2015.

[30] G. Liu, Z. Lin, S. Yan, J. Sun, Y. Yu, and Y. Ma, “Robust recoveryof subspace structures by low-rank representation,” IEEE Trans. PatternAnal. Mach. Intell., vol. 35, no. 1, pp. 171–184, Jan. 2013.

[31] L. Zhuang et al., “Constructing a nonnegative low-rank and sparsegraph with data-adaptive features,” IEEE Trans. Image Process., vol. 24,no. 11, pp. 3717–3728, Nov. 2015.

[32] W. Li, J. Liu, and Q. Du, “Sparse and low-rank graph for discriminantanalysis of hyperspectral imagery,” IEEE Trans. Geosci. Remote Sens.,vol. 54, no. 7, pp. 4094–4105, Jul. 2016.

[33] G. Camps-Valls, L. Gómez-Chova, J. Muñoz-Marí, J. L. Rojo-Álvarez,and M. Martínez-Ramón, “Kernel-based framework for multitemporaland multisource remote sensing data classification and change detec-tion,” IEEE Trans. Geosci. Remote Sens., vol. 46, no. 6, pp. 1822–1835,Jun. 2008.

[34] W. Li, S. Prasad, and J. E. Fowler, “Decision fusion in kernel-inducedspaces for hyperspectral image classification,” IEEE Trans. Geosci.Remote Sens., vol. 52, no. 6, pp. 3399–3411, Jun. 2014.

[35] W. Li, Q. Du, and M. Xiong, “Kernel collaborative representa-tion with Tikhonov regularization for hyperspectral image classifica-tion,” IEEE Geosci. Remote Sens. Lett., vol. 12, no. 1, pp. 48–52,Jan. 2015.

[36] F. de Morsier, M. Borgeaud, V. Gass, J.-P. Thiran, and D. Tuia,“Kernel low-rank and sparse graph for unsupervised and semi-supervisedclassification of hyperspectral images,” IEEE Trans. Geosci. RemoteSens., vol. 54, no. 6, pp. 3410–3420, Jun. 2016.

[37] X. Kang, S. Li, and J. A. Benediktsson, “Feature extraction of hyper-spectral images with image fusion and recursive filtering,” IEEE Trans.Geosci. Remote Sens., vol. 52, no. 6, pp. 3742–3752, Jun. 2014.

[38] K. Zhang, L. Lan, Z. Wang, and F. Moerchen, “Scaling up kernelSVM on limited resources: A low-rank linearization approach,” inProc. Int. Conf. Artif. Intell. Statist., Washington, DC, USA, 2012,pp. 1425–1434.

[39] J. Yang, A. F. Frangi, J.-Y. Yang, D. Zhang, and Z. Jin, “KPCA plusLDA: A complete kernel Fisher discriminant framework for featureextraction and recognition,” IEEE Trans. Pattern Anal. Mach. Intell.,vol. 27, no. 2, pp. 230–244, Feb. 2005.

[40] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Distributedoptimization and statistical learning via the alternating direction methodof multipliers,” Found. Trends Mach. Learn., vol. 3, no. 1, pp. 1–122,Jan. 2011.

[41] J.-F. Cai, E. J. Candès, and Z. Shen, “A singular value thresholdingalgorithm for matrix completion,” SIAM J. Optim., vol. 20, no. 4,pp. 1956–1982, Mar. 2010.

[42] K. Zhang and J. T. Kwok, “Clustered Nyström method for large scalemanifold learning and dimension reduction,” IEEE Trans. Neural Netw.,vol. 21, no. 10, pp. 1576–1587, Oct. 2010.

[43] A. Golts and M. Elad, “Linearized kernel dictionary learning,” IEEE J.Sel. Topics Signal Process., vol. 10, no. 4, pp. 726–739, Jun. 2016.

[44] A. Iosifidis and M. Gabbouj, “Nyström-based approximate kernel sub-space learning,” Pattern Recognit., vol. 57, pp. 190–197, Sep. 2016.

[45] C. K. I. Williams and M. Seeger, “Using the Nyström method to speedup kernel machines,” in Proc. Annu. Conf. Neural Inf. Process. Syst.,Whistler, BC, Canada, 2002, pp. 682–688.

[46] K. Zhang, I. W. Tsang, and J. T. Kwok, “Improved Nyström low-rankapproximation and error analysis,” in Proc. Int. Conf. Mach. Learn.,Helsinki, Finland, 2008, pp. 1232–1239.

[47] D. Feldman, M. Feigin, and N. Sochen, “Learning big (image) datavia coresets for dictionaries,” J. Math. Imag. Vis., vol. 46, no. 3,pp. 276–291, Jul. 2013.

Lei Pan received the B.Sc. degree in communica-tion engineering from the Shandong University ofScience and Technology, Qingdao, China, in 2010,and the M.Sc. degree in communication and infor-mation system from Southwest Jiaotong University,Chengdu, China, in 2013, where he is currentlypursuing the Ph.D. degree in signal and informationprocessing with the School of Information Scienceand Technology.

His research interests include remote sensingimage analysis and processing.

Heng-Chao Li (S’06–M’08–SM’14) received theB.Sc. and M.Sc. degrees in information and com-munication engineering from Southwest JiaotongUniversity, Chengdu, China, in 2001 and 2004,respectively, and the Ph.D. degree in information andcommunication engineering from the Graduate Uni-versity of Chinese Academy of Sciences, Beijing,China, in 2008. .

He is currently a Professor with the SichuanProvincial Key Laboratory of Information Codingand Transmission, Southwest Jiaotong University.

From 2013 to 2014, he was a Visiting Scholar with Prof. W. J. Emery atthe University of Colorado, Boulder, CO, USA. His research interests includestatistical analysis of synthetic aperture radar images, remote sensing imageprocessing, and signal processing in communications.

Prof. Li was a recipient of several scholarships or awards, espe-cially including the Special Grade of the Financial Support from ChinaPost-Doctoral Science Foundation in 2009 and the New Century ExcellentTalents in University from the Ministry of Education of China in 2011.In addition, he has also been a reviewer for several international journalsand conferences, such as the IEEE TRANSACTIONS ON GEOSCIENCE AND

REMOTE SENSING, the IEEE JOURNAL OF SELECTED TOPICS IN APPLIEDEARTH OBSERVATIONS AND REMOTE SENSING, the IEEE GEOSCIENCE

AND REMOTE SENSING LETTERS, the IEEE TRANSACTIONS ON IMAGE

PROCESSING, the IET Radar, Sonar and Navigation, and the CanadianJournal of Remote Sensing. He is currently serving as an Associate Editorof the IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSER-VATIONS AND REMOTE SENSING.

Page 14: IEEE TRANSACTIONS ON GEOSCIENCE AND …cist.buct.edu.cn/staff/WeiLi/Lee_Journals/2017/2017_TGRS...IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 55, NO. 11, NOVEMBER 2017

6098 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 55, NO. 11, NOVEMBER 2017

Wei Li (S’11–M’13–SM’16) received the B.Sc.degree in telecommunications engineering fromXidian University, Xi’an, China, in 2007, the M.Sc.degree in information science and technology fromSun Yat-sen University, Guangzhou, China, in 2009,and the Ph.D. degree in electrical and computer engi-neering from Mississippi State University, Starkville,MS, USA, in 2012.

He was a Postdoctoral Researcher with theUniversity of California, Davis, CA, USA, for ayear. He is currently with the College of Information

Science and Technology, Beijing University of Chemical Technology, Beijing,China. His research interests include statistical pattern recognition, hyperspec-tral image analysis, and data compression.

Dr. Li was a recipient of the 2015 Best Reviewer Award from the IEEEGeoscience and Remote Sensing Society for his service for the IEEEJOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS ANDREMOTE SENSING (JSTARS). He is a Reviewer for the IEEE TRANSACTIONS

ON GEOSCIENCE AND REMOTE SENSING, the IEEE GEOSCIENCE REMOTE

SENSING LETTERS, and the IEEE JSTARS.

Xiang-Dong Chen received the Ph.D. degree fromthe University of Electronic Science and Technology,Chengdu, China, in 1999.

He is currently a Professor with the School ofInformation Science and Technology, SouthwestJiaotong University, Sichuan, China. He hasauthored or co-authored more than 90 researchpapers. His research interests include sensor tech-nology, signal processing, and piezoelectric devices.

Guang-Ning Wu (M’97–SM’07) received the B.Sc.,M.Sc., and Ph.D. degrees in electrical engineeringfrom Xi’an Jiaotong University, Xi’an, China, in1991, 1994, and 1997, respectively.

He is currently a Professor with the School ofElectrical Engineering, Southwest Jiaotong Univer-sity, Chengdu, China. His research interests includehigh speed railway, condition monitoring, fault diag-nosis, and insulation life-span evaluation for electricpower equipment.

Qian Du (S’98–M’00–SM’05) received the Ph.D.degree in electrical engineering from the Univer-sity of Maryland–Baltimore County, Baltimore, MD,USA, in 2000.

She is currently a Bobby Shackouls Professorwith the Department of Electrical and ComputerEngineering, Mississippi State University, Starkville,MS, USA. Her research interests include hyperspec-tral remote sensing image analysis and applications,pattern classification, data compression, and neuralnetworks.

Dr. Du is a fellow of SPIE–International Society for Optics and Photonics.She was a recipient of the 2010 Best Reviewer Award from the IEEEGeoscience and Remote Sensing Society (GRSS). She served as a Co-Chairfor the Data Fusion Technical Committee of the IEEE GRSS from 2009 to2013. He was the Chair with the Remote Sensing and Mapping TechnicalCommittee of International Association for Pattern Recognition from 2010to 2014. She was the General Chair for the fourth IEEE GRSS Workshopon Hyperspectral Image and Signal Processing: Evolution in Remote Sensingheld at Shanghai, China, in 2012. She served as an Associate Editor of theIEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS

AND REMOTE SENSING (JSTARS), the Journal of Applied Remote Sensing,and the IEEE SIGNAL PROCESSING LETTERS. Since 2016, she has been theEditor-in-Chief of the IEEE JSTARS.