19
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 55, NO. 11, NOVEMBER 2017 6547 Multiple Kernel Learning for Hyperspectral Image Classification: A Review Yanfeng Gu, Senior Member, IEEE , Jocelyn Chanussot, Fellow, IEEE , Xiuping Jia, Senior Member, IEEE, and Jón Atli Benediktsson, Fellow, IEEE Abstract— With the rapid development of spectral imaging techniques, classification of hyperspectral images (HSIs) has attracted great attention in various applications such as land survey and resource monitoring in the field of remote sensing. A key challenge in HSI classification is how to explore effective approaches to fully use the spatial–spectral information provided by the data cube. Multiple kernel learning (MKL) has been successfully applied to HSI classification due to its capacity to handle heterogeneous fusion of both spectral and spatial features. This approach can generate an adaptive kernel as an optimally weighted sum of a few fixed kernels to model a nonlinear data structure. In this way, the difficulty of kernel selection and the limitation of a fixed kernel can be alleviated. Various MKL algorithms have been developed in recent years, such as the general MKL, the subspace MKL, the nonlinear MKL, the sparse MKL, and the ensemble MKL. The goal of this paper is to provide a systematic review of MKL methods, which have been applied to HSI classification. We also analyze and evaluate different MKL algorithms and their respective characteristics in different cases of HSI classification cases. Finally, we discuss the future direction and trends of research in this area. Index Terms— Classification, heterogeneous features, hyperspectral images (HSIs), multiple kernel learning (MKL), remote sensing. I. I NTRODUCTION A. Hyperspectral Image Data and Classification I T IS well known that hyperspectral remote sensing has become an important means of the earth observation and even space exploration. It extends the number of spectral bands from several or dozens to hundreds, so that each pixel in the scene contains a continuous spectrum that is used to identify the materials presented in the pixel by their reflectance Manuscript received May 2, 2017; revised June 26, 2017; accepted July 17, 2017. Date of publication August 7, 2017; date of current version October 25, 2017. This work was supported in part by the Natural Science Foundation of China for Excellent Young Scholars under Grant 61522107 and in part by the Natural Science Foundation of China under Grant 61371180. (Corresponding author: Yanfeng Gu.) Y. Gu is with the School of Electronics and Information Engineering, Harbin Institute of Technology, Harbin 150001, China (email: [email protected]). J. Chanussot is with the Grenoble Institute of Technology, 38402 Saint Martind’Hères cedex, France (e-mail: jocelyn.chanussot@ gipsa-lab.grenobleinp.fr). X. Jia is with the School of Engineering and Information Technology, University of New South Wales, Canberra, ACT 2600, Australia (e-mail: [email protected]). J. A. Benediktsson is with the Department of Electrical and Com- puter Engineering, University of Iceland, IS-107 Reykjavik, Iceland (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TGRS.2017.2729882 or emissivity [1], [2]. Hyperspectral sensors record the col- lected information in a series of images, which provide the spatial distribution of the reflected solar radiation from the scene of observation [3]. These images are arranged into a 3-D hyperspectral data cube for subsequent analysis and processing. The hyperspectral images (HSIs), which contain a significant amount of detailed information on land covers and the environmental state, can be used for various thematic applications, such as ecological science [4]–[6], hydrolog- ical science [7], geological science [8], precision agricul- ture [9], [10], and military applications [11], [12]. The success of the applications, however, relies heavily on the appropriate data processing approaches and techniques, including unmix- ing [13], [14], target detection [15]–[17], physical or chemical parameters retrieval [18]–[20], and classification [21], [22]. Among these approaches, supervised classification is fun- damental for processing. Supervised classification aims at assigning each pixel in a scene to one of the thematic classes defined [23]. The illustration of HSI supervised classification is shown in Fig. 1. B. Review of Methodology A wide range of pixel-level processing techniques for the classification of HSIs has been developed, which can be divided into kernel methods and methods without ker- nelization. There are a large number of algorithms without kernelization for HSI classification. Among those methods, the k-nearest neighbors, the minimum distance classifier, the maximum likelihood, and the Bayesian estimation meth- ods [24], [25] are conventional statistical approaches. Recently, more machine-learning methods have been gradually intro- duced, e.g., neural networks, deep neural network (DNN), representation-based learning (RL), and ensemble learning methods. Among these methods, DNN [26]–[30] derived from neural network [31]–[37] has been successfully devel- oped in computer vision [38]–[42] and has recently attracted more attention for HSI classification [43]–[45]. Several deep architecture models were exploited for HSI classification, such as the stacked autoencoder [44], the deep belief net- work [46], [47] convolutional neural networks [48], and some variants [49]–[53]. In terms of classification performance, ensemble learning [54]–[57] is a generic framework based on constructing an ensemble of individual classifiers. Accord- ing to the way of generating base classifiers, the types of ensemble learning methods contain resampling of the training 0196-2892 © 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, …static.tongtianta.site/paper_pdf/57337e92-1be0-11e9-82e2-00163e08… · attracted great attention in various applications such

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, …static.tongtianta.site/paper_pdf/57337e92-1be0-11e9-82e2-00163e08… · attracted great attention in various applications such

IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 55, NO. 11, NOVEMBER 2017 6547

Multiple Kernel Learning for HyperspectralImage Classification: A Review

Yanfeng Gu, Senior Member, IEEE, Jocelyn Chanussot, Fellow, IEEE, Xiuping Jia, Senior Member, IEEE,and Jón Atli Benediktsson, Fellow, IEEE

Abstract— With the rapid development of spectral imagingtechniques, classification of hyperspectral images (HSIs) hasattracted great attention in various applications such as landsurvey and resource monitoring in the field of remote sensing.A key challenge in HSI classification is how to explore effectiveapproaches to fully use the spatial–spectral information providedby the data cube. Multiple kernel learning (MKL) has beensuccessfully applied to HSI classification due to its capacity tohandle heterogeneous fusion of both spectral and spatial features.This approach can generate an adaptive kernel as an optimallyweighted sum of a few fixed kernels to model a nonlineardata structure. In this way, the difficulty of kernel selectionand the limitation of a fixed kernel can be alleviated. VariousMKL algorithms have been developed in recent years, suchas the general MKL, the subspace MKL, the nonlinear MKL,the sparse MKL, and the ensemble MKL. The goal of this paperis to provide a systematic review of MKL methods, which havebeen applied to HSI classification. We also analyze and evaluatedifferent MKL algorithms and their respective characteristics indifferent cases of HSI classification cases. Finally, we discuss thefuture direction and trends of research in this area.

Index Terms— Classification, heterogeneous features,hyperspectral images (HSIs), multiple kernel learning (MKL),remote sensing.

I. INTRODUCTION

A. Hyperspectral Image Data and Classification

IT IS well known that hyperspectral remote sensing hasbecome an important means of the earth observation and

even space exploration. It extends the number of spectralbands from several or dozens to hundreds, so that each pixelin the scene contains a continuous spectrum that is used toidentify the materials presented in the pixel by their reflectance

Manuscript received May 2, 2017; revised June 26, 2017; acceptedJuly 17, 2017. Date of publication August 7, 2017; date of current versionOctober 25, 2017. This work was supported in part by the Natural ScienceFoundation of China for Excellent Young Scholars under Grant 61522107 andin part by the Natural Science Foundation of China under Grant 61371180.(Corresponding author: Yanfeng Gu.)

Y. Gu is with the School of Electronics and Information Engineering, HarbinInstitute of Technology, Harbin 150001, China (email: [email protected]).

J. Chanussot is with the Grenoble Institute of Technology,38402 Saint Martind’Hères cedex, France (e-mail: [email protected]).

X. Jia is with the School of Engineering and Information Technology,University of New South Wales, Canberra, ACT 2600, Australia (e-mail:[email protected]).

J. A. Benediktsson is with the Department of Electrical and Com-puter Engineering, University of Iceland, IS-107 Reykjavik, Iceland (e-mail:[email protected]).

Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TGRS.2017.2729882

or emissivity [1], [2]. Hyperspectral sensors record the col-lected information in a series of images, which provide thespatial distribution of the reflected solar radiation from thescene of observation [3]. These images are arranged intoa 3-D hyperspectral data cube for subsequent analysis andprocessing. The hyperspectral images (HSIs), which containa significant amount of detailed information on land coversand the environmental state, can be used for various thematicapplications, such as ecological science [4]–[6], hydrolog-ical science [7], geological science [8], precision agricul-ture [9], [10], and military applications [11], [12]. The successof the applications, however, relies heavily on the appropriatedata processing approaches and techniques, including unmix-ing [13], [14], target detection [15]–[17], physical or chemicalparameters retrieval [18]–[20], and classification [21], [22].Among these approaches, supervised classification is fun-damental for processing. Supervised classification aims atassigning each pixel in a scene to one of the thematic classesdefined [23]. The illustration of HSI supervised classificationis shown in Fig. 1.

B. Review of Methodology

A wide range of pixel-level processing techniques forthe classification of HSIs has been developed, which canbe divided into kernel methods and methods without ker-nelization. There are a large number of algorithms withoutkernelization for HSI classification. Among those methods,the k-nearest neighbors, the minimum distance classifier,the maximum likelihood, and the Bayesian estimation meth-ods [24], [25] are conventional statistical approaches. Recently,more machine-learning methods have been gradually intro-duced, e.g., neural networks, deep neural network (DNN),representation-based learning (RL), and ensemble learningmethods. Among these methods, DNN [26]–[30] derivedfrom neural network [31]–[37] has been successfully devel-oped in computer vision [38]–[42] and has recently attractedmore attention for HSI classification [43]–[45]. Several deeparchitecture models were exploited for HSI classification,such as the stacked autoencoder [44], the deep belief net-work [46], [47] convolutional neural networks [48], and somevariants [49]–[53]. In terms of classification performance,ensemble learning [54]–[57] is a generic framework basedon constructing an ensemble of individual classifiers. Accord-ing to the way of generating base classifiers, the types ofensemble learning methods contain resampling of the training

0196-2892 © 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Page 2: IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, …static.tongtianta.site/paper_pdf/57337e92-1be0-11e9-82e2-00163e08… · attracted great attention in various applications such

6548 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 55, NO. 11, NOVEMBER 2017

Fig. 1. Illustration of HSI supervised classification.

set (such as bagging [58], boosting [59]), manipulation ofinput variables (such as random subspaces [60] and rotationforests [61]), and introducing randomness (such as randomforest [62]–[68]), which have been successfully applied to HSIclassification [69]–[71].

Kernel methods have been successfully applied to HSIclassification [72] while providing an elegant way to deal withnonlinear problems [73]. The main idea of kernel methods isto map the input data from the original space to a convenientfeature space by a nonlinear mapping function. Inner productsin the feature space can be computed by a kernel function with-out knowing the nonlinear mapping function explicitly. Then,the nonlinear problems in the input space can be processedby building linear algorithms in the feature space [74]. Thekernel support vector machine (SVM) is the most popularapproach applied to HSI classification among various kernelmethods [21], [74]–[77]. SVM is based on the margin max-imization principle, which does not require an estimation ofthe statistical distributions of classes. To address the limitationof the curse of dimensionality for HSI classification, someimproved methods based on SVM have been proposed, suchas multiple classifiers system based on adaptive boosting [78],rotation-based SVM ensemble [79], particle swarm optimiza-tion SVM [80], and subspace-based SVM [81]. To enhancethe ability of similarity measurements using the kernel trick,a region-kernel-based SVM was proposed [82]. Consideringthe tensor data structure of HSI, multiclass support tensormachine was specifically developed for HSI classification [83].However, the standard SVM classifier can only use the labeledsamples to provide predicted classes for new samples. In orderto consider the data structure during the classification process,some clustering algorithms have been used [84], such as thehierarchical semisupervised SVM [85] and spatial–spectralLaplacian SVM [86].

There are some other families of kernel methods for HSIclassification, such as Gaussian processes (GPs) and kernel-based representation. GPs provide a Bayesian nonparametricapproach of the considered classification problem [87]–[89].GPs assume that the probability of belonging to a class labelfor an input sample is monotonically related to the value ofsome latent function at that sample. In GP, the covariancekernel represents the prior assumption, which characterizes the

correlation between samples in the training data. Kernel-basedrepresentation was derived from RL to solve nonlinear prob-lems in HSI, which assumes that a test pixel can be linearlyrepresented by training samples in the feature space. RL hasalready been applied to HSI classification [90]–[109], whichincludes sparse representation-based classification (SRC)[110], [111], collaborative representation-based classifica-tion (CRC) [112], and their extensions [92], [102],[103], [108]. For example, to exploit spatial contexts of HSI,Chen et al. [90] proposed a joint SRC (JSRC) method underthe assumption of a joint sparsity model [113]. These RLmethods can be kernelized as kernel SRC [92], kernelizedJSRC [114], kernel nonlocal joint CRC [102], and kernelCRC [106], [107].

Furthermore, multiple kernel learning (MKL) methods havebeen proposed for HSI classification, as there is a verylimited selection of a single kernel, which is able to fitcomplex data structures. MKL methods aim at constructinga composite kernel by combining a set of predefined basekernels [115]. A framework of composite kernel machineswas presented to enhance classification of HSIs [116], whichopens a wide field of subsequent developments for integratingspatial and spectral informations [117], [118], such as, thespatial–spectral composite kernel of superpixel [119], [120],the extreme learning machine with spatial–spectral compositekernel [121], the spatial–spectral composite kernels discrim-inant analysis [122], and the locality preserving compositekernel [123]. In addition, MKL methods generally focus ondetermining key kernels to be preserved and their significancein optimal kernel combination. Some typical MKL methodshave been gradually proposed for HSI classification, suchas subspace MKL methods [124]–[127], SimpleMKL [128],class-specific sparse MKL (CS-SMKL) [129], and nonlinearMKL (NMKL) [130], [131].

This paper presents a survey of the existing papers relatedto MKL with special emphasis on remote sensing imageclassification. The rest of this survey paper is organized asfollows. First, general MKL framework will be discussed inSection II. Then, several typical MKL methods are introducedin Section III which has been divided into five parts. Theyare subspace MKL methods and NMKL method for spatial–spectral joint classification of HSI, sparse MKL methods for

Page 3: IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, …static.tongtianta.site/paper_pdf/57337e92-1be0-11e9-82e2-00163e08… · attracted great attention in various applications such

GU et al.: MKL FOR HSI CLASSIFICATION: REVIEW 6549

Fig. 2. Illustration of nonlinear kernel mapping.

TABLE I

SUMMARY OF THE NOTATIONS

feature interpretation in HSI classification, MK-boosting forensemble learning, and heterogeneous feature fusion withMKL, respectively. In Section IV, several typical exampleswith MKL for HSI classification are demonstrated. Conclu-sions are drawn in Section V, followed by some remarks onthe future work in Section VI. For easy reference, Table I liststhe notations of all the symbols used in this paper.

II. LEARNING FROM MULTIPLE KERNELS

Given a labeled training data set with N samples X ={xi |i = 1, 2, . . . , N }, xi ∈ RD , and Y = {yi |i = 1, 2, . . . , N },

where xi is a pixel vector with D dimension, yi is the classlabel, and D is the number of hyperspectral bands. The classesin the original feature space are often linearly inseparable,as shown in Fig. 2. Then, the kernel method maps these classesto a higher dimensional feature space via nonlinear mappingfunction �. The mapped higher dimensional feature space isdenoted as Q, that is

� :RD → Q, X → �(X). (1)

A. General MKL

MKL provides a more flexible framework so as to moreeffectively mine information, compared with using a singlekernel. In MKL, a flexible combined kernel is generated bya linear or nonlinear combination of a series of base kernelsand is used to replace the single kernel in a learning model toachieve better ability to learn. Each base kernel may exploit thefull set of features or a subset of features [128]. Fig. 3 providesan illustration of the comparison of multiple kernel trickand single-kernel case. The dual problem of general linearcombined MKL is expressed as follows:

minη

maxα

⎧⎨⎩

N∑i=1

αi − 1

2

N∑i, j=1

αiα j yi y j

M∑m=1

ηmKm(xi , x j )

⎫⎬⎭

s.t. ηm ≥ 0, andM∑

m=1

ηm = 1 (2)

where M is the number of candidate base kernels for combi-nation, ηm is the weight of the mth base kernel.

All the weighting coefficients are nonnegative and sumto one in order to ensure that the combined kernel fulfillsthe positive semidefinite (PSD) condition and retains normal-ization as base kernels. The MKL problem is designed tooptimize both the combining weights ηm and the solutions tothe original learning problem, i.e., the solutions of αi and α j

for SVM in (2).Learning from multiple kernels can provide better sim-

ilarity measuring ability, for example, multiscale kernels,which are RBF kernels with multiple scale parameters σ(i.e., bandwidth) [124]. Fig. 4 shows the multiscale kernelmatrices. According to the visual display of kernel matricesin Fig. 4, the kernelized similarity measuring appears multi-scale characteristics. The kernel with a small scale is sensitiveto variation of similarities, but may result in a highly diagonal

Page 4: IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, …static.tongtianta.site/paper_pdf/57337e92-1be0-11e9-82e2-00163e08… · attracted great attention in various applications such

6550 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 55, NO. 11, NOVEMBER 2017

Fig. 3. Comparison of the multiple kernel trick and the single-kernel method.

Fig. 4. Multiscale kernel matrices.

kernel matrix, which loses generalization capability. On thecontrary, with large scale, the kernel becomes insensitive tosmall variations of similarities. Therefore, by learning multi-scale kernels, an optimal kernel with the best discriminativeability can be achieved.

For various applications in real world, there are plentyof heterogeneous data or features [132]. In terms of remotesensing, the features could be spectra, spatial distribution,digital elevation model or height, and temporal information,which need to be learned with not only a single kernel butalso multiple kernels where each base kernel corresponds toone type of features.

B. Strategies for MKL

The strategies for determining the kernel combination canbe basically divided into three major categories [115], [133].

1) Criterion-Based Approaches: They use a criterion func-tion to obtain the kernel or the kernel weights. Forexample, kernel alignment selects the most similar ker-nel to the ideal kernel. Representative MKL (RMKL)obtains the kernel weights by performing principalcomponent analysis (PCA) on the base kernels [124].Sparse MKL acquires the kernel by robust sparsePCA [134]. Nonnegative matrix factorization (NMF)and kernel NMF (KNMF) MKL [125] find thekernel weights by NMF and KNMF. Rule-basedMKL (RBMKL) generates the kernel via summa-tion or multiplication of the base kernels. The spatial–spectral composite kernel assigns fixed values as thekernel weights [116], [119], [121]–[123].

2) Optimization Approaches: They obtain the base ker-nel weights and the decision function of classificationsimultaneously by solving the optimization problem. Forinstance, CS-SMKL [129], SimpleMKL [128], and dis-criminative MKL (DMKL) [127] are determined usingthe optimization approach.

3) Ensemble Approaches: They use the idea of ensem-ble learning. The new base kernel is added iterativelyuntil the minimum of cost function or the optimalclassification performance, such as MK-Boosting [135],which adopts boosting to determine base kernel andcorresponding weights.

C. Basic Training for MKL

In terms of training manners for MKL, the existing algo-rithms can be partitioned into two categories are as follows.

1) One-Stage Methods: Solve both classifier parametersand base kernel weights by simultaneously optimiz-ing a target function based on the risk function ofclassifier. The algorithms of one-stage MKL can befurther split into the two subcategories of direct andwrapper methods according to the order of solution ofclassifier parameters and base kernel weights. The directmethods simultaneously solve the base kernel weightsand the parameters [115]. The wrapper methods solvethe two kinds of parameters separately and alternatelyat a given iteration. First, they optimize the base kernelweights by fixing the classifier parameters, and then opti-mize the classifier parameters by fixing the base kernelweights [128], [129], [136].

2) Two-Stage Methods: Solve the base kernel weights inde-pendently of the classifier [124], [125], [127]. Usually,they solve the base kernel weights first, and then takethe base kernel weights as the known conditions to solvethe parameters of the classifier.

The computational time of one-stage and two-stageMKL depends on two factors, which are the number of con-sidered kernels and the number of available training samples.The one-stage algorithms are usually faster than the two-stagealgorithms when both the number and the size of the basekernels are small. The two-stage algorithms are generally

Page 5: IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, …static.tongtianta.site/paper_pdf/57337e92-1be0-11e9-82e2-00163e08… · attracted great attention in various applications such

GU et al.: MKL FOR HSI CLASSIFICATION: REVIEW 6551

Fig. 5. Illustration of subspace MKL methods. The square and circle, respectively, denote training samples from two classes. The combination weights ofsubspace MKL methods can be obtained by base kernels projection with a few projection directions.

faster than the one-stage algorithms when the number of basekernels is high or the number of training samples used forkernel construction is large.

III. MKL ALGORITHMS

A. Subspace MKL

Recently, some effective MKL algorithms have been pro-posed for HSI classification, called subspace MKL, which usessubspace method to obtain the weights of base kernels in thelinear combination. These algorithms include RMKL [124],NMF-MKL, KNMF-MKL [125], and DMKL [127]. GivenM base kernel matrices {Km, m = 1, 2, . . . , M, Km ∈ RN×N },which are composed of a 3-D data cube of size N × N × M .In order to facilitate the subsequent operations, the 3-D datacube of the kernel matrices is converted into a 2-D matrixwith the help of a vectorization operator, where all the kernelmatrices are separately converted into column vectors km =vec(Km). After the vectorization, a new form of the basekernels is denoted as Q = [k1, k2, . . . , kM ]T ∈ RM×N2

.Subspace MKL algorithms build a loss function as follows:

�(K, η) = ‖Q − DK‖2F (3)

where D ∈ RM×l is the projection matrix whose columns{ηr }l

r=1 are the bases of l-dimensional linear subspace,K ∈ Rl×N2

is the projected matrix onto the linear subspacespanned by D, and ‖·‖F is Frobenius norm of matrix. Adoptingdifferent optimization criteria to solve D and K forms differentsubspace MKL methods.

The visual illustration of subspace MKL methods is shownin Fig. 5. Table II summarizes the three subspace MKLmethods with different ways to solve the combination weights.RMKL is to determine the optimal kernel combination weightsby projecting onto the max-variance direction. In NMF-MKLand KNMF-MKL, NMF and KNMF are used to solve theproblem of weights and the optimal combined kernel due tothe nonnegativity of both matrix and combination weights.Moreover, the core idea of DMKL is to learn an optimallycombined kernel from predefined base kernels by maximizing

TABLE II

SUMMARY OF SUBSPACE MKL METHODS

separability in reproduction kernel Hilbert space, which leadsto the minimum within-class scatter and maximum betweenclass scatter.

B. Nonlinear MKL

NMKL is motivated by the justifiable assumption that thenonlinear combination of different linear kernels can improveclassification performance [115]. In [131], a NMKL is intro-duced to learn an optimally combined kernel from the prede-fined base kernels for HSI classification. The NMKL methodcan fully exploit the mutual discriminability of the interbase-kernels corresponding to the spatial–spectral features. Thenthe corresponding improvement in classification performancecan be expected.

The framework of NMKL is shown in Fig. 6. First,M spatial–spectral feature sets are extracted from the

Page 6: IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, …static.tongtianta.site/paper_pdf/57337e92-1be0-11e9-82e2-00163e08… · attracted great attention in various applications such

6552 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 55, NO. 11, NOVEMBER 2017

Fig. 6. Illustration of the kernel construction in NMKL.

HSI data cube. Each feature set is associated with onebase kernel, which is defined as Km(xi , x j ) = ηm〈xi , x j 〉,m = 1, 2, . . . , M . Therefore, η = [η1, η2, . . . , ηM ] is thevector of kernel weights associated with the base kernels,as shown in Fig. 6. Then, nonlinear combined kernel iscomputed from original kernels. M2 new kernel matrices aregiven by the Hadamard product of any two base kernels, andthe final kernel matrix is the weighted sum of these new kernelmatrices. The final kernel matrix is shown as follows:

Kη(xi , x j ) =M∑

m=1

M∑h=1

ηmηhKm(xi , x j ) � Kh(xi , x j ). (4)

Applying Kη(xi , x j ) to SVM, the related problem of learn-ing the kernel Kη can be concomitantly formulated as thefollowing min–max optimization problem:

minη∈�

maxα∈RN

N∑i=1

αi − 1

2

N∑i=1

N∑j=1

αiα j yi y j Kη(xi , x j ) (5)

where � = {η|η ≥ 0∧‖η−η0‖2 ≤ �} is a positive, bounded,and convex set. A positive η ensures that the combinedkernel function is PSD, and the regularization of the boundarycontrols the norm of η. The definition includes an offsetparameter η0 for the weight η. Natural choices for η0 are:η0 = 0 or η0/‖η0‖ = 1.

A projection-based gradient-descent algorithm can be usedto solve this min–max optimization problem. At each iteration,α is obtained by solving a kernel ridge regression problem withthe current kernel matrix and η is updated with the gradientscalculated using α while considering the bound constraints onη due to �.

C. Sparsity-Constrained MKL

1) Sparse MKL: There is redundancy among the multi-ple base kernels, especially the kernels with similar scales

Fig. 7. Illustration of sparse MKL.

(shown in Fig. 7). In [134], a sparse MKL framework wasproposed to achieve a good classification performance by usinga linear combination of only a few kernels from multiple basekernels. In sparse MKL, learning with multiple base kernelsfrom hyperspectral data is carried out by two stages. Thefirst stage is to learn an optimally sparse combined kernelfrom all base kernels, and the second stage is to perform thestandard SVM optimization with the optimal kernel. In the firststep, a sparsity constraint is introduced to control the numberof nonzero weights and improve the interpretability of basekernels in classification. The learning model in the first stepcan be written as the following optimization problem:

maxη

ηT η − ρCard(η)

s.t. ηT η = 1 (6)

Page 7: IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, …static.tongtianta.site/paper_pdf/57337e92-1be0-11e9-82e2-00163e08… · attracted great attention in various applications such

GU et al.: MKL FOR HSI CLASSIFICATION: REVIEW 6553

Fig. 8. Illustration of the class-specific kernel learning (taking classes 2 and 5as example).

where Card(η) is the cardinality of η and corresponds to thenumber of nonzero weights and ρ is a parameter to controlsparsity.

Maximization in (6) can be interpreted as a robust maximumeigenvalue problem and solved with a first-order algorithmgiven as

max Tr(Z) − ρ1T |Z|1s.t. Tr(Z) = 1, Z ≥ 0. (7)

2) Class-Specific MKL: A CS-SMKL framework has beenproposed for spatial–spectral classification of HSIs, whichcan effectively utilize the multiple features with multiplescales [129]. CS-SMKL classifies the HSIs by simultane-ously learning class-specific significant features and selectingclass-specific weights.

The framework of CS-SMKL is illustrated in Fig. 8. First,feature extraction is performed on the original data set, andM feature sets are obtained. Then, M base kernels associatedwith M feature sets were constructed. At the kernel learningstage, a class-specific way via the one-versus-one learningstrategy is used to select the class-specific weights for differentfeature sets and remove the redundancy of those features whenclassifying any two categories. As shown in Fig. 8, whenclassifying one class pair (take, e.g., class 2 and class 5), firstwe find their position coordinates according to the label oftraining samples, then the associate class-specific kernel κm ,m = 1, 2, . . . , M , is extracted from the base kernels viathe corresponding location. After that, the optimal kernel isobtained by the linear combination of these class-specifickernels. The weights of the linear combination are constrainedby the criteria

∑Mm=1 ηm = 1, ηm ≥ 0. The criteria can

TABLE III

SUMMARY OF SPARSE MKL METHODS

enforce the sparsity at the group/feature level and automat-ically learn a compact feature set for classification purposes.The combined kernel was embedded into SVM to completefinal classification.

In CS-SMKL approach, an efficient optimization methodhas been adopted by using the equivalence between MKLand group lasso [137]. The MKL optimization problem isequivalent to the optimization problem

minη∈�

min{ fm∈Hm}M

m=1

[1

2

M∑m=1

ηm‖ fm‖2Hm

+ maxα∈[0,C]N

N∑i=1

αi

×(

1 −M∑

m=1

yiηm fm(xi )

)]. (8)

The main differences among the three sparse MKL methodsare summarized in Table III.

D. Ensemble MKL

Ensemble learning strategy can be applied to the MKLframework to select more effective training samples. As beinga main way to ensemble learning, boosting was proposed [138]and improved in [139]. The idea is based on the way toiteratively select training samples, which sequentially paysmore attention to these easily misclassified samples to trainbase classifiers. The idea of using boosting techniques to learnkernel-based classifiers was introduced in [140]. Recently,boosting has been integrated to the MKL with extendedmorphological profiles (EMPs) features in [135] for HSIclassification.

Let T be the number of boosting tails. The base classifiersare constructed by SVM classifiers with the input of thecomplete set of multiple features. The method screens sampleby probability distribution Wt ⊂ W, t = 1, 2, . . . T , whichindicates the importance of the training samples for designinga classifier. The incorrectly classified samples have muchhigher probability to be chosen as screened samples in the

Page 8: IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, …static.tongtianta.site/paper_pdf/57337e92-1be0-11e9-82e2-00163e08… · attracted great attention in various applications such

6554 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 55, NO. 11, NOVEMBER 2017

Fig. 9. Illustration of the sample screening process during boosting trailsby taking two classes as a simple example. (a) Training samples set: thetriangle and square, respectively, denote training samples from two classes andsamples marked in red mean “hard” samples, which are easily misclassified.(b) Sequent screened samples: the screened samples (sample ratio = 0.2)marked in purple during boosting trails, and the screened samples focus on“hard samples” as shown in (a).

next iteration. In this way, MK-boosting provides a strategyto select more effective training samples for HSI classification.SVM classifier is used as a weak classifier in this case. In eachiteration, the base classifier ft is obtained from M weakclassifiers

ft = arg minf mt , j={1,...,M}

γ mt = arg min

f mt , j={1,...,M}

γ ( f mt ) (9)

where γ measures the misclassification performance of theweak classifiers.

In each iteration, the weights of the distribution are adjustedby increasing the values of incorrectly classified samples anddecreasing the values of correctly classified samples in order tomake the classifier focus on the “hard” samples in the trainingset, as shown in Fig. 9.

Taking the morphological profile (MP) as an example,the architecture of this method is shown in Fig. 10. Thefeatures, respectively, are the input to SVM, and then the bestclassifier with the best performance will be selected as a baseclassifier, and the last T base classifiers are combined as thefinal classifier. Furthermore, the coefficients are determinedby the classification accuracy of the base classifiers during theboosting trails.

E. Heterogeneous Feature Fusion With MKL

This section introduces a heterogeneous feature fusionframework with MKL, as shown in Fig. 11. It can be found thatthere are two levels of MKL in column and row, respectively.First, different kernel functions are used to measure thesimilarity of samples on each feature subset. This is the “col-

umn” MKL, K(m)Col (x

(m)i , x(m)

j ) = ∑Ss=1 h(m)

s K(m)s (x(m)

i , x(m)j ).

In this way, the discriminative ability of each feature subsetis exploited at different kernels and is integrated to generatean optimally combined kernel for each feature subset. Then,the multiple combined kernels resulted by MKL on eachfeature subset are integrated using a linear combination. Thisis the “row” MKL KRow(xi , x j ) =∑M

m=1 dmK(m)Col (x

(m)i , x(m)

j ).

Fig. 10. Architecture of MK-boosting method.

Fig. 11. Illustration of heterogeneous feature fusion with MKL.

As a result, the information contained in different featuresubsets is mined and integrated into the final classificationkernel. In this framework, the weights of the base kernelscan be determined by any MKL algorithm, such as RMKL,NMF-MKL, and DMKL It is worth noting that sparse MKLcan be carried out on both each feature subset level andbetween feature subsets level for base kernels and featuresinterpretation, respectively.

IV. MKL FOR HSI CLASSIFICATION

A. Hyperspectral Data Sets

Five data sets are used in this paper. Three of themare HSIs, which were used to validate classification. The4th and 5th data sets consist of two parts, i.e., MSI andLiDAR, which are used to perform multisource classification.The first two HSIs are from cropland scenes acquired by

Page 9: IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, …static.tongtianta.site/paper_pdf/57337e92-1be0-11e9-82e2-00163e08… · attracted great attention in various applications such

GU et al.: MKL FOR HSI CLASSIFICATION: REVIEW 6555

Fig. 12. Ground reference maps for the five data sets. (a) Indian Pines. (b) Salinas. (c) Pavia University. (d) Bayview Park. (e) Recology.

the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS)sensor. The AVIRIS sensor acquires 224 bands of 10-nm widthwith center wavelengths from 400 to 2500 nm. The thirdHSI was acquired with the Reflective Optics System Imag-ing Spectrometer (ROSIS-03) optical sensor over an urbanarea [141]. The flight over the city of Pavia, Italy, was operatedby the Deutschen Zentrum für Luft-und Raumfahrt (DLR,German Aerospace Agency) within the context of the HySensproject, managed and sponsored by the European Union. TheROSIS-03 sensor provides 115 bands with a spectral coverageranging from 430 to 860 nm. The spatial resolution is 1.3 mper pixel.

1) Indian Pine Data Set: This HSI was acquired over theagricultural Indian Pine test site in Northwestern Indiana.It has the spatial size of 145 × 145 pixels with a spatialresolution of 20 m per pixel. Twenty water absorption bandswere removed, and a 200-band image was used for theexperiments. The data set contains 10 366 labeled pixels and16 ground reference classes, most of which are different typesof crops. A false color image and the reference map arepresented in Fig. 12(a).

2) Salinas Data Set: This HSI was acquired in SouthernCalifornia [142]. It has spatial size of 512 ×217 pixels with aspatial resolution of 3.7 m per pixel. Twenty water absorption

Page 10: IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, …static.tongtianta.site/paper_pdf/57337e92-1be0-11e9-82e2-00163e08… · attracted great attention in various applications such

6556 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 55, NO. 11, NOVEMBER 2017

TABLE IV

INFORMATION OF THE FIVE DATA SETS

bands were removed, and a 200-band image was used forthe experiments. The ground reference map was composedof 54 129 pixels and 16 land-cover classes. Fig. 12(b) showsa false color image and information of the labeled classes.

3) Pavia University Area: This HSI with 610 × 340 pixelswas collected near the Engineering School, University ofPavia, Pavia, Italy. Twelve channels were removed due tonoise [116]. The remaining 103 spectral channels wereprocessed. There are 43 923 labeled samples in total, and nineclasses of interest. Fig. 12(c) presents false color images ofthis data set.

4) Bayview Park: The data set is from the 2012 IEEEGRSS Data Fusion Contest and is one of the subregionsof a whole scene around downtown area of San Francisco,CA, USA. This data set contains multispectral images witheight bands acquired by WorldView2 on October 9, 2011and corresponding LiDAR data acquired in June 2010. It hasspatial size of 300 × 200 pixels with a spatial resolutionof 1.8 m per pixel. There are 19 537 labeled pixels and7 classes. The false color image and ground reference mapare shown in Fig. 12(d).

5) Recology: The source of this data set is the same asBayview Park, which is another subregion of whole scene.It has 200 × 250 pixels with 11 811 labeled pixels and11 classes. Fig. 12(e) shows the false color image and groundreference map.

More details about these data sets are listed in Table IV.

B. Experimental Settings and Evaluation

To evaluate the performance of the various MKL methodsfor the classification task, MKL methods and typical com-parison methods are shown in Table V. The single-kernelmethod represents the best performance by standard SVM,which can be used as a standard to evaluate whether a MKLmethod is effective or not. The number of training samplesper class was varied (n = {1%, 2%, 3%} or n = {10, 20, 30}).The overall accuracy (OA [%]) and computation time weremeasured. Average results for a number of ten realizations are

TABLE V

EXPERIMENTAL METHODS AND SETTING

shown. To guarantee the generality, all the experiments wereconducted on typical HSI data sets.

In the first experiment of spectral classification, all spectralbands are stacked into a feature vector as input features.The feature vector was input into a Gaussian kernel withdifferent scales. For all of the classifiers, the range of thescale of Gaussian kernel was set to [0.05, 2], and uniform

Page 11: IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, …static.tongtianta.site/paper_pdf/57337e92-1be0-11e9-82e2-00163e08… · attracted great attention in various applications such

GU et al.: MKL FOR HSI CLASSIFICATION: REVIEW 6557

TABLE VI

SUMMARY OF THE EXPERIMENTAL SETUP FOR SECTION IV

TABLE VII

OA (%) OF MKL METHODS UNDER MULTIPLE SCALE BASE KERNEL CONSTRUCTION

sampling that selects scales from the interval with a fixed stepsize of 0.05 was used to select 40 scales within the givenrange.

In the second experiment of spatial and spectral classifica-tions, all the data sets were processed first by PCA and thenby mathematical morphology. The eigenvalues were arrangedin the descending order. The first p PCs that account for99% of the total variation in terms of eigenvalues werereserved. Hence, the construction of the MP was based onthe PCs, and a stacked vector was built with the MP oneach PC. Here, three kinds of SEs were used to obtain theMP features, including diamond, square, and disk SEs. Foreach kind of SE, a step size of increment of 1 was used,and ten closings and ten openings were computed for eachPC. Each structure of MPs with ten closings and ten openingsand the original spectral features were, respectively, stackedas the input vector of each base kernel for MKL algorithms.The base kernels were four Gaussian kernels, i.e., the values{0.1, 1, 1.5, 2}, which corresponds to three kinds of structuresof MPs and original spectral features, respectively, namely,20 base kernels for MKL methods, except for NMKL, whichis with three Gaussian kernels, i.e., the values {1, 1.5, 2} forNMKL-Gaussian, and four linear base kernels function forNMKL-linear.

Heterogeneous features were used in the third experiment,including spectral features, elevation features, normalized dig-ital surface model (nDSM) from LiDAR data, and spatial fea-tures of MPs. MPs features extract from original multispectralbands and nDSM use the diamond structure element withthe sizes [3], [5], [7], [9], [11], [13], [15], [17], [19], [21].Heterogeneous features are stacked as a single vector offeatures to be the input of fusion methods.

The summary of the experimental setup is listed in Table VI.

C. Spectral Classification

The numerical classification results of different MKL meth-ods for different data sets are given in Table VII. The perfor-mance of MKL methods is mainly determined by the ways ofconstructing base kernel and the solutions of weights for basekernels. The resulting base kernel matrices from the differentways of constructing base kernel contain all the informationthat will be used for the subsequent classification task. Theweights of base kernels learned by different MKL methodsrepresent how to combine this information with the objectiveof strengthening information extraction and curbing uselessinformation for classification.

Observing the results on the three data sets, some conclu-sions can be drawn as follows. 1) There is a situation thatthe classification performance of some MKL methods is notas good in terms of classification accuracies as for that of thesingle kernel method. This reveals that MKL methods needgood learning algorithms to ensure the performance. 2) In thethree typical HSI, the best classification performance in termsof accuracies is derived from the MKL methods. This provesthat using multiple kernels instead of a single one can improveperformances for HSI classification and the key is to choosethe suitable learning algorithm. 3) In most cases, the subspaceMKL methods are superior to the comparative MKL methodsand single-kernel method in terms of OA.

D. Spatial–Spectral Classification

The classification results of all these compared methodson three data sets are shown in Table VIII. And the over-all time of training and test process of Pavia Universitydata set with 1% training samples is shown in Fig. 13.Several conclusions can be derived. First, as the number of

Page 12: IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, …static.tongtianta.site/paper_pdf/57337e92-1be0-11e9-82e2-00163e08… · attracted great attention in various applications such

6558 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 55, NO. 11, NOVEMBER 2017

TABLE VIII

OA (%) OF MKL METHODS UNDER MPs BASE KERNEL CONSTRUCTION

Fig. 13. Overall time of training and testing process in all the methods.

training samples increases, accuracy increases. Second, theMK-boosting method has the best classification accuracy withthe cost of computation time. It is also important to notethat there is not a large difference between the methodsin terms of classification accuracy. It can be explained thatMPs can mine well information for classification by the wayof MKL and, then, the difference among MKL algorithmsmainly concentrate on complexity and sparsity of the solution.The conclusion is consistent with [115]. SimpleMKL showsthe worst classification performance in terms of accuraciesunder multiple-scale constructions in the first experiment, butis comparable to the other methods in terms classificationaccuracy in this experiment. The example of SimpleMKLillustrates that a MKL method is difficult to guarantee the bestclassification performance in terms of accuracies in all cases.Feature extraction and classification are both important stepsfor classification. If the information extraction via featuresis successful for classification, the classifier design can be

easy in terms of complexity and sparsity, and vice versa. Thesubspace MKL algorithms as two-stage methods have a lowercomplexity than one-step methods such as SimpleMKL andCS-SMKL.

It can be noted that the NMKL with the linear kernelsdemonstrates a little lower accuracy than subspace MKLalgorithms with the Gaussian kernel. NMKL with the Gaussiankernels obtains comparable classification accuracy comparedwith NMKL with linear kernels in the Pavia University data setand the Salinas data set, but with a lower accuracy in the Indiandata set. In general, using a linear combination of Gaussiankernels is more promising than a nonlinear combination of lin-ear kernels. However, the nonlinear combinations of Gaussiankernels need to be researched further. Feature combinationand the scale of the Gaussian kernels have a big influenceon the accuracy of NMKL with a Gaussian kernel. And theNMKL method also demonstrates a different performancetrend for different data sets. In this experiment, some trieswere attempted and the results show relatively better resultscompared to other approaches in some situations. More workof theoretical analysis needs to be done in this area.

It can be found that among all the sparse methods,CS-SMKL demonstrated comparable classification accura-cies for the Indian Pines and Salinas data sets. And forPavia data set, as the number of training samples growing,the classification performance of CS-SMKL increased signif-icantly and reached a comparable accuracy, too. In orderto visualize the contribution of each feature type and thesecorresponding base kernels in these MKL methods, we plotthe kernel weights of the base kernels for RMKL, DMKL,SimpleMKL, Sparse MKL, and CS-SMKL in Fig. 14. Forsimplicity, here only three one against one classifiers ofPavia University data set (painted metal sheets versus baresoil, painted metal sheets versus bitumen, and painted metal

Page 13: IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, …static.tongtianta.site/paper_pdf/57337e92-1be0-11e9-82e2-00163e08… · attracted great attention in various applications such

GU et al.: MKL FOR HSI CLASSIFICATION: REVIEW 6559

Fig. 14. Weights η determined for each base kernel and the corresponding feature type. (a)–(d) Fixed set of kernel weights selected by RMKL. (e) Kernelweights selected for three different class pairs by CS-SMKL.

sheets versus self-blocking bricks) are listed. RMKL, DMKL,SimpleMKL, and Sparse MKL used the same kernel weightsas shown in Fig. 14(a)–(d) for all the class pairs. FromFig. 14(e), it is easy to find that CS-SMKL selected differentsparse base kernel sets for different class pairs, and the spectralfeatures are important for these three class pair. For theCS-SMKL, it only selected very few base kernels for clas-sification purposes, while the kernel weight for the spectralfeatures is very high. However, these corresponding kernelweights in RMKL and DMKL are much lower, and sparseMKL did not select any kernel related to the spectral features,SimpleMKL selects the first three kernels related to the spec-tral features, but obviously, the corresponding kernel weightsare lower than that related to the EMP feature obtained bythe square SE. This is an example showing that CS-SMKLprovides more flexibility in selecting kernels (features) forimproving classification.

E. Classification With Heterogeneous Features

This section shows the performance of the fusion frameworkof heterogeneous features with MKL (denoted as HF-MKL)under realistic ill-posed situations, and the results comparedwith other MKL methods. In fusion framework of HF-MKL,

TABLE IX

OA (%) OF DIFFERENT MKL METHODS ON TWO DATA SETS

RMKL was adopted to determine the weights of the basekernels on both levels of MKL in column and row. Jointclassification with the spectral features, elevation features, andspatial features was carried out, and the results of classificationfor two data sets are as shown in Table IX. SK represents anatural and simple strategy to fuse heterogeneous features, andit can be used as a standard to evaluate the effectiveness ofdifferent fusion strategies for heterogeneous features. With thisstandard, CKL is poor. The performance of CKL is affectedby the weights of spectral, spatial, and elevation kernels.

Page 14: IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, …static.tongtianta.site/paper_pdf/57337e92-1be0-11e9-82e2-00163e08… · attracted great attention in various applications such

6560 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 55, NO. 11, NOVEMBER 2017

All the MKL methods outperform the stacked-vector approachstrategy. This reveals that features from different sourcesobviously have different meanings, and statistical significance.Therefore, they may play different roles in classification.Consequently, the stacked-vector approach is not a goodchoice for the joint classification. However, MKL is an effec-tive fusion strategy for heterogeneous features, and the furtherHF-MKL framework is a good choice.

V. CONCLUSION

In general, the MKL methods can improve the classifica-tion performance in most cases compared with single-kernelmethod. For spectral classification of HSI, subspace MKLmethods using a trained, weighted combination on the averageoutperform the untrained, unweighted sum, namely, RBMKL(mean MKL), and have significant superiority of accuracyand computational efficiency compared with the SimpleMKLmethod. Ensemble MKL method (MK-boosting) has higherclassification performance in terms of classification accuracybut an additional cost of computation time. It is also importantto note that there is not large difference considering classifi-cation accuracy for different kinds of MKL methods. If wecan extract effective spatial–spectral features for HSI classi-fication, the choice of MKL algorithms mainly concentrateson complexity and sparsity of the solution. In general, usingthe linear combination kernels with Gaussian kernels is morepromising accuracy than the nonlinear combination kernelswith linear kernels. However, the nonlinear combinationswith the complex Gaussian kernels need to be done moreresearch. This is still an open problem, which is affected bymany factors such as the way of features combination and thescale of Gaussian kernels.

Currently, with the improvement of HSI quality, we canextract more and more accurate features for classificationtask. These features could be multiscale, multiattribute, mul-tidimension, and multicomponents. Since MKL provides avery effective means of learning, it is natural consideringto utilize these features by MKL framework. Expanding thefeature spaces with a number of information diversities, thesemultiple features provide excellent classification performances.However, with a high redundancy of information, and eachkind of them has different contribution to classification task.As a solution, sparse MKL methods are developed. The sparseMKL framework allows to embed a variety of characteris-tics in the classifier and remove the redundancy of multiplefeatures effectively to learn a compact set of features andselect the weights of corresponding base kernels, leading to aremarkable discriminability. The experimental results on threedifferent hyperspectral data sets, corresponding to differentcontexts (urban, agricultural) and different spectral and spatialresolutions, demonstrate that the sparse methods offer goodperformances.

Heterogeneous features from different sources have differ-ent meanings, dimension units, and statistical significance.Therefore, they may play different roles in classificationand should be treated differently. MKL performs heteroge-neous features fusion in implicit high-dimensional featurerepresentation. Utilizing different heterogeneous features to

construct different base kernels can distinguish those differentroles and fuse the complementary information contained inheterogeneous features. Consequently, MKL gives a morereasonable choice than stacked-vector approach. And ourexperimental results also demonstrated this point. Further, thetwo-stage MKL framework is a good choice in term of OA.

VI. FUTURE LINES OF RESEARCH

A. Deep Kernel and Multiple Kernel Learning

MKL is a low-dimensional network structure with only onehidden layer compared with DNN. Motived by deep learning,deep kernel network is a new research direction. Some workhas been done to apply MKL into deep learning. There aremainly two kinds of ways: one is to fuse hierarchical featuresfrom DNN; the other is to use kernel trick to optimize theweights for speeding up the learning rate, namely, kernel deepconvex network. Furthermore, additional work should focus onbuilding a deep kernel network, which can not only optimizefeatures for HSI classification but also be trained by optimizingMKL learning problems. And some other factors should alsobe considered in the future, such as how to find a set ofglobally optimal parameters and theoretical guide adaptiveregularization in the deep kernel network.

B. Superpixel MKL

MKL provides a very effective means of learning, andcan conveniently be embedded in a variety of characteristics.Therefore, it is critical to apply MKL to effective features.Recently, a superpixel approach has been applied to HSIclassification as an effective spatial feature extraction means.Each superpixel is a local region, whose size and shapecan be adaptively adjusted according to local structures. Andthe pixels in the same superpixel are assumed to have verysimilar spectral characteristics, which mean that superpixelcan provide more accurate spatial information. Utilizing thefeature explored by superpixel, the salt and paper phenomenonappearing in the classification result will be reduced. In con-sequence, superpixel MKL will lead to a better classificationperformance.

C. MKL for Multimodal Classification

MKL provides a flexible framework for us to fuse differentsources of information in a very natural way. Multimodalclassification of remote sensing data is a typical problem ofmultisource information mining and utilization. The comple-mentary and relevant information contained in the multisourceremote sensing data can be fused and utilized by takinginto account the base kernels construction and optimizingconfiguration in MKL. Consequently, MKL will contribute tothe development of multimodal remote sensing, such as, mul-titemporal classification, multisensor fusion and classification,multiangular image fusion, and classification.

REFERENCES

[1] G. Shaw and D. Manolakis, “Signal processing for hyperspectral imageexploitation,” IEEE Signal Process. Mag., vol. 19, no. 1, pp. 12–16,Jan. 2002.

Page 15: IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, …static.tongtianta.site/paper_pdf/57337e92-1be0-11e9-82e2-00163e08… · attracted great attention in various applications such

GU et al.: MKL FOR HSI CLASSIFICATION: REVIEW 6561

[2] D. Landgrebe, “Hyperspectral image data analysis,” IEEE SignalProcess. Mag., vol. 19, no. 1, pp. 17–28, Jan. 2002.

[3] M. Fauvel, Y. Tarabalka, J. A. Benediktsson, J. Chanussot, andJ. C. Tilton, “Advances in spectral-spatial classification of hyperspectralimages,” Proc. IEEE, vol. 101, no. 3, pp. 652–675, Mar. 2013.

[4] M. A. Cochrane, “Using vegetation reflectance variability for specieslevel classification of hyperspectral data,” Int. J. Remote Sens., vol. 21,no. 10, pp. 2075–2087, 2000.

[5] A. Ghiyamat and H. Z. M. Shafri, “A review on hyperspectral remotesensing for homogeneous and heterogeneous forest biodiversity assess-ment,” Int. J. Remote Sens., vol. 31, no. 7, pp. 1837–1856, 2010.

[6] J. Pontius, M. Martin, L. Plourde, and R. Hallett, “Ash declineassessment in emerald ash borer-infested regions: A test of tree-level,hyperspectral technologies,” Remote Sens. Environ., vol. 112, no. 5,pp. 2665–2676, 2008.

[7] T. Schmid, M. Koch, and J. Gumuzzio, “Multisensor approach todetermine changes of wetland characteristics in semiarid environments(central Spain),” IEEE Trans. Geosci. Remote Sens., vol. 43, no. 11,pp. 2516–2525, Nov. 2005.

[8] E. A. Cloutis, “Hyperspectral geological remote sensing: Evaluationof analytical techniques,” Int. J. Remote Sens., vol. 17, no. 12,pp. 2215–2242, 1996.

[9] Y. Lanthier, A. Bannari, D. Haboudane, J. R. Miller, andN. Tremblay, “Hyperspectral data segmentation and classification inprecision agriculture: A multi-scale analysis,” in Proc. IEEE Int.Geosci. Remote Sens. Symp., Jul. 2008, pp. II-585–II-588.

[10] J. L. Boggs, T. D. Tsegaye, T. L. Coleman, K. C. Reddy, and A. Fahsi,“Relationship between hyperspectral reflectance, soil nitrate-nitrogen,cotton leaf chlorophyll, and cotton yield: A step toward precisionagriculture,” J. Sustain. Agricult., vol. 22, no. 3, pp. 5–16, 2003.

[11] D. Manolakis, D. Marden, and G. A. Shaw, “Hyperspectral imageprocessing for automatic target detection applications,” Lincoln Lab. J.,vol. 14, no. 1, pp. 79–116, 2003.

[12] X. Briottet et al., “Military applications of hyperspectral imagery,”vol. 6239, W. R. Watkins and D. Clement, Eds. Bellingham, WA, USA:SPIE, 2006, no. 1, p. 62390B.

[13] N. Keshava and J. F. Mustard, “Spectral unmixing,” IEEE SignalProcess. Mag., vol. 19, no. 1, pp. 44–57, Jan. 2002.

[14] J. M. Bioucas-Dias et al., “Hyperspectral unmixing overview: Geomet-rical, statistical, and sparse regression-based approaches,” IEEE J. Sel.Topics Appl. Earth Observ. Remote Sens., vol. 5, no. 2, pp. 354–379,Apr. 2012.

[15] B. Du and L. Zhang, “A discriminative metric learning based anomalydetection method,” IEEE Trans. Geosci. Remote Sens., vol. 52, no. 11,pp. 6844–6857, Nov. 2014.

[16] D. Manolakis and G. S. Shaw, “Detection algorithms for hyperspectralimaging applications,” IEEE Signal Process. Mag., vol. 19, no. 1,pp. 29–43, Jan. 2002.

[17] D. W. J. Stein, S. G. Beaven, L. E. Hoff, E. M. Winter, A. P. Schaum,and A. D. Stocker, “Anomaly detection from hyperspectral imagery,”IEEE Signal Process. Mag., vol. 19, no. 1, pp. 58–69, Jan. 2002.

[18] C. D. Rodgers, Inverse Methods for Atmospheric Sounding: Theoryand Practice (Series on Atmospheric, Oceanic and Planetary Physics),vol. 2. Singapore: World Scientific, 2000.

[19] S. Liang, Quantitative Remote Sensing of Land Surfaces. Hoboken, NJ,USA: Wiley, 2003.

[20] F. Baret and S. Buis, “Estimating canopy characteristics from remotesensing observations. Review of methods and associated problems,”in Advances in Land Remote Sensing: System Modeling Inversion andApplication. Dordrecht, The Netherlands: Springer, 2008, pp. 172–301.

[21] F. Melgani and L. Bruzzone, “Classification of hyperspectral remotesensing images with support vector machines,” IEEE Trans. Geosci.Remote Sens., vol. 42, no. 8, pp. 1778–1790, Aug. 2004.

[22] J. C. Harsanyi and C.-I. Chang, “Hyperspectral image classifica-tion and dimensionality reduction: An orthogonal subspace projec-tion approach,” IEEE Trans. Geosci. Remote Sens., vol. 32, no. 4,pp. 779–785, Jul. 1994.

[23] J. M. Bioucas-Dias, A. Plaza, G. Camps-Valls, P. Scheunders,N. M. Nasrabadi, and J. Chanussot, “Hyperspectral remote sensingdata analysis and future challenges,” IEEE Geosci. Remote Sens. Mag.,vol. 1, no. 2, pp. 6–36, Jun. 2013.

[24] J. A. Richards, Remote Sensing Digital Image Analysis. Berlin,Germany: Springer, 1986.

[25] D. A. Landgrebe, Signal Theory Methods in Multispectral RemoteSensing. Hoboken, NJ, USA: Wiley, 2003.

[26] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality ofdata with neural networks,” Science, vol. 313, no. 5786, pp. 504–507,2006.

[27] R. Salakhutdinov and G. Hinton, “Deep Boltzmann machines,” in Proc.12th Int. Conf. Artif. Intell. Statist. (AISTATS), 2009, pp. 448–455.

[28] H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng, “Convolutionaldeep belief networks for scalable unsupervised learning of hierarchicalrepresentations,” in Proc. 26th Annu. Int. Conf. Mach. Learn., 2009,pp. 609–616.

[29] L. Deng, D. Yu, and J. Platt, “Scalable stacking and learning forbuilding deep architectures,” in Proc. Int. Conf. Acoust., Speech, SignalProcess., 2012, pp. 2133–2136.

[30] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol,“Stacked denoising autoencoders: Learning useful representations in adeep network with a local denoising criterion,” J. Mach. Learn. Res.,vol. 11, no. 12, pp. 3371–3408, Dec. 2010.

[31] F. Ratle, G. Camps-Valls, and J. Weston, “Semisupervised neuralnetworks for efficient hyperspectral image classification,” IEEE Trans.Geosci. Remote Sens., vol. 48, no. 5, pp. 2271–2282, May 2010.

[32] L. Zhang, Y. Zhong, B. Huang, and P. Li, “A resource limited artifi-cial immune system algorithm for supervised classification of multi/hyper-spectral remote sensing imagery,” Int. J. Remote Sens., vol. 28,no. 7, pp. 1665–1686, 2007.

[33] Y. Zhong and L. Zhang, “An adaptive artificial immune networkfor supervised classification of multi-/hyperspectral remote sens-ing imagery,” IEEE Trans. Geosci. Remote Sens., vol. 50, no. 3,pp. 894–909, Mar. 2012.

[34] E. Merényi, W. H. Farrand, J. V. Taranik, and T. B. Minor, “Classifi-cation of hyperspectral imagery with neural networks: Comparison toconventional tools,” EURASIP J. Adv. Signal Process., vol. 1, no. 71,pp. 1–19, Dec. 2014.

[35] P. K. Goel, S. O. Prasher, R. M. Patel, J. A. Landry, R. B. Bonnell, andA. A. Viau, “Classification of hyperspectral data by decision trees andartificial neural networks to identify weed stress and nitrogen status ofcorn,” Comput. Electron. Agricult., vol. 39, no. 2, pp. 67–93, 2003.

[36] J. L. Crespo, R. J. Duro, and F. L. Pena, “Gaussian synapse ANNs inmulti- and hyperspectral image data analysis,” IEEE Trans. Instrum.Meas., vol. 52, no. 3, pp. 724–732, Jun. 2003.

[37] S. K. Meher, “Knowledge-encoded granular neural networks for hyper-spectral remote sensing image classification,” IEEE J. Sel. Topics Appl.Earth Observ. Remote Sens., vol. 8, no. 6, pp. 2439–2446, Jun. 2015.

[38] Y. LeCun, K. Kavukcuoglu, and C. Farabet, “Convolutional networksand applications in vision,” in Proc. IEEE Int. Symp. Circuits Syst.,May/Jun. 2010, pp. 253–256.

[39] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classificationwith deep convolutional neural networks,” in Proc. Adv. Neural Inf.Process. Syst., 2012, pp. 1097–1105.

[40] C. Farabet, C. Couprie, L. Najman, and Y. LeCun, “Learning hierar-chical features for scene labeling,” IEEE Trans. Pattern Anal. Mach.Intell., vol. 35, no. 8, pp. 1915–1929, Aug. 2013.

[41] Y. LeCun et al., “Backpropagation applied to handwritten zip coderecognition,” Neural Comput., vol. 1, no. 4, pp. 541–551, 1989.

[42] G. Hinton et al., “Deep neural networks for acoustic modeling in speechrecognition: The shared views of four research groups,” IEEE SignalProcess. Mag., vol. 29, no. 6, pp. 82–97, Nov. 2012.

[43] X. Chen, S. Xiang, C.-L. Liu, and C.-H. Pan, “Vehicle detection insatellite images by hybrid deep convolutional neural networks,” IEEEGeosci. Remote Sens. Lett., vol. 11, no. 10, pp. 1797–1801, Oct. 2014.

[44] Y. Chen, Z. Lin, X. Zhao, G. Wang, and Y. Gu, “Deep learning-basedclassification of hyperspectral data,” IEEE J. Sel. Topics Appl. EarthObserv. Remote Sens., vol. 7, no. 6, pp. 2094–2107, Jun. 2014.

[45] C. Vaduva, I. Gavat, and M. Datcu, “Deep learning in very high reso-lution remote sensing image information mining communication con-cept,” in Proc. 20th Eur. Signal Process. Conf. (EUSIPCO), Aug. 2012,pp. 2506–2510.

[46] Y. Chen, X. Zhao, and X. Jia, “Spectral–spatial classification ofhyperspectral data based on deep belief network,” IEEE J. Sel. TopicsAppl. Earth Observ. Remote Sens., vol. 8, no. 6, pp. 2381–2392,Jun. 2015.

[47] P. Liu, H. Zhang, and K. B. Eom, “Active deep learning for classifica-tion of hyperspectral images,” IEEE J. Sel. Topics Appl. Earth Observ.Remote Sens., vol. 10, no. 2, pp. 712–724, Feb. 2016.

[48] Y. Chen, H. Jiang, C. Li, X. Jia, and P. Ghamisi, “Deep feature extrac-tion and classification of hyperspectral images based on convolutionalneural networks,” IEEE Trans. Geosci. Remote Sens., vol. 54, no. 10,pp. 6232–6251, Oct. 2016.

Page 16: IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, …static.tongtianta.site/paper_pdf/57337e92-1be0-11e9-82e2-00163e08… · attracted great attention in various applications such

6562 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 55, NO. 11, NOVEMBER 2017

[49] B. Pan, Z. Shi, N. Zhang, and S. Xie, “Hyperspectral imageclassification based on nonlinear spectral–spatial network,” IEEEGeosci. Remote Sens. Lett., vol. 13, no. 12, pp. 1782–1786,Dec. 2016.

[50] W. Zhao and S. Du, “Spectral–spatial feature extraction for hyper-spectral image classification: A dimension reduction and deep learn-ing approach,” IEEE Trans. Geosci. Remote Sens., vol. 54, no. 8,pp. 4544–4554, Aug. 2016.

[51] A. Romero, C. Gatta, and G. Camps-Valls, “Unsupervised deep fea-ture extraction for remote sensing image classification,” IEEE Trans.Geosci. Remote Sens., vol. 54, no. 3, pp. 1349–1362, Mar. 2016.

[52] H. Liang and Q. Li, “Hyperspectral imagery classification using sparserepresentations of convolutional neural network features,” RemoteSens., vol. 8, no. 2, p. 99, 2016.

[53] W. Zhao, Z. Guo, J. Yue, X. Zhang, and L. Luo, “On combiningmultiscale deep learning features for the classification of hyperspec-tral remote sensing imagery,” Int. J. Remote Sens., vol. 36, no. 13,pp. 3368–3379, 2015.

[54] L. Breiman, “Arcing classifiers,” Ann. Statist., vol. 26, no. 3,pp. 801–824, 1998.

[55] L. Rokach, “Ensemble-based classifiers,” Artif. Intell. Rev., vol. 33,nos. 1–2, pp. 1–39, 2010.

[56] E. Bauer and R. Kohavi, “An empirical comparison of voting classi-fication algorithms: Bagging, boosting, and variants,” Mach. Learn.,vol. 36, nos. 1–2, pp. 105–139, 1999.

[57] R. Polikar, “Ensemble based systems in decision making,” IEEECircuits Syst. Mag., vol. 6, no. 3, pp. 21–45, Sep. 2006.

[58] L. Breiman, “Bagging predictors,” Mach. Learn., vol. 24, no. 2,pp. 123–140, 1996.

[59] R. Schapire, “The boosting approach to machine learning: Anoverview,” in Proc. MSRI Workshop Nonlinear Estimation Classifica-tion, 2001.

[60] T. K. Ho, “The random subspace method for constructing decisionforests,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 20, no. 8,pp. 832–844, Aug. 1998.

[61] J. J. Rodriguez, L. I. Kuncheva, and C. J. Alonso, “Rotation forest:A new classifier ensemble method,” IEEE Trans. Pattern Anal. Mach.Intell., vol. 28, no. 10, pp. 1619–1630, Oct. 2006.

[62] L. Breiman, “Random forests,” Mach. Learn., vol. 45, no. 1, pp. 5–32,2001.

[63] M. Dalponte, H. O. Orka, T. Gobakken, D. Gianelle, and E. Naesset,“Tree species classification in boreal forests with hyperspectral data,”IEEE Trans. Geosci. Remote Sens., vol. 51, no. 5, pp. 2632–2645,May 2013.

[64] J. C.-W. Chan and D. Paelinckx, “Evaluation of Random For-est and Adaboost tree-based ensemble classification and spectralband selection for ecotope mapping using airborne hyperspectralimagery,” Remote Sens. Environ., vol. 112, no. 6, pp. 2999–3011,2008.

[65] E. M. Adam, O. Mutanga, D. Rugege, and R. Ismail, “Discrim-inating the papyrus vegetation (Cyperus papyrus L.) and its co-existent species using random forest and hyperspectral data resampledto HYMAP,” Int. J. Remote Sens., vol. 33, no. 2, pp. 552–569,2012.

[66] N. B. Mishra and K. A. Crews, “Mapping vegetation morphology typesin a dry savanna ecosystem: Integrating hierarchical object-based imageanalysis with Random Forest,” Int. J. Remote Sens., vol. 35, no. 3,pp. 1175–1198, 2014.

[67] J. Xia, W. Liao, J. Chanussot, P. Du, G. Song, and W. Philips, “Improv-ing random forest with ensemble of features and semisupervisedfeature extraction,” IEEE Geosci. Remote Sens. Lett., vol. 12, no. 7,pp. 1471–1475, Jul. 2015.

[68] K. Y. Peerbhay, O. Mutanga, and R. Ismail, “Random forests unsu-pervised classification: The detection and mapping of solanum mau-ritianum infestations in plantation forestry using hyperspectral data,”IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 8, no. 6,pp. 3107–3122, Jun. 2015.

[69] J. Ham, Y. Chen, M. M. Crawford, and J. Ghosh, “Investigation ofthe random forest framework for classification of hyperspectral data,”IEEE Trans. Geosci. Remote Sens., vol. 43, no. 3, pp. 492–501,Mar. 2005.

[70] J. Xia, M. Dalla Mura, J. Chanussot, P. Du, and X. He, “Random sub-space ensembles for hyperspectral image classification with extendedmorphological attribute profiles,” IEEE Trans. Geosci. Remote Sens.,vol. 53, no. 9, pp. 4768–4786, Sep. 2015.

[71] A. Samat, P. Du, S. Liu, J. Li, and L. Cheng, “E2LMs: Ensembleextreme learning machines for hyperspectral image classification,”IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 7, no. 4,pp. 1060–1069, Apr. 2014.

[72] B. Demir and S. Erturk, “Empirical mode decomposition of hyper-spectral images for support vector machine classification,” IEEETrans. Geosci. Remote Sens., vol. 48, no. 11, pp. 4071–4084,Nov. 2010.

[73] G. Hughes, “On the mean accuracy of statistical pattern recognizers,”IEEE Trans. Inf. Theory, vol. IT-14, no. 1, pp. 55–63, Jan. 1968.

[74] B. C. Kuo, C. H. Li, and J. M. Yang, “Kernel nonparametric weightedfeature extraction for hyperspectral image classification,” IEEE Trans.Geosci. Remote Sens., vol. 47, no. 4, pp. 1139–1155, Apr. 2009.

[75] G. Camps-Valls and L. Bruzzone, “Kernel-based methods for hyper-spectral image classification,” IEEE Trans. Geosci. Remote Sens.,vol. 43, no. 6, pp. 1351–1362, Jun. 2005.

[76] B.-C. Kuo, H.-H. Ho, C.-H. Li, C.-C. Hung, and J.-S. Taur, “A kernel-based feature selection method for SVM with RBF kernel for hyper-spectral image classification,” IEEE J. Sel. Topics Appl. Earth Observ.Remote Sens., vol. 7, no. 1, pp. 317–326, Jan. 2014.

[77] P. V. Gehler and B. Schölkopf, “An introduction to kernel learningalgorithms,” in Kernel Methods for Remote Sensing Data Analysis,G. Camps-Valls and L. Bruzzone, Eds. Chichester, U.K.: Wiley, 2009,pp. 25–48.

[78] P. Ramzi, F. Samadzadegan, and P. Reinartz, “Classification of hyper-spectral data using an AdaBoostSVM technique applied on bandclusters,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 7,no. 6, pp. 2066–2079, Jun. 2014.

[79] J. Xia, J. Chanussot, P. Du, and X. He, “Rotation-based support vectormachine ensemble in classification of hyperspectral data with limitedtraining samples,” IEEE Trans. Geosci. Remote Sens., vol. 54, no. 3,pp. 1519–1531, Mar. 2016.

[80] Z. Xue, P. Du, and H. Su, “Harmonic analysis for hyperspectral imageclassification integrated with PSO optimized SVM,” IEEE J. Sel. TopicsAppl. Earth Observ. Remote Sens., vol. 7, no. 6, pp. 2131–2146,Jun. 2014.

[81] L. Gao et al., “Subspace-based support vector machines for hyperspec-tral image classification,” IEEE Geosci. Remote Sens. Lett., vol. 12,no. 2, pp. 349–353, Feb. 2015.

[82] J. Peng, Y. Zhou, and C. L. P. Chen, “Region-kernel-based supportvector machines for hyperspectral image classification,” IEEE Trans.Geosci. Remote Sens., vol. 53, no. 9, pp. 4810–4824, Sep. 2015.

[83] X. Guo, X. Huang, L. Zhang, L. Zhang, A. Plaza, andJ. A. Benediktsson, “Support tensor machines for classification ofhyperspectral remote sensing imagery,” IEEE Trans. Geosci. RemoteSens., vol. 54, no. 6, pp. 3248–3264, Jun. 2016.

[84] C. L. Stork and M. R. Keenan, “Advantages of clustering in thephase classification of hyperspectral materials images,” MicroscopyMicroanal., vol. 16, no. 6, pp. 810–820, 2010.

[85] Z. Shao, L. Zhang, X. Zhou, and L. Ding, “A novel hierarchi-cal semisupervised SVM for classification of hyperspectral images,”IEEE Geosci. Remote Sens. Lett., vol. 11, no. 9, pp. 1609–1613,Sep. 2014.

[86] L. Yang, S. Yang, P. Jin, and R. Zhang, “Semi-supervised hyper-spectral image classification using spatio-spectral Laplacian supportvector machine,” IEEE Geosci. Remote Sens. Lett., vol. 11, no. 3,pp. 651–655, Mar. 2014.

[87] Y. Bazi and F. Melgani, “Classification of hyperspectral remote sensingimages using Gaussian processes,” in Proc. IEEE Int. Geosci. RemoteSens. Symp., Jul. 2008, pp. II-1013–II-1016.

[88] Y. Bazi and F. Melgani, “Gaussian process approach to remote sensingimage classification,” IEEE Trans. Geosci. Remote Sens., vol. 48, no. 1,pp. 186–197, Jan. 2010.

[89] W. Liao, J. Tang, B. Rosenhahn, and M. Y. Yang, “Integration ofGaussian process and MRF for hyperspectral image classification,” inProc. IEEE Urban Remote Sens. Event, Mar./Apr. 2015, pp. 1–4.

[90] Y. Chen, N. M. Nasrabadi, and T. D. Tran, “Hyperspectral image clas-sification using dictionary-based sparse representation,” IEEE Trans.Geosci. Remote Sens., vol. 49, no. 10, pp. 3973–3985, Oct. 2011.

[91] J. Liu, Z. Wu, Z. Wei, L. Xiao, and L. Sun, “Spatial-spectral kernelsparse representation for hyperspectral image classification,” IEEEJ. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 6, no. 6,pp. 2462–2471, Dec. 2013.

[92] Y. Chen, N. M. Nasrabadi, and T. D. Tran, “Hyperspectral imageclassification via kernel sparse representation,” IEEE Trans. Geosci.Remote Sens., vol. 51, no. 1, pp. 217–231, Jan. 2013.

Page 17: IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, …static.tongtianta.site/paper_pdf/57337e92-1be0-11e9-82e2-00163e08… · attracted great attention in various applications such

GU et al.: MKL FOR HSI CLASSIFICATION: REVIEW 6563

[93] U. Srinivas, Y. Chen, V. Monga, N. M. Nasrabadi, and T. D. Tran,“Exploiting sparsity in hyperspectral image classification via graphicalmodels,” IEEE Geosci. Remote Sens. Lett., vol. 10, no. 3, pp. 505–509,May 2013.

[94] H. Zhang, J. Li, Y. Huang, and L. Zhang, “A nonlocal weighted jointsparse representation classification method for hyperspectral imagery,”IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 7, no. 6,pp. 2056–2065, Jun. 2014.

[95] L. Fang, S. Li, X. Kang, and J. A. Benediktsson, “Spectral–spatialhyperspectral image classification via multiscale adaptive sparse rep-resentation,” IEEE Trans. Geosci. Remote Sens., vol. 52, no. 12,pp. 7738–7749, Dec. 2014.

[96] J. Li, H. Zhang, and L. Zhang, “Efficient superpixel-level multitaskjoint sparse representation for hyperspectral image classification,” IEEETrans. Geosci. Remote Sens., vol. 53, no. 10, pp. 5338–5351, Oct. 2015.

[97] Q. S. U. Haq, L. Tao, F. Sun, and S. Yang, “A fast and robustsparse approach for hyperspectral data classification using a fewlabeled samples,” IEEE Trans. Geosci. Remote Sens., vol. 50, no. 6,pp. 2287–2302, Jun. 2012.

[98] S. Yang, H. Jin, M. Wang, Y. Ren, and L. Jiao, “Data-driven com-pressive sampling and learning sparse coding for hyperspectral imageclassification,” IEEE Geosci. Remote Sens. Lett., vol. 11, no. 2,pp. 479–483, Feb. 2014.

[99] Y. Qian, M. Ye, and J. Zhou, “Hyperspectral image classification basedon structured sparse logistic regression and three-dimensional wavelettexture features,” IEEE Trans. Geosci. Remote Sens., vol. 51, no. 4,pp. 2276–2291, Apr. 2013.

[100] Y. Y. Tang, H. Yuan, and L. Li, “Manifold-based sparse representationfor hyperspectral image classification,” IEEE Trans. Geosci. RemoteSens., vol. 52, no. 12, pp. 7606–7618, Dec. 2014.

[101] H. Yuan, Y. Y. Tang, Y. Lu, L. Yang, and H. Luo, “Hyperspectral imageclassification based on regularized sparse representation,” IEEE J. Sel.Topics Appl. Earth Observ. Remote Sens., vol. 7, no. 6, pp. 2174–2182,Jun. 2014.

[102] J. Li, H. Zhang, and L. Zhang, “Column-generation kernel nonlocaljoint collaborative representation for hyperspectral image classifica-tion,” ISPRS J. Photogramm. Remote Sens., vol. 94, pp. 25–36,Aug. 2014.

[103] J. Li, H. Zhang, Y. Huang, and L. Zhang, “Hyperspectral imageclassification by nonlocal joint collaborative representation with alocally adaptive dictionary,” IEEE Trans. Geosci. Remote Sens., vol. 52,no. 6, pp. 3707–3719, Jun. 2014.

[104] J. Li, H. Zhang, L. Zhang, X. Huang, and L. Zhang, “Joint collabo-rative representation with multitask learning for hyperspectral imageclassification,” IEEE Trans. Geosci. Remote Sens., vol. 52, no. 9,pp. 5923–5936, Sep. 2014.

[105] W. Li and Q. Du, “Joint within-class collaborative representation forhyperspectral image classification,” IEEE J. Sel. Topics Appl. EarthObserv. Remote Sens., vol. 7, no. 6, pp. 2200–2208, Jun. 2014.

[106] W. Li, Q. Du, and M. Xiong, “Kernel collaborative representation withTikhonov regularization for hyperspectral image classification,” IEEEGeosci. Remote Sens. Lett., vol. 12, no. 1, pp. 48–52, Jan. 2015.

[107] J. Liu, Z. Wu, J. Li, A. Plaza, and Y. Yuan, “Probabilistic-kernelcollaborative representation for spatial–spectral hyperspectral imageclassification,” IEEE Trans. Geosci. Remote Sens., vol. 54, no. 4,pp. 2371–2384, Apr. 2016.

[108] Z. He, Q. Wang, Y. Shen, and M. Sun, “Kernel sparse multitasklearning for hyperspectral image classification with empirical modedecomposition and morphological wavelet-based features,” IEEE Trans.Geosci. Remote Sens., vol. 52, no. 8, pp. 5150–5163, Aug. 2014.

[109] M. Xiong, Q. Ran, W. Li, J. Zou, and Q. Du, “Hyperspectral imageclassification using weighted joint collaborative representation,” IEEEGeosci. Remote Sens. Lett., vol. 12, no. 6, pp. 1209–1213, Jun. 2015.

[110] J. Wright, Y. Ma, J. Mairal, G. Sapiro, T. S. Huang, and S. Yan,“Sparse representation for computer vision and pattern recognition,”Proc. IEEE, vol. 98, no. 6, pp. 1031–1044, Jun. 2010.

[111] J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry, and Y. Ma, “Robustface recognition via sparse representation,” IEEE Trans. Pattern Anal.Mach. Intell., vol. 31, no. 2, pp. 210–227, Feb. 2009.

[112] L. Zhang, M. Yang, X. Feng, Y. Ma, and D. Zhang. (2012). “Collabora-tive representation based classification for face recognition.” [Online].Available: https://arxiv.org/abs/1204.2358.

[113] D. Baron, M. F. Duarte, M. B. Wakin, S. Sarvotham, andR. G. Baraniuk. (2009). “Distributed compressive sensing.” [Online].Available: https://arxiv.org/abs/0901.3403

[114] Y. Gu, Q. Wang, and B. Xie, “Multiple kernel sparse representationfor airborne LiDAR data classification,” IEEE Trans. Geosci. RemoteSens., vol. 55, no. 2, pp. 1085–1105, Feb. 2017.

[115] M. Gönen and E. Alpaydın, “Multiple kernel learning algorithms,”J. Mach. Learn. Res., vol. 12, pp. 2211–2268, Jul. 2011.

[116] G. Camps-Valls, L. Gomez-Chova, J. Munoz-Mari, J. Vila-Frances,and J. Calpe-Maravilla, “Composite kernels for hyperspectral imageclassification,” IEEE Geosci. Remote Sens. Lett., vol. 3, no. 1,pp. 93–97, Jan. 2006.

[117] J. Wang, L. Jiao, S. Wang, B. Hou, and F. Liu, “Adaptive nonlocalspatial–spectral kernel for hyperspectral imagery classification,” IEEEJ. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 9, no. 9,pp. 4086–4101, Sep. 2016.

[118] G. Camps-Valls, L. Gomez-Chova, J. Munoz-Mari, J. L. Rojo-Alvarez,and M. Martinez-Ramon, “Kernel-based framework for multitemporaland multisource remote sensing data classification and change detec-tion,” IEEE Trans. Geosci. Remote Sens., vol. 46, no. 6, pp. 1822–1835,Jun. 2008.

[119] L. Fang, S. Li, W. Duan, J. Ren, and J. A. Benediktsson, “Classificationof hyperspectral images by exploiting spectral–spatial information ofsuperpixel via multiple kernels,” IEEE Trans. Geosci. Remote Sens.,vol. 53, no. 12, pp. 6663–6674, Dec. 2015.

[120] S. Valero, P. Salembier, and J. Chanussot, “Hyperspectral imagerepresentation and processing with binary partition trees,” IEEE Trans.Image Process., vol. 22, no. 4, pp. 1430–1443, Apr. 2013.

[121] Y. Zhou, J. Peng, and C. L. P. Chen, “Extreme learning machine withcomposite kernels for hyperspectral image classification,” IEEE J. Sel.Topics Appl. Earth Observ. Remote Sens., vol. 8, no. 6, pp. 2351–2360,Jun. 2015.

[122] H. Li, Z. Ye, and G. Xiao, “Hyperspectral image classification usingspectral–spatial composite kernels discriminant analysis,” IEEE J. Sel.Topics Appl. Earth Observ. Remote Sens., vol. 8, no. 6, pp. 2341–2350,Jun. 2015.

[123] Y. Zhang and S. Prasad, “Locality preserving composite kernel featureextraction for multi-source geospatial image analysis,” IEEE J. Sel.Topics Appl. Earth Observ. Remote Sens., vol. 8, no. 3, pp. 1385–1392,Mar. 2015.

[124] Y. Gu, C. Wang, D. You, Y. Zhang, S. Wang, and Y. Zhang, “Rep-resentative multiple kernel learning for classification in hyperspec-tral imagery,” IEEE Trans. Geosci. Remote Sens., vol. 50, no. 7,pp. 2852–2865, Jul. 2012.

[125] Y. Gu, Q. Wang, H. Wang, D. You, and Y. Zhang, “Multiple kernellearning via low-rank nonnegative matrix factorization for classificationof hyperspectral imagery,” IEEE J. Sel. Topics Appl. Earth Observ.Remote Sens., vol. 8, no. 6, pp. 2739–2751, Jun. 2015.

[126] Y. Gu, Q. Wang, X. Jia, and J. A. Benediktsson, “A novel MKL modelof integrating LiDAR data and MSI for urban area classification,” IEEETrans. Geosci. Remote Sens., vol. 53, no. 10, pp. 5312–5326, Oct. 2015.

[127] Q. Wang, Y. Gu, and D. Tuia, “Discriminative multiple kernel learningfor hyperspectral image classification,” IEEE Trans. Geosci. RemoteSens., vol. 54, no. 7, pp. 3912–3927, Jul. 2016.

[128] A. Rakotomamonjy, F. Bach, S. Canu, and Y. Grandvalet,“SimpleMKL,” J. Mach. Learn. Res., vol. 9, pp. 2491–2521, Nov. 2008.

[129] T. Liu, Y. Gu, X. Jia, J. A. Benediktsson, and J. Chanussot, “Class-specific sparse multiple kernel learning for spectral–spatial hyperspec-tral image classification,” IEEE Trans. Geosci. Remote Sens., vol. 54,no. 12, pp. 7351–7365, Dec. 2016.

[130] L. Wang, S. Hao, Q. Wang, and P. M. Atkinson, “A multiple-mappingkernel for hyperspectral image classification,” IEEE Geosci. RemoteSens. Lett., vol. 12, no. 5, pp. 978–982, May 2015.

[131] Y. Gu, T. Liu, X. Jia, J. A. Benediktsson, and J. Chanussot,“Nonlinear multiple kernel learning with multiple-structure-elementextended morphological profiles for hyperspectral image classification,”IEEE Trans. Geosci. Remote Sens., vol. 54, no. 6, pp. 3235–3247,Jun. 2016.

[132] T. T. H. Do, “A unified framework for support vector machines,multiple kernel learning and metric learning,” M.S. thesis, Faculté Sci.,Univ. Geneva, Geneva, Switzerland, 2012.

[133] N. Cristianini and J. Shawe-Taylor, An Introduction to Support VectorMachines and Other Kernel-Based Learning Methods. New York, NY,USA: Cambridge Univ. Press, 1999.

[134] Y. Gu, G. Gao, D. Zuo, and D. You, “Model selection and classificationwith multiple kernel learning for hyperspectral images via sparsity,”IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 7, no. 6,pp. 2119–2130, Jun. 2014.

Page 18: IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, …static.tongtianta.site/paper_pdf/57337e92-1be0-11e9-82e2-00163e08… · attracted great attention in various applications such

6564 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 55, NO. 11, NOVEMBER 2017

[135] Y. Gu and H. Liu, “Sample-screening MKL method via boosting strat-egy for hyperspectral image classification,” Neurocomputing, vol. 173,pp. 1630–1639, Jan. 2016.

[136] C. Cortes, M. Mohri, and A. Rostamizadeh, “Learning non-linearcombinations of kernels,” in Proc. Adv. Neural Inf. Process. Syst., 2009,pp. 396–404.

[137] Z. Xu, R. Jin, H. Yang, I. King, and M. R. Lyu, “Simple and efficientmultiple kernel learning by group lasso,” in Proc. 27th Int. Conf. Mach.Learn., 2010, pp. 1175–1182.

[138] R. E. Schapire, “The strength of weak learnability,” Mach. Learn.,vol. 5, no. 2, pp. 197–227, 1990.

[139] Y. Freund and R. E. Schapire, “A decision-theoretic generalization ofon-line learning and an application to boosting,” J. Comput. Syst. Sci.,vol. 55, no. 1, pp. 119–139, 1997.

[140] H. Xia and S. C. H. Hoi, “MKBoost: A framework of multiplekernel boosting,” IEEE Trans. Knowl. Data Eng., vol. 25, no. 7,pp. 1574–1586, Jul. 2013.

[141] P. Gamba, “A collection of data for urban area characterization,” inProc. IEEE Int. Symp. Geosci. Remote Sens. (IGARSS), Sep. 2004,pp. 69–72.

[142] S. Jia, Z. Zhu, L. Shen, and Q. Li, “A two-stage feature selectionframework for hyperspectral image classification using few labeledsamples,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 7,no. 4, pp. 1023–1035, Apr. 2014.

[143] N. Cristianini, J. Kandola, A. Elisseeff, and J. Shawe-Taylor, “Onkernel-target alignment,” in Advances in Neural Information Process-ing Systems, vol. 14. Cambridge, MA, USA: MIT Press, 2002,pp. 367–373.

[144] W. Li and Q. Du, “A survey on representation-based classification anddetection in hyperspectral remote sensing imagery,” Pattern Recognit.Lett., vol. 83, pp. 115–123, Nov. 2016.

[145] J. Li et al., “Multiple feature learning for hyperspectral imageclassification,” IEEE Trans. Geosci. Remote Sens., vol. 53, no. 3,pp. 1592–1606, Mar. 2015.

Yanfeng Gu (M’06–SM’16) received the Ph.D.degree in information and communication engineer-ing from the Harbin Institute of Technology (HIT),Harbin, China, in 2005.

He was a Lecturer with the School of Electronicsand Information Engineering, HIT, where he becamean Associate Professor in 2006; meanwhile, he wasenrolled in first Outstanding Young Teacher TrainingProgram of HIT. From 2011 to 2012, he was aVisiting Scholar with the Department of Electri-cal Engineering and Computer Science, University

of California, Berkeley, CA, USA. He is currently a Professor with theDepartment of Information Engineering, HIT. He has authored more than60 peer-reviewed papers, four book chapters, and he is the inventor orco-inventor of seven patents. His research interests include image processingin remote sensing, machine learning and pattern analysis, and multiscalegeometric analysis.

Dr. Gu is a peer reviewer for several international journals such as theIEEE TRANSACTION ON GEOSCIENCE AND REMOTE SENSING, the IEEETRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, the IEEEGEOSCIENCE AND REMOTE SENSING LETTERS, and the IET ElectronicsLetter.

Jocelyn Chanussot (M’04–SM’04–F’12) receivedthe M.Sc. degree in electrical engineering fromthe Grenoble Institute of Technology (GrenobleINP), Grenoble, France, in 1995, and the Ph.D.degree from Savoie University, Annecy, France,in 1998.

In 1999, he joined the Geography ImageryPerception Laboratory, Delegation Generale deI’Armement, in Arcueil, France (DGA—FrenchNational Defense Department). From 1999 to 2005,he was an Assistant Professor with Grenoble INP,

where he was an Associate Professor from 2005 to 2007. He has been aVisiting Scholar at Stanford University, Stanford, CA, USA; KTH, Stockholm,Sweden; and NUS, Singapore. Since 2013, he has been an Adjunct Professorwith the University of Iceland, Reykjavik, Iceland. From 2014 to 2015, he wasa Visiting Professor at the University of California, Los Angeles, CA, USA.He is currently a Professor of signal and image processing with the GrenobleINP. He is conducting his research at the Grenoble Images Speech Signals andAutomatics Laboratory (GIPSA-Lab). His research interests include imageanalysis, multicomponent image processing, nonlinear filtering, and datafusion in remote sensing.

Dr. Chanussot was a member of the IEEE Geoscience and Remote SensingSociety (GRSS) Administrative Committee during 2009–2010, in charge ofmembership development. He is a member of the Institut Universitaire deFrance during 2012–2017. He was a co-recipient of the NORSIG 2006 BestStudent Paper Award, the IEEE GRSS 2011 and 2015 Symposium Best PaperAward, the IEEE GRSS 2012 Transactions Prize Paper Award, and the IEEEGRSS 2013 Highest Impact Paper Award. He was the founding President ofthe IEEE Geoscience and Remote Sensing French Chapter during 2007–2010,which received the 2010 IEEE GRSS Chapter Excellence Award. He was theGeneral Chair of the first IEEE GRSS Workshop on Hyperspectral Image andSignal Processing, Evolution in Remote sensing. During 2009–2011, he wasthe Chair of the GRS Data Fusion Technical Committee, where he was theCo-Chair during 2005–2008. He was a member of the Machine Learning forSignal Processing Technical Committee of the IEEE Signal Processing Societyduring 2006–2008 and the Program Chair of the IEEE International Workshopon Machine Learning for Signal Processing in 2009. He was an AssociateEditor of the IEEE GEOSCIENCE AND REMOTE SENSING LETTERS during2005–2007 and for the Pattern Recognition during 2006–2008. Since 2007, hehas been an Associate Editor of the IEEE TRANSACTIONS ON GEOSCIENCE

AND REMOTE SENSING. Since 2011, he has been the Editor-in-Chief of theIEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONSAND REMOTE SENSING. He was a Guest Editor of the PROCEEDINGS OF THE

IEEE in 2013, and the IEEE SIGNAL PROCESSING MAGAZINE in 2014.

Xiuping Jia (M’93–SM’03) received the B.Eng.degree from the Beijing University of Posts andTelecommunications, Beijing, China, in 1982, andthe Ph.D. degree in electrical engineering from theUniversity of New South Wales, Kensington, NSW,Australia, in 1996.

Since 1988, she has been with the School ofInformation Technology and Electrical Engineering,University of New South Wales, Canberra, ACT,Australia, where she is currently a Senior Lecturer.She is also a Guest Professor with Harbin Engi-

neering University, Harbin, China, and an Adjunct Researcher with theChina National Engineering Research Center for Informaiton Technologyin Agriculture, Beijing. She has co-authored the remote sensing textbooktitled Remote Sensing Digital Image Analysis [Springer-Verlag, 3rd (1999)and 4th eds. (2006)]. Her research interests include remote sensing, imageprocessing, and spatial data analysis.

Dr. Jia is an Editor of the Annals of GIS for remote sensing topic, a SubjectEditor of the Journal of Soils and Sediments, and an Associate Editor of theIEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING.

Page 19: IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, …static.tongtianta.site/paper_pdf/57337e92-1be0-11e9-82e2-00163e08… · attracted great attention in various applications such

GU et al.: MKL FOR HSI CLASSIFICATION: REVIEW 6565

Jón Atli Benediktsson (M’90–SM’99–F’04)received the Cand.Sci. degree in electricalengineering from the University of Iceland,Reykjavik, Iceland, in 1984, and the M.S.E.E.and Ph.D. degrees from Purdue University, WestLafayette, IN, USA, in 1987 and 1990, respectively.

In 2015, he joined the University of Icelandas a Rector, where he was the Pro Rector ofscience and academic affairs and a Professor ofElectrical and Computer Engineering from 2009 to2015. He is a Co-Founder of Oxymap. He has

authored extensively in the research fields. His research interests includeremote sensing, biomedical analysis of signals, pattern recognition, imageprocessing, and signal processing..

Prof. Benediktsson is a member of the Association of Chartered Engineersin Iceland (VFI), Societas Scinetiarum Islandica, and Tau Beta Pi. He isa fellow of SPIE. He received the Stevan J. Kristof Award from PurdueUniversity in 1991 as an outstanding graduate student in remote sensing,the Yearly Research Award from the Engineering Research Institute of theUniversity of Iceland in 2006, and the Outstanding Service Award from the

IEEE Geoscience and Remote Sensing Society (GRSS) in 2007. He was arecipient of the Icelandic Research Council’s Outstanding Young ResearcherAward in 1997 and the IEEE Third Millennium Medal in 2000. He was aco-recipient of the University of Iceland’s Technology Innovation Awardin 2004, the 2012 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE

SENSING Paper Award, the IEEE GRSS Highest Impact Paper Awardand the IEEE/VFI Electrical Engineer of the Year Award in 2013, andthe International Journal of Image and Data Fusion Best Paper Awardin 2014. He is a member of the 2014 IEEE Fellow Committee. He wasthe 2011–2012 President of the IEEE GRSS and has been with theGRSS Administrative Committee since 2000. He was an Editor-in-Chiefof the IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENS-ING (TGRS) from 2003 to 2008, and has been an Associate Editor ofTGRS since 1999, the IEEE GEOSCIENCE AND REMOTE SENSING LET-TERS since 2003, and IEEE ACCESS since 2013. He is on the Editor-ial Board of the PROCEEDINGS OF THE IEEE, the International EditorialBoard of the International Journal of Image and Data Fusion. He wasthe Chairman of the Steering Committee of IEEE JOURNAL OF SELECTED

TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING

during 2007–2010.