4
CLUSTER-BASED ENSEMBLE CLASSIFICATION FOR HYPERSPECTRAL REMOTE SENSING IMAGES Mingmin Chi, Qun Qian School of Computer Science and Engineering Fudan University, Shanghai, China Email: [email protected] on Atli Benediktsson Dept. of Electrical and Computer Engineering University of Iceland, Reykjavik, Iceland Email: [email protected] ABSTRACT Hyperspectral remote sensing images play a very important role in the discrimination of spectrally similar land-cover classes. In order to obtain a reliable classifier, a larger amount of rep- resentative training samples are necessary compared to multi- spectral remote sensing data. In real applications, it is difficult to obtain a sufficient number of training samples for supervised learning. Besides, the training samples may not represent the real distribution of the whole space. To attack the quality prob- lems of training samples, we proposed a Cluster-based ENsem- ble Algorithm (CENA) for the classification of hyperspectral remote sensing images. Data set collected from ROSIS univer- sity validates the effectiveness of the proposed approach. Index TermsMixture of Gaussian (MoG), Support Clus- ter Machine (SCM), Ensemble, Hyperspectral remote sensing images. 1. INTRODUCTION Hyperspectral remote sensing images play a very important role in the discrimination among spectrally similar land-cover classes. In order to obtain a reliable classifier, a larger amount of rep- resentative training samples are necessary compared to multi- spectral remote sensing data. In real applications, it is difficult to obtain sufficient number of training samples for supervised learning. Besides, the training samples may not represent the real distribution of the whole space. These result in quantity and quality problems of training samples for a robust learning of a classifier. For quantity problems, Semi-Supervised Learning (SSL) classification techniques (such as self-labeling approaches [1], low density separation SSL approaches [2], label-propagation SSL approaches [3], and so on) that have been proposed in re- cent decades are exploited on remote sensing data to overcome problems with small numbers of labeled samples. Nonetheless, unrepresentative problems were not considered with the meth- ods mentioned above. In our work, we are mainly dealing with the quality problem in the classification of hyperspectral remote sensing images. In particular, we propose to represent both labeled and unlabeled data with a generative model (i.e., the Mixture of Gaussian, MoG), and then the estimated model is used for learning. This is motivated by the recently proposed classification approach, a Support Cluster Machine (SCM) [4]. The SCM was origi- nally used to address large scale supervised learning problems. The main idea in SCM is that the labeled data are at first mod- eled using a generative model. Then the kernel, the similarity measure between Gaussians, is defined by probability product kernels (PPKs) [5]. In other words, the obtained PPK kernel is used to train support vector machines where the learned models contain support CLUSTERs rather than support vectors (this is where the name SCM comes from). In the SCM, the number of clusters is important in order to obtain better classification results. If the selected number of Gaussians does not fit the data well, the classification accu- racy could decrease. For small size training set problems, the mixture model estimated by only labeled samples cannot rep- resent the distributions for the whole data. For the latter, in our work, we use a model-based clustering algorithm to esti- mate MoG with both labeled and unlabeled samples. For the former, we propose to use an ensemble technique to overcome the problem, going from coarse to fine numbers of clusters in order to generate different sets of MoGs. In terms of differ- ent estimated MoGs, the corresponding PPK kernel matrixes can be computed and then used as an input to standard SVMs for training. The final classification result is integrated based on the ones obtained from individual SVMs, like in [6]. The accuracies and the reliability of the proposed algorithm have been evaluated on ROSIS hyperspectral remote sensing data collected in the University of Pavia, Italy, in presence of ill- posed classification problems. The results are promising when compared to the-state-of-the-art. The rest of the paper is organized as follows. The next sec- tion describes the proposed Cluster-based ENsemble Algorithm (CENA) is described. Section 3 illustrates the data used in the experiments, and reports and discusses the results provided by the different algorithms. Finally, Section 4 draws the conclu- sions of this paper. I - 209 978-1-4244-2808-3/08/$25.00 ©2008 IEEE IGARSS 2008

Cluster-Based Ensemble Classification for Hyperspectral Remote Sensing Images

Embed Size (px)

DESCRIPTION

hyperspectral data

Citation preview

  • CLUSTER-BASED ENSEMBLE CLASSIFICATION FOR HYPERSPECTRAL REMOTE SENSINGIMAGES

    Mingmin Chi, Qun Qian

    School of Computer Science and EngineeringFudan University, Shanghai, China

    Email: [email protected]

    Jon Atli Benediktsson

    Dept. of Electrical and Computer EngineeringUniversity of Iceland, Reykjavik, Iceland

    Email: [email protected]

    ABSTRACT

    Hyperspectral remote sensing images play a very important rolein the discrimination of spectrally similar land-cover classes.In order to obtain a reliable classier, a larger amount of rep-resentative training samples are necessary compared to multi-spectral remote sensing data. In real applications, it is difcultto obtain a sufcient number of training samples for supervisedlearning. Besides, the training samples may not represent thereal distribution of the whole space. To attack the quality prob-lems of training samples, we proposed a Cluster-based ENsem-ble Algorithm (CENA) for the classication of hyperspectralremote sensing images. Data set collected from ROSIS univer-sity validates the effectiveness of the proposed approach.

    Index Terms Mixture of Gaussian (MoG), Support Clus-ter Machine (SCM), Ensemble, Hyperspectral remote sensingimages.

    1. INTRODUCTION

    Hyperspectral remote sensing images play a very important rolein the discrimination among spectrally similar land-cover classes.In order to obtain a reliable classier, a larger amount of rep-resentative training samples are necessary compared to multi-spectral remote sensing data. In real applications, it is difcultto obtain sufcient number of training samples for supervisedlearning. Besides, the training samples may not represent thereal distribution of the whole space. These result in quantityand quality problems of training samples for a robust learningof a classier.

    For quantity problems, Semi-Supervised Learning (SSL)classication techniques (such as self-labeling approaches [1],low density separation SSL approaches [2], label-propagationSSL approaches [3], and so on) that have been proposed in re-cent decades are exploited on remote sensing data to overcomeproblems with small numbers of labeled samples. Nonetheless,unrepresentative problems were not considered with the meth-ods mentioned above.

    In our work, we are mainly dealing with the quality problemin the classication of hyperspectral remote sensing images. In

    particular, we propose to represent both labeled and unlabeleddata with a generative model (i.e., the Mixture of Gaussian,MoG), and then the estimated model is used for learning. Thisis motivated by the recently proposed classication approach,a Support Cluster Machine (SCM) [4]. The SCM was origi-nally used to address large scale supervised learning problems.The main idea in SCM is that the labeled data are at rst mod-eled using a generative model. Then the kernel, the similaritymeasure between Gaussians, is dened by probability productkernels (PPKs) [5]. In other words, the obtained PPK kernel isused to train support vector machines where the learned modelscontain support CLUSTERs rather than support vectors (this iswhere the name SCM comes from).

    In the SCM, the number of clusters is important in orderto obtain better classication results. If the selected numberof Gaussians does not t the data well, the classication accu-racy could decrease. For small size training set problems, themixture model estimated by only labeled samples cannot rep-resent the distributions for the whole data. For the latter, inour work, we use a model-based clustering algorithm to esti-mate MoG with both labeled and unlabeled samples. For theformer, we propose to use an ensemble technique to overcomethe problem, going from coarse to ne numbers of clusters inorder to generate different sets of MoGs. In terms of differ-ent estimated MoGs, the corresponding PPK kernel matrixescan be computed and then used as an input to standard SVMsfor training. The nal classication result is integrated basedon the ones obtained from individual SVMs, like in [6]. Theaccuracies and the reliability of the proposed algorithm havebeen evaluated on ROSIS hyperspectral remote sensing datacollected in the University of Pavia, Italy, in presence of ill-posed classication problems. The results are promising whencompared to the-state-of-the-art.

    The rest of the paper is organized as follows. The next sec-tion describes the proposed Cluster-based ENsemble Algorithm(CENA) is described. Section 3 illustrates the data used in theexperiments, and reports and discusses the results provided bythe different algorithms. Finally, Section 4 draws the conclu-sions of this paper.

    I - 209978-1-4244-2808-3/08/$25.00 2008 IEEE IGARSS 2008

  • 2. CLUSTER-BASED ENSEMBLE ALGORITHM

    2.1. Mixture of Gaussian

    To obtain information from unlabeled data, the correspondingstructure information is used in our work. In detail, a largeamount of unlabeled samples are used to better estimate thedata distribution, e.g., using a mixture of Gaussian. Assumethat the data X = (xi)ni=1,x RD are drawn independentlyby the model, we can compute the log likelihood function forthe i.i.d data by

    l(X) = ln p(X|,,)

    =n

    i=1

    ln

    {K

    k=1

    kN (xi|k,k)}

    where k denotes the mixing coefcient, k the mean vectorand k the covariance matrix of the k-th component. In thepaper, the EM algorithm is adopted to estimate the parameters.

    Since estimating the mixture model does not take class la-bels into account, we should assign a deterministic label to eachcomponent. Due to very small-size labeled set, some compo-nent might contain all unlabeled samples. In this case, we dis-card such kinds of components. For components containingdifferent labels, they are divided until there are no more sam-ples containing different label information.

    2.2. Support Cluster Machine (SCM)

    2.2.1. PPK with Gaussian mixture models

    After preprocessing, the data are represented by the mixture ofgenerative Gaussian model. The similarity between the compo-nents can be calculated by probability product kernel (PPK) [5]

    (k,k ) (1)

    = (kk )

    RD

    N (x|k,k)N (x|k ,k )dx

    = (kk )D/2(2)

    (12)D2 || 12 |k|

    2 |k |

    2

    exp(

    2(p

    1k k +

    k

    1k k

    1

    ))

    where is constant, = 1k +1k , =

    1k k+

    1k k .

    For the ease of computation, we can assume that the fea-tures are statistically independent. Hence, a diagonal covari-ance matrix can be obtained by k = diag(2k1, , 2kd).

    2.2.2. Learning with distributions

    After obtaining the kernel matrix K = (kk )Kk,k=1, we can

    use a SVM-like classier for learning, i.e., a support clustermachine (SCM) [4]. Here, The SCM maximizes the margin

    between the positive and the negative clusters, rather than datavectors as follows

    minw,b,

    12ww + C

    Kk=1

    kk (2)

    with the constraints

    yk(w(k) + b

    ) 1 k, k = 1, . . . ,K (3)where () is a mapping function, which is a generative distri-bution (Gaussian form in the paper) and the slack k is mul-tiplied by the weight k (the prior of the kth cluster in MoG)such that a misclassied cluster with more samples could begiven heavier penalty [4].

    Incorporating the constraints (3) and the constraints k 0, k = 1, . . . ,K, to the cost function (2), using Lagrangian the-orem, the constrained optimization problem can be transformedto a dual problem following the same steps as in the SVM. Thedual representation of the SCM is given as

    max

    Kk=1

    k 12K

    k=1

    Kk=1

    ykykkk(k,k ) (4)

    s.t. 0 k kC, k = 1, . . . ,K (5)K

    k=1

    kyk = 0.

    Like the SVM, the SCM has the same optimization formula-tion except that the Lagrange multipliers k are bounded by Cmultiplied by the weight k shown in (5).

    2.2.3. Prediction

    We can treat the test sample x as an extreme case of Gaussianx when its covariance matrix vanishes, i.e., x = (x =1,x = x,x = 2x1, x 0).

    Given two Gaussians, k and x, the kernel function (1)can be used to compute the similarity between the distributionand the test vector x. If = 1, Li et al. [4] proved that the limitof the kernel function (1) becomes the posterior probability ofk given x. Therefore, we can get the kernel for the SCMprediction:

    (k,x) = kp(x|k,x) (6)

    which is the posterior probability of x given k.Then, a testing sample x can be predicted by using the fol-

    lowing decision function:

    f(x) = sgn

    (K

    k=1

    kyk(k,x) + b

    ). (7)

    I - 210

  • 2.3. Ensemble Strategy

    In The SCM, the data are represented by a mixture model. Usu-ally, the number of components should be xed initially. Inreal applications, it is difcult to evaluate which one is bestfor the problem. In the paper, we propose to use an ensembletechnique to overcome these problems. The number of mixturecomponents goes from coarse to ne to generate different setsof MoG.

    Accordingly, the input to different SCMs is {gk, ygk}, g =1, , G, where G is the number of classiers. The predictionfunction for each classier fg is

    f(x|g) = sgn(

    Kgk=1

    gkygk(

    gk,x) + b

    g

    ). (8)

    Then, a winner-take-all combination strategy is used to make anal decision [6].

    3. EXPERIMENTAL RESULTS

    3.1. Dataset Description

    Table 1: The distribution of original training and test data setsin ROSIS University.

    Class Name No. of Training Set No. of Test Set1 Asphalt 548 66312 Meadows 539 185633 Gravel 391 20994 Trees 518 30305 Metal sheets 265 13456 Bare Soil 532 50217 Bitumen 375 13308 Bricks 514 36829 Shadow 213 897

    The data used in the experiments were collected using theoptical sensor ROSIS 03 (Reective Optics System ImagingSpectrometer) on the campus at the University of Pavia, Italy.Originally, there are 115 bands of ROSIS 03 sensor coveringfrom 0.43 through 0.86 m and the image size is 610 340.Some channels were removed due to noise and so the data con-tains 103 features. The task is to discriminate among nineclasses, i.e., Asphalt, Meadows, Gravel, Trees, Metal sheets(Metal), Bare soil (Soil), Bitumen, Bricks and Shadow. Sometraining data are removed due to zero features and hence, thefull data set contains 3895 training samples and 42598 test ones.The detail distribution can be found in Table 1.

    From Table 1, one can see that the number of original train-ing samples for all classes is quite balanced, however, that is notthe case for test patterns. In particular, the test number of class

    4 is 18563, while those of the remaining classes vary from 897to 6631. This means that the data distribution estimated by evenall of training samples cannot represent the distribution over allthe region. Furthermore, spectral characteristics of meadowsand trees are much similar due to similar spectral reectance.Therefore, in this paper, we are mainly focusing on these twospectral similar and unbalanced image classication problems.

    In order to investigate the impact of the number and qualityof labeled data on classier performance, the original trainingdata were sub-sampled to obtain ten splits made up of around2% of the original labeled data (10 samples per class). To com-pare the classication results, we also carry out the experimentswith the SVM on both the subsampled and the complete train-ing data sets.

    3.2. Experimental setup

    In the SVM, if Gaussian kernel is taken into account, kernel pa-rameter, the optimal should be decided by model selection.In our work, we use grid search, where C = (103, , 103), = (23, , 23) and then 5-cross validation is used to selectthe best model for prediction. In the SCM, we can computethe parameter kd from data directly. Also the variance in dif-ferent directions is different such that it is better and much ex-ible to capture the structure of data, such as cigar-shape data.Finally, only one parameter, penalization parameter C shouldbe decided in the SCM. In our experiments, C is not sensitiveto results and so we x it to C = 100 in all the following ex-periments. The range of K using the EM algorithm is set to(2, 3, , 19) to construct 18 base classiers. Note that, in theSVM, 49 models should be estimated for model selection andso the computational complexity of the SCM is in the samemagnitude as the SVM.

    3.3. Experimental Results

    Table 2: Classication accuracies using the SVM and the pro-posed CENA with the ten subsamples of the training data set.

    Split SVM (%) CENA (%)1 84.18 90.352 82.42 88.463 83.80 86.544 83.74 86.345 79.09 88.816 74.75 86.937 83.59 90.378 83.69 89.969 84.62 90.7410 76.88 86.16

    Average 81.73 88.47

    I - 211

  • 100 200 300

    100

    200

    300

    400

    500

    600100 200 300

    100

    200

    300

    400

    500

    600

    Others Meadows Trees

    (a) Test Map (b) Map by SVM (c) Map by CENA

    Figure 1: The comparison among the original test samples, andthe results provided by SVM and CENA.

    For the ease of comparison, we also carry out experimentsby supervised SVM on ten splits containing 20 labeled trainingsamples. The results are shown in Table 2. The average classi-cation accuracy is 81.73% over 10 splits, but individual resultsare not stable, varying from 74.75% to 84.62%.

    The average classication accuracies by the proposed algo-rithm is 88.47%, much better than that by the SVM. In par-ticular, all the results per split by the CENA are signicantlybetter than those by the SVM. Furthermore, the ensemble clas-sication result per split approaches or is even better than thatobtained by the SVM (i.e., 89.11%) using all the training sam-ples. This conrms that the proposed CENA can increase notonly classication accuracies but also the robustness of classi-cation results.

    Fig. 1 shows the comparison classication map using theSVM and the CENA compared to the original one for the split9.From Fig. 1, one can see that Meadows can be better classiedand Trees as well. Due to the use of the structure of unlabeledsamples, the data distribution can be better estimated and sounrepresentative problem of small-size training data set can bealleviated. This also can avoid model selection, which shouldbe taken into account for most of supervised classication al-gorithms.

    4. DISCUSSION AND CONCLUSION

    In the paper, a cluster-based ensemble algorithm (CENA) hasbeen proposed to classify hyperspectral remote sensing image.In particular, unlabeled samples together with very small-sizelabeled samples are used to generate generative models, i.e.,a Mixture of Gaussian (MoG). The number of components inMoG is difcult to determine and so we vary the ne to coarsenumber of Gaussians to avoid this problem. Then, each MoGis used to dene a base classier, support cluster machine [4].

    Different generative models can lead to the diverse of the clas-sication result of each base classier. Finally, the results fromdifferent base classier are combined to obtain a better andmore robust classication accuracy. Experiments are carriedout on real hyperspectral data collected by ROSIS 03 sensor inthe University of Pavia, Italy. The results obtained by the pro-posed CENA obtained both better classication accuracies andmore robustness compared to the-state-of-the art.

    In the work, we have mainly worked on a binary classi-cation problem with unrepresentative training samples. In ourfuture research, we will also work on a multiple classicationproblem. Besides, the estimation of mixture of model with con-straints provided by labeled samples deserves further investi-gation for the base classier. Finally, components of mixturemodel without labeled information will be further taken intoaccount for learning.

    5. ACKNOWLEDGEMENTS

    This work was supported in part by Natural Science Foundationof China under contract 60705008, and the Research Fund ofthe University of Iceland. Thanks to Dr. Paolo Gamba of theUniversity of Pavia, Italy, for providing the data set.

    6. REFERENCES

    [1] B. M. Shahshahani and D. A. Landgrebe, The effect ofunlabeled samples in reducing the small sample size prob-lem and mitigating the hughes phenomenon, IEEE Trans.Geosci Remote Sensing, vol. 32, no. 5, pp. 10871095,Sept. 1994.

    [2] M. Chi and L. Bruzzone, Semi-supervised classicationof hyperspectral images by svms optimized in the primal,IEEE Trans. Geosci. and Remote Sensing, vol. 45, no. 6,pp. 18701880, 2007.

    [3] T. Bandos, D. Zhou, and G. Camps-Valls, Semi-supervised hyperspectral image classication with graphs.in Proceeding of IEEE International Geoscience and Re-mote Sensing Symposium (IGARSS06), July 2006.

    [4] B. Li, M. Chi, J. Fan, and X. Xue, Support cluster ma-chine. International Conference on Machine Learning,2007.

    [5] T. Jebara, R. Kondor, and A. Howard, Probability productkernels, Journal of Machine Learning Research, vol. 5,pp. 819844, 2004.

    [6] G. Briem, J. Benediktsson, and J. Sveinsson, Multipleclassiers in classication of multisource remote sensingdata, IEEE Trans. Geosci Remote Sensing, vol. 40, no. 10,pp. 22912299, Oct. 2002.

    I - 212