Download pdf - 2009 asilomar

Active Learning schemes for Reduced Dimensionality HyperspectralClassification

Vikram Jayaram, Bryan Usevitch

Dept. of Electrical & Computer EngineeringThe University of Texas at El Paso

500 W. University Ave, El Paso, Texas 79968-0523{jayaram, usevitch }@ece.utep.edu ∗

Abstract

Statistical schemes have certain advantages which pro-mote their use in various pattern recognition problems. Inthis paper, we study the application of two statistical learn-ing criteria for material classification of Hyperspectral re-mote sensing data. In most cases, the Hyperspectral data ischaracterized using a Gaussian mixture model (GMM). Theproblem in using statistical model such as the GMM is theestimation of class conditional probability density functionsbased on the exemplar available from the training data foreach class. We demonstrate the usage of two training meth-ods - dynamic component allocation (DCA) and the mini-mum message length (MML) criteria that are employed tolearn the mixture observations. The training schemes arethen evaluated using the Bayesian classifier.

1 Introduction

Classification of Hyperspectral imagery (HSI) data is achallenging problem for two main reasons. First, with lim-ited spatial resolution of HSI sensors and/or the distanceof the observed scene, the images invariably contain pixelscomposed of several materials. It is desirable to resolve thecontributions of the constituents from the observed imagewithout relying on high spatial resolution images. Remotesensing cameras have been designed to capture a wide spec-tral range motivating the use of post-processing techniquesto distinguish materials via their spectral signatures. Sec-ondly, available training data for most pattern recognitionproblems in HSI processing is severely inadequate. Underthe framework of statistical classifiers, Hughes [8] was ableto demonstrate the impact of this problem on a theoretical

∗This work was supported by NASA Earth System Science FellowshipNNX06AF68H.

basis. Concerning the second problem, feature extractionand optimal band selection are the methods most commonlyused for finding useful features in high-dimensional data.On the other hand, reduced dimensionality algorithms suf-fer from theoretical loss of performance. This performanceloss occurs due to reduction of data to features, and furtherapproximating the theoretical features to PDFs. Although,a loss of performance is eminent in case of PDF based clas-sification methods, their evaluation in an HSI classificationregime is the prime focus of this paper.

2 Mixture Modeling

In order to define the relevance of using PDF finite mix-ture model for HSI, let us consider a random variable X , thefinite mixture models decompose a PDF f(x) into sum ofK class PDFs. In other words, the density function f(x) issemiparametric, since it may be decomposed into K com-ponents. Let fk(x) denote the kth class PDF. The finitemixture model with K components expands as

f(x) =K∑

k=1

akfk(x), (1)

where ak denotes the proportion of the kth class. The pro-portion ak may be interpreted as the prior probability ofobserving a sample from class k. Furthermore, the priorprobabilities ak for each distribution must be nonnegativeand sum-to-one, or

ak ≥ 0 for k = 1, · · ·, K, (2)

whereK∑

k=1

ak = 1. (3)

407978-1-4244-5827-1/09/$26.00 ©2009 IEEE Asilomar 2009

On a similar grounds, a multidimensional data such asthe HSI can be modeled by a multidimensional Gaussianmixture (GM)[1]. Normally, a GM in the form of the PDFfor z ∈ RP is given by

p(z) =

L∑i=1

αiN (z, μi, Σi)

where

N (z, μi, Σi) =1

(2π)P/2|Σi|1/2e{−

1

2(z−μi)

′Σ−1

i(z−μi)}.

Here L is the number of mixture components and P thenumber of spectral channels (bands). The GM parametersare denoted by λ = {αi, μi, Σi}. These parameters areestimated using maximum likelihood (ML) by means of theexpectation-maximization (EM) algorithm.

Figure 1. The scene is a 1995 AVIRIS imageof Cuprite field in Nevada with the training re-gions overlayed.

Figure 1 shows data sets used in our experiments thatbelong to 1995 Cuprite field scene in Nevada. The trainingregions in the HSI data are identified heuristically from min-eral maps provided by Clark et. al. [2]. The remote sens-ing data sets that we have used in our experiments comefrom an Airborne Visible/Infrared Imaging Spectrometer(AVIRIS) sensor image. AVIRIS is a unique optical sen-sor that delivers calibrated images of the upwelling spectralradiance in 224 contiguous spectral channels (bands) withwavelengths from 0.4-2.5 μm. AVIRIS is flown all acrossthe US, Canada and Europe.

Since, HSI imagery is highly correlated in the spectraldirection using the minimum noise fraction (MNF) trans-form is natural for dimensionality reduction. This trans-form is also called the noise adjusted principal component(NAPC) transformation. Like principal component analysis

(PCA), this technique is also used to determine the inher-ent dimensionality of the imagery data. This transformationsegregates noise in the data and reduces the computationalrequirements for subsequent processing [5]. Figure 2 showsthe 2D “scatter” plot of the first two MNF components ofthe original cuprite data.

−1000 −800 −600 −400 −200 0 200 400 600 800−200

−150

−100

−50

0

50

100

150

200

MNF Band 1

MN

F B

and

2

Figure 2. 2D scatter plot of the data using thefirst two MNF bands

3 Dynamic Component Allocation

In spite of the good mathematical tractability of GMM,there are challenges trying to train a GM with a local al-gorithm like EM. First of all, the true number of mixturecomponents is usually unknown. Eventually, not knowingthe true number of mixing components is a major learningproblem for a mixture classifier using EM [4]. The solu-tion to this problem is a dynamic algorithm for Gaussianmixture density estimation that could effectively add andremove kernel components to adequately characterize theinput data. This methodology also increases the chances toescape getting stuck in one of the many local maxima ofthe likelihood function. The solution to the component ini-tialization is based on a greedy EM approach which beginsthe GM training with a single component [6]. Componentsor modes are then added in a sequential manner until thelikelihood stops increasing or the incrementally computedmixture is almost as good as any mixture in that form. Thisincremental mixture density function uses a combination ofglobal and local search each time a new kernel componentis added to the mixture. We shall now describe in detail thefollowing three operations-merging, splitting and pruningof the GMM.

3.1 Merging of Modes

Merging is one of the processes in this proposed trainingscheme wherein a single mode is created from two identical

408

ones. The closeness between the mixture modes is givenby a metric d. For example, consider two PDF’s p1(x) andp2(x). Let there be collection of points near the central peakof p1(x) represented by xi ∈ X1 and another set of pointsnear the central peak of p2(x) denoted by xi ∈ X2. In whichcase the closeness metric d is given by

d = log

{∏xi∈X1

p2(xi)∏

xi∈X2p1(xi)∏

xi∈X1p1(xi)

∏xi∈X2

p2(xi)

}(4)

Notice that this metric is zero when p1(x) = p2(x) andgreater than zero for p1(x) �= p2(x). A pre-determinedthreshold is set to determine if the modes are too close toeach other. Since we assume that p1(x) and p2(x) are justtwo Gaussian modes, it is easy to know where some goodpoints for X1 and X2 are. We choose the means (centers)and then go one standard deviation in each direction alongall the principal axes. The principal axes are found by SVDdecomposition of R (the Cholesky factor of the covariancematrix).

If the two modes are found to be too close, they will bemerged forming a weighted sum of two modes(weighted byα1, α2). The mean for this newly merged mode will be

μ =α1μ1 + α2μ2

α1 + α2(5)

Here μ1 and μ2 are means of the components beforemerging and μ is the resultant mean after merging of thetwo components. The proper way to form a weighted com-bination of the covariances is not simply a weighted sum ofthe covariances, which does not take in to account the sepa-ration of means. Therefore, one needs to implement a moreintelligent technique. Consider the Cholesky decomposi-tion of the covariance matrix Σ = R

′

R. It is possible toconsider the rows

√(P )R to be samples of P -dimensional

vectors whose covariance is Σ, where P is the dimension.The sample covariance is given by 1

P (√

P )2R′

R = Σ.Now, given the two modes to merge, we regard

√(P )R1

and√

(P )R2 as two populations to be joined. The samplecovariance of the collection of rows is the desired covari-ance. But this will assign equal weight to the two popu-lations. To weight them with their respective weights, we

multiply them by√

α1

α1+α2

and√

α2

α1+α2

. Before they can

be joined, however, they must be shifted so they are re-referenced to the new central mean.

3.2 Splitting of Components

On the other hand if the number of components is toolow, then the components are split in order to increase thetotal number of components. Vlassis et. al. [6] definea method to monitor the weighted kurtosis of each mode

which directly determines the number of mixture compo-nents. This kurtosis measure is given by

Ki =

∑Nn=1 wn,i(

Zn−μi√Σi

)4∑Nn=1 wn,i

− 3

where

wn,i =N (zn, μi, Σi)

ΣNn=1N (zn, μi, Σi)

·

Therefore, if |Ki| is too high for any component (mode)i, then the mode is split into two. This could be modifiedto higher dimension by considering skew in addition to thekurtosis, where each data sample Zn is projected on to thejth principal axis of Σi in turn. Let z

jn,i � (Zn − μi)

′Vij

where Vij is the jth column of V, obtained from the SVDof Σi. Therefore, for each j

•

Ki,j =

∑Nn=1 wn,i(

Zj

n,i

si)4∑N

n=1 wn,i

− 3

•

ψi,j =

∑Nn=1 wn,i(

Zj

n,i

si)3∑N

n=1 wn,i

•mi,j = |Ki,j | + |ψi,j |

where

s2i =

∑Nn=1 wn,i(z

jn,i)

2∑Nn=1 wn,i

.

Now, if mi,j > τ , for any j, split mode i. Further, splitthe mode by creating the modes at μ = μi + vi,j Si,j

and μ = μi − vi,j Si,j , where Si,j is the jth singularvalue of Σi. The same covariance Σi is used for eachnew mode. The decision to split or not also dependsupon the mixing proportion αi. The splitting does nottake place if the value of αi is too small. The optionalthreshold parameter allows control over splitting. Ahigher threshold is less likely to split.

3.3 Pruning of Modes

When the number of components becomes high they arepruned out as the mixing weight αi falls. Pruning is killingweak modes. This procedure ensures removal of weakmodes from the overall mixture. A weak mode is identi-fied by checking αi with respect to certain threshold. Onceidentified they are obliterated and further re-normalizing αi

such that∑

i αi = 1. It is equally important that the algo-rithm does not annihilate many moderately weak modes allat once. This is achieved by setting up two input thresholdvalues.

409

Finally, once the number of modes settles out, then Q

stops increasing, and convergence is achieved. Hence, theDCA technique for GMM approximation of HSI observa-tions demonstrates that the combination of covariance con-straints, mode pruning, merging and splitting can result in agood PDF approximation of the HSI mixture models.

4 Minimum-Message Length Criteria

Let us now consider the second mixture learning tech-nique based on the minimum message length (MML) crite-rion. This method is also known by the name of Figueredo-Jain algorithm [13]. Using the MML criterion and applyingit to mixture models leads to the following objective func-tion

∧(λ, Z) =V

2

∑{c:αc>0}

log(Nαc

12) +

Cnz

2log

N

12+

Cnz(V + 1)

2− logL(Z, λ)

where N is the number of training points, V is thenumber of free parameters specifying a component, Cnz

is the number of components with non-zero weight in themixture (αc > 0), λ is the parameter list of the GMMi.e. {α1, μ1, Σ1, · · ·, αC , μC , ΣC}, and the last expressionlogL(Z, λ) is the log-likelihood of the training data giventhe distribution parameters λ.

The EM algorithm can be used to minimize the aboveequation with fixed Cnz . This leads to the M-step with com-ponent weight updating formula

αi+1c =

max{0, (∑N

n=1 wn,c) − V2 }∑C

j=1 max{0, (∑N

n=1 wn,j) − V2 }

This formula contains an explicit rule of annihilatingcomponents by setting their weights to zero. The otherdistribution parameters are updates similar to the previousmethod. Figure 3 shows an intermediate step of mixturelearning using the MML criterion.

5 Classification Results

Having described the two learning criteria, we now eval-uate their performance by using a simple Bayesian classi-fier. Figures 4,5 depict the classification of the three HSImixture classes based on the two learning criteria. Classifi-cation, outliers and miss-classifications are shown in thesefigures for the two cases. Table 1 shows the classificationperformance comparison of the DCA learning method vs.the MML. The tabulation shows various classification andmiss-classification rates of the two learning methodologiesbased on the amount of training data utilized.

−1000 −800 −600 −400 −200 0 200 400 600−150

−100

−50

0

50

100

150

200

MNF BAND 1

MN

F B

AN

D 2

Figure 3. Intermediate learning step of GMMbefore achieving convergence using the MMLcriterion.

−1000 −800 −600 −400 −200 0 200 400 600 800−200

−150

−100

−50

0

50

100

150

200

MNF Band 1

MN

F B

and

2OutliersClass 1Class 2Class 3Miss−classification

Figure 4. Classification using MML learningcriterion.

−1000 −800 −600 −400 −200 0 200 400 600 800 1000−200

−150

−100

−50

0

50

100

150

200

250

MNF Band 1

MN

F B

and

2

OutliersClass 1Class 2Class 3Miss−Classification

Figure 5. Classification using the DCA learn-ing criterion.

410

Table 1. Classification performance compar-isons between the DCA learning criterion andMML Learning criterion.

Training Data Classification Miss-Class(in percentage) (in percentage) (# of pixels)

Utilized MML DCA MML DCA55 % 68.7691 69.8148 1087 105460 % 73.5235 74.6569 957 97065 % 72.6667 73.5882 647 60670 % 74.54 75.1457 1230 86675 % 75.1541 74.3254 857 75280 % 75.6765 75.5392 404 468

6 Conclusions

In this paper, we evaluated the performance of twoPDF mixture learning criterion for the reduced dimension-ality material classification of Hyperspectral remote sens-ing data. The results show that the two methods have nearlyidentical classification performance. The outcome of thispaper presents a possible integration of advanced data anal-ysis and modeling tools to scientists, advancing the state-of-the-practice in the utilization of satellite image data tovarious types of Earth System Science studies.

References

[1] D. G. Manolakis, and G. Shaw, “Detection Algo-rithms for Hyperspectral Imaging Applications,”IEEE Signal Processing Magazine, Vol. 19, Issue1, January 2002, ISSN 1053-5888.

[2] R. N. Clark, A. J. Gallagher, and G. A. Swayze,“Material absorption band depth mapping of imag-ing spectrometer data using a complete bandshape least-squares fit with library reference spec-tra”, Proceedings of the Second Airborne Visi-ble/Infrared Imaging Spectrometer (AVIRIS) Work-shop, JPL Publication 90-54, pp. 176-186, 1990.

[3] G. J. McLachlan and K. E. Basford, Mixture Mod-els: Inference and Applications to Clustering. NewYork: M. Dekker, 1988.

[4] G. J. McLachlan and D. Peel, Finite Mixture Mod-els. New York: Wiley, 2000.

[5] A.A. Green, M. Berman, P. Switzer andM.D. Craig, “A transformation for ordering mul-tispectral data in terms of image quality with im-

plications for noice removal,” in IEEE Transac-tions on Geoscience and Remote Sensing, pp. 65-74, Vol.26, No.1, 1988.

[6] N. Vlassis and A. Likas, “A kurtosis-based dy-namic approach to Gaussian mixture modeling,”IEEE Transactions on Systems, Man and Cyber-netics, vol. 29, pp. 393399, 1999.

[7] N. Vlassis and A. Likas, “A greedy EM for Gaus-sian mixture learning,” Neural Processing Letters,vol. 15, pp. 7787, 2002.

[8] G. F. Hughes, “On mean accuracy of statistiticalpattern recognizers”, IEEE Transactions on Infor-mation Theory, vol. 14, no. 1, pp. 55-63, 1968.

[9] D. A. Landgrebe, Signal Theory Methods in Mul-tispectral Remote Sensing. ISBN 0-471-42028-X,Wiley Inter-Science Publishers, 2003.

[10] B. S. Everitt and D. J. Hand, Finite Mixture Distri-butions. London: Chapman and Hall, 1981.

[11] T. Hastie, R. Tibshirani and J. Friedman, The Ele-ments of Statistical Learning: Data Mining, Infer-ence and Prediction, Springer, 2001.

[12] A. P. Dempster, N. M. Laird, and D. B. Rubin,“Maximum likelihood from incomplete data viathe EM algorithm,” Journal of the Royal Statisti-cal Society, vol. 39B, No. 1, pp. 138, 1977.

[13] M. A. T. Figueiredo and A. K. Jain, “UnsupervisedLearning of Finite mixture models,” IEEE Trans.Pattern Anal. Machine Intell., vol. 24, no. 3, pp.381396, March 2002.

[14] A. K. Jain and R. Dubes, Algorithms for Cluster-ing Data. Englewood Cliffs, New Jersey: PrenticeHall, 1988.

[15] B. H. Huang, “Maximum Likelihood estimationfor mixture multivariate stochastic observations ofMarkov chains”, AT&T Technical Journal,vol. 64,No. 6, pp. 1235-1249, 1985.

[16] L. Liporace, “Maximum likelihood estimationfor multivariate observations of Markov sources,”IEEE Transactions on Information Theory, vol. 28,Issue 5, pp. 729-734, 1982.

[17] A.K. Jain, R. Duin and J. Mao, Statistical patternrecognition: A review, IEEE Transactions on Pat-tern Analysis and Machine Intelligence, vol. 22,no. 1, pp. 4-38, 2000.

411