2008 spie gmm

Dynamic mixing kernels in Gaussian Mixture Classifier forHyperspectral Classification

Vikram Jayaram & Bryan Usevitch

Dept. of Electrical & Computer Engineering

The University of Texas at El Paso500 W. University Ave., El Paso, TX 79968-0523

ABSTRACT

In this paper, new Gaussian mixture classifiers are designed to deal with the case of an unknown number of mixingkernels. Not knowing the true number of mixing components is a major learning problem for a mixture classifierusing expectation-maximization (EM). To overcome this problem, the training algorithm uses a combination ofcovariance constraints, dynamic pruning, splitting and merging of mixture kernels of the Gaussian mixture tocorrectly automate the learning process. This structural learning of Gaussian mixtures is employed to modeland classify Hyperspectral imagery (HSI) data. The results from the HSI experiments suggested that this newmethodology is a potential alternative to the traditional mixture based modeling and classification using generalEM.

Keywords: Hyperspectral imagery (HSI), Gaussian mixture model (GMM), EM, Classification, Kurtosis, PCA.

1. INTRODUCTION

The complexity involved in collecting, storing, analysis and processing of voluminous and multi-dimensionalremote sensing data for an array of “Earth System Science” applications is a well known problem to the remotesensing community. The recent HSI technology have evolved from its earlier version of multispectral imaging(MSI).1, 2 In HSI, images are acquired using hundreds of spectral channels when compared to fewer channelsin MSI. However, over the years there has not been much significant development of processing algorithms forthese ever growing (in the spectral direction) electro-optical (EO) data sets. The need to come up with reduceddimensionality processing algorithms3 is even stronger due to the increased spectral dimensionality of the remotesensing data. In most remote sensing EO imagery, each spatial pixel is treated as a column vector that containsspectral information from each channel. On several occasions mixture model based approach have been justifiedfor processing voluminous data. In this paper, we show an instance of training a Gaussian mixture classifierusing dynamic kernel carpentry to model voluminous data such as the HSI.

Gaussian mixture model (GMM) is a standard modeling technique for estimating unknown probability densityfunctions (PDF). Even though the merit of GMM lie in fairly approximating most naturally occurring randomprocesses,4, 5 there lies a learning disability while using EM to estimate its model parameters. In order toensure proper learning by the EM, dynamic allocation of Gaussian kernels are used to fit the HSI data. Thismodel estimates an unknown PDF based on the assumption that the unknown density can be expressed as aweighted finite sum of Gaussian kernels. These Gaussian kernels have different mixing weights and parameters-means and covariance matrices. Updating the mixture parameters is carried out by the EM algorithm whilealso monitoring the total kurtosis which serves the requirement of kernel splitting (increase in the number ofkernels). Therefore, this technique not only ensures likelihood maximization but also kurtosis minimization.The kernel splitting comes to a halt when there is no further improvement in the minimization of kurtosis seempossible. Similarly, the other steps of this training methodology - pruning (destroying the weak kernels), mergingof kernels and determining if the algorithm converged is carried out in a step-by-step fashion. The results of this

Further author information: (Send correspondence to Vikram Jayaram)V. Jayaram: E-mail: [email protected], Telephone: 1 915 747 5869

Mathematics of Data/Image Pattern Recognition, Compression, and Encryption with Applications XI, edited by Mark S. Schmalz, Gerhard X. Ritter, Junior Barrera, Jaakko T. Astola, Proc. of SPIE Vol. 7075, 70750L, (2008)

0277-786X/08/$18 · doi: 10.1117/12.798443

Proc. of SPIE Vol. 7075 70750L-1

Downloaded From: http://spiedigitallibrary.org/ on 09/04/2013 Terms of Use: http://spiedl.org/terms

training are indicated by means of the receiver operating characteristics (ROC) curves. This structural learning6

based training technique uses relatively fewer kernels for to estimate the model parameters of GMM with a fastconverging property.

2. GAUSSIAN MIXTURE MODELS AND EM ALGORITHM

Multidimensional data such as the HSI can be modeled by a multidimensional Gaussian mixture (GM). Normally,GM in the form of the PDF for z ∈ RP is given by

p(z) =L∑

i=1

αiN (z, µi,Σi)

whereN (z, µi,Σi) =

1(2π)P/2|Σi|1/2

e{−12 (z−µi)

′Σ−1i

(z−µi)}.

Here L is the number of mixture components and P the number of spectral channels (bands). The GM parametersare denoted by λ = {αi, µi,Σi}, where αi, µi, Σi are the mixing weight, mean and covariances of the individualcomponents for the mixture model. The parameters of the GM are estimated using maximum likelihood (ML)by means of EM algorithm.7

Let sample vectors Z = {z1, z2, · · ·, zT } & λ, the likelihood (ML) of the GMM is given by:

p(Z|λ) =T∏

t=1

p(zt|λ). (1)

Next the ML estimation finds a new parameter model λ̂ such that p(Z|λ̂) ≥ p(Z|λ). Due to the nonlinearitybehavior of the likelihood in λ given in 1, a straight forward maximization of the function is not viable. Themaximization takes place on an iterative basis using EM.7 In EM algorithm, we use the auxiliary function Qgiven by

Q(λ, λ̂) =T∑

t=1

L∑

i=1

p(i|zt, λ) log[α̂iN (zt, µ̂i, Σ̂i)], (2)

where p(i|zt, λ) is the a posteriori probability for each mixture component of image class i, where i = 1, · · ·, Land satisfies

p(i|zt, λ) =αiN (zt, µi,Σi)

ΣLk=1αkN (zt, µk,Σk)

· (3)

The EM algorithm is such that if Q(λ, λ̂) ≥ Q(λ, λ) then p(Z|λ̂) ≥ p(Z|λ).8 After setting up derivatives of theQ function with respect to λ̂ to zero, the re-estimation formulas are as follows8

α̂i =1T

ΣTt=1p(i|zt, λ), (4)

µ̂i =ΣT

t=1p(i|zt, λ)zt

ΣTt=1p(i|zt, λ)

, (5)

Σ̂i =ΣT

t=1p(i|zt, λ)(zt − µi)(zt − µi)′

ΣTt=1p(i|zt, λ)

· (6)

The algorithm for training GMM is summarized as follows:

• Generate the a posteriori probability ΣTt=1p(i|zt, λ) based on proposed method (explained further) satisfying

(4).

• Compute the mixture weight, mean vector and covariance matrix by means of (4), (5) & (6).



• Update the a posteriori probability ΣTt=1p(i|zt, λ) according to (3) followed by computation of the Q

function using (2).

• Stop if the increase in value of Q function at the current iteration relative to the value of Q function atthe previous iteration is less than a chosen threshold, otherwise go to item 2.

3. TRAINING THE MIXTURE MODEL

In spite of robust design of GMM, there are challenges trying to train a GM with a local algorithm like EM.First of all, the true number of mixture components is usually unknown. Not knowing the true number of mixingcomponents is a major learning problem for a mixture classifier using EM.5, 9

The solution to this problem is a dynamic algorithm for Gaussian mixture density estimation that coulddynamically add-remove kernel components to adequately model the input data. This methodology also increasesthe chances to escape getting stuck in one of the many local maxima of the likelihood function.10 In a methodproposed by N. Vlassis and A. Likas11 called the greedy EM algorithm, GM training begins with a singlecomponent mixture. Components or modes are then added in a sequential manner until the likelihood stopsincreasing or the incrementally computed mixture is almost as good as any mixture in that form. This incrementalmixture density function uses a combination of global11 and local search11 techniques each time a new kernelcomponent is added to the mixture.

In case, the number of mixture components become high, they are pruned out depending upon the value ofmixing weight αi. This procedure ensures removal of weak modes from the mixture. A weak mode is identifiedby checking αi with respect to certain threshold. Once identified, the weak modes are obliterated. A furtherre-normalizing of αi takes place, such that

∑i αi = 1.

Merging of kernel components is another process in this training scheme, wherein, a single mode is createdfrom two identical ones. The similarity measure between the mixture modes is given by a metric d. For example,consider two PDF’s p1(x) and p2(x). Let there be collection of points near the central peak of p1(x) representedby xi ∈ X1 and another set of points near the central peak of p2(x) denoted by xi ∈ X2. In which case closenessmetric d is given by

d = log

{∏xi∈X1

p2(xi)∏

xi∈X2p1(xi)∏

xi∈X1p1(xi)

∏xi∈X2

p2(xi)

}(7)

Notice that d = 0 when p1(x) = p2(x) and d < 0 for p1(x) �= p2(x). A pre-determined threshold is set todetermine if the modes are too close. If the two modes are found below a certain threshold, they will be mergedforming weighted sum of two modes. The mode for this newly merged kernel components will be computed as10

µ =α1µ1 + α2µ2

α1 + α2.

A similar weighted analogy cannot be applied while merging covariances as it does not take in to accountthe separation of means. Instead of computing Σi directly we consider its Cholesky factors7 that are multipliedby the respective weights given by

√α1

α1+α2and

√α2

α1+α2to obtain the merged covariance.11 On the other hand

if the number of mixture components are insufficient, then the components are split in order to increase thetotal number of components. Vlassis et. al.12 define a method to monitor weighted kurtosis of each mode whichdirectly determine the number of mixture components. This kurtosis measure is given by

Ki =

∑Nn=1 wn,i(Zn−µi√

Σi)4

∑Nn=1 wn,i

− 3

where

wn,i =N (zn, µi,Σi)

ΣNn=1N (zn, µi,Σi)

·



−1500 −1000 −500 0 500 1000

−200

−100

0

100

200

300

PCA 1

PC

A 2

PCA 1

PC

A 2

−1500 −1000 −500 0 500 1000

−200

−100

0

100

200

300

Figure 1. The scene on the left is a 1995 AVIRIS image of Cuprite field in Nevada. Figure on the right is the 2D scatterplot of first two components after PCA rotation.

If |Ki| is too high for any mode i, then the mode is split into two. This could be modified to higher dimensionby considering skew in addition to the kurtosis, where each data sample zn is projected on to the j-th principalaxis of Σi in turn. Let zj

n,i = (zn − µi)′vij where vij is the j-th column of V, obtained from the SVD of Σi

(this step is necessary in order to condition the covariances). Conditioning of covariances is necessary in orderto prevent covariance matrices from becoming singular. Therefore, for each j

Ki,j =

∑Nn=1 wn,i(

Zjn,i

si)4

∑Nn=1 wn,i

− 3

ψi,j =

∑Nn=1 wn,i(

Zjn,i

si)3

∑Nn=1 wn,i

mi,j = |Ki,j | + |ψi,j |where

s2i =

∑Nn=1 wn,i(z

jn,i)

2

∑Nn=1 wn,i

.

Now, if mi,j > τ (threshold), for any j, split mode i. Further, split the mode by creating mixture componentsat µ = µi + vi,jSi,j and µ = µi − vi,jSi,j . Here Si,j is the j-th singular value of Σi. The same covariance Σi isused for each new mode. As mentioned earlier, the decision to split or not depends upon the mixing weight αi.The splitting does not take place if the value of αi is too small. Finally, once the number of modes settles outthe likelihood stops increasing and convergence is achieved. With this combination of covariance constraints,mode pruning, merging and splitting can result in a good PDF approximation of the mixture models.

4. EXPERIMENTS

To demonstrate the robustness of this learning scheme we run the model experiments on high dimensional HSIdata. The remote sensing data sets that we have used in our experiments comes from an Airborne Visible/InfraredImaging Spectrometer (AVIRIS) sensor derived imagery. AVIRIS is a unique optical sensor that delivers cali-brated images of the upwelling spectral radiance in 224 contiguous spectral channels (also called bands) withwavelengths from 0.4-2.5 µm. AVIRIS is flown all across the US, Canada and Europe. Figure 1 shows data sets



PCA 1

PC

A 2

−1500 −1000 −500 0 500 1000

−200

−100

0

100

200

300

−1500 −1000 −500 0 500 1000

−200

−100

0

100

200

300

PCA 1

PC

A 2

Figure 2. 2D scatter plot of the data and the Gaussian mixture model after the convergence achieved by the EM algorithm.

used in our experiments that belong to 1995 Cuprite field scene in Nevada. Since, HSI imagery is highly corre-lated in the spectral direction using principal component (PCA) rotation is an obvious choice for decorrelationamong the bands.13, 14 The 2D “scatter” plot of the first two principal components of the data as shown inFigure 1. The scatter plots used in the paper are similar to marginalized PDF on any 2D plane. Marginalizationcould be easily depicted for Gaussian mixtures.10 Let z = [z1, z2, z3, z4]. For example, to visualize on the (z2, z4)plane, we would need to compute

p(z2, z4) =∫

z1

∫

z3

p(z1, z2, z3, z4)dz1dz3.

This utility is very useful when visualizing high-dimensional PDF. With the given HSI data, the next step is totrain the mixture model. Training consists of five operations as mentioned earlier- begin EM algorithm, pruningand merging the components, spitting the components if necessary and finally determining if the algorithm hasconverged based on likelihood estimates of the parameters by the end of each iteration. Some of the aspects thatare critical for training are- covariance constraints, minimum individual component weights used in pruning,threshold to determine if the two components should be merged or split and the criterion for determining if theconvergence has taken place. Figure 3 shows one dimensional PDF plots of the two PCA components. Noticethe marginal PDF’s being compared to the histograms.

During the process of training the log likelihood would monotonically increase, if not for pruning, splitting,and merging operations. Figure 2 shows the Gaussian mixture approximation after convergence. Approximatelyten components were derived by the EM to characterize the GMM. The GM model parameters obtained as aresult of the structural learning will now be used to build a classifier. Figure 4 shows a synthetically simulatedsecond class of data added to the already existing input data. We will now build a classifier using Gaussianmixtures by training a second parameter set on the newly added class also using the similar scheme of learning.Figure 5 shows the result after the model converges to obtain the parameters for the second class. This isfollowed by computing the log-likelihood of the test data. The performance of the classifier is evaluated usingROC as shown in Figure 6. The response of the ROC curve clearly supports the robustness of the proposedclassification-learning scheme.



−200 −150 −100 −50 0 50 100 150 200 2500

0.002

0.004

0.006

0.008

0.01

PC

A 2

−1500 −1000 −500 0 500 1000 15000

1

2

3x 10

−3

PC

A 1

Figure 3. One dimensional PDF plots. The marginal PDF’s are compared to the histograms.

−1000 −800 −600 −400 −200 0 200 400 600 800 1000−200

−150

−100

−50

0

50

100

150

200

250

FE

AT

UR

E B

FEATURE A

Figure 4. Addition of second (yellow) class to the original data.



−300 −250 −200 −150 −100 −50

−100

−50

0

50

100

150

PCA 1

PC

A 2

PCA 1

PC

A 2

−300 −250 −200 −150 −100 −50

−100

0

100

Figure 5. Trained GM approximation of the second class.

0 0.005 0.01 0.015 0.02 0.025 0.030.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

Pfa

Pd

Figure 6. ROC curve for the two-class problem.



5. CONCLUSIONS AND FUTURE WORKS

In this paper, we proposed the use of Gaussian mixture models that utilize structural learning scheme formodeling and classification of Hyperspectral imagery. Traditional learning technique employing general EMhas serious drawbacks such as no generally accepted method for parameter initialization, how many mixturecomponents to be employed to adequately model the input data and the chances of the model being stuck inmultiple local maxima’s of the likelihood function. These drawbacks have been well addressed by the proposedstructural learning scheme. The ROC curve in our experiments is used as a general diagnosis tool to evaluateclassification. Clearly, ROC depicted high probability of detection for low false alarm rates. The GMM inconjunction with structural learning is well equipped to model and classify HSI data. These models providesufficient adjustment to several distributions with lower variances. This trait of GMM is particular appreciatedin image classification applications, since it reduces misclassification. As future work, we intend to explore andequip current state-of-the-art statistical classifiers with better training schemes for practical HSI applications.

ACKNOWLEDGMENTS

This work was supported by NASA Earth System Science Fellowship grant. The authors would also like to thankdepartment of Geological Sciences at UTEP for providing access to ENVI software.

REFERENCES

[1] Landgrebe, D. A., [Signal Theory Methods in Multispectral Remote Sensing ], Wiley Inter-Science, Hoboken,NJ, second ed. (2003).

[2] Schowengerdt, R. A., [Remote Sensing Models & Methods for Image Processing ], Academic Press, Burling-ton, MA, seventh ed. (1997).

[3] Keshava, N., “Distance metrics & band selection in hyperspectral processing with applications to materialidentification & spectral libraries,” IEEE Transactions on Geoscience and Remote Sensing 42, 1552–1565(2004).

[4] Redner, R. and Walker, H., “Mixture densities, maximum likelihood and the EM algorithm,” SIAM Re-view 26, 195–239 (1984).

[5] Duda, R. O., Hart, P. E., and Stork, D. G., [Pattern Classification ], John-Wiley and Sons, New York, NY,seventh ed. (2001).

[6] Baggenstoss, P. M., “Structural learning for classification of high dimensional data,” Proc. of InternationalConference on Intelligent Systems and Semantics NIST, 124–129 (1997).

[7] Moon, T. and Stirling, W., [Mathematical Methods and Algorithms for Signal Processing ], Prentice Hall,Upper Saddle River, NJ (2000).

[8] Rabiner, L., “A tutorial on hidden M arkov models and selected application in speech recognition,” Proc. ofIEEE 77, 257–286 (1989).

[9] McLachlan, G. and Peel, D., [Finite Mixture Models ], Wiley Series in Probability and Statistics, New York,NY, second ed. (2000).

[10] Hastie, T., Tibshirani, R., and Friedman, J., [The Elements of Statistical Learning ], Springer-Verlag, NewYork, NY (1994).

[11] Vlassis, N. and Likas, A., “A greedy EM for gaussian mixture learning,” Neural Processing Letters 15,77–87 (2002).

[12] Vlassis, N. and Likas, A., “A kurtosis-based dynamic approach to gaussian mixture modeling,” IEEETransactions on Systems, Man and Cybernetics 29, 393–399 (1999).

[13] Jayaram, V., Usevitch, B., and Kosheleva, O., “Detection from H yperspectral images compressed usingrate distortion and optimization techniques under JPEG2000 part 2,” IEEE 11th DSP Workshop and 3rdSPE Workshop cdrom, 195–239 (2004).

[14] Shaw, G. and Manolakis, D., “Signal processing for H yperspectral image exploitation,” IEEE Signal Pro-cessing Magazine 19, 12–16 (2002).



Technology

2008 spie gmm