A Novel Gaussianized Vector Representation for Natural ...isle.illinois.edu/sst/pubs/2008/icpr2008_zhou.pdfA Novel Gaussianized Vector Representation for Natural Scene Categorization

A Novel Gaussianized Vector Representationfor Natural Scene Categorization

Xi Zhou, Xiaodan Zhuang, Hao Tang, Mark Hasegawa-Johnson, and Thomas S. HuangBeckman Institute, Department of Electrical and Computer Engineering

University of Illinois at Urbana-Champaign, USA{xizhou2,xzhuang2,haotang2,jhasegaw}@uiuc.edu [email protected]

Abstract

This paper presents a novel Gaussianized vector rep-resentation for scene images by an unsupervised ap-proach. First, each image is encoded as an ensembleof orderless bag of features, and then a global Gaus-sian Mixture Model (GMM) learned from all images isused to randomly distribute each feature into one Gaus-sian component by a multinomial trial. The parame-ters of the multinomial distribution are defined by theposteriors of the feature on all the Gaussian compo-nents. Finally, the normalized means of the featuresdistributed in every Gaussian component are concate-nated to form a supervector, which is a compact rep-resentation for each scene image. We prove that thesesuper-vectors observe the standard normal distribution.Our experiments on scene categorization tasks usingthis vector representation show significantly improvedperformance compared with the bag-of-features repre-sentation.

1. Introduction

More and more research attention has been put onautomatic analysis and recognition of natural scenes(forest, street, office, etc.). Most previous approachesto this problem are either based on supervised seg-mentation as preprocessing or the manual annotation of“intermediate” properties [1, 2, 3]. These propertiesmight be considered as classes of texture information.In recent years, bag-of-features methods, which repre-sent an image as an orderless collection of local fea-tures, have demonstrated good performance [4, 5, 6, 7]for the whole-image categorization tasks. Furthermore,Lazebnik proposed to adopt spatial pyramid matchingfor scene categorization in order to utilize the spatial in-formation beyond the bag-of-features image representa-

tion [8].A common problem to scene categorization, or in

general, to computer vision, is to find correspondence,i.e. matching corresponding feature points betweenimages. Many dimensionality reduction methods, in-cluding global linear transformations such as PrincipalComponent Analysis (PCA) and Linear DiscriminantAnalysis (LDA) and the manifold learning methodssuch as Locally Linear Embedding (LLE) and LocalityPreserving Projections (LPP), need well-correspondedfeature points to seek a meaningful low dimensionalsubspace. For classifiers based on certain distance met-rics in the feature space, such as nearest neighbor, cor-respondence is also critical. The challenges of findingcorrespondence are at least two-fold: First, feature setsfrom different images are distorted by unknown trans-forms, which is offset by adopting features relatively in-variant to such transforms in a sense, such as the SIFTdescriptor. Second, the order of the feature points arepartially, if not completely, unknown.

Scene Categorization turns out to present extra chal-lenges for correspondence compared to some othertasks. For example, in facial information recognition,such as face recognition, age and head pose estimation,it is possible, although challenging, to obtain corre-spondence by global alignment of face images [10, 11].Therefore, it is natural to represent an aligned face im-age as a vector of features (ordered by their spatial lo-cation) for appearance based methods. The features forscene categorization, however, are more complicated.Global alignment provides limited, if at all, correspon-dence, because the inhomogeneous regions of scene im-ages are still misaligned. This calls for a robust and ge-ometrically invariant structural scene image representa-tion.

On the other hand, many dimensionality reductionalgorithms, such as PCA, have the implicit assump-tion that the features observe the Gaussian distribution.Moreover, many classifiers are based on the Euclidean

distance. Therefore, it is desirable to have a vectorbased image representation that observes the standardNormal distribution.

In this paper, we present a novel method to trans-form the scene images to correspondent and normalizedfeature vectors using an unsupervised approach. First,each image is encoded as an ensemble of orderless bagof features. The features from all the images are used tolearn a global Gaussian Mixture Model (GMM). ThisGMM with M Gaussian components is used to ran-domly distribute each feature into one of theM classesby a multinomial trial. That is, we calculate the pos-terior probabilities of the feature on all the Gaussiancomponents and use the posterior probabilities as theparameters for the multinomial trial. Finally, the nor-malized means of the features distributed in every classare concatenated to form a super-vector, which is a com-pact representation for each scene image. We prove thatthese feature vectors observe the standard normal distri-bution. Our experiments on scene categorization tasksshow that on the same data set, our feature representa-tion achieves much better performance than the bag-of-features representation with probabilistic latent seman-tic analysis (pLSA) [9].

2 Correspondence and Gaussianization

An image can be represented by a sequence ofNpatches denoted byx = (x1, x2, . . . , xN ) wherexk isthekth patch of the image. In this paper, the basic ideaof corresponding the scene images is to soft cluster theimage patches in an unsupervised manner. This is im-plemented through the use of a GMM. The GMM isone of the most widely used models for large-scale den-sity estimation as it can approximate any complicateddistributions with enough Gaussian components and itcan be effectively estimated through the expectation-maximization (EM) algorithm. Now the distribution ofthe image patches is modeled by a GMM of the form

pX(x) =

M∑

k=1

wkN (x;µk,Σk), (1)

wherewk, µk andΣk denote the weight, mean vectorand covariance matrix of thekth Gaussian componentrespectively, andM denotes the total number of Gaus-sian components.

This mixture density is a weighted linear combina-tion of M unimodal Gaussian densities, namely,

N (x;µk,Σk) =1

(2π)d

2 |Σk|1

2

e−1

2(x−µk)T Σ−1

k(x−µk).

(2)

The posterior probability that an observed patchxcomes from thekth mixture is given by

γk(x) =wkN (x;µk,Σk)

∑M

j=1 wjN (x;µj ,Σj), (3)

Now we randomly distribute the observed patchesinto M classes according to their posterior proba-bilities. That is, for a specific patchx, let γ =[γ1(x), γ2(x), ..., γM (x)]T be the parameters of amultinomial distributionMult(γ), and we introduce aM-dimensional binary random variableη ∼ Mult(γ).η has a 1-of-M representation in which a particular ele-mentηk is equal to 1 and all other elements are equal to0. ηk = 1 indicates thatx is assigned to thekth class.Obviously,p(ηk = 1|x) = γk(x) andp(ηk = 1) = wk.

Then the probability density function of the randomvariableYk which denotes the samples in thekth classis given by

pYk(a) = pX|η(a|ηk = 1) (4)

=p(ηk = 1|a)pX(a)

p(ηk = 1)(5)

=γk(a)pX(a)

wk

(6)

= N (a;µk,Σk) (7)

Now assume one scene image hasN patches. Weseparate them intoM classes according to the abovestrategy. Let the number of samples in each class bedenoted bynk, k = 1, 2, ...,M . We have

Zk =1

nk

nk∑

i=1

(Yki − µk) (8)

whereYki is theith sample in thekth class. SinceYki’sare i.i.d,Zk ∼ N (0, Σk

nk

). We further form a supervec-tor

Z =

(

Σ1

n1

)− 1

2

Z1

(

Σ2

n2

)− 1

2

Z2

...(

ΣM

nM

)− 1

2

ZM

(9)

and it is straightforward to show thatZ ∼ N (0, I). Zis a compact representation of a scene image which ob-serves the standard normal distribution.

3 Patch representation

In this section, we briefly describe the feature ex-traction process. According to the study by Fei-Fei and

suburb coast forest

highway inside city mountain

open country street tall building

office bedroom kitchen

Figure 1. Example images from the scene category database.

Perona [7], the dense regular grid works better than theother detectors for scene recognition. In this work, wechoose to use the dense regular grid as our detector.More exactly, we use an evenly sampled grid spacedat 4 x 4 pixels, on which patches of 30 x 30 pixels areextracted from the scene images.

Two different representations for describing a patchare adopted, namely the raw features and SIFT descrip-tors, respectively. For the raw feature, we first resizea 30-by-30 patch to 6-by-6 and remove from the patchthe mean of the intensity values, then normalize the in-tensities to unit variance, and finally use the 2D discretecosine transform to generate the feature vector. For theSIFT descriptor, it is a 128-dimensional SIFT vector ex-tracted for each 30-by-30 patch, whose dimensionalityis reduced to 64 dimensions by PCA.

The raw features and SIFT descriptors are fur-ther transformed into super-vector representations by aGMM as described Section 2.

4. Experiments

In this section, we report our scene categorizationexperiment results on the scene category database [7].In the database, there are 200 to 400 images in eachscene category. The average size of the images is 300 x250 pixels. The major sources of the images include theCOREL collection, personal photographs, and Googleimage search, etc. This database is one of the mostcomprehensive scene category database used in the lit-erature thus far [8]. Example images of different scenecategories from this database are illustrated in Figure 1.

We perform all processing using grayscale images,even when color images are available. When doingthe calculation, instead of explicitly carrying out the

multinomial trial, we can approximate Equation 9 asZk =

∑N

i=1 γik(xi − µk)/∑

i=1 Nnγik. All exper-iments are repeated ten times with 100 randomly se-lected images per class for training and the rest for test-ing. Here, our experiment settings are purposely madethe same as those in [7] and [8] to guarantee fairnessof performance comparison. The average of per-classrecognition rates is recorded for each run. The final re-sult is reported as the mean and standard deviation ofthe results from the individual runs.

We first use PCA to reduce the dimensionality ofthe transformed featuresZ to 1000 and then employthree different classifiers for multi-class classification,namely, k-nearest neighbor (KNN), nearest centroid(NC) and support vector machine (SVM). For the k-nearest neighbor classifier, we calculate the Euclideandistances of each image in the test set to all the imagesin the training set and assign to the test image the mostfrequent label in thek nearest training images, withkbeing set to10 in the experiments. For the nearest cen-troid classifier, we estimate the centroid of each class onthe training set, and calculate the Euclidean distances ofeach test image to all the centroids and assign to the testimage the label of the nearest centroid. For SVM, weuse the LIBSVM software [12] to perform training andtesting. The kernel used is

K(Za, Zb) =Za

TZb

√

‖Za‖‖Zb‖(10)

Table 1 shows the results of our scene categoriza-tion experiments. The global GMM for computing thesuper-vectorZ for each scene image contains 512 or1024 Gaussian components. Among the three classi-fiers, the SVM classifier achieves better performance

Table 1. Classification results on the scene category databa se [7].Raw feature SIFT

Mixture Number KNN NC SVM KNN NC SVM512 60.3± 0.8 66.5± 1.0 72.6 ± 0.9 73.6± 0.8 78.3± 0.7 83.6 ± 0.61024 61.5± 0.7 69.0± 0.9 74.4 ± 1.0 74.0± 0.6 78.7± 0.6 84.1 ± 0.5

.74 .01 .01 .07 .16 .01 .01

.80 .10 .07 .04

.01.89 .01 .01 .01 .01 .01 .06

.01 .01 .01 .90 .01 .02 .02

.03 .85 .06 .05 .01

.97 .01 .02

.08 .02 .77 .06 .06

.25 .01 .01 .10 .61 .01 .02 .01

.10 .87 .01 .02

.01 .97 .02

.09 .03 .01 .09 .01 .76 .01 .01

.02 .02 .03 .05 .02 .02 .83

.06 .01 .02 .05 .86

bedroom

suburb

kitchen

livingroom

coast

forest

highway

inside city

mountain

open country

street

tall building

officebedroom

suburbkitchen

livingroom

coastforest

highway

inside city

mountain

open country

streettall building

office

Figure 2. Confusion table for the scenecategory database. The entry in the ith

row and jth column is the percentage ofimages from class i that were misidenti-fied as class j

than the other two classifiers. Notice that even withthe raw features, the SVM classifier achieves an aver-age recognition rate of 74.4%, which is much higherthan the best recognition rate (65.2%) in Fei-Fei andPerona [7], obtained with SIFT and the bag-of-wordsmethod. With the SIFT descriptors and SVM classi-fier, our system achieves the recognition rate of 84.1%,which outperforms the performance in Lazebnik et. al.’swork, which used Bags of features with spatial pyramidmatching [8].

Figure 2 shows a confusion table between the thir-teen scene categories. The highest recognition rate is97% for the reign of “Open country” and “Forest”.

5. Conclusion & Discussion

In this paper, we propose a Gaussianization vec-tor representation for scene image , which represents ascene image as a super-vector observing the standardnormal distribution. We apply various classificationtechniques on this representation of feature vectors andachieve significantly improved performance on scenecategorization as compared with the previous work. Ourexperiments show that this representation, without con-sidering the spacial information, achieves much betterperformance than the bag-of-words with pLSA [7]; Fur-thermore, it is outperforms the Bags of features with

spatial pyramid matching [8].

References

[1] A. Treisman and G. Gelade. A feature-integrationtheory of attention.Cognitive Psychology, vol. 12,pp. 97-136, 1980.

[2] A. Oliva and A. Torralba. Modeling the shape ofthe scene: a holistic representation of the spatialenvelope. Int. Journal of Computer Vision, vol.42, 2001.

[3] J. Vogel and B. Schiele. A semantic typi-cality measure for natural scene categorization.DAGM’04 Annual Pattern Recognition Sympo-sium, 2004.

[4] C. Wallraven, B. Caputo, and A. Graf. Recogni-tion with local features: the kernel recipe.ICCV,vol. 1, pp. 257-264, 2003

[5] J. Willamowski, D. Arregui, G. Csurka, C. R.Dance, and L. Fan Categorizing nine visualclasses using local appearance descriptorsICPRWorkshop on Learning for Adaptable Visual Sys-tems, 2004

[6] K. Grauman and T. Darrell Pyramid match ker-nels: Discriminative classification with sets of im-age features.ICCV, 2005

[7] L. Fei-Fei and P. Perona A Bayesian hierarchi-cal model for learning natural scene categories.CVPR, 2005

[8] S Lazebnik, C Schmid and J Ponce Beyond bagsof features: Spatial pyramid matching for recog-nizing natural scene categories.CVPR, 2006

[9] T. Hofmann Unsupervised learning by probabilis-tic latent semantic analysisMachine Learning, no.41, pp. 177-196, 2001

[10] W Wang, S Shan, W Gao, B Cao and B Yin Animproved active shape model for face alignment.Multimodal Interfaces, 2002

[11] F Jiao, S Li, HY Shum and D Schuurmans FaceAlignment Using Statistical Models and WaveletFeatures.CVPR, 2003

[12] Chih-Chung Chang and Chih-Jen LinLIBSVM : a library for support vec-tor machines Software available athttp://www.csie.ntu.edu.tw/ cjlin/libsvm, 2001

Documents

A Novel Gaussianized Vector Representation for Natural ...isle.illinois.edu/sst/pubs/2008/icpr2008_zhou.pdfA Novel Gaussianized Vector Representation for Natural Scene Categorization