14
1924 IEEE TRANSACTIONS ON CYBERNETICS, VOL. 44, NO. 10, OCTOBER 2014 Visual Words Assignment Via Information-Theoretic Manifold Embedding Yue Deng, Yipeng Li, Yanjun Qian, Xiangyang Ji, and Qionghai Dai, Senior Member, IEEE Abstract —Codebook-based learning provides a flexible way to extract the contents of an image in a data-driven manner for visual recognition. One central task in such frameworks is codeword assignment, which allocates local image descriptors to the most similar codewords in the dictionary to generate histogram for categorization. Nevertheless, existing assignment approaches, e.g., nearest neighbors strategy (hard assignment) and Gaussian similarity (soft assignment), suffer from two problems: 1) too strong Euclidean assumption and 2) neglecting the label information of the local descriptors. To address the aforementioned two challenges, we propose a graph assignment method with maximal mutual information (GAMI) regulariza- tion. GAMI takes the power of manifold structure to better reveal the relationship of massive number of local features by nonlinear graph metric. Meanwhile, the mutual information of descriptor- label pairs is ultimately optimized in the embedding space for the sake of enhancing the discriminant property of the selected codewords. According to such objective, two optimization models, i.e., inexact-GAMI and exact-GAMI, are respectively proposed in this paper. The inexact model can be efficiently solved with a closed-from solution. The stricter exact-GAMI nonparametrically estimates the entropy of descriptor-label pairs in the embedding space and thus leads to a relatively complicated but still trackable optimization. The effectiveness of GAMI models are verified on both the public and our own datasets. Index Terms—Manifold embedding, mutual information, scene categorization, visual words assignment. I. Introduction I N computer vision, a long standing but still challenging problem is how to accurately categorize different images based on their intrinsic contents [1]–[3]. Recently, there is a growing trend to solve the problem in the data-driven framework. In this framework, instead of detailed explanation for each component in the image, we only care about the Manuscript received March 17, 2013; revised July 31, 2013 and November 1, 2013; accepted January 3, 2014. Date of publication January 28, 2014; date of current version September 12, 2014. This work was supported in part by the National Basic Research Project under Grant 2010CB731800 and in part by the the Project of NSFC under Grant 61120106003, Grant 61035002, and Grant 61325003. This paper was recommended by Associate Editor P. P. Angelov. Y. Deng is with the Tsinghua National Laboratory for Information Science and Technology, Department of Automation, Tsinghua University, Beijing 100084 and also with the school of Electronic Science and Engineering, Nanjing University, Nanjing 21000, China (e-mail: [email protected]). Y. Li, Y. Qian, X. Ji, and Q. Dai are with the Tsinghua National Lab- oratory for Information Science and Technology, Department of Automa- tion, Tsinghua University, Beijing 100084, China (e-mail: [email protected]; [email protected]; [email protected]; [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TCYB.2014.2300192 statistical features of the whole image obtained in the bag of feature (BoF) way. Due to the BoF framework, the whole image is summarized as a feature histogram that indicates the frequency of local descriptors appearing in the image. Once getting the histogram, image categorization turns into a well defined machine learning task and can be effectively solved by some strong discriminative machines, e.g., SVM. Accordingly, the prominent task of image categorization is how to effectively generate visual histograms to represent different images. For image representation, prevalent algorithms include topic models [4], [5], codeword similarity assignment [1], [6], and sparse coding [7]. Topic model uses a probabilistic generative model to hierarchically interpret how the local features on the image are generated from different image categories. How- ever, these models lack discriminative ability. Sparse coding methods take the power of popular sparse optimization to establish a patch-level feature histogram and produce robust performances. Unfortunately, it suffers heavy computational loads because of the requirements of solving 1 -type convex optimizations for multiple times in the feature generation step. In order to achieve higher accuracy than the topic models and to release the heavy computational burdens of sparse coding, in this paper, we focus on the codeword-based framework to generate the visual histogram for image categorization. Codeword-based framework [6] encodes each image into a histogram following two steps: codeword construction and codeword assignment. In the first step, the local features, e.g., the SIFT features, should be extracted from multiple images. Consequently, these thousands of local descriptors are clustered into a number of centers, i.e., codewords. The set containing all these codewords is named codebook (dictio- nary). With these codewords, in the second step, the local image features should be assigned to the codewords to form an image-level histogram. In the context of codeword assignment, albeit many varia- tions have been proposed in the past decade, hard assignment [1] and soft assignment with Gaussian kernels [6] are still two of the most influential works. Hard assignment directly assigns each local feature to the codeword with the least distance. Correspondingly, soft assignment assigns each local feature to all the codewords in the dictionary according to the similarity evaluated by the Gaussian kernel. Although these two assignment methods show promising results in many practical applications, they still suffer from two problems that deserve consolidated investigations. 2168-2267 c 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications standards/publications/rights/index.html for more information.

Visual Words Assignment Via Information-Theoretic Manifold Embedding

Embed Size (px)

Citation preview

Page 1: Visual Words Assignment Via Information-Theoretic Manifold Embedding

1924 IEEE TRANSACTIONS ON CYBERNETICS, VOL. 44, NO. 10, OCTOBER 2014

Visual Words Assignment Via Information-TheoreticManifold Embedding

Yue Deng, Yipeng Li, Yanjun Qian, Xiangyang Ji, and Qionghai Dai, Senior Member, IEEE

Abstract—Codebook-based learning provides a flexible wayto extract the contents of an image in a data-driven mannerfor visual recognition. One central task in such frameworks iscodeword assignment, which allocates local image descriptorsto the most similar codewords in the dictionary to generatehistogram for categorization. Nevertheless, existing assignmentapproaches, e.g., nearest neighbors strategy (hard assignment)and Gaussian similarity (soft assignment), suffer from twoproblems: 1) too strong Euclidean assumption and 2) neglectingthe label information of the local descriptors. To address theaforementioned two challenges, we propose a graph assignmentmethod with maximal mutual information (GAMI) regulariza-tion. GAMI takes the power of manifold structure to better revealthe relationship of massive number of local features by nonlineargraph metric. Meanwhile, the mutual information of descriptor-label pairs is ultimately optimized in the embedding space forthe sake of enhancing the discriminant property of the selectedcodewords. According to such objective, two optimization models,i.e., inexact-GAMI and exact-GAMI, are respectively proposedin this paper. The inexact model can be efficiently solved with aclosed-from solution. The stricter exact-GAMI nonparametricallyestimates the entropy of descriptor-label pairs in the embeddingspace and thus leads to a relatively complicated but still trackableoptimization. The effectiveness of GAMI models are verified onboth the public and our own datasets.

Index Terms—Manifold embedding, mutual information, scenecategorization, visual words assignment.

I. Introduction

IN computer vision, a long standing but still challengingproblem is how to accurately categorize different images

based on their intrinsic contents [1]–[3]. Recently, there isa growing trend to solve the problem in the data-drivenframework. In this framework, instead of detailed explanationfor each component in the image, we only care about the

Manuscript received March 17, 2013; revised July 31, 2013 andNovember 1, 2013; accepted January 3, 2014. Date of publication January 28,2014; date of current version September 12, 2014. This work was supportedin part by the National Basic Research Project under Grant 2010CB731800and in part by the the Project of NSFC under Grant 61120106003, Grant61035002, and Grant 61325003. This paper was recommended by AssociateEditor P. P. Angelov.

Y. Deng is with the Tsinghua National Laboratory for Information Scienceand Technology, Department of Automation, Tsinghua University, Beijing100084 and also with the school of Electronic Science and Engineering,Nanjing University, Nanjing 21000, China (e-mail: [email protected]).

Y. Li, Y. Qian, X. Ji, and Q. Dai are with the Tsinghua National Lab-oratory for Information Science and Technology, Department of Automa-tion, Tsinghua University, Beijing 100084, China (e-mail: [email protected];[email protected]; [email protected]; [email protected]).

Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TCYB.2014.2300192

statistical features of the whole image obtained in the bagof feature (BoF) way. Due to the BoF framework, the wholeimage is summarized as a feature histogram that indicatesthe frequency of local descriptors appearing in the image.Once getting the histogram, image categorization turns intoa well defined machine learning task and can be effectivelysolved by some strong discriminative machines, e.g., SVM.Accordingly, the prominent task of image categorization ishow to effectively generate visual histograms to representdifferent images.

For image representation, prevalent algorithms include topicmodels [4], [5], codeword similarity assignment [1], [6], andsparse coding [7]. Topic model uses a probabilistic generativemodel to hierarchically interpret how the local features on theimage are generated from different image categories. How-ever, these models lack discriminative ability. Sparse codingmethods take the power of popular sparse optimization toestablish a patch-level feature histogram and produce robustperformances. Unfortunately, it suffers heavy computationalloads because of the requirements of solving �1-type convexoptimizations for multiple times in the feature generation step.In order to achieve higher accuracy than the topic models andto release the heavy computational burdens of sparse coding,in this paper, we focus on the codeword-based framework togenerate the visual histogram for image categorization.

Codeword-based framework [6] encodes each image intoa histogram following two steps: codeword construction andcodeword assignment. In the first step, the local features,e.g., the SIFT features, should be extracted from multipleimages. Consequently, these thousands of local descriptors areclustered into a number of centers, i.e., codewords. The setcontaining all these codewords is named codebook (dictio-nary). With these codewords, in the second step, the localimage features should be assigned to the codewords to forman image-level histogram.

In the context of codeword assignment, albeit many varia-tions have been proposed in the past decade, hard assignment[1] and soft assignment with Gaussian kernels [6] are stilltwo of the most influential works. Hard assignment directlyassigns each local feature to the codeword with the leastdistance. Correspondingly, soft assignment assigns each localfeature to all the codewords in the dictionary according tothe similarity evaluated by the Gaussian kernel. Althoughthese two assignment methods show promising results in manypractical applications, they still suffer from two problems thatdeserve consolidated investigations.

2168-2267 c© 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications standards/publications/rights/index.html for more information.

Page 2: Visual Words Assignment Via Information-Theoretic Manifold Embedding

DENG et al.: VISUAL WORDS ASSIGNMENT VIA INFORMATION-THEORETIC MANIFOLD EMBEDDING 1925

The first drawback of existing assignment methods is thatthey are implemented in the Euclidean space. During thetraining procedure, thousands of local image descriptors areextracted from a number of training images in differentcategories, such as more than 320 000 descriptors from 15Scene dataset [4]. With this huge data quantity, it is notreasonable to neglect the manifold structure of data and thenonlinearity therein. However, as stated previously, typicalassignment methods only rely on Euclidean distance to makeassignments. In this paper, we propose an assignment methodby using a manifold structure to model the distribution ofthe massive local features. Owing to the graph topology,many powerful nonlinear similarities, e.g., geodesic distance[8] and commute time [9] of random walk, can be naturallyused as a metric to evaluate the feature similarity duringassignment.

Besides, typical codeword assignment methods ignore thelabel information contained in each local descriptor. As theimages are from various categories, local image features alsotake the label information [10]. This critical property is hardlyaddressed by previous approaches in both the proceduresof codebook generation and codeword assignment. In thispaper, we attempt to enhance the discriminative informationof descriptor-label pairs by regularizing their information-theoretic quantity.

To overcome the aforementioned shortcomings, in this pa-per, a graph assignment with maximal mutual information(GAMI) is proposed for codeword assignment, by exploringthe manifold structure to enhance the nonlinearity among thelarge amount of local descriptors. Different from previousassignment methods, GAMI explicitly utilizes the label infor-mation by incorporating the mutual information term into theobjective. Mutual information encourages the mutual depen-dency of the labels and the features and thus it significantlyimproves the discriminative structure of the learned histogram.

To achieve GAMI optimization, the probability densityfunctions (pdf) of labels and samples in the embedding spaceshould be explicitly defined which requires some prior as-sumptions about the data distribution. Unfortunately, the datapoints in the embedding space are related to the optimizationvariables which cannot be obtained from a parametrical for-mulation directly. In this paper, we estimate the features andlabels pdfs in a nonparametric way using the Parzen windows[11] and respectively propose two formulations, i.e., inexactand exact models.

In the inexact model, we assume that the original graphstructure is optimally preserved in the embedding space.Therefore, the pdf of labels in the embedding space can beeasily obtained by k-nearest neighbors with the Epanechnickovkernel [12]. In that way, the inexact GAMI model is wellsolved in a closed-form. Although with elegant formulations,it is worth noting that the inexact GAMI model dependson the graph topology invariant assumption. To abandonthe assumption, more strictly, we also propose an exact-GAMI model where no additional assumption is made andthe information quantity in the embedding space is directlyestimated with Gaussian kernel. In the exact-GAMI paradigm,a relatively complicated optimization model is obtained and we

will introduce the augment Lagrangian multiplier [13]–[15] tosolve it in an iterative way.

In order to verify the performances of the proposed method,we test our GAMI on four datasets. First, we publish our ownmultiview human body dataset, a laboratory synthetic dataset,to investigate the advantages of GAMI to handle data undermanifold distributions. Besides, the improvements of GAMIare also validated on two benchmarks of 15 Scene [4] andCaltech-101 [2]. Finally, to emphasize the effectiveness of ourGAMI model for complicated images, we publish our owndatasets on aerial images. It will be verified that GAMI moreaccurately classifies the aerial images into different categoriesthan other state-of-arts.

In summary, the paper contributes mainly in the followingaspects.

1) To the best of our knowledge, a codeword assignmentalgorithm from the perspective of manifold learning hasbeen proposed for the first time.

2) A unifying information-theoretic manifold embeddingoptimization is designed and solved in this paper, whichtechnically contributes in bringing information-theoreticquantities to manifold learning.

3) The proposed GAMI assignment model achieves thestate-of-the-art performances on benchmarks and re-quires less time for histogram generation than the pow-erful sparse coding model.

The remainder of this paper is structured as follows.Section II reviews related works on codeword assignment andinformation-theoretic learning. The algorithm for information-theoretic manifold assignment is introduced in Section III.The solutions and optimizations for the proposed model arediscussed in Section IV. The experimental verifications arepresented in Section V. We conclude this paper in Section VI.

II. Related Work

This review section is divided as two main portions. First ofall, we generally introduce some prevalent methods for imageand scene categorization under the BoF framework. Specially,we will review some works on codeword-based method. Fur-thermore, since the paper designs the assignment optimizationfrom manifold and information-theoretic learning, we will alsodiscuss some related works on these two learning topics.

A. Codeword-Based Image Categorization

Topic models for image categorization is a natural exten-sion from text retrieval to image analysis [16]. It relies ona probabilistic generative model to hierarchically interprethow the local features on the image are generated condi-tionally on different image categories. However, these topicmodels such as latent Dirichlet allocation (LDA) [4] andprobabilistic latent semantic analysis (PLSA) [5] are fullycharacterized by the generative property while lacking discrim-inative functionality. To overcome this shortcoming, hybridgenerative/discriminative framework was proposed and is nowdrawing more and more attentions due to its simplicity andeffectiveness [17], [19].

Page 3: Visual Words Assignment Via Information-Theoretic Manifold Embedding

1926 IEEE TRANSACTIONS ON CYBERNETICS, VOL. 44, NO. 10, OCTOBER 2014

Fig. 1. Overview of GAMI-based codeword assignment and optimization. Blue arrow flow: training procedures. Purple arrow flow: testing and assignmentprocedures.

Typical hybrid generative/discriminative system uses theBoF method [20], [21] to draw local image descriptors denselyfrom an image and these local features are assigned into astatistic histogram, which is then classified with discriminativelearning machines. Therefore, one critical task in this frame-work is how to accurately generate the statistic histogram torepresent the original image. In [17], Bosch et al. considerthat one local descriptor is assigned to one single codewordin the visual dictionary according to nearest neighbors (NN)strategy. This approach is named as hard assignment in theEuclidean space and one extension of this paper argues thatthe performance can be further improved if the soft assignmentis adopted [22].

Different from hard assignment, soft assignment believesthat each local descriptor is related to all the codewordswith different distances. Their similarity is evaluated with the�2 norm-based distance paired with a Gaussian kernel [6], [22].The work in [1] indicates that both the performances of hardand soft assignments can be further improved in the spatialpyramid matching framework. Different from typical code-word assignment method, learning a histogram from sparsecoding and its variations have drawn extensive attentions inthe area of image categorization in recent years. Due to thetheoretical contributions in compressive sensing [23], sparsecoding methods consider that one local descriptor is sparselyrepresented by a number of basis in a prelearned overcompletedictionary [7], [24]. Sparse coding methods always achievehigher accuracy on benchmark datasets than codeword-basedmethod. However, the computational burden of sparse codingis much heavier than codeword similarity-based algorithms. Inthe assignment, it requires to solve a �1-type optimization foreach local descriptor on the testing images. The optimizationcost will be raised in a cubic rate with the number of basisincreasing in the dictionary.1

1In the famous paper of compressive sensing [25], it is indicated that solvingthe �1-type sparse optimization requires the computational cost O(n3) wheren is the number of basis.

B. Manifold and Information-Theoretic Learning

Manifold learning is a powerful tool for data analysisand has been successfully applied to many areas includingface recognition [26], [27], image segmentation [28], andvideo/image-set recognition [29]. The widespread of manifoldfor image analysis are mainly due to the following two reasons.First, when the data quantity is abundant and huge, it isimpossible to neglect the nonlinearity among massive points[8] and it is reasonable to model them by a graph topology.The second reason of the prevalence of manifold learningis mainly its natural connections with linear latent models[30]. Due to this reason, it is possible to train the data via agraph and to extend the learning result to out-of-sample datain a transformed latent space. It will become clear later thatthe graph assignment algorithm proposed in this paper fullyexhibits both of these two advantages in a single model.

Information-theoretic learning (ITL) has its origin deeplyrooted in the probability theory which provides a flexibleview to describe the inherent structure of labels and fea-tures. Owing to such desired property, ITL is successfullyapplied to a number of machine learning tasks that requiresexplicit definitions of feature-label coherence. Therefore, manyinformation-theoretic learning approaches naturally containsthe mechanism of feature selection [31]. Many practical appli-cations of ITL include feature selection [32], semisupervisedlearning [33], and clustering [34]. In this paper, we willuse the information-theoretic quantity to regularize the datadistribution in the embedding space for codeword selectionand assignment.

III. Visual Words Assignment on Manifold

In this part, we will propose the graph assignment methodwith maximal mutual information regularization, i.e., theGAMI model. An overview of our assignment and learningalgorithm is provided in Fig. 1. We will elaborately explainthese procedures and their optimization formulations in theremainder of this section.

Page 4: Visual Words Assignment Via Information-Theoretic Manifold Embedding

DENG et al.: VISUAL WORDS ASSIGNMENT VIA INFORMATION-THEORETIC MANIFOLD EMBEDDING 1927

A. Manifold AssignmentThe local image feature set is defined as S =

{(f1, l1), (f2, l2)...(fn, ln)}, where fi ∈ Rp is the local imagedescriptor, e.g., dense SIFT, extracted from the original image.li ∈ {1, 2, ...C} is the category label of the image from whichfi is extracted. C is the number of image categories. The leftand top panel of Fig. 1 shows images from four categoriesin our aerial image dataset, in which round dots are thelocal descriptors and different colors represent different imagecategories.

Considering the huge quantity of local image descriptorsextracted from all the training images, a graph is used tomodel the samples in S which facilitates the evaluation ofnonlinearity among training samples. On the graph spannedby the local descriptors, many nonlinear metric can be appliedto evaluate the nonlinear similarity between two samples.For manifold learning, two widely known graph metrics aregeodesic distance [8] and commute time [9]. Geodesic distancerefers to the shortest distance between two nodes on the graph.Commute time records the steps of a random walk to travelaround a pair of nodes. The calculations of geodesic distanceand commute time have been well studied in some previousworks [8], [9], [35].

A toy example is used to illustrate the manifold structure forcodeword assignment. Two Gaussian components are used togenerate the data. The mean values of the two components arefixed as (0, 0) and (2, 0) and their variance are set as 0.5 and2.5, respectively. Each node connects to its nearest three neigh-bors to construct the manifold structure. As shown in Fig. 2,the data are assigned to two codewords which are representedby the yellow markers (i.e., square and triangular) in eachsubfigure. Then, the Euclidean distance, geodesic distance, andcommute time are calculated, respectively. With each metric,the point are assigned to the codeword according to the leastdistance strategy. Obviously, the manifold assignment methodsby geodesic distance and commute time generally outperformthe result assigned by Euclidean distance.

The above discussion verified that nonlinear metric on thegraph is much suitable for codeword assignment. Nevertheless,the graph generated by the aforementioned approaches onlyenables the learning and evaluations for the in-sample localdescriptors. For practical usage, it is necessary to extend theassignment ability to the out-of-sample data on the testingimages. One trivial solution is to construct such a manifold forany testing image and then to calculate the graph metric fromthe feature nodes to the codeword. However, the computationalcosts are extremely huge for real time usage.

To enable the learning ability of manifold learning to out-of-sample data, in our previous work [35], we have consideredto embed the manifold into a subspace with orthogonal projec-tions. The Euclidean distance in the subspace could implicitlypreserve the original nonlinear metric on the manifold. Tomake the paper self-contained, in the next section, we willgenerally introduce the graph metric preservation method in[35], which is inspired by [30].

We define fi ∈ Rp as a node on the manifold and mij isthe nonlinear graph distance (e.g., geodesic distance) betweenfi and fj . The graph metric preservation optimization aims to

Fig. 2. Toy assignment via different methods, i.e., Euclidean distance (ED),geodesic distance (GD), and commute time (CT). (a) Ground truth. (b)Assignment via ED. (c) Assignment via GD. (d) Assignment via CT.

find a projection matric � ∈ Rp×q( q < p) to preserve theoriginal graph metric with the Euclidean distance that

min�

∑ij

‖�T fi − �T fj‖2F

mij

(1)

where ‖ · ‖F is the Frobenius norm. Equation (1) implicitlypreserves the graph metric with the Euclidean distance inthe subspace. The meaning of the above optimization isself-evident: the larger the graph distance mij is, the largerEuclidean distance on the numerator is allowed. Conversely, ifmij is small, the Euclidean distance should also be very smallto minimize the whole objective. However, using 1 solely mayencourage all the points mapped into the same point, e.g.,the zero. Inspired by [30], an orthogonal constraint is addedto avoid such trivial solution. By adding the constraint andwriting (1) in the form of matrix, we get

min�

tr(�T F(D − W)FT �) s.t. �T FDFT � = I (2)

where I is an identity matrix; F = [f1, ..., fn] ∈ Rp×n is thefeature matrix; W = [Wij] is the weight matrix obtained on thegraph which records the similarity between any two nodes i, j

on the graph, i.e., Wij = 1/mij and D = diag(∑

i Wij) . tr(·)represents the trace of a matrix. We define the projections inthe embedding space as Y and y = �T f. This formulationresembles of the typical graph embedding [30] algorithm andare expected to be solved in a closed-form.

B. Mutual Information Revisited

Since each local feature contains the label information, itis not trivial to utilize this priori to improve the classificationperformance. This label information is illustrated in the mani-fold space block, i.e., the right and top panel in Fig. 1, wherethe nodes with different colors come from different categories.

To categorize the images, the prominent goal is to enhancethe separability of the learned features. In general, the classseparability refers to that the image features extracted fromdifferent categories should exhibit large margins. Inspired bythe work in [34] and [33], we know that mutual informationcan be used to address the goal of improving class separability.The mutual information [36] of two random variables is a

Page 5: Visual Words Assignment Via Information-Theoretic Manifold Embedding

1928 IEEE TRANSACTIONS ON CYBERNETICS, VOL. 44, NO. 10, OCTOBER 2014

quantity that measures the mutual dependence of the tworandom variables. It measures how much knowing one of thesevariables reduces uncertainty about the other.

Informally, in machine learning, the mutual informationI(Y; L) can be interpreted as how much the uncertainty isreduced about the label L if we know the feature Y. Fordiscriminative learning, a large mutual information score isdesired and we should maximize the mutual information ofdescriptor-label pairs in the embedding space, i.e.

max I(Y = �T F; L). (3)

In the next two paragraphs, we will give intuitive explana-tions to this mutual information maximization objective fromtwo different perspectives.

First, the mutual information maximization strategy couldbe explained from the perspective of minimal informationloss. The proposed graph assignment method projects the highdimensional feature to a low dimensional space. Ideally, themutual information on the original graph should be preservedthe same in the embedding space. Unfortunately, reducing thedimensionality of data may cause information loss. Instead ofstrict mutual information preservation, we explore the minimalmutual information loss criterion [34], i.e., to minimizeI(F; L) − I(�T F; L). I(F; L) is a fixed value because theoriginal F and L are both known in the original manifoldspace. Therefore, minimizing the information loss is equivalentto maximizing the mutual information.

Besides, the objective in (3) can also be linked to the point-to-point channel capacity problem. Following this way, wemay consider this feature learning task as a quantization prob-lem. In details, we suppose there exists an ideal discriminativechannel that could automatically encode the continuous featureY (input) into discrete codes L (output). The codes L exactlyindicates the label information of the input feature. There-fore, it is reasonable to call this channel as a discriminativechannel. During transmission, we are concerned about howmuch information can be transmitted from the feature point tothe label point. Ideally, the information should be maximizedto guarantee the codes sufficiently capture the informationof the original input feature. In information theory, channelcapacity exactly represents this quantity. The channel capacityof two random variables (input Y and output L) is definedas Cap = max

p(Y)I(Y, L) where the maximum is taken over all

possible input distributions p(Y). Therefore, the supremumof this mutual information term, i.e., the channel capacity isdetermined by the term p(Y = �T F ).

The discriminative strategy has been explained from twodifferent perspectives. In the next part, this discriminativestrategy will be combined into the manifold embedding modelfor discriminative codeword generation and assignment.

C. Information-Theoretic Manifold Embedding

Starting from the basic definitions in Shannon’s entropy[36], we know

I(Y; L) = H(L) − H(L|Y),

where H(·) denotes the entropy. In this paper, the trainingimages are uniformly selected from different categories and

we place a uniform distribution for the labels, i.e., p(l) = 1/C,where C is the number of image categories. Accordingly, theentropy H(L) =

∑Ci=1 C−1 log(C) = log C is a constant and

can thus be dropped from the optimization.By considering both the information loss and graph simi-

larity, the optimization for our GAMI model is given

min�

tr(�T F(D − W)FT �)︸ ︷︷ ︸graph assignment

+ αH(L|Y = �T F)︸ ︷︷ ︸maximal mutual information

s.t. �T FDFT � = I(4)

where α is a user specified parameter which trades off theoptimizations of graph assignment and mutual informationloss.

There are make some remarks on the second term in (4).Although H(L|Y) is obtained from minimal information losscriterion, we can also give the physical explanation to itdirectly. H(L|Y = �T F) is the conditional entropy for thelabels by giving features in the embedding space. When theentropy is optimally minimized, the uncertainty of l by givingy reduced to the minimum. In other words, the minimalentropy implies the maximal determination of labels by seeingthe features. However, in (4), the entropy can not be arbitrarilyminimized. We can only find the best � that seeks thebalance between the entropy minimization and graph similaritypreservation. Please refer to the investigations in Section V-Bfor the experimental discussions about information-theoreticlearning’s effectiveness on codeword selection and assignment.

GAMI optimization is indicated by the blue block in Fig. 1.Following the blue arrow flow in the block, we can see thatthe optimal � depends on both the manifold structure and theprojections in the embedding space. � projects data on themanifold into the embedding space and meanwhile the datadistribution in the embedding space also affect the optimalvalue of �, which leads to an egg-and-chicken type problem.The learning procedures of the GAMI optimization will beprovided in Section IV. Before that, we would like to first illus-trate how to make codeword assignment by the GAMI model.

D. Codeword Selection and Graph Assignment

After learning the optimal � in the GAMI model, thetraining local descriptors are projected into the embeddingsubspace. In the embedding space, the clustering centers (e.g.,by k-means) of samples are selected as codewords. In Fig. 1,the golden stars in the embedding space panel are used torepresent the codewords.

In Fig. 1, the purple arrow flow indicates the detailedprocedures of codewords assignment for the out-of-samplefeatures. For the local descriptors on a testing image, wefirst project them to the embedding space according to thelearned � and then assign it to each codeword via Euclideansimilarity. It’s worth noting that both the codeword selectionand assignment are implemented in the embedding space.To generate the histogram, the soft assignment method withGaussian kernel is used due to its effectiveness. Note thatthe distance in the Gaussian kernel is the distance in theembedding space which sufficiently represents the nonlinearrelationships within data on the manifold.

Page 6: Visual Words Assignment Via Information-Theoretic Manifold Embedding

DENG et al.: VISUAL WORDS ASSIGNMENT VIA INFORMATION-THEORETIC MANIFOLD EMBEDDING 1929

IV. GAMI Optimization and Model Learning

The remaining problem here is how to solve the GAMIoptimization. To learn the model, we give the empiricalestimation of the conditional entropy H(L|Y) in (4), i.e.,

H(L|Y = �T F) = −∫∫

p(y, l) log p(l|y)dydl

= −∑l,f

p(�T f)p(l|�T f) log p(l|�T f). (5)

To numerically calculate the entropy, we should estimateprobability density of p(�T f) and p(l|y = �T f) where theformer’s density is estimated using a Parzen window [11]

p(�T fi) =1

∑j �=i

K(�T fi − �T fj

σ). (6)

In (6), K(·) is the kernel density function and σ is theparameter of the function controlling the width of the Parzenwindow. In our formulation, the variations of the selectionof K(·) and of the estimation of p(l|�T f) may lead to twodifferent models that are called inexact-GAMI and exact-GAMI.

A. Inexact GAMI Model

In the inexact model, we follow the general idea used in[34], [37] to estimate the density p(l|yi = �T fi). The generalidea is to calculate the frequency of each label l appearing inthe neighboring area of yi, i.e.,

p(l|yi = �T fi) =1

k

∑yj∈�k(yi)

δ(lj = l) (7)

where �k(yi) stands for the k nearest neighbors (kNN) of yi inthe embedding space. δ(lj = l) is the Kronecker delta functionthat δ(lj = l) = 1 if and only if (iff. for short) lj = l; andδ(lj = l) = 0 otherwise. In this model, since the estimation ofp(l|y) relies the kNN structures, we use the Epanechnickovkernel [12] that also relies on the kNN structure as a specifickernel function in (6). It is defined as

KE((�T fi − �T fj)

σ)

=3

4(1 − ‖(�T fi − �T fj)‖2

F

σ2)δ[yj ∈ �k(yi)]. (8)

Similarly, δ[yj ∈ �k(yi)] is one iff. yj is the kNN of yi andzero otherwise. Note that y = �T f is the projection of f in theembedding space.

According to [38], Epanechnickov kernel owns many de-sired numerical properties such as symmetry and normaliza-tion. Moreover, in our specific inexact GAMI, the adoptionof the Epanechnickov kernel enables us to find a closed-formsolution to the whole optimization within the typical manifold

embedding framework. By specializing (8) into (6) and (5),with some algebras, it is immediate that

H(L|Y = �T F)

= − 3

4Nσ

∑ij

(1 − ‖(yi − yj)‖2F

σ2) (9)

× δ(yj ∈ �k(yi))∑

l

p(l|yi) log p(l|yi)︸ ︷︷ ︸gij

.

However, the optimization is still intractable by such anestimation. This is because the estimation of the entropy in(10) involves the nonlinear functionality of finding the kNNs inthe embedding space. During the optimization, the projectionmatrix itself is unknown and we could not get the exact kNNstructure of data in the embedding space. Therefore, in thispaper, we introduce an approximation to make GAMI modelefficiently solvable.

In our approach, we note that the first term in (4) is a graphpreservation term, which preserves the topology of the originalgraph into the embedding subspace. Accordingly, we assumethat after the graph projection, the neighboring topology on theoriginal graph is kept the same in the embedding space. Ofcourse, this assumption does not hold strictly for all the pairsof nodes. According to [35], such topology invariant rule isapplied to most of the nodes in the embedding space. Basedon such an assumption, the kNN structures on the originalmanifold is used to approximate the neighboring structuresin the embedding space. In that way, gij is estimated in theoriginal manifold space before the optimization. In (10), it isregarded as known constants.

Besides, for the sake of brevity, we have dropped theconstants independent to y in (10) and also resized somescaling coefficients2 to obtain a surrogate that

H(L|Y) ∝ −∑

ij

‖yi − yj‖2Fgij = H(L|Y).

Therefore, by taking the surrogate H(L|Y) into (4) toreplace H(L|Y) and writing the optimization in a matrix form,we get the inexact-GAMI optimization

min�

tr(�T F(D − W)FT �) − αtr(�T F(DG − G)FT �)

s.t. �T FDFT � = I(10)

where G = [gij] and DG = diag(∑

i gij). Equation (10) is astandard optimization with quadratic objective. If we defineLW = D − W and LH = DG − G, we can write the Lagrangianmultiplier of (10) in the form of

J(�, �) = tr(�T F(LW − αLH )FT �)+ < �, I − �T FDFT � >

(11)

where < A, B >= tr(ABT ) = tr(AT B) is the inner product and� is the Lagrangian multiplier which is a diagonal matrix. Bysetting ∂L

∂�= 0, we get

F(LW − αLG)FT � = FDFT ��. (12)

2In our experiment, the parameter α before H(L|Y) in (4) is determined bycross validation. Therefore, we can rescale the entropy term since the scalingcoefficient can be regarded as being merged into α.

Page 7: Visual Words Assignment Via Information-Theoretic Manifold Embedding

1930 IEEE TRANSACTIONS ON CYBERNETICS, VOL. 44, NO. 10, OCTOBER 2014

Therefore, the optimal solution to � is subject to such ageneral eigenvalue decomposition. Since the objective in (10)is to be minimized, the eigenvectors corresponding to thesmallest eigenvalues are used in the projecting matrix.

The prominent advantage of the inexact GAMI modelis its simplicity with a closed-form solution. However, theelegant formulation is a result of the graph topology invariantassumption. Potential readers may be curious about whetherwe can directly solve the GAMI optimization without makingsuch assumption. In the next section, we will address thischallenge.

B. Exact GAMI Model

Different from (7) that requires local neighboring informa-tion, we can also directly estimate p(l|y) from Bayesian rulethat

p(l|y = �T f) =p(l)p(�T f|l)

p(�T f)

where p(l) = 1C

is assumed to be uniform and C is thetotal number of image classes. p(�T f) is estimated in (6) andp(�T f|l) is obtained by

p(�T fi|l) =1

∑j �=i

K(�T fi − �T fj

σ)δ(lj = l). (13)

In the exact GAMI model, we abandon the neighboringassumption and adopt the widely used Gaussian kernel toestimate the pdf in (6) and (13) in a continuous way, i.e.,

KG

(�T fi − �T fj

σ

)=

1√2πσ

exp

(−‖�T fi − �T

j fj‖2F

2σ2

).

(14)Accordingly, the calculation of the entropy in (5) is subject

to

H(L|Y) = −∑

f,l

p(l)p(�T f|l) logp(�T f|l)p(l)

p(�T f)∝

−∑i,l

p(�T fi|l) log p(�T f|l) + log C∑i,l

p(�T fi|l)

+∑i,l

p(�T fi|l) log p(�T f) = H(L|Y) (15)

p(l) = 1/C is dropped and we get a surrogate. By takingsurrogate in (15) into (4), we get the optimization of exact-GAMI model. This model does not have a closed-solutiondue to the continuation character of H(L|Y). Fortunately,the model is also well solved in an iterative way. In thispaper, we use the augmented Lagrangian multiplier method(ALM) [14] to convert the constrained GAMI model into anunconstrained optimization. We use the ALM instead of thetypical Lagrangian multiplier in this paper because its robustconvergence, which has been verified in a number of previousworks [14], [13]. The ALM formulation is given that

JA(�, �) = tr(�T FLWF�)+ < �, �T FDFT � − I >

2

∥∥�T FDFT � − I∥∥2

F+ αH(L|Y)

(16)

Algorithm 1: Solving exact-GAMI via ALM.Input : Original label matrix L and local

descriptor matrix F; Manifold structurematrices: D and W

Initialization: k = 1, �0 = �ini, �0 = 0, μ0, ρ > 1repeat1

// Variable Updating

�k = �k−1 − βk∂J(�,�k−1)

∂�k−1;2

// Dual ascending.

�k = �k−1 + μk(�Tk FDFT �T

k − I);3

μk = ρ1μk−1;4

βk = ρ2βk−1;5

k = k + 1;6

until convergence ;7

Output : �k.

where � is the Lagrangian multiplier whose value can beupdated in the optimization via dual ascending. From (16),it is apparent that ALM adds a quadratic term into thetypical Lagrangian multiplier. The added term could force thesolutions satisfying the equality constraint with μ → ∞ andgreatly speeds up the convergence of the whole objective [13].The optimization of the above ALM formulation requires theexplicit gradient of JA with respect to �. We provide suchgradient in Appendix A. The detailed iterative procedures tosolve the exact-GAMI model are provided in Algorithm 1.

In the algorithm, we initialize � using the optimal resultobtained from the graph embedding problem in (2). Thevariable updating rule in line 2 relies on the value of βk.βk is initialized as 0.1 and its value is iteratively decreasedwith the processing of the optimization by a decreasing factorρ2 = 0.95. Similarly, μk is initialized as 1 and its value isiteratively increased with a factor ρ1 = 1.1. The optimizationis implemented with an optimization package in [39].

V. Experiments

In this part, we will experimentally verify the performancesof GAMI model. Before that, some basic experimental settingsare firstly presented.

A. Experimental Setup

The experimental setup of this paper substantially followsLazebnik et al. [1]. We randomly pick a number of imagesper class for training, and the left are for testing. In order toget reliable results, each experiment is repeated for 10 times.To describe an image, dense SIIF features [40] are extractedon a 16×16 pixel patches sampled every 8 pixels. For GAMImethod, the codebook is generated in the embedding space byK-means algorithm.

The GAMI optimizations do not involve too many param-eters. In inexact-GAMI, there is only one parameter α in(4) to be fixed. In exact-GAMI, except for α, we also needto fix the width of the Gaussian Parzen window controlledby parameter σ. As the projecting matrix is not availablebefore GAMI optimization, we hardly have any assumption

Page 8: Visual Words Assignment Via Information-Theoretic Manifold Embedding

DENG et al.: VISUAL WORDS ASSIGNMENT VIA INFORMATION-THEORETIC MANIFOLD EMBEDDING 1931

about the data distribution in the embedding space. In suchcase, fixing the parameters with statistic priori is impractical.Therefore, one possible way to fix them is to empirically fixthem on the testing set. In general, in different experiments,the optimal value for α ranges from 0.1 to 0.3 and theoptimal value of σ is always selected around 10. Anotherparameter needs justification is the number of neighbors whenconstructing the manifold. In the experiments, we empiricallyfix it as a constant, i.e., 10. This parameter setting applies toall the experiments discussed in the remaining sections. Therobustness of the parameters will be discussed at the end of thissection. For classification, we use the SVM with a histogramintersection kernel. The libSVM toolbox [41] is adopted tomake a one-versus-one approach for multiclass classification.

The experiments are conducted on four datasets includingtwo benchmarks and two datasets of our own. The twobenchmarks are the widely used 15 Scene [4] and Caltech-101 [2] datasets. Besides, we publish our datasets on multiviewhuman body (MHB) [42] to verify the manifold properties ofGAMI model. Finally, an aerial image dataset is accumulatedand published in this paper to justify GAMI’s performancesbeyond natural images. Our own datasets will be introducedwhere they first appear.

B. Verifications on Information-Theoretic Approaches

The technical contribution of the proposed GAMI modelare mainly two points: 1) use graph structure to modelmassive local descriptors and 2) introducing information-theoretic quantity to improve the discrimination for descriptorsin the embedding space. The advantages of the first pointhave been verified in Fig. 2 by a toy example. So in thispart, before going forward to image classification, we willconduct another toy experiment to verify the second point.In a nutshell, we will verify the advantages of information-theoretic approaches adopted in our algorithm for codewordselection and assignment.

As stated previously, the mutual information term in theGAMI model mainly contributes in improving the discrimi-native ability of local descriptors. Loosely speaking, for thespecific codeword selection problem, we expect that once wesee a codeword, we know which category it is generated fromwith maximal determination. This property helps a lot fordiscriminative learning. However, this nice character is noteasily satisfied in our problem. The reason is obvious becauseone specific codeword may has multiple labels. For example,in the natural scene categorization task, the codeword sky maybe generated in multiple categories, e.g., city, seaside, etc.Therefore, we instead hope that, around a specific codeword,the rate of the majority labels should be maximized. Themajority label stands for the label which is shared by mostdescriptors around a particular codeword.

To verify this point for our GAMI model, one toy exampleis conducted on the first three categories in the fifteen Scenedataset. In each category, the SIFT points are clustered into100 points by k-means. The SIFT features from these differentimage categories are embedded to a latent space by threemethods, i.e., manifold embedding (2), inexact-GAMI and

Fig. 3. Average majority label rate with different projection dimensions.

exact-GAMI. We vary the projection dimensions of � to seethe average majority label distribution in the embedding space.

The nonlinear metric used in this toy example is the com-mute time due to its efficiency in computation. Ten codewordsare generated in the embedding space by k-means clustering.Generally, on average 30 features are expected to be assignedto the same codeword. To determine their belongings, we usethe nearest neighbor strategy to assign them to their nearestcodeword. The average majority label rates calculated fromthese 10 codewords are reported in Fig. 3.

From the figure, exact-GAMI model achieves the bestperformance. It achieves more than 50% average majority ratewhich means that around the codeword selected by exact-GAMI model, 50% features come from the same category onaverage. It’s worth noting that the baseline average majoritylabel rate is 33% because three image categories are used inthe toy example. From the trend of the curve, it is obvious thatinexact-GAMI achieves lower score than exact-GAMI but hasmuch better performance than the graph embedding methodwithout information-theoretic regularization.

From Fig. 3, it is also interesting to note that the in-creases of dimensions of projecting matrix will not furtherimprove the discrimination. When the projection dimensionare larger than 40, the curve in Fig. 3 shows a convergedtrend. Improving the projection dimensions will of courseincrease the computational burdens for training. Therefore, inour work, we project the original SIFT feature from R128 toR

50 for a balance between effectiveness and efficiency. Unlessotherwise notified, the experiments in the remaining part areall implemented with this setting.

C. Comparisons of Graph Assignment and EuclideanAssignment

In this part, we will experimentally verify that manifoldassignment is superior to the typical Euclidean assignment,especially for data that exhibit manifold distributions. Tovalidate this point, we would like to publish our own dataseton MHB.3 We have setup a multicamera 3-D studio to capturemultiview images for human actors. Compared with naturalimage datasets, our dataset could better highlight the mani-fold property among data. This multiview dataset consists ofimages from 20 cameras which are evenly placed on a ring.

3Anyone interested, in using our dataset, please contact the authors via e-mail: [email protected].

Page 9: Visual Words Assignment Via Information-Theoretic Manifold Embedding

1932 IEEE TRANSACTIONS ON CYBERNETICS, VOL. 44, NO. 10, OCTOBER 2014

Fig. 4. Images in the multiview human body dataset.

Our dataset contains eight actors/actress. In Fig. 4, we showone boy and one girl as demos. For categorization, a rectanglewhich contains the whole human body is extracted from eachimage.

With respect to manifold learning, two basic concerns arehow to construct the graph and which nonlinear metric isused to represent data structure. In our test, we construct threekinds of graphs, i.e., kNN graph, kNN graph with a Gaussiankernel (G-kNN) and sparse graph. In kNN graph and G-kNNgraph, one node is connected with its nearest five neighbors.For sparse graph, one node is sparsely represented by othersand the majority coefficients are used as the weights for edgeconnections [35].

For manifold embedding, a number of previous works, e.g.,Laplacian eigenmap (LE) [43], Isomap [8], commute timeembedding (CTE) [9], and locally linear embedding (LLE)[44] have been proposed in the last decade. However, thesemethods could only handle the in-sample data but could nottreat out-of-sample data. In the codeword assignment task, weshould also project the training samples into the latent spaceand the aforementioned manifold embedding methods couldnot be directly explored in the assignment task.

Fortunately, graph embedding method [30] presented in 2provides an elegant solution to this dilemma. With the graphembedding framework, Laplacian eigenmap is converted to bea learning model called locality preservation projection(LPP)[45]. Meanwhile, if we take the geodesic distance in Isomap orthe commute time in CTE into the graph embedding paradigm,it is possible to get the formulation of geodesic projection(recorded as GEO in Table I) and the commute time projection(recorded as CT in Table I). For the LLE method, it ismuch easier to generalize its results to out-of-sample data.For an out-of-sample data, it can be linearly represented by thetraining samples on the manifold. Then, the linear coefficientsare applied to reconstruct its projection in the latent space.

For assignment, the number of codewords in the dictionaryis fixed as 200. To only highlight the effectiveness of themanifold assignment algorithms themselves, we do not adoptspatial pyramid approaches to improve the classification accu-racy. Results with spatial pyramid matching will be reportedin Section V-E. In our experiment, we randomly choose fiveimages in each category as training samples and the other

fifteen images are for test. The experiments are implementedfor 10 times on different graph structures with different graphmetrics. The accuracies of three graph assignment methods,i.e., graph embedding, inexact- and exact-GAMI are tabulatedin Table I. It worths noting here that LLE does not rely onthe any particular graph topology in Table I. Therefore, we donot report its accuracy in the Table. For this experiment onthe MHB dataset, LLE achieves the accuracy of 81.6 ± 0.7.

We also report the Euclidean assignment results as a base-line. In this experiment, hard and soft assignment methods,respectively, achieve classification accuracies of 81.0 ± 0.6%and 82.1 ± 0.5%. However, using geodesic distance on aG-kNN manifold, exact GAMI model could achieve as highas 85.4 ± 0.4% which is much higher than the Euclideanassignment methods.

It is also apparent from the table that sparse graph performsworse than kNN and G-kNN graphs. This may be due to thatlocal image features do not exhibit intrinsic sparse structures.That is to say one local descriptor can hardly be sparselyrepresented by others.4 For two graph metrics, they do notshow significant evidence to support which one is betterbecause their performances vary in different tests.

However, by taking the calculation complexity into consid-eration, commute time shows its advantages. The calculationof shortest paths between all pairs are with polynomial-timecomplexity. While, the calculation of commute time onlyrequires to solve a general eigenvalue decomposition [9]. Forefficiency, we adopt commute time as a graph metric for thefollowing experiments. Meanwhile, we use the kNN graph toconstruct the manifold for the remaining experiments due toits robust performance and simplicity.

D. Comparisons of Inexact and Exact GAMI Models

In this part, we will provide some discussions on the inexactand exact GAMI models. The comparisons will be reportedon both the training costs and the performances. We publishour aerial dataset to verify the effectiveness of these twomodels for handling aerial scenes. In our dataset, some ofthe images are collected from the Internet and others arecropped from google earth. The dataset contains six categorieswhich describe most representative geomorphic informationabout the earth observations. The six categories and somesample images of them are provided in Fig. 5. In our dataset,each category contains more than 50 images and the aerialimages are collected from all over the world. For example,in the category of city, the images are from Newyork, LosAngeles, Paris, Beijing, Shanghai, Tokyo, etc. The images arenormalized to the resolution of 800 × 400.

During training, to reduce the complexity, the SIFT fea-tures in each category are clustered as 1000 centers and weaccumulate no more than 6000 points in all for training. Theinexact model, which can be solved in a closed form, needs 67seconds to reach convergence. The exact GAMI model finished

4Sparse coding method can sparsely represent a SIFT feature becausethe basis in the dictionary are learned via a sparse optimization. However,in sparse graph construction, we cannot prelearn such an overcompletedictionary. One node on the sparse graph is represented by other nodesdirectly.

Page 10: Visual Words Assignment Via Information-Theoretic Manifold Embedding

DENG et al.: VISUAL WORDS ASSIGNMENT VIA INFORMATION-THEORETIC MANIFOLD EMBEDDING 1933

TABLE I

Manifold Structure Verification: Classification Accuracy on MHB Dataset (%)

Fig. 5. Images in different categories of aerial scene datasets.

Fig. 6. Average categorization accuracy of inexact- and exact-GAMI meth-ods for different categories in the aerial image set.

training in approximately 349 seconds. The exact GAMI isregarded as reaching the convergence if ‖�k+1−�k‖2

F

‖�k‖2F

< 10−4.The computer to implement the experiments is equipped witha 2.3 GHZ CPU processor and a 4 GB RAM. The experimentsand time costs reported in the remaining parts are all processedwith this computer.

From the performance comparisons in Fig. 6, exact GAMIgenerally outperforms inexact model on almost all the cate-gories. We also report the results of hard assignment methodas the baseline. It is also apparent in Fig. 6 that GAMI modelachieves much superior performances in handling difficult cat-egories, i.e., jungle, sea, and desert which have little textures.The improvements may be ascribed to that the exact modelmuch better regularizes the data structures to select morediscriminative codewords in the embedding space as discussedin Section V-B.

However, as aforementioned discussions, the exact GAMIrequires more time to train the model than that of the inexact

one. Fortunately, these costs are only paid during training. Theassignment costs of inexact- and exact-GAMI models are thesame.

E. Results on Benchmark Datasets

We report our GAMI algorithm on 15 Scene [4],Caltech-101 [2], PASCAL VOC [46], MHB, and aerial imagedatasets. For two benchmarks, we follow the experimentalsetup in most previous works to select 100 images per categoryas training samples in fifteen Scene dataset and 15 images ineach category for caltech-101. In PASCAL dataset, we usethe training set and testing set provided in [46] to conductthe experiment. In our own datasets, five images in eachcategory of MHB and aerial datasets are used for training,the rest images are for test. Each experiment is repeated for10 times and the average classification results are recorded.We follow the work in [1] to select 200 codewords and use atwo-level spatial pyramid method to improve the classificationaccuracy. The images classification task in PASCAL datasetis much challenging and we select select 1000 codewords forthe experiment and use a two-level pyramid to further improvethe performance.

For comparison, we pit our graph assignment against otherstate-of-the-arts. We use the public codes to produce the resultsof other algorithms. The evaluation results are recorded inTable II. The accuracy reproduced by us are consistent withthe ones reported in the works [1], [6], and [7], where thesemethods are originally proposed. One thing worthwhile notingis that by the sparse coding (SC) method on Caltech 101dataset our result is a bit lower than the accuracy in [7]where the author reported the accuracy as high as 67%. Thisdifference may be caused by that we use different training andvalidation samples.

Page 11: Visual Words Assignment Via Information-Theoretic Manifold Embedding

1934 IEEE TRANSACTIONS ON CYBERNETICS, VOL. 44, NO. 10, OCTOBER 2014

TABLE II

Image Categorization Accuracy on Benchmark Datasets With

Spatial Pyramid Method (%)

TABLE III

p-Value of t-Test to Verify Statistic Significance

Besides, we test the statistic significance of the results byt-test [47], [48]. The general hypothesis verified in the testis that the accuracy of GAMI model is superior than others.To quantitatively evaluate this significance, we implement thet-test [47], [49] on the results obtained from different datasets.The p-value of the t-test are reported in Table III. In thet-test, a small p-value implies the strong statistical signif-icance. In the Table, we only report the p-value of GAMIwith SC, LLC and Inexact-GAMI. For the other comparisonmethods reported in Table II, their p-values are less than 0.01on all the datasets. Therefore, we omit them for brevity. On theCaltech-101 dataset, GAMI performs worse than SC and LLCand the hypothesis that GAMI is superior than SC and LLC iswrong. Therefore, on these two tests, the p-values approximateto one. From other results, it is apparent that GAMI achievesmuch better performances than other assignment methods withprominent statistical significance.

In Table II, the last three rows show the results of graphassignment methods including MA+GEO, MA+CT, LPP, LLE,inexact- and exact-GAMI, respectively. For Euclidean assign-ment, we publish the results of hard assignment [1] and softassignment [6]. Besides, the results of sparse coding method[7], locality-constrained coding (LLC) [24] and information-loss coding (ILC) [34] are also reported.

On average, our graph assignment method improves typicalEuclidean assignment method with a large margin. Even forthe GA method without information-theoretic regularization,it significantly improves the results of Euclidean assignment.This finding exactly verifies that taking nonlinearity intoconsideration will be nontrivial to improve the assignmentperformance. For two information-theoretic graph embeddingmethod, exact-GAMI generally outperforms inexact-GAMI.

Besides, GAMI models produce similar accuracy as sparse-coding-based methods and exact-GAMI achieves the high-est accuracy on three datasets. On Caltech-101 dataset,

TABLE IV

Average Assignment Costs (in Seconds) for One Image in 15

Scene Dataset

sparse-coding-based method makes the best result althoughthey are obtained from a 3-level spatial pyramid with 1024codewords. The results of our method on Caltech-101 isproduced with only 200 codewords on a 2-level spatial pyra-mid. There is no doubt that increasing the length of thehistogram will add time costs for both assignment and SVMclassification.

SC based algorithm depends on a large dictionary withabundant basis to conduct sparse representation. The largerthe dictionary size is, the much sparser structure SC exhibits[50]. Therefore, the large number of codewords (basis) inSC-based method is not avoided which is determined by themathematical nature of sparse coding. However, our graphassignment methods and typical Euclidean assignments fol-low a different way to learn the feature histogram, whichallows to construct a small dictionary with less codewords.Increasing the number of codewords and pyramid levels inour algorithm may also improve the performance for a greatchance. However, we will not do that because the prominentgoal of this paper is not to tune a method for the highestperformance on some specific dataset. Our contribution is toproduce comparable high accuracy as sparse coding but withrelatively low computational costs. The computational costshould be similar as typical Euclidean assignment.

To compare the computational complexity, the averagetime costs of different methods to assign one image in 15Scene dataset is reported in Table IV. The results apparentlyverify that graph assignment method exhibits comparable lowassignment cost as Euclidean assignment and is much moreefficient than sparse coding.

F. Robustness Verifications

In this part, we verify the robustness of the GAMI modeltoward different parameter settings. The two critical parame-ters validated here is α in 4 and the number of k-neighbors ingraph construction. In the experiment, we first test differentparameters on the training set for in-sample test. Besides,the same parameters are applied to the testing samples andthe results are denoted as out-of-sample test. The robustnessverifications are conducted on the Caltech 101 and VOCdatasets because they contain thousands of images and arebenchmarks for general evaluations.

From the results in Fig. 7, it is apparent that the resultsfrom out-of-sample test are very consistent with the in-sampletest. The parameter achieving higher accuracy on the trainingset also performs better on the testing set. By observing thecurves in the figure, we empirically find that α (resp. k) maybe selected in the range from 0.1 to 0.3 (resp. 10 to 20).

Besides, it is interesting to note that either a large or smallparameter may lead to poor performances. Such experimental

Page 12: Visual Words Assignment Via Information-Theoretic Manifold Embedding

DENG et al.: VISUAL WORDS ASSIGNMENT VIA INFORMATION-THEORETIC MANIFOLD EMBEDDING 1935

Fig. 7. Robustness verifications with different parameters. In each subfigure, the blue lines denote the in-sample test and the dotted red lines record theout-of-sample test. The ordinate of each subfigure represents the classification accuracy (%). (a) Variations of α. (b) Variations of k.

findings are reasonable. First, the parameter α controls thebalances of graph assignment and discriminative learning. Ifα is too small, the performances are similar to the generalgraph embedding without using the label information. Onthe contrary, a large α emphasizes too much on the labeldistribution while neglecting the original structure of themanifold. In Fig. 7 (b), the results are stable when k is selectedbetween 10 and 20. A too large or a too small k may disturbthe graph structure and cause low classification accuracy onboth in-sample and out-of-sample test.

VI. Conclusion

This paper presents a codeword assignment method togenerate statistic histogram for image categorization. The tech-nical contribution of this paper is mainly on the information-theoretical manifold embedding model for codeword assign-ment. We propose to use a graph structure to reveal thenonlinearity among the massive image features and regularizethe discrimination of data-label pairs in the embedding spaceby maximizing the mutual information. For practical usage,our model is very efficient in implementation and greatlyimproves the classification accuracy of typical assignmentalgorithms. Besides, the bulk of this algorithm still falls intothe typical similarity based assignment framework, therefore itavoids the heavy computational costs in other coding methods,e.g., sparse coding. Our GAMI model exhibits significantadvantages in both effectiveness and efficiency.

Appendix

A. Derivatives for Exact-GAMI

For the sake of brevity, we rephrase JA in (16) as JA =J1 + H(L|Y), where J1 contains the first three terms in (16)and H(L|Y) is the conditional entropy surrogate. The gradientof J1 is easily obtained by matrix calculus

∂J1

∂�= 2FLWF� + 2FDFT ��

−2μFDFT �(�T FDFT � − I).

The derivatives about the H(L|Y) seems to be complicatedbut it will be greatly simplified by the chain rule. For clarity,

we define pli = p(�T fi|l) and pi = p(�T fi). According to thechain rule

∂H(L|Y)

∂�

= −∑

il

∂pli

∂�(log pli + 1 − log C − log pi) +

∑il

pli

pi

∂pi

∂�.

The calculation of ∂pli

∂�and ∂pi

∂�are very straight-forward that

∂pli

∂�= −pli

σ2

∑i

(fi − fj)(fi − fj)T � and ∂pi

∂�= − pi

σ2

∑i

(fi − fj)(fi −fj)T �.

It’s worth noting that the derivatives for the entropy termrely on discrete estimations whose computational complexityis O(n2), where n is the total number of nodes on the manifold.In this paper, for the cases with thousands of training SIFTfeatures, we first clustering them in each category as k clusters.k is much smaller than the quantity of original descriptorsand these centers in each categories are used to calculatethe derivatives for the entropy term. The derivatives for J1

is implemented with original SIFT features without clusteringbecause it does not involve any discrete estimation and can beeasily finished with matrix calculus.

References

[1] S. Lazebnik, C. Schmid, and J. Ponce, “Beyond bags of features: Spatialpyramid matching for recognizing natural scene categories,” in Proc.IEEE Comput. Soc. Conf. Comput. Vision Pattern Recognit., vol. 2, 2006,pp. 2169–2178.

[2] L. Fei-Fei, R. Fergus, and P. Perona, “Learning generative visual modelsfrom few training examples: An incremental Bayesian approach testedon 101 object categories,” Comput. Vision Image Understand., vol. 106,no. 1, pp. 59–70, 2007.

[3] J. Yu, D. Liu, D. Tao, and H. S. Seah, “On combining multiple featuresfor cartoon character retrieval and clip synthesis,” IEEE Trans. Syst.,Man, Cybern. B, Cybern., vol. 42, no. 5, pp. 1413–1427, Oct. 2012.

[4] L. Fei-Fei and P. Perona, “A Bayesian hierarchical model for learningnatural scene categories,” in Proc. IEEE Comput. Soc. Conf. Comput.Vision Pattern Recognit., vol. 2. 2005, pp. 524–531.

[5] A. Bosch, A. Zisserman, and X. Munoz, “Scene classification viaPLSA,” in Proc. ECCV, 2006, pp. 517–530.

[6] J. van Gemert, C. Veenman, A. Smeulders, and J. Geusebroek, “Visualword ambiguity,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 32, no. 7,pp. 1271–1283, Jul. 2010.

[7] J. Yang, K. Yu, Y. Gong, and T. Huang, “Linear spatial pyra-mid matching using sparse coding for image classification,” in Proc.IEEE Conf. Comput. Vision Pattern Recognit (CVPR), Jun. 2009,pp. 1794–1801.

Page 13: Visual Words Assignment Via Information-Theoretic Manifold Embedding

1936 IEEE TRANSACTIONS ON CYBERNETICS, VOL. 44, NO. 10, OCTOBER 2014

[8] J. Tenenbaum, V. De Silva, and J. Langford, “A global geometricframework for nonlinear dimensionality reduction,” Science, vol. 290,no. 5500, pp. 2319–2323, 2000.

[9] H. Qiu and E. Hancock, “Clustering and embedding using commutetimes,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 29, no. 11,pp. 1873–1890, Nov. 2007.

[10] Y. Deng, Y. Zhao, Y. Liu, and Q. Dai, “Differences help recognition:A probabilistic interpretation,” PloS one, vol. 8, no. 6, p. e63385,2013.

[11] N. Kwak and C. Choi, “Input feature selection by mutual informationbased on Parzen window,” IEEE Trans. Pattern Anal. Mach. Intell.,vol. 24, no. 12, pp. 1667–1671, Dec. 2002.

[12] V. Epanechnikov, “Nonparametric estimation of a multidimensionalprobability density,” Teoriya Veroyatnostei i ee Primeneniya, vol. 14,no. 1, pp. 156–161, 1969.

[13] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Distributedoptimization and statistical learning via the alternating direction methodof multipliers,” Found. Trends Mach. Learn., vol. 3, no. 1, pp. 1–123,2010.

[14] Y. Deng, Y. Liu, Q. Dai, Z. Zhang, and Y. Wang, “Noisy depth mapsfusion for multiview stereo via matrix completion,” IEEE J. Sel. TopicsSignal Process., vol. 6, no. 5, pp. 566–582, Sep. 2012.

[15] Y. Deng, Q. Dai, R. Liu, Z. Zhang, and S. Hu, “Low-rank structurelearning via nonconvex heuristic recovery,” IEEE Trans. Neural Netw.Learn. Syst., vol. 24, no. 3, pp. 383–396, Mar. 2013.

[16] D. Blei, A. Ng, and M. Jordan, “Latent dirichlet allocation,” J. Mach.Learn. Res., vol. 3, no. 1, pp. 993–1022, 2003.

[17] A. Bosch, A. Zisserman, and X. Muoz, “Scene classification using ahybrid generative/discriminative approach,” IEEE Trans. Pattern Anal.Mach. Intell., vol. 30, no. 4, pp. 712–727, Apr. 2008.

[18] O. Boiman, E. Shechtman, and M. Irani, “In defense of nearest-neighborbased image classification,” in Proc. IEEE Conf. Comput. Vision PatternRecognit. (CVPR), Jun. 2008, pp. 1–8.

[19] S. Banerji, A. Sinha, and C. Liu, “Scene image classification: Somenovel descriptors,” in Proc. IEEE Int. Conf. Syst., Man, Cybern. (SMC),Oct. 2012, pp. 2294–2299.

[20] K. Huang, D. Tao, Y. Yuan, X. Li, and T. Tan, “Biologically inspiredfeatures for scene classification in video surveillance,” IEEE Trans. Syst.,Man, Cybern. B, Cybern., vol. 41, no. 1, pp. 307–313, Feb. 2011.

[21] H. Li, Y. Wei, L. Li, and C. P. Chen, “Hierarchical feature extractionwith local neural response for image recognition,” IEEE Trans. Cybern.,vol. 43, no. 2, pp. 412–424, Apr. 2013.

[22] J. van Gemert, J. Geusebroek, C. Veenman, and A. Smeulders,“Kernel codebooks for scene categorization,” in Proc. ECCV, 2008,pp. 696–709.

[23] R. Baraniuk, “Compressive sensing [lecture notes],” IEEE Signal Pro-cess. Mag., vol. 24, no. 4, pp. 118–121, Jul. 2007.

[24] J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, and Y. Gong, “Locality-constrained linear coding for image classification,” in Proc. IEEE Conf.Comput. Vision Pattern Recognit. (CVPR), Jun. 2010, pp. 3360–3367.

[25] S. S. Chen, D. L. Donoho, and M. A. Saunders, “Atomic decompositionby basis pursuit,” SIAM J. Sci. Comput., vol. 20, no. 1, pp. 33–61, 1998.

[26] Y. Deng, Q. Dai, and Z. Zhang, “Graph Laplace for occluded facecompletion and recognition,” IEEE Trans. Image Process., vol. 20, no. 8,pp. 2329–2338, Aug. 2011.

[27] R. Wang and X. Chen, “Manifold discriminant analysis,” in Proc. IEEEConf. Comput. Vision Pattern Recognit. (CVPR), Jun. 2009, pp. 429–436.

[28] J. Shi and J. Malik, “Normalized cuts and image segmentation,” IEEETrans. Pattern Anal. Mach. Intell., vol. 22, no. 8, pp. 888–905, Aug.2000.

[29] R. Wang, S. Shan, X. Chen, and W. Gao, “Manifold-manifold distancewith application to face recognition based on image set,” in Proc. IEEEConf. Comput. Vision Pattern Recognit. (CVPR), Jun. 2008, pp. 1–8.

[30] S. Yan, D. Xu, B. Zhang, H. Zhang, Q. Yang, and S. Lin, “Graphembedding and extensions: A general framework for dimensionalityreduction,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 29, no. 1, pp.40–51, Jan. 2007.

[31] J.-B. Yang and C.-J. Ong, “An effective feature selection method viamutual information estimation,” IEEE Trans. Syst., Man, Cybern. B,Cybern., vol. 42, no. 6, pp. 1550–1559, Dec. 2012.

[32] H. Peng, F. Long, and C. Ding, “Feature selection based on mutualinformation criteria of max-dependency, max-relevance, and min-redundancy,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 27, no. 8,pp. 1226–1238, Aug. 2005.

[33] J. Davis, B. Kulis, P. Jain, S. Sra, and I. Dhillon, “Information-theoretic metric learning,” in Proc. 24th Int. Conf. Mach. Learn., 2007,pp. 209–216.

[34] S. Lazebnik and M. Raginsky, “Supervised learning of quantizercodebooks by information loss minimization,” IEEE Trans. PatternAnal. Mach. Intell., vol. 31, no. 7, pp. 1294–1309, Jul. 2009.

[35] Y. Deng, Q. Dai, R. Wang, and Z. Zhang, “Commute time guidedtransformation for feature extraction,” Comput. Vision ImageUnderstand., vol. 116, no. 4, pp. 473–483, 2012.

[36] T. Cover and J. Thomas, Elements of Information Theory, vol. 14,no. 1, 2nd ed. New York, NY, USA: Wiley, 2006, pp. 153–158.

[37] Y. Deng, Y. Qian, Y. Li, Q. Dai, and G. Er, “Visual words assignmenton a graph via minimal mutual information loss,” in Proc. Brit. Mach.Vision Conf., 2012, pp. 91.1–91.10.

[38] D. Fadda, E. Slezak, and A. Bijaoui, “Density estimation withnon-parametric methods,” Arxiv preprint astro-ph/9704096, 1997.

[39] M. Schmidt [Online]. Available: http://www.di.ens.fr/mschmidt/Software/minFunc.html

[40] N. Dalal and B. Triggs, “Histograms of oriented gradients for humandetection,” in Proc. IEEE Comput. Soc. Conf. Comput. Vision PatternRecognit. (CVPR), vol. 1. 2005, pp. 886–893.

[41] C.-C. Chang and C.-J. Lin, “LIBSVM: A library for supportvector machines,” ACM Trans. Intell. Syst. Technol., vol. 2, no. 3,pp. 27:1–27:27, 2011.

[42] G. Ye, Y. Liu, Y. Deng, N. Hasler, X. Ji, Q. Dai, et al., “Free-viewpointvideo of human actors using multiple handheld kinects,” IEEE Trans.Cybern., vol. 43, no. 5, pp. 1370–1382, Oct. 2013.

[43] M. Belkin and P. Niyogi, “Laplacian eigenmaps for dimensionalityreduction and data representation,” Neural Comput., vol. 15, no. 6,pp. 1373–1396, 2003.

[44] S. Roweis and L. Saul, “Nonlinear dimensionality reduction by locallylinear embedding,” Science, vol. 290, no. 5500, pp. 2323–2326,2000.

[45] X. He, S. Yan, Y. Hu, P. Niyogi, and H. Zhang, “Face recognitionusing Laplacianfaces,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 27,no. 3, pp. 328–340, Mar. 2005.

[46] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, andA. Zisserman. (2007). The PASCAL Visual Object Classes Challenge2007 (VOC2007) Results [Online]. Available: http://www.pascal-network.org/challenges/VOC/voc2007/workshop/index.html

[47] G. Casella and R. L. Berger, Statistical Inference, vol. 70. Belmont,CA, USA: Duxbury Press, 1990.

[48] Q. Zhang and Q. Wang, “Local least absolute relative error estimatingapproach for partially linear multiplicative model,” Statistica Sinica,vol. 23, no. 3, pp. 1091–1116, 2013.

[49] Q. Zhang, D. Li, and H. Wang, “A note on tail dependence regression,”J. Multivariate Anal., vol. 120, no. 12, pp. 163–172, 2013.

[50] H. Lee, A. Battle, R. Raina, and A. Ng, “Efficient sparse codingalgorithms,” Adv. Neural Inf. Process. Syst., vol. 19, p. 801, 2007.

Yue Deng received the B.E. degree (Hons.) in au-tomatic control from Southeast University, Nanjing,China, in 2008, and the Ph.D. degree (Hons.) in con-trol science and engineering from the Department ofAutomation, Tsinghua University, Beijing, China, in2013.

He was a Visiting Scholar with the School of Com-puter Science, Carnegie Mellon University, Pitts-burgh, PA, USA, from 2010 to 2011. He is currentlyan Associate Professor with the School of Elec-tronic Science and Engineering, Nanjing University,

Jiangsu, China. His current research interests include machine learning,computer vision and computational biology.

Yipeng Li received the B.S. and M.S. degrees inelectronic engineering from the Harbin Institute ofTechnology, Harbin, China, and the Ph.D. degreein electronic engineering from Tsinghua University,Beijing, China, in 2003, 2005, and 2011, respec-tively.

He is currently an Assistant Researcher with theDepartment of Automation, Tsinghua University.His current research interests include UAV vision-based autonomous navigation, 3-D reconstruction ofnatural environment, complex systems theory, and

Internet applications analysis.

Page 14: Visual Words Assignment Via Information-Theoretic Manifold Embedding

DENG et al.: VISUAL WORDS ASSIGNMENT VIA INFORMATION-THEORETIC MANIFOLD EMBEDDING 1937

Yanjun Qian received the B.S. and M.S. degreefrom Tsinghua University, Beijing, China, both fromAutomation Department. He is currently pursuingthe Ph.D. degree with the Department of Industrialand System Engineering, at Texas A&M University,College Station, TX, USA.

His current research interests include machinelearning and statistics with their application on im-age processing and manufacturing methodology.

Xiangyang Ji received the B.S. degree in materialsscience and the M.S. degree in computer sciencefrom the Harbin Institute of Technology, Harbin,China, in 1999 and 2001, respectively, and the Ph.D.degree in computer science from the Institute ofComputing Technology, Chinese Academy of Sci-ences, Beijing, China.

He joined Tsinghua University, Beijing, China, in2008, where he is currently an Associate Professorwith the Department of Automation, School of Infor-mation Science and Technology. His current research

interests include signal processing, image/video processing, image/video com-pression and communication, 3-D representation, and reconstruction.

Qionghai Dai (SM’05) received the B.S. degree inmathematics from Shanxi Normal University, Xian,China, in 1987, and the M.E. and Ph.D. degrees incomputer science and automation from NortheasternUniversity, Shenyang, China, in 1994 and 1996,respectively.

Since 1997, he has been a Faculty Member withTsinghua University, Beijing, China, where he iscurrently a Cheung Kong Professor and is the Direc-tor of the Broadband Networks and Digital MediaLaboratory. His current research interests include

signal processing, computer vision, and graphics.