Context-Aware Discriminative Vocabulary Learning for Mobile Landmark Recognition

Preview:

Citation preview

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 23, NO. 9, SEPTEMBER 2013 1611

Context-Aware Discriminative Vocabulary Learningfor Mobile Landmark Recognition

Tao Chen and Kim-Hui Yap

Abstract—This paper proposes a discriminative vocabularylearning for landmark recognition based on the context informa-tion acquired from mobile devices. The vocabulary learning gen-erates a set of discriminative codewords for image representation,which is important for landmark recognition. Many state-of-the-art methods use content analysis alone for vocabulary learning,which underutilizes the context information provided by mobiledevices, such as location from the GPS positioner and directionfrom the digital compass. Although some works start to considerthe images’ location information for vocabulary learning, thelocation alone is insufficient since GPS data has significant errorsin dense built-up areas. The context analysis techniques that useGPS to shortlist the geographically nearby landmark candidatesfor subsequent image matching are at times inadequate. In viewof this, the paper proposes to employ both direction and locationinformation to learn a discriminative compact vocabulary (DCV)for mobile landmark recognition. Direction information is firstconsidered to supervise image feature clustering to constructdirection-dependent scalable vocabulary trees (DSVTs). Locationinformation is then incorporated into the proposed DCV learningalgorithm, to select the discriminative codewords of the DSVTto form the DCV. An ImageRank technique and an iterativecodeword selection algorithm are developed for DCV learning.Experimental results using the NTU50Landmark database showthat the proposed approach achieves 4% improvement over thecurrent method in mobile landmark recognition.

Index Terms—Direction-dependent scalable vocabulary trees(DSVT), discriminative compact vocabulary (DCV), location anddirection, mobile landmark recognition.

I. Introduction

THE DEVELOPMENTS of cell phone cameras in recentyears [1] have prompted many researchers to devote

more effort into mobile device-based landmark and locationrecognition, which allows users to capture a picture of alandmark using a mobile phone camera and determine itsrelated information. As a result, various mobile-based visualsearch [2], [3] and landmark recognition [4]–[11] works haveappeared.

Unlike non-mobile environments, mobile landmark recog-nition needs to take the unique features of mobile devicesinto consideration, including: 1) limited computing powerand battery capacity, and 2) mobile user’s fast response-time

Manuscript received August 1, 2012; revised December 17, 2012; acceptedFebruary 20, 2013. Date of publication March 27, 2013; date of current versionAugust 30, 2013. This paper was recommended by Associate Editor Q. Tian.

The authors are with the School of Electrical and Electronic Engi-neering, Nanyang Technological University, Singapore 639798 (e-mail:chen0523@e.ntu.edu.sg; ekhyap@ntu.edu.sg).

Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TCSVT.2013.2254978

requirement. In order to address these issues, a client-serverstructure is adopted in many mobile-based recognition works[2]–[4], [6]–[8], [10], [11], [34]. Using this architecture, thecomputational load is shifted from the mobile client to highperformance servers, and the powerful computing capabilityof the servers can speed up the recognition process. Amongthese works, some of them capture an image at the clientend (mobile device), and send the query image to the server.The server extracts features from the query image, and thenmatches these features with the offline constructed databasefor recognition. However, one important concern for the aboveprocess is that the size of the query image may be large.This will cause the image transmission time over the wirelessnetwork to be long, and prone to interruption when the networkcondition is poor.

Considering this, some works [2], [8], [14], [34] haveendeavored to extract the image features directly on the mobiledevice and then transform these features into a compactdescriptor for transmission. By sending the compact descriptorinstead of the original image, the transmission time can begreatly reduced. For example, in [14], local features, suchas speeded up robust feature (SURF) [16] or scale invariantfeature transform (SIFT) [17] are first extracted from the queryimage, followed by a scalable vocabulary tree (SVT) [18] toquantize these features into a bag-of-words (BoW) histogram,which is then encoded and transmitted to the remote serverfor recognition. In [2], a compressed histogram of gradients[15], [34] is adopted to replace SIFT or SURF as the imagedescriptors for transmission. An effective vocabulary learningmethod for mobile landmark recognition has been proposedin [8] recently. The authors consider the GPS informationand propose a location discriminative vocabulary encodingapproach, which consists of two major steps. The first step is tolearn a compact location discriminative landmark descriptor inthe offline stage. Spectral clustering is used to segment a citymap into distinct geographical regions. A Ranking Sensitivevocabulary boosting method is proposed to learn locationdiscriminative vocabulary coding (LDVC) in each region,which embeds location cues to learn a compact descriptorand minimizes the retrieval ranking loss. The second step isto perform a location aware online vocabulary adaptation. Asingle vocabulary is stored in the mobile device, which isefficiently adapted for a region-based LDVC once a mobiledevice enters a specific region. The quantitative evaluationshows that transmitting the compact descriptor [8] insteadof the original BoW [3], [4] achieves higher recognition

1051-8215 c© 2013 IEEE

1612 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 23, NO. 9, SEPTEMBER 2013

Fig. 1. Proposed mobile landmark recognition architecture.

rate and faster speed. In summary, it is found that learninga discriminative vocabulary to encode the query image iseffective in mobile landmark recognition.

In fact, various vocabulary (codebook) learning workshave appeared recently in [11], [19]–[24]. A category-relatedcodeword discrimination learning method based on the wordoccurrence ratio of positive to negative categories is proposedin [11] and [24]. In [19], a fast hierarchical merging algorithmis proposed to merge a large number of words into a smallnumber of discriminative words, which can reduce the memorycost. In [20], a discriminative codebook learning algorithmis developed that iteratively feedbacks the SVM classifieroutputs to the codeword generation, until the final codebookachieves the strongest discriminative power. A multiresolutioncodeword selection approach is proposed in [21], which tries toselect the discriminative words from the multiresolution localpatches using a boosting process. In [22], the discriminativecapability of each codeword for each category is learnedthrough an information gain-based method. In [23], a mul-tioutput linear function is proposed to model the relationshipbetween the codeword selected data matrix and the indicatormatrix. The most discriminative words are defined as thoseleading to minimal fitting error.

The common purpose of these works [19]–[24] is to con-struct a compact vocabulary by maximally preserving theseparability of the object classes. This can boost the com-putational efficiency and reduce memory cost. However, oneshortcoming of these works is that their vocabulary learningmethods are mainly based on the content analysis alone, anddo not consider the context information (e.g., the GPS data)provided by the mobile device. Even though some works,such as [8], have started to integrate the context with contentanalysis for vocabulary learning, the context information isstill limited to GPS alone, which can have significant errors(up to 100 m) in dense built-up areas, and this causes thecontext analysis using GPS alone to be unreliable at times.Furthermore, the GPS and content integration technique ispreliminary, which simply uses the location information toshortlist the geographically nearby landmark candidates forsubsequent image matching.

In view of this, this paper proposes to exploit both locationand direction information to learn a discriminative compactvocabulary (DCV) for mobile landmark recognition. It isnoted that although direction information is also involved forlandmark recognition in [11], it mainly forms a field-of-viewto shortlist landmark candidates that reduces the search range.During the offline learning phase, the direction informationis not incorporated. In contrast, in this paper, the directioninformation not only is used in the online image testing

phase, but also fully utilized in the offline DCV learningphase. To the best of our knowledge, this is the first workto employ direction information for DCV learning. This ispossible since many smart phones are now equipped withdigital compass, e.g., Blackberry 9220, Nokia Lumia 610,iPhone 4, etc. This renders the direction information readilyavailable. Two contributions are made in this paper. First,direction-dependent scalable vocabulary trees (DSVTs) areconstructed by using the direction information to supervise theimage feature clustering process. The leaf nodes (codewords)of the DSVT are treated as the preliminary DCV candidates.Second, a context-aware DCV learning algorithm is developedto select the discriminative codewords of the DSVT to formthe DCV. An ImageRank technique based on latent semanticanalysis (LSA) is developed to estimate the importance ofeach image. An iterative codeword selection algorithm thatincorporates the image weights is then proposed to generatethe DCV, which is used to encode the query image forlandmark recognition.

The rest of the paper is organized as follows: A gen-eral overview of the proposed DCV-based mobile landmarkrecognition system is presented in Section II. The context-aware DCV learning approach is discussed in detail inSection III, which includes the DSVT construction, the de-veloped ImageRank technique, and the proposed iterativecodeword selection algorithm. A quantitative evaluation thatincludes the experimental results and discussions is given inSection IV. Section V concludes the paper.

II. Overview of the Proposed DCV-Based Mobile

Landmark Recognition

Due to the unique characteristics of mobile devices, a client-server architecture is adopted in this paper. Through incor-porating the learned DCV into the client-server structure, aneffective mobile landmark recognition prototype is proposedin Fig. 1. It is seen that at the client end, a mobile userfirst captures a landmark image that is tagged with bothlocation and direction data. Local features (e.g., SIFT) are thenextracted from the image and quantized into a BoW histogramusing the DSVT that is stored in the mobile device beforehand.This high dimensional (the leaf nodes in the DSVT usuallyrange from several thousands to hundreds of thousands) BoWhistogram is compressed into a compact vector by retainingonly the codewords that belong to the DCV. The compactvector is finally encoded using arithmetic or other codingtechniques and transmitted to the server over the wirelessnetwork (e.g., WiFi, 3G, etc.). Compared with the originalBoW histogram, the compact vector has a smaller memory

CHEN AND YAP: CONTEXT-AWARE DISCRIMINATIVE VOCABULARY LEARNING 1613

Fig. 2. Overview of the proposed context-aware DCV learning process.

size and lower bit rate and can be transmitted over a narrowbandwidth network. This is important for mobile landmarkrecognition that is often hindered by poor wireless networkconnections. At the backend server, the received packet is firstdecoded into the compact vector. The database images that areindexed with inverted file are then scored according to theirhistogram similarity with the query image. The inverted filesystem is constructed similar to [2], which can increase thesearch speed significantly.

III. Context-Aware DCV Learning

In this section, we will give a detailed description of thecontext-aware DCV learning, which plays a key role in thesystem. As shown in Fig. 2, the proposed DCV learningapproach mainly consists of three steps.

1) DSVT construction, which first extracts local featuresand context data from the training images, and thenutilizes the direction information to supervise the clus-tering of local features into a DSVT. The leaf nodes(codewords) of the DSVT are treated as the DCVcandidates and used to quantize each image into a BoWhistogram.

2) ImageRank analysis, which ranks the images’ impor-tance according to their discriminative capability. Thereason for this step is based on the observation that,discriminative images usually have a higher potentialto contain the desired important codewords than otherimages, and thus should be given more emphasis inlearning the DCV. In the proposed ImageRank tech-nique, a term-document matrix is constructed usingthe BoW histograms of the training images. SVD isthen performed on these histograms for latent semanticanalysis, which can greatly speed up the process.

3) Codeword selection, which first develops a vocabularydiscrimination metric based on image intra-class andinter-class similarity, and then proposes an iterativecodeword selection algorithm based on this metric to

form the DCV. The algorithm iteratively selects the mostdiscriminative codeword from the original codebook intothe DCV, until the termination criteria or a predefinedDCV size is met.

A. Direction-Dependent Scalable Vocabulary Tree

Recently, the SVT model [18] has been used in differentlandmark recognition works such as [2], [8], and [14], dueto its effectiveness in handling scalable visual search. How-ever, one shortcoming of these works is that they do notconsider any context information when constructing the SVT.Therefore in this paper, direction information is considered tosupervise the image feature clustering to construct the DSVT.The rationale is that the consistency of the codewords for alandmark will depend on the capturing angles of the landmark,and direction data from smartphones is generally accurate andreliable. The DSVT is constructed as follows.

First, as illustrated in Fig. 3(a), the training images aredivided into various clusters according to their associatedtagged direction data, e.g., images with directions in 0–30degrees form a cluster, 30–60 degrees form the next cluster,and so on. The reason for adopting 30 degrees as the clusterrange is that the image descriptors used in this paper arerelatively robust to a perspective change of 30 degrees, e.g.,SIFT [17]. Then, hierarchical k-means algorithm is performedon the image features in each direction cluster to constructan SVT as in Fig. 3(b), after which a set of leaf nodes aregenerated and used as the DCV candidates. For an H-depth,B-branch SVT, it will produce BH leaf nodes (codewords) [8].

There are two benefits of using direction to supervise imageclustering for DSVT construction.

1) DSVT generated codewords have more discriminativecapability in differentiating the landmark images in thesame cluster. The rationale is that the same landmark’simages in the cluster are often visually close to eachother due to similar viewpoints. As a result, the gen-erated codewords from these images are also visuallyconsistent and have better discriminative capability.

2) the search space in the database is greatly reduced whenperforming similarity matching for the query image.Unlike previous works, such as [6] and [8] that useGPS alone for image filtering, the direction and locationinformation are combined to select the landmark candi-dates in this paper, which can provide better results thanusing location alone. The database images in the nearestdirection cluster with the query are first determined,followed by GPS to select the geographically nearbylandmarks for subsequent image matching.

However, these DSVT derived codewords cannot bedirectly used as the final vocabulary to encode the queryimage, since the total number of codewords in each directioncluster may be large, and this may cause the dimension ofthe BoW histogram to be high. This is unsuitable for mobilerecognition since the mobile network usually has limitedbandwidth. Furthermore, some words that are commonlyobserved in many landmarks, may have weak discriminativecapability in the direction cluster. Considering this, we

1614 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 23, NO. 9, SEPTEMBER 2013

Fig. 3. (a) Training images that are divided into various direction clusters. (b) DSVT in each cluster, with the leaf nodes as the DCV candidates.

designed a vocabulary learning method to select thosediscriminative codewords to form the DCV.

B. ImageRank Analysis

As mentioned earlier, ImageRank analysis is performed toemphasize the discriminative images since they have greaterpotential to contain words that belong to the DCV, and viceversa. The flowchart of the proposed ImageRank algorithmis shown in Fig. 2. It can be seen that a term-document (inour case it is word-image) matrix is first constructed usingthe BoW histograms of the training images. LSA is thenperformed on the matrix for image discriminative analysis.The reason for adopting LSA is the following.

1) The original BoW histogram is high dimensional, andthis causes image analysis on the BoW level to have highcomputational cost. By transforming the BoW histogramto latent topic representation through selecting the top-ics with large singular values, the dimension of thehistogram can be greatly reduced. The singular valuesin fact reflect the topics’ importance in discriminatingvarious landmark images.

2) Landmark images usually consist of basic semanticunits, e.g., windows, glasses, walls, and doors, etc.Therefore, it is reasonable to assume that there existsome topics among these images.

The latent topic context under the word-image relationshipcan be explored by means of a semantic model. The LSA,originally proposed for text indexing and retrieval and provedto be powerful for discovering the implicit higher-order contextin the association of terms with documents [25], is sufficientfor our task. According to LSA, given a word–image matrixWM×N =

[w1 ... wN

], where M and N denote the number

of codewords and images respectively, each column wn de-notes the M-dimensional BoW histogram for the nth image,and is normalized to unit length. It can be decomposed intothe product of three matrices by SVD

WM×N= UM×M�M×NVTN×N

⇒ [w1 · · · wN ]=[u1 · · · uM ]

⎡⎢⎢⎢⎢⎢⎢⎣

σ1,1 · · · 0...

. . ....

0 · · · σN,N

......

...0 0 0M,N

⎤⎥⎥⎥⎥⎥⎥⎦

·

⎡⎢⎣

vT1...

vTN

⎤⎥⎦

(1)

where UM×M can be understood as the orthogonal word-topicmatrix of size M×M, with each column um an M-dimensionalvector, and VN×N can be understood as the orthogonal image-topic matrix of size N×N, with each column vn an N-dimensional vector. � is an M×N rectangular diagonal matrixwith all diagonal elements σ positive and in decreasing order,corresponding to the latent topics. In order to maintain thereal data structure of W while ignoring the unimportantdetails, only the top Z (Z < M, Z < N) largest diagonalelements (corresponding to the most important topics) in �

are kept while the remaining smaller ones (correspondingto unimportant topics) are set to zero. This is equivalent todeleting the zero rows and columns of � to obtain a compactmatrix � and deleting the corresponding columns of U and Vto yield U and V, respectively, which is denoted as

WM×N =[

w1 · · · wN

]=

[u1 · · · uZ

] ·

⎡⎢⎣

σ1,1 . . . 0...

. . ....

0 . . . σZ,Z

⎤⎥⎦ ·

⎡⎢⎣

vT1...

vTZ

⎤⎥⎦

=UM×Z�Z×ZVT

N×Z.(2)

The dimensionality reduction operation for a new image isthen achieved through performing the following mathematicaltransforms on (2)

WT

=(

U�VT)T

⇒ WT

= V�UT

⇒ WT

U�−1

= V�UT

U�−1

⇒ V = WT

U�−1

.

(3)

Based on (3), given a query image’s original BoW histogramw, its latent topic representation v is formulated as

v = wT U�−1

. (4)

In order to determine Z, we follow the method in [26] and findthat best results are obtained when Z is equivalent to the top30% of the total diagonal elements in �, which means thatthe dimension of the new image’s latent topic representationis reduced to only 30% of the original BoW histogram. Thecomputational cost thus can be greatly reduced. Next, toestimate the discriminative capability of each image for itscategory in a specific direction cluster, we take two factors intoconsideration. The first factor is the images’ pairwise cosinesimilarity. The images having more pairwise similarity with

CHEN AND YAP: CONTEXT-AWARE DISCRIMINATIVE VOCABULARY LEARNING 1615

other images of the same landmark are generally consideredmore representative. For a pair of images i and j, their cosinesimilarity can be calculated using latent topic representation,and denoted as S(i, j)

S(i, j) = ||vi, vj||cos =vT

i vj

||vi|| · ||vj|| (5)

where ||vi|| is the norm-2 value of vi.

The second factor is the images’ within-class to between-class similarity ratio. The image that has more similarity withits own category’s images while less similarity with othercategories’ images is considered to be more representative forits category. The ratio S(i, i) of an image i is formulated as

S(i, i) =

Nt∑j=1,j �=i

||vi, vj||cosNo∑j=1

||vi, vj||cosIR(vi, vj)

(6)

where the numerator and denominator represent the image’swithin-class and between-class similarity, respectively, Nt andNo denote the number of the images in the target and negativelandmark categories in the direction cluster, respectively, IR(·)is an indicator function used to select the geographicallynearby images, which are within a radius of R from the targetlandmark for similarity analysis. R is set as 200 m based onour field study. IR(·) is defined as

IR(vi, vj) =

{1, if ||vi, vj||Geo< R0, otherwise

(7)

where ||vi, vj||Geo is the geographical distance between thenegative image j and the target landmark that the image ibelongs to, which is calculated the same as in [27].

Finally, inspired by the VisualRank idea in [28], an Im-ageRank technique is developed here. The rationale is thatthe ImageRank method can assign different weights to dif-ferent images according to their discriminative capability.The weights reflect the importance of different images inits category. The method works as follows: A matrix isfirst built, where the diagonal element records the within-class to between-class similarity ratio of each image, and thenondiagonal element records the similarity between pairwiseimages. Iterations are then carried out by multiplying thismatrix with an initialized weight vector of these images.During each iteration, the images with higher within-class tobetween-class similarity ratio and more similarity with otherimages will be emphasized, and vice versa. Iterations will bestopped when the weights stay stable. Based on this, a matrixS is first constructed for the ImageRank analysis, where thediagonal elements S(i, i) are calculated as (6) and normalizedby their diagonal summation to unit length, and nondiagonalelements S(i, j) are calculated as (5) and normalized by theircolumnwise summation to unit length. The image weightvector r is then iteratively estimated as

r = dS · r + (1 − d) · ρ (8)

where ρ is an Nt × 1 distracting vector for random walkbehavior [28], with each element equivalent to 1

Nt, and d is a

Algorithm 1 ImageRank algorithm

Input: A set of BoW histograms wn for the training imagesof each landmark category in the direction cluster,SVD decomposed matrices: U, �.

Output: Learned image weight vector r for each category1 Transform the original BoW histogram to latent

topic-based representation according to (4).2 Construct the matrix S based on (5) and (6).3 Learn the image weight vector r iteratively as:

While {||r(i) − r(i−1)|| < τ and i < imax} doUpdate the image weight r using (8);Iteration number i ++;

End4 Output the final learned image weights r.

Fig. 4. Ranked images for a direction cluster using the ImageRank method.

constant damping factor. r is an Nt × 1 weight vector for theNt images in the target category. Usually, d = 0.8 is chosenas in [26]. The iteration of (8) is considered to converge whenthe change of r is small enough or a maximal iteration numberis met. Algorithm 1 summarizes the ImageRank technique.

Fig. 4 illustrates several landmark images in a directioncluster that are ranked in a decreasing weight order from leftto right. It can be seen that the leftmost images present morediscriminative information of the captured landmarks, whilethe rightmost images present less discriminative informationdue to the clutter, occlusion, etc. This shows the effectivenessof the proposed ImageRank method.

C. DCV Learning

In this section, we first describe the DCV learning principlein the BoW space, and then extend it to latent semantic space.The DCV learning in fact is a codeword selection process,which selects the most discriminative codewords from theoriginal codebook into the newly generated DCV. Based onthis, we first create a binary matrix B of size K×M to recordwhether a word in the original codebook is selected, with eachcolumn m= 1, 2, . . . , M (M is the original codebook size)denotes the word index in the original codebook, each rowk = 1, 2, . . . , K (K is the DCV size, K < M) denotes theselected word index in the generated DCV, and each item B(k,m) denotes that whether the mth word in the original codebookis selected (1 indicates yes, 0 indicates no) and further mappedas the kth word in the new DCV. Using this matrix, an originalhigh-dimensional BoW histogram w generated by the DSVTcan be compressed into a low-dimensional compact vector c

1616 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 23, NO. 9, SEPTEMBER 2013

for transmission, which is calculated by

cK×1 = BK×MwM×1 (9)

where each row of the matrix B must and only contain onenonzero element to select an item of w into c.

From the above, it is seen that the DCV learning can betransformed into finding an optimal word selection matrixB to determine which codewords in the original codebookare selected into the DCV. In order to learn the optimal B,a vocabulary discrimination metric based on intraclass andinterclass image similarity is developed. An iterative codewordselection algorithm is then proposed to find the optimal B thatcan maximize the developed vocabulary discrimination metric.

1) Vocabulary Discrimination Metric: The developed vo-cabulary discrimination metric consists of two parts: imageintraclass similarity, which refers to how close and similarthe compressed histograms of the images within the samecategory in a direction cluster, as indicated in the numerator in(10), and image interclass similarity, which is defined as howdifferent the compressed histograms of the images betweenvarious landmark categories in a direction cluster, as indicatedin the denominator in (10). With respect to the word selectionmatrix B, the proposed metric is defined as

D(B) =

L∑l=1

Nl∑m=1

r(m)Nl∑

n=1,n�=m

||Bwm, Bwn||cos sin θ

L∑l=1

Nl∑m=1

r(m)Nm∑n=1

||Bwm, Bwn||cos IR(wm, wn)

(10)

where D is the vocabulary discrimination value using B, L isthe number of all landmark categories in the direction cluster,Nl is the number of the images in the landmark category l, Nm

is the number of negative landmark images for the image m,which is determined according to the indicator function IR(·)as defined in (7). r(m) is the image weight as calculated in(8) which is used to emphasize the more discriminative imagesand weaken the less discriminative ones. ||Bwm, Bwn||cos isthe cosine similarity between compressed histograms of themth and n-th images using B. θ denotes the angle betweencapturing directions of the two images, and is always smallerthan 30 degrees since the range of the direction cluster is 30degrees. sin(θ) serves as a weighting factor, and is used toemphasize the words that spread across a large perspectiveview around the landmark. The reason is that if a word canbe found across a large view angle around the landmark, thenthe word has good representative capability for this landmark,and will be emphasized. On the contrary, if a word can befound only within a narrow view angle around the landmark,then this word has lower representative capability as reflectedby small sin(θ) value.

From (10), it is seen that the proposed vocabulary discrimi-nation metric is an image similarity ratio between intraclassesand interclasses, and determined by the word selection matrixB. Based on this, the optimal B can be defined as that leadingto maximum intra-class image similarity and minimum inter-class image similarity, that is, maximizing the discriminationvalue D. This is formulated as seen in (11) on the next page,

Algorithm 2 Iterative word selection

Input: BoW histograms for all the landmark images, SVDdecomposed matrix U, �, DCV size: K

Output: Optimal word selection matrix Boptimal

For i = 1, 2, . . . , K1 Select the most discriminative codeword ci using (14).2 Update the word selection matrix Bi using (13).3 Calculate the discriminative value Di with respect to

current Bi using (15).4 If Di−Di−1 < ε, then stop, where ε is a predefined

threshold.End

where Boptimal is the learned optimal word selection matrix. Inorder to speed up the learning process of B, the LSA, which isused by the ImageRank technique in Section III-B, is adoptedhere to transform the original function (11) from BoW spaceto latent semantic space for solving problem, as follows,

2) Iterative Codeword Selection: In order to find theoptimal B in (12), an iterative codeword selection algorithmis proposed. The algorithm iteratively selects the word thatproduces the largest discrimination value into the DCV. Thefollowing notations are defined for use: i denotes the iterationcount and in the meanwhile the selected word index inthe generated DCV. This is because during each iterationonly one single word is selected into the DCV. j denotesthe word index in the original codebook, Bi−1 denotes thelearned word selection matrix during the (i-1)th iteration,yj = [0, ..., pos(j), ..., 0]M×1 denotes that the jth visual wordin the original codebook that will be selected into the DCV,ui = [0, ..., pos(i), ..., 0]K×1 denotes that the selected word ismapped as the ith word ci in the generated DCV. Based onthese definitions, we use ui(yj)T to indicate that the jth word inthe original codebook is selected and mapped as the ith wordin the new DCV during the ith iteration. The word selectionmatrix Bi−1 is updated to Bi by following operation:

Bi = Bi−1 + ui(yj)T . (13)

From (13), it is seen that the key step is to determine whichword (indexed by j) in the original codebook is selected as ci

in the new DCV during the ith iteration. This is solved by

ci = arg maxj

D(Bi)

= arg maxj

D(Bi−1 + ui(yj)T ), j = 1, ..., M(14)

where (15), as seen on the next page, is satisfied.After determining the selected word ci using (14), B is then

updated using (13). The word selection process is terminatedwhen the DCV size K is reached or the discrimination value Dwith respect to B starts to decrease. Algorithm 2 summarizesthe proposed iterative word selection process.

D. Database Image Scoring

This section presents the image recognition process basedon the integration of direction, location, and content analysis.At the client end, a landmark image is first captured by

CHEN AND YAP: CONTEXT-AWARE DISCRIMINATIVE VOCABULARY LEARNING 1617

Boptimal = arg maxB

D(B)= arg maxB

⎛⎜⎜⎜⎜⎜⎜⎝

L∑l=1

Nl∑m=1

r(m)

⎛⎝ Nl∑

n=1,n�=m

||Bwm, Bwn||cos sin θ

⎞⎠

L∑l=1

Nl∑m=1

r(m)

(Nm∑n=1

||Bwm, Bwn||cos IR(wm, wn)

)⎞⎟⎟⎟⎟⎟⎟⎠

(11)

Boptimal = arg maxB

D(B)

= arg maxB

⎛⎜⎜⎜⎜⎜⎜⎝

L∑l=1

Nl∑m=1

r(m)

⎛⎝ Nl∑

n=1,n�=m

||(Bwm)T (BU)�−1

, (Bwn)T (BU)�−1||cos sin(θ)

⎞⎠

L∑l=1

Nl∑m=1

r(m)

(Nm∑n=1

||(Bwm)T (BU)�−1

, (Bwn)T (BU)�−1||cos IR(wm, wn)

)⎞⎟⎟⎟⎟⎟⎟⎠

(12)

D(B) =

L∑l=1

Nl∑m=1

r(m)

⎛⎝ Nl∑

n=1,n�=m

||(Bwm)T (BU)�−1

, (Bwn)T (BU)�−1||cos sin(θ)

⎞⎠

L∑l=1

Nl∑m=1

r(m)

(Nm∑n=1

||(Bwm)T (BU)�−1

, (Bwn)T (BU)�−1||cos IR(wm, wn).

) . (15)

the mobile device as the query, which is tagged with bothlocation and direction data. Then the closest direction clusterwith the query image is determined and the DSVT for thatcluster is used to quantize the image into a BoW histogram, asdescribed in [18]. Finally, the learned word selection matrix Bis used to compress the high-dimensional BoW histogram intoa compact vector through (9), which is then encoded throughthe arithmetic or other coding techniques to be transmitted tothe server.

At the server end, the DSVT addresses the scalability searchissue through an inverted file system, which is similar tothe file structure in [2]. The inverted file system indexes thetraining images to the codewords, and maintains two lists: 1)a sorted array of images indicating the database images thathave visited that codeword, and 2) a corresponding array ofcounts indicating the visit number. When performing a queryfor an image, only the landmarks that lie: 1) in the selecteddirection cluster, and 2) within a 200 m geographical distanceaway from the query image (as discussed in Section III-B),are selected for similarity matching. These images can bequickly scored by traversing only the codewords in the DCVthat are visited by the query descriptors. In case when thequery image’s direction is near to the boundary of the twodirection clusters (e.g., within ± 5 degrees), we choose boththe two nearest clusters for consideration. The highest scoredimage is taken as the recognized landmark for the query.

IV. Quantitative Evaluation

A. Experimental Setup

It is noted that both the location and direction informationare explored for DCV learning in the proposed approach.Therefore, to evaluate the effect of these two types of context

information, we have created a NTU50Landmark database thatis tagged with both location and direction information forexperimental evaluation. The database consists of 4156 images(50 categories × about 80 images/category) collected from 50landmarks in the campus of Nanyang Technological University(NTU). The images of the database are captured using cameraphones (e.g., HTC Dream, IPhone 4, etc.) with different built-in camera settings, under different capturing conditions. Theselected landmarks are well-recognized buildings and places-of-interests in NTU. Sample images of these landmarks aregiven in Fig. 5(a). The landmark images can contain the entireor just a portion of the landmark. The geospatial distributionof the landmarks is given in Fig. 5(b), and they cover anarea of more than 2 km2. From the figure, it can be seenthat the landmarks spread across the whole campus of NTU,with certain areas having a higher concentration of landmarks.The recognition rate is used to evaluate the performance ofdifferent methods, which is defined as the ratio of numberof correctly classified test images to the total number of testimages. A test image is considered to be correctly classified ifthe landmark category label of the best-matched image agreeswith the ground-truth category label.

B. Comparison of the Proposed Method With OtherApproaches

To demonstrate the effectiveness of the proposed DCVlearning method, 3622 images from NTU50Landmarkdatabase are used for training and the remaining 534 imagesare used for testing. As shown in [33], dense SIFT descriptorsthat are extracted from multiscale patches (16×16, 32×32,and 64×64 overlapping patches) in an image are used as thefeature representation for the proposed and other methods,followed by hierarchical k-means clustering to generate theDSVT. The mobile visual search method in [2] is used as the

1618 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 23, NO. 9, SEPTEMBER 2013

Fig. 5. (a) Sample images and (b) geospatial distribution of the 50 landmark categories.

baseline for comparison. Several mobile landmark recognitionmethods are also implemented for comparison, including theinformation gain-based word learning in [22], the descriptiveword selection in [24], the latent visual context word learningin [26], and the location discriminative vocabulary learningin [8]. For all the methods, the original BoW histogramgenerated using the original codebook is used as the input.Then different vocabulary learning methods, such as [8], [22],[24], and [26], are used to compress the original BoW intoa compact vector. The compact vector is matched with thetraining images that are indexed using inverted file system inthe SVT for recognition. Fig. 6 gives the recognition ratesof these methods for different DSVT depths with H rangingfrom 3 to 6, and B = 10. It is seen that the proposed methodconsistently outperforms the other methods for H ranging from3 to 6, and can achieve a maximum recognition rate of around95% when H is beyond 4. There are two reasons for thesuperiority of the proposed approach over other methods. First,both direction and location information are considered to learnthe DCV in this paper while only location information is usedfor vocabulary leaning in [8]. The direction data is reliable, andusing its integration with GPS for DCV learning can providebetter results for landmark recognition. Second, the proposedDCV learning considers the discriminative information of theimages, whereas the conventional methods, such as [8], [24]do not consider this. It is also noted that when the DSVTdepth goes beyond 4, the performance improvement of all themethods become less obvious. This is because a DSVT withH = 4 is large enough for the size of this landmark database.

Furthermore, a comparison using different contexts infor-mation for DCV learning is conducted, which includes thefollowings.

1) Content-based DCV learning (using the ImageRank anditerative word selection algorithm), which does not in-clude either the direction or location information [e.g.,using conventional SVT to replace the DSVT, locationand direction analysis in (6) and (10) is removed].

Fig. 6. Comparision of the proposed DCV learning approach with other stat-of-ther-art methods.

Fig. 7. Comparison using different contexts for DCV learning.

2) Direction and content-based DCV learning, which usesDSVT and content analysis as in 1).

3) Location (GPS) and content-based DCV learning, whichuses conventional SVT and content analysis as in 1)to generate the DCV. When testing an image, GPSdata is first used to shortlist geographically nearbylandmark candidates, the proposed method is then usedfor ranking.

CHEN AND YAP: CONTEXT-AWARE DISCRIMINATIVE VOCABULARY LEARNING 1619

Fig. 8. Experimental results of key components in the proposed DCVlearning.

4) The proposed overall approach for DCV learning. Theresults are given in Fig. 7. It can be seen that theoverall approach using direction, location and contentintegration presents the best performance for each DSVTdepth, followed by GPS and content analysis, directionand content analysis, and lastly the content alone.

Finally, we also conduct some experiments to evaluate thecontribution of each key component in the proposed method,which includes ImageRank analysis and Iterative codewordselection. The DSVT without ImageRank and codeword selec-tion is used as the baseline for comparison. For the baseline,all the image weights are set to be equal to 1, and all thecodewords are used to encode the image. We then combineDSVT with following respective components to evaluate theirindividual performance gain: 1) DSVT + codeword selection,and 2) DSVT + ImageRank + codeword selection (Overallapproach). The results are shown in Fig. 8. It can be seenthat each component (ImageRank and codeword selection) canprovide performance improvement over the baseline DSVT. Itis observed that the codeword selection alone offers a per-formance gain of 3%–4%, while the ImageRank + codewordselection offers a performance gain of 5%–7% over the base-line DSVT. Therefore, it is clear that important components inDCV learning contribute toward the performance improvementin the proposed method.

C. Evaluation of the DCV Compression Using the ProposedMethod

In this section, the proposed DCV learning approach andthe location discriminative vocabulary (LDV) method in [8]are compared to investigate their performance for vocabularycompression, which is meaningful for mobile landmark recog-nition since smaller DCV size indicates less transmission costand faster recognition speed. The LDV method in [8] is chosenfor comparison due to its better performance when comparedwith other state-of-art methods as shown in Fig. 6. The DSVTdepth is chosen as 4, which produces 104 codewords in eachdirection cluster. Since the DCV is generated by selecting acertain number of discriminative words, we will show theexperimental results of the DCV with various numbers ofwords, which is given in Fig. 9. It is seen that the proposedmethod based on integration of direction, location, and contentanalysis achieves the best recognition performance for variousDCV sizes. It is also noted that when using both direction andlocation information for DCV learning, the DCV size can becompressed to 50% of the original number of the words, with

Fig. 9. Comparison of the proposed DCV learning method with othermethods.

the best performance achieving around 95% accuracy. Thisshows the effectiveness of the proposed method for vocabularycompression. Further, the proposed method using contentanalysis alone consistently outperforms the method in [8] forall compressed DCV sizes. This shows the effectiveness of theproposed ImageRank and iterative word selection algorithms.Finally, when the DCV size goes beyond a certain threshold,the performance of all the methods tend to decrease gradually.This is due to the fact that some words have poor discrimina-tive capability, such as the background words. If these wordsare selected into the DCV for recognition, they will have a neg-ative impact on the recognition result. Furthermore, when theDCV size is equal to 104, which means no word selection pro-cess is involved, the recognition rates for all these methods be-come the same since they are using the same set of vocabulary.

Furthermore, the effect of BoW histogram compressionusing the learned DCV is shown in Fig. 10. From the figure,the following is seen.

1) The compressed BoW histogram (vector) using DCVaccounts for only a small proportion of the original BoWhistogram, due to the small size of the compressed DCV.

2) Some words in the BoW histogram may have zerocounts. When these zero elements are removed, thehistogram can be further compressed.

3) When the DSVT depth becomes greater, the com-pression effect of the proposed method becomes moreobvious. This is because when the DSVT depth becomeslarger, more words that have poor discriminative capa-bility will be removed using the proposed DCV learning.

Finally, we also demonstrate the difference of using (ornot using) LSA in the recognition performance. We haveconducted an experiment that removes the LSA componentin (12) and (15), and uses the original BoW histogram as in(11) for inferring of matrix B. Under the setting of H = 4and B = 10, the recognition rate without using LSA is shownto drop by around 2% as compared with the recognition ratewhen using the LSA. This shows that LSA not only improvesthe learning efficiency, but also provides a 2% performanceimprovement in recognition rate.

1620 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 23, NO. 9, SEPTEMBER 2013

Fig. 10. BoW histogram compression effect using the proposed method, thevertical axis denotes the dimension ratio between the compressed vector andthe original BoW histogram.

TABLE I

Average Recognition Time (s) After Histogram Compression

Using Proposed Method

D. Evaluation of Recognition Time Analysis

Table I shows the average recognition time of the proposedapproach. Here the recognition time refers to the imagematching time using the proposed content and context analysisin the backend server. From the table, it is seen that the DCVmethod takes less recognition time than the original BoWhistogram. Specifically, when the DSVT depth is H = 4, anaverage recognition time of 0.81s is achieved that satisfiesmobile users’ fast response-time requirement. Furthermore, itis noted that when the DSVT depth H increases from threeto six, the average recognition time for the proposed methodincreases slightly, from 0.72s to 0.99s. This shows that theproposed method achieves good computational efficiency.

E. Evaluation of the Proposed Mobile Landmark RecognitionApproach on San Francisco Dataset

In order to demonstrate the effectiveness of the proposedmethod on a large-scale dataset, we have used the SanFrancisco Landmark dataset from [4] for experiments. Thedataset contains a database of 1.7 million images of buildingsin San Francisco [1.06M perspective central images (PCI)and 638K perspective frontal images (PFI)] with ground truthlabels, geotags, and calibration data, as well as a query set of803 cell phone images taken with a variety of different cameraphones by various people over several months. We select the1.06M PCIs for experiments, and exclude the PFIs because thePFIs require the query image to be rectified using vanishingpoints for matching, which is beyond the scope of this paper.We learn a new codebook using this dataset, and use all the1.06M PCI database images to build the SVT, which forms thebasis for subsequent DCV learning/training. The LDV methodin [8] is selected for comparison as discussed before. As theSan Francisco dataset contains only GPS data, we use twomethods in this experiment: one uses content analysis alone

Fig. 11. Experiment results of different methods on San Francisco landmarkdataset.

TABLE II

Average Recognition Time (s) Using Different Methods

as discussed in Section IV-B, and another uses content andGPS integration analysis. For content and GPS integration,in the training phase, the conventional SVT is constructed toobtain preliminary codewords, the ImageRank and codewordselection algorithm [the sin θ factor in (10) is removed in thiscase] is used to generate the DCV. In the testing phase, GPSdata is first used to shortlist geographically nearby landmarkcandidates, the proposed method in Section III-D (using SVTto replace DSVT) is further used for refinement.

Fig. 11 gives the recognition rates of these methods fordifferent SVT depth H ranging from three to six, and B = 10.From the figure it can be seen that when the SVT depth reachessix, which means a million leaf codewords are generated,the proposed content and GPS integration analysis achievesthe highest recognition rate among all the methods (around89%). This shows that the proposed method can obtain goodperformance in a different large-scale benchmark dataset.

The recognition time using the proposed approach on theSan Francisco landmark database is also investigated, as shownin Table II. The SVT depth is set as six. From the table,it is seen that the proposed content and GPS integrationmethod has the best recognition time. This shows the goodscalability of the proposed SVT-based method for mobilelandmark recognition.

V. Conclusion

This paper presented a new DCV learning approach formobile landmark recognition by exploiting the context infor-mation, especially the direction information for vocabularylearning. To the best of our knowledge, this is the first workthat provides a concrete integration of the direction informa-tion in mobile landmark recognition. The key contributions ofthis paper include: 1) the incorporation of direction informa-tion into the SVT construction to generate DSVT for imagequantization, and 2) the proposal of an effective ImageRanktechnique and iterative codeword selection algorithm for DCV

CHEN AND YAP: CONTEXT-AWARE DISCRIMINATIVE VOCABULARY LEARNING 1621

learning. Experimental results on the NTU50Landmark andSan Francisco database showed that the proposed methodachieved both high recognition rate and fast recognition timein mobile landmark recognition.

References

[1] [Online]. Available: http://www.ehow.com/camera-phone/[2] B. Girod, V. Chandrasekhar, and D. M. Chen, “Mobile visual search,”

IEEE Signal Process. Mag., vol. 28, no. 4, pp. 61–76, Jul. 2011.[3] S. S. Tsai, D. Chen, G. Takacs, V. Chandrasekhar, J. P. Singhy, and

B. Girod, “Location coding for mobile image retrieval,” in Proc. Int.Mobile Multimedia Commun. Conf., 2010, article 8.

[4] D. M. Chen, G. Baatz, K. Koeser, S. S. Tsai, R. Vedantham, K. Roimela,J. Bach, M. Pollefeys, B. Girod, and R. Grzeszczuk“City-scale landmarkidentification on mobile devices,” in Proc. IEEE Int. Conf. Comput.Vision Pattern Recognit., Jun. 2011, pp. 737–744.

[5] R.-R. Ji, X. Xie, H. Yao, and W.-Y. Ma, “Mining city landmarks fromblogs by graph modeling,” in Proc. ACM Int. Conf. Multimedia, vol. 96.2009, pp. 105–114.

[6] W. Zhang and J. Kosecka, “Image based localization in urban envi-ronments,” in Proc. 3rd Int. Symp. 3-D Data Process. VisualizationTransmission, 2006, pp. 33–40.

[7] J.-A. Lee, K.-C. Yow, and A. Sluzek, “Image-based information guideon mobile devices,” in Proc. 4th Int. Symp. Advances Visual Comput.,2008, pp. 346–355.

[8] R. Ji, L.-Y. Duan, J. Chen, H. Yao, J. Yuan, Y. Rui, and W. Gao, “Loca-tion discriminative vocabulary coding for mobile landmark search,” Int.J. Comput. Vision, vol. 96, pp. 1–25, Jul. 2011.

[9] H. James and A. A. Efros, “IM2GPS: Estimating geographic informationfrom a single image,” in Proc. IEEE Int. Conf. Comput. Vision PatternRecognit., Jun. 2008, pp. 1–8.

[10] A. Irschara, C. Zach, J.-M. Frahm, and H. Bischof, “From structurefrom-motion point clouds to fast location recognition,” in Proc. IEEEInt. Conf. Comput. Vision Pattern Recognit., Jun. 2009, pp. 2599–2606.

[11] T. Chen, K.-H. Yap, and L.-P. Chau, “Integrated content and contextanalysis for mobile landmark recognition,” IEEE Trans. Circuits Syst.Video Technol., vol. 21, no. 10, pp. 1476–1486, Oct. 2011.

[12] E. Kalogerakis, O. Vesselova, J. Hays, A. A. Efros, and A. Hertzmann,“Image sequence geolocation with human travel priors,” in Proc. IEEEInt. Conf. Comput. Vision Pattern Recognit., Sep.–Oct. 2009, pp. 1–8.

[13] D. Liu, M. R. Scott, R. Ji, W. Jiang, H. Yao, and X. Xie, “Locationsensitive indexing for image-based advertising,” in Proc. ACM Int. Conf.Multimedia, 2009, pp. 793–796.

[14] D. Chen, S. Tsai, and V. Chandrasekhar, “Tree histogram coding formobile image matching,” in Proc. Data Compression Conf., 2009, pp.143–152.

[15] V. Chandrasekhar, G. Takacs, D. Chen, S. Tsai, R. Grzeszczuk, and B.Girod, “CHoG: Compressed histogram of gradients a low bit-rate featuredescriptor,” in Proc. IEEE Int. Conf. Comput. Vision Pattern Recognit.,Jun. 2009, pp. 2504–2511.

[16] H. Bay, T. Tuytelaars, and L. V. Gool, “SURF: Speeded up robustfeatures,” in Proc. 9th Eur. Conf. Comput. Vision, vol. 3951, part 1.2006, pp. 404–417.

[17] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,”Int. J. Comput. Vision, vol. 60, no. 2, pp. 91–110, 2004.

[18] D. Nister, H. Stewenius, “Scalable recognition with a vocabulary tree,”in Proc. IEEE Int. Conf. Comput. Vision Pattern Recogn., Jun. 2006, pp.2161–2168.

[19] L. Wang, L. P. Zhou, and C. H. Shen, “A fast algorithm for creatinga compact and discriminative visual codebook,” in Proc. Eur. Conf.Comput. Vision, vol. 5305. 2008, pp. 719–732.

[20] L. Yang, R. Jin, R. Sukthankar, and F. Jurie, “Unifying discriminativevisual codebook generation with classifier training for object categoryrecognition,” in Proc. IEEE Int. Conf. Comput. Vision Pattern Recogn.,Jun. 2008, pp. 1–8.

[21] L. Wang, “Toward a discriminative codebook: Codeword selection acrossmulti-resolution,” in Proc. IEEE Int. Conf. Comput. Vision PatternRecogn., Jun. 2007, pp. 1–8.

[22] G. Schindler, M. Brown, and R. Szeliski, “City-scale location recogni-tion,” in Proc. IEEE Int. Conf. Comput. Vision Pattern Recogn., Jun.2007, pp. 1–7.

[23] L. Zhang, C. Chen, J. Bu, Z. Chen, S. Tan, and X. He, “Discriminativecodeword selection for image representation,” in Proc. ACM Int. Conf.Multimedia, 2010, pp. 173–182.

[24] S. Zhang, Q. Tian, G. Hua, Q. Huang, and S. Li, “Descriptive visualwords and visual phrases for image applications,” in Proc. ACM Int.Conf. Multimedia, 2009, pp. 75–84.

[25] S. Deerwester, S. T. Dumais, and R. Harshman, “Indexing by la-tent semantic analysis,” J. Amer. Soc. Inform. Sci., vol. 41, no. 6,pp. 391–407, 1990.

[26] W. Zhou, Q. Tian, Y. Lu, L. Yang, and H. Li, “Latent visual contextlearning for web image applications,” Pattern Recogn., vol. 44, nos.10–11, pp. 2263–2273, Oct.–Nov. 2010.

[27] [Online]. Available: http://www.meridianworlddata.com/Distance-Calculation.asp

[28] Y. Jing and S. Baluja, “VisualRank: Applying PageRank to large-scaleimage search,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 30, no. 11,pp. 1877–1890, Nov. 2008.

[29] J. P. Pluim, J. B. Mainta, and M. A. Viergever, “Mutual-information-based registration of medical images: A survey,” IEEE Trans. Med.Imaging, vol. 22, no. 8, pp. 986–1004, Aug. 2003.

[30] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman, “Lost inquantization: Improving particular object retrieval in large scale imagedatabases,” in Proc. IEEE Conf. Comput. Vision Pattern Recogn., Jun.2008, pp. 1–8.

[31] Y.-T. Zheng, M. Zhao, Y. Song, H. Adam, U. Buddemeier, A. Bissacco,F. Brucher, T.-S. Chua, and H. Neven, “Tour the world: Building a web-scale landmark recognition engine,” in Proc. IEEE Int. Conf. Comput.Vision Pattern Recognit., Jun. 2009, pp. 1085–1092.

[32] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman, “Objectretrieval with large vocabularies and fast spatial matching,” in Proc.IEEE Int. Conf. Comput. Vision, Jun. 2007, pp. 1–8.

[33] K.-H. Yap, T. Chen, Z. Li, and K. Wu, “A comparative study of mobile-based landmark recognition techniques,” IEEE Intell. Syst., vol. 25, no.1, pp. 48–57, Jan.–Feb. 2010.

[34] V. Chandrasekhar, G. Takacs, D. M. Chen, S. S. Tsai, Y. Reznik, R.Grzeszczuk, and B. Girod, “Compressed histogram of gradients: A lowbitrate descriptor,” Int. J. Comput. Vision, vol. 94, no. 5, pp. 384–399,2011.

Tao Chen received the B.Eng. degree from Shan-dong University, Shandong, China, in 2006, andthe M.S. degree in electrical engineering from Zhe-jiang University, Hangzhou, China, in 2008. He iscurrently pursuing the Ph.D. degree at NanyangTechnological University, Singapore.

His current research interests include multimediacontent analysis and mobile media retrieval.

Kim-Hui Yap (SM’09) received the B.Eng. andPh.D. degrees in electrical engineering from theUniversity of Sydney, Sydney, Australia, in 1998 and2002, respectively.

Since 2002, he has been a Faculty Memberwith Nanyang Technological University, Singapore,where he is currently an Associate Professor. He hasserved as the Group Leader in content-based analysisat the Center for Signal Processing, Nanyang Tech-nological University, Singapore. He has numerouspublications in various international journals, book

chapters, and conference proceedings. He has authored a book entitled,Adaptive Image Processing: A Computational Intelligence Perspective (2nded., CRC Press, 2009), and edited a book entitled, Intelligent MultimediaProcessing With Soft Computing (Springer-Verlag, 2005). His current researchinterests include image/video processing, media content analysis, computervision, and computational intelligence.

Dr. Yap has served as an Associate Editor for the IEEE Computational

Intelligence Magazine and the Journal of Signal Processing Systems.He has also served as an Editorial Board Member for the Open Electricaland Electronic Engineering Journal, and as a Guest Editor for the IEICETransactions on Fundamentals. He has served as the Treasurer of the IEEESingapore Signal Processing Chapter and the Committee Member of the IEEESingapore Computational Intelligence Chapter. He has also served as theFinance Chair in the 2010 IEEE International Conference on Multimedia andExpo, the Workshop Co-Chair of the 2009 MDM International Workshop onMobile Media Retrieval, among others.

Recommended