11
1476 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 21, NO. 10, OCTOBER 2011 Integrated Content and Context Analysis for Mobile Landmark Recognition Tao Chen, Student Member, IEEE, Kim-Hui Yap, Senior Member, IEEE, and Lap-Pui Chau, Senior Member, IEEE Abstract —This paper proposes a new approach for mobile landmark recognition based on integrated content and context analysis. Conventional scene/landmark recognition methods fo- cus mainly on nonmobile desktop/PC platform, where content analysis alone is used to perform landmark recognition. These nonmobile systems, however, do not take unique features of mobile devices into consideration, e.g., limited computational power and fast response time requirement of mobile users. On the contrary, most existing context-aware content mobile landmark recognition methods mainly rely on global positioning system location information for context analysis. In view of this, this paper proposes an effective method that employs an integration of content and context analysis to perform landmark recognition using mobile devices. A new bags-of-words (BoW) framework is developed to perform content analysis. It is then integrated with context analysis involving fusion of location and direction information to perform mobile landmark recognition. Experimental results based on the NTU50Landmark database show that the proposed method can achieve good recognition performance in mobile landmark recognition. Index Terms—Bags-of-words model, content and context inte- gration, mobile landmark recognition. I. Introduction I N RECENT years, the number of mobile/cellular phones has grown tremendously. Global mobile phone subscribers have reached 4.6 billion by the end of 2009, penetrating approximately two thirds of the world’s population [1]. Today, more than half of the shipped mobile phones are equipped with cameras. The improvements in imaging capabilities and wireless network infrastructure have given rise to many ex- citing mobile applications. Among these is mobile landmark recognition where a user can capture a landmark image using his/her phone device, and find out information about the cap- tured landmark. Mobile landmark recognition is particularly useful in applications such as augmented city guide, location- based mobile information hub, and museum tour guide. Various scene/landmark recognition systems have been de- veloped in recent years [2]–[8]. A worldwide evidence grid Manuscript received December 3, 2010; revised April 28, 2011; accepted June 17, 2011. Date of publication July 14, 2011; date of current version October 5, 2011. This work is supported by Agency for Science, Technology and Research (A*STAR), Singapore, under SERC Grant 062 130 0055. This paper was recommended by Associate Editor K. Pulli. The authors are with the School of Electrical and Electronic Engi- neering, Nanyang Technological University, 639798, Singapore (e-mail: [email protected]; [email protected]; [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TCSVT.2011.2161413 associated with different places is proposed in [2] to locate the position of the captured image. A scene matching algorithm based on color and wavelet features is developed in [3]. It uses image dataset from the Internet to estimate the probability distribution over geographic locations for the query image. In [4]–[7], several statistical algorithms have been proposed to model the distribution for each scene, e.g., visual language model in [4], contextual bags-of-words (BoW) model in [5], probabilistic latent semantic analysis in [6], and Bayesian analysis in [7]. Multiclass support vector machine based on BoW has been used to learn the model for each landmark category in [8]. Nonvisual information such as textual tags and temporal constraints are also utilized to improve the performance. However, it is observed that most of these works are centered on nonmobile PC/desktop environments. They rely mainly on content analysis for recognition, and hence are computationally intensive as the query image needs to be matched with all the landmark categories in the database. Further, they do not take into account the unique properties of mobile landmark recognition, namely: 1) limited processing power of mobile devices; 2) fast response time requirement of mobile users; and 3) availability of context information such as the location and direction information in latest mobile devices. As mobile phones with imaging capability are becom- ing more prevalent, a few works that focus on mobile scene/landmark recognition [9]–[12] have been developed recently. A content-based image recognition approach using mobile devices has been developed in [9]. It uses scale invariant feature transform (SIFT)-based content analysis for recognition. In [10] and [11], content analysis integrated with location information is used to recognize the outdoor and urban scenes. Informative SIFT [10] and discriminative patch-based [11] content analysis are adopted for similarity matching. A mobile tourist information prototype has been developed in [12]. It uses the global positioning system (GPS) information and 3-D model to match the query image. Besides, some context aware content works that focus on large-scale non-mobile landmark recognition appeared [13]– [19]. A world-scale landmark database has been constructed from GPS-tagged photos and online tour web pages in [13]. Unsupervised graph clustering and image matching algorithm have been used to model each landmark category. Recognition is achieved by matching the query image against the model using the nearest neighbor principle. An effective vocabulary 1051-8215/$26.00 c 2011 IEEE

Integrated Content and Context Analysis for Mobile Landmark Recognition

  • Upload
    lap-pui

  • View
    212

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Integrated Content and Context Analysis for Mobile Landmark Recognition

1476 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 21, NO. 10, OCTOBER 2011

Integrated Content and Context Analysis forMobile Landmark Recognition

Tao Chen, Student Member, IEEE, Kim-Hui Yap, Senior Member, IEEE, andLap-Pui Chau, Senior Member, IEEE

Abstract—This paper proposes a new approach for mobilelandmark recognition based on integrated content and contextanalysis. Conventional scene/landmark recognition methods fo-cus mainly on nonmobile desktop/PC platform, where contentanalysis alone is used to perform landmark recognition. Thesenonmobile systems, however, do not take unique features ofmobile devices into consideration, e.g., limited computationalpower and fast response time requirement of mobile users.On the contrary, most existing context-aware content mobilelandmark recognition methods mainly rely on global positioningsystem location information for context analysis. In view ofthis, this paper proposes an effective method that employs anintegration of content and context analysis to perform landmarkrecognition using mobile devices. A new bags-of-words (BoW)framework is developed to perform content analysis. It is thenintegrated with context analysis involving fusion of location anddirection information to perform mobile landmark recognition.Experimental results based on the NTU50Landmark databaseshow that the proposed method can achieve good recognitionperformance in mobile landmark recognition.

Index Terms—Bags-of-words model, content and context inte-gration, mobile landmark recognition.

I. Introduction

IN RECENT years, the number of mobile/cellular phoneshas grown tremendously. Global mobile phone subscribers

have reached 4.6 billion by the end of 2009, penetratingapproximately two thirds of the world’s population [1]. Today,more than half of the shipped mobile phones are equippedwith cameras. The improvements in imaging capabilities andwireless network infrastructure have given rise to many ex-citing mobile applications. Among these is mobile landmarkrecognition where a user can capture a landmark image usinghis/her phone device, and find out information about the cap-tured landmark. Mobile landmark recognition is particularlyuseful in applications such as augmented city guide, location-based mobile information hub, and museum tour guide.

Various scene/landmark recognition systems have been de-veloped in recent years [2]–[8]. A worldwide evidence grid

Manuscript received December 3, 2010; revised April 28, 2011; acceptedJune 17, 2011. Date of publication July 14, 2011; date of current versionOctober 5, 2011. This work is supported by Agency for Science, Technologyand Research (A*STAR), Singapore, under SERC Grant 062 130 0055. Thispaper was recommended by Associate Editor K. Pulli.

The authors are with the School of Electrical and Electronic Engi-neering, Nanyang Technological University, 639798, Singapore (e-mail:[email protected]; [email protected]; [email protected]).

Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TCSVT.2011.2161413

associated with different places is proposed in [2] to locate theposition of the captured image. A scene matching algorithmbased on color and wavelet features is developed in [3]. Ituses image dataset from the Internet to estimate the probabilitydistribution over geographic locations for the query image. In[4]–[7], several statistical algorithms have been proposed tomodel the distribution for each scene, e.g., visual languagemodel in [4], contextual bags-of-words (BoW) model in [5],probabilistic latent semantic analysis in [6], and Bayesiananalysis in [7]. Multiclass support vector machine based onBoW has been used to learn the model for each landmarkcategory in [8]. Nonvisual information such as textual tagsand temporal constraints are also utilized to improve theperformance.

However, it is observed that most of these works arecentered on nonmobile PC/desktop environments. They relymainly on content analysis for recognition, and hence arecomputationally intensive as the query image needs to bematched with all the landmark categories in the database.Further, they do not take into account the unique properties ofmobile landmark recognition, namely: 1) limited processingpower of mobile devices; 2) fast response time requirement ofmobile users; and 3) availability of context information such asthe location and direction information in latest mobile devices.

As mobile phones with imaging capability are becom-ing more prevalent, a few works that focus on mobilescene/landmark recognition [9]–[12] have been developedrecently. A content-based image recognition approach usingmobile devices has been developed in [9]. It uses scaleinvariant feature transform (SIFT)-based content analysis forrecognition. In [10] and [11], content analysis integratedwith location information is used to recognize the outdoorand urban scenes. Informative SIFT [10] and discriminativepatch-based [11] content analysis are adopted for similaritymatching. A mobile tourist information prototype has beendeveloped in [12]. It uses the global positioning system(GPS) information and 3-D model to match the query image.Besides, some context aware content works that focus onlarge-scale non-mobile landmark recognition appeared [13]–[19]. A world-scale landmark database has been constructedfrom GPS-tagged photos and online tour web pages in [13].Unsupervised graph clustering and image matching algorithmhave been used to model each landmark category. Recognitionis achieved by matching the query image against the modelusing the nearest neighbor principle. An effective vocabulary

1051-8215/$26.00 c© 2011 IEEE

Page 2: Integrated Content and Context Analysis for Mobile Landmark Recognition

CHEN et al.: INTEGRATED CONTENT AND CONTEXT ANALYSIS FOR MOBILE LANDMARK RECOGNITION 1477

Fig. 1. Overview of the proposed framework.

tree has been proposed in [14]. It uses the most informativefeatures of each landmark to select the vocabulary and makesit more suitable for large-scale dataset. In [15]–[19], [37],and [38], GPS is first used to tag the large-scale databaseimages for pruning purpose; visual analysis techniques orrandom sample consensus (RANSAC)-based geometric algo-rithms are then used to localize the query image. However, theone-to-one matching RANSAC algorithm is computationallyintensive.

In addition to these recognition-based works, there are alsosome related works on image retrieval or annotation [20]–[22].In [20], an image retrieval method is developed by using BoWalgorithm as a filtering step, followed by a RANSAC algorithm[23] for spatial re-ranking. An unsupervised mining approachfor community photo collections is presented in [21]. Based ongeospatial information and visual content analysis, a databaseof mined objects and events that are labeled with a Wikipedialink is constructed. Similarly, a database is collected in [22]from photo sharing website worldwide. Context information,such as textual tags and geographical location, and contentanalysis (BoW, tf-idf, etc.) are utilized to develop an object-level auto-annotation system for the landmark images.

Existing mobile-based landmark recognition techniquesmainly adopt context analysis or content analysis alone forrecognition. Even though there are some works [15]–[19] thatintegrate the context and content analysis for recognition, theirintegration techniques are still preliminary and the contextinformation used is limited to GPS information only.

In view of this, this paper proposes a new content andcontext integration technique for mobile landmark recognition.A BoW framework is proposed to perform content analysis. Itinvolves development of a new spatial pyramid decompositionand cell saliency estimation approach to achieve effectiverepresentation of landmark images. The technique also incor-porates a new spatial pyramid kernel which employs codeworddiscrimination for classification. During the context analysis,different strategies involving formation of viewcones whichincorporate the location information (obtained through built-in GPS of mobile device) and direction information (obtainedthrough built-in digital compass of mobile device) are devel-oped. A new integration of content and context analysis isproposed, which greatly enhances the recognition performance

of the system, as demonstrated through various experimentalresults in later sections.

The rest of this paper is organized as follows. A generaloverview of the proposed mobile landmark recognition sys-tem is presented in Section II. The content analysis and itsintegration with context analysis are discussed in Sections IIIand IV, respectively. Experimental results and discussions aregiven in Section V. Section VI concludes this paper with asummary of our findings.

II. System Overview of Integrated Content and

Context Analysis

As opposed to nonmobile environments, mobile landmarkrecognition needs to take the unique characteristics of mo-bile devices into consideration. These include low processingpower of mobile devices and user’s fast response-time require-ment. In order to address the issues, a client-server structureis adopted in this paper. This will enable the computationalload to be shifted from the mobile client to the server.

The overview of the proposed mobile landmark recognitionsystem is given in Fig. 1. It consists of two stages: offlinetraining and online testing. During offline training, the localfeatures of the training images are extracted and representedusing an efficient BoW framework. A new spatial pyramiddecomposition technique is developed to partition the imageinto multiple cells. A saliency estimation approach based onvisual features and codeword discrimination is then used toevaluate the discriminative power of these cells. Finally, amulti-class support vector machine (SVM) employs the newspatial pyramid kernel (SPK) to train the landmark classifiers.During online testing, context analysis (fusion of location anddirection information) is first performed. It is then integratedwith the proposed content analysis and forwarded to thetrained SVM classifiers for landmark recognition.

III. Content Analysis

BoW model has been used in many recent state-of-the-artscene/landmark recognition systems [6], [13], [15], [25], [26]due to its good recognition performance and fast processingspeed. However, early BoW frameworks do not take the spatial

Page 3: Integrated Content and Context Analysis for Mobile Landmark Recognition

1478 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 21, NO. 10, OCTOBER 2011

Fig. 2. New spatial pyramid decomposition strategy. (a) Level 0. (b) Level 1. (c) Level 2.

layout information of the landmark images into consideration.This underutilizes the information available as landmarks areusually located near the center of the images, and backgroundssuch as sky and grass are located near the top and bottom of theimage, respectively. In view of this, spatial pyramid matchingapproach is proposed in [26] and [27]. The methods partitionan image into multiresolution pyramid cells and then representeach cell in the pyramid using BoW. The extracted BoWhistograms from all the cells are then concatenated to form thefinal feature vector. However, these spatial pyramid methodsonly assign a level-dependent weight to the codewords atdifferent pyramid levels, while consider all codewords at thesame level to be equally important. In other words, they donot consider different descriptive capabilities of visual wordsin representing different landmark categories. Further, theyalso do not consider the different importance of cells at thesame pyramid level when fusing the BoW feature. This isinconsistent with the observation that image regions/patcheswith more visual saliency should be given more emphasiswhen considering its visual representation.

In view of these, this paper proposes a new BoW frame-work. The framework includes the following key contributions:

1) it proposes an effective spatial pyramid decompositionapproach that can represent landmark images more ef-fectively, in terms of higher recognition accuracy usingthis representation;

2) it develops a pyramid cell saliency estimation approachthat determines the importance of different cells at thesame pyramid level, and integrates this information intothe BoW model. The saliency information is estimatedbased on two factors: cell discriminative information andcodeword discrimination;

3) it proposes a new spatial pyramid kernel in training ofSVM classifiers, which employs codeword discrimina-tion information among different landmark categories.

A. Spatial Pyramid Decomposition

Current spatial pyramid techniques [26], [27] adopt a three-level decomposition with 1×1, 2×2, and 4×4 cell partitioningstrategy. The reason why we use three levels is because thework in [26] shows that a large number of pyramid levelswill cause the computational cost to increase, while renderinglittle improvement on performance. Even though the 1 × 1and 4×4 cell structures are able to capture the overall pictureand detailed spatial orientation of the image reasonably well,the 2 × 2 partitioning at level 1 is unable to represent the

landmark image effectively. The reason is that landmarks areusually located near the center of the image, and backgroundssuch as sky and lawn are usually located near the top andbottom of the image, respectively. The 2×2 cell structure tendsto partition the central landmark into four quadrants, therebydiminishing the effectiveness of individual cell histogram.Therefore, this section will introduce an effective yet simplepyramid decomposition strategy, as illustrated in Fig. 2.

From the figure, it can be seen that the new decompositionapproach employs a 1 × 1, 1 × 3, 3 × 1, and 4 × 4 celldecomposition strategy. As opposed to the conventional level-1 2 × 2 cell structure, the proposed decomposition involvesa combination of 1 × 3 and 3 × 1 cell structures to betterpartition the image into the central region-of-interest andthe background areas. This partitioning strategy is consistentwith the Rule of Thirds in photography that states that animage should be imagined as divided by two equally spacedhorizontal lines and two equally spaced vertical lines, andthat important compositional elements should be placed alongthese lines or their intersections [28]. Experimental results inlater section will demonstrate the effectiveness of this strategyin performing landmark recognition.

B. Pyramid Cell Saliency Estimation

The decomposition approach in previous section partitionsthe landmark image into multiscale cells that contain differ-ent amount of saliency information. Cells that contain moresaliency are generally more useful for classification, sincestronger saliency indicates more discriminating or relevantinformation is contained in the cell. Therefore, a pyramid cellsaliency estimation approach is proposed in this section to es-timate the saliency of different pyramid cells, so that this infor-mation can be incorporated into the training of classifiers. Theapproach involves development of two techniques based oncell discriminative information and codeword discriminationamong different landmark categories. To simplify discussionand notation in the following sections, we will denote a typicaljth cell at the lth pyramid level in an image I as c. We usel = 0, 1, 2 to denote the first, second, and third pyramid level.

1) Cell Discriminative Information: The discriminativeinformation of cell c depends on two factors: a) the visualfeatures that are an indicator of the information contained inthe cell, and b) the position of the cell as landmarks are morelikely to be located near the center of the image. As visualfeatures can be reflected by the number of SIFT keypoints in acell, more SIFT keypoints imply that the particular cell should

Page 4: Integrated Content and Context Analysis for Mobile Landmark Recognition

CHEN et al.: INTEGRATED CONTENT AND CONTEXT ANALYSIS FOR MOBILE LANDMARK RECOGNITION 1479

be given more emphasis when considering its representation.Taking this into consideration, the saliency z of the cell c dueto the number of SIFT keypoints can be estimated using

z =1

1 + exp(−15 tT

)(1)

where t is the number of SIFT keypoints inside cell c and Tis the total number of SIFT keypoints in the image. Sigmoidfunction is used as large t value will cause z to be large,indicating that the cell with more SIFT keypoints is visuallymore informative in representing the image. The constant of−15 is determined empirically through experiments.

It is observed that landmarks are typically located near thecentral region of images. Therefore, cells that are closer to theimage center should be assigned larger weight. The saliencyof the cell c due to the location can be estimated using aGaussian function as follows:

s =1√2π

exp(−1

2||g − p||) (2)

where g and p represent the positions of the cell center andimage center, respectively. The constant of 0.5 is determinedempirically through experiments. Finally, the visual impor-tance of cell c is estimated by combining the saliency dueto visual features and cell location to give

v = z · s. (3)

It is worth noting that the assumption that more SIFTfeatures in a cell imply higher importance is heuristic attimes, since some high-frequency clutters may also have largenumber of SIFT keypoints. Therefore, a codeword discrimi-nation learning method is proposed in following sections tocomplement the cell discrimination information.

2) Codeword Discrimination: The cell informative infor-mation estimation in the previous section evaluates the cell’simportance based on only a single image. It does not utilizethe discriminative information available from other images.Therefore, this section will propose a cell saliency estimationapproach based on codeword discrimination. The method willfirst determine the discriminating power of different code-words for different landmark categories. This information isthen used to estimate and adjust the importance of eachpyramid cell. A similar approach that tries to estimate thecodewords’ importance for different landmarks is proposedin [29]. However, they mainly try to select the descriptivecodewords for each category, and combine them into a uni-versal codebook, while our method focuses on estimatingthe codeword discrimination power and integrate it with thecell’s BoW histogram to evaluate the cell’s importance. Theproposed codeword discrimination and cell saliency estimationprocess is summarized in Algorithm 1.

The proposed codeword discrimination approach is ex-plained as follows. First, a codebook consisting of M code-words is constructed from N categories of landmarks usingk-means clustering [31]. The codebook contains a list ofcodewords whose discrimination values are to be learned. Anoccurrence matrix S(m, n) is formed to capture the relation-ships between the codewords and landmark categories, where

Algorithm 1 Codebook discrimination and cell saliency estima-tion method

1) For the training images from N landmarks, perform spatialpyramid decomposition in Section III-A to obtain multiple cellsclj, j = 1, 2, ..., J , where J = 1, 6, and 16 for l = 0, 1, and 2,

respectively.2) Extract dense SIFT descriptors from each image as in[7], and perform codebook generation by applying k-meansclustering to cluster dense descriptors into M codewords.3) Calculate the M × N occurrence matrix S(m, n) as follows:

category 1 ... category n ... category N

codeword 1...

codeword m...

codeword M

⎡⎢⎢⎣

S(1, 1) ... S(1, n) ... S(1, N)... ...S(m, 1) ... S(m, n) ... S(m, N)... ... ...S(M, 1) ... S(M, n) ... S(M, N)

⎤⎥⎥⎦

where S(m, n) represents the number of all training patchesfrom landmark category n whose SIFT descriptors are clus-tered as codeword m.4) Normalize the occurrence matrix S(m, n) column-wise toobtain P(m, n) as follows:

P(m, n) =S(m, n)∑M

m=1 S(m, n), m = 1, ..., M, n = 1, ..., N. (4)

5) Normalize the matrix P(m, n) row-wise to obtain the code-word discrimination matrix D(m, n) as follows:

D(m, n) =P(m, n)∑N

n = 1,n �= n

P(m, n), m = 1, ..., M, n = 1, ..., N. (5)

6) For each landmark category n, n = 1, ..., N, computethe saliency score of cell c in the image X using codeworddiscrimination matrix D(m, n) as follows:

ρ =M∑

m=1

HmX(c)D(m, n) (6)

where ρ is the calculated saliency score of cell c.HmX(c) is the

number of occurrences that codeword m appears in cell c. Nextwe will abbreviate Hm

X(c) as HX to simplify the notation.

the matrix entry S(m, n) represents the number of all trainingpatches from landmark category n whose SIFT descriptors areclustered as codeword m. The intuitive interpretation of theoccurrence matrix S(m, n) is described as follows. For the mthcodeword where m = 1, ..., M, if

S(m, n) > S(m, n), ∀n �= n (7)

namely, more SIFT descriptors (dense patches) from categoryn belong to the mth codeword than SIFT descriptors (patches)from other categories, then we consider that codeword m ismore effective in representing category n than other categories.In practice, however, as different landmark categories mayhave different number of training images for codebook con-struction, we have to normalize the matrix S(m, n) column-wise to obtain P(m, n). Next, the algorithm considers P(m,n) and P(m, n) for the target category n and all negativecategories n, respectively. It then estimates the discriminationpower D(m, n) of each codeword m for each category n using(5). Intuitively, larger D(m, n) indicates that codeword m has

Page 5: Integrated Content and Context Analysis for Mobile Landmark Recognition

1480 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 21, NO. 10, OCTOBER 2011

a larger discriminating power in representing category n thanrepresenting other categories n.

After learning the codeword discrimination matrix D(m,n) for codeword m = 1, . . . , M, and landmark categoryn = 1, . . . , N, we use the cross-correlation between the cell’sBoW histogram HX and the nth column of the D(m, n) in(6) to estimate the cell’s importance in representing landmarkcategory n. In short, if HX is highly correlated with the nthcolumn of the D(m, n), this implies that the particular cell’shistogram contains more discriminating power to representlandmark category n, and hence should be given more weightin representing landmark category n. Based on the previousdiscussion on cell visual saliency and codeword discrimina-tion, the final saliency index γl for cell c to represent landmarkcategory n can be obtained as follows:

γl = wl [(1 − λ)v + λρ] (8)

where λ is the regularization parameter that controls therelative contribution between v and ρ, and wl is a level-dependent weight used to emphasize the cells with finerresolution. The idea that cells with finer resolution should begiven more emphasis is consistent with studies in [26] and[27]. The weight for the three levels, w′(l = 0, 1, 2) is set to0.25, 0.25, and 0.5, respectively. The histogram HX of cell c isthen weighted with its saliency index γl to obtain a weightedhistogram HX as follows:

HX = γlHX. (9)

Based on the decomposition approach in Section III-A,we will obtain multiple weighted histograms HX(cl

j), j =1, 2, ..., J ; l = 0, 1, 2. These weighted cell histograms arethen concatenated to form a long vector to represent image X

from the nth landmark category.

C. Codeword Discriminative Spatial Pyramid KernelThe pyramid kernel in [26] mainly assigns a level-dependent

weight to various codewords when calculating the distancebetween two cells’ histograms. Suppose X and Y are twoimages from landmark categories n and n, and the weightedhistograms of cell c in two images X and Y are HX andHY , then the kernel Kl that is used to evaluate the two cells’similarity can be expressed as follows:

Kl(X, Y ) =M∑

m=1

kl(m) (10)

wherekl(m) = f

(Hm

X, HmY

), m = 1, 2, ..., M (11)

and f is an element-by-element histogram similarity function.From (10), it can be seen that the pyramid kernel Kl is just

a simple summation of kl(m). It treats kl(m) from m = 1 toM equally. However, as discussed in Section III-B, differentcodewords in fact have different descriptive capabilities whenrepresenting different landmark categories. The histogram dis-tance kl(m) calculated using more discriminative codewordsshould be given more emphasis. Based on this principle, thekernel function Kl for two cells is updated as follows:

Kl(X, Y ) =M∑

m=1

D(m, n)kl(m). (12)

Fig. 3. Visual viewcone in context analysis.

From (12), it can be observed that the new kernel between twocells becomes a weighted summation of kl(m) in accordancewith the codeword discrimination D(m, n). Finally, whentraining the classifier for the nth landmark, the new spatialpyramid kernel between two images (X, Y ) is given as follows:

K(X, Y ) =L∑l=0

J∑j=1

wlKlj(X, Y ) (13)

Klj(X, Y ) =

M∑m=1

D(m, n)klj(m) (14)

klj(m) =

(Hm

X − HmY

)2

HmX + Hm

Y

, m = 1, 2, ..., M (15)

where L is the number of image levels, J is the number ofcells at image level l, and M is the codebook size. Hm

X isthe weighted BoW histogram of the cell cl

j in image X. Weuse χ2 function in (15) to evaluate the similarity between twohistograms. The new spatial pyramid kernel described from(13) to (15) is then used to train SVM classifiers to performmobile landmark recognition.

IV. Integration of Context and Content Analysis

Conventional context analysis usually employs locationinformation for mobile landmark recognition [10], [13].Location-based analysis infers landmark based on the distancefrom the user’s location to the landmarks. However, thelocation data such as the GPS readings may contain significanterrors, especially in the dense build-up areas. This will reducethe recognition performance of mobile landmark recognitionif location information alone is used. In view of this, thispaper proposes to complement the location information withdirection information. The specification of the mobile device’sdigital compass indicates that it has an error range within±2 degrees. This is consistent with our observations in theexperiments. Today, increasingly more mobile devices areequipped with Gyro or digital compass, such as the GoogleAndroid HTC Dream and iPhone. These mobile devices willprovide directional information to enhance the performance ofmobile landmark recognition.

The context analysis developed in this paper involves theformation of visual viewcones based on the location anddirection information. Using the user’s position and camera’s

Page 6: Integrated Content and Context Analysis for Mobile Landmark Recognition

CHEN et al.: INTEGRATED CONTENT AND CONTEXT ANALYSIS FOR MOBILE LANDMARK RECOGNITION 1481

pointing direction, a viewcone can be drawn which corre-sponds to the camera phone’s field of view as illustrated inFig. 3. Three suggested context analysis strategies are givenas follows.

1) Nearest-to-center strategy: Selecting the landmark cat-egory nearest to the center of the viewcone as therecognized landmark. In Fig. 3, landmark 2 is chosenbased on this strategy.

2) Wide viewcone nearest-to-user strategy: Selecting thelandmark nearest to the user’s location from a wideviewcone (±25 degrees). In Fig. 3, landmark 1 is chosenbased on this strategy.

3) Narrow viewcone nearest-to-user strategy: Selecting thelandmark nearest to the user’s location from a narrowviewcone (±10 degrees). In Fig. 3, landmark 3 is chosenbased on this strategy.

This paper proposes an integration of context and contentanalysis to perform mobile landmark recognition. The integra-tion approach is summarized in Algorithm 2.

The proposed integration technique is explained as follows.First, the context viewcone analysis is used to select a numberof candidate landmarks. This will reduce the subsequent con-tent analysis time in order to meet the real-time requirementof mobile users. Next, the context-based soft likelihoods areestimated. It considers the following two factors: 1) distancebetween the user and the candidate landmark, and 2) the anglebetween the candidate landmark and the center direction ofthe user’s viewcone. Intuitively, landmark that is closer to theuser and nearer to the center of the viewcone will have higherdegree of likelihood. The Gaussian and cosine functions areused to incorporate the distance and orientation information,respectively. Next, the captured image is passed through theSVM classifier of each candidate landmark to calculate thecontent-based scores. Finally, the context-based and content-based scores are combined to estimate the final likelihood foreach candidate landmark. Landmark with the largest score willbe selected as the recognized landmark.

V. Experimental Evaluation

A. Experimental Setup

To evaluate the performance of the proposed method, a land-mark database called NTU50Landmark is constructed [33].The database covers a large variety/complexity of real-worldconditions that can be used in the training of the landmark clas-sifiers. The NTU50Landmark database consists of 3622 train-ing images and 534 testing images for 50 landmark categoriesfrom the campus in Nanyang Technological University (NTU),Singapore. The selected landmarks are well-recognized build-ings, structures, and places-of-interests in NTU such as Chi-nese Heritage Center, Yunnan Monument and Pavilion, Audi-torium Sculpture, and others. They are chosen based on feed-backs from various students and visitors that these are places-of-interests in NTU. The definition of a landmark is taken as abuilding, structure or place-of-interest that is unique or distinc-tive. The landmark images are captured using camera phonesequipped with GPS receiver and digital compass. All the testimages contain the location and direction information of the

Algorithm 2 Proposed content-context integration for mobilelandmark recognition

1) A filtering step is first adopted to shortlist a number ofcandidate landmarks Li, i = 1, 2, . . . , K using the proposedcontext-based viewcone strategy.2) For each candidate landmark Li, a context-based softlikelihood score p1(i) is estimated as follows:

p1(i) = cos(θi) · 1√2πσ

exp(− 1

2σ(x2

i + y2i )) (16)

where xi and yi are the latitudinal and longitudinal distancebetween the user and the candidate landmark Li [32], σ is ascaling parameter, and θi is the angle between the candidatelandmark and the center direction of the user’s view cone.3) For each candidate landmark Li, a content-based softlikelihood score p2(i) is also estimated using the proposedcontent analysis method as follows:

p2(i) =exp(s(i))∑K

j=1 exp(s(j))(17)

where s(i) is the output of the i th SVM landmark classifier forthe query image. Softmax function is used to normalize theobtained value between [0, 1].4) The final likelihood score is then calculated to determine therecognized landmark I as follows:

I = arg maxi

{p1(i) · p2(i)}

= arg maxi

{cos(θi) · 1√

2πσexp

(− 1

2σ(x2

i + y2i )

)

· exp(s(i))∑K

j=1 exp(s(j))

}. (18)

camera phone when the images are taken. Different image ac-quisition conditions are used when capturing the images. Theseinclude different camera settings (contrast, resolution), differ-ent capturing time (morning, noon, afternoon), illuminationand weather changes (sunny, cloudy), changes in viewpoints(side, front, left, right, full, part), changes in scale (far, near),and occlusion. Sample images of these landmarks are given inFig. 4(a). The geospatial distribution of the landmarks is givenin Fig. 4(b), and they cover an area of more than 2 km2. Fromthe figure, it can be seen that the landmarks spread acrossthe whole campus of NTU, with certain areas having a higherconcentration of landmarks. The numbers of the training andtesting images for each landmark are mostly uniform. Thereare on average 70 images for training and 10 images for testingfor each landmark. The following sections will demonstrate theperformance of the methods under three different conditions:1) content analysis alone; 2) context analysis alone; and3) integration of content and context analysis. The perfor-mance metrics are given in terms of recognition accuracy.Recognition accuracy is defined as the ratio of number ofcorrectly classified landmarks to the total number of testingimages.

B. Performance Evaluation of Proposed Content Analysis

The content analysis technique in Section III is evaluated onthe NTU50Landmark database. To demonstrate the efficiencyof different components in Section III, the arrangements for

Page 7: Integrated Content and Context Analysis for Mobile Landmark Recognition

1482 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 21, NO. 10, OCTOBER 2011

Fig. 4. Sample images and geospatial distribution of the 50 landmark categories. (a) Sample images of 50 landmark categories. (b) Geospatial distributionof the 50 landmarks.

different modules are outlined as follows.

1) Improvement 1: new spatial pyramid decomposition(NSPD) using BoW.

2) Improvement 2: Improvement 1 + new pyramid cellsaliency estimation (PCSE).

3) Improvement 3: Improvement 2 + new spatial pyramidkernel (NSPK).

From the NTU50Landmark database, 3622 images are usedfor training and the rest 534 images are used for testing. Theregularization parameter λ = 0.7 is determined empiricallythrough experiments.

To compare the performance of the proposed technique, weuse the state-of-the-art BoW as the baseline for comparison[25]. Dense SIFT descriptors are first extracted from multiscalepatches (16×16, 32×32, and 64×64 overlapping patches) inan image, and quantized to form a BoW histogram based onthe codebook. One-versus-all SVM with radial basis functionis then used to train the classifiers. It takes about 4 h to trainthe SVM classifiers using MATLAB for the 50 landmarks.Spatial pyramid matching with BoW (SPK-BoW) in [26] isalso implemented for comparison. Fig. 5(a) gives the averagerecognition rates of the methods for different codebook sizeranging from 150 to 400. The error bars are given in termsof 95% confidence interval. From the figure, it is observedthat the proposed methods (Improvements 1–3) outperformthe benchmark BoW and SPK-BoW methods for codebooksize ranging from 150 to 400. The overall content analysis(Improvement 3) can achieve a recognition rate of around 93%for codebook size of 400. It can be seen that each componentin the proposed content analysis (spatial pyramid decompo-sition, pyramid cell saliency estimation, and spatial pyramidkernel) can offer 2–3% improvement in the recognition rate,rendering an overall improvement around 6–9% over the SPK-BoW in [26] and 8–11% over the baseline BoW [25]. The

improvements are due to the following reasons.

1) It estimates the saliency of different pyramid cells so thatthis information can be incorporated into the training ofclassifiers. This is consistent with the observation thatimage regions/patches with more visual saliency shouldbe given more emphasis when considering its visualrepresentation.

2) The proposed spatial pyramid representation can rep-resent landmark images more effectively by involvinga combination of 1 × 3 and 3 × 1 cell structures asopposed to the conventional decomposition. This canbetter partition the image into the central region-of-interest and the background areas.

Fig. 5(b) illustrates some visual examples where the pro-posed weighted method outperforms the nonweighted ap-proaches. Three selected images and their estimated cellsaliency are given in the figure. Brighter areas imply thecells contain more discriminating information. From the figure,it can be seen that the proposed method can provide moreemphasis to those cells corresponding to landmarks. Therefore,the method is able to offer improvement over the nonweightedapproaches where all cells are given equal weights.

Further, we also compare the performance of the proposedmethod with another state-of-the-art BoW method called con-textual BoW with spatial layout (CBoWSL) [5] based onseveral various local descriptors. It is worth noting that [5] hasreported that the CBoWSL method outperforms other currentmethods such as visual language modeling [4] and spatialpyramid matching intermediate concepts [35] on differentdatasets. It should be noted that the proposed method usesdense SIFT as the local descriptor in its original setting.Nevertheless, we have also used different local descriptorsincluding sparse SIFT [24], dense SIFT [7], speeded up robustfeatures (SURF) and upright-SURF (U-SURF) [36] in the

Page 8: Integrated Content and Context Analysis for Mobile Landmark Recognition

CHEN et al.: INTEGRATED CONTENT AND CONTEXT ANALYSIS FOR MOBILE LANDMARK RECOGNITION 1483

Fig. 5. (a) Recognition accuracy of different content analysis methods. (b) Illustrative examples of the proposed cell saliency estimation. SPK-BoW: spatialpyramid kernel in bags-of-words method [26]; Improvement 1: NSPD with BoW; Improvement 2: Improvement 1 + PCSE; Improvement 3: Improvement 2 +NSPK.

TABLE I

Performance Comparison of Different BoW Methods Using Different Local Descriptors

Recognition Ratesfor Various Features

Sparse SIFT (%) SURF (%) U-SURF (%) Dense SIFT (%)

MethodProposed method 86.5 89.3 91.6 93.1CBoWSL 83.5 86.9 89.1 91.0

proposed method, and the CBoWSL method for comparison.The results are given in Table I.

From the table, it can be observed that the proposed methodusing the dense SIFT descriptors obtains the highest recogni-tion rate (93.1%) among the combinations of BoW methodsand local feature descriptors. Further it can be seen that theproposed BoW method consistently outperforms the CBoWSLfor different local descriptors. This shows the effectiveness ofthe proposed method for mobile landmark recognition. Whenwe compare the local descriptors, it is observed that the denseSIFT descriptor offers the best performance, followed by U-SURF, SURF, and sparse SIFT. The reasons why dense SIFThave outperformed the other local descriptors is because thedense SIFT descriptors are obtained from overlapping regularpatches in the image without rotation. As the landmarksimages are usually acquired in upright position with littlerotation, the dense SIFT descriptors turn out to be moreeffective than sparse SIFT and SURF that consider orientationinformation in landmark recognition.

C. Performance Evaluation of Context Analysis

This section will show the experimental evaluation ofthree proposed context analysis techniques highlighted inSection IV. We have also implemented the context analysisusing GPS alone for comparison. It selects the landmarknearest to the user as the recognized landmark. In order todemonstrate the performance of the techniques in variousareas with different GPS errors, three sets of experimentsare conducted: 1) dataset with small GPS errors (typically

TABLE II

Recognition Accuracy of Context Analysis

(a) Image Dataset with Small GPS ErrorsContext Analysis Recognition Rate (%)Nearest-to-center strategy 91.7Wide viewcone nearest-to-user strategy 90.0Narrow viewcone nearest-to-user strategy 92.5GPS alone 86.7

(b) Image Dataset with Medium-Large GPS ErrorsContext Analysis Recognition Rate (%)Nearest-to-center strategy 66.4Wide viewcone nearest-to-user strategy 67.3Narrow viewcone nearest-to-user strategy 70.9GPS alone 54.5

(c) Complete Image DatasetContext Analysis Recognition Rate (%)Nearest-to-center strategy 73.0Wide viewcone nearest-to-user strategy 74.0Narrow viewcone nearest-to-user strategy 77.2GPS alone 69.5

less than 10 m) which consists of 120 test images capturedmainly in the open sky areas; 2) dataset with medium-largeGPS errors (typically more than 20 m) which consists of 110test images captured mainly in the dense buildup areas; and3) the complete dataset inclusive of 1) and 2) which consists of534 test images captured in all environments. The results aregiven in Table II, from which it can be seen that the threeproposed viewcone approaches consistently outperform the

Page 9: Integrated Content and Context Analysis for Mobile Landmark Recognition

1484 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 21, NO. 10, OCTOBER 2011

TABLE III

Performance Comparison for Content Analysis, Context Analysis, and Their Integration

Method Dataset Recognition Rate (%) Average Recognition Time (s)Content analysis alone Small GPS errors (open sky areas) 94.2 9.8

Medium-large GPS errors (buildup areas) 92.7 10.1Whole dataset 91.0 10.2

Context analysis alone Small GPS errors (open sky areas) 92.5 0.0034Medium-large GPS errors (buildup areas) 70.9 0.0035Whole dataset 77.2 0.0039

Proposed integration of contextand content analysis

Small GPS errors (open sky areas) 98.3 5.2Medium-large GPS errors (buildup areas) 96.4 5.0Whole dataset 97.6 5.1

GPS alone method. This shows that the viewcone approachesbased on fusion of GPS and direction are more effective thanthe GPS alone method. It can be also seen that image datasetwith small GPS errors obtains higher recognition rate than thatwith medium-large GPS errors. This shows that medium-largeGPS errors can reduce the recognition accuracy significantly.Further, it can be observed that the narrow viewcone nearest-to-user strategy produces the best recognition accuracy. This isbecause the strategy utilizes the observation that mobile userstend to capture the landmark that is close to the user and nearthe center of the viewcone. The recognition times of all threetechniques are comparable, ranging from 0.002 to 0.004 s.Overall, context analysis is much faster than content analysis.However, it cannot give satisfactory recognition accuracy. Thisis due to: 1) the GPS errors and the real world’s complexcapturing conditions, and 2) the ambiguity associated withmultiple candidate landmarks in the viewcone. Therefore,context analysis needs to be integrated with content analysisto achieve good recognition accuracy.

D. Performance Comparison of Content Analysis, ContextAnalysis, and Integration of Content–Context Analysis

Based on our previous discussion in Sections V-B and V-C, it is clear that using context or content analysis alone isnot able to achieve satisfactory results as context analysisalone is susceptible to GPS errors whereas content analysisalone is computationally intensive as it needs to compare thequery image with all landmark categories in the database. Inview of this, this section will demonstrate the performancecomparison of content analysis, context analysis, and integra-tion of content–context analysis. The content analysis usedis the overall Improvement 3 used in Section V-B and thecontext analysis used is the narrow viewcone nearest-to-userstrategy used in Section V-C. The content–context integrationin Section IV is run. Up to top six landmark candidatesare selected. The actual candidate number will depend onhow many landmarks are located in the viewcone. Thesecandidates are then used for subsequent visual recognition.The results are given in Table III, from which it can beseen that the integration technique achieves better recognitionrate than the context and content analysis alone methods forall three datasets. In particular, for the whole dataset, theintegration technique achieves an improvement of 20.4% and6.6% over the context and content analysis, respectively. This

TABLE IV

Experimental Results on the Oxford Dataset

Method Csurka’s Lazebnik’s ProposedMethod [25] Method [26] Method

Accuracy (%) 54.5 72.7 81.8

shows that the integration method is able to complement theadvantages of content analysis and context analysis to improvethe recognition rate of the system.

In terms of the computational time, it is noted that contextanalysis alone requires the least recognition time, albeit witha much lower recognition accuracy. The proposed integrationtechnique, on the other hand, offers the best compromiseas it reduces the computational time of the content analysismethod by approximately 50%, while achieving much superiorrecognition rate than the context analysis alone methods. Thereason why the integration method is much faster than thecontent analysis method is because using the context analysiscan reduce the search scope of landmark categories from alarge number (50 categories in our experiments) to only a few(here, it is up to six) candidates in the viewcone. In short,the proposed content–context integration approach utilizes theadvantages of content and context analysis, respectively, e.g.,the high recognition accuracy of proposed content analysisand the fast processing of context analysis to achieve anexcellent performance. This integration produces a goodrecognition rate of 97.6% while maintaining a reasonably lowcomputational time of around 5 s.

Next, we will consider the image query time that is de-fined as the total processing time between the moment theuser submits an image query on the mobile device to themoment he/she receives the recognition result. Based on ourexperiments, the transmission time through wireless local areanetwork is 0.6 s for an image of dimension 240 ×320 pixels.The time spent on data wrapping/formatting and function callis 1.7 s. The recognition time on the server which includescontent analysis, context analysis, and classification time formobile landmark recognition is 5.1 s on average. Therefore,the final image query time based on content and context inte-gration is on average 0.6 + 1.7 + 5.1 = 7.4 s. It is worth notingthat the proposed back-end algorithm is currently implementedmainly in MATLAB. The total recognition time can be furtherreduced if it is implemented in compiling language such asC/JAVA with code optimization.

Page 10: Integrated Content and Context Analysis for Mobile Landmark Recognition

CHEN et al.: INTEGRATED CONTENT AND CONTEXT ANALYSIS FOR MOBILE LANDMARK RECOGNITION 1485

E. Experiments on the Oxford Buildings Dataset

In order to demonstrate the effectiveness of the proposedmethod on other datasets, we have used a second dataset tovalidate the proposed algorithms. We have downloaded theOxford building dataset from [34], and used them for theexperiments. The dataset consists of 5062 images collectedfrom Flickr for Oxford landmarks. It has been manually anno-tated for 11 different landmarks. Each landmark category con-tains five queries, giving rise to a total of 55 queries over whichthe landmark recognition system can be evaluated. It is notedthat as the Oxford dataset does not contain the location anddirection information, hence we can only use them to evaluatethe effectiveness of the proposed content analysis method. Theproposed spatial pyramid matching approach is applied andcompared with two other benchmark methods. The codebooksize is set as 300. The results are given in the Table IV.

From the table, it can be seen that the proposed method hasa good recognition accuracy of 81.8%, which is 9.1% higherthan Lazebnik’s method in [26]. This shows that the proposedmethod can obtain good performance not only in a differentbenchmark dataset but also in the constructed NTU landmarkdatabase. Further, the reason that the recognition accuracy(81.8%) on the Oxford database is lower than the recognitionaccuracy (93%) on the NTU50Landmark database is mainlydue to the difference on the training dataset collection process.The NTU database is collected with the training purposein mind, e.g., different perspective views, illuminations, andscales of landmark images in the training set. While for theOxford training set, the images are collected mainly from theFlicker website, which does not contain comprehensive imagecapturing conditions. Nevertheless, the results show that theproposed method is able to achieve satisfactory performancefor Oxford building database.

VI. Conclusion

This paper presented an information fusion approach formobile landmark recognition. The key contributions of thispaper include: 1) developments of a numbers of new tech-niques in content analysis, context analysis, and integrationof content and context analysis; 2) construction of a newlandmark database called NTU50Landmarks; 3) design andimplementation of a system prototype of mobile landmarkrecognition; and 4) experimental evaluation of the proposedsystems and methods under various conditions and envi-ronments. Experimental results based on the two landmarkdatabases show that the proposed methods can achieve goodrecognition performance in mobile landmark recognition.

References

[1] PRLog. Free Press Release Distribution Service [Online]. Avail-able: http://www.prlog.org/10366989-global-mobile-phone-subscribers-to-reach-46-billion-by-end-of-2009.html

[2] B. Yamauchi and P. Langley, “Place recognition in dynamic environ-ments,” J. Robot. Syst., vol. 14, no. 2, pp. 107–120, 1997.

[3] H. James and A. A. Efros, “IM2GPS: Estimating geographic informationfrom a single image,” in Proc. IEEE Int. Conf. Comput. Vis. Patt. Recog.,Jun. 2008, pp. 1–8.

[4] L. Wu, Y. Hu, M. Li, N. Yu, and X.-S. Hua, “Scale-invariant visuallanguage modeling for object categorization,” IEEE Trans. Multimedia,vol. 11, no. 2, pp. 286–294, Feb. 2009.

[5] T. Li, T. Mei, I.-S. Kweon, and X.-S. Hua, “Contextual bag-of-wordsfor visual categorization,” IEEE Trans. Circuits Syst. Video Technol., vol.21, no. 4, pp. 381–392, Apr. 2011.

[6] A. Bosch, A. Zisserman, and X. Munoz, “Scene classification usinga hybrid generative/discriminative approach,” IEEE Trans. Patt. Anal.Mach. Intell., vol. 4, no. 3, pp. 712–726, Apr. 2008.

[7] F. F. Li and P. Perona, “A Bayesian hierarchical model for learningnatural scene categories,” in Proc. IEEE Int. Conf. Comput. Vis. Patt.Recog., vol. 2. Jun. 2005, pp. 524–531.

[8] Y. P. Li, D. J. Crandall, and D. P. Huttenlocher, “Landmark classificationin large-scale image collections,” in Proc. IEEE Int. Conf. Comput. Vis.,Sep.–Oct. 2009, pp. 1957–1964.

[9] T. Chen, K.-H. Yap, and L.-P. Chau, “From universal bag-of-words toadaptive bag-of-phrase for scene recognition,” to be presented at theIEEE International Conference on Image Processing, Brussels, Belgium,Sep. 2011.

[10] G. Fritz, C. Seifert, and L. Paletta, “A mobile vision system for urbandetection with informative local descriptors,” in Proc. IEEE Int. Conf.Comput. Vis. Syst., Jan. 2006, pp. 30–35.

[11] T. Chen, K.-H. Yap, and L.-P. Chau, “A discriminative learningapproach for mobile landmark recognition,” to be presented at theIEEE International Conference on Image Processing, Brussels, Belgium,Sep. 2011.

[12] W. Mai, G. Dodds, and C. Tweed, “A PDA-based system for recogniz-ing buildings from user-supplied images,” in Proc. Mobile UbiquitousInform. Access: Mobile HCI Int. Workshop, LNCS 2954. Sep. 2003, pp.143–157.

[13] Y.-T. Zheng, M. Zhao, Y. Song, H. Adam, U. Buddemeier, A. Bissacco,F. Brucher, T.-S. Chua, and H. Neven, “Tour the world: Building a web-scale landmark recognition engine,” in Proc. IEEE Int. Conf. Comput.Vis. Patt. Recog., Aug. 2009, pp. 1085–1092.

[14] G. Schindler, M. Brown, and R. Szeliski, “City-scale location recogni-tion,” in Proc. IEEE Int. Conf. Comput. Vis. Patt. Recog., Jun. 2007, pp.1–7.

[15] T. Quack, H. Bay, and L. V. Gool, “Object recognition for the internetof things,” in Proc. Int. Conf. Internet Things, 2008, pp. 230–246.

[16] D. J. Crandall, L. Backstrom, D. Huttenlocher, and J. Kleinberg,“Mapping the world’s photos,” in Proc. 18th Int. Conf. World Wide Web,2009, pp. 761–770.

[17] S. Gammeter, T. Quack, D. Tingdahl, and L. V. Gool, “Size doesmatter: Improving object recognition and 3-D reconstruction with cross-media analysis of image,” in Proc. Eur. Conf. Comput. Vis., 2010,pp. 734–747.

[18] J. Knopp, J. Sivic, and T. Pajdla, “Avoiding confusing features in placerecognition,” in Proc. Eur. Conf. Comput. Vis., 2010, pp. 748–761.

[19] A. R. Zamir and M. Shah, “Accurate image localization based onGoogle maps street view,” in Proc. Eur. Conf. Comput. Vis., 2010,pp. 255–268.

[20] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman, “Objectretrieval with large vocabularies and fast spatial matching,” in Proc.IEEE Int. Conf. Comput. Vis., Jun. 2007, pp. 1–8.

[21] T. Quack, B. Leibe, and L. V. Gool, “World-scale mining of objects andevents from community photo collections,” in Proc. Int. Conf. Content-Based Image Video Retrieval, 2008, pp. 47–56.

[22] S. Gammeter, L. Bossard, T. Quack, and L. V. Gool, “I knowwhat you did last summer: Object-level auto-annotation of holidaysnaps,” in Proc. IEEE Int. Conf. Comput. Vis., Sep.–Oct. 2009,pp. 614–621.

[23] R. Hartley and A. Zisserman, Multiple View Geometry in ComputerVision, 2nd ed. New York: Cambridge Univ. Press, 2003.

[24] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,”Int. J. Comput. Vis., vol. 60, no. 2, pp. 91–110, Nov. 2004.

[25] G. Csurka, C. Dance, L. X. Fan, J. Willamowski, and C. Bray, “Visualcategorization with bags of keypoints,” in Proc. Int. Workshop Statist.Learn. Comput. Vis., 2004, pp. 1–22.

[26] S. Lazebnik, C. Schmid, and J. Ponce, “Beyond bags of features: Spatialpyramid matching for recognizing natural scene categories,” in Proc.IEEE Comput. Soc. Conf. Comput. Vis. Patt. Recog., vol. 2. Oct. 2006,pp. 2169–2178.

[27] K. Grauman and T. Darrell, “Pyramid match kernels: Discriminativeclassification with sets of image features,” in Proc. Int. Conf. Comput.Vis., vol. 2. 2005, pp. 1458–1465.

[28] B. F. Peterson, Learning to See Creatively: Design, Color and Compo-sition in Photography. New York: Amphoto Press, 2003.

[29] S. L. Zhang, Q. Tian, G. Hua, Q. Huang, and S. Li, “Descriptive visualwords and visual phrases for image applications,” in Proc. ACM Int.Conf. Multimedia, 2009, pp. 75–84.

Page 11: Integrated Content and Context Analysis for Mobile Landmark Recognition

1486 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 21, NO. 10, OCTOBER 2011

[30] Computational Vision at Caltech. Caltech 101 Dataset [Online]. Avail-able: http://www.vision.caltech.edu/Image−Datasets/Caltech101

[31] T. Kanungo, D. M. Mount, N. S. Netanyahu, C. D. Piatko, R. Silverman,and A. Y. Wu, “An efficient k-means clustering algorithm: Analysis andimplementation,” IEEE Trans. Patt. Anal. Mach. Intell., vol. 24, no. 7,pp. 881–892, Jul. 2002.

[32] Meridian World Data. Distance Calculation: How to Calculate theDistance Between Two Points on the Earth [Online]. Available:http://www.meridianworlddata.com/Distance-Calculation.asp

[33] K.-H. Yap, T. Chen, Z. Li, and K. Wu, “A comparative study of mobile-based landmark recognition techniques,” IEEE Intell. Syst., vol. 25, no.1, pp. 48–57, Jan.–Feb. 2010.

[34] J. Philbin and A. Zisserman. (2007, Nov.). Visual GeometryGroup: The Oxford Buildings Dataset [Online]. Available:http://www.robots.ox.ac.uk/∼vgg/data/oxbuildings/index.html

[35] J. Liu and M. Shah, “Scene modeling using co-clustering,” in Proc.IEEE Int. Conf. Comput. Vis., Oct. 2007, pp. 1–7.

[36] H. Bay, A. Ess, T. Tuytelaars, and L. V. Gool, “SURF: Speeded uprobust features,” Comput. Vision Image Understand., vol. 110, no. 3,pp. 346–359, 2008.

[37] D. Chen, G. Baatz, K. Koeser, S. Tsai, R. Vedantham, T. Pylvanainen, K.Roimela, X. Chen, J. Bach, M. Pollefeys, B. Girod, and R. Grzeszczuk,“City-scale landmark identification on mobile devices,” in Proc. IEEEInt. Conf. Comput. Vis. Patt. Recog., Jun. 2011, pp. 737–744.

[38] G. Baatz, K. Koser, D. Chen, R. Grzeszczuk, and M. Pollefeys,“Leveraging 3-D city models for rotation invariant place-of-interestrecognition,” Int. J. Comput. Vis., vol. 94, no. 5, pp. 1–20, May 2011.

Tao Chen (S’10) received the B.E. degree fromShandong University, Shandong, China, in 2006,and the M.S. degree in electrical engineering fromZhejiang University, Zhejiang, China, in 2008. Cur-rently, he is pursuing the Ph.D. degree from NanyangTechnological University, Singapore.

His current research interests include multimediacontent analysis and mobile media retrieval.

Kim-Hui Yap (SM’09) received the B.E. and Ph.D.degrees in electrical engineering from the Universityof Sydney, Sydney, Australia, in 1998 and 2002,respectively.

Since 2002, he has been a Faculty Memberwith Nanyang Technological University, Singapore,where he is currently an Associate Professor. Hewas the Group Leader in content-based analysis withthe Center for Signal Processing, Nanyang Tech-nological University. He has published in variousinternational journals, book chapters, and conference

proceedings. He has authored the book, Adaptive Image Processing: AComputational Intelligence Perspective (2nd ed. Boca Raton, FL: CRC Press,

2009) and edited the book, Intelligent Multimedia Processing with SoftComputing (New York: Springer-Verlag, 2005). His current research interestsinclude image/video processing, media content analysis, computer vision, andcomputational intelligence.

Dr. Yap has served as an Associate Editor for the IEEE Computational

Intelligence Magazine and the Journal of Signal Processing Systems.He has also served as an Editorial Board Member for the Open Electricaland Electronic Engineering Journal, and as a Guest Editor for the IEICETransactions on Fundamentals. He has served as the Treasurer of the IEEESingapore Signal Processing Chapter and the Committee Member of the IEEESingapore Computational Intelligence Chapter. He has also served as theFinance Chair at the 2010 IEEE International Conference on Multimedia andExpo, the Workshop Co-Chair of the 2009 MDM International Workshop onMobile Media Retrieval, and others.

Lap-Pui Chau (SM’03) received the B.E. degree,with first class honors, in electronic engineeringfrom Oxford Brookes University, Oxford, U.K., andthe Ph.D. degree in electronic engineering fromHong Kong Polytechnic University, Kowloon, HongKong, in 1992 and 1997, respectively.

In June 1996, he joined Tritech Microelectron-ics, Singapore, as a Senior Engineer. Since March1997, he has been with the School of Electricaland Electronic Engineering, Nanyang TechnologicalUniversity, Singapore, as a Research Fellow, and

then as an Assistant Professor. Currently, he is an Associate Professor withthe same university. His current research interests include streaming media,multimedia coding and compression, and very large-scale integration for signalprocessing.

Dr. Chau is the Chair of the Technical Committee on Circuits and Systemsfor Communications, a Technical Committee Member on Multimedia Systemsand Applications and Visual Signal Processing and Communications of theIEEE Circuits and Systems Society. He was the Chairman of the IEEESingapore Circuits and Systems Chapter from 2009 to 2010. He was a memberof the Singapore Digital Television Technical Committee from 1998 to 1999.He was involved in organization committees of international conferences,including the IEEE International Conference on Image Processing (ICIP 2010,ICIP 2004) and the IEEE International Conference on Multimedia and Expo(ICME 2010). He was a Technical Program Co-Chair of the 2010 Inter-national Symposium on Intelligent Signal Processing and CommunicationsSystems (ISPACS 2010). He was also a Track Chair in technical programcommittees for many international conferences regularly. He was an AssociateEditor of the IEEE Transactions on Multimedia. He is currently anAssociate Editor of the IEEE Transactions on Circuits and Systems

for Video Technology, the IEEE Transactions on Broadcasting,and the IEEE Signal Processing Letters. He is an IEEE DistinguishedLecturer for 2009 to 2013, and is a Steering Committee Member of the IEEETransactions for Mobile Computing for 2011 to 2012.