4
IEEE SIGNAL PROCESSING LETTERS, VOL. 19, NO. 8, AUGUST 2012 459 Content and Context Boosting for Mobile Landmark Recognition Zhen Li and Kim-Hui Yap Abstract—Existing mobile landmark recognition techniques mainly use GPS location information to obtain the candidate images nearby the mobile device, followed by content analysis within the shortlist. This is insufcient since i) GPS often has large errors in dense build-up areas, and ii) direction is underutilized to further improve recognition. In this letter, visual content and two types of mobile context: location and direction, are integrated by the proposed boosting algorithm. Experimental results show that the proposed method outperforms the state-of-the-art methods by about 6%, 11%, and 15% on NTU Landmark-50, PKU Land- mark-198, and the large-scale San Francisco landmark dataset, respectively. Index Terms—Content and context integration, mobile land- mark recognition, multi-class adaboost. I. INTRODUCTION R ECENT years have witnessed the phenomenal growth in the usage of mobile devices. The built-in camera and net- work connectivity of current mobile phones make it increas- ingly appealing for users to snap pictures of landmarks, and ob- tain relevant information about the captured landmarks. This is particularly useful in applications such as mobile city guide and mobile shopping. Landmark recognition (e.g., [1], [2]) is closely related to landmark classication [3] and landmark search [4] problems, and can be broadly categorized into non-mobile and mobile landmark recognition systems. In recent years, a number of non-mobile landmark vision sys- tems have been developed [3], [5], which focused on learning a statistical model for mapping image content features to clas- sication labels. In [5], a vocabulary tree (VT) is generated by hierarchically clustering local descriptors as follows: 1) an ini- tial k-means clustering is performed on the local descriptors of the training data to obtain the rst-level k-clusters, 2) the same clustering process is applied recursively to each cluster at the current level. This will lead to the formation of a hier- archical k-means clustering structure, also known as a VT in this context. Based on the VT, the high-dimensional histograms of image local descriptors can be generated, which enables ef- cient image recognition. In [3], non-visual information such as textual tags and temporal constraints are utilized to improve the performance of content analysis. However, these landmark Manuscript received April 21, 2012; revised May 28, 2012; accepted May 30, 2012. Date of publication June 06, 2012; date of current version June 14, 2012. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Lisimachos Paul Kondi. The authors are with the School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore (e-mail: ([email protected]; [email protected]). Color versions of one or more of the gures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identier 10.1109/LSP.2012.2203120 recognition methods underutilize the unique features of mobile devices, such as location, direction and time information. These context information can be easily acquired from mobile devices, which can further enhance the performance of content analysis without introducing much additional computational cost. In the mobile landmark recognition scenario, utilizing either content or context analysis alone is suboptimal. GPS location is able to help discriminate visually similar landmark categories that are located far away from each other. However, location should not used alone due to the GPS error. In view of this, the context in- formation must be used in conjunction with visual content anal- ysis in mobile landmark recognition. Recent work in [1], [4], [6]–[8] demonstrates that mobile con- text can be utilized to improve mobile landmark recognition performance. In these systems, we obtain the candidate images from the landmark image database by using GPS information. This is achieved by selecting those landmark images that are lo- cated close to the GPS location of the mobile device when the query image is taken. Visual content analysis are then performed in the GPS-based shortlist for the nal recognition. This ap- proach makes it easy to differentiate visually similar images of different landmark categories that are far away from each other. However, the approach of performing content analysis within the GPS-based shortlist still has some drawbacks that make it nonideal in real applications: (1) it is elusive to determine the optimal search radius since the GPS error of the captured image may vary signicantly from open areas (about 30 meters) to dense build-up areas (up to 150 meters). In the event that GPS error of the captured image is too large, the correct landmark category may be ltered out, which will result in wrong recog- nition, and (2) direction information is not utilized to further improve the recognition performance, which can be easily ac- quired from the digital compass on mobile devices. In order to circumvent the approach of performing content analysis within a GPS-based shortlist as in most existing mo- bile recognition systems as well as to have the exibility of incorporating different types of context, we propose to inte- grate various content and context by Adaboost. As compared to the well-known approach of using vocabulary tree (VT) built upon SIFT descriptors [5], we improve the content-based recognition performance by incorporating recognition results from various context-based vocabulary trees (VTs) built upon context descriptors. Context information adopted in this work include location information (obtained through built-in GPS or WiFi localization) and direction information (obtained from digital compass). In particular, we develop a new real-valued multi-class adaboost algorithm to integrate the recognition results from content-based VTs and context-based VTs. The proposed framework is shown in Fig. 1, where the dashed arrows indicate the optional usage of context information when they are available. The context-based VTs are generated by 1070-9908/$31.00 © 2012 IEEE

Content and Context Boosting for Mobile Landmark Recognition

  • Upload
    buikiet

  • View
    212

  • Download
    0

Embed Size (px)

Citation preview

IEEE SIGNAL PROCESSING LETTERS, VOL. 19, NO. 8, AUGUST 2012 459

Content and Context Boosting forMobile Landmark Recognition

Zhen Li and Kim-Hui Yap

Abstract—Existing mobile landmark recognition techniquesmainly use GPS location information to obtain the candidateimages nearby the mobile device, followed by content analysiswithin the shortlist. This is insufficient since i) GPS often has largeerrors in dense build-up areas, and ii) direction is underutilized tofurther improve recognition. In this letter, visual content and twotypes of mobile context: location and direction, are integrated bythe proposed boosting algorithm. Experimental results show thatthe proposed method outperforms the state-of-the-art methodsby about 6%, 11%, and 15% on NTU Landmark-50, PKU Land-mark-198, and the large-scale San Francisco landmark dataset,respectively.

Index Terms—Content and context integration, mobile land-mark recognition, multi-class adaboost.

I. INTRODUCTION

R ECENT years have witnessed the phenomenal growth inthe usage of mobile devices. The built-in camera and net-

work connectivity of current mobile phones make it increas-ingly appealing for users to snap pictures of landmarks, and ob-tain relevant information about the captured landmarks. This isparticularly useful in applications such as mobile city guide andmobile shopping. Landmark recognition (e.g., [1], [2]) is closelyrelated to landmark classification [3] and landmark search [4]problems, and can be broadly categorized into non-mobile andmobile landmark recognition systems.In recent years, a number of non-mobile landmark vision sys-

tems have been developed [3], [5], which focused on learninga statistical model for mapping image content features to clas-sification labels. In [5], a vocabulary tree (VT) is generated byhierarchically clustering local descriptors as follows: 1) an ini-tial k-means clustering is performed on the local descriptorsof the training data to obtain the first-level k-clusters, 2) thesame clustering process is applied recursively to each clusterat the current level. This will lead to the formation of a hier-archical k-means clustering structure, also known as a VT inthis context. Based on the VT, the high-dimensional histogramsof image local descriptors can be generated, which enables ef-ficient image recognition. In [3], non-visual information suchas textual tags and temporal constraints are utilized to improvethe performance of content analysis. However, these landmark

Manuscript received April 21, 2012; revisedMay 28, 2012; acceptedMay 30,2012. Date of publication June 06, 2012; date of current version June 14, 2012.The associate editor coordinating the review of this manuscript and approvingit for publication was Prof. Lisimachos Paul Kondi.The authors are with the School of Electrical and Electronic Engineering,

Nanyang Technological University, Singapore (e-mail: ([email protected];[email protected]).Color versions of one or more of the figures in this paper are available online

at http://ieeexplore.ieee.org.Digital Object Identifier 10.1109/LSP.2012.2203120

recognition methods underutilize the unique features of mobiledevices, such as location, direction and time information. Thesecontext information can be easily acquired frommobile devices,which can further enhance the performance of content analysiswithout introducing much additional computational cost. In themobile landmark recognition scenario, utilizing either contentor context analysis alone is suboptimal. GPS location is able tohelp discriminate visually similar landmark categories that arelocated far away from each other. However, location should notused alone due to the GPS error. In view of this, the context in-formation must be used in conjunction with visual content anal-ysis in mobile landmark recognition.Recent work in [1], [4], [6]–[8] demonstrates that mobile con-

text can be utilized to improve mobile landmark recognitionperformance. In these systems, we obtain the candidate imagesfrom the landmark image database by using GPS information.This is achieved by selecting those landmark images that are lo-cated close to the GPS location of the mobile device when thequery image is taken. Visual content analysis are then performedin the GPS-based shortlist for the final recognition. This ap-proach makes it easy to differentiate visually similar images ofdifferent landmark categories that are far away from each other.However, the approach of performing content analysis withinthe GPS-based shortlist still has some drawbacks that make itnonideal in real applications: (1) it is elusive to determine theoptimal search radius since the GPS error of the captured imagemay vary significantly from open areas (about 30 meters) todense build-up areas (up to 150 meters). In the event that GPSerror of the captured image is too large, the correct landmarkcategory may be filtered out, which will result in wrong recog-nition, and (2) direction information is not utilized to furtherimprove the recognition performance, which can be easily ac-quired from the digital compass on mobile devices.In order to circumvent the approach of performing content

analysis within a GPS-based shortlist as in most existing mo-bile recognition systems as well as to have the flexibility ofincorporating different types of context, we propose to inte-grate various content and context by Adaboost. As comparedto the well-known approach of using vocabulary tree (VT)built upon SIFT descriptors [5], we improve the content-basedrecognition performance by incorporating recognition resultsfrom various context-based vocabulary trees (VTs) built uponcontext descriptors. Context information adopted in this workinclude location information (obtained through built-in GPSor WiFi localization) and direction information (obtained fromdigital compass). In particular, we develop a new real-valuedmulti-class adaboost algorithm to integrate the recognitionresults from content-based VTs and context-based VTs. Theproposed framework is shown in Fig. 1, where the dashedarrows indicate the optional usage of context information whenthey are available. The context-based VTs are generated by

1070-9908/$31.00 © 2012 IEEE

460 IEEE SIGNAL PROCESSING LETTERS, VOL. 19, NO. 8, AUGUST 2012

Fig. 1. Proposed framework for mobile landmark recognition.

hierarchically clustering the context descriptors in the trainingprocess. The choice of the context descriptor will depend onwhether the direction information is available. When only GPSinformation is available, the context descriptors for generatingthe context-based VTs will be the 2-D location vectors. How-ever, when both GPS and direction information is available,the context descriptors are the 3-D composite GPS-directiondescriptors (obtained by concatenating the Gaussian normal-ized 2-D location vectors and the 1-D direction). The proposedframework can achieve superior recognition performance whencompared with the state-of-the-art mobile landmark recognitionon various landmark datasets.

II. PROPOSED CONTENT & CONTEXT ADABOOST ALGORITHM

In this work, we propose to integrate mobile content andcontext by Adaboost, which enhances recognition by makingfull use of both content and context information. A number ofVTs can be built upon various content and context descriptorssuch as SIFT, GPS and direction. Based on each VT, we canconstruct a classifier by VT histogram matching for its corre-sponding type of descriptors. Then Adaboost algorithms canbe used to construct a stronger classifier through the combi-nation of each weak classifier based on a single VT. Particu-larly, we develop a newmulti-class adaboost algorithm based onthe additive logistic regression model and the exponential lossmodel, which are the two basic models for the Adaboost algo-rithms as shown in [9]. The proposed algorithm is real-valuedin the sense that, real-valued basis functions and real-valuedconfidence-rated predictions rather than the discrete classifica-tions labels are used in order to update the additive model moreaccurately. We refer to our algorithm as RMAE—Real-valuedMulti-class Adaboost using Exponential loss function.Suppose we have a set of training data with images:

, whererepresents the -th type of descriptor set for the

-th image, is the dimensionality of the -th type of de-scriptor, is the number of the -th type descriptor for the-th image, and is the total number of available types of con-

tent & context descriptors. is the classlabel defined as ,where is the total number of landmark categories, is theground-truth label for the -th image, is the indicator func-tion, and has the symmetric constraint that. Specifically, for the -th image, we have the training data

, where canbe one type of the available content and context descriptors suchas SIFT, GPS and composite GPS-direction descriptors. Notethat in the subsequent derivations, for simplicity we use todenote which is the set of the -th type descriptors for the-th image.

Algorithm 1 Real-valued Multi-class Adaboost usingExponential loss function (RMAE)

1) Input: Descriptors of database images and their labels:, where can be SIFT, GPS or

GPS-direction descriptors for the corresponding image.2) Initialization: Generate and vocabularytrees from a random subset of content and contextdescriptors, respectively; Let ; calculatethe VT histograms for each image based on eachvocabulary tree; Set all the training sample weights:

.3) Repeat for = 1 to do• normalize sample weights:

;• choose one of the VTs that corresponds to theoptimal using (9), (7) and (8); remove thechosen VT from the entire list of VTs;

• calculate the basis function of additive modelby (6);

• calculate the basis function coefficients ofadditive model by solving (10);

• update sample weights:,

where is the corresponding descriptor type of ;4) Output: final relevance vector for the -thimage over the landmark categories is

; The label of the finalrecognized category is the index of the largest value of

: .

Based on the exponential loss model, we wish to find out theadaboost classifier such that

(1)

subject to

(2)

where is the label vector andis the relevance vector for

the -th image over the landmark categories. We considerto have the following forward stage additive modeling [9]

form:

(3)

where are coefficients and are basis functions sat-isfying the symmetric constraint . Forward stage-wise additive modeling approximates the solution to (1) by se-quentially adding new basis functions without adjusting the cur-rent basis functions or coefficients. Based on the additive model

(4)

the optimization follows as

(5)

LI AND YAP: CONTENT AND CONTEXT BOOSTING FOR MOBILE LANDMARK RECOGNITION 461

where is the current unnor-malized sample weight. Notice that every basis function hasa one-to-one correspondence with the weak classifier in thefollowing way:

(6)

where is the normal-ization factor and the weak classifier is defined using Sigmoidfunction as

(7)

where is defined as follow

(8)

where denotes descriptors of a database image, is the setof the database image descriptors in the -th category,is the histogram of an image based on its corresponding typeof VT and denotes the cardinality of a set. If , wecalculate the histogram by traversing the typically thousandsof SIFT descriptors through the SIFT-based VT; if or

, since the context descriptor is only one vector, we adoptthe soft binning as in [5] to generate the histogram based onthe context-based VT. When is very small, which meansthe -th image is very close to the database images in the -thcategory, and will be large.Lemma 1: The solution to is obtained according to

(9)

and solving the equation

(10)

gives the solution to in (5).Proof: Let

(11)

where is the ground-truth class label for the -th image.Then, the best candidate for is selected by evaluating thecriterion:

(12)

Since normally , we can express (11) by first-orderTaylor expansion as

(13)

For any fixed value of , the choice of only dependson , so we get that (9) holds. To find the solution to

gives

(14)

To solve , although (11) is neither convex nor concave, wecan still use a conjugate gradient search [10] to obtain a localoptimal solution. Finally, we summarize the proposed RMAEalgorithm as shown in Algorithm 1.

III. EXPERIMENTAL RESULTS

A. Experiment Settings

To evaluate the mobile landmark recognition performance,we implemented various methods for mobile image recognitionon three landmark datasets of different scales: NTU-50 dataset[1], PKU-198 dataset [11], and Nokia San Francisco dataset[8]. NTU-50 dataset consists of 3 622 training images and 534testing images for 50 main landmarks from Nanyang Techno-logical University Campus, including location and direction ascontext information. PKU Landmark-198 dataset from MPEGCDVS requirement subgroup contains 13 179 geotagged photosfrom Peking University Campus. We randomly selected 3290images for training and the rest for testing. San Francisco Land-mark Dataset contains 1.7 million geotagged landmark imagesin San Francisco, which is rather challenging since it has morethan 6000 categories and the 803 query images were taken withcamera phones which is different from the acquisition of data-base images.For effective image descriptor extraction, we compute dense

patch SIFT descriptors sampled on overlapping 16 16 pixelpatches in space of 8 pixels as in [12] for all the compared algo-rithms on all the datasets. All images are resized to 320 240resolution. For matching of VT histograms between queryimages and database images, we adopt the intersection kernel[13] for its high efficiency. The proposed boosting algorithm(RMAE) is compared with state-of-the-art works including:Scalable Vocabulary Tree (SVT) based on the content analysis(Nister’s method) [5], SVT within the GPS-based shortlist(Girod’s method) [4], and state-of-the-art multi-class adaboostalgorithms (SAMME and SAMME.R) [14] applied to theproposed content & context boosting framework.

B. Performance Evaluation

In this section, several parameters for Nister’s method andGirod’s method are first determined for subsequent comparison,including the layout of the SVTs and the local search radiusto obtain the GPS-based shortlist in Girod’s method. By com-paring the recognition rates of Nister’s method [5] and Girod’smethod [4], we observed that, for depth is larger than 5,the recognition rates improves slowly for both NTU-50 datasetand PKU-198 dataset. Thus we set the depth for SIFT-basedSVT to be as a tradeoff between the recognition rateand the computational cost for all datasets. For context-basedSVTs, since it is not necessary to partition the low-dimensional

462 IEEE SIGNAL PROCESSING LETTERS, VOL. 19, NO. 8, AUGUST 2012

Fig. 2. Performance comparison in terms of the number of adopted SVTs:(a) NTU-50 dataset (b) PKU-198 dataset (c) San Francisco dataset.

TABLE IEFFECT OF RADIUS USING GIROD’S METHOD ON NTU-50 DATASET

context descriptors in very fine bins, the SVT depth is set to4. In addition, the branch number of all types of SVTs is setas [5] for all the datasets. Furthermore, the effectof varying the search radius for landmark recognition with

using Girod’s method is given in Table I, where we canreduce the number of average candidate images of the GPS-based shortlist thus reducing the computational cost for con-tent analysis. We notice that the recognition rate achieves itspeak at about 200 m, however, the improvement at the peak isstill limited compared to the baseline SVT approach. With theabove parameters optimized for Nister’s method and Girod’smethod, we further demonstrate the performance improvementof the proposed RMAE method. In this work, we construct 6SIFT SVTs and 4 GPS SVTs for all datasets. Note that direc-tion information is only available in NTU-50 dataset, and weconstruct 4 additional SVTs based on the composite GPS-di-rection descriptors. When direction is used for recognition onNTU-50 dataset, we use the GPS-direction SVTs instead of theGPS SVTs. By using the proposed RMAE algorithm to inte-grate SIFT and GPS information, the recognition rates versusthe number of adopted SVTs (i.e., the number of weak classi-fiers) are shown in Fig. 2. By comparing with Table I, we cannotice that, by boosting the recognition results on each of theSIFT SVTs and GPS SVTs, the recognition performance is sig-nificantly improved over Girod’s method. We can also observefrom Fig. 2 that, by using the proposed RMAE algorithm, therecognition performance can achieve consistent improvementsover those of SAMME and SAMME.R algorithms. The recog-nition results of Nister’s method, Girod’s method and the pro-posed RMAE method in different context scenarios are shownin Table II, where the adopted context sources are indicated inbrackets. It is evident that the proposed RMAE method outper-

TABLE IIPERFORMANCE COMPARISON IN DIFFERENT CONTEXT SCENARIOS

forms Nister’s method and Girod’s method by combining SIFTSVTs and GPS SVTs. For the NTU-50 dataset, by utilizing thecomposite GPS-direction information, the performance of theproposed RMAE can be improved by about 1.9% compared toonly incorporating GPS information. The whole adaboost pro-cedure needs less than 3 seconds for all the datasets. The sim-ulation environments are: Windows XP, MATLAB2008a, CPUP4 2.66 GHz, and 4-G RAM.

IV. CONCLUSION

We proposed a new mobile landmark recognition frameworkthat can integrate various mobile content and context informa-tion by Adaboost. As compared to content analysis within GPS-based shortlist, the proposed boosting algorithm can signifi-cantly improve the recognition performance. In addition, theproposed Adaboost framework has the flexibility to incorpo-rate any types of content and context information when they areavailable in the mobile device. Future work may include inte-grating other context information (e.g., time) to further improvethe recognition performance.

REFERENCES[1] K. Yap, T. Chen, Z. Li, and K. Wu, “A comparative study of mobile-

based landmark recognition techniques,” IEEE Intell. Syst., vol. 25, no.1, pp. 48–57, 2010.

[2] T. Chen et al., “A multi-scale learning approach for landmark recogni-tion using mobile devices,” in Int. Conf. on Information, Communica-tions and Signal Processing, 2009, pp. 1–4.

[3] Y. Li, D. Crandall, and D. Huttenlocher, “Landmark classification inlarge-scale image collections,” in IEEE 12th Int. Conf. Computer Vi-sion, 2009, pp. 1957–1964.

[4] B. Girod et al., “Mobile visual search,” IEEE Signal Process. Mag.,vol. 28, no. 4, pp. 61–76, 2011.

[5] D. Nister and H. Stewenius, “Scalable recognition with a vocabularytree,” in IEEE Computer Soc. Conf. Computer Vision and PatternRecognition, 2006, vol. 2, pp. 2161–2168.

[6] G. Fritz, C. Seifert, and L. Paletta, “A mobile vision system for urbandetection with informative local descriptors,” in IEEE Int. Conf. Com-puter Vision Systems, 2006, pp. 30–30.

[7] J. Lim, Y. Li, Y. You, and J. Chevallet, “Scene recognition with cameraphones for tourist information access,” in IEEE Int. Conf. Multimediaand Expo, 2007, pp. 100–103.

[8] D. Chen et al., “City-scale landmark identification on mobile devices,”in IEEE Conf. Computer Vision and Pattern Recognition, 2011, pp.737–744.

[9] J. Friedman, T. Hastie, and R. Tibshirani, “Additive logistic regres-sion: A statistical view of boosting,” Ann. Statist., vol. 28, no. 2, pp.337–407, 2000.

[10] M. Avriel, Nonlinear Programming: Analysis and Methods. NewYork: Dover, 2003.

[11] R. Ji et al., “Pkubench: A context rich mobile visual searchbenchmark,” in 18th IEEE Int. Conf. Image Processing, 2011,pp. 2545–2548.

[12] J. Zhang, M. Marszalek, S. Lazebnik, and C. Schmid, “Local featuresand kernels for classification of texture and object categories: A com-prehensive study,” in IEEE Conf. Computer Vision and Pattern Recog-nition Workshop, 2006, pp. 13–13.

[13] M. Swain and D. Ballard, “Color indexing,” Int. J. Comput. Vis., vol.7, no. 1, pp. 11–32, 1991.

[14] J. Zhu, S. Rosset, H. Zou, and T. Hastie, “Multi-class adaboost,” AnnArbor, vol. 1001, no. 48109, p. 1612, 2006.