14
Machine Vision and Applications (2011) 22:521–534 DOI 10.1007/s00138-010-0285-9 ORIGINAL PAPER Hyperlinking reality via camera phones Dušan Omerˇ cevi´ c · Aleš Leonardis Received: 26 December 2008 / Revised: 20 March 2010 / Accepted: 28 June 2010 / Published online: 18 July 2010 © Springer-Verlag 2010 Abstract A novel user interface concept for camera phones, called “Hyperlinking Reality via Camera Phones”, that we present in this article, provides a solution to one of the main challenges facing mobile user interfaces, that is, the problem of selection and visualization of actions that are relevant to the user in her current context. Instead of typing keywords on a small and inconvenient keypad of a mobile device, a user of our system just snaps a photo of her surroundings and objects in the image become hyperlinks to information. Our method commences by matching a query image to refer- ence panoramas depicting the same scene that were collected and annotated with information beforehand. Once the query image is related to the reference panoramas, we transfer the relevant information from the reference panoramas to the query image. By visualizing the information on the query image and displaying it on the camera phone’s (multi-)touch screen, the query image augmented with hyperlinks allows the user intuitive access to information. Keywords Image matching using local invariant features · Wide baseline stereo matching · Augmented reality · Image-based localization This research has been supported in part by: Research program Computer Vision P2-0214 (RS), EU FP6-004250-IP project CoSy, EU MRTN-CT-2004-005439 project VISIONTRAIN, and EU FP6-511051project MOBVIS. D. Omerˇ cevi´ c(B ) · A. Leonardis Faculty of Computer and Information Science, University of Ljubljana, Tržaška cesta 25, Ljubljana, Slovenia e-mail: [email protected] URL: http://vicos.fri.uni-lj.si/ A. Leonardis e-mail: [email protected] 1 Introduction In 2008 there were more than three billion mobile phones in use throughout the world [1]. Even though the modern mobile phones possess substantial processing and data com- munication possibilities, they are still predominantly used only for voice communication between people. One of the major obstacles for using mobile phones to also access infor- mation available on the Internet and other data networks are the inadequate user interfaces on mobile phones. Besides small displays, the major problem of mobile user interfaces are input devices. Traditional input devices of desktop com- puters such as keyboards and mice are not suitable for mobile devices due to their space requirements, while some other techniques such as voice control are not commonly accepted by users [2]. Today, the most popular user interface concept for access- ing information on mobile devices is navigation among the limited number of actions that the user can select using a keypad, a joystick, or more recently, a (multi-)touch screen [2]. The main challenge of this concept is deciding on what actions to present to the user and how to visualize them. Due to the abundance of information available, different user requirements, and limited information about user’s context, selection of actions that are relevant to the user in her current context is a difficult and challenging problem that is a critical success factor of mobile user interfaces. An interesting alternative to traditional mobile user inter- faces was investigated in Nokia Research’s MARA (Mobile Augmented Reality Applications) project [3]. In this project a mobile phone equipped with accelerometers, a compass, and a GPS was used as a window to an augmented reality envi- ronment in which users could access information by pointing the phone’s camera in the direction of an interesting object. Additional information about the object was accessible 123

Hyperlinking reality via camera phones

Embed Size (px)

Citation preview

Machine Vision and Applications (2011) 22:521–534DOI 10.1007/s00138-010-0285-9

ORIGINAL PAPER

Hyperlinking reality via camera phones

Dušan Omercevic · Aleš Leonardis

Received: 26 December 2008 / Revised: 20 March 2010 / Accepted: 28 June 2010 / Published online: 18 July 2010© Springer-Verlag 2010

Abstract A novel user interface concept for camera phones,called “Hyperlinking Reality via Camera Phones”, that wepresent in this article, provides a solution to one of the mainchallenges facing mobile user interfaces, that is, the problemof selection and visualization of actions that are relevant tothe user in her current context. Instead of typing keywordson a small and inconvenient keypad of a mobile device, auser of our system just snaps a photo of her surroundingsand objects in the image become hyperlinks to information.Our method commences by matching a query image to refer-ence panoramas depicting the same scene that were collectedand annotated with information beforehand. Once the queryimage is related to the reference panoramas, we transfer therelevant information from the reference panoramas to thequery image. By visualizing the information on the queryimage and displaying it on the camera phone’s (multi-)touchscreen, the query image augmented with hyperlinks allowsthe user intuitive access to information.

Keywords Image matching using local invariant features ·Wide baseline stereo matching · Augmented reality ·Image-based localization

This research has been supported in part by: Research programComputer Vision P2-0214 (RS), EU FP6-004250-IP project CoSy,EU MRTN-CT-2004-005439 project VISIONTRAIN, and EUFP6-511051project MOBVIS.

D. Omercevic (B) · A. LeonardisFaculty of Computer and Information Science, University of Ljubljana,Tržaška cesta 25, Ljubljana, Sloveniae-mail: [email protected]: http://vicos.fri.uni-lj.si/

A. Leonardise-mail: [email protected]

1 Introduction

In 2008 there were more than three billion mobile phonesin use throughout the world [1]. Even though the modernmobile phones possess substantial processing and data com-munication possibilities, they are still predominantly usedonly for voice communication between people. One of themajor obstacles for using mobile phones to also access infor-mation available on the Internet and other data networks arethe inadequate user interfaces on mobile phones. Besidessmall displays, the major problem of mobile user interfacesare input devices. Traditional input devices of desktop com-puters such as keyboards and mice are not suitable for mobiledevices due to their space requirements, while some othertechniques such as voice control are not commonly acceptedby users [2].

Today, the most popular user interface concept for access-ing information on mobile devices is navigation among thelimited number of actions that the user can select using akeypad, a joystick, or more recently, a (multi-)touch screen[2]. The main challenge of this concept is deciding on whatactions to present to the user and how to visualize them.Due to the abundance of information available, different userrequirements, and limited information about user’s context,selection of actions that are relevant to the user in her currentcontext is a difficult and challenging problem that is a criticalsuccess factor of mobile user interfaces.

An interesting alternative to traditional mobile user inter-faces was investigated in Nokia Research’s MARA (MobileAugmented Reality Applications) project [3]. In this project amobile phone equipped with accelerometers, a compass, anda GPS was used as a window to an augmented reality envi-ronment in which users could access information by pointingthe phone’s camera in the direction of an interesting object.Additional information about the object was accessible

123

522 D. Omercevic, A. Leonardis

as a hyperlink overlaid over the video stream taken bythe phone’s camera and hence the concept got a name“hyperlinking reality via phones” [4]. If we are able to attri-bute actions to objects that surround the user, then, suddenly,all the objects in the environment become action triggersthat a user can activate by a physical action of pointingthe camera phone towards the object. Moreover, a photo ofthe environment on the camera phone’s (multi-)touch screenbecomes a natural interaction device allowing intuitive accessto information. Therefore, the concept of “hyperlinking real-ity via phones” solves the main problem of the “action nav-igation” mobile user interface concept, that is, the selectionand visualization of relevant actions. With the proliferationof mobile devices equipped with a GPS and a compass,the MARA project’s approach to mobile augmented real-ity has become mainstream and widely accepted by the users[5]. The recent examples of such applications are Layar andWikitude.

The major drawback of the MARA project’s approach isits dependance on the accurate information about absolutelocation and orientation of the camera phone. For example,the median accuracy of location information provided bythe GPS is only 10 m [6], which is insufficient for accuratesuperimposition of hyperlinks in a video stream, resultingin poor user experience. In this article we propose an alter-native to the MARA project’s approach. Instead of relyingon the knowledge about absolute location and orientation ofthe camera phone, we employ computer vision techniques toidentify object(s) that a user is pointing to, thus implement-ing the same user interface concept as the MARA project butusing a different technology.

In the next section we give a short description of relatedwork on mobile applications of computer vision and we referto the existing computer vision techniques based on localinvariant features. In Sect. 3 we provide a detailed descrip-tion of our system. The “Hyperlinking Reality via Cameraphones” mobile user interface concept requires a data set ofreference panoramas that are collected and annotated withinformation beforehand. Therefore, in Sect. 4, we describethe process of reference data set acquisition. As part of theuser acceptance study [7], we have evaluated the performanceof our system in a real-world scenario. We present the resultsof the evaluation in Sect. 5. Finally, before concluding inSect. 7, we discuss in Sect. 6 benefits that interactivity inreal-time would bring to our system.

2 Related work

In recent years there are very few mobile phones on themarket without a built-in camera. It is estimated that the totalnumber of these, so called camera phones in the world, has in2008 surpassed one billion items [8]. Therefore, it is natural

to consider using the camera as an input device for queryinginformation on mobile phones [9–13]. Among these systems,the most similar to ours is the system of [12,13]. Contrary toour system, most of the processing in the system of Takacset al. is done on the mobile phone itself by pushing subset offeatures continuously from the server to the mobile phone.At current bandwidth of wireless networks, the processing onthe mobile phone provides for faster system response timeat the expense of quality of results. While our system is notinteractive in real-time, the system of Takacs et al. is notregistered in 3D and as such is also not a proper augmentedreality system (see Sect. 6).

As stated by Henrysson [14, page 2] in his excellent PhDthesis, the tight coupling of camera and CPU gives mobilephones unique input capabilities where real-time computervision is used to enable new interaction metaphors and linkphysical and virtual worlds. In his thesis, Henrysson presentsthe state of the art in camera phone augmented reality anddescribes several augmented reality applications running ona camera phone, the most famous of them being AR Tennis.As stated before, using non-vision sensors for augmentedreality has some serious drawbacks due to the limited accu-racy of such sensors. Therefore, Henrysson used marker-based computer vision algorithms for tracking and adaptedthem for camera phones, thus enabling real-time interac-tion with the system. Marker-based tracking requires place-ment of special recognizable codes in the environment foreasy detection using camera phones. In outdoor environ-ments, marker-based tracking is not an option due to thedifficulties in preparing the environment and wide area use[14].

In our work, we are not limited by the requirements ofaugmented reality on real-time interaction, though we wouldnot object to such a feature in our system (see Sect. 6 for a dis-cussion about benefits of real-time interaction for the user).Instead we have used more powerful, but also more com-putationally demanding computer vision techniques basedon local invariant features that were pioneered by Schmidand Mohr [15], and Lowe [16], and that are used for solvingmany computer vision problems, including image retrieval[15,17,18], object recognition [16], wide baseline matching[19–21], building panoramas [22], image-based localization[23] and video data mining [24].

3 Method

As stated in Sect. 1, the goal of our work is to enable a noveluser interface for mobile devices. Instead of typing keywordson a small and inconvenient keypad of a mobile device, auser just snaps a photo of her surroundings and objects onthe image become hyperlinks to information (see Fig. 1 foran example). We will call the photo snapped by the user a

123

Hyperlinking reality via camera phones 523

Fig. 1 To access information about objects in her surroundings, a userjust snaps a photo (top left image) and objects on the image becomehyperlinks to information (bottom left image) that a user can access bysimply tapping an icon. The user’s photo is first sent to an applicationserver over a UMTS network and then forwarded to a server which, byemploying computer vision techniques, identifies relevant hyperlinksand calculates the user’s position. The result is sent back to the appli-

cation server in XML form, where the hyperlinks are depicted as iconson the photo (image a) while the calculated position is visualized ona map (indicated by icon of a man in image b), and finally displayedon the camera phone. The camera phone used in our experiments is theHTC Touch Cruise. See Sect. 5 for more information about the systemsetup

query image. Our method commences by matching the queryimage to the reference panoramas depicting the same scenethat were collected and annotated with information before-hand (see Sect. 4 for a detailed description of the referencepanorama data set). Once the query image is related to thereference panoramas, we transfer the relevant informationfrom the reference panoramas to the query image. By visu-alizing the information on the query image and displaying iton the camera phone’s (multi-)touch screen, the query imageaugmented with hyperlinks allows the user intuitive accessto information.

The approach we have chosen in order to bring the“hyperlinking reality” functionality to camera phones isimage matching using local invariant features. In imagematching a query image (or a part of it) is matched against thereference images (or their parts) in order to relate the queryimage with a subset of reference images depicting the samescene or objects. The establishment of relations between thequery and the reference images enables transfer of infor-mation from the reference images to the query image. Forexample, if we know the position and camera orientation ofthe reference image and if we have estimated geometry relat-ing the query and the reference image, we can estimate theposition and camera orientation of the query image up to ascale ambiguity [25].

The dominant framework of using local invariant featuresfor image matching is the approach of [15]. Their approachstarts with (1.a) a detection and (1.b) a description of a setof local features in an image, followed by (2.) a matching

of similar local structures in two images and (3.) rejectionof incorrect relations between the images. The main advan-tage of the approach of Schmid and Mohr and of using localinvariant features in general is that it can handle substan-tial viewpoint and photometric changes and that it toleratessubstantial clutter and occlusions [26].

Consistent with the framework of Schmid and Mohr, thefirst step of our method is detection and description of localinvariant features in the query image snapped by the user.In the second step, the detected features are matched againstthe features detected beforehand in the reference panoramaimages in order to identify reference panoramas depictingthe same scene. In the third step, we try to estimate geo-metric relations between the query image and the subset ofreference panoramas depicting the same scene. In the fourthstep, the hyperlinks defined on the reference panoramas byusers or administrators are transferred on the query image.In the last step, the hyperlinks are annotated on the queryimage and displayed on the camera phone’s (multi-)touchscreen. An overview of our method together with main com-puter vision techniques used and a visual example is shownFig. 2.

The next subsections give details about our method.

3.1 Local invariant features

We have chosen Maximally Stable Extremal Regions(MSER) [20] as features used in our framework. Thesefeatures represent connected image regions of similar light

123

524 D. Omercevic, A. Leonardis

Fig. 2 Overview of the method and main computer vision techniquesused. The query image is in color, while the reference panorama ismonochrome. In the bottom right image, annotated objects and detectedfeatures (shown as plus signs) are depicted on the reference panorama.The red plus signs indicate features that are included in a description ofsome object. The red polygon indicates the extent of the object within

the reference panorama and the numbers next to the polygon are objectand hyperlink identifier, respectively. a MSER features [20] describedby SIFT descriptor [16], b high-dimensional feature matching [27],c estimation of geometric relations [25], d transfer of hyperlinks (colorfigure online)

intensity with well-defined borders. Due to the abundanceof such regions in man-made environments (e.g., letters,signs, banners) they are the most suitable type of featuresfor applications targeted at urban settings. MSER featureshave also performed best in several performance evalua-tions of local invariant features [28,29]. The visual contentof each detected MSER feature is characterized by com-puting a SIFT descriptor [16] for the respective ellipticalimage region. There is a consensus in the computer visioncommunity [26,30] that the SIFT descriptor [16] is best atproviding invariance to illumination and viewpoint changes,and tolerance to slight misregistrations between the featureand the respective visual structure, while being distinctiveenough to determine likely matches in a large database offeatures.

3.2 Feature matching

The first step of our method has provided us with a largenumber of features detected in the reference panoramasand much smaller, but still substantial, number of featuresdetected in the query image. The goal of the feature match-ing is to relate each query feature with none, one, or severalfeatures detected in the reference panoramas. Each matchtranslates into a vote for a particular reference panorama,and might become a tentative correspondence between imageregions in the query image and a reference panorama, respec-tively. For the voting to be successful, a sufficient numberof votes should go to the matching reference images, whileonly a smaller number of votes can go to unrelated referenceimages. Similarly, the set of tentative correspondences should

123

Hyperlinking reality via camera phones 525

include a sufficient number and percentage of true correspon-dences for the successful estimation of relation between thequery image and the reference panorama using the hypothe-size-and-test approach [31].

The elementary methods for matching local invariantfeatures such as threshold-based matching and nearest neigh-bor(s) matching, treat all features equally, while more sophis-ticated methods [16,19] take into account that some of thelocal invariant features are more distinctive than others.Another approach to matching is inspired by text retrievalmethods and uses entropy weighting to explicitly account forvariable distinctiveness of local invariant features [17,24,32].None of the traditional methods have met our criteria that agood feature matching method should provide a sufficientnumber of votes for matching reference images and that itshould provide a set of tentative correspondences with suffi-cient number and percentage of true correspondences. Thatis why we have chosen to use a novel high-dimensional fea-ture matching method based on the concept of meaningfulnearest neighbors [27].

Given a query feature, the concept of meaningful nearestneighbors divides the set of features detected in the refer-ence panoramas into two disjoint sets, the set of meaningfulnearest neighbors of the query feature, and the set of irrele-vant match candidates. While the distribution of similaritiesbetween the query feature and its meaningful nearest neigh-bors is (in general) unknown, authors in [27] have experimen-tally observed that the distribution of similarities between thequery feature and the irrelevant match candidates can be wellmodeled by an exponential distribution, at least in the localneighborhood of the query feature.

The feature matching method based on the concept ofmeaningful nearest neighbors first estimates the rate parame-ter of the exponential distribution that models the distributionof similarities between the query feature and the irrelevantmatch candidates. The rate parameter is then used to calculatelikeliness that a tentative nearest neighbor is an outlier to theexponential distribution of the irrelevant match candidatesand therefore, a meaningful match. In addition, we assign toeach meaningful nearest neighbor a weight which takes intoaccount the similarity of a query feature to the neighbor, andthe likeliness of the neighbor being a true match.

The concept of meaningful nearest neighbors requiresidentification of nearest neighbors of a given feature. Usingexhaustive search for nearest neighbors search would be tooslow (i.e., computationally demanding) for the “Hyperlink-ing Reality via Camera Phones” application to be viable. Asstated in [16,33,34], no algorithms are known that can iden-tify exact nearest neighbors of points in high-dimensionalspaces that are any more efficient than exhaustive search.Recently, several approximate methods were introduced[27,35,36] that provide substantial speed-up while retain-ing adequate quality of the approximation. The evaluation

framework of [35] has shown that the method based on sparsecoding with an overcomplete basis set of [27] is the most suit-able for our application being 25% faster than the methodbased on hierarchical k-means tree of [35], more than threetimes faster than the multiple randomized kd-trees of [36]and more than 100 times faster than exhaustive search. Theevaluation was done using the LUIS-34 data set [27] with3.8 M SIFT descriptor vectors. Due to the requirement of theconcept of meaningful nearest neighbors we have set requiredquality of approximation at 95% for the first nearest neighborand at 80% for the first 100 nearest neighbors.

Given an overcomplete basis set, that is learned fromdescriptor vectors of features detected in the reference pan-oramas, the approximate nearest neighbor search method of[27] first sparsely projects 128-dimensional feature descrip-tor vectors to the overcomplete basis set and then comparesthese sparse representations instead of the original descriptorvectors. The sparse representation is even higher in dimen-sionality than the original representation (several thousandsdimensions), but is very sparse, i.e., only a few dozen vec-tor elements are non-zero. Therefore, this method effec-tively translates the problem of the nearest neighbor searchfrom the problem of efficient matrix-vector multiplication(i.e., exhaustive search) to the problem of efficient sparsematrix-sparse vector multiplication. With the typical valuesof sparsity s of O(log N ) and optimal cardinality of the over-complete basis set of s

√N/D, the computational complex-

ity of sparse coding with an overcomplete basis set is onlyO(log N

√DN + DL) instead of O(DN ) for the exhaus-

tive search, where N is the number of data points, D is thedimensionality of the vector space, and L is the number ofdata points that are exhaustively compared with the querydata point in the refining step.

3.3 Estimation of geometric relations

The meaningful nearest neighbors of the query features arefirst used as weighted votes for respective reference pan-oramas. The reference panoramas with the most votes areconsidered as potentially matching, i.e., they depict the samescene as the query image. For each potentially matching ref-erence panorama, we try to estimate its geometric relation tothe query image using the meaningful nearest neighbors andthe epipolar geometry constraint. The reference panoramasfor which estimation of geometric relations is unsuccessfulare rejected.

We know the exact position and camera orientation of thematching reference panoramas, and hyperlinks to interestinginformation have been annotated on them (see Sect. 4 fordetails about the data acquisition and annotation process).The information about the position and the camera orienta-tion is used to triangulate position and camera orientation ofthe query image [6], while the hyperlinks annotated on the

123

526 D. Omercevic, A. Leonardis

matching reference panoramas are transferred to the queryimage.

3.3.1 Epipolar geometry estimation

The epipolar constraint states that if we know the epipolargeometry and the position of x in the first view than we canlimit the search for the position of x ′ in the second view to aline l ′. The epipolar constraint can be expressed in the formof a fundamental matrix F or an essential matrix E that can beestimated if a sufficient number of correspondences is known[25]. We will refer to the set of correspondences with just thesufficient number of correspondences as a minimum set. Theestimation of the fundamental matrix requires the minimumset of seven correspondences, while the estimation of theessential matrix requires a minimum set of only five corre-spondences (using so called five-point algorithm [37]). Dueto reasons that will become apparent in the next section, weprefer the algorithm with the smallest cardinality of the mini-mum set. Therefore in our method we use an implementationof the five-point algorithm by [38]. In addition, the five-pointalgorithm works even if all the correspondences in the min-imum set lie on a plane (e.g., a facade of a building). Sucha configuration is very common in urban environments andit is not tolerated by other algorithms for epipolar geometryestimation.

The five-point algorithm requires that the internal param-eters of the camera are known. In case of camera phones, themost important internal parameter is the focal length. Dueto the small form factor, most of the camera phones on themarket today have a fixed focal length (i.e., no zoom) thatcan be acquired from the manufacturers, while the cameraphones with zoom store information about the focal lengthin the EXIF header of the image. We can assume standardvalues for other internal parameters of the camera [25].

3.3.2 In search of true correspondences

Existing matching algorithms cannot guarantee that allcorrespondences are true correspondences, i.e., that they areprojections of the same structure in 3D world, so we resort to ahypothesize-and-test approach [31] in order to find a hypoth-esis that is the most consistent with the tentative correspon-dences set. The first and still dominant model that follows thehypothesize-and-test approach is RANdom SAmpling andConsensus (RANSAC) paradigm of [31]. In RANSAC, sam-pling of the space of possible hypothesis is done by repeatedrandom sampling of the correspondences set, where eachrandom sample is used for calculation of a hypothesis andthe criteria for selection of the best hypothesis is the num-ber of inliers, i.e., correspondences that are consistent withthe hypothesis. All correspondences in the random samplemust be true correspondences in order for the hypothesis

calculated from the sample to be correct and therefore it ispreferable that the random sample is the minimum set, i.e.,the set of correspondences with just the sufficient number ofcorrespondences for calculation of the hypothesis.

RANSAC assesses a hypothesis by simply counting thenumber of inliers. The more principled way would be to cal-culate the posterior probability p(Mh |C) of hypothesis Mh

given a correspondences set C with n elements. The posteriorprobability p(Mh |C) cannot be measured directly, thereforeTorr and Zisserman have introduced MLESAC [39], in whichhypotheses are scored using the likelihood p(C|Mh). Assum-ing uniform prior p(Mh) and constant p(C), the likelihoodp(C|Mh) is directly proportional to p(Mh |C).

Calculation of p(C|Mh) is problem dependent. MLESACwas developed for estimation of the fundamental matrix fromfeature correspondences and it is therefore applicable also toour problem of estimation of the essential matrix. In MLE-SAC, an assumption is made that the likelihood p(C|Mh)

depends on the probability of residual error rhi of each cor-respondence given the hypothesis Mh and that the probabil-ity of each residual is independent, where residual error isdefined as the distance between the observed and the antici-pated position of the feature in the image. According to MLE-SAC, conditional probability of observing a residual rhi giventhe hypothesis Mh can be expressed as a mixture model of aGaussian and uniform model for the case of true and false cor-respondences, respectively, with a probability p(vi ) express-ing the prior probability that the i th correspondence is a truecorrespondence.

In MLESAC, the prior probability of a true correspon-dence p(vi ) is assumed uniform within the correspondencesset (i.e., p(vi ) = p(v)) and is, in principle, assumed to beindependent of the hypothesis. Torr and Zisserman also donot assume any prior knowledge of p(v) and they suggestestimating p(v) for each hypothesis separately.

3.3.3 Deriving the prior probability of a truecorrespondence

In [40], the authors argue that MLESAC’s probabilisticapproach to random sampling and consensus could beimproved if an estimate of the prior probability of a truecorrespondence p(vi ) was available. In the feature matchingstep of our method, we have first identified and then weightedthe meaningful nearest neighbors. Experimentally we haveobserved that the weight attributed to the meaningful near-est neighbor is well correlated with the probability that thequery feature and the meaningful nearest neighbor form atrue correspondence. Therefore, in our method, we use theweight attributed to the meaningful nearest neighbors in orderto calculate the prior probability of a true correspondencep(vi ).

123

Hyperlinking reality via camera phones 527

Due to the presence of repetitive visual structures, a refer-ence feature can be matched to more than one query feature.In addition, to reduce the chance of missing a true correspon-dence, we allow matching of a query feature with more thanone reference feature from a single image (i.e., soft match-ing). To express that, due to the uniqueness constraint [41],at most one correspondence per feature can be correct, theprior probability of the kth correspondence of a query featurei with ni potentially matching reference features with scoressik and validities vik(k = 1, . . . , ni ), is calculated as in [40]

p(vik |si1, . . . , sini )

= p(vik |sik , ni )∏ni

j �=k p(vi j |si j , ni

)

∑nil

[p(vil |sil , ni )

∏nij �=l p

(vi j |si j , ni

)]+∏nij p

(vi j |si j , ni

)

(1)

In this equation, the numerator gives the probability that thekth correspondence of the query feature i is correct and allother correspondences of this query feature are incorrect. Thedenominator normalizes the numerator by the sum of proba-bilities of all correspondences for a given query feature, andthe possibility that none are correct.

3.3.4 Sampling the space of possible hypothesis

Original RANSAC’s strategy to draw minimum sets fromthe correspondences set uniformly at random is feasible onlywhen the correspondences set includes more than 50% of truecorrespondences [42]. If there are less than 50% of true corre-spondences, the time needed (i.e., the number of hypotheses)to draw the minimum set with only true correspondencesincluded with sufficient probability (e.g., of at least 95%)becomes too long and therefore not acceptable to the user.

In Guided-MLESAC of [40], instead of filling the mini-mum set uniformly at random, the elements in the minimumset are chosen by a Monte-Carlo method according to p(vi ).While this enables identification of the true hypothesis alsofor cases when the correspondences set includes only 30%of true correspondences, it is still not powerful enough forour target application. Due to the presence of repetitive struc-tures, the correspondences sets that we have observed in ourapplication usually include only between 10 and 15% of truecorrespondences. Historically, the random sampling strategywas chosen primarily because there was no reliable measureof validity of a correspondence, and secondarily, to avoidinterdependencies between correspondences included in theminimum set (e.g., all features lying on a single plane or asmall part of the image). In our method, the weight attrib-uted to the meaningful nearest neighbor can be used as ameasure of correspondence validity. Therefore, instead ofrandom sampling strategy, we employ a deterministic sam-pling strategy that constructs minimum sets by permuting the

top l tentative correspondences with the highest weight. The

number of hypotheses thus generated is

(l

l − r

)

, where r is

the cardinality of the minimum set.While we have indeed observed interdependencies

between correspondences included in the minimum set thatwould not be present in the case of a random sampling strat-egy, the interdependencies did not noticeably diminish thesuccess of geometry estimation. In our opinion, any sam-pling strategy that includes information about probability ofa correspondence being a true correspondence based on thevisual similarity of respective image regions, will samplesome parts of the scene with higher probability than othersdue to the nature of camera motion. For example, if the cam-era moves in the direction of a facade A and parallel to thefacade B, than the images of the features detected on facadeA will undergo only a similarity transformation (i.e., onlyposition and scale will change), while the images of the fea-tures detected on the facade B will undergo a full perspectivetransformation that is much harder to model.

3.3.5 Structure estimation

The epipolar geometry constraint does not differentiatebetween physically possible configurations and configura-tions in which some features are behind the camera, thusviolating the so called cheirality constraint [25]. Tentativecorrespondences that are consistent with the hypothesizedepipolar geometry but are violating the cheirality constraintmay impede success of geometry estimation by giving sup-port to incorrect hypotheses. In order to remove the influenceof such correspondences, we first decompose the essentialmatrix into four possible combinations of rotation and trans-lation of the first camera relative to the second camera usingthe method described in [43]. For each of the four combina-tions, we estimate the position of the features in 3D world byintersecting rays back-projected from the two cameras’ cen-ters through the images of features in both images, respec-tively, using the method described in [43]. The combinationof the camera rotation and translation with the most fea-tures in a physically viable configuration (i.e., in front of thecamera) is chosen, and the correspondences that violate thecheirality constraint are removed from the set of true corre-spondences for the hypothesis that is being tested.

3.3.6 Detection of pure rotation

If the distance between the first and the second camera posi-tion is small compared to the distance to the scene, the calcu-lation of the translational component of the camera motionis unreliable. Typically this happens, when the user has shotthe query image from (almost) the same position as the refer-ence panorama. In order to detect such a situation, we check

123

528 D. Omercevic, A. Leonardis

if the rotational matrix R alone is sufficient to explain therelation between positions of features in the first and sec-ond image, respectively. We estimate the rotational matrixR with the same hypothesize-and-test approach as used forthe estimation of the epipolar geometry, but this time weuse the Singular Value Decomposition (SVD) to computethe rotational matrix R from the minimum set of just threecorrespondences. The correspondences set from which theminimum set is drawn includes only correspondences thatare consistent with the estimated epipolar geometry. If therotation alone is sufficient to explain a large majority (75%)of transformations of feature positions in the first and secondimage, respectively, we assume that the two images are shotfrom the same position.

3.4 Transfer of hyperlinks

The previous step of our method has provided us with thegeometry that relates the query image and matching refer-ence panoramas. In addition, the set of tentative correspon-dences was purged to include only true correspondences, i.e.,correspondences that are consistent with the estimated geo-metric relations. This enables us to transfer information fromthe reference panoramas to the query image.

Among the information that we have annotated on thereference panoramas beforehand (see Sect. 4 for details) arealso hyperlinks to some interesting information about build-ings, logos, banners, monuments and other objects depictedon the reference panoramas and that we expect to also bepresent on the query image shot by the user. A hyperlink isdefined by a polygon that indicates the position of an objectof interest within the reference panorama. The features thatare detected within the polygon are therefore describing thevisual appearance of the object and can serve as an indicationof an object’s presence.

The transfer of a hyperlink commences by identifying truecorrespondences that include features of the object. If thereis an insufficient number of such correspondences (less thanfour) we assume that the object is not present in the queryimage. Otherwise, we verify that the configuration of posi-tions of features in the reference panorama is compatible withthe positions of the features in the query image. For that, weassume that the objects in question are planar (or close toplanar), and that the compatibility of configurations can beverified using the homography constraint, as expressed by

x′ = Hx. (2)

We estimate the homography matrix H with the samehypothesize-and-test approach as used for the estimation ofthe epipolar geometry, but this time we use an algorithmthat estimates homography from a minimum set of just threecorrespondences and known epipolar geometry [25,44]. Inaddition, we reject hypotheses for which the homography

matrix H is (almost) singular [45] and hypotheses that donot preserve the polygon orientation. An (almost) singularhomography matrix indicates projection of a plane in thefirst image to a line in the second image (the local invari-ant features cannot be detected on a line), while a change ofthe polygon orientation indicates that either the hypothesisis wrong or the plane is transparent.

The transfer of hyperlinks is done for each matching ref-erence panorama separately. In case, there is more than onehyperlink detected for a particular object, we choose thehyperlink with higher credibility and reject the others.

3.5 Visualization of results

The estimated homography matrix H is used to project thepolygon that defines the hyperlink from the reference pano-rama to the query panorama (see Fig. 2 for an example). Ifthe projected polygon is too big (e.g., in case of a building)to be fully visible in the query image, it is cut accordingly, sothat it does not extend past the borders of the query image.

Finally, we put an icon that indicates a hyperlink to theuser in the center of the projected polygon. Once the usertaps the icon on her camera phone’s (multi-)touch screen,an information pane is displayed with some interesting storyabout the object, as depicted in Fig. 1.

In addition, since we know the geometry relating the queryimage and the nearby reference panoramas, and we know theexact location and orientation from where reference panora-mas were shot, we can triangulate the user’s position and ori-entation with an accuracy comparable to GPS [6] (see Fig. 1for an example and Sect. 5 as well as [6] for an evaluationof the accuracy of image based localization). Such an imagebased localization system can augment GPS, or in some cir-cumstances where GPS performs poorly, even replace it.

4 Reference data set

The novel user interface concept presented in Sect. 3 requiresa data set of reference panoramas that were collected before-hand and are annotated with information that is of interest tothe user. In this chapter we present the process and the toolsthat we have used in order to collect such a data set.

We have acquired 1284 reference images shot from 107accurately measured camera positions using an Olympus E-1digital camera and a 5-megapixel image resolution. Twelveimages shot at a single position together make a panoramaview covering 360◦ × 56◦ (see Fig. 3 for visualization ofcamera positions on a map and an example of images thattogether make a panorama).1 The data set was acquired inthe historical city center of Graz, Austria, in October 2007.

1 The data set is available online at http://vicos.fri.uni-lj.si/GUIS107/.

123

Hyperlinking reality via camera phones 529

Fig. 3 The positions where the reference panoramas were shot and anexample of images that together make a panorama

Reference images were collected in such a way that a largescale acquisition would be feasible.

The cameras were positioned directly above distinctiveobjects on the ground (i.e., shafts) with known position pro-vided by the municipality from their geographic informa-tion system with positioning error of less than 0.5 m. Due tothe abundance of distinctive objects on the ground, this only

slightly influenced the selection of camera positions. Refer-ence images were shot with a camera mounted on the tripod.The height of the camera from the ground was 1.5 m.

While there exist several simple and accurate techniquesfor acquisition of the geographical location of the cam-era (e.g., satellite positioning, classical surveying tech-niques, geographic information systems), acquisition ofabsolute camera orientation is more challenging. Thereforewe acquired the absolute camera orientation by registeringimages themselves using an automatic procedure followed bymanual verification. The procedure for registration (estima-tion of absolute orientation) of reference images proceededin three steps. In the first step, reference images shot at thesame position were stitched in a panorama using the methodof [22] (see Fig. 4 for an example). Features were detectedand described in the original images and then put in the com-mon coordinate system. The final registration was furtheroptimized using bundle adjustment [46]. The result of pano-rama stitching was manually verified in order to assure thatall panoramas are correctly stitched. In the second step, ref-erence panoramas were matched against each other in orderto estimate epipolar geometry relating panoramas depictingthe same scene. The success of epipolar geometry estimationwas manually verified by checking whether the image regionaround the epipoles depict the same scene in both panoramas.Finally, the absolute orientation of panoramas was calculatedeither by aligning epipoles and known camera positions, orby propagating absolute orientation from one panorama toanother using known epipolar constraints.

In order to streamline the process of annotating the ref-erence panoramas with the hyperlinks to some interestinginformation about buildings, logos, banners, monuments andother objects, we have developed a hyperlink annotation tool.The application enables the administrator (i) to annotate theimage region that represents a building or some object, (ii)to add new hyperlinks, and (iii) to attribute a hyperlink to a

Fig. 4 An example of stitched reference panoramas with annotated hyperlinks (red polygons) (color figure online)

123

530 D. Omercevic, A. Leonardis

particular object. Besides having an ID, each hyperlink hasattributed a name, an icon, a short description, and a URLfor further information about the object.

Using the hyperlinks annotation tool, we have annotatedseveral hundred historic sights, shops, restaurants, cafés, icecream parlors, bookshops, pharmacies, buildings, logos, ban-ners, monuments, and other objects of interest to the user (seeFig. 4 for annotation samples).

5 Results

The performance of the system presented in this article wasevaluated within the scope of a user acceptance study [7]conducted on a sample of 16 users and with a real-world sce-nario. The study took place in the city of Graz, Austria, inJune 2008. Each user was first given a brief explanation ofthe system, the setting, and what we expect her to do. Afterthat the user was free to test the system by herself, free andindependently within a predefined area. We expected the userto walk around and be curious about prominent objects in hersurroundings, mostly buildings, but also shops, restaurants,cafés, ice cream parlors, bookshops, pharmacies, logos, ban-ners, monuments, etc. The users shot 73 query images alto-gether (see Fig. 5 for examples of query images).2

In order to evaluate the performance of our system, wehave measured the geographic position of each query imageand counted the number of hyperlinks a user would expectto find on a particular query image. The geographic positionof the query images was measured using a high-resolutionaerial image with a ground sampling distance of 10 cm. Dueto the lack of distinctive markings on the ground, the accu-racy of this method is only about 5 m, but this is still betterthan the accuracy of the GPS receiver [6] that is built into thecamera phone used in the study. We have counted the num-ber of hyperlinks a user would expect to find on a particularquery image based on the number of prominent objects in theimage. Because the definition of a prominent object is rathervague and user-dependent, the number of hyperlinks countedin such a way provides only an approximate measure of thesuccess of our implementation of the “Hyperlinking Realityvia Camera Phones” concept. Each of the 73 query imagesincluded at least one hyperlink, but none more than six. Theaverage number of hyperlinks was 2.4 with a standard devi-ation of 1.2.

Our system automatically annotated 51 out of 73 queryimages with at least one hyperlink (see Fig. 5 for exam-ples of query images with annotated hyperlinks). The max-imum number of hyperlinks per query image was six andthe number of hyperlinks per query image was never larger

2 Also the query images are available online at http://vicos.fri.uni-lj.si/GUIS107/.

Table 1 Accuracy of image based localization

∼10 m ∼20 m ∼30 m ∼50 m ∼100 m Notor more positioned

38 9 5 0 7 14

Out of 73 query images, 52 images were positioned with accuracy com-parable to GPS, while the remaining 21 images were positioned incor-rectly or not at all

than the number of manually counted hyperlinks. The aver-age number of automatically annotated hyperlinks was 1.8with standard deviation of 1.1. We manually checked all theautomatically annotated hyperlinks and we did not find anyincorrect annotations. There were 22 query images withoutany hyperlink annotated. Such an outcome would clearlybe a disappointment to the user expecting access to infor-mation about objects in her surroundings. On close inspec-tion, we found out that a reason behind 14 out of 22 failureswas unsuccessful localization (see below). Namely, the trans-fer of relevant information (i.e., hyperlinks) from the refer-ence panoramas to the query image is the last step of ourmethod and therefore it fails whenever the preceding stepsfail.

The success of image-based localization was evaluated bycomparing the calculated and measured geographic positionfrom where a query image was shot. The results, presentedin Table 1, show that out of 73 query images, 52 imageswere positioned with accuracy comparable to GPS [6] withmedian localization accuracy of 13.5 m, while the remain-ing 21 images were positioned incorrectly or not at all. Theseresults further support the claim of [6] that image-based local-ization can augment GPS, or in some circumstances whereGPS performs poorly, even replace it.

The user acceptance study [7] has shown, that usersreacted positively on the application and were highly moti-vated to take advantage of the intuitive interface, with someimportant remarks regarding technical features (responsetime, reliability), information visualization and future appli-cations of the technology.

5.1 System setup

The system was implemented as a three-tier architecture (seeFig. 1). The client ran on a HTC Touch Cruise camera phoneand it was responsible for capturing an image, sending it tothe server, and displaying the results to the user using the webbrowser. The task of the application server was to receive theimage, write the focal length into its EXIF header, send itfor processing to the computer vision server, and visualizethe results in the form of a web page. The focal length of theHTC was fixed and provided by the phone’s manufacturer.

123

Hyperlinking reality via camera phones 531

Fig. 5 Query images with annotated hyperlinks. For query images (a),(b), (d), (e), (g), and (h) our system has automatically annotated all thehyperlinks that a user would anticipate. In the query image (c) only thebuilding itself was hyperlinked while the “Tschibo” logo and the mon-ument of St. Mary were not hyperlinked. Please notice, that this imagewas shot against the sun, a practice not uncommon among our test users.

Our method has failed to annotate with hyperlinks query images (f), (i),and (j). The images (f) and (i) failed due to the dominance of repetitivestructures, while the monument in image (j) has insufficient number offeatures detected on it and it is also not planar enough for our methodto succeed

The computer vision server took as input an image with focallength written in its EXIF header and returned the result inXML form.

The HTC Touch Cruise is a UMTS mobile phone with aMicrosoft Windows Mobile operating system and a touchscreen. In addition, it has a 3-megapixel camera of goodquality with autofocus and a built-in GPS receiver. Becausethe phone supports high-speed downlink packet access(HSDPA), the transmission of results from the server tothe mobile client took less then one second. Unfortunately,this phone does not support high-speed uplink packet access(HSUPA) and therefore the transmission of a 3-megapixel

image with average size of half a megabyte is the most time-consuming step in our system taking approximately 25 s.With a broader adoption of HSUPA, the transmission ofthe query image to the server should take only a couple ofseconds.

The application and computer vision servers ran on a dualquad-core Intel Xeon processor with 8 GB of memory. Onaverage, the query image was processed in 19.6 s with astandard deviation of 8.6 s. Because all components of ourrecognition pipeline are highly parallelizable, the processingtime could be substantially reduced with larger number ofprocessing cores.

123

532 D. Omercevic, A. Leonardis

6 Discussions

The system presented in this article is not a proper augmentedreality system yet. A proper augmented reality system[14,47] should (1) combine real and virtual, (2) be inter-active in real-time, and (3) be registered in 3D. While oursystem does combine real and virtual, and it is registeredin 3D, it is not interactive in real-time. In our system, pro-cessing of a typical image is done on a server computerand it requires 19 s on average using two 4-core Intel pro-cessors, while a proper interactivity would require at leastfive frames per second and processing done on the cameraphone itself in order to avoid lags due to data communicationlatency.

Users of our system would benefit greatly if it would beinteractive in real-time. One of the concerns of the users par-ticipating in the user acceptance study [7] was that they donot have feedback to inform them whether or not there existhyperlinks for a given object. If our system would be capa-ble of real-time interactivity, users could look around with acamera phone acting as a looking glass indicating hyperlinkspresent in their environment. By pressing a button, the usercould first freeze the current view and then explore availablehyperlinks. As was shown in [14, page 53], users prefer tofreeze the current view in order to assume a more natural posefor interaction with the system. In addition, a description ofa hyperlink could be shown, if the user keeps an icon indi-cating the hyperlink in the center of the view for an extendedtime.

A natural way for speeding-up our system and bringing itcloser to interactive frame rates would be to couple it withtracking using a video stream (e.g., [48]), accelerometers,compass, and GPS, while our system would provide for ini-tialization, drift prevention using periodic re-initialization,and fine-grain registration. A major obstacle that should beaddressed before coupling our system with tracking is thattypical video frame image quality and resolution in today’scamera phones is insufficient for a good performance of oursystem.

Finally, we would like to note, that the major advantageof our approach compared to other approaches to augmentedreality is simplicity of content creation since there is no needfor the content creator to have detailed knowledge of a 3Denvironment [14, page 59]. In our system, a hyperlink can beattached to an object by simply selecting features detectedon the object.

7 Conclusion

We have presented a novel user interface concept for cameraphones based on the state-of-the-art computer vision tech-

nology, called “Hyperlinking Reality via Camera Phones”.3

The presented concept provides a solution to one of the mainchallenges facing mobile user interfaces, that is, the problemof selection and visualization of actions that are relevant tothe user in her current context. Instead of typing keywordson a small and inconvenient keypad of a mobile device, auser of our system just snaps a photo of her surroundingsand objects on the image become hyperlinks to information.

Our method commences by matching the query image tothe reference panoramas depicting the same scene that werecollected and annotated with information beforehand. Oncethe query image is related to the reference panoramas, wetransfer the relevant information from the reference panora-mas to the query image. By visualizing the information onthe query image and displaying it on the camera phone’s(multi-)touch screen, the query image augmented with hy-perlinks allows the user intuitive access to information. Inaddition, we provide the user with information about herposition and orientation, thus augmenting GPS.

The presented mobile user interface concept requires adata set of reference panoramas that are collected and anno-tated with information beforehand. The Graz Urban Imagedata set consists of 107 reference panoramas shot from accu-rately measured positions, while the camera orientationswere acquired automatically using computer vision tech-niques followed by manual verification. On each referencepanorama a few dozen buildings, logos, banners, monuments,and other objects of interest to the user were annotated usingthe hyperlinks annotation tool. In the near future, we envi-sion collecting the data sets of reference panoramas for manymore cities in an industrial fashion, by augmenting the currentprocess of mobile mapping imagery acquisition of digital car-tography vendor companies, such as TeleAtlas. Alternatively,we could also use street-level imagery, such as provided byGoogle [49] or Microsoft.

The performance of the system presented in this articlewas evaluated using 73 query images acquired within thescope of a study that validated acceptance of the system byusers [7]. The results of the evaluation show that our methodsuccessfully annotated with at least one hyperlink 70% of thequery images, with two thirds of the failures being causedby unsuccessful localization. The image-based localizationpositioned 71% of the query images with accuracy compa-rable to GPS, i.e., with a median localization accuracy of13.5 m, further supporting the claim of [6] that image-basedlocalization can augment GPS, or in some circumstanceswhere GPS performs poorly, even replace it.

The system presented in this article is not a proper aug-mented reality system yet, since it is not interactive in real-time. In our system, processing of a typical image is done on

3 The video presentation of “Hyperlinking Reality via Camera Phones”concept is available at http://vicos.fri.uni-lj.si/HypR/.

123

Hyperlinking reality via camera phones 533

a server computer and it requires 19 s on average using two4-core Intel processors. But we envision this system evolving,so that it becomes the looking glass through which users canaccess stories behind buildings, objects and people aroundthem.

Acknowledgments We would like to thank Jan-Olof Eklundh forcomments that have greatly improved this manuscript, Roland Perkoand Barry Ridge for many fruitful discussions, and Marko Mahnic andPatrick Luley for help in implementation.

References

1. Kendall, P.: Worldwide cellular user forecasts, 2008–2013. Tech.rep., Strategy Analytics Inc. (2008)

2. Lindholm, C., Keinonen, T., Kiljander, H.: Mobile Usability: HowNokia Changed the Face of the Mobile Phone. McGraw-Hill,New York (2003)

3. Kähäri, M., Murphy, D.J.: MARA—Sensor based augmented real-ity system for mobile imaging. http://research.nokia.com/research/projects/mara/ (2006). Nokia Research Center

4. Greene, K.: Hyperlinking reality via phones. MIT TechnologyReview (11–12) (2006)

5. Chen, B.X.: If youre not seeing data, youre not seeing. http://www.wired.com/gadgetlab/2009/08/augmented-reality/

6. Steinhoff, U., Omercevic, D., Perko, R., Schiele, B., Leonardis,A.: How computer vision can help in outdoor positioning. In:European Conference on Ambient Intelligence, vol. 4794, pp. 124–141. Springer LNCS, Berlin (2007)

7. Höller, N., Geven, A., Tscheligi, M., Paletta, L., Amlacher, K.,Luley, P., Omercevic, D.: Exploring the urban environment with acamera phone: lessons from a user study. In: Proceedings of the11th International Conference on Human–Computer Interactionwith Mobile Devices and Services (MobileHCI) (2009)

8. Mawston, N.: Enabling technologies: CMOS beats CCD in half-billion global camera phone market. Tech. rep., Strategy AnalyticsInc. (2007)

9. Reynolds, F.: Camera phones: a snapshot of research and applica-tions. Pervasive Comput. IEEE 7(2), 16–19 (2008)

10. Yeh, T., Tollmar, K., Darrell, T.: Searching the web with mobileimages for location recognition. In: Proc. IEEE Computer Visionand Pattern Recognition (CVPR), pp. 76–81 (2004)

11. Wang, J., Zhai, S., Canny, J.: Camera phone based motion sens-ing: interaction techniques, applications and performance study.In: UIST ’06: Proceedings of the 19th annual ACM symposium onUser interface software and technology, pp. 101–110 (2006)

12. Cuellar, G., Eckles, D., Spasojevic, M.: Photos for information: afield study of cameraphone computer vision interactions in tour-ism. In: CHI’08: CHI’08 Extended Abstracts on Human Factors inComputing Systems, pp. 3243–3248 (2008)

13. Takacs, G., Chandrasekhar, V., Gelfand, N., Xiong, Y., Chen, W.C.,Bismpigiannis, T., Grzeszczuk, R., Pulli, K., Girod, B.: Outdoorsaugmented reality on mobile phone using loxel-based visual featureorganization. In: Proceeding of the 1st ACM International Confer-ence on Multimedia information retrieval—MIR’08, pp. 427–434(2008)

14. Henrysson, A.: Bringing augmented reality to mobile phones.Ph.D. thesis, Linköping University (2007)

15. Schmid, C., Mohr, R.: Local grayvalue invariants for imageretrieval. IEEE PAMI 19(5), 530–535 (1997)

16. Lowe, D.G.: Distinctive image features from scale-invariant key-points. IJCV 60(2), 91–110 (2004)

17. Nistér, D., Stewénius, H.: Scalable recognition with a vocabularytree. In: CVPR, vol. 2, pp. 2161–2168 (2006)

18. Chum, O., Philbin, J., Sivic, J., Isard, M., Zisserman, A.: Totalrecall: automatic query expansion with a generative feature modelfor object retrieval. In: ICCV (2007)

19. Baumberg, A.: Reliable feature matching across widely separatedviews. In: CVPR, vol. 1, pp. 774–781 (2000)

20. Matas, J., Chum, O., Urban, M., Pajdla, T.: Robust wide-baselinestereo from maximally stable extremal regions. Image Vis. Comput.22(10), 761–767 (2004)

21. Tuytelaars, T., Van Gool, L.: Matching widely separated viewsbased on affine invariant regions. IJCV 59(1), 61–85 (2004)

22. Brown, M., Lowe, D.G.: Automatic panoramic image stitchingusing invariant features. IJCV 74(1), 59–73 (2007)

23. Zhang, W., Košecká, J.: Image based localization in urban envi-ronments. In: International Symposium on 3D Data Processing,Visualization and Transmission, pp. 33–40 (2006)

24. Sivic, J., Zisserman, A.: Video Google: a text retrieval approachto object matching in videos. In: ICCV, vol. 2, pp. 1470–1477(2003)

25. Hartley, R., Zisserman, A.: Multiple View Geometry in ComputerVision, 2nd edn. Cambridge University Press, Cambridge. ISBN:0521540518 (2004)

26. Tuytelaars, T.: A survey on local invariant features. Tutorial atECCV2006 (2006)

27. Omercevic, D., Drbohlav, O., Leonardis, A.: High-dimensional fea-ture matching: employing the concept of meaningful nearest neigh-bors. In: ICCV (2007)

28. Mikolajczyk, K., Tuytelaars, T., Schmid, C., Zisserman, A.,Matas, J., Schaffalitzky, F., Kadir, T., Van Gool, L.: A compari-son of affine region detectors. IJCV 65(1–2), 43–72 (2005)

29. Moreels, P., Perona, P.: Evaluation of features detectors anddescriptors based on 3D objects. In: ICCV, vol. 1, pp. 800–807(2005)

30. Mikolajczyk, K., Schmid, C.: A performance evaluation of localdescriptors. IEEE PAMI 27(10), 1615–1630 (2005)

31. Fischler, M.A., Bolles, R.C.: Random sample consensus: a par-adigm for model fitting with applications to image analysis andautomated cartography. Commun. ACM 24(6), 381–395 (1981)

32. Grauman, K., Darrell, T.: Approximate correspondences in highdimensions. In: NIPS 19, pp. 505–512 (2007)

33. Böhm, C., Berchtold, S., Keim, D.A.: Searching in high-dimen-sional spaces: index structures for improving the performance ofmultimedia databases. ACM Comput. Surv. (CSUR) 33(3), 322–373 (2001)

34. Indyk, P.: Nearest neighbors in high-dimensional spaces. In:Goodman, J.E., ORourke, J. Handbook of Discrete and Com-putational Geometry, 2nd edn., Chap. 39, CRC Press, BocaRaton (2004)

35. Muja, M., Lowe, D.G.: Fast approximate nearest neighborswith automatic algorithm configuration. In: Proceedings of theInternational Conference on Computer Vision and Applications(2009)

36. Silpa-Anan, C., Hartley, R.: Optimised KD-trees for fast imagedescriptor matching. In: CVPR (2008)

37. Nister, D.: An efficient solution to the five-point relative pose prob-lem. IEEE PAMI 26(6), 756–777 (2004)

38. Stewénius, H., Engels, C., Nistér, D.: Recent developments ondirect relative orientation. ISPRS J. Photogram. Remote Sens.60, 284–294 (2006)

39. Torr, P.H.S., Zisserman, A.: Mlesac: a new robust estimator withapplication to estimating image geometry. Comput. Vis. ImageUnderst. 78(1), 138–156 (2000)

40. Tordoff, B.J., Murray, D.W.: Guided-mlesac: faster imagetransform estimation by using matching priors. IEEE PAMI27(10), 1523–1535 (2005)

123

534 D. Omercevic, A. Leonardis

41. Strecha, C., Tuytelaars, T., Van Gool, L.: Dense matching of mul-tiple wide-baseline views. In: ICCV (2003)

42. Torr, P., Murray, D.: Outlier detection and motion segmentation.In: Proceedings of SPIE (1993)

43. Ma, Y., Soatto, S., Kosecka, J., Sastry, S.S.: An Invitationto 3-D Vision: From Images to Geometric Models. Springer,Berlin (2003)

44. Su, J., Chung, R., Jin, L.: Homography-based partitioning of curvedsurface for stereo correspondence establishment. Pattern Recognit.Lett. 28(12), 1459–1471 (2007)

45. Vincent, E., Laganiere, R.: Detecting planar homographies in animage pair. In: Image and Signal Processing and Analysis, 2001.ISPA 2001. In: Proceedings of the 2nd International Symposiumon, pp. 182–187 (2001)

46. Triggs, B., McLauchlan, P.F., Hartley, R.I., Fitzgibbon, A.W.: Bun-dle adjustment—a modern synthesis. In: ICCV ’99: Proceedingsof the International Workshop on Vision Algorithms, pp. 298–372(2000)

47. Azuma, R.T.: A survey of augmented reality. Presence Teleoper.Virtual Environ. 6(4), 355–385 (1997)

48. Davison, A.J.: Real-time simultaneous localisation and mappingwith a single camera. In: ICCV (2003)

49. Vincent, L.: Taking online maps down to street level. Computer40(12) (2007)

Author Biographies

Dušan Omercevic received hisB.Sc. degree in computer sciencefrom the University of Ljubljana,Slovenia in 2005, and now pur-sues a Ph.D. in computer vision.He is currently a researcher atZemanta—Your Blogging Assis-tant, where he is responsiblefor the development of a real-time content recommendation ser-vice. Before joining Zemanta, heparticipated in EU FP6 projectMOBVIS and led several largesoftware development projects.His research interests includesmart mobile vision services,

high-dimensional matching and feature engineering.

Aleš Leonardis is a full professorand the head of the Visual Cog-nitive Systems Laboratory withthe Faculty of Computer andInformation Science, Universityof Ljubljana. He is also an adjunctprofessor at the Faculty of Com-puter Science, Graz University ofTechnology. From 1988 to 1991,he was a visiting researcher in theGeneral Robotics and Active Sen-sory Perception Laboratory at theUniversity of Pennsylvania. From1995 to 1997, he was a postdoc-toral associate at the PRIP, Vienna

University of Technology. He was also a visiting researcher and a visit-ing professor at the Swiss Federal Institute of Technology ETH in Zurichand at the Technische Fakultaet der Friedrich-Alexander-Universitaet inErlangen, respectively. His research interests include robust and adap-tive methods for computer vision, object and scene recognition andcategorization, statistical visual learning, 3D object modeling, and bio-logically motivated vision. He is an author or coauthor of more than160 papers published in journals and conferences, and he coauthoredthe book Segmentation and Recovery of Superquadrics (Kluwer, 2000).He is an editorial board member of pattern recognition, an editor of theSpringer Book Series Computational Imaging and Vision, and an asso-ciate editor of the IEEE Transactions on Pattern Analysis and MachineIntelligence. He has served on the program committees of major com-puter vision and pattern recognition conferences. He was also a pro-gram co-chair of the European Conference on Computer Vision, ECCV2006. He has received several awards. In 2002, he coauthored a paper,Multiple Eigenspaces, which won the 29th Annual Pattern RecognitionSociety award. In 2004, he was awarded a prestigious national awardfor scientific achievements. He is a fellow of the IAPR and a memberof the IEEE and the IEEE Computer Society.

123