1 Large-scale Image Geo-Localization Using Dominant … · and Mubarak Shah, Fellow, IEEE Abstract ... and obtain a robust ﬁnal solution to the problem ... M. Shah is with the Center

1

Large-scale Image Geo-LocalizationUsing Dominant Sets

Eyasu Zemene*, Student Member, IEEE, Yonatan Tariku*, Student Member, IEEE,Haroon Idrees, member, IEEE, Andrea Prati, Senior member, IEEE, Marcello Pelillo, Fellow, IEEE,

and Mubarak Shah, Fellow, IEEE

Abstract—This paper presents a new approach for the challenging problem of geo-localization using image matching in a structureddatabase of city-wide reference images with known GPS coordinates. We cast the geo-localization as a clustering problem of localimage features. Akin to existing approaches to the problem, our framework builds on low-level features which allow local matchingbetween images. For each local feature in the query image, we find its approximate nearest neighbors in the reference set. Next, wecluster the features from reference images using Dominant Set clustering, which affords several advantages over existing approaches.First, it permits variable number of nodes in the cluster, which we use to dynamically select the number of nearest neighbors for eachquery feature based on its discrimination value. Second, this approach is several orders of magnitude faster than existing approaches.Thus, we obtain multiple clusters (different local maximizers) and obtain a robust final solution to the problem using multiple weaksolutions through constrained Dominant Set clustering on global image features, where we enforce the constraint that the query imagemust be included in the cluster. This second level of clustering also bypasses heuristic approaches to voting and selecting the referenceimage that matches to the query. We evaluate the proposed framework on an existing dataset of 102k street view images as well as anew larger dataset of 300k images, and show that it outperforms the state-of-the-art by 20% and 7%, respectively, on the two datasets.

Index Terms—Geo-localization, Dominant Set Clustering, Multiple Nearest Neighbor Feature Matching, Constrained Dominant Set

F

1 INTRODUCTION

IMAGE geo-localization, the problem of determining thelocation of an image using just the visual information,

is remarkably difficult. Nonetheless, images often containuseful visual and contextual informative cues which allowus to determine the location of an image with variableconfidence. The foremost of these cues are landmarks, ar-chitectural details, building textures and colors, in additionto road markings and surrounding vegetation.

Recently, the geo-localization through image-matchingapproach was proposed in [1], [2]. In [1], the authors findthe first nearest neighbor (NN) for each local feature inthe query image, prune outliers and use a heuristic vot-ing scheme for selecting the matched reference image. Thefollow-up work [2] relaxes the restriction of using only thefirst NN and proposed Generalized Minimum Clique Prob-lem (GMCP) formulation for solving this problem. How-ever, GMCP formulation can only handle a fixed numberof nearest neighbors for each query feature. The authorsused 5 NN, and found that increasing the number of NNdrops the performance. Additionally, the GMCP formula-

• * The first and second authors have equal contribution.• E. Zemene and M. Pelillo are with the department of Computer Sci-

ence, Ca’ Foscari University of Venice, Italy. E-mail: {eyasu.zemene,pelillo}@unive.it

• Y. Tariku Tesfaye is with the department of Design and Planning inComplex Environments of the University IUAV of Venice, Italy. E-mail:[email protected]

• A. Prati is with the department of Department of Engineering and Archi-tecture of the University of Parma, Italy. E-mail: [email protected]

• M. Shah is with the Center for Research in Computer Vision (CRCV),University of Central Florida, USA. E-mail: {haroon,shah}@eecs.ucf.edu

tion selects exactly one NN per query feature. This makesthe optimization sensitive to outliers, since it is possible thatnone of the 5 NN is correct. Once the best NN is selectedfor each query feature, a very simple voting scheme isused to select the best match. Effectively, each query featurevotes for a single reference image, from which the NN wasselected for that particular query feature. This often resultsin identical number of votes for several images from thereference set. Then, both [1], [2] proceed with randomlyselecting one reference image as the correct match to inferGPS location of the query image. Furthermore, the GMCPis a binary-variable NP-hard problem, and due to the highcomputational cost, only a single local minima solution iscomputed in [2].

In this paper, we propose an approach to image geo-localization by robustly finding a matching reference imageto a given query image. This is done by finding correspon-dences between local features of the query and referenceimages. We first introduce automatic NN selection into ourframework, by exploiting the discriminative power of eachNN feature and employing different number of NN foreach query feature. That is, if the distance between queryand reference NNs is similar, then we use several NNssince they are ambiguous, and the optimization is affordedwith more choices to select the correct match. On the otherhand, if a query feature has very few low-distance referenceNNs, then we use fewer NNs to save the computation cost.Thus, for some cases we use fewer NNs, while for otherswe use more requiring on the average approximately thesame amount of computation power, but improving theperformance, nonetheless. This also bypasses the manualtuning of the number of NNs to be considered, which can

arX

iv:1

702.

0123

8v3

[cs

.CV

] 1

4 Se

p 20

17

2

vary between datasets and is not straightforward.Our approach to image geo-localization is based on

Dominant Set clustering (DSC) - a well-known generaliza-tion of maximal clique problem to edge-weighted graphs-where the goal is to extract the most compact and coher-ent set. It’s intriguing connections to evolutionary gametheory allow us to use efficient game dynamics, such asreplicator dynamics and infection-immunization dynamics(InImDyn). InImDyn has been shown to have a lineartime/space complexity for solving standard quadratic pro-grams (StQPs), programs which deal with finding the ex-trema of a quadratic polynomial over the standard simplex[3], [4]. The proposed approach is on average 200 timesfaster and yields an improvement of 20% in the accuracy ofgeo-localization compared to [1], [2]. This is made possible,in addition to the dynamics, through the flexibility inherentin DSC, which unlike the GMCP formulation avoids anyhard constraints on memberships. This naturally handlesoutliers, since their membership score is lower comparedto inliers present in the cluster. Furthermore, our solutionuses a linear relaxation to the binary variables, which in theabsence of hard constraints is solved through an iterativealgorithm resulting in massive speed up.

Since the dynamics and linear relaxation of binary vari-ables allow our method to be extremely fast, we run itmultiple times to obtain several local maxima as solutions.Next, we use a query-based variation of DSC to combinethose solutions to obtain a final robust solution. The query-based DSC uses the soft-constraint that the query, or a groupof queries, must always become part of the cluster, thusensuring their membership in the solution. We use a fusionof several global features to compute the cost between queryand reference images selected from the previous step. Themembers of the cluster from the reference set are used tofind the geo-location of the query image. Note that, the GPSlocation of matching reference image is also used as a cost inaddition to visual features to ensure both visual similarityand geographical proximity.

GPS tagged reference image databases collected fromuser uploaded images on Flickr have been typically usedfor the geo-localization task. The query images in our ex-periments have been collected from Flickr, however, thereference images were collected from Google Street View.The data collected through Flickr and Google Street Viewdiffer in several important aspects: the images downloadedfrom Flickr are often redundant and repetitive, where im-ages of a particular building, landmark or street are cap-tured multiple times by different users. Typically, popularor tourist spots have relatively more images in testing andreference sets compared to less interesting parts of the urbanenvironment. An important constraint during evaluation isthat the distribution of testing images should be similar tothat of reference images. On the contrary, Google Street Viewreference data used in this paper contains only a singlesample of each location of the city. However, Street Viewdoes provide spherical 360◦ panoramic views, , approxi-mately 12 meters apart, of most streets and roads. Thus, theimages are uniformly distributed over different locations,independent of their popularity. The comprehensiveness ofthe data ensures that a correct match exists; nonetheless, atthe same time, the sparsity or uniform distribution of the

data makes geo-localization difficult, since every location iscaptured in only few of the reference images. The difficultyis compounded by the distorted, low-quality nature of theimages as well.

The main contributions of this paper are summarized asfollows:

• We present a robust and computationally efficientapproach for the problem of large-scale image geo-localization by locating images in a structureddatabase of city-wide reference images with knownGPS coordinates.

• We formulate geo-localization problem in terms of amore generalized form of dominant sets frameworkwhich incorporates weights from the nodes in addi-tion to edges.

• We take a two-step approach to solve the problem.The first step uses local features to find putative setof reference images (and is therefore faster), whereasthe second step uses global features and a con-strained variation of dominant sets to refine resultsfrom the first step, thereby, significantly boosting thegeo-localization performance.

• We have collected new and more challenging highresolution reference dataset (WorldCities dataset) of300K Google street view images.

The rest of the paper is structured as follows. We presentliterature relevant to our problem in Sec. 2, followed bytechnical details of the proposed approach in Sec. 3, whileconstrained dominant set based post processing step isdiscussed in Sec. 4. This is followed by dataset descriptionin section 5.1. Finally, we provide results of our extensiveevaluation in Sec. 5 and conclude in Sec. 6.

2 RELATED WORK

The computer vision literature on the problem of geo-localization can be divided into three categories dependingon the scale of the datasets used: landmarks or buildings[5], [6], [7], [8], city-scale including streetview data [9], andworldwide [10], [11], [12]. Landmark recognition is typicallyformulated as an image retrieval problem [5], [7], [8], [13],[14]. For geo-localization of landmarks and buildings, Cran-dall et al. [15] perform structural analysis in the form ofspatial distribution of millions of geo-tagged photos. Thisis used in conjunction with visual and meta data fromimages to geo-locate them. The datasets for this categorycontain many images near prominent landmarks or images.Therefore, in many works [5], [7], similar looking imagesbelonging to same landmarks are often grouped before geo-localization is undertaken.

For citywide geo-localization of query images, Zamirand Shah [1] performed matching using SIFT features,where each feature votes for a reference image. The votemap is then smoothed geo-spatially and the peak in thevote map is selected as the location of the query image.They also compute ’confidence of localization’ using theKurtosis measure as it quantifies the peakiness of vote mapdistribution. The extension of this work in [2] formulatesthe geo-localization as a clique-finding problem where the

3

authors relax the constraint of using only one nearest neigh-bor per query feature. The best match for each query featureis then solved using Generalized Minimum Clique Graphs,so that a simultaneous solution is obtained for all queryfeatures in contrast to their previous work [1]. In similarvein, Schindler et al. [16] used a dataset of 30,000 imagescorresponding to 20 kilometers of street-side data capturedthrough a vehicle using vocabulary tree. Sattler et al. [17]investigated ways to explicitly handle geometric burstsby analyzing the geometric relations between the differentdatabase images retrieved by a query. Arandjelovic et al.[18] developed a convolutional neural network architecturefor place recognition that aggregates mid-level (conv5) con-volutional features extracted from the entire image into acompact single vector representation amenable to efficientindexing. Torii et al. [19] exploited repetitive structure forvisual place recognition, by robustly detecting repeatedimage structures and a simple modification of weights inthe bag-of-visual-word model. Zeisl et al. [20] proposed avoting-based pose estimation strategy that exhibits linearcomplexity in the number of matches and thus facilitates toconsider much more matches.

For geo-localization at the global scale, Hays and Efros[10] were the first to extract coarse geographical locationof query images using Flickr collected across the world.Recently, Weyand et al. [12] pose the problem of geo-locatingimages in terms of classification by subdividing the surfaceof the earth into thousands of multi-scale geographic cells,and train a deep network using millions of geo-tagged im-ages. In the regions where the coverage of photos is dense,structure-from-motion reconstruction is used for matchingquery images [21], [22], [23], [24]. Since the difficulty ofthe problem increases as we move from landmarks tocity-scale and finally to worldwide, the performance alsodrops. There are many interesting variations to the geo-localization problem as well. Sequential information suchas chronological order of photos was used by [25] to geo-locate photos. Similarly, there are methods to find trajectoryof a moving camera by geo-locating video frames usingBayesian Smoothing [26] or geometric constraints [27]. Chenand Grauman [28] present Hidden Markov Model approachto match sets of images with sets in the database for locationestimation. Lin et al. [29] use aerial imagery in conjunctionwith ground images for geo-localization. Others [30], [31]approach the problem by matching ground images againsta database of aerial images. Jacob et al. [32] geo-localizea webcam by correlating its video-stream with satelliteweather maps over the same time period. Skyline2GPS[33] uses street view data and segments the skyline in animage captured by an upward-facing camera by matching itagainst a 3D model of the city.

Feature discriminativity has been explored by [34], whouse local density of descriptor space as a measure of descrip-tor distinctiveness, i.e. descriptors which are in a denselypopulated region of the descriptor space are deemed tobe less distinctive. Similarly, Bergamo et al. [35] leverageStructure from Motion to learn discriminative codebooksfor recognition of landmarks. In contrast, Cao and Snavely[36] build a graph over the image database, and learn localdiscriminative models over the graph which are used forranking database images according to the query. Similarly,

Gronat et al. [37] train discriminative classifier for eachlandmark and calibrate them afterwards using statisticalsignificance measures. Instead of exploiting discriminativity,some works use similarity of features to detect repetitivestructures to find locations of images. For instance, Torii etal. [38] consider a similar idea and find repetitive patternsamong features to place recognition. Similarly, Hao et al. [39]incorporate geometry between low-level features, termed’visual phrases’, to improve the performance on landmarkrecognition.

Our work is situated in the middle category, where givena database of images from a city or a group of cities, weaim to find the location where a test image was taken from.Unlike landmark recognition methods, the query image mayor may not contain landmarks or prominent buildings. Sim-ilarly, in contrast to methods employing reference imagesfrom around the globe, the street view data exclusivelycontains man-made structures and rarely natural scenes likemountains, waterfalls or beaches.

3 IMAGE MATCHING BASED GEO-LOCALIZATION

Fig. 1 depicts the overview of the proposed approach.Given a set of reference images, e.g., taken from GoogleStreet View, we extract local features (hereinafter referredas reference features) using SIFT from each reference image.We then organize them in a k-means tree [40].

First, for each local feature extracted from the queryimage (hereinafter referred as query feature), we dynamicallycollect nearest neighbors based on how distinctive the twonearest neighbors are relative to their corresponding queryfeature. Then, we remove query features, along with theircorresponding reference features, if the ratio of the distancebetween the first and the last nearest neighbor is too large(Sec. 3.1). If so, it means that the query feature is not veryinformative for geo-localization, and is not worth keepingfor further processing. In the next step, we formalize theproblem of finding matching reference features to queryfeatures as a DSC (Dominant Set Clustering) problem, thatis, selecting reference features which form a coherent andmost compact set (Sec. 3.2). Finally, we employ constraineddominant-set-based post-processing step to choose the bestmatching reference image and use the location of thestrongest match as an estimation of the location of the queryimage (Sec. 4).

3.1 Dynamic Nearest Neighbor Selection and QueryFeature Pruning

For each of N query features detected in the query image,we collect their corresponding nearest neighbors (NN). Letvim be the mth nearest neighbor of ith query feature qi, andm ∈ N : 1 ≤ m ≤| NN i | and i ∈ N : 1 ≤ i ≤ N , where| · | represents the set cardinality and NN i is the set ofNNs of the ith query feature. In this work, we propose adynamic NNs selection technique based on how distinctivetwo consecutive elements are in a ranked list of neighborsfor a given query feature, and employ different number ofnearest neighbors for each query feature.

As shown in Algorithm (1), we add the (m + 1)th NNof the ith query feature, vim+1 , if the ratio of the two

4

QueryImage

Bestmatchingreferenceimage

40.439-80.004

Approximatethelocationbasedonthelocationof

thebestmatchingreference

Localfeature(SIFT)extraction

DynamicNNselectionforeachqueryfeature

NNPruning DSCbasedfeaturematching

DS1 DS2

DS3

𝓠

DS3

DS2

DS1

Post-ProcessingusingconstrainedDominantsets

Fig. 1: Overview of the proposed method.

Algorithm 1 : Dynamic Nearest Neighbor Selection for ith

query feature (qi)

Input: the ith query feature (qi) and all its nearest neighborsextracted from K-means tree {vi1, vi2.......vi|NNi|}Output: Selected Nearest Neighbors for the ith query fea-ture (Vi)

1: procedure DYNAMIC NN SELECTION()2: Initialize Vi = {vi1} and m=13: while m < |NN i| − 1 do4: if ‖ξ(qi)−ξ(vim)‖

‖ξ(qi)−ξ(vim+1)‖> θ then

5: Vi = Vi ∪ vim+1 . If so, add vim+1 to oursolution

6: m = m+ 1 . Go to the next neighbor7: else8: Break . If not, stop adding and exit9: end if

10: end while11: end procedure

consecutive NN is greater than θ, otherwise we stop. Inother words, in the case of a query feature which is notvery discriminative, i.e., most of its NNs are very similarto each other, the algorithm continues adding NNs untila distinctive one is found. In this way, less discriminativequery features will use more NNs to account for theirambiguity, whereas more discriminative query features willbe compared with fewer NNs.Query Feature Pruning. For the geo-localization task, mostof the query features that are detected from moving objects(such as cars) or the ground plane, do not convey anyuseful information. If such features are coarsely identifiedand removed prior to performing the feature matching,that will reduce the clutter and computation cost for theremaining features. Towards this end, we use the follow-

ing pruning constraint which takes into consideration dis-tinctiveness of the first and the last NN. In particular, if∥∥ξ (qi)− ξ (vi1)∥∥ / ∥∥∥ξ (qi)− ξ (vi|NNi|

)∥∥∥ > β, where ξ(·)represents an operator which returns the local descriptorof the argument node, then qi is removed, otherwise it isretained. That is, if the first NN is similar to the last NN (lessthan β), then the corresponding query feature along with itsNNs are pruned since it is expected to be uninformative.

We empirically set both thresholds, θ and β, in Algo-rithm (1) and pruning step, respectively, to 0.7 and keepthem fixed for all tests.

3.2 Multiple Feature Matching Using Dominant Sets

3.2.1 The Dominant Set Framework

The dominant set framework is a pairwise clustering ap-proach [41], based on the notion of a dominant set, whichcan be seen as an edge-weighted generalization of a clique.The approach is a fast and efficient framework for pairwiseclustering and has been used to solve multiple problems,such as data association in tracking [42] as well as groupdetection [43].

In an attempt to formally capture this notion, we presentsome notations and definitions. The data to be clusteredis defined as a graph G = (V,E, ζ,$), where V,E, ζand $ denote the set of nodes (of cardinality n), edges,node weights and edge weights, respectively. For a non-empty subset S ⊆ V , l ∈ S, and k /∈ S, where l andk represent nodes in a graph G, we define φS(l, k) =B(l, k)− 1

|S|∑p∈S B(l, p), where B is the corresponding n×n

affinity matrix of graph G. This quantity measures therelative similarity between nodes k and l, with respect tothe average similarity between node l and its neighbors inS. Note that φS(l, k) can be either positive or negative. Next,to each vertex i ∈ S we assign a weight defined recursively

5

1

2 3

1

2 3

4 1

2 3

4

5

W{1,2,3,4} (4)>0

W{1,2,3,4,5} (5)<0

W{1,2,3,4} (4)>0

20

21

2220

21

22

30

41

3530

3541

21

111

1

(a) (b) (c)

Fig. 2: Dominant set example: (a) shows a compact set(dominant set), (b) node 4 is added which is highly similarto the set {1,2,3} forming a new compact set. (c) Node 5 isadded to the set which has very low similarity with the restof the nodes and this is reflected in the value W{1,2,3,4,5}(5).

as follows:

WS(l) =

{1, if |S| = 1,∑k∈S\{l} φS\{l}(k, l)WS\{l}(k), otherwise .

(1)where S \{l}means set S without the element l. Intuitively,WS(l) gives us a measure of the overall similarity betweenvertex l and the vertices of S\{l}, with respect to the overallsimilarity among the vertices in S \{l}. Therefore, a positiveWS(l) indicates that adding l into its neighbors in S willincrease the internal coherence of the set, whereas in thepresence of a negative value we expect the overall coherenceto be decreased. The total weight of S is computed asW (S) =

∑l∈SWS(l).

A non-empty subset of vertices S ⊆ V such thatW (T ) >0 for any non-empty T ⊆ S, is said to be a dominant set if:

WS(l) > 0,∀l ∈ S,WS∪{l}(l) < 0,∀l /∈ S, (2)

These conditions correspond to the two main propertiesof a cluster: the first regards internal homogeneity, whereasthe second regards external inhomogeneity.

Example: Let us consider a graph with nodes {1, 2, 3},which forms a coherent group (dominant set) with edgeweights 20, 21 and 22 as shown in Fig. 2(a). Now, let ustry to add a node {4} to the graph which is highly similarto the set {1,2,3} with edge weights of 30, 35 and 41 (seeFig. 2(b)). Here, we can see that adding node {4} to theset increases the overall similarity of the new set {1,2,3,4},that can be seen from the fact that the weight associatedto the node {4} with respect to the set {1,2,3,4} is positive,(W{1,2,3,4}(4) > 0). On the contrary, when adding node {5}which is less similar to the set {1,2,3,4} (edge weight of 1 -Fig. 2(c)) the overall similarity of the new set {1,2,3,4,5} de-creases, since we are adding to the set something less similarwith respect to the internal similarity. This is reflected by thefact that the weight associated to node {5} with respect tothe set {1,2,3,4,5} is less than zero (W{1,2,3,4,5}(5) < 0).

From the definition of a dominant set in (2) the set{1,2,3,4} (Fig. 2 (b)) forms a dominant set, as it satisfiesboth criteria (internal coherence and external incoherence).While the weight associated to the node out side of the set(dominant set) is less than zero, W{1,2,3,4,5}(5) < 0.

The main result presented in [41] provides a one-to-onerelation between dominant sets and strict local maximizersof the following standard quadratic optimization problem:

maximize f(x) = x>Bx,subject to x ∈ ∆,

(3)

where

∆ = {x ∈ Rn :∑i

xi = 1, and xi ≥ 0 for all i = 1 . . . n},

which is called standard simplex of Rn. Specifically, in [41] itis shown that if S is a dominant subset of vertices, then itsweighted characteristic vector, which lies in M,

xl =

{WS(l)W (S) , if l ∈ S,0, otherwise

is a strict local solution of (3). Conversely, under mild con-ditions, if x is a strict local solution of (3), then its supportS = σ(x) is a dominant set. Here, the support of a vectorx ∈ ∆ is the set of indices corresponding to its positivecomponents, that is σ(x) = {l ∈ V : xl > 0}. By virtue ofthis result, a dominant set can be found by localizing a localsolution of (3) and then picking up its support.

3.2.2 Similarity Function and Dynamics for Multiple FeatureMatching

In our framework, the set of nodes, V , represents all NNsfor each query feature which survives the pruning step. Theedge set is defined as E = {(vim, vjn) | i 6= j}, whichsignifies that all the nodes in G are connected as long astheir corresponding query features are not the same. Theedge weight, $ : E −→ R+ is defined as $(vim, v

jn) =

exp(−‖ψ(vim) − ψ(vjn)‖2/2γ2), where ψ(·) represents anoperator which returns the global descriptor of the parentimage of the argument node and γ is empirically set to 27 .The edge weight,$(vim, v

jn), represents a similarity between

nodes vim and vjn in terms of the global features of theirparent images. The node score, ζ : V −→ R+, is definedas ζ(vim) = exp(−‖ξ(qi) − ξ(vim)‖2/2γ2). The node scoreshows how similar the node vim is with its correspondingquery feature in terms of its local features.

Matching the query features to the reference featuresrequires identifying the correct NNs from the graph Gwhich maximize the weight, that is, selecting a node (NN)which forms a coherent (highly compact) set in terms ofboth global and local feature similarities.

Affinity matrix A represents the global similarity amongreference images, which is built using GPS locations as aglobal feature and a node score b which shows how similarthe reference image is with its corresponding query featurein terms of their local features. We formulate the followingoptimization problem, a more general form of the dominantset formulation:

maximize f(x) = x>Ax + b>x,subject to x ∈ ∆.

(4)

The affinity A and the score b are computed as follows:

A(vim, vjn) =

{$(vim, v

jn), for i 6= j,

0, otherwise,(5)

b(vim) = ζ(vim). (6)

6

General quadratic optimization problems, like (4), areknown to be NP-hard [44]. However, in relaxed form, stan-dard quadratic optimization problems can be solved usingmany algorithms which make full systematic use of dataconstellations. Off-the-shelf procedures find a local solutionof (4), by following the paths of feasible points provided bygame dynamics based on evolutionary game theory.

Interestingly, the general quadratic optimization prob-lem can be rewritten in the form of standard quadraticproblem. A principled way to do that is to follow the resultpresented in [45], which shows that maximizing the generalquadratic problem over the simplex can be homogenizedas follows. Maximizing x>Ax + b>x, subject to x ∈ ∆ isequivalent to maximizing x>Bx, subject to x ∈ ∆, where

B = A + eb> + be> and e =n∑i=1

ei = [1, 1, ....1], where

ei denotes the ith standard basis vector in Rn. This can beeasily proved by noting that the problem is solved in thesimplex.

x>Ax + 2 ∗ b>x = x>Ax + b>x + x>b,= x>Ax + x>eb>x + x>be>x,= x>(A + eb> + be>)x,= x>Bx.

As can be inferred from the formulation of our multipleNN feature matching problem, DSC can be essentially usedfor solving the optimization problem. Therefore, by solvingDSC for the graph G, the optimal solution that has themost agreement in terms of local and global features willbe found. Next, we introduce some useful dynamics fromevolutionary game theory which allow us to efficiently andeffectively match reference features to the query features.

The standard approach for finding the local maxima ofproblem (3), as used in [41], is to use replicator dynamics- a well-known family of algorithms from evolutionarygame theory inspired by Darwinian selection processes.Variety of StQP applications [46], [47], [41], [48], [49] haveproven the effectiveness of replicator dynamics. However,its computational complexity, which is O(N2) for problemsinvolving N variables, prevents it from being used in large-scale applications. Its exponential variant, though it reducesthe number of iterations needed for the algorithm to find asolution, suffers from a per-step quadratic complexity.

In this work, we adopt a new class of evolutionary gamedynamics called infection-immunization dynamics (InIm-Dyn), which have been shown to have a linear time/spacecomplexity for solving standard quadratic programs. In [4],it has been shown that InImDyn is orders of magnitudefaster but as accurate as the replicator dynamics.

The dynamics, inspired by infection and immunizationprocesses summarized in Algorithm (2), finds the optimalsolution by iteratively refining an initial distribution x ∈ ∆.The process allows for invasion of an infective distributiony ∈ ∆ that satisfies the inequality (y − x)>Bx > 0, andcombines linearly x and y (line 7 of Algorithm (2)), therebyengendering a new population z which is immune to yand guarantees a maximum increase in the expected payoff.A selective function, S(x), returns an infective strategy fordistribution x if it exists, or x otherwise (line 2 of Algorithm(2)). Selecting a strategy y which is infective for the current

Algorithm 2 FindEquilibrium(B,x,τ )Input: n× n payoff matrix B, initial distribution x ∈ ∆ andtolerance τ .Output: Fixed point x

1: while ε(x) > τ do2: y← S(x)3: δ ← 14: if (y − x)>B(y − x) < 0 then5: δ ← min

{(x−y)>Bx

(y−x)>B(y−x) , 1}

6: end if7: x← δ(y − x) + x8: end while9: return x

population x, the extent of the infection, δy(x), is thencomputed in lines 3 to 6 of Algorithm (2).

By reiterating this process of infection and immunizationthe dynamics drives the population to a state that cannot beinfected by any other strategy. If this is the case then x is anequilibrium or fixed point under the dynamics. The refine-ment loop of Algorithm (2) controls the number of iterationsallowing them to continue until x is with in the range of thetolerance τ and we emperically set τ to 10−7. The range ε(x)

is computed as ε(x) =∑i∈J

min{xi, (Bx)i − x>Bx

}2.

As it can be inferred from the above formulation of dom-inant sets, finding the strict local solution of (3) coincideswith finding the best matching reference features to thequery features. It is important to note that, since our solutionis a local solution and we do not know which local solutionincludes the best matching reference image, we determineseveral (typically three) locally optimal solutions. Unlikemost of the previous approaches, which perform a simplevoting scheme to select the best matching reference image,we introduce a post processing step utilizing a variantof dominant set called constrained dominant set, which isdiscussed briefly in the next section.

4 POST PROCESSING USING CONSTRAINED DOMI-NANT SETS

Up to now, we devised a method to collect matching ref-erence features corresponding to our query features. Thenext task is to select one reference image, based on featurematching between query and reference features, which bestmatches the query image. To do so, most of the previousmethods follow a simple voting scheme, that is, the matchedreference image with the highest vote is considered as thebest match.

This approach has two important shortcomings. First,if there are equal votes for two or more reference images(which happens quite often), it will randomly select one,which makes it prone to outliers. Second, a simple votingscheme does not consider the similarity between the queryimage and the candidate reference images at the globallevel, but simply accounts for local matching of features.Therefore, we deal with this issue by proposing a post pro-cessing step, which considers the comparison of the globalfeatures of the images and employs constrained dominant set,

7

a framework that generalizes the dominant sets formulationand its variant [41], [50].

4.1 Constrained Dominant Sets Framework

In our post processing step, the user-selected query andthe matched reference images are related using their globalfeatures and a unique local solution is then searched whichcontains the union of all the dominant sets containing thequery. As customary, the resulting solution will be globallyconsistent both with the reference images and the queryimage and due to the notion of centrality, each element inthe resulting solution will have a membership score, whichdepicts how similar a given reference image is with the restof the images in the cluster. So, by virtue of this property ofconstrained dominant sets, we will select the image withthe highest membership score as the final best matchingreference image and approximate the location of the querywith the GPS location of the best matched reference image.

In this section, we review the basic definitions andproperties of constrained dominant sets, as introduced in[51]. Given a user specified query, Q ⊆ V , we definethe graph G = (V , E, w), where the edges are defined asE = {(i, j)|i 6= j, {i, j} ∈ DSn ∨ (i ∈ Q ∨ j ∈ Q)}, i.e.,all the nodes are connected as long as they do not belongto different local maximizers, DSn, which represents thenth extracted dominant set. The set of nodes V representsall matched reference images (local maximizers) and queryimage, Q. The edge weight w : E → R+ is defined as:

w(i, j) =

ρ(i, j), for i 6= j, i ∈ Q ∨ j ∈ Q,Bn(i, j), for i 6= j, {i, j} ∈ DSn,0, otherwise

where ρ(i, j) is an operator which returns the global sim-ilarity of two given images i and j, that is, ρ(i, j) =exp(−‖ψ(i) − ψ(j)‖2/2γ2), Bn represents a sub-matrix ofB, which contains only members of DSn, normalized by itsmaximum value and finally Bn(i, j) returns the normalizedaffinity between the ith and jth members of DSn. Thegraph G can be represented by an n × n affinity matrixB = (w(i, j)), where n is the number of nodes in the graph.

Given a parameter α > 0, let us define the followingparameterized variant of program (3):

maximize fαQ(x) = x>(B− αIQ)x,subject to x ∈ ∆,

(7)

where IQ is the n × n diagonal matrix whose diagonalelements are set to 1 in correspondence to the verticescontained in V \ Q (a set V without the element Q) andto zero otherwise.

Let Q ⊆ V , with Q 6= ∅ and let α > λmax(BV \Q),where λmax(BV \Q) is the largest eigenvalue of the principalsubmatrix of B indexed by the elements of V \ Q. If x is alocal maximizer of fαQ in ∆, then σ(x) ∩Q 6= ∅. A completeproof can be found in [51].

The above result provides us with a simple technique todetermine dominant-set clusters containing user-specified

query vertices. Indeed, if Q is a vertex selected by the user,by setting

α > λmax(BV \Q), (8)

we are guaranteed that all local solutions of (7) will have asupport that necessarily contains elements of Q.

The performance of our post processing may vary dra-matically among queries and we do not know in advancewhich global feature plays a major role. Figs. 3 and 4 showillustrative cases of different global features. In the case ofFig. 3, HSV color histograms and GIST provide a wrongmatch (dark red node for HSV and dark green node forGIST), while both CNN-based global features (CNN6 andCNN7) matched it to the right reference image (yellownode). The second example, Fig. 4, shows us that only CNN6feature localized it to the right location while the others fail.Recently, fusing different retrieval methods has been shownto enhance the overall retrieval performance [52], [53].

Motivated by [52] we dynamically assign a weight, basedon the effectiveness of a feature, to all global features basedon the area under normalized score between the query andthe matched reference images. The area under the curveis inversely proportional to the effectiveness of a feature.More specifically, let us suppose to have G global featuresand the distance between the query and the jth matchedreference image (Nj), based on the ith global feature (Gi), iscomputed as: f ji = ψi(Q)− ψi(Nj), where ψi(·) representsan operator which returns the ith global descriptor of theargument node. Let the area under the normalized scoreof fi be Ai. The weight assigned for feature Gi is then

computed as wi = 1Ai/|G|∑j=1

1Aj

.

Figs. 3 and 4 show illustrative cases of some of theadvantages of having the post processing step. Both casesshow the disadvantage of localization following heuristicapproaches, as in [1], [2], to voting and selecting the ref-erence image that matches to the query. In each case, thematched reference image with the highest number of votes(shown under each image) is the first node of the firstextracted dominant set, but represents a wrong match. Bothcases (Figs. 3 and 4) also demonstrate that the KNN-basedmatching may lead to a wrong localization. For example,by choosing HSV histogram as a global feature, the KNNapproach chooses as best match the dark red node in Fig.3 and the yellow node in Fig. 4 (both with min-max valueto 1.00). Moreover, it is also evident that choosing the bestmatch using the first extracted local solution (i.e., the lightgreen node in Fig. 3 and blue node in Fig. 4), as done in [2],may lead to a wrong localization, since one cannot know inadvance which local solution plays a major role. In fact, inthe case of Fig. 3, the third extracted dominant set containsthe right matched reference image (yellow node), whereasin the case of Fig. 4 the best match is contained in the firstlocal solution (the light green node).

The similarity between query Q and the correspondingmatched reference images is computed using their globalfeatures such as HSV histogram, GIST [54] and CNN (CNN6and CNN7 are Convolutional Neural Network features ex-tracted from ReLU6 and FC7 layers of pre-trained network,respectively [55]). For the different advantages and disad-

8

6

1

3

1

4

3

0.00

0.52

0.52

0.60

0.32

0.54

0.60

0.13

1.00

0.34

0.27

0.00

0.03

0.00

0.00

0.14

0.81

0.40

0.30

1.00

0.98

1.00

1.00

0.19

Q

Fig. 3: Exemplar output of the dominant set framework:Left: query, Right: each row shows corresponding referenceimages from the first, second and third local solutions (dom-inant sets), respectively, from top to bottom. The numberunder each image shows the frequency of the matchedreference image, while those on the right side of each imageshow the min-max normalized scores of HSV, CNN6, CNN7and GIST global features, respectively. The filled colorscircles on the upper right corner of the images are used asreference IDs of the images.

vantages of the global features, we refer interested readersto [2].

Fig. 3 shows the top three extracted dominant sets withtheir corresponding frequency of the matched referenceimages (at the bottom of each image). Let Fi be the number(cardinality) of local features, which belongs to ith referenceimage from the extracted sets and the total number ofmatched reference images be N . We build an affinity matrix

B of size S =N∑i=1Fi + 1 (e.g., for the example in Fig. 3, the

size S is 19 ). Fig. 5 shows the reduced graph for the matchedreference images shown in Fig. 3. Fig. 5 upper left, showsthe part of the graph for the post processing. It shows therelation that the query has with matched reference images.The bottom left part of the figure shows how one can get thefull graph from the reduced graph. For the example in Fig.5, V = {Q, 1, 2, 2, 2, 3, 4, 4.......6}.

The advantages of using constrained dominant sets arenumerous. First, it provides a unique local (and henceglobal) solution whose support coincides with the unionof all dominant sets of G, which contains the query. Suchsolution contains all the local solutions which have strongrelation with the user-selected query. As it can be observedin Fig. 5 (bottom right), the Constrained Dominant Setwhich contains the query Q, CDS(Q), is the union of alloverlapping dominant sets (the query, one green, one darkred and three yellow nodes) containing the query as one ofthe members. If we assume to have no cyan link between the

8

2

52

5

55

22

0.350.560.780.54

0.321.000.600.54

0.490.570.850.65

0.000.000.000.00

0.670.731.000.70

0.390.700.740.54

0.410.330.640.49

1.000.420.451.00

0.440.470.650.55

Q

Fig. 4: Exemplar output of the dominant set framework:Left: query, Right: each row shows corresponding referenceimages of the first, second and third local solutions (dom-inant sets), respectively, from top to bottom. The numberunder each image shows the mode of the matched referenceimage, while those on the right of each image show themin-max normalized scores of HSV, CNN6, CNN7 and GISTglobal features, respectively.

Q Q

4

3 2

1

6 5 Q 1 2 3 4 5 6

Q

1 0.81 0 0.76 0 0 0 0

2 0.98 0.76 0 0 0 0 0

3 1.00 0 0 0 0.80 0 0

4 0.03 0 0 0.80 0 0 0

5 0.32 0 0 0 0 0 0.84

6 0 0 0 0 0 0.84 0

0 0.81 0.98 1.00 0.03 0.32 0

Q

1 2

2

2

2

2

2

Q

1

0.81

1

1

1

0.98

0.98

0.98

0.76

0.76

0.76 2

2

2

1

Q

Output of constrained dominant sets and

their membership scores:

Q -- 0.225

1 -- 0.101

2 -- 0.190

2 -- 0.190 Node 2 is chosen as final

2 -- 0.190 match

3 -- 0.104

KNN(Q) = 1 (node 1)

Fig. 5: Exemplar graph for post processing. Top left: reducedgraph for Fig. 3 which contains unique matched referenceimages. Bottom left: Part of the full graph which containsthe gray circled nodes of the reduced graph and the query.Top right: corresponding affinity of the reduced graph. Bot-tom right: The outputs of nearest neighbor approach, con-sider only the node’s pairwise similarity, (KNN(Q)=node 3which is the dark red node) and constrained dominant setsapproach (CDS(Q) = node 2 which is the yellow node).

green and yellow nodes, as long as there is a strong relationbetween the green node and the query, CDS(Q) will notbe affected. In addition, due to the noise, the strong affinitybetween the query and the green node may be reduced,while still keeping the strong relation with the cyan linkwhich, as a result, will preserve the final result. Second,

9

in addition to fusing all the local solutions leveraging thenotion of centrality, one of the interesting properties ofdominant set framework is that it assigns to each imagea score corresponding to how similar it is to the rest of theimages in the solution. Therefore, not only it helps selectingthe best local solution, but also choosing the best finalmatch from the chosen local solution. Third, an interestingproperty of constrained dominant sets approach is that it notonly considers the pairwise similarity of the query and thereference images, but also the similarity among the referenceimages. This property helps the algorithm avoid assignmentof wrong high pairwise affinities. As an example, with ref-erence to Fig. 5, if we consider the nodes pairwise affinities,the best solution will be the dark red node (score 1.00).However, using constrained dominant sets and consideringthe relation among the neighbors, the solution bounded bythe red dotted rectangle can be found, and by choosingthe node with the highest membership score, the final bestmatch is the yellow node which is more similar to the queryimage than the reference image depicted by the dark rednode (see Fig. 3).

5 EXPERIMENTAL RESULTS

5.1 Dataset Description

We evaluate the proposed algorithm using publicly avail-able reference data sets of over 102k Google street view im-ages [2] and a new dataset, WorldCities, of high resolution300k Google street view images collected for this work. The360 degrees view of each place mark is broken down intoone top and four side view images. The WorldCities datasetis publicly available. 1

The 102k Google street view images dataset covers 204Km of urban streets and the place marks are approximately12 m apart. It covers downtown and the neighboring areasof Orlando, FL; Pittsburgh, PA and partially Manhattan, NY.The WorldCities dataset is a new high resolution referencedataset of 300k street view images that covers 14 differentcities from different parts of the world: Europe (Amsterdam,Frankfurt, Rome, Milan and Paris), Australia (Sydney andMelbourne), USA (Vegas, Los Angeles, Phoenix, Houston,San Diego, Dallas, Chicago). Existence of similarity in build-ings around the world, which can be in terms of their walldesigns, edges, shapes, color etc, makes the dataset morechallenging than the other. Fig. 6 (top four rows) showssample reference images taken from different place marks.

For the test set, we use 644 and 500 GPS-tagged useruploaded images downloaded from Picasa, Flickr andPanoramio for the 102k Google street view images andWorldCities datasets, respectively. Fig. 6 (last two rows)shows sample test images. Throughout our experiment, weuse the all the reference images from around the world tofind the best match with the query image, not just with theground truth city only.

1. http://www.cs.ucf.edu/∼haroon/UCF-Google-Streetview-II-Data/UCF-Google-Streetview-II-Data.zip

5.2 Quantitative Comparison With Other Methods

5.2.1 Performance on the 102k Google street view imagesDatasetThe proposed approach has been then compared with theresults obtained by state-of-the-art methods. In Fig. 7, thehorizontal axes shows the error threshold in meters and thevertical axes shows the percentage of the test set localizedwithin a particular error threshold. Since the scope of thiswork is an accurate image localization at a city-scale level,test set images localized above 300 meter are considered afailure.

The black (-*-) curve shows localization result of theapproach proposed in [16] which uses vocabulary tree tolocalize images. The red (-o-) curve depicts the results of[1] where they only consider the first NN for each queryfeature as best matches which makes the approach verysensitive to the query features they select. Moreover, theirapproach suffers from lacking global feature information.The green (-o-) curve illustrates the localization resultsof [2] which uses generalized maximum clique problem(GMCP) to solve feature matching problem and followsvoting scheme to select the best matching reference image.The black (-o- and −♦−) curves show localization resultsof MAC and RMAC, (regional) maximum activation ofconvolutions ([56], [57]). These approaches build compactfeature vectors that encode several image regions withoutthe need to feed multiple inputs to the network. The cyan(-o-) curve represents localization result of NetVLAD [18]which aggregates mid-level (conv5) convolutional featuresextracted from the entire image into a compact single vectorrepresentation amenable to effici,ent indexing. The cyan(−♦−) curve depicts localization result of NetVLAD butfinetuned on our dataset. The blue (−♦−) curve showlocalizaton result of approach proposed in [17] which ex-ploits geometric relations between different database imagesretrieved by a query to handle geometric burstness. Theblue (-o-) curve shows results from our baseline approach,that is, we use voting scheme to select best match referenceimage and estimate the location of the query image. We areable to make a 10% improvement w.r.t the other methodswith only our baseline approach (without post processing).The magenta (-o-) curve illustrates geo-localization results ofour proposed approach using dominant set clustering basedfeature matching and constrained dominant set clusteringbased post processing. As it can be seen, our approachshows about 20% improvement over the state-of-the-arttechniques.

Computational Time. Fig. 8, on the vertical axis, showsthe ratio between GMCP (numerator) and our approach(denominator) in terms of CPU time taken for each queryimages to be localized. As it is evident from the plot, thisratio can range from 200 (our approach 200x faster thanGMCP) to a maximum of even 750x faster.

5.2.2 Performance on the WorldCities DatasetWe have also compared the performance of different algo-rithms on the new dataset of 300k Google street view imagescreated by us. Similarly to the previous tests, Fig. 9 reportsthe percentage of the test set localized within a particularerror threshold. Since the new dataset is relatively more

http://www.cs.ucf.edu/~haroon/UCF-Google-Streetview-II-Data/UCF-Google-Streetview-II-Data.zip

http://www.cs.ucf.edu/~haroon/UCF-Google-Streetview-II-Data/UCF-Google-Streetview-II-Data.zip

10

Fig. 6: The top four rows are sample street view images from eight different places of WorldCities dataset. The bottom tworows are sample user uploaded images from the test set.

60 100 140 180 220 260 300

Error Threshold(m)

10

20

30

40

50

60

70

80

% o

f tes

t set

loca

lized

with

in e

rror

thre

shol

d

DSC with Post-processingDSC w/o post-processingGMCP(2014)Fine-tuned NetVLAD (2016)Zamir and Shah (2010)Sattler et al.(2016)NetVLAD(2016)Schindler et al.(2007)RMAC (2016)MAC (2016)

Fig. 7: Comparison of our baseline (without post processing)and final method, on overall geo-localization results, withstate-of-the-art approaches on the first dataset (102K Googlestreet view images).

challenging, the overall performance achieved by all themethods is lower compared to 102k image dataset.

From bottom to top of the graph in Fig. 9 we presentthe results of [56], [57] black (−♦− and -o-), [17] blue(−♦−), [1] red (-o-), [18] cyan (-o-), fine tuned [18] cyan(−♦−), [2] green (-o-), our baseline approach without postprocessing blue (-o-) and our final approach with postprocessing magenta (-o-) . The improvements obtained withour method are lower than in the other dataset, but stillnoticeable (around 2% for the baseline and 7% for the finalapproach).

Some qualitative results for Pittsburgh, PA are presentedin Fig. 10.

Queries0 100 200 300 400 500 600

CP

U ti

me

ratio

100

200

300

400

500

600

700

800

Fig. 8: The ratio of CPU time taken between GMCP basedgeo-localization [2] and our approach, computed as CPUtime for GMCP/CPU time for DSC.

5.3 Analysis5.3.1 Outlier HandlingIn order to show that our dominant set-based feature match-ing technique is robust in handling outliers, we conduct anexperiment by fixing the number of NNs (disabling the dy-namic selection of NNs) to different numbers. It is obviousthat the higher the number of NNs are considered for eachquery feature, the higher will be the number of outlier NNsin the input graph, besides the increased computational costand an elevated chance of query features whose NNs do notcontain any inliers surviving the pruning stage.

Fig. 11 shows the results of geo-localization obtainedby using GMCP and dominant set based feature matchingon 102K Google street view images [2]. The graph shows

11

60 100 140 180 220 260 300

Error Threshold(m)

10

15

20

25

30

35

40

45

50

55

60

% o

f tes

t set

loca

lized

with

in e

rror

thre

shol

dDSC W post-processingDSC W/o post-processingGMCP (2014)Finetuned NetVLAD (2016)Zamir and Shah (2010)Sattler et al.(2016)NetVLAD (2016)RMAC (2016)MAC(2016)

Fig. 9: Comparison of overall geo-localization results usingDSC with and without post processing and state-of-the-artapproaches on the WorldCities dataset.

the percentage of the test set localized within the distanceof 30 meters as a function of number of NNs. The bluecurve shows the results using dominant sets: it is evidentthat when the number of NNs increases, the performanceimproves despite the fact that more outliers are introducedin the input graph. This is mainly because our frameworktakes advantage of the few inliers that are added along withmany outliers. The red curve shows the results of GMCPbased localization and as the number of NNs increase theresults begin to drop. This is mainly due to the fact that theirapproach imposes hard constraint that at least one matchingreference feature should be selected for each query featurewhether or not the matching feature is correct.

5.3.2 Effectiveness of the Proposed Post ProcessingIn order to show the effectiveness of the post processingstep, we perform an experiment comparing our constraineddominant set based post processing with a simple votingscheme to select the best matching reference image. The GPSinformation of the best matching reference image is usedto estimate the location of the query image. In Fig. 13, thevertical axis shows the percentage of the test set localizedwithin a particular error threshold shown in the horizontalaxis (in meters). The blue and magenta curves depict geo-localization results of our approach using a simple votingscheme and constrained dominant sets, respectively. Thegreen curve shows the results from GMCP based geo-localization. As it is clear from the results, our approachwith post processing exhibits superior performance com-pared to both GMCP and our baseline approach.

Since our post-processing algorithm can be easilyplugged in to an existing retrieval methods, we performanother experiment to determine how much improvementwe can achieve by our post processing. We use [19], [56], [57]methods to obtain candidate reference images and employas an edge weight the similarity score generated by thecorresponding approaches. Table 1 reports, for each dataset,the first row shows rank-1 result obtained from the existingalgorithms while the second row (w post) shows rank-1

result obtained after adding the proposed post-processingstep on top of the retrieved images. For each query, weuse the first 20 retrieved reference images. As the resultsdemonstrate, Table 1, we are able to make up to 7% and4% improvement on 102k Google street view images andWorldCities datasets, respectively. We ought to note that,the total additional time required to perform the above postprocessing, for each approach, is less than 0.003 seconds onaverage.

NetVLAD NetVLAD* RMAC MAC

Dts 1 Rank1 49.2 56.00 35.16 29.00w post 51.60 58.05 40.18 36.30

Dts 2 Rank1 50.00 50.61 29.96 22.87w post 53.04 52.23 33.16 26.31

TABLE 1: Results of the experiment, done on the 102kGoogle street view images (Dts1) and WorldCities (Dts2)datasets, to see the impact of the post-processing step whenthe candidates of reference images are obtained by otherimage retrieval algorithms

The NetVLAD results are obtained from the featuresgenerated using the best trained model downloaded fromthe authors project page [19]. It’s fine-tuned version(NetVLAD*) is obtained from the model we fine-tunedusing images within 24m range as a positive set and imageswith GPS locations greater than 300m as a negative set.

The MAC and RMAC results are obtained using MACand RMAC representations extracted from fine-tuned VGGnetworks downloaded from the authors webpage [56], [57].

5.3.3 Assessment of Global Features Used in Post Pro-cessing StepThe input graph for our post processing step utilizes theglobal similarity between the query and the matched refer-ence images. Wide variety of global features can be used forthe proposed technique. In our experiments, the similaritybetween query and the corresponding matched referenceimages is computed between their global features, usingHSV, GIST, CNN6, CNN7 and fine-tuned NetVLAD. Theperformance of the proposed post processing techniquehighly depends on the discriminative ability of the globalfeatures used to built the input graph.

Depending on how informative the feature is, we dy-namically assign a weight for each global feature based onthe area under the normalized score between the query andthe matched reference images. To show the effectivenessof this approach, we perform an experiment to find thelocation of our test set images using both individual andcombined global features. Fig. 12 shows the results attainedby using fine-tuned NetVLAD, CNN7, CNN6, GIST, HSVand by combining them together. The combination of allthe global features outperforms the individual feature per-formance, demonstrating the benefits of fusing the globalfeatures based on their discriminative abilities for eachquery.

6 CONCLUSION AND FUTURE WORK

In this paper, we proposed a novel framework for city-scaleimage geo-localization. Specifically, we introduced domi-nant set clustering-based multiple NN feature matching

12

Fig. 10: Sample qualitative results taken from Pittsburgh area. The green ones are the ground truth while yellow locationsindicate our localization results.

Number of Nearest Neighbors1 3 5 8 11 15

% o

f tes

t set

loca

lized

with

in 3

0 m

eter

18

20

22

24

26

28

30

32

34

GMCPDominant Sets W/o Post processing and Dynamic NN selection

Fig. 11: Geo-localization results using different number ofNN

approach. Both global and local features are used in ourmatching step in order to improve the matching accuracy. Inthe experiments, carried out on two large city-scale datasets,we demonstrated the effectiveness of post processing em-ploying the novel constrained dominant set over a simplevoting scheme. Furthermore, we showed that our proposedapproach is 200 times, on average, faster than GMCP-basedapproach [2]. Finally, the newly-created dataset (WorldCi-ties) containing more than 300k Google Street View imagesused in our experiments will be made available to the publicfor research purposes.

As a natural future direction of research, we can extendthe results of this work for estimating the geo-spatial tra-jectory of a video in a city-scale urban environment from amoving camera with unknown intrinsic camera parameters.

60 100 140 180 220 260 300

Error Threshold(m)

35

40

45

50

55

60

65

70

75

80

% o

f tes

t set

loca

lized

with

in e

rror

thre

shol

d

CombinedHSVGISTCNN6CNN7fine-tuned NetVLAD

Fig. 12: Comparison of geo-localization results using differ-ent global features for our post processing step.

REFERENCES

[1] A. R. Zamir and M. Shah, “Accurate image localization basedon google maps street view,” in European Conference on ComputerVision. Springer, 2010, pp. 255–268.

[2] ——, “Image geo-localization based on multiple nearest neighborfeature matching using generalized graphs,” IEEE Trans. PatternAnal. Mach. Intell., vol. 36, no. 8, pp. 1546–1558, 2014.

[3] S. Rota Bulo and I. M. Bomze, “Infection and immunization: Anew class of evolutionary game dynamics,” Games and EconomicBehavior, vol. 71, no. 1, pp. 193–211, 2011.

[4] S. R. Bulo, M. Pelillo, and I. M. Bomze, “Graph-based quadraticoptimization: A fast evolutionary approach,” Computer Vision andImage Understanding, vol. 115, no. 7, pp. 984–995, 2011.

[5] Y. Avrithis, Y. Kalantidis, G. Tolias, and E. Spyrou, “Retrievinglandmark and non-landmark images from community photo col-

13

60 100 140 180 220 260 300

Error Threshold(m)

40

45

50

55

60

65

70

75

80

% o

f tes

t set

loca

lized

with

in e

rror

thre

shol

dGMCPDSC w/o post-processingDSC with Post-processing

Fig. 13: The effectiveness of constrained dominant set basedpost processing step over simple voting scheme.

lections,” in Proceedings of the 18th ACM international conference onMultimedia. ACM, 2010, pp. 153–162.

[6] D. M. Chen, G. Baatz, K. Koser, S. S. Tsai, R. Vedantham,T. Pylvanainen, K. Roimela, X. Chen, J. Bach, M. Pollefeys et al.,“City-scale landmark identification on mobile devices,” in Com-puter Vision and Pattern Recognition (CVPR), 2011 IEEE Conferenceon. IEEE, 2011, pp. 737–744.

[7] T. Quack, B. Leibe, and L. Van Gool, “World-scale mining of objectsand events from community photo collections,” in Proceedings ofthe 2008 international conference on Content-based image and videoretrieval. ACM, 2008, pp. 47–56.

[8] Y.-T. Zheng, M. Zhao, Y. Song, H. Adam, U. Buddemeier, A. Bis-sacco, F. Brucher, T.-S. Chua, and H. Neven, “Tour the world:building a web-scale landmark recognition engine,” in Computervision and pattern recognition, 2009. CVPR 2009. IEEE conference on.IEEE, 2009, pp. 1085–1092.

[9] H. Jin Kim, E. Dunn, and J.-M. Frahm, “Predicting good featuresfor image geo-localization using per-bundle vlad,” in Proceedingsof the IEEE International Conference on Computer Vision, 2015, pp.1170–1178.

[10] J. Hays and A. A. Efros, “Im2gps: estimating geographic in-formation from a single image,” in Computer Vision and PatternRecognition, 2008. CVPR 2008. IEEE Conference on. IEEE, 2008, pp.1–8.

[11] ——, “Large-scale image geolocalization,” in Multimodal LocationEstimation of Videos and Images. Springer, 2015, pp. 41–62.

[12] T. Weyand, I. Kostrikov, and J. Philbin, “Planet-photo ge-olocation with convolutional neural networks,” arXiv preprintarXiv:1602.05314, 2016.

[13] S. Gammeter, L. Bossard, T. Quack, and L. Van Gool, “I knowwhat you did last summer: object-level auto-annotation of holidaysnaps,” in 2009 IEEE 12th International Conference on ComputerVision. IEEE, 2009, pp. 614–621.

[14] E. Johns and G.-Z. Yang, “From images to scenes: Compressing animage cluster into a single scene model for place recognition,” in2011 International Conference on Computer Vision. IEEE, 2011, pp.874–881.

[15] D. J. Crandall, L. Backstrom, D. Huttenlocher, and J. Kleinberg,“Mapping the world’s photos,” in Proceedings of the 18th interna-tional conference on World wide web. ACM, 2009, pp. 761–770.

[16] G. Schindler, M. A. Brown, and R. Szeliski, “City-scale locationrecognition,” in IEEE Computer Society Conference on ComputerVision and Pattern Recognition (CVPR), Minneapolis, Minnesota, 2007.

[17] T. Sattler, M. Havlena, K. Schindler, and M. Pollefeys, “Large-scalelocation recognition and the geometric burstiness problem,” in2016 IEEE Conference on Computer Vision and Pattern Recognition,CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, 2016, pp. 1582–1590.

[18] R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, and J. Sivic,“Netvlad: CNN architecture for weakly supervised place recog-

nition,” in 2016 IEEE Conference on Computer Vision and PatternRecognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, 2016,pp. 5297–5307.

[19] A. Torii, J. Sivic, M. Okutomi, and T. Pajdla, “Visual place recog-nition with repetitive structures,” IEEE Trans. Pattern Anal. Mach.Intell., vol. 37, no. 11, pp. 2346–2359, 2015.

[20] B. Zeisl, T. Sattler, and M. Pollefeys, “Camera pose voting forlarge-scale image-based localization,” in 2015 IEEE InternationalConference on Computer Vision, ICCV 2015, Santiago, Chile, December7-13, 2015, 2015, pp. 2704–2712.

[21] S. Agarwal, N. Snavely, I. Simon, S. M. Seitz, and R. Szeliski,“Building rome in a day,” in 2009 IEEE 12th international conferenceon computer vision. IEEE, 2009, pp. 72–79.

[22] Y. Li, N. Snavely, D. Huttenlocher, and P. Fua, “Worldwide poseestimation using 3d point clouds,” in European Conference on Com-puter Vision. Springer, 2012, pp. 15–29.

[23] T. Sattler, B. Leibe, and L. Kobbelt, “Fast image-based localizationusing direct 2d-to-3d matching,” in 2011 International Conference onComputer Vision. IEEE, 2011, pp. 667–674.

[24] ——, “Improving image-based localization by active corre-spondence search,” in European Conference on Computer Vision.Springer, 2012, pp. 752–765.

[25] E. Kalogerakis, O. Vesselova, J. Hays, A. A. Efros, and A. Hertz-mann, “Image sequence geolocation with human travel priors,” in2009 IEEE 12th International Conference on Computer Vision. IEEE,2009, pp. 253–260.

[26] G. Vaca-Castano, A. R. Zamir, and M. Shah, “City scale geo-spatialtrajectory estimation of a moving camera,” in Computer Vision andPattern Recognition (CVPR), 2012 IEEE Conference on. IEEE, 2012,pp. 1186–1193.

[27] A. Hakeem, R. Vezzani, M. Shah, and R. Cucchiara, “Estimatinggeospatial trajectory of a moving camera,” in 18th InternationalConference on Pattern Recognition (ICPR’06), vol. 2. IEEE, 2006, pp.82–87.

[28] C.-Y. Chen and K. Grauman, “Clues from the beaten path: Loca-tion estimation with bursty sequences of tourist photos,” in Com-puter Vision and Pattern Recognition (CVPR), 2011 IEEE Conferenceon. IEEE, 2011, pp. 1569–1576.

[29] T.-Y. Lin, S. Belongie, and J. Hays, “Cross-view image geolocaliza-tion,” in Proceedings of the IEEE Conference on Computer Vision andPattern Recognition, 2013, pp. 891–898.

[30] T.-Y. Lin, Y. Cui, S. Belongie, and J. Hays, “Learning deep rep-resentations for ground-to-aerial geolocalization,” in 2015 IEEEConference on Computer Vision and Pattern Recognition (CVPR).IEEE, 2015, pp. 5007–5015.

[31] S. Workman, R. Souvenir, and N. Jacobs, “Wide-area image geolo-calization with aerial reference imagery,” in Proceedings of the IEEEInternational Conference on Computer Vision, 2015, pp. 3961–3969.

[32] N. Jacobs, S. Satkin, N. Roman, R. Speyer, and R. Pless, “Geolocat-ing static cameras,” in 2007 IEEE 11th International Conference onComputer Vision. IEEE, 2007, pp. 1–6.

[33] S. Ramalingam, S. Bouaziz, P. Sturm, and M. Brand, “Skyline2gps:Localization in urban canyons using omni-skylines,” in IntelligentRobots and Systems (IROS), 2010 IEEE/RSJ International Conferenceon. IEEE, 2010, pp. 3816–3823.

[34] R. Arandjelovic and A. Zisserman, “Dislocation: Scalable descrip-tor distinctiveness for location recognition,” in Asian Conference onComputer Vision. Springer, 2014, pp. 188–204.

[35] A. Bergamo, S. N. Sinha, and L. Torresani, “Leveraging struc-ture from motion to learn discriminative codebooks for scalablelandmark classification,” in Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition, 2013, pp. 763–770.

[36] S. Cao and N. Snavely, “Graph-based discriminative learningfor location recognition,” in Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition, 2013, pp. 700–707.

[37] P. Gronat, G. Obozinski, J. Sivic, and T. Pajdla, “Learning andcalibrating per-location classifiers for visual place recognition,” inProceedings of the IEEE Conference on Computer Vision and PatternRecognition, 2013, pp. 907–914.

[38] A. Torii, J. Sivic, T. Pajdla, and M. Okutomi, “Visual place recog-nition with repetitive structures,” in Proceedings of the IEEE Confer-ence on Computer Vision and Pattern Recognition, 2013, pp. 883–890.

[39] Q. Hao, R. Cai, Z. Li, L. Zhang, Y. Pang, and F. Wu, “3d visualphrases for landmark recognition,” in Computer Vision and PatternRecognition (CVPR), 2012 IEEE Conference on. IEEE, 2012, pp. 3594–3601.

14

[40] M. Muja and D. G. Lowe, “Fast approximate nearest neighborswith automatic algorithm configuration,” in VISAPP 2009 - Pro-ceedings of the Fourth International Conference on Computer VisionTheory and Applications, Lisboa, Portugal, February 5-8, 2009 - Volume1, 2009, pp. 331–340.

[41] M. Pavan and M. Pelillo, “Dominant sets and pairwise clustering,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 29, no. 1, pp. 167–172,2007.

[42] Y. T. Tesfaye, E. Zemene, M. Pelillo, and A. Prati, “Multi-objecttracking using dominant sets,” IET computer vision, vol. 10, pp.289–298, 2016.

[43] S. Vascon, E. Z. Mequanint, M. Cristani, H. Hung, M. Pelillo,and V. Murino, “Detecting conversational groups in images andsequences: A robust game-theoretic approach,” Computer Visionand Image Understanding, vol. 143, pp. 11–24, 2016.

[44] R. Horst, P. Pardalos, and N. Van Thoai, Introduction to GlobalOptimization, ser. Nonconvex Optimization and Its Applications.Kluwer Academic Publishers, Dordrecht/Boston/London, 2000.

[45] I. M. Bomze, “On standard quadratic optimization problems,” J.Global Optimization, vol. 13, no. 4, pp. 369–387, 1998.

[46] ——, “Evolution towards the maximum clique,” J. Global Optimiza-tion, vol. 10, no. 2, pp. 143–164, 1997.

[47] K. Muller, J. Neumann, M. Grigutsch, D. Y. von Cramon, andG. Lohmann, “Detecting groups of coherent voxels in functionalmri data using spectral analysis and replicator dynamics,” Journalof Magnetic Resonance Imaging, vol. 26, no. 6, pp. 1642–1650, 2007.

[48] M. Pelillo, “Replicator dynamics in combinatorial optimization,”in Encyclopedia of Optimization, Second Edition, 2009, pp. 3279–3291.

[49] S. Todorovic and N. Ahuja, “Region-based hierarchical imagematching,” International Journal of Computer Vision, vol. 78, no. 1,pp. 47–66, 2008.

[50] M. Pavan and M. Pelillo, “Dominant sets and hierarchical cluster-ing,” in ICCV, 2003, pp. 362–369.

[51] E. Zemene and M. Pelillo, “Interactive image segmentation usingconstrained dominant sets,” in ECCV, 2016, pp. 278–294.

[52] L. Zheng, S. Wang, L. Tian, F. He, Z. Liu, and Q. Tian,“Query-adaptive late fusion for image search and person re-identification,” in CVPR, 2015, pp. 1741–1750.

[53] S. Zhang, M. Yang, T. Cour, K. Yu, and D. N. Metaxas, “Queryspecific rank fusion for image retrieval,” IEEE Trans. Pattern Anal.Mach. Intell., vol. 37, no. 4, pp. 803–815, 2015.

[54] A. Oliva and A. Torralba, “Modeling the shape of the scene:A holistic representation of the spatial envelope,” InternationalJournal of Computer Vision, vol. 42, no. 3, pp. 145–175, 2001.

[55] K. Simonyan and A. Zisserman, “Very deep convolutionalnetworks for large-scale image recognition,” arXiv preprintarXiv:1409.1556, 2014.

[56] G. Tolias, R. Sicre, and H. Jegou, “Particular object retrieval withintegral max-pooling of CNN activations,” 2016.

[57] F. Radenovic, G. Tolias, and O. Chum, “CNN image retrievallearns from bow: Unsupervised fine-tuning with hard examples,”in ECCV, 2016, pp. 3–20.

Eyasu Zemene received the BSc degree inElectrical Engineering from Jimma University in2007, he then worked in Ethio Telecom for 4years till he joined CaFoscari University (Octo-ber 2011) where he got his MSc in ComputerScience in June 2013. September 2013, he wona 1 year research fellow to work on AdversarialLearning at Pattern Recognition and Applicationlab of University of Cagliari. Since September2014 he is a PhD student of CaFoscari Univer-sity under the supervision of prof. Pelillo. Work-

ing towards his Ph.D. he is trying to solve different computer visionand pattern recognition problems using theories and mathematical toolsinherited from graph theory, optimization theory and game theory. Cur-rently, Eyasu, as part of his PhD, is working as a research assistant atCenter for Research in Computer Vision at University of Central Floridaunder the supervision of Dr. Mubarak Shah. His research interests are inthe areas of Computer Vision, Pattern Recognition, Machine Learning,Graph theory and Game theory.

Yonatan Tariku received his BSc degree incomputer science from Arba Minch University in2007. He has worked 5 years at Ethio-telecomas senior programmer and later joined CaFos-cari University of Venice and received his MScdegree (with Honor) in computer science in2014. He is currently a PhD student at IUAV uni-versity of Venice starting from 2014. He is now aresearch assistance, towards his PhD, at Centerfor Research in Computer Vision at Universityof Central Florida. His research interests include

multi-target tracking, segmentation, image and video geo-localization,game theoretic model and graph theory.

Haroon Idrees is a Postdoctoral Associate atthe Center for Research in Computer Vision atUniversity of Central Florida. He received theBSc (Hons) degree in Computer Engineeringfrom the Lahore University of Management Sci-ences, Pakistan in 2007, and the PhD degree inComputer Science from the University of CentralFlorida in 2014. He has published several pa-pers in conferences and journals such as CVPR,ECCV, Journal of Image and Vision Computing,and IEEE Transactions on Pattern Analysis and

Machine Intelligence. His research interests include crowd analysis,object detection and tracking, wide area analysis, multi-camera andairborne surveillance, and multimedia content analysis.

Andrea Prati Andrea Prati graduated in Com-puter Engineering at the University of Modenaand Reggio Emilia in 1998. He got his PhD inInformation Engineering in 2002 from the sameUniversity. After some post-doc position at Uni-versity of Modena and Reggio Emilia, he wasappointed as Assistant Professor at the Fac-ulty of Engineering of Reggio Emilia (Universityof Modena and Reggio Emilia) from 2005 to2011, and then as Associate Professor at theDepartment of Design and Planning in Complex

Environments of the University IUAV of Venice, Italy. In 2013 he hasbeen promoted to full professorship, waiting for official hiring in the newposition. In December 2015 he moved to the Department of Engineeringand Architecture of the University of Parma. Author of 7 book chapters,31 papers in international referred journals (including 9 papers publishedin IEEE Transactions) and more than 100 papers in proceedings ofinternational conferences and workshops. Andrea Prati is Senior Mem-ber of IEEE, Fellow of IAPR (”For contributions to low- and high-levelalgorithms for video surveillance”), and member of GIRPR.

Marcello Pelillo is Full Professor of ComputerScience at CaFoscari University in Venice, Italy,where he directs the European Centre for Liv-ing Technology (ECLT) and leads the ComputerVision and Pattern Recognition group. He heldvisiting research positions at Yale University,McGill University,the University of Vienna, YorkUniversity (UK), the University College London,and the National ICT Australia (NICTA). He haspublished more than 200 technical papers inrefereed journals, handbooks, and conference

proceedings in the areas of pattern recognition, computer vision andmachine learning. He is General Chair for ICCV 2017 and has servedas Program Chair for several conferences and workshops (EMMCVPR,SIMBAD, S+SSPR, etc.). He serves (has served) on the Editorial Boardsof the journals IEEE Transactions on Pattern Analysis and MachineIntelligence (PAMI), Pattern Recognition, IET Computer Vision, Frontiersin Computer Image Analysis, Brain Informatics, and serves on theAdvisory Board of the International Journal of Machine Learning andCybernetics. Prof. Pelillo is a Fellow of the IEEE and a Fellow of theIAPR.

15

Mubarak Shah, the Trustee chair professor ofcomputer science, is the founding director ofthe Center for Research in Computer Visionat the University of Central Florida (UCF). Heis an editor of an international book series onvideo computing, editor-in-chief of Machine Vi-sion and Applications journal, and an associateeditor of ACM Computing Surveys journal. Hewas the program cochair of the IEEE Conferenceon Computer Vision and Pattern Recognition(CVPR) in 2008, an associate editor of the IEEE

Transactions on Pattern Analysis and Machine Intelligence, and a guesteditor of the special issue of the International Journal of ComputerVision on Video Computing. His research interests include video surveil-lance, visual tracking, human activity recognition, visual analysis ofcrowded scenes, video registration, UAV video analysis, and so on. Heis an ACM distinguished speaker. He was an IEEE distinguished visitorspeaker for 1997-2000 and received the IEEE Outstanding EngineeringEducator Award in 1997. In 2006, he was awarded a Pegasus ProfessorAward, the highest award at UCF. He received the Harris CorporationsEngineering Achievement Award in 1999, TOKTEN awards from UNDPin 1995, 1997, and 2000, Teaching Incentive Program Award in 1995and 2003, Research Incentive Award in 2003 and 2009, MillionairesClub Awards in 2005 and 2006, University Distinguished ResearcherAward in 2007, Honorable mention for the ICCV 2005 Where Am I?Challenge Problem, and was nominated for the Best Paper Award at theACM Multimedia Conference in 2005. He is a fellow of the IEEE, AAAS,IAPR, and SPIE.

Documents

1 Large-scale Image Geo-Localization Using Dominant … · and Mubarak Shah, Fellow, IEEE Abstract ... and obtain a robust ﬁnal solution to the problem ... M. Shah is with the Center