Contextual Synonym Dictionary for Visual Object Retrieval∗

Contextual Synonym Dictionary for Visual Object Retrieval∗

Wenbin Tang†‡§†

, Rui Cai†, Zhiwei Li†, and Lei Zhang†

†Microsoft Research Asia‡Dept. of Computer Science and Technology, Tsinghua University

§Tsinghua National Laboratory for Information Science and Technology

[email protected] ruicai, zli, [email protected]

ABSTRACT

In this paper, we study the problem of visual object retrievalby introducing a dictionary of contextual synonyms to nar-row down the semantic gap in visual word quantization. Thebasic idea is to expand a visual word in the query image withits synonyms to boost the retrieval recall. Unlike the exist-ing work such as soft-quantization, which only focuses on theEuclidean (l2) distance in descriptor space, we utilize the vi-sual words which are more likely to describe visual objectswith the same semantic meaning by identifying the wordswith similar contextual distributions (i.e. contextual syn-onyms). We describe the contextual distribution of a visualword using the statistics of both co-occurrence and spatialinformation averaged over all the image patches having thisvisual word, and propose an efficient system implementa-tion to construct the contextual synonym dictionary for alarge visual vocabulary. The whole construction process isunsupervised and the synonym dictionary can be naturallyintegrated into a standard bag-of-feature image retrieval sys-tem. Experimental results on several benchmark datasetsare quite promising. The contextual synonym dictionary-based expansion consistently outperforms the l2 distance-based soft-quantization, and advances the state-of-the-artperformance remarkably.

Categories and Subject Descriptors

I.2.10 [Vision and Scene Understanding]: Vision

General Terms

Algorithms, Experimentation, Performance

Keywords

Bag-of-word, object retrieval, visual synonym dictionary,query expansion

∗Area chair: Tat-Seng Chua†This work was performed at Microsoft Research Asia.

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.MM’11, November 28–December 1, 2011, Scotsdale, Arizona, USA.Copyright 2011 ACM 978-1-4503-0616-4/11/11 ...$10.00.

1. INTRODUCTIONTo retrieve visual objects from a large-scale image database,

most of the sophisticated methods in this area are basedon the well-known bag-of-feature framework [14, 18], whichtypically works as follows. First, for each database image,local region descriptors such as the Scale Invariant FeatureTransform (SIFT) [10, 20] are extracted. And then the high-dimensional local descriptors are quantized into discrete vi-sual words by utilizing a visual vocabulary. The most com-mon method to construct a visual vocabulary is to performclustering (e.g.K-means) on the descriptors extracted from aset of training images; and each cluster is treated as a visualword described by its center. The quantization step is essen-tially to assign a local descriptor to its nearest visual wordin the vocabulary in terms of Euclidean (l2) distance usingvarious approximate search algorithms like KD-tree [10, 15],Locality Sensitive Hashing [3] and Compact Projection [13].After the quantization, each image is represented by the fre-quency histogram of a bag of visual words. Taking visualwords as keys, images in the database are indexed by in-verted files for quick access and search.

Visual descriptor quantization is a key step to develop ascalable and fast solution for image retrieval on a large scaledatabase. However, it also brings two serious and unavoid-able problems: mismatch and semantic gap.

Mismatch is due to the polysemy phenomenon, whichmeans one visual word is too coarse to distinguish descrip-tors extracted from semantically different objects. This phe-nomenon is particularly prominent when the visual vocab-ulary is small. Fig. 1 shows some visual words from a vo-cabulary with 500K visual words constructed based on theOxford Buildings Dataset [15]. The visual words in the firstrow are the top-5 nearest neighbors to the query word Q.Obviously these words describe objects with different seman-tic meanings (e.g. pillar, fence, wall), but have very close l2distances to Q. In a smaller visual vocabulary, they arevery likely to be grouped into one word and lead to poorprecision in retrieval. To address this problem, the sim-plest way to increase the discriminative capability of visualwords is enlarging the vocabulary [15]. Other remarkableapproaches working on this problem include bundling fea-tures [25], spatial-bag-of-features [1], hamming embeddingand weak geometry consistency [5, 6], utilizing contextualdissimilarity measure [7], high-order spatial features [32],and compositional features [26], etc. In summary, by in-troducing additional spatial and contextual constraints intovisual word matching, these approaches have made signifi-cant improvements on retrieval precision.

503

Q (radius ≈ 150.0) rank-1 (l2 = 159.9) rank-2 (l2 = 179.0) rank-3 (l2 = 181.8) rank-4 (l2 = 181.9) rank-5 (l2 = 183.6)

rank-361 (l2 = 264.5) rank-57284 (l2 = 408.8) rank-60561 (l2 = 410.7) rank-209948 (l2 = 463.7) rank-481363 (l2 = 571.3)

Figure 1: An example to illustrate the “semantic gap” between descriptors and semantic objects. Each visualword is represented by a panel containing 9 patches which are the closest samples to the correspondingcluster center. The Harris scale of each patch is marked with a red circle. Given the query word Q, its top-5nearest neighbor words under l2 distance are shown in the first row. The second row lists another 5 visualwords (actually were identified by the algorithm in Section 2) which are more semantically related to thequery word but are far away in the descriptor space. This example was extracted from a 500K vocabularyconstructed based on the Oxford Building Dataset [15].

By contrast, semantic gap leads to the synonymy phe-nomenon, which means several different visual words actu-ally describe semantically same visual objects. This phe-nomenon is especially obvious when adopting a large vi-sual vocabulary, and usually results in poor recall perfor-mance in visual object retrieval [15]. In Fig. 1, there areanother 5 visual words in the second row which are more se-mantically related to the query word Q but are far awayfrom Q in the descriptor space. One of the reasons be-hind this phenomenon is the poor robustness of local de-scriptor algorithms. In fact, even those state-of-the-art lo-cal descriptors like SIFT are still sensitive to small distur-bances (e.g. changes in viewpoint, scale, blur, and captur-ing condition) and output unstable and noisy feature val-ues [19]. To increase the recall rate, a few research effortshave been devoted to reducing the troubles caused by syn-onymous words. One of the most influential techniques is thequery expansion strategy proposed in [2], which completesthe query set using visual words taken from matching ob-jects in the first-round search results. Another method wor-thy of notice is soft-quantization (or soft-assignment), whichassigns a descriptor to a weighted combination of nearest vi-sual words [16, 21]. Statistical learning methods were alsoadopted to bridge semantic gap, e.g., learning a transfor-mation from original descriptor space (e.g. SIFT) to a newEuclidean space in which l2 distance is more correlated tosemantic divergence [17, 23]. A more direct approach wasproposed in [12] to estimate the semantic coherence of twovisual words via counting the frequency that the two wordsmatch with each other in a training set. A similar tech-nique is also applied in [4], where spatial verifications areemployed between the query and top-ranked images in thefirst-round result to detect matching points with differentvisual words, then the detected word pairs are viewed assynonyms and used to update the initial query histogram.Recently, there is another trend to bridge semantic gap invisual retrieval. The basic idea is to incorporate textual in-

formation to promote more meaningful visual features [8] orprovide more accurate combinational similarity measure [24,11]. For example, in [8], auxiliary visual words are identi-fied by propagating the similarity across both visual andtextual graphs. Incorporating textual information in visualretrieval is a promising research area as surrounding textis indeed available in many application scenarios (e.g. near-dup key-frame retrieval in video search [24]). However, thisis beyond the scope of this paper, in which we still focus onleveraging pure visual information.

In this paper, we target at providing a simple yet effectivesolution to narrow down the semantic gap in visual objectretrieval. In contrast, most aforementioned approaches arepractically expensive. For example, query expansion needsto do multiple search processes; and its performance highlydepends on the quality of the initial recall [2]. Learning-based methods [12, 17, 23] need to collect a set of matchingpoint pairs as training data. To construct a comprehen-sive training set, it usually needs a large image databaseand has to run complex geometric verification algorithmslike RANSAC on candidate image pairs to identify possiblematching points. Even so, the obtained training samples(pairs of synonymous words) are still very sparse when thevocabulary size is large. Soft-quantization is cheap and ef-fective, but in essence it can only handle the over-splittingcases in quantization. We argue that the existing l2 distance-based soft-quantization is incapable of dealing with semanticgap. This is because visual words that are close in l2 distancedo not always have similar semantic meanings, as shown inFig. 1.

However, the “expansion” idea behind soft-quantizationand query expansion is very praiseworthy. It reminds us ofthe effective synonym expansion strategy in text retrieval [22].For visual object retrieval, the main challenge here is how toidentify synonyms from a visual vocabulary. In this paper,we once again borrow the experience from the text domain.In text analysis, synonyms can be identified by finding words

504

… …

rc × scalepi

…

…

…

sector 1sector K

sector 2

rotate

…

sector 1sector K

sector 2

Contextual Distribution

of the visual word

(a) (b) (c) (d)

pi

swpj→pi

pj

R pi

Figure 2: An illustration of the process to estimate the contextual distribution of a visual word ωm. (a)For each instance pi of the visual word, its visual context is defined as a spatial circular region Rpi with theradius rc × scalepi ; (b) A point in the visual context is weighted according to its spatial distance to pi; (c) Thedominant orientation of pi is rotated to be vertical; Rpi is partitioned into K sectors, and for each sector asub-histograms is constructed; (d) The sub-histograms of all the instances of the visual word ωm are aggregatedto describe the contextual distribution of ωm.

with similar contextual distributions [28], where the textualcontext refers to the surrounding text of a word. Analogi-cally, we can define the visual context as a set of local pointswhich are in a certain spatial region of a visual word. Intu-itively, visual words representing the same semantic mean-ing tend to have similar visual contextual distributions. Anevidence to support this assumption is the “visual phrase”reported in the recent literature of object recognition [27,34, 26, 30, 29, 33, 32]. A visual phrase is a set of frequentlyco-occurring visual words, and has been observed to exist invisual objects from the same semantic category (e.g. cars).Besides co-occurrence, we also incorporate local spatial in-formation to the visual contextual distribution as well asthe corresponding contextual similarity measure1. Next, wedefine the“contextual synonyms”of a visual word as its near-est neighbors in terms of the contextual similarity measure.The ensemble of contextual synonyms of all the visual wordsis a contextual synonym dictionary, which can be computedoffline and stored in memory for fast access. For retrieval,each visual word extracted on the query image is expandedto a weighted combination of its visual synonyms, then theexpanded histogram is taken as the new query.

As a side-note, here we would like to make a distinctionbetween the proposed contextual synonym dictionary andvisual phrase in existing work. Actually, their perspectivesare different. To identify visual phrases, the typical processis to first discover combinations of visual words and thengroup these combinations according to some distance mea-sures. Each group is taken as a visual phrase. The related

1Although spatial information has been widely utilized toincrease the distinguishability of visual words, the goal hereis to identify synonymous words. Moreover, most spatialinformation in previous work just takes into account thecontext of an individual image patch; while in this paper,the spatial information of a visual word is the contextualstatistics averaged over all the image patches assigned tothat word.

work usually works on a small vocabulary as the numberof possible combinations increases in exponential order ofthe vocabulary size [27, 34]. Moreover, they can leveragesupervised information (e.g. image categories) to learn dis-tance measures between visual phrase (e.g. the Mahalanobisdistance in [29], boosting in [26], and the kernel-based learn-ing in [32]). In retrieval, visual phrase is considered as anadditional kind of feature to provide better image represen-tations. By contrast, the contextual synonym dictionaryin this paper is defined on single visual word. There is noexplicit “phrases” but embeds some implicit contextual re-lationships among visual words. Our approach can work onlarge vocabulary as the cost is just linear with the vocabu-lary size. For retrieval, the synonym dictionary is utilizedto expand query terms and the image representation is stilla histogram of visual words. Therefore, the contextual syn-onym dictionary can be naturally integrated into a standardbag-of-feature image retrieval system.

The rest of the paper is organized as follows. We intro-duce the visual context and contextual similarity measurein Section 2. Section 3 describes the system implementationto construct a contextual synonym dictionary, and presentshow to apply the synonym dictionary to expand visual wordsin visual object retrieval. Experimental results are discussedin Section 4; and conclusion is drawn in Section 5.

2. VISUAL CONTEXTUAL SYNONYMSIn this section, we introduce the concepts of visual context

and contextual distribution, as well as how to measure thecontextual similarity of two visual words according to theircontextual distributions.

To make a clear presentation and facilitate the followingdiscussions, we first define some notations used in this paper.In the bag-of-feature framework, there is a visual vocabularyΩ = ω1, . . . , ωm, . . . , ωM in which ωm is the mth visualword and M = |Ω| is the vocabulary size. Each extracted

505

local descriptor (SIFT) is assigned to its nearest visual word(i.e.quantization). As a result an image I is represented asa bag of visual words ωpi where ωpi is the visual word ofthe ith interest point pi on I. Correspondingly, we adopt(xpi , ypi) to represent the coordinate of pi, and scalepi todenote its Harris scale output by the SIFT detector. More-over, for a visual word ωm, we define its instance set Im asall the interest points quantized to ωm in the database.

2.1 Visual Contextual DistributionIn a text document, the context of a word is its surround-

ing text, e.g, a sentence or a paragraph containing that word.For an interest point pi, there is no such syntax boundariesand we just simply define its visual context as a spatial cir-cular region Rpi centered at pi. As illustrated in Fig. 2 (a)and (b), the radius of Rpi is rc × scalepi , where rc is amultiplier to control the contextual region size. Since theradius is in proportion to the Harris scale scalepi , Rpi isrobust to object scaling. The simplest way to characterizetextual context is the histogram of term-frequency ; similarly,the most straightforward definition of the visual contextualdistribution of pi is the histogram of visual words:

Ctf (pi) = [tf1(pi), . . . , tfm(pi), . . . , tfM(pi)] (1)

where tfm is the term frequency of the visual word ωm inthe contextual region Rpi .

Although Ctf can embed the co-occurrences among visualwords, it is incapable of describing the spatial information,which has been proven to be important in many computervision applications. To provide a more comprehensive de-scription, in the following we introduce two strategies to addspatial characteristics to the visual context, namely support-

ing weight and sector-based context.

Supporting weight is designed to distinguish the contri-butions of two different local points to the contextual dis-tribution of pi, as shown in Fig. 2 (b). Intuitively, the onewhich is closer to the region center should be more impor-tant. Specifically, for an interest point pj ∈ Rpi , its relative

distance to pi is the Euclidean distance on the image planenormalized by the contextual region radius:

distpj→pi =‖(xpj , ypj )− (xpi , ypi)‖2

rc × scale(pi)(2)

And the contribution (supporting weight) of pj to pi’s con-textual distribution is defined as:

swpj→pi = e−dist2pj→pi

/2(3)

while in Ctf , each pj is considered to contribute equally tothe histogram. Then, the contextual distribution definitionincorporating supporting weight is:

Csw(pi) = [c1(pi), . . . , cm(pi), . . . , cM(pi)] (4)

where

cm(pi) =∑

pj

swpj→pi , pj ∈ Rpi& ωpj = ωm. (5)

Sector-based context is inspired by the definitions of pre-ceding and following words in text domain. In textual con-text modeling, the part before a word is treated differentlywith the part after that word. Analogously, we can dividethe circular region Rpi into several equal sectors, as shownin Fig. 2 (c). Actually, this thought is quite natural and

i

...

j

i

...

j

|i - j|

K-|i - j|

d(i, j) = min(|i - j|, K-|i - j|)

Figure 3: Definition of the circle distance betweentwo sectors.

similar ideas have been successfully applied for both objectrecognition[9] and retrieval[1]. To yield rotation invariance,we also rotate Rpi to align its dominant orientation to bevertical. Then, as illustrated in Fig. 2 (c), the sectors arenumbered with 1, 2, . . . ,K in clockwise starting from the oneon the right side of the dominant orientation. After that,we compute the contextual distribution for each sector, de-noted as Ck

sw, and concatenate all these sub-histograms toconstruct a new hyper-histogram Csect to describe the con-textual distribution of Rpi :

Csect(pi) = [C1

sw(pi), . . . , Cksw(pi), . . . , C

Ksw(pi)] (6)

It should be noticed that the length of Csect is K ×M.

Contextual distribution of a visual word is the sta-tistical aggregation of the contextual distributions of all itsinstances in the database, as shown in Fig. 2 (d). For a vi-sual word ωm, we average Csect(pi),∀pi ∈ Im to represent itsstatistical contextual distribution:

Cm = [C1

m, . . . , Ckm, . . . , CK

m], Ckm =

1

|Im|

∑

pi∈Im

Cksw(pi). (7)

Ckm is a M-dimensional vector whose nth element Ck

m(n) isthe weight of the visual word ωn in the kth sector of ωm. Inthis way, Cm incorporates abundant semantic informationfrom instances in various images, and is more stable thanthe context of an individual image patch.

2.2 Contextual Similarity MeasureTo identify visual synonyms, the remaining problem is

how to measure the contextual similarity between two visualwords. First of all, the contextual distributions Cm, 1 ≤m ≤ M should be normalized to provide a comparablemeasure between different visual words. Considering thatCm is essentially a hyper-histogram consisting of K sub-histograms, the most straightforward way is to normalizeeach sub-histogram Ck

m respectively. However, in practicewe found the normalization of sub-histograms usually re-sults in the lost of spatial density information. In otherwords, a “rich” sector containing a large amount of visualword instances is treated equally to a “poor” sector in thefollowing similarity measure. To keep the density informa-tion, we perform a l2-normalization on the whole histogramCm as if it has not been partitioned into sectors.

Based on the l2-normalized Cm, the most natural simi-larity measure is the cosine similarity. However, the cosinesimilarity on Cm implies a constraint that different sectors(i.e.sub-histograms) are totally unrelated and will not becompared. This constraint is too restrict as sectors are only

506

i

j

2 4 6 8 10 12 14 16

2

4

6

8

10

12

14

16

10

20

30

40

50

60

70

80

90

100

Figure 4: Visualization of the dislocation probability(%) when K = 16.

a very coarse partition of a contextual region. Moreover,the dominant direction estimated by the SIFT detector isnot that accurate. Therefore, the cosine similarity is easyto underestimate the similarity of two visual words reallyhaving close contextual distributions. To tolerate the dislo-cation error, in this paper the contextual similarity of twovisual words ωm and ωn is defined as:

Sim(ωm, ωn) =K∑

i=1

K∑

j=1

(

Simsec(Ci

m, Cjn) · Φ(i, j)

)

. (8)

Here, Simsec is the weighted inner product of the two sub-histograms Ci

m and Cjn:

Simsec(Ci

m, Cjn) =

M∑

v=1

idf2

v · Cim(v) · Cj

n(v). (9)

where the weight idfv is the inverse-document-frequency ofthe visual word ωv in all the contextual distributions Cm, 1 ≤m ≤ M, to punish those popular visual words. Φ is a pre-defined K×K matrix whose (i, j)-element approximates thedislocation probability between the ith and jth sectors. Asshown in Fig. 3 and Fig. 4, Φ(i, j) is computed based onthe exponential weighting of the circle distance between thecorresponding sectors:

Φ(i, j) = e−

d(i,j)2

K/2 , d(i, j) = min(|i− j|,K− |i− j|). (10)

According to the similarity measure, the “contextual syn-onyms”of a visual word ωm are defined as those visual wordswith the largest contextual similarities to ωm.

3. SYSTEM IMPLEMENTATIONSIn this section, we introduce the system implementation to

construct the contextual synonym dictionary through iden-tifying synonyms for every visual words in the vocabulary.Furthermore, we describe how to leverage the synonym dic-tionary in visual object retrieval.

3.1 Contextual Synonym Dictionary Construc-tion

To construct the contextual synonym dictionary, the con-textual distributions of all interest points in training imagesare first extracted. Next, the contextual distributions of vi-sual words are obtained by aggregating of the distributionof all its instances. For a visual word ωq, the contextual syn-onyms are then identified by searching for the visual wordswith largest contextual similarities.

ωn

……

Entry …

Weights (4K Bytes )

Contextual Distributions

Cm1(n) Cm

2(n) ...ωm CmK(n)

4 Bytes

Figure 5: Inverted files to index contextual distri-butions.

Input: Training image set: I1,I2, . . . ,Is ;Output: Synonym Dictionary;

foreach interest point pi ∈ I1,I2, . . . , Is doextract contextual distribution Csect(pi);

end

foreach visual word ωm ∈ Ω do

Cm = [C1m, . . . , Ck

m, . . . , CKm], Ck

m = 1

|Im|

∑pi∈Im

Cksw(pi) ;

Sparsify Cm to satisfy ‖Cm‖ ≤ ‖Cmax‖ ;end

Build inverted files as illustrated in Fig. 5;foreach visual word ωq ∈ Ω do

Initialize all score[] = 0;foreach visual word (ωn, C1

q (n), . . . , CKq (n)) ∈ Cq do

foreach entry (ωm, C1m(n), . . . , CK

m(n)) in theinverted list of ωn do

score[m]+=∑K

i,j=1idf2

n · Ciq(n) · C

jm(n) · Φ(i, j) ;

end

end

The visual words with largest score are the synonyms ofωq;

end

Algorithm 1: Synonym Dictionary Construction.

The brute-force way to find the most similar visual wordsis to linear scan all the visual words and compare the con-textual similarities one-by-one. However, the computationalcost is extremely heavy when dealing with a large visual vo-cabulary. For example, to construct the synonym dictionaryfor a vocabulary with 105 visual words, it needs to compute1010 similarities. Therefore the linear scan solution is in-feasible in practice. To accelerate the construction process,we utilize inverted files to index contextual distributions ofvisual words, as shown in Fig. 5. The inverted files are builtas follows. Every visual word ωn has an entry list in whicheach entry represents another visual word ωm whose con-textual distribution containing ωn. An entry records the idof ωm, as well as the weights Ck

m(n), 1 ≤ k ≤ K of ωn

in ωm’s contextual distribution. Supposing there are ‖Cq‖unique non-zero elements in ωq’s contextual distribution, itjust needs to measure visual words in the corresponding ‖Cq‖entry lists to identify synonyms. ‖Cq‖ is called the contex-

tual distribution size of ωq; and ‖C‖avg is the average of thecontextual distribution sizes for all the visual words.

Sparsification Optimization : With the assistance of in-verted files, the time cost to identity synonyms of specificvisual word ωq is O(‖Cq‖ × ‖Cavg‖ × K2), which is consid-erably less expensive than linear scanning. However, whenlarge training set applied, the contextual distributions of vi-sual words may contain thousands of visual words and makethe construction process costly. To save the costs, in prac-

507

Query Image

Database

Descriptor Quantization

Contextual Synonym Dictionary

Construction (Offline)

Contextual Synonym

Dictionary

Synonym Expansion

Query Histogram

Figure 6: The framework of leveraging contextualsynonym dictionary in visual object retrieval. Thevisual words in query image are expended to corre-sponding weighted combination of visual synonyms,and the augmented histogram is token as query.

tice we can make the contextual distribution more sparse byremoving visual words with small weights in the histogram,to satisfy ‖Cm‖ ≤ ‖C‖max,∀m ∈ [1,M]. Finally, the con-struction process is summarized in Algorithm 1.

Complexity Analysis : Each entry in the inverted filesrequests 4 × (K + 1) bytes; and the total memory cost isaround 4×(K+1)×M×‖C‖avg bytes without sparsification.The time cost to identify synonyms of one visual word isO(‖C‖2avg×K2); and the overall time complexity to constructthe contextual synonym dictionary is O(‖C‖2avg × K ×M).For example, given the vocabulary size M = 500K, the sec-tor number K = 16 and the contextual region scale rc = 4,the average contextual distribution size ‖C‖avg ≈ 219. Theconstruction process requires about 7G memory and runsaround 35 minutes on a 16-core 2.67 GHz Intel Xeonr ma-chine. By employ sparsification, the contextual distributionsize of a visual word do not exceed ‖C‖max. Therefore, thespace complexity is O(K×M×‖C‖max) and time complex-ity is O(‖C‖2max × K2 ×M). It should be noted that, since‖C‖max is a constant, the computation cost is independentwith the number of images in training set. A comprehensivestudy of different settings of ‖C‖max and the correspondingcosts are given in Section 4.2.

3.2 Contextual Synonym-based ExpansionNow we explain how to leverage contextual synonym dic-

tionary in visual object retrieval. The framework is similarto that proposed for soft-quantization. The basis flow is thateach visual word extracted on the query image is expandedto a list of knn words (either l2 nearest neighbors in soft-quantization or contextual synonyms in this paper). Thequery image is then described with a weighted combinationof the expansion words. However, there are some differencesthat should be mentioned: (1) We only perform expansion

Table 1: Information of the datasets.Dataset #Images #Descriptors (SIFT)

Oxford5K 5,063 24,910,180Paris6K 6,392 28,572,648

Flickr100K 100,000 152,052,164

Table 2: The approaches in comparison.Approach Parameters

Hard-Quantization Ml2-based Soft-Quantization (Query-side) M, knn

Contextual Synonym-based Expansion M, knn, rc, K, ‖C‖max

on query images. Although expanding images in databaseis doable, it will greatly increase the index size and retrievaltime. (2) Each local descriptor is only assigned to one vi-sual word in quantization. The expanded synonym wordsare obtained via looking up the contextual synonym dictio-nary. (3) The weight of each synonym is in proportion to itscontextual similarity to the related query word. The workflow is illustrated in Fig. 6.

4. EXPERIMENTAL RESULTSA series of experiments were conducted to evaluate the

proposed contextual synonym-based expansion for visual ob-ject retrieval. First, we investigated the influences of variousparameters in the proposed approach, and then comparedthe overall retrieval performance with some state-of-the-artmethods on several benchmark datasets.

4.1 Experiment SetupThe experiments were based on three datasets, as shown

in Table 1. The Oxford5K dataset was introduced in [15]and has become a standard benchmark for object retrieval.It has 55 query images with manually labeled ground truth.The Paris6K dataset was first introduced in [16]. It hasmore than 6000 images collected from Flickr by searchingfor particular Paris landmarks, and also has 55 query imageswith ground truth. The Flickr100K dataset in this paperhas 100K images crawled from Flickr, and we had ensuredthere is no overlap with the Oxford5K and Paris6K datasets.We adopted SIFT as the local descriptor in the experiments,and applied the software developed in [20] to extract SIFTdescriptors for all the images in these datasets.

Two state-of-the-art methods were implemented as base-lines for performance comparison, i.e., hard-quantization andl2-based soft-quantization, as shown in Table 2. For evalu-ation, the standard mean average precision (mAP) was uti-lized to characterize the retrieval performance. For eachquery image, the average precision (AP) was computed asthe area under the precision-recall curve. The mAP is de-fined as the mean value of average precision over all queries.Table 2 also lists the parameters in these methods, whichwill be discussed in the following subsection.

To avoid over-fitting in the experiment, the query imageswere excluded in the contextual synonym dictionary con-struction process.

4.2 Contextual Synonym-based ExpansionIn this subsection, we investigate the influences of the pa-

rameters in the proposed approach. We also gave a try to

508

100K 500K 1M 1.5M 2M0.65

0.67

0.69

0.71

0.73

0.75

Vocabulary Size

mA

P

hard−bof

l2−soft

syn−expand

Figure 7: Performance comparisons of different ap-proaches under different vocabulary sizes.

2 4 6 8 10

0.71

0.72

0.73

0.74

0.75

knn

mA

P

rc=2 r

c=3 r

c=4 r

c=5 l

2 soft

Figure 8: Performance comparisons of different ap-proaches under different contextual region scalesand different numbers of expansion words (vocab-ulary size M=500K).

combine the l2-based soft-quantization with our synonym-based expansion. All the evaluations in this part were basedon the Oxford5K dataset.

Vocabulary Size M. The three approaches in com-parison were evaluated on five vocabulary sizes – 100K,500K, 1M, 1.5M and 2M. For l2-based soft-quantization andsynonym-based expansion, the number of expansion wordswas set to 3, under which the best performance was achievedin [16]. The mAP values of different approaches under dif-ferent vocabulary sizes are shown in Fig. 7. All the threeapproaches got the best performance on the 500K vocabu-lary. This is slightly different with the observation in [15]in which the best performance was on a vocabulary with1M visual words. This may because the adopted SIFT de-tectors were different. An evidence to support this expla-nation is that in this paper even the baseline performance(with hard-quantization) is much higher than that reportedin [15]. From Fig. 7, it is also noticeable that the perfor-mances of all the approaches drop when the vocabulary sizebecomes larger. This suggests that over-splitting becomesthe major problem with a large vocabulary. This also ex-plains why soft-quantization is with the slowest decreasingspeed as it is designed to deal with the over-splitting prob-lem in quantization. All the following experiments (unlessexplicitly specified) were based on the 500K vocabulary.

Number of Expansion Words knn. Both soft-quantizationand the proposed synonym-based expansion need to deter-

Table 3: Performance comparison of the proposedapproach under different sector numbers in contex-tual region partition.

K 1 2 4 8 16 32

mAP 0.748 0.751 0.752 0.753 0.755 0.754

Table 4: Complexity and performance analysis ofdifferent contextual distribution sizes.

‖C‖max ∞ 100 20

‖C‖avg 219.1 94.8 19.8mAP 0.755 0.755 0.750

Memory 7.4GB 3.2GB 673MBSyn. Dict. Const. Time 35min 19min 14min

mine the number of soft-assignment/expansion words. Asshown in Fig. 8, knn was enumerated from 1 to 10 in the ex-periment. It is clear that the performance of soft-quantizationdrops when more expansion words are included. This makessense as neighbors in l2 distance could be noises in terms ofsemantic meaning, as the example shown in Fig. 1. This ob-servation is also consistent with the results reported in [16],and proves that soft-quantization cannot help reduce seman-tic gaps in object retrieval. By contrast, the proposed syn-onym expansion works well when the expansion number in-creases. Although the performance is relatively stable whenknn becomes large, the additional expansion words don’thurt the search results. This also suggests that the includedcontextual synonyms are semantically related to the querywords.

Contextual Region Scale rc. Fig. 8 also shows the per-formances of the proposed approach with different contex-tual region scales. We enumerated the multiplier rc from 2 to5, and found the performance becomes stable after rc = 4.Actually, the Harris scales for most SIFT descriptors arequite small. We need a relatively large rc to include enoughsurrounding points for getting a sufficient statistic to visualcontext. Therefore, rc was set as 4 in the following experi-ments.

Number of Contextual Region Sectors K . As in-troduced in Section 2.1, a contextual region is divided intoseveral sectors to embed more spatial information. Differ-ent numbers of sectors were evaluated, as shown in Table. 3.The performance was improved with more sectors in the par-tition. However, the improvement is not significant when Kis large. We selected K = 16 in the experiments.

Contextual Distribution Size ‖C‖max. We have an-alyzed the complexity to construct a contextual synonymdictionary in Section 3.1, and proposed to make the con-textual distribution sparse with an upper bound threshold‖C‖max. Different ‖C‖max values were tested and the corre-sponding mAP, memory cost, and running time are reportedin Table 4. Of course this restriction will sacrifice the char-acterization capability of contextual distribution. However,the retrieval performance only slightly drops with the com-pensation of significantly reduced memory and running time.One explanation of the small performance decrease is thatthe sparse operation also make the contextual distributionnoise-free. In other words, words with small weights in con-text are very possible to be random noises.

509

Table 5: Retrieval performances via integrating l2-based soft-quantization and synonym expansion, re-spectively on two vocabularies of 500K and 2M.

Synonym Expansion1nn 2nn 3nn 5nn 10nn

l 2Soft-Q

uant.

500K 1nn 0.708 0.734 0.741 0.748 0.755

2nn 0.716 0.742 0.747 0.752 0.756

3nn 0.718 0.744 0.749 0.753 0.755

2M

1nn 0.680 0.698 0.706 0.717 0.7302nn 0.700 0.715 0.725 0.734 0.7453nn 0.708 0.725 0.734 0.741 0.751

Integrate Soft-Quantization and Synonym Expan-sion. It is an intuitive thought to integrate soft-quantizationand the proposed synonym expansion, as they target to dealwith over-splitting and semantic gap respectively. We pro-posed a simple integration strategy, i.e., doing soft-assignmentfirst and then expanding each obtained soft word with itscontextual synonyms. The motivation here is that over-splitting is in low-level feature space and should be han-dled before touching the high-level semantic problem. Inthe experiment we evaluated different combinations of ex-pansion numbers for soft-quantization and contextual syn-onyms. The results are shown in Table 5, in which eachcolumn is with the same number of synonyms but differentnumbers of words (1 ∼ 3) in soft quantization. We did theevaluations on two vocabularies. On the 500K vocabulary,the improvement of integration is not significant in compari-son with only using contextual synonym expansion; however,the improvement on the 2M vocabulary is noticeable. Thisis because the over-splitting phenomenon is not that seriouson the 500K vocabulary, and semantic gap is the main issue.While for the 2M vocabulary, both over-splitting and seman-tic gap become serious; and the two approaches complementeach other. Of course we just shallowly touch the integra-tion problem and leave the exploring of more sophisticatedsolutions to the future work.

4.3 Comparison of Overall PerformanceThe comparison of overall performance of different ap-

proaches on various datasets are shown in Table 6. Toprovide a comprehensive comparison, for each approach weadopted the RANSAC algorithm to do spatial verificationfor reranking the search results. Moreover,the query expan-sion strategy (denoted as QE in Table 6) proposed in [2]was also applied based on the RANSAC results to furtherimprove the retrieval performance. The “query” in [2] repre-sents an image but not a visual word, and the corresponding“query expansion” is different with the expansion in this pa-per. From Table 6, it is clear that the proposed contextualsynonym-based expansion always achieved the best perfor-mance.

There are several interesting observations from Table 6.First, in the Oxford5K dataset, the performance gains ofthe RANSAC re-ranking are 3.0%, 3.3% and 3.9% for hard-quantization, l2-soft quantization, and synonym-based ex-pand, respectively. The spatial verification and re-rankingwas only performed on the top-ranked images, thus the per-formances highly depend on the initial recall. With the as-sistance of synonym expansion, the recall is improved whichmakes the RANSAC more powerful. The additional im-provements brought by the image-level query expansion (QE)

0

0.2

0.4

0.6

0.8

1

all_souls

ashmoleanballio

l

bodleian

christ_church

cornmarket

hertford

keble

magdalen

pitt_riv

ers

radcliffe_camera

hard−bof l2−soft syn−expand

Figure 9: mAP of all the queries on the “Ox-ford5K+Flickr100K” dataset.

are 4.9%, 3.9% and 1.7% respectively. The performancegains in syn-expand results are not as significant as thosein two baselines. One possible explanation is that visualsynonyms already narrow down the semantic gaps and leavelittle room for QE. It’s noticeable that the mAP of syn-expand is even larger than l2-soft + RANSAC on all thefour experiments, this proves that the proposed approachis cheap and effective. Another observation is that, theimprovements brought by RANSAC in Paris6K dataset isnot as significant as those in Oxford5K. This may becausethat the queries in Paris6K dataset contain more positivecases(on average, each query in Paris6K has 163 positivecases, but only 52 for queries in Oxford5K), meanwhile theRANSAC is only performed on a small part of the first-roundretrieval results.

For the “Oxford5K+Flickr100K” dataset, we also presentthe mAP for all the 11 query buildings (five queries for each),as shown in Fig. 9. Our approach outperforms the base-line methods in most query buildings. However, it performslightly worse than l2-soft in balliol, cornmarket and keble.This may because of the noises in the synonym vocabulary.We leave the noise removing task as our future work.

The price has to be paid by the proposed approach is rela-tively slow speed in search, as expansion words increase thequery length. Soft-quantization also suffers from the “longquery”problem. With our implementation (knn = 10 and onthe “Oxford5K+Flickr100K” dataset), the average respondtime of hard-quantization is about 0.14 seconds while that ofour approach was around 0.93 seconds. Fortunately, the re-sponse time is still acceptable for most visual search scenar-ios. Moreover, we noticed that there has been some recentresearch efforts [31] addressed on accelerating search speedfor long queries.

To provide a vivid impression of the performance, we showfour illustrative examples in Fig. 10. The search was per-formed on the “Oxford5K+Flickr100K” dataset. For eachexample, the precision-recall curves of the three approachesin Table 2 were drawn. From the precision-recall curves, it isclear that the gain of the proposed approach is mainly fromthe improvement of recall, which is exactly our purpose.

5. CONCLUSIONIn this paper we introduced a contextual synonym dictio-

nary to the bag-of-feature framework for large scale visualsearch. The goal is to provide an unsupervised method toreduce the semantic gap in visual word quantization. Syn-onym words are considered to describe visual objects withthe same semantic meanings, and are identified via measur-

510

Table 6: Comparison of the overall performances of different approaches on various datasets.BOF BOF + RANSAC BOF + RANSAC + QE

hard-bof l2-soft syn-expand hard-bof l2-soft syn-expand hard-bof l2-soft syn-expand

Oxford5K 0.708 0.718 0.755 0.738 0.751 0.794 0.787 0.790 0.811Paris6K 0.696 0.705 0.733 0.702 0.710 0.737 0.776 0.776 0.791

Oxford5K+Flickr100K 0.672 0.688 0.736 0.707 0.722 0.773 0.758 0.767 0.797Paris5K+Flickr100K 0.673 0.688 0.722 0.681 0.696 0.730 0.766 0.770 0.785

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Recall

Pre

cis

ion

hard-bof

l2-soft

syn-expand

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Recall

Pre

cis

ion

hard-bof

l2-soft

syn-expand

Query: all_souls_1 Query: bodleian_3

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Recall

Pre

cis

ion

hard-bof

l2-soft

syn-expand

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Recall

Pre

cis

ion

hard-bof

l2-soft

syn-expand

Query: ashmolean_2 Query: christ_church_5

Figure 10: An illustration of search results and precision-recall curves for 4 example query images, all souls 1(upper-left), bodleian 3 (upper-right), ashmolean 2 (bottom-left) and christ church 5 (bottom-right), onthe “Oxford5K+Flickr100K” dataset. The three rows of search results correspond respectively to hard-

quantization, soft-quantization, and contextual synonym-based expansion. False alarms are marked with redboxes.

511

ing the similarities of their contextual distributions. Thecontextual distribution of a visual word is based on thestatistics averaged over all the image patches having thisword, and contains both co-occurrence and spatial informa-tion. The contextual synonym dictionary can be efficientlyconstructed, and can be stored in memory for visual wordexpansion in online visual search. In brief, the proposedmethod is simple and cheap, and has been proven effectiveby exhaustive experiments.

This paper describes our preliminary study to leveragecontextual synonyms to deal with semantic gaps, there isstill much room for future improvement, e.g., exploring abetter strategy to deal with over-splitting and semantic gapsimultaneously. Moreover, in future work we will investi-gate whether the contextual synonym dictionary can helpcompress a visual vocabulary. Another direction is to applythe synonym dictionary to applications other than visualsearch, e.g., discover better visual phrases for visual objectrecognition.

6. REFERENCES[1] Y. Cao, C. Wang, Z. Li, L. Zhang, and L. Zhang.

Spatial-bag-of-features. In CVPR, pages 3352–3359,2010.

[2] O. Chum, J. Philbin, J. Sivic, M. Isard, andA. Zisserman. Total recall: Automatic queryexpansion with a generative feature model for objectretrieval. In ICCV, 2007.

[3] M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni.Locality-sensitive hashing scheme based on p-stabledistributions. In SCG, pages 253–262, 2004.

[4] E. Gavves and C. G. M. Snoek. Landmark imageretrieval using visual synonyms. In ACM Multimedia,pages 1123–1126, 2010.

[5] H. Jegou, M. Douze, and C. Schmid. Hammingembedding and weak geometric consistency for largescale image search. In ECCV, pages I: 304–317, 2008.

[6] H. Jegou, M. Douze, and C. Schmid. Improvingbag-of-features for large scale image search. IJCV,87(3):316–336, 2010.

[7] H. Jegou, H. Harzallah, and C. Schmid. A contextualdissimilarity measure for accurate and efficient imagesearch. In CVPR, 2007.

[8] Y. Kuo, H. Lin, W. Cheng, Y. Yang, and W. Hsu.Unsupervised auxiliary visual words discovery forlarge-scale image object retrieval. In CVPR, 2011.

[9] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags offeatures: Spatial pyramid matching for recognizingnatural scene categories. In CVPR, pages 2169–2178,2006.

[10] D. G. Lowe. Distinctive image features fromscale-invariant keypoints. IJCV, 60(2):91–110, 2004.

[11] H. Ma, J. Zhu, M. R. Lyu, and I. King. Bridging thesemantic gap between image contents and tags. IEEETransactions on Multimedia, 12(5):462–473, 2010.

[12] A. Mikulık, M. Perdoch, O. Chum, and J. Matas.Learning a fine vocabulary. In ECCV, pages 1–14,2010.

[13] K. Min, L. Yang, J. Wright, L. Wu, X.-S. Hua, andY. Ma. Compact projection: Simple and efficient nearneighbor search with practical memory requirements.In CVPR, pages 3477–3484, 2010.

[14] D. Nister and H. Stewenius. Scalable recognition witha vocabulary tree. In CVPR, pages 2161–2168, 2006.

[15] J. Philbin, O. Chum, M. Isard, J. Sivic, andA. Zisserman. Object retrieval with large vocabulariesand fast spatial matching. In CVPR, 2007.

[16] J. Philbin, O. Chum, M. Isard, J. Sivic, andA. Zisserman. Lost in quantization: Improvingparticular object retrieval in large scale imagedatabases. In CVPR, 2008.

[17] J. Philbin, M. Isard, J. Sivic, and A. Zisserman.Descriptor learning for efficient retrieval. In ECCV,pages 677–691, 2010.

[18] J. Sivic and A. Zisserman. Video Google: A textretrieval approach to object matching in videos. InICCV, pages 1470–1477, 2003.

[19] T. Tuytelaars and K. Mikolajczyk. Local invariantfeature detectors: A survey. Foundations and Trends

in Computer Graphics and Vision, 3(3):177–280, 2007.

[20] K. E. A. van de Sande, T. Gevers, and C. G. M.Snoek. Evaluating color descriptors for object andscene recognition. PAMI, 32(9):1582–1596, 2010.

[21] J. van Gemert, C. J. Veenman, A. W. M. Smeulders,and J.-M. Geusebroek. Visual word ambiguity. PAMI,32(7):1271–1283, 2010.

[22] E. M. Voorhees. Query expansion usinglexical-semantic relations. In SIGIR, pages 61–69,1994.

[23] S. A. J. Winder and M. Brown. Learning local imagedescriptors. In CVPR, 2007.

[24] X. Wu, W. Zhao, and C.-W. Ngo. Near-duplicatekeyframe retrieval with visual keywords and semanticcontext. In CIVR, pages 162–169, 2007.

[25] Z. Wu, Q. Ke, M. Isard, and J. Sun. Bundling featuresfor large scale partial-duplicate web image search. InCVPR, pages 25–32, 2009.

[26] J. S. Yuan, J. B. Luo, and Y. Wu. Miningcompositional features for boosting. In CVPR, 2008.

[27] J. S. Yuan, Y. Wu, and M. Yang. Discovery ofcollocation patterns: from visual words to visualphrases. In CVPR, 2007.

[28] H. S. Zellig. Mathematical structures of language.Interscience Publishers New York, 1968.

[29] S. Zhang, Q. Huang, G. Hua, S. Jiang, W. Gao, andQ. Tian. Building contextual visual vocabulary forlarge-scale image applications. In ACM Multimedia,pages 501–510, 2010.

[30] S. Zhang, Q. Tian, G. Hua, Q. Huang, and S. Li.Descriptive visual words and visual phrases for imageapplications. In ACM Multimedia, pages 75–84, 2009.

[31] X. Zhang, Z. Li, L. Zhang, W.-Y. Ma, and H.-Y.Shum. Efficient indexing for large scale visual search.In ICCV, pages 1103–1110, 2009.

[32] Y. M. Zhang and T. H. Chen. Efficient kernels foridentifying unbounded-order spatial features. InCVPR, pages 1762–1769, 2009.

[33] Y. M. Zhang, Z. Y. Jia, and T. H. Chen. Imageretrieval with geometry preserving visual phrases. InCVPR, 2011.

[34] Y.-T. Zheng, M. Zhao, S.-Y. Neo, T.-S. Chua, andQ. Tian. Visual synset: Towards a higher-level visualrepresentation. In CVPR, 2008.

512

Documents

Contextual Synonym Dictionary for Visual Object Retrieval∗