Query Expansion for Hash-based Image Object …jzwang/ustc11/mm2009/p65-kuo.pdfQuery Expansion for Hash-based Image Object Retrieval Yin-Hsi Kuo1, Kuan-Ting Chen2, Chien-Hsing Chiang1,

Query Expansion for Hash-based Image Object Retrieval Yin-Hsi Kuo1, Kuan-Ting Chen2, Chien-Hsing Chiang1, Winston H. Hsu1,2

1Dept. of Computer Science and Information Engineering, 2Graduate Institute of Networking and Multimedia, National Taiwan University, Taipei, Taiwan

ABSTRACT An efficient indexing method is essential for content-based image retrieval with the exponential growth in large-scale videos and photos. Recently, hash-based methods (e.g., locality sensitive hashing – LSH) have been shown efficient for similarity search. We extend such hash-based methods for retrieving images represented by bags of (high-dimensional) feature points. Though promising, the hash-based image object search suffers from low recall rates. To boost the hash-based search quality, we propose two novel expansion strategies – intra-expansion and inter-expansion. The former expands more target feature points similar to those in the query and the latter mines those feature points that shall co-occur with the search targets but not present in the query. We further exploit variations for the proposed methods. Experimenting in two consumer-photo benchmarks, we will show that the proposed expansion methods are complementary to each other and can collaboratively contribute up to 76.3% (average) relative improvement over the original hash-based method.

Categories and Subject Descriptors H.3.3 [Information Search and Retrieval]: Retrieval models

General Terms Algorithms, Experimentation, Performance

Keywords Locality sensitive hashing (LSH), Query expansion

1. INTRODUCTION The exponential growth of photos and videos, either in media-sharing sites, business stockings, or personal collections, have created the needs for efficient content-based image retrieval (CBIR), which helps locating similar images in large-scale collections. In recent years, researchers have focused on the more challenging problem of image object retrieval [21][23]. Image object retrieval aims to retrieve images that contain the visual object shown in the object query image. For example, searching for images that contain “Torre Pendente Di Pisa” or the Starbucks logo (cf. Figure 7(c)(d)). Such techniques also motivate many promising applications such as exploring photo collections in 3D [25], photo-based question answering [31], video advertising by image matching [21], annotation by search [27], etc.

The traditional solutions for CBIR employ global low-level features like color and texture. By selecting proper feature representations and distance metrics, the similarities between query and database images can be calculated and a ranking list generated accordingly. Prominent CBIR systems that use these

techniques include QBIC (Query By Image Content) [15] and VisualSEEK [24], etc. However, for image object retrieval, the query can be a full image or just part of the whole image which we call object-level image. As shown in Figure 2(a), the red rectangle represents an object-level query image. Similarly, the object may occupy only part of the target database images. Global color and texture features become limited under these conditions. To capture local image information that is essential in object retrieval, Lowe proposed scale-invariant feature transform (SIFT) [19]. The SIFT feature works by first detecting salient regions in an image and then describing each region with a 128-d feature vector. Its advantage is that both spatial and appearance information of each local region are recorded with built-in invariance to modest changes in object scale or camera viewpoint. As a result, an image can be viewed as a bag-of-feature-points and any object within the image is a subset of the points.

With the bag-of-feature-points representation, image retrieval is carried out by comparing all feature points in query image to those from all images in the database. Since an image typically contains hundreds to thousands of feature points, an image database can easily have millions or even billions of points. Therefore, image retrieval based on bag-of-feature-points representation is in fact a large-scale many-to-many matching problem in high-dimensional space. To index the data so that similarity search can be performed effectively and efficiently is extremely crucial. A hash-based method known as locality sensitive hashing (LSH) [16] has been shown successful in performing similarity search on various types of data including text [26], audio [8][6], images [18] and videos [13]. LSH has the important characteristic that given a distance metric, hash functions are designed so that similar data have much higher probability to hash into the same bucket than dissimilar data. As a result, when we evaluate a query, only a small portion of the data points, those that hash into the same bucket as the query point, needs to be examined. Also, multiple hash functions and hash tables are employed to improve robustness against false negatives. LSH runs in sublinear time and scales well with data dimensionality [16]. See section 3 for more details. The combination of SIFT feature and LSH has been shown effective in image duplicate detection [18].

Figure 1(a) shows an illustration of searching by LSH over bag-of-feature-points representation. The red rectangle is the object query image which contains four query feature points (A, B, C, and D). Each query point is used to retrieve database points that collide (hash into the same bucket) with it in any of the hash tables. After examining all query points, retrieved database points are analyzed to identify database images associated to them and also the degree of relevance of these images to the query. For simplicity of illustration, relevance here is defined as number of matching feature point pairs between query and data images. Although this retrieval model succeeds at retrieving target images with high precision, it suffers from low recall rates. Figure 2(c) shows a real retrieval case with basic LSH. While most of the top ranked images are correct, some false positives are high up in the

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. MM’09, October 19–24, 2009, Beijing, China. Copyright 2009 ACM 978-1-60558-608-3/09/10...$10.00.

65

(a) Basic LSH and query result (b) Intra-expansion result (c) Inter-expansion result

Figure 1: Query expansion for hash-based image retrieval. (a) An illustration for search by hashing over bag-of-feature-points (i.e.,{A, B, C, D}) and ranking by the number of matched pairs (e.g., 4, 2, 2, 1, 1); the query image is in the left; since only the exact (inthe same bucket) feature points are considered, the original LSH suffers from low recall rate. We tackle the problem by proposingtwo expansion strategies for hashing. (b) Intra-expansion expands more target feature points (e.g., A’, A”, B’, C’, etc.) similar tothose in the query but mis-hashed to other buckets. (c) Inter-expansion mines those feature points (e.g., {E, F, G}) that shallco-occur with the search targets by exploiting the related hash buckets from the initial search result; more diverse (and related)results can be retrieved. Note that the two expansion strategies can be combined iteratively to boost the search results, and theexpansions are realized efficiently by merely looking up the hash buckets.

list, too. The recall is low because even though LSH guarantees a high probability of hashing similar data into the same buckets, this probability is not 1. It is unavoidable that some similar features are hashed into neighboring buckets and LSH fails to retrieve them at query time. Also, there may be features that strongly characterize the query object but are not present in the query image or simply appear very different from query features. This can be the result of changes in lighting condition, camera angle and occlusion effects, etc. To tackle this problem, we introduce query expansion1 technique for hash-based image object retrieval. We propose two expansion strategies – intra-expansion and inter-expansion. Intra-expansion uses existing query features to obtain more similar features as matching targets. For example, in Figure 1(b), features A’ and A’’ are similar to feature A (the distance between A’ and A is smaller than a threshold) but not found by basic LSH. Intra-expansion discovers A’ and A’’ by inspecting neighboring buckets to A and they can be used as target features once identified. Inter-expansion is to obtain new query features not present in the query image but mined from the initial search results. To do this, we choose some possible correct images (e.g., top ranked images) and issue their relevant features as new queries. For example, in Figure 1(c), features E, F and G are not present in the query image but they appear alongside correctly matched target features in the initial search results. By inter-expansion, they are considered relevant to the query and issued as new queries. Their query results are factored into the final rank list returned to the user. Intra- and inter-expansion expand two different types of new queries. Both can improve retrieval performance significantly and they can be combined iteratively to obtain even better results. Note that such expansion methods are realized by searching over hash tables and buckets only and actually incur little extra time while obtaining new query features. Experimenting over photo search benchmarks, we show that the proposed methods can achieve up to 76.3% (average) relative improvement over 1 In information retrieval community, query expansion involves

evaluating a user's input (what words were typed into the search query area and sometimes other types of data) and expanding the search query to match additional documents [28].

state-of-the-art hash-based image search method (i.e., [18]). The recall is improved greatly as well as shown in Figure 2. The primary contributions of this paper include:

․ The first proposal of intra- and inter-expansion for hash-based image object retrieval.

․ Investigating effective and efficient algorithms in implementing the proposed expansion methods (Section 4).

․ Conducting extensive experiments in large-scale benchmarks (Section 5 and 6) and exemplifying the significant gains for the proposed methods.

2. RELATED WORK In most multimedia retrieval works, researchers are faced with two important issues: the feature representation and the means to calculate similarity between data objects effectively and efficiently. Repeatedly solving the nearest neighbor problem (NN) between features is often required while addressing these issues. Researchers have argued and proved that by allowing a small error bound in solving NN, time efficiency can be greatly improved while performance degradation is acceptable [4]. NN with the addition of error bound is known as approximate nearest neighbor problems (ANN). Two popular methods for solving ANN are KD-trees [4] and locality sensitive hashing (LSH) [16]. A KD-tree is a search tree that splits according to data distribution along one single dimension at a time. Data points are stored at leave nodes and the internal nodes store splitting criteria that can be used to speed up verification of pruning opportunities. However, when the number of data dimensions exceeds 20, KD-trees suffer from the curse of dimensionality and its pruning mechanism becomes ineffective [16].

On the other hand, LSH has gained popularity in recent years because of its ability to deal with data features in even higher number of dimensions and at the same time keeping running time satisfactory. The basic idea for LSH is to hash data points in a way that close points in feature space have higher chance of collision compared to far apart points. LSH is first proposed in [17] along with rigorous mathematical description. [16] implements LSH in Hamming space and evaluates its performance with low level multimedia features in high dimensions. Ke et al. [18] applies LSH in Hamming space to a variant of SIFT feature and

66

Figure 2: For query image (a), basic LSH returns result (c)which has high precision but low recall. With proposed queryexpansion methods, more accurate and diverse results can beretrieved as shown in (d), (e) and (f). The number below eachimage is its rank in retrieval result. Number in theparentheses represents its rank with basic LSH. (b) shows acomparison of PR curves for these results. It is clear that theproposed expansion methods can greatly improve searchperformance.

achieved very good performance in detecting near-duplicate images. However, the ground truth images in their experimental dataset are obtained by performing various image transformations on query images. In consumer photos, rather than duplicates, the ground truth images are photographed on real world objects under much diversified photo capturing conditions (as depicted in Figure 7). Meanwhile, we are interested in retrieving those designated objects in consumer photos. This results in far greater differences in feature points for the same object. A direct LSH approach in finding similar points is not sufficient to deal with the vast variability present within consumer photos. We introduce query expansion techniques to mend this problem.

LSH is later extended to work with Euclidean distance by adopting a class of p-stable distribution functions [12]. E2LSH [1] is a publicly available software package that implements the algorithms in Euclidean space. One disadvantage of LSH is that by using multiple hash tables, space and time requirement becomes a burden. The authors of [20] propose multi-probe LSH where multiple buckets in a hash table are inspected to retrieve more diversified results. In exchange, less hash tables and running time are needed to achieve the same performance.

[13] employs LSH to project feature points to an auxiliary space and represent images as random histograms therein. Support vector machines (SVMs) is used to learn classifiers on object classes in this auxiliary space. The use of random histograms bypasses the many-to-many matching problem when features are compared directly for similarity. However, the random histogram is a global summary of local features which does not preserve important spatial and appearance information and spatial

verification is not possible. While appropriate for classification, it is not suitable for image object retrieval where local distribution of features is the key rather than global compositions.

Another popular approach in object retrieval is by using the bag-of-visual-words representation and adopting traditional retrieval techniques for textual words [23]. In this approach, a set of training feature points is clustered and the centroid of each resulting cluster is defined as a visual word. Once a set of visual words is obtained, every feature can be represented by its most similar visual word. An image is thus a bag-of-visual-words and described by a frequency histogram of the visual words which in turn is used in calculating similarity scores. Due to the nature of clustering algorithms, the quantization of feature points into visual words can be a noisy process [11]. To attenuate the problem, [33][34] both consider quantization-related (by soft assignment) visual words as calculating the global visual word histogram.

Object retrieval of the visual word model still suffers from low recall rates. The authors of [11] proposed multiple image resolution expansion where correct entries in the initial retrieval result are analyzed to obtain latent images which are estimated visual word histograms of the query object if shot under a different resolution. The latent images are issued as new queries to retrieve more diversified results. This is similar to inter-expansion proposed in our work in that both are expanded by characteristics of verified correct images in the initial result. The difference is that instead of using estimated visual word histogram as new queries, we use feature points that are verified to be important for the query object.

Also, a visual word histogram is the aggregation of local features. If an image has multiple salient objects or a complex background with many interfering visual words, its histogram does not well describe any objects. By working with the SIFT features directly, we avoid introducing extra noises from quantization of SIFT features or aggregation of visual words. Meanwhile, we propose novel expansion methods integrating both inter-expansion and intra-expansion strategies, which will be shown complementary to each other and significantly improving hash-based image object retrieval (up to 76.3%).

3. SYSTEM OVERVIEW AND LSH For image object retrieval, we adopt bag-of-feature-points image representation and employ the hash-based indexing method. To improve the search recall, we will extend the hash method (i.e., LSH) by two novel expansion strategies. We first provide the LSH overview in Section 3.1 and explain the matching for bags of feature points in Section 3.2. Two ranking criteria for image object retrieval are introduced in Section 3.3. Based on them, we detail the hash-based query expansion methods in Section 4.

3.1 LSH Overview LSH [16] has been shown effective and efficient for many multimedia (high-dimensional) retrieval applications [8][6][18] [13]. The essence is the hash functions which assure that similar data have much higher probability to hash into the same bucket than dissimilar data. There are plenty of hash functions that can achieve this goal, such as transform into hamming space [17][16], L1 distance [3], min-wise hash [5], random projection [10], or stable distribution method [12], etc. Applicable on L2 distance metric, the stable distribution hash function [12] is widely adopted. In this work, without loss of generality, we will base on the hash framework [12] and its implementation [1].

67

Figure 3: An illustration for LSH with feature points (K=2,L=2, cf. Section 3.1); e.g., the triangle is hashed into bucket (2,2) in hash table 1 and bucket (1, 2) in hash table 2. It alsoshows the need for intra-expansion. Assuming feature pointswith the same shape (or color) are from the same image. If is a query feature, we can retrieve two features, and .Actually is missed due to residing in a nearby bucketthough it is still within (Euclidean) distance R. This examplemotivates intra-expansion by further checking neighboringbuckets (cf. Section 4.1.2) or using co-occurrence across hashtables (cf. Section 4.1.1). For example, we can retrieve inhash table 2, by collided with the query in hash table 1.

(a) Images (b) Feature points (c) Hash tables (d) Ranking Figure 4: Illustration for image search by hashing over bagsof feature points. Each feature point of the database image isassociated with (a) the image id (i.e., , , ) it belongs toand (b) the unique feature id (i.e., A-H), and then hashed to(c) multiple hash tables (cf. section 3.1). To retrieve thesimilar images for the query image ( ), with a single featurepoint, we simply hash the query feature point(s) into thebuckets across hash tables (e.g., the grey buckets in (c)). Anintuitive way to score the target image similarity is simplycounting the number of its feature points that reside in thesame buckets with the query feature points, as shown in (d).

For LSH by stable distribution method [12], the hash function )(vh v

is defined as follows:

⎥⎦⎥

⎢⎣⎢ +⋅

=W

bvavhvv

v)( (1)

where represents a feature point, is a vector sampled from Gaussian distribution, W is the window size, and b is a random value for the offset ranging from 0 to W. )(vh v

computes the inner product and projects the input feature point onto and then windowed by W, as illustrated in Figure 3.

Generally, K hash functions are considered to provide discriminativity among high-dimensional data points. For a large K, two feature points are very likely closed to each other (in the original feature space) if they are still hashed to the same bucket through these K hashing functions. However, false negatives, those true neighbors mis-hashed to other buckets, are commonly observed; for example, see feature points (bucket (2, 2)) and

(bucket (3, 2)) in hash table 1, Figure 3; the two feature points are actually closed to each other but hashed to different buckets. To remedy the problems, multiple hash tables (parameterized by L) are considered to improve the robustness against the false positives; for example, and collide in the same bucket in hash table 2. Aggregating multiple tables, a query point can treat those feature points residing in the same buckets across hash tables as the (approximate) nearest neighbors; i.e., in Figure 3, (bucket (2, 2), hash table 2) and (bucket (2, 2), hash tale 1) are neighbors for . It is efficient since generally requiring constant time for retrieving these buckets.

Empirically, the effecitiveness and efficieicy for LSH is parameterized by the number of hash functions K and the number of hash tables L. A larger K will have fewer feature points colliding in the same bucket; similarly, increasing L will increae candidate nearest neighbors since more buckets included across these tables. We will evaluate the impacts of K and L for image object retrieval in Section 5.3.

3.2 Feature Points Matching in LSH Originally, hash-based methods are to support problems where either the query or a target in the database is represented by a

(global) feature. The retrieval is intuitive by one-to-many search in LSH and the candidate targets are those collide with the query in the buckets across hash tables.

With the bag-of-feature-point representation, image retrieval is essentially a many-to-many matching problem in the high-dimensional space. An image typically contains hundreds or thousands of feature points (See few of them in Figure 6). The image database generally contains millions or even billions of feature points (i.e., 128-dimensioanl SIFT features). Feature points of the same image are associated with a unique image id. We then hash all feature points into the L hash tables.

When querying an image, we issue a LSH query for each feature point independently. Each query point is used to retrieve database points that collide (hash into the same bucket) with it in any of the hash tables. After examining all query points, retrieved database points are analyzed to identify retrieved database images associated to them. For each retrieved image, we then know its feature points having initial matches to those in the query image. The naive way to rank these database images is by the number of possible matches between the query and database images. However, such ranking criterion is poor since the feature point matching through bucket lookup only is noisy. We will further inspect the candidate matches between the query and database images and then filter out their false positives (Section 3.3.1).

3.3 Filtering and Ranking Functions For each database image retrieved, we have a set of initial (noisy) matches to the corresponding feature points in the query image (cf. Figure 4). For improving the matching quality, we will employ two filtering methods: (1) inspecting the matches between feature pairs by the feature distance and (2) employing spatial consistency between (true) matching pairs. We will also introduce two image ranking measures. Note that the filtering methods are conducted one by one for each retrieved image and referred to the query.

3.3.1 Nearest neighbor filtering and similarities A naïve way to filter out the (noisy) matches between the query and a retrieved image is simply rejecting those candidate points with (Euclidean) distance greater than a threshold. However, such distance-thresholding filtering is not adaptive and even difficult to determine a proper threshold. It has been observed that a correct match needs to have the closest matching point significantly

vv av

va vv ⋅ vv

av

68

closer than the closest incorrect match, while false matches have a certain number of other close false matches, due to the high dimensionality of the feature space [19]. We then filter out those matching pairs whose ratios of the distances to the first and second nearest neighbors are larger than a threshold (i.e., 0.8).

For the matched feature pairs, we transform the feature distance to a similarity score. The ranking score, Rs, matching similarities for the query image Q to a database image I is defined as:

(2)

where (vi, vj) are matched pairs between images I and Q, d(vi, vj) is the L2 distance between the match; here σ=200 is empirically determined2. Generally, a database image will have a higher similarity score if it has more matches to the query image.

3.3.2 Spatial verification and matching inliers Another important cue for image object retrieval is the spatial relationship between the matched image objects. Besides filtering by feature point similarities, we also investigate filtering by spatial verification – exploiting geometry model between matching candidates for rejecting false positive matches. After spatial verification, we only keep those feature points following the spatial consistency (e.g., only 11 remained in Figure 6). The approach is mostly related to the well-known RANdom Sample Consensus (RANSAC) algorithm [14]. A basic assumption is that the data consists of “inliers” and “outliers.” The former are consistent with the estimated (spatial) model and can be explained by some set of model parameters; the latter are items that do not fit the model. This method has the intuition that if one of the set used for model estimation is an outlier then the geometry model will not gain much support. By the RANSAC algorithm, we can estimate the possible geometry model between the query Q and the inspected image I and use the model to further determine the inliers and outliers among candidate feature point matches. Meanwhile, we can also estimate the matched region in the target image for further applications. See more details in [19]. A matched region is illustrated in Figure 6.

Hence, after spatial verification, another ranking score between the query image Q and the target image I is by the number of inliers and defined as follows:

),(),( QIinliersQIR L = (3)For example, Figure 6 illustrates the 11 matches between the query object and a retrieved image after spatial verification; i.e., RL(I,Q) = 11. Note that till now the spatial verification is very time-consuming. In this experiment, we will mainly adopt prior nearest neighbor ratio filtering for the baseline. For efficiency, we only do spatial verification for the few top-ranked initial results as [23], where an efficient spatial verification is proposed as well.

4. QUERY EXPANSION To boost hash-based image object retrieval, we propose two expansion strategies – intra-expansion and inter-expansion, which will be explained in the following sections. We will show that combining both expansions can collaboratively boost the 2 Experiments show that σ is not sensitive within a reasonable

range. We take σ=200 for the following experiments.

performance gains over conventional hash-based image object retrieval (Section 6.3).

4.1 Intra-expansion LSH-like methods suffer from low recall rates [23][18][11] and similar feature points are likely to be mis-hashed to other buckets. Traditionally, a large number of hash tables are utilized to ease this problem. However, extra hash tables consume huge memory and are infeasible for large-scale image object retrieval by bags of feature points. Instead, we propose novel intra-expansion methods aimed to expand more target feature points similar to those in the query. To optimize the performance, we investigate variant methods for effective intra-expansion by: (1) associating feature points by co-occurrence across hash tables, (2) probing neighboring buckets from the initial hashed bucket, and (3) using meta-features to associate related feature points. The impacts of the proposed methods will be experimented in Section 6.1. Note that such expansion methods are merely looking up the hash buckets and require no extra hashing overheads except the initial image object query. The expansion will increase the number of candidate feature point matches. Similarly, we can reject the false positives effectively through the nearest neighbor ratio introduced in Section 3.3.1.

4.1.1 Matched point in LSH (MP) In this method, we use matched feature points to locate possible buckets that might accommodate similar feature points mis-hashed to other buckets. Each query feature point will be hashed into a bucket in each table. The candidate feature points collide with a query feature point can be seeds to associate possible buckets in other hash tables that the candidate matches might be hashed to. It is intuitive since for a matched feature point ‘A’, collided with the query in a hash table, the feature points that collide with feature point ‘A’ in other hash tables might also be candidate matches. See Figure 3 for the illustration. A red star is a feature point for the query image. We can only find two feature points colliding in the same buckets with the query point in the two hash tables. Actually, there is still a blue circle point similar to the query (i.e., within certain Euclidean distance). In MP, we can associate the missing feature point by inspecting where the purple triangle, matched in hash table 1, has been hashed to in hash table 2. As a result, we can then get the blue feature point, which might associate with a candidate database image.

4.1.2 Multi-probe LSH (MPL) The prior method is to associate candidate buckets by feature point co-occurrence. It is less efficient since by point-based association the time complexity grows linearly with the number of feature points in the same bucket. Another perspective is to locate the possible buckets neighboring the initial bucket that the query point is hashed to, as motivated by “multi-probe LSH” proposed in [20] 3 . It is intuitive since the neighboring buckets are geometrically closed to the query feature point and very likely to accommodate mis-hashed feature candidates.

The probing sequence for neighboring buckets is not randomly chosen but considers the distance to the initial hashed bucket. See the example in Figure 3. For example, feature point ‘star’ is

3 Note that in [20] multi-probe LSH is designed to reduce hash

table size. Instead, in this work, we address to locate more likely buckets that the candidate matches might reside in.

RS (I,Q) = evi ∈I ,v j ∈Q∑ −

d ( v i ,v j )σ

69

Figure 5: An illustration for the need of inter-expansion. Eachrectangle represents an image with certain feature points.Those of the same shape are assumed matched. Query imagecan retrieve image A (4 matched pairs) by baseline LSH butcannot find image B (0 matches). However, we can stillretrieve image B through image A by including those featurepoints in image A as the expanded feature points, i.e.,inter-expansion. Note that intra-expansion is optional here.

Figure 6: Matches and regions in the target (database) imagefor inter expansion. There are actually two regions potentialfor inter-expansion: region A (red) for the (projected)matched region and B for the whole image. Note that thefeature point matches are usually noisy and random. The 11matches are retained as inliers after spatial verification.

hashed to bucket (2, 2) in hash table 1; its first bucket to probe for MPL will be bucket (2, 3) (above) since it is the most closed to the feature query ‘start’; the next is the right bucket (3, 2), etc. See more details in [20]. Note that the number of probes for the neighboring buckets is an important factor for the performance and efficiency. We will conduct the sensitivity test in Section 6.4.1.

4.1.3 Meta-feature (MF) The prior intra-expansion methods, MP and MPL, both depend on the cues provided by the hash structure. We investigate another intra-expansion method to associate the feature points by discovering the meta- (or representative) features in the original feature space (i.e., SIFT). A meta-feature matched to a query feature point is then used to bring in the set of feature points that the meta-feature represents.

The meta-feature uses one feature point to represent a set of similar feature points in the original feature space. A clustering method for all the feature points in the database images is required to locate the meta-features. We mark each cluster center as the meta-feature and record the feature points belonging to it (in the same cluster). The meta-features are then hashed into the hash tables. In the query phase, we retrieve all meta-features collided with the query and then use the matched meta-features as seeds to include the candidate feature points associated with these meta-features.

Note that since clustering for a large number of high-dimensional data is very time-consuming, here we adopt hierarchical clustering (HKM) [22] as the clustering method. It has been shown much more efficient than conventional clustering methods (e.g., K-means). Empirically, we take 10K cluster centers in HKM as meta-features.

4.2 Inter-expansion Due to the changes in lighting condition, camera angles, and occlusions, etc., there are feature points that strongly characterize the query object but are not present in the query image or appear very different from query features. Such issues cannot be solved by the intra-expansion methods mentioned above. We propose to seek the solution from the initial query results by inter-expansion.

Figure 5 illustrates the intuition for inter-expansion. For example, the query image can robustly retrieve image ‘A’ through LSH methods (or enhanced by intra-expansion mentioned in Section 4.1) with 4 matches but fail to retrieve image B (no matches). However, we can associate image B by taking image A as a new query image. Such behaviors had been observed by pseudo-relevance feedback (PRF) in text-based retrieval [7][28] and video retrieval [29]. For optimizing inter-expansion in hash-based image object retrieval, we investigate certain factors parameterizing the retrieval results: For example, (1) if requiring any filtering process for determining the seeded images from the initial search result for expansion (more details in Section 3.3.1), (2) the proper region for expansion – the region of interest or the entire image (more details in Section 3.3.2 and Figure 6), (3) effective fusion methods and similarity measures for fusing multiple image ranking lists expanded by the images from the initial search result, (cf. Figure 5). Note that such inter-expansion methods are mostly realized by searching related feature points in the buckets. The most time-consuming part is spatial verification (cf. Section 3.3.2) for evaluating matching inliers.

4.2.1 Pseudo-relevance feedback (PRF) PRF is the most intuitive method for inter-expansion. Initially introduced in [7], where the top-ranking documents are used to rerank the retrieved documents assuming that a significant fraction of top-ranked documents will be relevant. This is in contrast to relevance feedback where users explicitly provide feedback by labeling the top results as positive or negative.

For inter-expansion, we are to automatically expand the retrieved images by issuing new queries with the top-ranking images from the initial search result since some characteristic feature points relevant to the search targets might not exist in the query image but only the search results. Figure 5 demonstrates the needs and the process for PRF (or inter-expansion). It simply assumes that the top retrieved images are correct and might be the case for text retrieval but not for photo or video search [29]. The latter generally contains more noisy high-dimensional data and often mistakenly includes false positives as the seeded images for expansion. We will exploit further methods to verify the needs of image filtering (i.e., spatial verification). Meanwhile, it is not intuitive to select the number of top ranks for image expansion and mostly query-dependent. The fusion of multiple expanded lists will be discussed in Section 4.2.4.

70

Figure 7: Examples from two photo datasets for evaluatingimage object retrieval. Oxford landmarks, (a) “balliol” and(b) “ashmolean,” in Oxford Building dataset. (c) “TorrePendente Di Pisa” and (d) “Starbucks” in Flickr11K datasetincluding multiple buildings and small logos, which are morecomplicated and diverse. Rectangles highlight the objects.

4.2.2 Full image by spatial verification (SVFI) Rather than blindly taking the top images as seeds for inter-expansion as PRF, we select the images, from the initial search result, whose matched inliers are larger than a threshold δ, as shown in Figure 5. Naturally images with more inliers to the query are more likely to be true targets. Here the number of inliers between the query image and the target image is determined by spatial verification discussed in Section 3.3.2. See the example in Figure 6, where we have 11 inliers between the target and query images and region B, the entire target image, is used for further image expansion in SVFI. Note that the threshold δ for determining the correct image will be experimented in Section 6.4.2.

4.2.3 Matched region by spatial verification (SVMR) Similar to SVFI, we need to conduct spatial verification and filter out incorrect images from the initial search result by inlier threshold δ. However, the region for inter-expansion is the matched region corresponding to the query object, i.e., ‘Region A’ in Figure 6. As explained in Section 3.3.2, the region generally corresponds to the region of query interest and can be estimated in the spatial verification process.

4.2.4 Fusion methods As shown in Figure 5, multiple images or image regions from the initial search result will be used for expansion. Generally, each image will generate a ranking list through LSH or further be enhanced by intra-expansion methods introduced in Section 4.1. To maximize the performance for fusing multiple ranking lists from the seeded inter-expansion images, we consider several fusion methods and similarity measures as suggested in [32].

․ Average Score (AVG): in AVG, the expanded image in each ranking list (cf. Figure 5) is scored by the matching similarity RS in Equation 2. For inter-expansion, the query image is now the seeded image from the initial search result (e.g., image ‘A’). The final ranking score for each expanded image is to take the average among those scored in each ranking list. Note that since LSH-based retrieval only returns a subset of database images. We assume zero score if the image is not within the expanded ranking list; for example, image ‘B’ exists only in ranking list 2 but not in ranking list 3; thus it is

assumed scored zero in ranking list 3. Note that, for AVG, spatial verification is not employed for the expanded ranking lists and is thus efficient. Each expanded image is to take the average similarity scores Rs from the seeding ranking lists.

․ Maximum Score (MAX): Same as AVG, except that the final ranking score for each expanded image is to take the maximum among those scored in each ranking list.

․ Average Inliers (AVG_IL): Similar to AVG, except that the inlier score RL in Equation 3 is adopted. Hence, such fusion method requires applying spatial verification among the expanded ranking lists and required extra computation time.

․ Maximum Inliers (MAX_IL): Same as AVG_IL, except that the final ranking score for each expanded image is to take the maximum among those scored in each ranking list.

․ Borda Score (Borda): Instead of using the matching similarity score RS or the inlier score RL, the Borda score is used to score each expanded image list based on its ranking position in the list. For an image at rank i among the total N, its Borda score is 1-i/N. The final ranking score for each expanded image is to take the average among the Borda scores across expanded ranking lists.

4.3 Iteration for Expansion The expanded photos can then iteratively work as the seeded photos for another round of inter-expansion (or enhanced with intra-expansion). The iteration stops as no more new photos are considered relevant (thresholded by inlier number δ) in the expansion. For example in Figure 5, from expanded ranking list 2 and 3, we will choose the retrieved images with matching inliers larger than the threshold δ as the new seeded images for next inter and intra-expansion. The expanded correct images are collected and later ranked by their corresponding ranking measures as the iteration stops.

5. EXPERIMENTAL SETUPS We experiment the proposed methods in two photo retrieval benchmarks, Oxford Building [23] and Flickr11K, a subset of Flickr550 [30]. Some of the queries and their ground-truth images are exemplified in Figure 7.

5.1 Datasets Oxford Building dataset4: The Oxford Buildings dataset [23] consists of 5,062 images collected by issuing Oxford landmark names as search keywords on the Flickr website. The dataset includes 11 query categories with 5 queries each. We use the cropped image object from the authors [23] for the 55 query photos, illustrated in Figure 7 (a) and (b) (Query objects are in the red rectangles). The images are downsized to quarter of the original to match the image dimensions in our next dataset. The average image size is about 512x374 pixels. There are totally 7,162,122 feature points in Oxford dataset and on the average 1,415 ones for each photo.

Flickr11K dataset: Flickr11K is a larger dataset consisting of 11,282 medium resolution (500x360) images. This is the subset provided by the authors of Flickr550 dataset [30], downloaded from Flickr in the “European Travel” group. We modify the queries and ground truth defined by the authors to suit the

4 Note that we do not aim to optimize the retrieval performance

for the benchmark but investigate relative improvements for the proposed expansion methods in LSH.

71

Table 1: Comparisons of intra-expansion methods, which all improve the baseline hash-based image retrieval in Oxford dataset. MPL (100 probes) is the most effective as probing more buckets neighboring the initially hashed bucket. MP and MF are limited due to loosely associating other candidate feature points by co-occurrences in other hash tables or sparse meta-features. ‘%’ stands for relative improvement from the hash baseline.

Oxford Baseline MP MPL MF MAP 0.261 0.305 0.397 0.304

% - 16.9 52.1 16.5

Table 2: Comparisons for variant inter-expansion and fusion methods in Oxford dataset; most of all outperform hash-based baseline (with 0.261 MAP). The best performer for inter-expansion only is SVMR with MAX_IL fusion and is SVMR in AVG fusion if further considering intra-expansion. See more explanations in Section 6.2. Note that we have MPL for intra-expansion.

OxfordInter-expansion Intra + Inter expansion

PRF SVFI SVMR PRF SVFI SVMRAVG 0.273 0.291 0.295 0.431 0.442 0.460 MAX 0.277 0.295 0.276 0.406 0.396 0.392 Borda 0.227 0.249 0.236 0.398 0.388 0.358 MAX_IL - 0.320 0.322 - 0.430 0.425 AVG_IL - 0.310 0.307 - 0.438 0.448

objective of “content-based” photo search. The result is a total of 1,282 ground truth images in 7 query categories such as “Torre Pendente Di Pisa” and “starbucks” in Figure 7 (c) and (d). We then add 10k images randomly sampled from Flickr550 to form our Flickr11K dataset.

5.2 Performance Metrics In order to evaluate the retrieval performance, we use the average precision (AP) as the major performance metric. Widely adopted in large-scale photo/video retrieval benchmarks such as TRECVID [2], Oxford Buildings [23], and Flickr550 [30], AP approximates the area under a non-interpolated precision-recall curve for a query. Since AP only shows the performance for a single image query, generally we compute mean average precision (MAP) to represent the system performance over the all queries; thereby, 55 query images (across 11 categories) in Oxford dataset and 56 query images (across 7 categories) in Flickr11K dataset.

5.3 Baseline LSH Configurations The experiments for the LSH expansion are based on E2LSH implementation [1]. For object-level image retrieval by bags of feature points, the default configurations need to be adjusted for optimizing efficiency and effectiveness. Through the pilot experiment in a held-out dataset, we found that decreasing K or increasing L will help locate more matched feature point candidates since a smaller K will put loose constraints for hashing more feature points into the same bucket and more hash tables (larger L) can help accumulate more collided feature points. For baseline LSH, we set K=10, L=10 and W=400, through the sensitivity test in the held-out set for balancing time efficiency and effectiveness 5 . Generally, each query feature point will retrieve (averaged) 0.01% of total number of feature points as candidates. For Oxford dataset, we have 7,162,122 feature points in LSH; on the average, each table has around 1,192,240 buckets. We also set the threshold for matched pair filtering ratio as 0.8 as suggested in [19]. Note that if the query feature point only has one nearest neighbor we treat the two points are matched.

The hash-based baseline achieves 0.261 (MAP) in Oxford dataset and 0.128 in Flickr11K (cf. the second row in Table 3). With the configurations mentioned above, the averaged image query time for the two datasets is 0.8 second and 1.2 second respectively in a regular Linux server with Intel CPU. We also apply spatial verification on the hash-based baselines and found the improvement is marginal due to very sparse matching between

5 Note that according to our preliminary experiment, more hash

tables can slightly improve the retrieval performance, however, at the cost of huge memory and slow response time. In the following, we retain a small number of hash tables.

photo pairs as commonly observed in bags of visual words paradigm as well [11]. We will show later that the introduction of inter and intra expansions can significantly boost the hash-based baseline by bringing in more target-related and context-related feature points.

6. RESULTS AND DISCUSSIONS 6.1 Intra-expansion on LSH We first experiment variant methods for intra-expansion, which is aimed to help locate similar feature points mis-hashed to other buckets. According to Table 1, the expansion methods (i.e., MP, MPL, and MF, cf. Section 4.1) all improve the object-level image retrieval by hash-based baseline in Oxford dataset. The expansion methods do include more related feature points helpful for bags of feature point matching. Note that the candidate feature points are later filtered by nearest neighbor ratio (cf. Section 3.3.1).

Among them, multi-probe LSH (MPL) outperforms MP and MF because MPL can still locate neighboring hash buckets, where candidate feature points might reside, even if there are no matched points at the initial hashed bucket; whereas MP is to associate candidate feature points by feature co-occurrence and the association between feature points might be sparse. Similar deficiencies are observed in MF, which utilizes meta-features to associate candidate feature points. However the total number of buckets is so huge such that the meta-features might not be hashed into the same bucket with the initial query feature points.

6.2 Inter-expansion and Fusion Methods The goal of inter-expansion is to retrieve more target-related photos given the initial search results since there might be certain context-related feature points not observed in the given query. We compare three inter-expansion methods (PRF, SVFI, and SVMR) based on different fusion manners, which are necessary since generally multiple photos in the initial search result are seeded for expansion and their respective expanded lists need to be fused effectively, as illustrated in Figure 5. We also look into the impacts as further considering intra-expansion.

For inter-expansion only (cf. Table 2) in Oxford dataset, PRF is the worst and at most improves the hash baseline 6.1% relatively (from 0.261 to 0.277 by MAX fusion). It is natural since PRF simply assumes that the top returns are correct and then issued for expansion directly. Such limitation for PRF-like methods in video retrieval is also observed in [9][29]. If the initial seeded photos are further filtered by spatial verification (cf. Section 3.3.2), the expanded results mostly have salient gains (e.g., up to 23%) over

72

Figure 8: Performance breakdown in Oxford dataset wherewe have 100-probe MPL for intra and SVMR with AVGfusion for inter-expansion. All expansion methods improve thehash-based baseline across query categories. Only few querycategories degrade slightly in inter-expansion because someincorrect images are seeded for expansion. Interestingly,“inter + intra” and “inter + intra + iteration” outperform allothers.

Table 3: Performance gain boosts saliently as combining intra and inter expansions for both Oxford and Flick11K datasets. “%” stands for relative improvement in MAP.

Oxford Flickr11K

MAP % MAP % Baseline 0.261 - 0.128 - Intra 0.396 52.0 0.215 67.3 Inter 0.295 13.1 0.135 5.2 Intra + Inter 0.460 76.3 0.225 75.0 Intra + Inter + Iteration 0.469 79.8 0.236 83.5

the hash baseline (see SVFI in Table 2) across fusion methods. Apparently, spatial verification is helpful for filtering effective photos for expansion since the search targets might be spatially correlated for these object or landmark photos.

Another factor is if the photo area (either a matched region only or the entire photo) for expansion does matter the expansion quality. For inter-expansion only, it does not differ a lot for SVFI and SVMR. The former uses the entire region for expansion and the latter uses the matched region only (cf. Figure 6). As more matched feature points are brought in by intra-expansion, matching region of interest (i.e., SVMR) with proper fusion methods (i.e., AVG, AVG_IL) saliently outperforms inter-expansion by the whole seeded photo, which is generally contaminated by background noises not relevant to the target photos. Meanwhile, as shown in Table 2, SVMR with AVG fusion and intra-expansion has the most gain (MAP 0.261 to 0.460, relatively 76.3%) among all configurations in Oxford dataset; so does it in Flickr11K.

As for the fusion methods, AVG, AVG_IL, and MAX_IL are generally on par for different inter-expansion configurations as stated in Table 2. However, AVG_IL and MAX_IL are time-consuming as evaluating matched inliers and require extra spatial verification for the expanded photo lists; however, AVG simply averages the feature point matching similarities across expanded photo lists and is very efficient. Borda fusion, taking only the ranking order from expanded photo lists, is non-effective as ignoring the matched inliers, spatial information, and feature point similarities.

6.3 Combining Intra and Inter Expansions We had shown that both intra and inter expansion methods could improve the hash-based baseline for image object retrieval. It is more interesting to see that the performance boosts significantly as combining both expansion strategies. More reliable matching pairs can be retrieved through intra-expansion and more context-related photos are yielded through inter-expansion from the retrieved image lists. Besides the comparisons by MAP, a sample query by different expansion methods is illustrated in Figure 2, where some results ranked low in the hash-based baseline can be boosted to the top by the expansion methods, also demonstrated by the precision-recall (PR) curves in Figure 2(b).

As shown in Table 3, combining intra-expansion (MPL with 100 probes) and inter-expansion (SVMR in AVG fusion) can improve the MAP 76.3% relatively from the LSH baseline in Oxford dataset. Another interesting observation is that the intra and inter expansions seem to work orthogonally and the two expansion methods collaboratively boost the performance gains. For example in Oxford dataset, we have 52% relative improvement for intra and 13.1% for inter; ideally, the multiplied gains from

two different methods is around 72% (0.72 = (1 + 0.52) x (1 + 0.13) – 1; practically we have 76.3%. Similarly, for Flickr11K dataset, we have 67.3% for intra and 5.2% for inter; thus ideally the multiplied performance gain should be 76% (i.e., 0.76 = (1 + 0.673) x (1 + 0.052) – 1) and empirically it is 75%.

Figure 8 shows the performance breakdown for 11 query categories in Oxford dataset for the major expansion methods. All expansion methods improve the hash-based baseline across query categories. Only few query categories (e.g., “Balliol”) degrade marginally in inter-expansion because some incorrect images from the hash baseline are seeded for expansion due to unreliable feature point matches. Interestingly, as we combine intra and inter expansions (either “inter + intra” or “inter + intra + iteration”), the former brings in more robust feature point matches and significantly outperforms the hash baseline across queries.

Iteratively expanding the retrieval results by inter and intra expansions is slightly helpful (2% - 5% relative improvement) as shown in the last row in Table 3. However, it takes quite time to conduct spatial verification for selecting the seeded photos for expansion.

6.4 Parameter Sensitivity 6.4.1 Number of probes for MPL intra-expansion In intra-expansion, the most effective method is MPL, where the essential parameter is the number of probes to the neighboring buckets (cf. Section 4.1.2). Experimenting in Oxford dataset, we test different number of probes and find that the performance (in MAP) saturates for the number from 100 to 350, as shown in Figure 9. Generally, the performance increases as probing more buckets. It is natural since it improves the recall for mis-hashed feature points. However, as probing more than 350 buckets, the performance degrades due to including more noisy feature points. Note that the candidate feature points will increase as probing more buckets. Balancing efficiency and effectiveness, we choose 100 for MPL intra-expansion; even so, the total number of candidate feature points inspected for a feature point query is still small – 0.1%, on the average, among the total.

73

Figure 9: Sensitive test on the number of probes for MPLintra-expansion (cf. Section 4.1.2). We recommend 100 for thenumber of probes in MPL. Similar behaviors are observed inthe other benchmark Flickr11K.

Table 4: Sensitive test on the number of inliers (δ) for determining a correct match. We have 100-probe MPL for intra and SVMR with AVG fusion for inter. A lower threshold is preferred for inter only. As intra expansion is introduced, more robust matches are yielded between image pairs and higher threshold (20) is preferred.

Inliers number 10 20 30 Inter 0.305 0.295 0.288 Inter + Intra 0.457 0.460 0.447

6.4.2 Number of inliers (δ) for a correct image Another important factor parameterizing inter-expansion is δ, for determining a correct image in inter-expansion. Generally, a higher threshold is stricter for determining a correct match thus trading recall for precision. See the second row in Table 4. For inter-expansion only, the performance (in MAP) is best as δ=10. However, as introducing intra-expansion (i.e., MPL) for probing more candidate buckets, the number of correct matches between image pairs increases. Thus the best threshold for hash-based image retrieval for combining both intra and inter expansions is δ=20 (See the third row in Table 4). The experiment also shows the needs for intra-expansion, which generally brings in more reliable matches between image pairs and eases the mis-hashed problem in traditional hash frameworks.

7. CONCLUSIONS AND FUTURE WORK In this work, we identify low recall problem in hash-based image object retrieval for database images represented by bags of feature points. We extend LSH and propose two novel expansion strategies – intra-expansion and inter-expansion. We investigate variant methods for maximizing the expansion effectiveness and show that such two methods are complementary to each other and can collaboratively boost the performance significantly (up to 76.3% relative improvement) in two consumer photo benchmarks. In the near future, we will optimize efficient implementations for the expansions and the spatial verification process. Since hashing for bags of feature points can be operated independently, we will seek parallel solutions such as multi-core platforms for further speeding up large-scale image object retrieval.

8. REFERENCE [1] E2LSH: http://www.mit.edu/~andoni/LSH/ [2] NIST TRECVID.

http://www-nlpir.nist.gov/projects/trecvid/. [3] A. Andoni et al. Efficient algorithms for substring near

neighbor problem. ACM-SIAM SODA, 2006. [4] S. Arya et al. Approximate nearest neighbor queries in fixed

dimensions. ACM-SIAM SODA, 1993. [5] A. Broder et al. Min-wise independent permutations. JCSS

1998. [6] R. Cai et al. Scalable music recommendation by search.

ACM Multimedia 2007. [7] J. G. Carbonell et al. Translingual information retrieval: A

comparative evaluation. IJCAI 1997. [8] M. Casey et al. Fast recognition of remixed music audio.

ICASSP 2007 [9] T.-C. Chang et al. TRECVID 2004 search and feature

extraction task by NUS PRIS. TRECVID Workshop 2004. [10] M. Charikar. Similarity estimation techniques from rounding

algorithms. STOCI 2002.

[11] O. Chum et al. Total recall: Automatic query expansion with a generative feature model for object retrieval. ICCV 2007.

[12] M. Datar, et al. Locality-sensitive hashing scheme based on p-stable distributions. SoCG 2004.

[13] W. Dong et al. Efficiently matching sets of features with random histograms. ACM Multimedia 2008.

[14] A. Fischler et al. Random Sample Consensus: A Paradigm for Model Fitting with Applications to Image Analysis and Automated Cartography. CACM 1981.

[15] M. Flickner et al. Query by image and video content: The QBIC system. IEEE Computer, 1995.

[16] A. Gionis et al. Similarity search in high dimensions via hashing. VLDB 1999.

[17] P. Indyk et al. Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality. STOC 1998.

[18] Y. Ke et al. Efficient near-duplicate detection and sub-image retrieval. ACM Multimedia 2004.

[19] D. Lowe. Distinctive image features from scale-invariant keypoints. IJCV 2004.

[20] Q. Lv et al. Multi-probe lsh: efficient indexing for high-dimensional similarity search. VLDB 2007.

[21] W.-S. Liao et al. AdImage: video advertising by image matching and ad scheduling optimization. ACM SIGIR 2008.

[22] D. Nistér et al. Scalable Recognition with a Vocabulary Tree. CVPR 2006.

[23] J. Philbin et al. Object retrieval with large vocabularies and fast spatial matching. CVPR 2007.

[24] J.R. Smith et al. VisualSEEK: A Fully Automated Content-Based Image Query System. ACM Multimedia 1996.

[25] N. Snavely et al. Photo tourism: exploring photo collections in 3D. ACM TOG, 2006

[26] B. Stein. Principles of hash-based text retrieval. ACM SIGIR 2007.

[27] X.-J. Wang et al. Annosearch: Image auto-annotation by search. CVPR 2006.

[28] J. Xu et al. Query expansion using local and global document analysis. ACM SIGIR 1996.

[29] R. Yan et al. Multimedia search with pseudo-relevance feedback. CIVR 2003.

[30] Y.-H. Yang et al. ContextSeer: context search and recommendation at query time for shared consumer photos. ACM Multimedia 2008.

[31] T. Yeh et al. Photo-based Question Answering. ACM Multimedia 2008.

[32] K. M. Donald et al. A comparison of score, rank and probability-based fusion methods for video shot retrieval. CIVR 2005.

[33] Y. G. Jiang et al. Bag-of-visual-words expansion using visual relatedness for video indexing. SIGIR 2008.

[34] J. Philbin et al. Lost in quantization: Improving particular object retrieval in large scale image databases. CVPR 2008.

74

Documents

Query Expansion for Hash-based Image Object …jzwang/ustc11/mm2009/p65-kuo.pdfQuery Expansion for Hash-based Image Object Retrieval Yin-Hsi Kuo1, Kuan-Ting Chen2, Chien-Hsing Chiang1,