Lost in Quantization: Improving Particular Object Retrieval in Large Scale Image Databases CVPR 2008...

Preview:

Citation preview

Lost in Quantization: Improving Particular Object Retrieval in Large Scale Image Databases

CVPR 2008James PhilbinOndˇrej ChumMichael Isard

Josef SivicAndrew Zisserman

[7] O. Chum, J. Philbin, J. Sivic, M. Isard, and A. Zisserman. Total recall: Automatic query expansion with a generative feature model for object retrieval. In Proc. ICCV, 2007.

Outline

• Introduction• Methods in this paper• Experiment & Result• Conclusion

Outline

• Introduction• Methods in this paper• Experiment & Result• Conclusion

Introduction

• Goal– Specific object retrieval from an image database

• For large database– It’s achieved by systems that are inspired by text retrieval

(visual words).

Flow

1. Get features– SIFT

2. Cluster– Approximate k-means

3. Feature quantization– Visual word– Soft-assignment (query)

4. Re-ranked– RANSAC

5. Query expansion– Average query expansion

Outline

• Introduction• Methods in this paper• Experiment & Result• Conclusion

Feature

• SIFT

8

Quantization (visual word)

• Point List = [(2,3), (5,4), (9,6), (4,7), (8,1), (7,2)]• Sorted List = [(2,3), (4,7), (5,4), (7,2), (8,1),(9,6)]

Soft-assignment of visual words

• Matching two image features in bag-of-visual-words in hard-assignment– Yes if assigned to the same visual word– No otherwise

• Sort-assignment– A weighted combination of visual words

Soft-assignment of visual words

A~E represent cluster centers (visual words)points 1–4 are features

Soft-assignment of visual words

• – d is the distance from the cluster center to the

descriptor• In practice is chosen so that a substantial

weight is only assigned to few cells• The essential parameters– the spatial scale – r, nearest neighbors considered

Soft-assignment of visual words

• the weights to the r nearest neighbors, the descriptor is represented by an r-vector, which is then L1 normalized

TF–IDF weighting

• Standard index architecture

TF–IDF weighting

• tf– 100 vocabularies in a document, ‘a’ 3 times– 0.03 (3/100)

• idf– 1,000 documents have ‘a’, total number of

documents 10,000,000– 9.21 ( ln(10,000,000 / 1,000) )

• if-idf = 0.28( 0.03 * 9.21)

TF–IDF weighting

• In this paper– For the term frequency(tf)• we simply use the normalized weight value for each

visual word.

– For the inverse document(idf)• feature measure, we found that counting an occurrence

of a visual word as one, no matter how small its weight, gave the best results

Re-ranking

• RANSAC– Affine transform Θ : Y = AX+b

• Algorithm– 1. Randomly choose n points– 2. Use n points to find Θ – 3. Input N-n points to Θ– 4. How many inlier– Repeat 1~4 K times– Pick the best Θ

Re-ranking

• In this paper– No only counting the number of inlier

correspondences ,but also scoring function, or cosine =

Average query expansion

• Obtain top (m < 50) verified results of original query• Construct new query using average of these results•

– where d0 is the normalized tf vector of the query region

– di is the normalized tf vector of the i-th result

• Requery once

Outline

• Introduction• Methods in this paper• Experiment & Result• Conclusion

Dataset

• Crawled from Flickr & high resolution(1024x768)• Oxford buildings– About 5,062 high resolution(1024x768) images– using 11 landmarks as queries

• Paris– Used for quantization– 6,300 images

• Flickr1– 145 most popular tags– 99,782 images

Dataset

Dataset

• Query– 55 queries: 5 queries for each of 11 landmarks

Baseline

• Follow the architecture of previous work [15]• A visual vocabulary of 1M words is generated

using an approximate k-means

[15] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman. Object retrieval with large vocabularies and fast spatial matching. In Proc. CVPR, 2007

24

Evaluation

• Compute Average Precision (AP) score for each of the 5 queries for a landmark– Area under the precision-recall curve• Precision = RPI / TNIR• Recall = RPI / TNPCRPI = retrieved positive images

TNIR = total number of images retrieved

TNPC = total number of positives in the corpus• Average these to obtain a Mean Average

Precision (MAP)

Recall

Precision

Evaluation

• Dataset– Only the Oxford (D1) 5,062 images– Oxford (D1) + Flickr1 (D2) 104,844 images

• Vector quantizers– Oxford or Paris

Result

[14] D. Nister and H. Stewenius. Scalable recognition with a vocabulary tree.CVPR, 2006.

[18] T. Tuytelaars and C. Schmid. Vector quantizing feature space with a regular lattice. ICCV, 2007.

[15] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman. Object retrieval with large vocabularies and fast spatial matching. CVPR, 2007.

Parameter variation Comparison with other methods

Result

Effect of vocabulary size

Spatial verification

Result

Query expansion

Scaling-up to 100K images

Result

Result

ashmolean_3 goes from 0.626 AP to 0.874 APchrist_church_5 increases from 0.333 to 0.813 AP

Outline

• Introduction• Methods in this paper• Experiment & Result• Conclusion

Conclusion

• A new method of visual word assignment was introduced:– descriptor-space soft-assignment

• It improves that descriptor lost in the quantization step of previously published methods.

Recommended