View
238
Download
2
Category
Tags:
Preview:
Citation preview
JOSEF SIVIC AND ANDREW ZISSERMAN
PRESENTERS: ILGE AKKAYA & JEANNETTE CHANG
MARCH 1 , 2011
Efficient Visual Search for Objects in Videos
Introduction
Text Query
Results: Documents
Image Query
Results: Frames Generalize text retrieval methods to
non-textual information
State-of-the-Art before this paper…
Text-based search for images (Google Images)Object recognition
Barnard, et al. (2003): “Matching words and pictures” Sivic, et al. (2005): “Discovering objects and their location in
images” Sudderth, et al. (2005): “Learning hierarchical models of
scenes, objects, and parts”Scene classification
Fei-Fei and Perona (2005): “A Bayesian hierarchical model for learning natural scene categories”
Quelhas, et al. (2005): “Modeling scenes with local descriptors and latent aspects”
Lazebnik, et al. (2006): “Beyond bag of features: Spatial pyramid matching for recognizing natural scene categories”
Introduction (cont.)
Retrieve specific objects vs. categories of objects/scenes (“Camry” logo vs. cars)
Employ text retrieval techniques for visual search, with images as queries and results
Why Text Retrieval Approach? Matches essentially precomputed so that no delay at
run time Any object in video can be retrieved without
modification of descriptors originally built for video
Overview of the Talk
Visual Search Algorithm Offline Pre-Processing Real-Time Query A Few Implementation Details
Performance General Results Testing Individual Words Using External Images As Queries
A Few Challenges and Future DirectionsConcluding RemarksDemo of the Algorithm
Overview of the Talk
Visual Search Algorithm Offline Pre-Processing Real-Time Query A Few Implementation Details
Performance General Results Testing Individual Words Using External Images As Queries
A Few Challenges and Future DirectionsConcluding RemarksDemo of the Algorithm
Pre-Processing (Offline)
1. For each frame, detect affine covariant regions.
2. Track the regions through video and reject unstable regions
3. Build visual vocabulary4. Remove stop-listed visual words5. Compute tf-idf weighted document
frequency vectors6. Built inverted file-indexing structure
Typically ~1200 regions / frame (720x576)Elliptical regionsEach region represented by 128-dimensional
SIFT vectorSIFT features provide invariance against
affine transformations
Detection of Affine Covariant Regions
Two types of affine covariant regions:1. Shape-Adapted(SA):
Mikolajczyk et al.Elliptical Shape adaptation about a Harris interest pointOften centered on corner-like features
2. Maximally-Stable(MS):Proposed by Matas et al.
Intensity watershed image segmentationHigh-contrast blobs
Pre-Processing (Offline)
1. For each frame, detect affine covariant regions.
2. Track the regions through video and reject unstable regions
3. Build visual vocabulary4. Remove stop-listed visual words5. Compute tf-idf weighted document
frequency vectors6. Built inverted file-indexing structure
Tracking regions through video and rejecting unstable regions
Any region that does not survive for 3+ frames is rejected
These regions are not potentially interestingReduces number of regions/frame to approx.
50% (~600/frame)
Pre-Processing (Offline)
1. For each frame, detect affine covariant regions.
2. Track the regions through video and reject unstable regions
3. Build visual vocabulary4. Remove stop-listed visual words5. Compute tf-idf weighted document
frequency vectors6. Built inverted file-indexing structure
Visual Indexing Using Text-Retrieval Methods
TEXT IMAGE
Represent words by the “stems”‘write’‘writing’ ‘write’‘written’ mapped to
Cluster similar regions into ‘visual words’
Stop-list common words ‘a/an/the’
Stop-list common visual words
Rank search results according to how close the query words occur within retrieved document
Use spatial information to check retrieval consistency
Visual Vocabulary
Purpose: Cluster regions from multiple frames into fewer groups called ‘visual words’
Each descriptor: 128-vector
K-means clustering (explain more)
~300K descriptors mapped into 16K visual words(600 regions/frame x ~500 frames)
(6000 SA, 10000 MS regions used)
K-Means Clustering
Purpose: Cluster N data points (SIFT descriptors) into K clusters (visual words)
K = desired number of cluster centers (mean points)Step 1: Randomly guess K mean points
Step 2: Calculate nearest mean point to assign each data point to a cluster center
In this paper, Mahalanobis distance is used to determine ‘nearest cluster center’.
where ∑ is the covariance matrix for all descriptors,x2 is the length 128 mean vector andx1’s are the descriptor vectors(i.e. data points)
Step 3: Recalculate cluster centers and distances, repeat until stationarity
Samples of normalized affine covariant regions
Examples of Clusters of Regions
Pre-Processing (Offline)
1. For each frame, detect affine covariant regions.
2. Track the regions through video and reject unstable regions
3. Build visual vocabulary4. Remove stop-listed visual words5. Compute tf-idf weighted document
frequency vectors6. Built inverted file-indexing structure
Remove Stop-Listed Words
Analogy to text-retrieval:‘a’, ‘and’, ‘the’ … are not distinctive wordsCommon words cause mismatches5-10% of the most common visual words are
stopped800-1600 / 16000 words are stopped
(Upper row) Matches before stop-listing(Lower row) Matches after stop-listing
Pre-Processing (Offline)
1. For each frame, detect affine covariant regions.
2. Track the regions through video and reject unstable regions
3. Build visual vocabulary4. Remove stop-listed visual words5. Compute tf-idf weighted document
frequency vectors6. Built inverted file-indexing structure
tf-idf Weighting(term frequency-inverse document frequency weighting)
nid : #of occurrences of word(visual word) i in document(frame) dnd : total number of words in document dNi : total number of documents containing term IN : number of documents in the databaseti : weighted word frequency
Each document(frame) represented by:
wherev = number of visual words in vocabulary
And vd = the tf-idf vector of the particular frame d
Inverted File Indexing
Visual Word Index
1
2
…
N
Found in Frames:
1,4,5
1,2,10
…
Overview of the Talk
Visual Search Algorithm Offline Pre-Processing Real-Time Query A Few Implementation Details
Performance General Results Testing Individual Words Using External Images As Queries
A Few Challenges and Future DirectionsConcluding RemarksDemo of the Algorithm
Real-Time Query
1. Determine the set of visual words found within the query region
2. Retrieve keyframes based on visual word frequencies (Ns = 500)
3. Re-rank retrieved keyframes using spatial consistency
Retrieve keyframes based on visual word frequencies
vq: vector containing visual word frequencies corresponding to query region is computed
the normalized scalar product of vq with vd’s are computed:
Spatial Consistency Voting
Analogy: Google text document retrieval
Matched covariant regions in the retrieved frames should have a similar spatial arrangement
Search area: 15 nearest spatial neighbors of each match
Each neighboring region which also matches in the retrieved frame, votes for the frame
Spatial Consistency Voting
Matched pair of words (A,B)
Each region in defined search area in both frames casts a voteFor the match (A,B)
(upper row)Matches after stop-listing(lower row) Remaining matches after spatial consistency voting
1 2
3 4
5 6
7 8
1: Query Region2: Close-up version of 13-4: Initial matches5-6: Matches after stop-listing7-8: Matches after spatial consistency matching
Query Frame Sample Retrieved Frame
Overview of the Talk
Visual Search Algorithm Offline Pre-Processing Real-Time Query A Few Implementation Details
Performance General Results Testing Individual Words Using External Images As Queries
A Few Challenges and Future DirectionsConcluding RemarksDemo of the Algorithm
Implementation Details
Offline Processing:100-150K frames/typical feature length film,Refined to 4000-6000 keyframesDescriptors are computed for stable regions
in each frameEach region is assigned to a visual wordVisual words over all keyframes assembled
into an inverted file-structure
Algorithm Implementation
Real-Time Process:User selects query regionVisual words are identified within query
regionA short list of Ns = 500 keyframes retrieved
based on tf-idf vector similaritySimilarity is recomputed considering spatial
consistency voting
Example Visual Search
Overview of the Talk
Visual Search Algorithm Offline Pre-Processing Real-Time Query A Few Implementation Details
Performance General Results Testing Individual Words Using External Images As Queries
A Few Challenges and Future DirectionsConcluding RemarksDemo of the Algorithm
Retrieval Examples
Query Image
A Few Retrieved Matches
Retrieval Examples (cont.)
Query Image
A Few Retrieved Matches
Performance of the Algorithm
Tried 6 object queries
(1) Red Clock
(5) “Phil” Sign
(6) Microphone
(4) Digital Clock
(3) “Frame’s” Sign
(2) Black Clock
Performance of the Algorithm (cont.)
Evaluated on the level of shots rather than keyframes
Measured using precision-recall plots
Precision like measure of fidelity or exactnessRecall like measure of completeness
Performance of the Algorithm (cont.)
Ideally, precision = 1 for all recall valuesAverage Precision (AP) , ideally AP = 1
Examples of Missed Shots
Extreme viewing angles
Original query object Low-ranked shot
Examples of Missed Shots (cont.)
Significant changes in scale and motion blurring
Original query object Low-ranked shot
Qualitative Assessment of Performance
General trends Higher precision at low recall levels Bias towards lightly textured
regions detectable by SA/MS detectors
Could address these challenges by adding more covariant regions
Other Difficulties Textureless regions (e.g., mug) Thin or wiry objects (e.g., bike) Highly-deformable (e.g., clothing)
Overview of the Talk
Visual Search Algorithm Offline Pre-Processing Real-Time Query A Few Implementation Details
Performance General Results Testing Individual Words Using External Images As Queries
A Few Challenges and Future DirectionsConcluding RemarksDemo of the Algorithm
Quality of Individual Visual Words
Using single visual word as query Tests the expressiveness of the visual vocabulary
Sample query Given an object of interest, select one of the visual
words from that object Retrieve all frames that contain the visual word (no
ranking) Retrieval considered correct if contains object of
interest
Examples of Individual Visual Words
Top row: Scale-normalized close-ups of elliptical regions overlaid on query image
Bottom row: Corresponding normalized regions
Results of Individual Word Searches
Individual words are “noisy”Intuitively because words occur in multiple
objects and do not cover all occurrences of the object
Unrealistic Realistic
Require each word to occur on only one object (high precision)
Growing number of objects would result in growing number of words
Visual words shared across objects, with objects represented by a combination of words
Quality of Individual Visual Words
Overview of the Talk
Visual Search Algorithm Offline Pre-Processing Real-Time Query A Few Implementation Details
Performance General Results Testing Individual Words Using External Images As Queries
A Few Challenges and Future DirectionsConcluding RemarksDemo of the Algorithm
Searching for Objects From Outside of the Movie
Used external query images from the internetManually labeled all occurrences of external
queries in moviesResults
External Query Image
No. of Occurrences
Rankings of Retrieved Occurrences
AP (Average Precision)
Sony logo 3 1st, 4th, 35th 0.53
Hollywood sign
1 1st 1
Notre Dame 1 1st 1
Sample External Query Results
Potential Applications
Overview of the Talk
Visual Search Algorithm Offline Pre-Processing Real-Time Query A Few Implementation Details
Performance General Results Testing Individual Words Using External Images As Queries
A Few Challenges and Future DirectionsConcluding RemarksDemo of the Algorithm
Challenge I: Visual Vocabularies for Very Large Scale Retrieval
Current progress: 150000 frame feature movie reduced to 6000 keyframes and then processed
Ultimate goal: indexing billions of online images to build a visual search engine
Should the vocabulary increase in size as the image archive grows?
How discriminative should the words be?
Generalization of images from one movie to an outside database of images?
Learning a universal visual vocabulary still remains a challenge
(a) (c) external images downloaded from the Internet
(b) Correct retrieval frame from the movie ‘Pretty Woman’
(d) Correct retrieval from the movie ‘Charade’
Challenge II: Retrieval of 3D Objects
Current algorithm covers successful detection despite slight changes in viewpoint, illumination, partial occlusion due to SIFT features
However, 3D retrieval is fundamentally a bigger challenge
Proposed approach 1:Automatic association of images using temporal information
Grouping front-side-back of a car in a videoPossible either in query and/or database
sideQuery-Side Matching: Associated query
frames are computed and used for 3D image search
Query-Side matching of associated frames
Proposed approach 1 (cont.)
Grouping on database side: Query on a single aspect is expected to retrieve pregrouped frames associated with 3D image
(Top Row) Query image(Bottom Rows) Matching frames
Proposed approach 2:Building an explicit 3-D model for each 3-D object in the Video
Focus is more on model building than detection
Only rigid objects considered
Challenge III: Verification using Spatial Structure
Spatial consistency was helpful, but could be improved
A few suggestions Caution with using measures for rigid geometry Reduce cost using hierarchical approach
Two complementary methods Ferrari et al. (2004): matching deformable objects Rothganger et al. (2003): matching 3D objects
Verification Using Spatial Structure (cont.)
Method 1 (Ferrari) Based on spatial overlap of local
regions Requires regions to match
individually and pattern of intersection between neighboring regions to be preserved
Performance Pro: Works well with
deformations Con: Computationally expensive
Verification Using Spatial Structure (cont.)
Method 2 (Rothganger) Based on 3-D object model Requires consistency of local
appearance descriptors and geometric consistency
Performance Pro: Object can be matched in
diverse (even novel) poses Con: 3-D model built offline,
requires up to 20 images of object taken from different viewpoints
Overview of the Talk
Visual Search Algorithm Offline Pre-Processing Real-Time Query A Few Implementation Details
Performance General Results Testing Individual Words Using External Images As Queries
A Few Challenges and Future DirectionsConcluding RemarksDemo of the Algorithm
Conclusion
Demonstrated scalable object retrieval architecture which uses Visual vocabulary based on vector-quantized
viewpoint invariant descriptors Efficient indexing techniques from text retrieval
A few notable differences between document and image bag-of-words retrieval Spatial information Numbers of “words” in query Matching requirements
Looking forward…
TinEye (May 2008) Image-based search engine Given a query image, searches for altered versions of
that image (scaled or cropped) 1.86 billion images indexed
Google Goggles (2009) Use phone to take photo, results from the internet Limited categories
Overview of the Talk
Visual Search Algorithm Offline Pre-Processing Real-Time Query A Few Implementation Details
Performance General Results Testing Individual Words Using External Images As Queries
A Few Challenges and Future DirectionsConcluding RemarksDemo of the Algorithm
Demo of Retrieval Algorithm
Live demonstration
Main References
D. Lowe. Distinctive Image Features from Scale-Invariant Keypoints. International Journal of Computer Vision. 2(60):91.110, 2004.
J. Sivic and A. Zisserman. Efficient visual search for objects in videos. Proc. IEEE, 96(4):548–566, 2008.
W. Qian “Video Google: A Text Retrieval Approach to Object Matching in Videos.” www.mriedel.ece.umn.edu/wiki/index.php/Weikang_Qian
Recommended