Upload
tamas
View
28
Download
1
Embed Size (px)
DESCRIPTION
Visual Object Recognition. Bastian Leibe & Computer Vision Laboratory ETH Zurich Chicago, 14.07.2008. Kristen Grauman Department of Computer Sciences University of Texas in Austin. Outline. Detection with Global Appearance & Sliding Windows - PowerPoint PPT Presentation
Citation preview
Perc
ep
tual an
d S
en
sory
Au
gm
en
ted
Com
pu
tin
gV
isu
al O
bje
ct
Recog
nit
ion
Tu
tori
al
Visual Object Recognition
Bastian Leibe &
Computer Vision LaboratoryETH Zurich
Chicago, 14.07.2008
Kristen Grauman
Department of Computer SciencesUniversity of Texas in Austin
Perc
ep
tual an
d S
en
sory
Au
gm
en
ted
Com
pu
tin
gV
isu
al O
bje
ct
Recog
nit
ion
Tu
tori
al
2K. Grauman, B. Leibe
Outline
1. Detection with Global Appearance & Sliding Windows
2. Local Invariant Features: Detection & Description
3. Specific Object Recognition with Local Features
― Coffee Break ―
4. Visual Words: Indexing, Bags of Words Categorization
5. Matching Local Feature Sets
6. Part-Based Models for Categorization
7. Current Challenges and Research Directions
Perc
ep
tual an
d S
en
sory
Au
gm
en
ted
Com
pu
tin
gV
isu
al O
bje
ct
Recog
nit
ion
Tu
tori
al
Global representations: limitations
• Success may rely on alignment -> sensitive to viewpoint
• All parts of the image or window impact the description -> sensitive to occlusion, clutter
3K. Grauman, B. Leibe
Perc
ep
tual an
d S
en
sory
Au
gm
en
ted
Com
pu
tin
gV
isu
al O
bje
ct
Recog
nit
ion
Tu
tori
al
Local representations• Describe component regions or patches
separately.• Many options for detection & description…
4K. Grauman, B. Leibe
Superpixels [Ren et al.]
Shape context [Belongie 02]
Maximally Stable Extremal Regions
[Matas 02]
Geometric Blur [Berg 05]
SIFT [Lowe 99]
Salient regions [Kadir 01]
Harris-Affine [Mikolajczyk 04]
Spin images [Johnson 99]
Perc
ep
tual an
d S
en
sory
Au
gm
en
ted
Com
pu
tin
gV
isu
al O
bje
ct
Recog
nit
ion
Tu
tori
al
Recall: Invariant local features
Subset of local feature types designed to be invariant to
Scale Translation Rotation Affine transformations Illumination
1) Detect interest points 2) Extract descriptors
x1
x2 … xd
y1
y2 … yd
[Mikolajczyk01, Matas02, Tuytelaars04, Lowe99, Kadir01,… ]
K. Grauman, B. Leibe
Perc
ep
tual an
d S
en
sory
Au
gm
en
ted
Com
pu
tin
gV
isu
al O
bje
ct
Recog
nit
ion
Tu
tori
al
Recognition with local feature sets
• Previously, we saw how to use local invariant features + a global spatial model to recognize specific objects, using a planar object assumption.
• Now, we’ll use local features for
Indexing-based recognition
Bags of words representations
Correspondence / matching kernels
6K. Grauman, B. Leibe
Aachen Cathedral
Perc
ep
tual an
d S
en
sory
Au
gm
en
ted
Com
pu
tin
gV
isu
al O
bje
ct
Recog
nit
ion
Tu
tori
al
Basic flow
7K. Grauman, B. Leibe
Detect or sample features
Describe features
…List of
positions, scales,
orientations
Associated list of d-
dimensional descriptors
…Index each one into pool of descriptors from previously seen images
…
Perc
ep
tual an
d S
en
sory
Au
gm
en
ted
Com
pu
tin
gV
isu
al O
bje
ct
Recog
nit
ion
Tu
tori
al
Indexing local features
• Each patch / region has a descriptor, which is a point in some high-dimensional feature space (e.g., SIFT)
K. Grauman, B. Leibe
Perc
ep
tual an
d S
en
sory
Au
gm
en
ted
Com
pu
tin
gV
isu
al O
bje
ct
Recog
nit
ion
Tu
tori
al
Indexing local features
• When we see close points in feature space, we have similar descriptors, which indicates similar local content.
Figure credit: A. Zisserman K. Grauman, B. Leibe
Perc
ep
tual an
d S
en
sory
Au
gm
en
ted
Com
pu
tin
gV
isu
al O
bje
ct
Recog
nit
ion
Tu
tori
al
Indexing local features
• We saw in the previous section how to use voting and pose clustering to identify objects using local features
10K. Grauman, B. Leibe
Figure credit: David Lowe
Perc
ep
tual an
d S
en
sory
Au
gm
en
ted
Com
pu
tin
gV
isu
al O
bje
ct
Recog
nit
ion
Tu
tori
al
Indexing local features
• With potentially thousands of features per image, and hundreds to millions of images to search, how to efficiently find those that are relevant to a new image?
Low-dimensional descriptors : can use standard efficient data structures for nearest neighbor search
High-dimensional descriptors: approximate nearest neighbor search methods more practical
Inverted file indexing schemes
11K. Grauman, B. Leibe
Perc
ep
tual an
d S
en
sory
Au
gm
en
ted
Com
pu
tin
gV
isu
al O
bje
ct
Recog
nit
ion
Tu
tori
al
Indexing local features: approximate nearest neighbor search
12K. Grauman, B. Leibe
Best-Bin First (BBF), a variant of k-d trees that uses priority queue to examine most promising branches first [Beis & Lowe, CVPR 1997]
Locality-Sensitive Hashing (LSH), a randomized hashing technique using hash functions that map similar points to the same bin, with high probability [Indyk & Motwani, 1998]
Perc
ep
tual an
d S
en
sory
Au
gm
en
ted
Com
pu
tin
gV
isu
al O
bje
ct
Recog
nit
ion
Tu
tori
al
• For text documents, an efficient way to find all pages on which a word occurs is to use an index…
• We want to find all images in which a feature occurs.
• To use this idea, we’ll need to map our features to “visual words”.
13K. Grauman, B. Leibe
Indexing local features: inverted file index
Perc
ep
tual an
d S
en
sory
Au
gm
en
ted
Com
pu
tin
gV
isu
al O
bje
ct
Recog
nit
ion
Tu
tori
al
Visual words: main idea
• Extract some local features from a number of images …
14K. Grauman, B. Leibe
e.g., SIFT descriptor space: each point is 128-dimensional
Slide credit: D. Nister
Perc
ep
tual an
d S
en
sory
Au
gm
en
ted
Com
pu
tin
gV
isu
al O
bje
ct
Recog
nit
ion
Tu
tori
al
Visual words: main idea
15K. Grauman, B. LeibeSlide credit: D. Nister
Perc
ep
tual an
d S
en
sory
Au
gm
en
ted
Com
pu
tin
gV
isu
al O
bje
ct
Recog
nit
ion
Tu
tori
al
Visual words: main idea
16K. Grauman, B. LeibeSlide credit: D. Nister
Perc
ep
tual an
d S
en
sory
Au
gm
en
ted
Com
pu
tin
gV
isu
al O
bje
ct
Recog
nit
ion
Tu
tori
al
Visual words: main idea
17K. Grauman, B. LeibeSlide credit: D. Nister
Perc
ep
tual an
d S
en
sory
Au
gm
en
ted
Com
pu
tin
gV
isu
al O
bje
ct
Recog
nit
ion
Tu
tori
al
18K. Grauman, B. LeibeSlide credit: D. Nister
Perc
ep
tual an
d S
en
sory
Au
gm
en
ted
Com
pu
tin
gV
isu
al O
bje
ct
Recog
nit
ion
Tu
tori
al
19K. Grauman, B. LeibeSlide credit: D. Nister
Perc
ep
tual an
d S
en
sory
Au
gm
en
ted
Com
pu
tin
gV
isu
al O
bje
ct
Recog
nit
ion
Tu
tori
al
Visual words: main idea
Map high-dimensional descriptors to tokens/words by quantizing the feature space
20K. Grauman, B. Leibe
• Quantize via clustering, let cluster centers be the prototype “words”
Descriptor space
Perc
ep
tual an
d S
en
sory
Au
gm
en
ted
Com
pu
tin
gV
isu
al O
bje
ct
Recog
nit
ion
Tu
tori
al
Visual words: main idea
Map high-dimensional descriptors to tokens/words by quantizing the feature space
21K. Grauman, B. Leibe
• Determine which word to assign to each new image region by finding the closest cluster center.
Descriptor space
Perc
ep
tual an
d S
en
sory
Au
gm
en
ted
Com
pu
tin
gV
isu
al O
bje
ct
Recog
nit
ion
Tu
tori
al
Visual words
• Example: each group of patches belongs to the same visual word
22K. Grauman, B. Leibe
Figure from Sivic & Zisserman, ICCV 2003
Perc
ep
tual an
d S
en
sory
Au
gm
en
ted
Com
pu
tin
gV
isu
al O
bje
ct
Recog
nit
ion
Tu
tori
al
Visual words
• First explored for texture and material representations
• Texton = cluster center of filter responses over collection of images
• Describe textures and materials based on distribution of prototypical texture elements.Leung & Malik 1999; Varma & Zisserman, 2002; Lazebnik, Schmid & Ponce, 2003;
Perc
ep
tual an
d S
en
sory
Au
gm
en
ted
Com
pu
tin
gV
isu
al O
bje
ct
Recog
nit
ion
Tu
tori
al
Visual words
24K. Grauman, B. Leibe
• More recently used for describing scenes and objects for the sake of indexing or classification.
Sivic & Zisserman 2003; Csurka, Bray, Dance, & Fan 2004; many others.
Perc
ep
tual an
d S
en
sory
Au
gm
en
ted
Com
pu
tin
gV
isu
al O
bje
ct
Recog
nit
ion
Tu
tori
al
Inverted file index for images comprised of visual words
Image credit: A. Zisserman K. Grauman, B. Leibe
Word numbe
r
List of image
numbers
Perc
ep
tual an
d S
en
sory
Au
gm
en
ted
Com
pu
tin
gV
isu
al O
bje
ct
Recog
nit
ion
Tu
tori
al
Bags of visual words
• Summarize entire image based on its distribution (histogram) of word occurrences.
• Analogous to bag of words representation commonly used for documents.
26K. Grauman, B. LeibeImage credit: Fei-Fei Li
Perc
ep
tual an
d S
en
sory
Au
gm
en
ted
Com
pu
tin
gV
isu
al O
bje
ct
Recog
nit
ion
Tu
tori
al
Video Google System
1. Collect all words within query region
2. Inverted file index to find relevant frames
3. Compare word counts4. Spatial verification
Sivic & Zisserman, ICCV 2003
• Demo online at : http://www.robots.ox.ac.uk/~vgg/research/vgoogle/index.html
27K. Grauman, B. Leibe
Query region
Retrie
ved
fram
es
Perc
ep
tual an
d S
en
sory
Au
gm
en
ted
Com
pu
tin
gV
isu
al O
bje
ct
Recog
nit
ion
Tu
tori
al
Basic flow
28K. Grauman, B. Leibe
Detect or sample features
Describe features
…List of
positions, scales,
orientations
Associated list of d-
dimensional descriptors
…Index each one into pool of descriptors from previously seen images
…
Quantize to form bag of words vector for the image
or
…
Perc
ep
tual an
d S
en
sory
Au
gm
en
ted
Com
pu
tin
gV
isu
al O
bje
ct
Recog
nit
ion
Tu
tori
al
Visual vocabulary formation
Issues:• Sampling strategy• Clustering / quantization algorithm• Unsupervised vs. supervised• What corpus provides features (universal
vocabulary?)• Vocabulary size, number of words
29K. Grauman, B. Leibe
Perc
ep
tual an
d S
en
sory
Au
gm
en
ted
Com
pu
tin
gV
isu
al O
bje
ct
Recog
nit
ion
Tu
tori
al
Sampling strategies
30K. Grauman, B. LeibeImage credits: F-F. Li, E. Nowak, J.
Sivic
Dense, uniformly
Sparse, at interest points
Randomly
Multiple interest operators
• To find specific, textured objects, sparse sampling from interest points often more reliable.
• Multiple complementary interest operators offer more image coverage.
• For object categorization, dense sampling offers better coverage.
[See Nowak, Jurie & Triggs, ECCV 2006]
Perc
ep
tual an
d S
en
sory
Au
gm
en
ted
Com
pu
tin
gV
isu
al O
bje
ct
Recog
nit
ion
Tu
tori
al
Clustering / quantization methods
• k-means (typical choice), agglomerative clustering, mean-shift,…
• Hierarchical clustering: allows faster insertion / word assignment while still allowing large vocabularies
Vocabulary tree [Nister & Stewenius, CVPR 2006]
31K. Grauman, B. Leibe
Perc
ep
tual an
d S
en
sory
Au
gm
en
ted
Com
pu
tin
gV
isu
al O
bje
ct
Recog
nit
ion
Tu
tori
al
32K. Grauman, B. Leibe
Example: Recognition with Vocabulary Tree
• Tree construction:
Slide credit: David Nister
[Nister & Stewenius, CVPR’06]
Perc
ep
tual an
d S
en
sory
Au
gm
en
ted
Com
pu
tin
gV
isu
al O
bje
ct
Recog
nit
ion
Tu
tori
al
33K. Grauman, B. Leibe
Vocabulary Tree
• Training: Filling the tree
Slide credit: David Nister
[Nister & Stewenius, CVPR’06]
Perc
ep
tual an
d S
en
sory
Au
gm
en
ted
Com
pu
tin
gV
isu
al O
bje
ct
Recog
nit
ion
Tu
tori
al
34K. Grauman, B. Leibe
Vocabulary Tree
• Training: Filling the tree
Slide credit: David Nister
[Nister & Stewenius, CVPR’06]
Perc
ep
tual an
d S
en
sory
Au
gm
en
ted
Com
pu
tin
gV
isu
al O
bje
ct
Recog
nit
ion
Tu
tori
al
35K. Grauman, B. Leibe
Vocabulary Tree
• Training: Filling the tree
Slide credit: David Nister
[Nister & Stewenius, CVPR’06]
Perc
ep
tual an
d S
en
sory
Au
gm
en
ted
Com
pu
tin
gV
isu
al O
bje
ct
Recog
nit
ion
Tu
tori
al
36K. Grauman, B. Leibe
Vocabulary Tree
• Training: Filling the tree
Slide credit: David Nister
[Nister & Stewenius, CVPR’06]
Perc
ep
tual an
d S
en
sory
Au
gm
en
ted
Com
pu
tin
gV
isu
al O
bje
ct
Recog
nit
ion
Tu
tori
al
37K. Grauman, B. Leibe
Vocabulary Tree
• Training: Filling the tree
Slide credit: David Nister
[Nister & Stewenius, CVPR’06]
Perc
ep
tual an
d S
en
sory
Au
gm
en
ted
Com
pu
tin
gV
isu
al O
bje
ct
Recog
nit
ion
Tu
tori
al
38K. Grauman, B. Leibe
Vocabulary Tree
• Recognition
Slide credit: David Nister
[Nister & Stewenius, CVPR’06]
RANSACverification
Perc
ep
tual an
d S
en
sory
Au
gm
en
ted
Com
pu
tin
gV
isu
al O
bje
ct
Recog
nit
ion
Tu
tori
al
39K. Grauman, B. Leibe
Vocabulary Tree: Performance
• Evaluated on large databases Indexing with up to 1M images
• Online recognition for databaseof 50,000 CD covers
Retrieval in ~1s
• Find experimentally that large vocabularies can be beneficial for recognition
[Nister & Stewenius, CVPR’06]
Perc
ep
tual an
d S
en
sory
Au
gm
en
ted
Com
pu
tin
gV
isu
al O
bje
ct
Recog
nit
ion
Tu
tori
al
Vocabulary formation
• Ensembles of trees provide additional robustness
Figure credit: F. Jurie K. Grauman, B. Leibe
Moosmann, Jurie, & Triggs 2006; Yeh, Lee, & Darrell 2007; Bosch, Zisserman, & Munoz 2007; …
Perc
ep
tual an
d S
en
sory
Au
gm
en
ted
Com
pu
tin
gV
isu
al O
bje
ct
Recog
nit
ion
Tu
tori
al
Supervised vocabulary formation
• Recent work considers how to leverage labeled images when constructing the vocabulary
41K. Grauman, B. Leibe
Perronnin, Dance, Csurka, & Bressan, Adapted Vocabularies for Generic Visual Categorization, ECCV 2006.
Perc
ep
tual an
d S
en
sory
Au
gm
en
ted
Com
pu
tin
gV
isu
al O
bje
ct
Recog
nit
ion
Tu
tori
al
Supervised vocabulary formation
• Merge words that don’t aid in discriminability
Winn, Criminisi, & Minka, Object Categorization by Learned Universal Visual Dictionary, ICCV 2005
Perc
ep
tual an
d S
en
sory
Au
gm
en
ted
Com
pu
tin
gV
isu
al O
bje
ct
Recog
nit
ion
Tu
tori
al
Supervised vocabulary formation
• Consider vocabulary and classifier construction jointly.
43K. Grauman, B. Leibe
Yang, Jin, Sukthankar, & Jurie, Discriminative Visual Codebook Generation with Classifier Training for Object Category Recognition, CVPR 2008.
Perc
ep
tual an
d S
en
sory
Au
gm
en
ted
Com
pu
tin
gV
isu
al O
bje
ct
Recog
nit
ion
Tu
tori
al
Learning and recognition with bag of words histograms• Bag of words representation makes it possible to
describe the unordered point set with a single vector (of fixed dimension across image examples)
• Provides easy way to use distribution of feature types with various learning algorithms requiring vector input.
44K. Grauman, B. Leibe
Perc
ep
tual an
d S
en
sory
Au
gm
en
ted
Com
pu
tin
gV
isu
al O
bje
ct
Recog
nit
ion
Tu
tori
al
• …including unsupervised topic models designed for documents.
• Hierarchical Bayesian text models (pLSA and LDA)– Hoffman 2001, Blei, Ng & Jordan, 2004– For object and scene categorization: Sivic et al.
2005, Sudderth et al. 2005, Quelhas et al. 2005, Fei-Fei et al. 2005
45K. Grauman, B. Leibe
Learning and recognition with bag of words histograms
Figure credit: Fei-Fei Li
Perc
ep
tual an
d S
en
sory
Au
gm
en
ted
Com
pu
tin
gV
isu
al O
bje
ct
Recog
nit
ion
Tu
tori
al
• …including unsupervised topic models designed for documents.
46K. Grauman, B. Leibe
Learning and recognition with bag of words histograms
Probabilistic Latent Semantic Analysis (pLSA)w
Nd z
D
“face”
Sivic et al. ICCV 2005[pLSA code available at: http://www.robots.ox.ac.uk/~vgg/software/]
Figure credit: Fei-Fei Li
Perc
ep
tual an
d S
en
sory
Au
gm
en
ted
Com
pu
tin
gV
isu
al O
bje
ct
Recog
nit
ion
Tu
tori
al
Bags of words: pros and cons
+ flexible to geometry / deformations / viewpoint
+ compact summary of image content
+ provides vector representation for sets
+ has yielded good recognition results in practice
- basic model ignores geometry – must verify afterwards, or encode via features
- background and foreground mixed when bag covers whole image
- interest points or sampling: no guarantee to capture object-level parts
- optimal vocabulary formation remains unclear
47K. Grauman, B. Leibe
Perc
ep
tual an
d S
en
sory
Au
gm
en
ted
Com
pu
tin
gV
isu
al O
bje
ct
Recog
nit
ion
Tu
tori
al
48K. Grauman, B. Leibe
Outline
1. Detection with Global Appearance & Sliding Windows
2. Local Invariant Features: Detection & Description
3. Specific Object Recognition with Local Features
― Coffee Break ―
4. Visual Words: Indexing, Bags of Words Categorization
5. Matching Local Feature Sets
6. Part-Based Models for Categorization
7. Current Challenges and Research Directions