Visual Object Recognition

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gV

isu

al O

bje

ct

Recog

nit

ion

Tu

tori

al

Visual Object Recognition

Bastian Leibe &

Computer Vision LaboratoryETH Zurich

Chicago, 14.07.2008

Kristen Grauman

Department of Computer SciencesUniversity of Texas in Austin

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gV

isu

al O

bje

ct

Recog

nit

ion

Tu

tori

al

2K. Grauman, B. Leibe

Outline

1. Detection with Global Appearance & Sliding Windows

2. Local Invariant Features: Detection & Description

3. Specific Object Recognition with Local Features

― Coffee Break ―

4. Visual Words: Indexing, Bags of Words Categorization

5. Matching Local Feature Sets

6. Part-Based Models for Categorization

7. Current Challenges and Research Directions

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gV

isu

al O

bje

ct

Recog

nit

ion

Tu

tori

al

Global representations: limitations

• Success may rely on alignment -> sensitive to viewpoint

• All parts of the image or window impact the description -> sensitive to occlusion, clutter


Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gV

isu

al O

bje

ct

Recog

nit

ion

Tu

tori

al

Local representations• Describe component regions or patches

separately.• Many options for detection & description…


Superpixels [Ren et al.]

Shape context [Belongie 02]

Maximally Stable Extremal Regions

[Matas 02]

Geometric Blur [Berg 05]

SIFT [Lowe 99]

Salient regions [Kadir 01]

Harris-Affine [Mikolajczyk 04]

Spin images [Johnson 99]

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gV

isu

al O

bje

ct

Recog

nit

ion

Tu

tori

al

Recall: Invariant local features

Subset of local feature types designed to be invariant to

Scale Translation Rotation Affine transformations Illumination

1) Detect interest points 2) Extract descriptors

x1

x2 … xd

y1

y2 … yd

[Mikolajczyk01, Matas02, Tuytelaars04, Lowe99, Kadir01,… ]

K. Grauman, B. Leibe

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gV

isu

al O

bje

ct

Recog

nit

ion

Tu

tori

al

Recognition with local feature sets

• Previously, we saw how to use local invariant features + a global spatial model to recognize specific objects, using a planar object assumption.

• Now, we’ll use local features for

Indexing-based recognition

Bags of words representations

Correspondence / matching kernels


Aachen Cathedral

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gV

isu

al O

bje

ct

Recog

nit

ion

Tu

tori

al

Basic flow


Detect or sample features

Describe features

…List of

positions, scales,

orientations

Associated list of d-

dimensional descriptors

…Index each one into pool of descriptors from previously seen images

…

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gV

isu

al O

bje

ct

Recog

nit

ion

Tu

tori

al

Indexing local features

• Each patch / region has a descriptor, which is a point in some high-dimensional feature space (e.g., SIFT)

K. Grauman, B. Leibe

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gV

isu

al O

bje

ct

Recog

nit

ion

Tu

tori

al


• When we see close points in feature space, we have similar descriptors, which indicates similar local content.

Figure credit: A. Zisserman K. Grauman, B. Leibe

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gV

isu

al O

bje

ct

Recog

nit

ion

Tu

tori

al


• We saw in the previous section how to use voting and pose clustering to identify objects using local features


Figure credit: David Lowe

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gV

isu

al O

bje

ct

Recog

nit

ion

Tu

tori

al


• With potentially thousands of features per image, and hundreds to millions of images to search, how to efficiently find those that are relevant to a new image?

Low-dimensional descriptors : can use standard efficient data structures for nearest neighbor search

High-dimensional descriptors: approximate nearest neighbor search methods more practical

Inverted file indexing schemes


Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gV

isu

al O

bje

ct

Recog

nit

ion

Tu

tori

al

Indexing local features: approximate nearest neighbor search


Best-Bin First (BBF), a variant of k-d trees that uses priority queue to examine most promising branches first [Beis & Lowe, CVPR 1997]

Locality-Sensitive Hashing (LSH), a randomized hashing technique using hash functions that map similar points to the same bin, with high probability [Indyk & Motwani, 1998]

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gV

isu

al O

bje

ct

Recog

nit

ion

Tu

tori

al

• For text documents, an efficient way to find all pages on which a word occurs is to use an index…

• We want to find all images in which a feature occurs.

• To use this idea, we’ll need to map our features to “visual words”.


Indexing local features: inverted file index

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gV

isu

al O

bje

ct

Recog

nit

ion

Tu

tori

al

Visual words: main idea

• Extract some local features from a number of images …


e.g., SIFT descriptor space: each point is 128-dimensional

Slide credit: D. Nister

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gV

isu

al O

bje

ct

Recog

nit

ion

Tu

tori

al


15K. Grauman, B. LeibeSlide credit: D. Nister

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gV

isu

al O

bje

ct

Recog

nit

ion

Tu

tori

al



Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gV

isu

al O

bje

ct

Recog

nit

ion

Tu

tori

al



Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gV

isu

al O

bje

ct

Recog

nit

ion

Tu

tori

al


Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gV

isu

al O

bje

ct

Recog

nit

ion

Tu

tori

al


Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gV

isu

al O

bje

ct

Recog

nit

ion

Tu

tori

al


Map high-dimensional descriptors to tokens/words by quantizing the feature space


• Quantize via clustering, let cluster centers be the prototype “words”

Descriptor space

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gV

isu

al O

bje

ct

Recog

nit

ion

Tu

tori

al


Map high-dimensional descriptors to tokens/words by quantizing the feature space


• Determine which word to assign to each new image region by finding the closest cluster center.

Descriptor space

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gV

isu

al O

bje

ct

Recog

nit

ion

Tu

tori

al

Visual words

• Example: each group of patches belongs to the same visual word


Figure from Sivic & Zisserman, ICCV 2003

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gV

isu

al O

bje

ct

Recog

nit

ion

Tu

tori

al

Visual words

• First explored for texture and material representations

• Texton = cluster center of filter responses over collection of images

• Describe textures and materials based on distribution of prototypical texture elements.Leung & Malik 1999; Varma & Zisserman, 2002; Lazebnik, Schmid & Ponce, 2003;

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gV

isu

al O

bje

ct

Recog

nit

ion

Tu

tori

al

Visual words


• More recently used for describing scenes and objects for the sake of indexing or classification.

Sivic & Zisserman 2003; Csurka, Bray, Dance, & Fan 2004; many others.

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gV

isu

al O

bje

ct

Recog

nit

ion

Tu

tori

al

Inverted file index for images comprised of visual words

Image credit: A. Zisserman K. Grauman, B. Leibe

Word numbe

r

List of image

numbers

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gV

isu

al O

bje

ct

Recog

nit

ion

Tu

tori

al

Bags of visual words

• Summarize entire image based on its distribution (histogram) of word occurrences.

• Analogous to bag of words representation commonly used for documents.

26K. Grauman, B. LeibeImage credit: Fei-Fei Li

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gV

isu

al O

bje

ct

Recog

nit

ion

Tu

tori

al

Video Google System

1. Collect all words within query region

2. Inverted file index to find relevant frames

3. Compare word counts4. Spatial verification

Sivic & Zisserman, ICCV 2003

• Demo online at : http://www.robots.ox.ac.uk/~vgg/research/vgoogle/index.html


Query region

Retrie

ved

fram

es

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gV

isu

al O

bje

ct

Recog

nit

ion

Tu

tori

al

Basic flow


Detect or sample features

Describe features

…List of

positions, scales,

orientations

Associated list of d-

dimensional descriptors

…Index each one into pool of descriptors from previously seen images

…

Quantize to form bag of words vector for the image

or

…

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gV

isu

al O

bje

ct

Recog

nit

ion

Tu

tori

al

Visual vocabulary formation

Issues:• Sampling strategy• Clustering / quantization algorithm• Unsupervised vs. supervised• What corpus provides features (universal

vocabulary?)• Vocabulary size, number of words


Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gV

isu

al O

bje

ct

Recog

nit

ion

Tu

tori

al

Sampling strategies

30K. Grauman, B. LeibeImage credits: F-F. Li, E. Nowak, J.

Sivic

Dense, uniformly

Sparse, at interest points

Randomly

Multiple interest operators

• To find specific, textured objects, sparse sampling from interest points often more reliable.

• Multiple complementary interest operators offer more image coverage.

• For object categorization, dense sampling offers better coverage.

[See Nowak, Jurie & Triggs, ECCV 2006]

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gV

isu

al O

bje

ct

Recog

nit

ion

Tu

tori

al

Clustering / quantization methods

• k-means (typical choice), agglomerative clustering, mean-shift,…

• Hierarchical clustering: allows faster insertion / word assignment while still allowing large vocabularies

Vocabulary tree [Nister & Stewenius, CVPR 2006]


Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gV

isu

al O

bje

ct

Recog

nit

ion

Tu

tori

al


Example: Recognition with Vocabulary Tree

• Tree construction:

Slide credit: David Nister

[Nister & Stewenius, CVPR’06]

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gV

isu

al O

bje

ct

Recog

nit

ion

Tu

tori

al


Vocabulary Tree

• Training: Filling the tree



Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gV

isu

al O

bje

ct

Recog

nit

ion

Tu

tori

al


Vocabulary Tree




Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gV

isu

al O

bje

ct

Recog

nit

ion

Tu

tori

al


Vocabulary Tree




Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gV

isu

al O

bje

ct

Recog

nit

ion

Tu

tori

al


Vocabulary Tree




Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gV

isu

al O

bje

ct

Recog

nit

ion

Tu

tori

al


Vocabulary Tree




Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gV

isu

al O

bje

ct

Recog

nit

ion

Tu

tori

al


Vocabulary Tree

• Recognition



RANSACverification

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gV

isu

al O

bje

ct

Recog

nit

ion

Tu

tori

al


Vocabulary Tree: Performance

• Evaluated on large databases Indexing with up to 1M images

• Online recognition for databaseof 50,000 CD covers

Retrieval in ~1s

• Find experimentally that large vocabularies can be beneficial for recognition


Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gV

isu

al O

bje

ct

Recog

nit

ion

Tu

tori

al

Vocabulary formation

• Ensembles of trees provide additional robustness

Figure credit: F. Jurie K. Grauman, B. Leibe

Moosmann, Jurie, & Triggs 2006; Yeh, Lee, & Darrell 2007; Bosch, Zisserman, & Munoz 2007; …

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gV

isu

al O

bje

ct

Recog

nit

ion

Tu

tori

al

Supervised vocabulary formation

• Recent work considers how to leverage labeled images when constructing the vocabulary


Perronnin, Dance, Csurka, & Bressan, Adapted Vocabularies for Generic Visual Categorization, ECCV 2006.

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gV

isu

al O

bje

ct

Recog

nit

ion

Tu

tori

al


• Merge words that don’t aid in discriminability

Winn, Criminisi, & Minka, Object Categorization by Learned Universal Visual Dictionary, ICCV 2005

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gV

isu

al O

bje

ct

Recog

nit

ion

Tu

tori

al


• Consider vocabulary and classifier construction jointly.


Yang, Jin, Sukthankar, & Jurie, Discriminative Visual Codebook Generation with Classifier Training for Object Category Recognition, CVPR 2008.

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gV

isu

al O

bje

ct

Recog

nit

ion

Tu

tori

al

Learning and recognition with bag of words histograms• Bag of words representation makes it possible to

describe the unordered point set with a single vector (of fixed dimension across image examples)

• Provides easy way to use distribution of feature types with various learning algorithms requiring vector input.


Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gV

isu

al O

bje

ct

Recog

nit

ion

Tu

tori

al

• …including unsupervised topic models designed for documents.

• Hierarchical Bayesian text models (pLSA and LDA)– Hoffman 2001, Blei, Ng & Jordan, 2004– For object and scene categorization: Sivic et al.

2005, Sudderth et al. 2005, Quelhas et al. 2005, Fei-Fei et al. 2005


Learning and recognition with bag of words histograms

Figure credit: Fei-Fei Li

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gV

isu

al O

bje

ct

Recog

nit

ion

Tu

tori

al

• …including unsupervised topic models designed for documents.


Learning and recognition with bag of words histograms

Probabilistic Latent Semantic Analysis (pLSA)w

Nd z

D

“face”

Sivic et al. ICCV 2005[pLSA code available at: http://www.robots.ox.ac.uk/~vgg/software/]

Figure credit: Fei-Fei Li

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gV

isu

al O

bje

ct

Recog

nit

ion

Tu

tori

al

Bags of words: pros and cons

+ flexible to geometry / deformations / viewpoint

+ compact summary of image content

+ provides vector representation for sets

+ has yielded good recognition results in practice

- basic model ignores geometry – must verify afterwards, or encode via features

- background and foreground mixed when bag covers whole image

- interest points or sampling: no guarantee to capture object-level parts

- optimal vocabulary formation remains unclear


Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gV

isu

al O

bje

ct

Recog

nit

ion

Tu

tori

al


Outline

1. Detection with Global Appearance & Sliding Windows

2. Local Invariant Features: Detection & Description

3. Specific Object Recognition with Local Features

― Coffee Break ―

4. Visual Words: Indexing, Bags of Words Categorization

5. Matching Local Feature Sets

6. Part-Based Models for Categorization

7. Current Challenges and Research Directions

Documents

Visual Object Recognition