48
Perceptual and Sensory Augmented Computing Visual Object Recognition Tutorial Visual Object Recognition Bastian Leibe & Computer Vision Laboratory ETH Zurich Chicago, 14.07.2008 Kristen Grauman Department of Computer Sciences University of Texas in Austin

Visual Object Recognition

  • Upload
    tamas

  • View
    28

  • Download
    1

Embed Size (px)

DESCRIPTION

Visual Object Recognition. Bastian Leibe & Computer Vision Laboratory ETH Zurich Chicago, 14.07.2008. Kristen Grauman Department of Computer Sciences University of Texas in Austin. Outline. Detection with Global Appearance & Sliding Windows - PowerPoint PPT Presentation

Citation preview

Page 1: Visual Object Recognition

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gV

isu

al O

bje

ct

Recog

nit

ion

Tu

tori

al

Visual Object Recognition

Bastian Leibe &

Computer Vision LaboratoryETH Zurich

Chicago, 14.07.2008

Kristen Grauman

Department of Computer SciencesUniversity of Texas in Austin

Page 2: Visual Object Recognition

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gV

isu

al O

bje

ct

Recog

nit

ion

Tu

tori

al

2K. Grauman, B. Leibe

Outline

1. Detection with Global Appearance & Sliding Windows

2. Local Invariant Features: Detection & Description

3. Specific Object Recognition with Local Features

― Coffee Break ―

4. Visual Words: Indexing, Bags of Words Categorization

5. Matching Local Feature Sets

6. Part-Based Models for Categorization

7. Current Challenges and Research Directions

Page 3: Visual Object Recognition

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gV

isu

al O

bje

ct

Recog

nit

ion

Tu

tori

al

Global representations: limitations

• Success may rely on alignment -> sensitive to viewpoint

• All parts of the image or window impact the description -> sensitive to occlusion, clutter

3K. Grauman, B. Leibe

Page 4: Visual Object Recognition

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gV

isu

al O

bje

ct

Recog

nit

ion

Tu

tori

al

Local representations• Describe component regions or patches

separately.• Many options for detection & description…

4K. Grauman, B. Leibe

Superpixels [Ren et al.]

Shape context [Belongie 02]

Maximally Stable Extremal Regions

[Matas 02]

Geometric Blur [Berg 05]

SIFT [Lowe 99]

Salient regions [Kadir 01]

Harris-Affine [Mikolajczyk 04]

Spin images [Johnson 99]

Page 5: Visual Object Recognition

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gV

isu

al O

bje

ct

Recog

nit

ion

Tu

tori

al

Recall: Invariant local features

Subset of local feature types designed to be invariant to

Scale Translation Rotation Affine transformations Illumination

1) Detect interest points 2) Extract descriptors

x1

x2 … xd

y1

y2 … yd

[Mikolajczyk01, Matas02, Tuytelaars04, Lowe99, Kadir01,… ]

K. Grauman, B. Leibe

Page 6: Visual Object Recognition

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gV

isu

al O

bje

ct

Recog

nit

ion

Tu

tori

al

Recognition with local feature sets

• Previously, we saw how to use local invariant features + a global spatial model to recognize specific objects, using a planar object assumption.

• Now, we’ll use local features for

Indexing-based recognition

Bags of words representations

Correspondence / matching kernels

6K. Grauman, B. Leibe

Aachen Cathedral

Page 7: Visual Object Recognition

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gV

isu

al O

bje

ct

Recog

nit

ion

Tu

tori

al

Basic flow

7K. Grauman, B. Leibe

Detect or sample features

Describe features

…List of

positions, scales,

orientations

Associated list of d-

dimensional descriptors

…Index each one into pool of descriptors from previously seen images

Page 8: Visual Object Recognition

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gV

isu

al O

bje

ct

Recog

nit

ion

Tu

tori

al

Indexing local features

• Each patch / region has a descriptor, which is a point in some high-dimensional feature space (e.g., SIFT)

K. Grauman, B. Leibe

Page 9: Visual Object Recognition

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gV

isu

al O

bje

ct

Recog

nit

ion

Tu

tori

al

Indexing local features

• When we see close points in feature space, we have similar descriptors, which indicates similar local content.

Figure credit: A. Zisserman K. Grauman, B. Leibe

Page 10: Visual Object Recognition

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gV

isu

al O

bje

ct

Recog

nit

ion

Tu

tori

al

Indexing local features

• We saw in the previous section how to use voting and pose clustering to identify objects using local features

10K. Grauman, B. Leibe

Figure credit: David Lowe

Page 11: Visual Object Recognition

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gV

isu

al O

bje

ct

Recog

nit

ion

Tu

tori

al

Indexing local features

• With potentially thousands of features per image, and hundreds to millions of images to search, how to efficiently find those that are relevant to a new image?

Low-dimensional descriptors : can use standard efficient data structures for nearest neighbor search

High-dimensional descriptors: approximate nearest neighbor search methods more practical

Inverted file indexing schemes

11K. Grauman, B. Leibe

Page 12: Visual Object Recognition

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gV

isu

al O

bje

ct

Recog

nit

ion

Tu

tori

al

Indexing local features: approximate nearest neighbor search

12K. Grauman, B. Leibe

Best-Bin First (BBF), a variant of k-d trees that uses priority queue to examine most promising branches first [Beis & Lowe, CVPR 1997]

Locality-Sensitive Hashing (LSH), a randomized hashing technique using hash functions that map similar points to the same bin, with high probability [Indyk & Motwani, 1998]

Page 13: Visual Object Recognition

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gV

isu

al O

bje

ct

Recog

nit

ion

Tu

tori

al

• For text documents, an efficient way to find all pages on which a word occurs is to use an index…

• We want to find all images in which a feature occurs.

• To use this idea, we’ll need to map our features to “visual words”.

13K. Grauman, B. Leibe

Indexing local features: inverted file index

Page 14: Visual Object Recognition

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gV

isu

al O

bje

ct

Recog

nit

ion

Tu

tori

al

Visual words: main idea

• Extract some local features from a number of images …

14K. Grauman, B. Leibe

e.g., SIFT descriptor space: each point is 128-dimensional

Slide credit: D. Nister

Page 15: Visual Object Recognition

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gV

isu

al O

bje

ct

Recog

nit

ion

Tu

tori

al

Visual words: main idea

15K. Grauman, B. LeibeSlide credit: D. Nister

Page 16: Visual Object Recognition

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gV

isu

al O

bje

ct

Recog

nit

ion

Tu

tori

al

Visual words: main idea

16K. Grauman, B. LeibeSlide credit: D. Nister

Page 17: Visual Object Recognition

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gV

isu

al O

bje

ct

Recog

nit

ion

Tu

tori

al

Visual words: main idea

17K. Grauman, B. LeibeSlide credit: D. Nister

Page 18: Visual Object Recognition

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gV

isu

al O

bje

ct

Recog

nit

ion

Tu

tori

al

18K. Grauman, B. LeibeSlide credit: D. Nister

Page 19: Visual Object Recognition

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gV

isu

al O

bje

ct

Recog

nit

ion

Tu

tori

al

19K. Grauman, B. LeibeSlide credit: D. Nister

Page 20: Visual Object Recognition

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gV

isu

al O

bje

ct

Recog

nit

ion

Tu

tori

al

Visual words: main idea

Map high-dimensional descriptors to tokens/words by quantizing the feature space

20K. Grauman, B. Leibe

• Quantize via clustering, let cluster centers be the prototype “words”

Descriptor space

Page 21: Visual Object Recognition

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gV

isu

al O

bje

ct

Recog

nit

ion

Tu

tori

al

Visual words: main idea

Map high-dimensional descriptors to tokens/words by quantizing the feature space

21K. Grauman, B. Leibe

• Determine which word to assign to each new image region by finding the closest cluster center.

Descriptor space

Page 22: Visual Object Recognition

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gV

isu

al O

bje

ct

Recog

nit

ion

Tu

tori

al

Visual words

• Example: each group of patches belongs to the same visual word

22K. Grauman, B. Leibe

Figure from Sivic & Zisserman, ICCV 2003

Page 23: Visual Object Recognition

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gV

isu

al O

bje

ct

Recog

nit

ion

Tu

tori

al

Visual words

• First explored for texture and material representations

• Texton = cluster center of filter responses over collection of images

• Describe textures and materials based on distribution of prototypical texture elements.Leung & Malik 1999; Varma & Zisserman, 2002; Lazebnik, Schmid & Ponce, 2003;

Page 24: Visual Object Recognition

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gV

isu

al O

bje

ct

Recog

nit

ion

Tu

tori

al

Visual words

24K. Grauman, B. Leibe

• More recently used for describing scenes and objects for the sake of indexing or classification.

Sivic & Zisserman 2003; Csurka, Bray, Dance, & Fan 2004; many others.

Page 25: Visual Object Recognition

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gV

isu

al O

bje

ct

Recog

nit

ion

Tu

tori

al

Inverted file index for images comprised of visual words

Image credit: A. Zisserman K. Grauman, B. Leibe

Word numbe

r

List of image

numbers

Page 26: Visual Object Recognition

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gV

isu

al O

bje

ct

Recog

nit

ion

Tu

tori

al

Bags of visual words

• Summarize entire image based on its distribution (histogram) of word occurrences.

• Analogous to bag of words representation commonly used for documents.

26K. Grauman, B. LeibeImage credit: Fei-Fei Li

Page 27: Visual Object Recognition

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gV

isu

al O

bje

ct

Recog

nit

ion

Tu

tori

al

Video Google System

1. Collect all words within query region

2. Inverted file index to find relevant frames

3. Compare word counts4. Spatial verification

Sivic & Zisserman, ICCV 2003

• Demo online at : http://www.robots.ox.ac.uk/~vgg/research/vgoogle/index.html

27K. Grauman, B. Leibe

Query region

Retrie

ved

fram

es

Page 28: Visual Object Recognition

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gV

isu

al O

bje

ct

Recog

nit

ion

Tu

tori

al

Basic flow

28K. Grauman, B. Leibe

Detect or sample features

Describe features

…List of

positions, scales,

orientations

Associated list of d-

dimensional descriptors

…Index each one into pool of descriptors from previously seen images

Quantize to form bag of words vector for the image

or

Page 29: Visual Object Recognition

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gV

isu

al O

bje

ct

Recog

nit

ion

Tu

tori

al

Visual vocabulary formation

Issues:• Sampling strategy• Clustering / quantization algorithm• Unsupervised vs. supervised• What corpus provides features (universal

vocabulary?)• Vocabulary size, number of words

29K. Grauman, B. Leibe

Page 30: Visual Object Recognition

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gV

isu

al O

bje

ct

Recog

nit

ion

Tu

tori

al

Sampling strategies

30K. Grauman, B. LeibeImage credits: F-F. Li, E. Nowak, J.

Sivic

Dense, uniformly

Sparse, at interest points

Randomly

Multiple interest operators

• To find specific, textured objects, sparse sampling from interest points often more reliable.

• Multiple complementary interest operators offer more image coverage.

• For object categorization, dense sampling offers better coverage.

[See Nowak, Jurie & Triggs, ECCV 2006]

Page 31: Visual Object Recognition

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gV

isu

al O

bje

ct

Recog

nit

ion

Tu

tori

al

Clustering / quantization methods

• k-means (typical choice), agglomerative clustering, mean-shift,…

• Hierarchical clustering: allows faster insertion / word assignment while still allowing large vocabularies

Vocabulary tree [Nister & Stewenius, CVPR 2006]

31K. Grauman, B. Leibe

Page 32: Visual Object Recognition

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gV

isu

al O

bje

ct

Recog

nit

ion

Tu

tori

al

32K. Grauman, B. Leibe

Example: Recognition with Vocabulary Tree

• Tree construction:

Slide credit: David Nister

[Nister & Stewenius, CVPR’06]

Page 33: Visual Object Recognition

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gV

isu

al O

bje

ct

Recog

nit

ion

Tu

tori

al

33K. Grauman, B. Leibe

Vocabulary Tree

• Training: Filling the tree

Slide credit: David Nister

[Nister & Stewenius, CVPR’06]

Page 34: Visual Object Recognition

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gV

isu

al O

bje

ct

Recog

nit

ion

Tu

tori

al

34K. Grauman, B. Leibe

Vocabulary Tree

• Training: Filling the tree

Slide credit: David Nister

[Nister & Stewenius, CVPR’06]

Page 35: Visual Object Recognition

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gV

isu

al O

bje

ct

Recog

nit

ion

Tu

tori

al

35K. Grauman, B. Leibe

Vocabulary Tree

• Training: Filling the tree

Slide credit: David Nister

[Nister & Stewenius, CVPR’06]

Page 36: Visual Object Recognition

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gV

isu

al O

bje

ct

Recog

nit

ion

Tu

tori

al

36K. Grauman, B. Leibe

Vocabulary Tree

• Training: Filling the tree

Slide credit: David Nister

[Nister & Stewenius, CVPR’06]

Page 37: Visual Object Recognition

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gV

isu

al O

bje

ct

Recog

nit

ion

Tu

tori

al

37K. Grauman, B. Leibe

Vocabulary Tree

• Training: Filling the tree

Slide credit: David Nister

[Nister & Stewenius, CVPR’06]

Page 38: Visual Object Recognition

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gV

isu

al O

bje

ct

Recog

nit

ion

Tu

tori

al

38K. Grauman, B. Leibe

Vocabulary Tree

• Recognition

Slide credit: David Nister

[Nister & Stewenius, CVPR’06]

RANSACverification

Page 39: Visual Object Recognition

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gV

isu

al O

bje

ct

Recog

nit

ion

Tu

tori

al

39K. Grauman, B. Leibe

Vocabulary Tree: Performance

• Evaluated on large databases Indexing with up to 1M images

• Online recognition for databaseof 50,000 CD covers

Retrieval in ~1s

• Find experimentally that large vocabularies can be beneficial for recognition

[Nister & Stewenius, CVPR’06]

Page 40: Visual Object Recognition

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gV

isu

al O

bje

ct

Recog

nit

ion

Tu

tori

al

Vocabulary formation

• Ensembles of trees provide additional robustness

Figure credit: F. Jurie K. Grauman, B. Leibe

Moosmann, Jurie, & Triggs 2006; Yeh, Lee, & Darrell 2007; Bosch, Zisserman, & Munoz 2007; …

Page 41: Visual Object Recognition

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gV

isu

al O

bje

ct

Recog

nit

ion

Tu

tori

al

Supervised vocabulary formation

• Recent work considers how to leverage labeled images when constructing the vocabulary

41K. Grauman, B. Leibe

Perronnin, Dance, Csurka, & Bressan, Adapted Vocabularies for Generic Visual Categorization, ECCV 2006.

Page 42: Visual Object Recognition

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gV

isu

al O

bje

ct

Recog

nit

ion

Tu

tori

al

Supervised vocabulary formation

• Merge words that don’t aid in discriminability

Winn, Criminisi, & Minka, Object Categorization by Learned Universal Visual Dictionary, ICCV 2005

Page 43: Visual Object Recognition

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gV

isu

al O

bje

ct

Recog

nit

ion

Tu

tori

al

Supervised vocabulary formation

• Consider vocabulary and classifier construction jointly.

43K. Grauman, B. Leibe

Yang, Jin, Sukthankar, & Jurie, Discriminative Visual Codebook Generation with Classifier Training for Object Category Recognition, CVPR 2008.

Page 44: Visual Object Recognition

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gV

isu

al O

bje

ct

Recog

nit

ion

Tu

tori

al

Learning and recognition with bag of words histograms• Bag of words representation makes it possible to

describe the unordered point set with a single vector (of fixed dimension across image examples)

• Provides easy way to use distribution of feature types with various learning algorithms requiring vector input.

44K. Grauman, B. Leibe

Page 45: Visual Object Recognition

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gV

isu

al O

bje

ct

Recog

nit

ion

Tu

tori

al

• …including unsupervised topic models designed for documents.

• Hierarchical Bayesian text models (pLSA and LDA)– Hoffman 2001, Blei, Ng & Jordan, 2004– For object and scene categorization: Sivic et al.

2005, Sudderth et al. 2005, Quelhas et al. 2005, Fei-Fei et al. 2005

45K. Grauman, B. Leibe

Learning and recognition with bag of words histograms

Figure credit: Fei-Fei Li

Page 46: Visual Object Recognition

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gV

isu

al O

bje

ct

Recog

nit

ion

Tu

tori

al

• …including unsupervised topic models designed for documents.

46K. Grauman, B. Leibe

Learning and recognition with bag of words histograms

Probabilistic Latent Semantic Analysis (pLSA)w

Nd z

D

“face”

Sivic et al. ICCV 2005[pLSA code available at: http://www.robots.ox.ac.uk/~vgg/software/]

Figure credit: Fei-Fei Li

Page 47: Visual Object Recognition

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gV

isu

al O

bje

ct

Recog

nit

ion

Tu

tori

al

Bags of words: pros and cons

+ flexible to geometry / deformations / viewpoint

+ compact summary of image content

+ provides vector representation for sets

+ has yielded good recognition results in practice

- basic model ignores geometry – must verify afterwards, or encode via features

- background and foreground mixed when bag covers whole image

- interest points or sampling: no guarantee to capture object-level parts

- optimal vocabulary formation remains unclear

47K. Grauman, B. Leibe

Page 48: Visual Object Recognition

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gV

isu

al O

bje

ct

Recog

nit

ion

Tu

tori

al

48K. Grauman, B. Leibe

Outline

1. Detection with Global Appearance & Sliding Windows

2. Local Invariant Features: Detection & Description

3. Specific Object Recognition with Local Features

― Coffee Break ―

4. Visual Words: Indexing, Bags of Words Categorization

5. Matching Local Feature Sets

6. Part-Based Models for Categorization

7. Current Challenges and Research Directions