AAAI08 tutorial: visual object recognition

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Visual Object Recognition

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Bastian Leibe &

Computer Vision LaboratoryETH Zurich

Chicago, 14.07.2008

Kristen Grauman

Department of Computer SciencesUniversity of Texas in Austin

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l ????

Identification vs. Categorization

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

2K. Grauman, B. Leibe

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Object Categorization

• How to recognize ANY car

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l


• How to recognize ANY cow

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

What could be done with recognition algorithms?

There is a wide range of applications, including…

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Medical image analysis

Navigation, driver safetyAutonomous robots Situated search

Content-based retrieval and analysis for images and videos

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Object Categorization

• Task Description

� “Given a small number of training images of a category, recognize a-priori unknown instances of that category and assign the correct category label.”

• Which categories are feasible visually?

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l


•� Extensively studied in Cognitive Psychology,

e.g. [Brown’58]

GermanGermanGermanGerman

shepherdshepherdshepherdshepherd

animalanimalanimalanimaldogdogdogdog livinglivinglivingliving

beingbeingbeingbeing

“Fido”“Fido”“Fido”“Fido”

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Visual Object Categories

• Basic Level Categories in human categorization [Rosch 76, Lakoff 87]

� The highest level at which category members have similar perceived shape

� The highest level at which a single mental image reflects the entire category

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l


entire category

� The level at which human subjects are usually fastest at identifying category members

� The first level named and understood by children

� The highest level at which a person uses similar motor actions for interaction with category members

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Visual Object Categories

• Basic-level categories in humans seem to be defined predominantly visually.

• There is evidence that humans (usually)start with basic-level categorization before doing identification.

⇒⇒⇒⇒ Basic-level categorization is easierAbstract

animal

…

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l


⇒⇒⇒⇒ Basic-level categorization is easierand faster for humans than objectidentification!

⇒⇒⇒⇒ Most promising starting pointfor visual classification

Basic level

Individual level

Abstract levels

“Fido”

dog

quadruped

German

shepherdDoberman

cat cow

…

……

… …

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Other Types of Categories

• Functional Categories

� e.g. chairs = “something you can sit on”

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l


Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Other Types of Categories

• Ad-hoc categories

� e.g. “something you can find in an office environment”

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l


Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Levels of Object Categorization

“cow”

“motorbike”

“car”

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l


• Different levels of recognition

� Which object class is in the image? ⇒⇒⇒⇒ Obj/Img classification

� Where is it in the image? ⇒⇒⇒⇒ Detection/Localization

� Where exactly ― which pixels? ⇒⇒⇒⇒ Figure/Ground segmentation

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Challenges: robustness

Illumination Object pose Clutter

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Illumination Object pose Clutter

ViewpointIntra-class appearance

Occlusions

K. Grauman, B. Leibe

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Challenges: robustness

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l


• Detection in Crowded Scenes� Learn object variability

– Changes in appearance, scale, and articulation

� Compensate for clutter, overlap, and occlusion

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Challenges: context and human experience

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l


Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Challenges: context and human experience

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Context cues Dynamics

Video credit: J. DavisImage credit: D. Hoeim

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Challenges: scale, efficiency

• Thousands to millions of pixels in an image• Estimated 30 Gigapixels of image/video content

generated per second• About half of the cerebral cortex in primates is devoted

to processing visual information [Felleman and van Essen 1991]

• 3,000-30,000 human recognizable object categories

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

• 3,000-30,000 human recognizable object categories• 30+ degrees of freedom in the pose of articulated

objects (humans)• Billions of images indexed by Google Image Search• 18 billion+ prints produced from digital camera images

in 2004• 295.5 million camera phones sold in 2005


Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Challenges: learning with minimal supervision

MoreLess

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l


Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Rough evolution of focus in recognition research

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

1980s Currently1990s to early 2000s


Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

This tutorial

• Intended for broad AAAI audience

� Assuming basic familiarity with machine learning, linear algebra, probability

� Not assuming significant vision background

• Our goals

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

• Our goals

� Describe main approaches to recognition

� Highlight past successes and future challenges

� Provide the pointers (to literature and tools) that would allow you to take advantage of existing techniques in your research

• Questions welcome


Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Outline

1. Detection with Global Appearance & Sliding Windows

2. Local Invariant Features: Detection & Description

3. Specific Object Recognition with Local Features

― Coffee Break ―

Perc

eptu

al

and S

enso

ry A

ugm

ente

d C

om

puti

ng

Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l



4. Visual Words: Indexing, Bags of Words Categorization

5. Matching Local Features

6. Part-Based Models for Categorization

7. Current Challenges and Research Directions

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial


Bastian Leibe &Computer Vision LaboratoryETH Zurich

Chicago, 14.07.2008

Kristen GraumanDepartment of Computer SciencesUniversity of Texas in Austin

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial


Outline









Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

Detection via classification: Main idea

Car/non-car Classifier

Yes, car.No, not a car.


Basic component: a binary classifier

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial




If object may be in a cluttered scene, slide a window around looking for it.

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial



Feature extraction

Training examples


1. Obtain training data2. Define features3. Define classifier

Fleshing out this pipeline a bit more, we need to:

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial



• Consider all subwindows in an imageSample at multiple scales and positions

• Make a decision per window:“Does this contain object category X or not?”

• In this section, we’ll focus specifically on methods using a global representation (i.e., not part-based, not local features).

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

Feature extraction: global appearance

Feature extraction

Simple holistic descriptions of image contentgrayscale / color histogramvector of pixel intensities


Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

Eigenfaces: global appearance description

K. Grauman, B. LeibeTurk & Pentland, 1991

Training images

Mean

Eigenvectors computed from covariance matrix

Project new images to “face space”.

Recognition via nearest neighbors in face space

Generate low-dimensional representation of appearance with a linear subspace.

≈ + +Mean

+ +

...

An early appearance-based approach to face recognition

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

Feature extraction: global appearance

• Pixel-based representations sensitive to small shifts

• Color or grayscale-based appearance description can be sensitive to illumination and intra-class appearance variation


Cartoon example: an albino koala

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

Gradient-based representations

• Consider edges, contours, and (oriented) intensity gradients


Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

Gradient-based representations: Matching edge templates

• Example: Chamfer matching

Template shape

Input image

Edges detected

Distance transform

Gavrila & Philomin ICCV 1999

Best match

At each window position, compute average min distance between points on template (T) and input (I).


Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

• Chamfer matching

Gavrila & Philomin ICCV 1999

Hierarchy of templates

Gradient-based representations: Matching edge templates


Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

Gradient-based representations

• Consider edges, contours, and (oriented) intensity gradients

• Summarize local distribution of gradients with histogramLocally orderless: offers invariance to small shifts and rotationsContrast-normalization: try to correct for variable illumination


Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

Gradient-based representations:Histograms of oriented gradients (HoG)

Dalal & Triggs, CVPR 2005

Map each grid cell in the input window to a histogram counting the gradients per orientation.

Code available: http://pascal.inrialpes.fr/soft/olt/


Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

Gradient-based representations:SIFT descriptor

Lowe, ICCV 1999

Local patch descriptor (more on this later)


Code: http://vision.ucla.edu/~vedaldi/code/sift/sift.htmlBinary: http://www.cs.ubc.ca/~lowe/keypoints/

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial


Gradient-based representations:Biologically inspired features

Serre, Wolf, Poggio, CVPR 2005Mutch & Lowe, CVPR 2006

Convolve with Gabor filters at multiple orientations

Pool nearby units (max)

Intermediate layers compare inputto prototype patches

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

Gradient-based representations:Rectangular features

Compute differences between sums of pixels in rectangles

Captures contrast in adjacent spatial regions

Similar to Haar wavelets, efficient to compute

Viola & Jones, CVPR 2001K. Grauman, B. Leibe

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

Gradient-based representations:Shape context descriptor

Count the number of points inside each bin, e.g.:

Count = 4

Count = 10...

Log-polar binning: more precision for nearby points, more flexibility for farther points.

Belongie, Malik & Puzicha, ICCV 2001


Local descriptor (more on this later)

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

• How to compute a decision for each subwindow?

Image feature

Classifier construction


Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

Discriminative vs. generative models

0 10 20 30 40 50 60 700

0.05

0.1

0 10 20 30 40 50 60 700

0.5

1x = data

Plots from Antonio Torralba 2007

),Pr( carimage ),Pr( carimage ¬

)|Pr( imagecar )|Pr( imagecar¬

image feature

image feature

Generative: separately model class-conditional and prior densities

Discriminative: directly model posterior


Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

Discriminative vs. generative models

• Generative:+ possibly interpretable+ can draw samples- models variability unimportant to classification task- often hard to build good model with few parameters

• Discriminative:+ appealing when infeasible to model data itself+ excel in practice- often can’t provide uncertainty in predictions- non-interpretable


Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

Discriminative methods

106 examples

Nearest neighbor

Shakhnarovich, Viola, Darrell 2003Berg, Berg, Malik 2005...

Neural networks

LeCun, Bottou, Bengio, Haffner 1998Rowley, Baluja, Kanade 1998…

Support Vector Machines Conditional Random Fields

McCallum, Freitag, Pereira 2000; Kumar, Hebert 2003…

Guyon, VapnikHeisele, Serre, Poggio, 2001,…

Slide adapted from Antonio TorralbaK. Grauman, B. Leibe

Boosting

Viola, Jones 2001, Torralba et al. 2004, Opelt et al. 2006,…

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

Boosting

• Build a strong classifier by combining number of “weak classifiers”, which need only be better than chance

• Sequential learning process: at each iteration, add a weak classifier

• Flexible to choice of weak learnerincluding fast simple classifiers that alone may be inaccurate

• We’ll look at Freund & Schapire’s AdaBoost algorithmEasy to implementBase learning algorithm for Viola-Jones face detector


Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

AdaBoost: Intuition


Figure adapted from Freund and Schapire

Consider a 2-d feature space with positive and negative examples.

Each weak classifier splits the training examples with at least 50% accuracy.

Examples misclassified by a previous weak learner are given more emphasis at future rounds.

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

AdaBoost: Intuition


Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

AdaBoost: Intuition


Final classifier is combination of the weak classifiers

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

AdaBoost AlgorithmStart with uniform weights on training examples

Evaluate weighted error for each feature, pick best.

Incorrectly classified -> more weight

Correctly classified -> less weight

Final classifier is combination of the weak ones, weighted according to error they had.

Freund & Schapire 1995

{x1,…xn}

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

Cascading classifiers for detection

For efficiency, apply less accurate but faster classifiers first to immediately discard windows that clearly appear to be negative; e.g.,

Filter for promising regions with an initial inexpensive classifier

Build a chain of classifiers, choosing cheap ones with low false negative rates early in the chain


Fleuret & Geman, IJCV 2001Rowley et al., PAMI 1998Viola & Jones, CVPR 2001

Figure from Viola & Jones CVPR 2001

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

Example: Face detection

• Frontal faces are a good example of a class where global appearance models + a sliding window detection approach fit well:

Regular 2D structure

Center of face almost shaped like a “patch”/window

• Now we’ll take AdaBoost and see how the Viola-Jones face detector works


Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

Feature extraction


Feature output is difference between adjacent regions

Viola & Jones, CVPR 2001

Efficiently computable with integral image: any sum can be computed in constant time

Avoid scaling images scale features directly for same cost

“Rectangular” filters

Value at (x,y) is sum of pixels above and to the left of (x,y)

Integral image

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

Large library of filters

Considering all possible filter parameters: position, scale, and type:

180,000+ possible features associated with each 24 x 24 window

Use AdaBoost both to select the informative features and to form the classifier


Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

AdaBoost for feature+classifier selection• Want to select the single rectangle feature and threshold

that best separates positive (faces) and negative (non-faces) training examples, in terms of weighted error.

Outputs of a possible rectangle feature on faces and non-faces.

…

Resulting weak classifier:

For next round, reweight the examples according to errors, choose another filter/threshold combo.


Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

Viola-Jones Face Detector: Summary

• Train with 5K positives, 350M negatives• Real-time detector using 38 layer cascade• 6061 features in final layer• [Implementation available in OpenCV:

http://www.intel.com/technology/computing/opencv/]33


Faces

Non-faces

Train cascade of classifiers with

AdaBoost

Selected features, thresholds, and weights

New image

Appl

y to

eac

h

subw

indo

w

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

Viola-Jones Face Detector: Results


First two features selected

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial


Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial


Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial


Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

Profile Features

Detecting profile faces requires training separate detector with profile examples.

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

Paul Viola, ICCV tutorial


Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial


Everingham, M., Sivic, J. and Zisserman, A."Hello! My name is... Buffy" - Automatic naming of characters in TV video,BMVC 2006. http://www.robots.ox.ac.uk/~vgg/research/nface/index.html

Example application

Frontal faces detected and then tracked, character names inferred with alignment of script and subtitles.

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

Pedestrian detection• Detecting upright, walking humans also possible using sliding

window’s appearance/texture; e.g.,


SVM with Haar wavelets [Papageorgiou & Poggio, IJCV 2000]

Space-time rectangle features [Viola, Jones & Snow, ICCV 2003]

SVM with HoGs [Dalal & Triggs, CVPR 2005]

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

Highlights

• Sliding window detection and global appearance descriptors:

Simple detection protocol to implementGood feature choices criticalPast successes for certain classes


Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

Limitations

• High computational complexity For example: 250,000 locations x 30 orientations x 4 scales = 30,000,000 evaluations!If training binary detectors independently, means cost increaseslinearly with number of classes

• With so many windows, false positive rate better be low


Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

Limitations (continued)

• Not all objects are “box” shaped


Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial


• Non-rigid, deformable objects not captured well with representations assuming a fixed 2d structure; or must assume fixed viewpoint

• Objects with less-regular textures not captured well with holistic appearance-based descriptions


Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial


• If considering windows in isolation, context is lost

46K. Grauman, B. LeibeFigure credit: Derek Hoiem

Sliding window Detector’s view

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial


• In practice, often entails large, cropped training set (expensive)

• Requiring good match to a global appearance description can lead to sensitivity to partial occlusions

47K. Grauman, B. LeibeImage credit: Adam, Rivlin, & Shimshoni

Perc

eptu

al a

nd S

enso

ry A

ugm

ente

d Co

mpu

ting

Visu

al O

bjec

t Rec

ogni

tion

Tuto

rial


Outline









Perceptual and Sensory Augmented Computing

Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l



Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Bastian Leibe &


Chicago, 14.07.2008

Kristen Grauman



Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Outline






Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l








Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

lV

isu

al

Ob

jec

t R

ec

og

nit

ion

Tu

tori

al

Motivation

• Global representations have major limitations

• Instead, describe and match only local regions

• Increased robustness to

� Occlusions


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

lV

isu

al

Ob

jec

t R

ec

og

nit

ion

Tu

tori

al

� Articulation

� Intra-category variations


θq

φ

dq

φ

θ

d


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

lV

isu

al

Ob

jec

t R

ec

og

nit

ion

Tu

tori

al

Approach

A1

A2 A3

1. Find a set of distinctive key-points

2. Define a region around each keypoint


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

lV

isu

al

Ob

jec

t R

ec

og

nit

ion

Tu

tori

al


Npixels

N pixels

Similarity

measureAf

e.g. color

Bf

e.g. color

Tffd BA <),(

3. Extract and normalize the region content

4. Compute a local descriptor from the normalized region

5. Match local descriptors


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

lV

isu

al

Ob

jec

t R

ec

og

nit

ion

Tu

tori

al

Requirements

• Region extraction needs to be repeatable and precise

� Translation, rotation, scale changes

� (Limited out-of-plane (≈≈≈≈affine) transformations)

� Lighting variations

• We need a sufficient number of regions to cover the


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

lV

isu

al

Ob

jec

t R

ec

og

nit

ion

Tu

tori

al

• We need a sufficient number of regions to cover the object

• The regions should contain “interesting” structure



Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

lV

isu

al

Ob

jec

t R

ec

og

nit

ion

Tu

tori

al

Many Existing Detectors Available

• Hessian & Harris [Beaudet ‘78], [Harris ‘88]

• Laplacian, DoG [Lindeberg ‘98], [Lowe 1999]

• Harris-/Hessian-Laplace [Mikolajczyk & Schmid ‘01]

• Harris-/Hessian-Affine [Mikolajczyk & Schmid ‘04]

• EBR and IBR [Tuytelaars & Van Gool ‘04]

•


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

lV

isu

al

Ob

jec

t R

ec

og

nit

ion

Tu

tori

al

• MSER [Matas ‘02]

• Salient Regions [Kadir & Brady ‘01]

• Others…



Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Keypoint Localization


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

• Goals:

� Repeatable detection

� Precise localization

� Interesting content

⇒⇒⇒⇒ Look for two-dimensional signal changes



Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Hessian Detector [Beaudet78]

• Hessian determinant

=

yyxy

xyxx

II

IIIHessian )(

Ixx

I


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l


Iyy

Ixy

Intuition: Search for strongderivatives in two orthogonal directions


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Hessian Detector [Beaudet78]

• Hessian determinant

Ixx

I

=

yyxy

xyxx

II

IIIHessian )(


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l


IyyIxy

2))(det( xyyyxx IIIIHessian −=

2)^(. xyyyxx III −∗

In Matlab:


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Hessian Detector – Responses [Beaudet78]


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

10

Effect: Responses mainly on corners and strongly textured areas.


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Hessian Detector – Responses [Beaudet78]


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

11


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Harris Detector [Harris88]

• Second moment matrix(autocorrelation matrix)

∗=

)()(

)()()(),(

2

2

DyDyx

DyxDx

IDIIII

IIIg

σσ

σσσσσµ


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l


Intuition: Search for local neighborhoods where the image content has two main directions (eigenvectors).


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l



I I

∗=

)()(

)()()(),(

2

2

DyDyx

DyxDx

IDIIII

IIIg

σσ

σσσσσµ


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l


1. Image derivatives

gx(σD), gy(σD),

IxIy


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l



∗=

)()(

)()()(),(

2

2

DyDyx

DyxDx

IDIIII

IIIg

σσ

σσσσσµ

I I


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l



gx(σD), gy(σD),

IxIy

14

2. Square of

derivatives

Ix2 Iy

2 IxIy


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l



I

∗=

)()(

)()()(),(

2

2

DyDyx

DyxDx

IDIIII

IIIg

σσ

σσσσσµ

1. Image

derivatives

Ix Iy


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l


gx(σD), gy(σD),

2. Square of

derivatives

Iy

2. Square of

derivatives

3. Gaussian

filter g(σI)

Ix2 Iy

2 IxIy

g(Ix2) g(Iy

2) g(IxIy)

15


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l



I

∗=

)()(

)()()(),(

2

2

DyDyx

DyxDx

IDIIII

IIIg

σσ

σσσσσµ

1. Image

derivatives

2. Square of

derivatives

Ix Iy

Ix2 Iy

2 IxIy


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Iy

g(IxIy)

16

derivatives

3. Gaussian

filter g(σI)g(Ix

2) g(Iy2) g(IxIy)

222222)]()([)]([)()( yxyxyx IgIgIIgIgIg +−− α

=−= ))],([trace()],(det[ DIDIhar σσµασσµ

4. Cornerness function – both eigenvalues are strong

har5. Non-maxima suppression


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Harris Detector – Responses [Harris88]


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

17

Effect: A very precise corner detector.


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Harris Detector – Responses [Harris88]


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

18


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Automatic Scale Selection


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l


)),(( )),((11

σσ ′′= xIfxIfmm iiii KK

Same operator responses if the patch contains the same image up to scale factor

How to find corresponding patch sizes?


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l


• Function responses for increasing scale (scale signature)


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l


)),((1

σxIfmii K

)),((1

σxIfmii

′K


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l




Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l


)),((1

σxIfmii K

)),((1

σxIfmii

′K


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l




Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l


)),((1

σxIfmii K

)),((1

σxIfmii

′K


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l




Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l


)),((1

σxIfmii K

)),((1

σxIfmii

′K


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l




Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l


)),((1

σxIfmii K

)),((1

σxIfmii

′K


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l




Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l


)),((1

σxIfmii K

)),((1

σ ′′xIfmii K


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

lV

isu

al

Ob

jec

t R

ec

og

nit

ion

Tu

tori

al

What Is A Useful Signature Function?

• Laplacian-of-Gaussian = “blob” detector


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

lV

isu

al

Ob

jec

t R

ec

og

nit

ion

Tu

tori

al



Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Laplacian-of-Gaussian (LoG)

• Local maxima in scale space of Laplacian-of-Gaussian

)()( σσ LL +

σσσσ4444

σσσσ5555


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l


)()( σσ yyxx LL +

σσσσ

σσσσ2222

σσσσ3333

⇒⇒⇒⇒ List of(x, y, s)


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Results: Laplacian-of-Gaussian


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l



Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Difference-of-Gaussian (DoG)

• Difference of Gaussians as approximation of theLaplacian-of-Gaussian


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l


- =


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

DoG – Efficient Computation

• Computation in Gaussian scale pyramid

Sampling with


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l


σσσσ

Original image4

1

2=σ

Sampling withstep σσσσ4444 =2

σσσσ

σσσσ

σσσσ


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

lV

isu

al

Ob

jec

t R

ec

og

nit

ion

Tu

tori

al

Results: Lowe’s DoG


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

lV

isu

al

Ob

jec

t R

ec

og

nit

ion

Tu

tori

al



Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

lV

isu

al

Ob

jec

t R

ec

og

nit

ion

Tu

tori

al

Harris-Laplace [Mikolajczyk ‘01]

1. Initialization: Multiscale Harris corner detection

σσσσ4444


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

lV

isu

al

Ob

jec

t R

ec

og

nit

ion

Tu

tori

al

32

σσσσ

σσσσ2222

σσσσ3333

Computing Harris function Detecting local maxima


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

lV

isu

al

Ob

jec

t R

ec

og

nit

ion

Tu

tori

al

Harris-Laplace [Mikolajczyk ‘01]

1. Initialization: Multiscale Harris corner detection

2. Scale selection based on Laplacian(same procedure with Hessian ⇒⇒⇒⇒ Hessian-Laplace)

Harris points


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

lV

isu

al

Ob

jec

t R

ec

og

nit

ion

Tu

tori

al


Harris-Laplace points


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

lV

isu

al

Ob

jec

t R

ec

og

nit

ion

Tu

tori

al

Maximally Stable Extremal Regions [Matas ‘02]

• Based on Watershed segmentation algorithm

• Select regions that stay stable over a large parameter range


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

lV

isu

al

Ob

jec

t R

ec

og

nit

ion

Tu

tori

al



Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

lV

isu

al

Ob

jec

t R

ec

og

nit

ion

Tu

tori

al

Example Results: MSER


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

lV

isu

al

Ob

jec

t R

ec

og

nit

ion

Tu

tori

al



Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

lV

isu

al

Ob

jec

t R

ec

og

nit

ion

Tu

tori

al

You Can Try It At Home…

• For most local feature detectors, executables are available online:

• http://robots.ox.ac.uk/~vgg/research/affine

• http://www.cs.ubc.ca/~lowe/keypoints/

• http://www.vision.ee.ethz.ch/~surf


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

lV

isu

al

Ob

jec

t R

ec

og

nit

ion

Tu

tori

al



Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

lV

isu

al

Ob

jec

t R

ec

og

nit

ion

Tu

tori

al

Orientation Normalization

• Compute orientation histogram

• Select dominant orientation

• Normalize: rotate to fixed orientation

[Lowe, SIFT, 1999]


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

lV

isu

al

Ob

jec

t R

ec

og

nit

ion

Tu

tori

al

T. Tuytelaars, B. Leibe

370 2π


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

lV

isu

al

Ob

jec

t R

ec

og

nit

ion

Tu

tori

al

Local Descriptors

• The ideal descriptor should be

� Repeatable

� Distinctive

� Compact

� Efficient


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

lV

isu

al

Ob

jec

t R

ec

og

nit

ion

Tu

tori

al

• Most available descriptors focus on edge/gradient information

� Capture texture information

� Color still relatively seldomly used (more suitable for homogenous regions)



Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

lV

isu

al

Ob

jec

t R

ec

og

nit

ion

Tu

tori

al

Local Descriptors: SIFT Descriptor


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

lV

isu

al

Ob

jec

t R

ec

og

nit

ion

Tu

tori

al

[Lowe, ICCV 1999]

Histogram of oriented gradients

• Captures important texture information

• Robust to small translations /affine deformations



Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Local Descriptors: SURF

• Fast approximation of SIFT idea

� Efficient computation by 2D box filters & integral images⇒⇒⇒⇒ 6 times faster than SIFT

� Equivalent quality for object identification


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l


[Bay, ECCV’06], [Cornelis, CVGPU’08]

• GPU implementation available

� Feature extraction @ 100Hz(detector + descriptor, 640×480 img)

� http://www.vision.ee.ethz.ch/~surf


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

lV

isu

al

Ob

jec

t R

ec

og

nit

ion

Tu

tori

al

Local Descriptors: Shape Context

Count the number of points inside each bin, e.g.:

Count = 4

Count = 10...


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

lV

isu

al

Ob

jec

t R

ec

og

nit

ion

Tu

tori

al

Count = 10

Log-polar binning: more precision for nearby points, more flexibility for farther points.

Belongie & Malik, ICCV 2001K. Grauman, B. Leibe


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

lV

isu

al

Ob

jec

t R

ec

og

nit

ion

Tu

tori

al

Local Descriptors: Geometric Blur

Compute edges

at four

orientations

Extract a patch

in each channel


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

lV

isu

al

Ob

jec

t R

ec

og

nit

ion

Tu

tori

al

Example descriptor

~

in each channel

Apply spatially varying

blur and sub-sample

(Idealized signal)

Berg & Malik, CVPR 2001K. Grauman, B. Leibe


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

So, What Local Features Should I Use?

• There have been extensive evaluations/comparisons

� [Mikolajczyk et al., IJCV’05, PAMI’05]

� All detectors/descriptors shown here work well

• Best choice often application dependent

� MSER works well for buildings and printed things


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

� MSER works well for buildings and printed things

� Harris-/Hessian-Laplace/DoG work well for many natural categories

• More features are better

� Combining several detectors often helps



Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Outline






Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l








Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l



Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Bastian Leibe &


Chicago, 14.07.2008

Kristen Grauman



Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Outline






Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l








Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Recognition with Local Features

• Image content is transformed into local features that are invariant to translation, rotation, and scale

• Goal: Verify if they belong to a consistent configuration


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l


Local Features, e.g. SIFT

Slide credit: David Lowe


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Finding Consistent Configurations

• Global spatial models

� Generalized Hough Transform [Lowe99]

� RANSAC [Obdrzalek02, Chum05, Nister06]

� Basic assumption: object is planar

• Assumption is often justified in practice


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l


• Assumption is often justified in practice

� Valid for many structures on buildings

� Sufficient for small viewpoint variations on 3D objects


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Hough Transform

• Origin: Detection of straight lines in clutter� Basic idea: each candidate point votes

for all lines that it is consistent with.

� Votes are accumulated in quantized array

� Local maxima correspond to candidate lines

• Representation of a line


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l


• Representation of a line� Usual form y = a x + b has a singularity around 90º.

� Better parameterization: x cos(θθθθ) + y sin(θθθθ) = ρ

ρ

θx

y

θ

ρ

x

y


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Hough Transform: Noisy Line

ρ


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l


• Problem: Finding the true maximum

Tokens Votesθ



Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Hough Transform: Noisy Input

ρ


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l


• Problem: Lots of spurious maxima

Tokens Votes


θ


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Generalized Hough Transform [Ballard81]

• Generalization for an arbitrary contour or shape

� Choose reference point for the contour (e.g. center)

� For each point on the contour remember where it is located w.r.t. to the reference point

� Remember radius r and angle φrelative to the contour tangent

Recognition: whenever you find


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l


� Recognition: whenever you find a contour point, calculate the tangent angle and ‘vote’ for all possible reference points

� Instead of reference point, can also vote for transformation

⇒⇒⇒⇒ The same idea can be used with local features!

Slide credit: Bernt Schiele


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Gen. Hough Transform with Local Features

• For every feature, store possible “occurrences”


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l


– Object identity

– Pose

– Relative position

• For new image, let the matched features vote for possible object positions


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

3D Object Recognition

• Gen. HT for Recognition

� Typically only 3 feature matches needed for recognition

� Extra matches provide robustness

� Affine model can be used for planar objects

[Lowe99]


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

12K. Grauman, B. Leibe Slide credit: David Lowe


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

View Interpolation

• Training

� Training views from similar viewpoints are clusteredbased on feature matches.

� Matching features between adjacent views are linked.


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l


• Recognition

� Feature matches may bespread over several training viewpoints.

⇒⇒⇒⇒ Use the known links to “transfer votes” to other viewpoints.


[Lowe01]


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Recognition Using View Interpolation


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l


[Lowe01]


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Location Recognition

Training


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l


Training

[Lowe04]


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Applications

• Sony Aibo(Evolution Robotics)

• SIFT usage

� Recognize docking station


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l


docking station

� Communicate with visual cards

• Other uses

� Place recognition

� Loop closure in SLAM



Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

RANSAC (RANdom SAmple Consensus) [Fischler81]

• Randomly choose a minimal subset of data points necessary to fit a model (a sample)

• Points within some distance threshold t of model are a consensus set. Size of consensus set is model’s support.

• Repeat for N samples; model with biggest support is most robust fit

Points within distance of best model are inliers


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l


� Points within distance t of best model are inliers

� Fit final model to all inliers



Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

RANSAC: How many samples?

• How many samples are needed?� Suppose w is fraction of inliers (points from line).

� n points needed to define hypothesis (2 for lines)

� k samples chosen.

• Prob. that a single sample of n points is correct:

• Prob. that all samples fail is:

nw

knw )1( −


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l


• Prob. that all samples fail is:

⇒⇒⇒⇒ Choose k high enough to keep this below desired failure rate.

knw )1( −



Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

After RANSAC

• RANSAC divides data into inliers and outliers and yields estimate computed from minimal set of inliers

• Improve this initial estimate with estimation over all inliers (e.g. with standard least-squares minimization)

• But this may change inliers, so alternate fitting with re-classification as inlier/outlier


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l


classification as inlier/outlier



Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Example: Finding Feature Matches

• Find best stereo match within a square search window (here 300 pixels2)

• Global transformation model: epipolar geometry


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l


from Hartley & Zisserman



Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Example: Finding Feature Matches

• Find best stereo match within a square search window (here 300 pixels2)

• Global transformation model: epipolar geometry

before RANSAC after RANSAC


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l


from Hartley & Zisserman



Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Comparison

Gen. Hough Transform

• Advantages

� Very effective for recognizing arbitrary shapes or objects

� Can handle high percentage of outliers (>95%)

� Extracts groupings from clutter in linear time

RANSAC

• Advantages

� General method suited to large range of problems

� Easy to implement

� Independent of number of dimensions


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l


linear time

• Disadvantages

� Quantization issues

� Only practical for small number of dimensions (up to 4)

• Improvements available

� Probabilistic Extensions

� Continuous Voting Space

• Disadvantages

� Only handles moderate number of outliers (<50%)

• Many variants available, e.g.

� PROSAC: Progressive RANSAC [Chum05]

� Preemptive RANSAC [Nister05]

[Leibe08]


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Example Applications


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

25B. Leibe

Mobile tourist guide• Self-localization• Object/building recognition• Photo/video augmentation

[Quack, Leibe, Van Gool, CIVR’08]


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

lV

isu

al

Ob

jec

t R

ec

og

nit

ion

Tu

tori

al

Web Demo: Movie Poster Recognition


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

lV

isu

al

Ob

jec

t R

ec

og

nit

ion

Tu

tori

al


http://www.kooaba.com/en/products_engine.html#

50’000 movie

posters indexed

Query-by-image

from mobile phone

available in Switzer-

land


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

lV

isu

al

Ob

jec

t R

ec

og

nit

ion

Tu

tori

al

Application: Large-Scale Retrieval


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

lV

isu

al

Ob

jec

t R

ec

og

nit

ion

Tu

tori

al

27K. Grauman, B. Leibe [Philbin CVPR’07]

Query Results from 5k Flickr images (demo available for 100k set)


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

lV

isu

al

Ob

jec

t R

ec

og

nit

ion

Tu

tori

al

Application: Image Auto-AnnotationMoulin Rouge

Tour Montparnasse Colosseum

Old Town Square (Prague)


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

lV

isu

al

Ob

jec

t R

ec

og

nit

ion

Tu

tori

al


Left: Wikipedia image

Right: closest match from Flickr

[Quack CIVR’08]

Colosseum

Viktualienmarkt

Maypole


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Outline






Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l








Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l



Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Bastian Leibe &


Chicago, 14.07.2008

Kristen Grauman



Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Outline






Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l




5. Matching Local Feature Sets




Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Global representations: limitations

• Success may rely on alignment -> sensitive to viewpoint

• All parts of the image or window impact the description -> sensitive to occlusion, clutter


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l



Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Local representations

• Describe component regions or patches separately.

• Many options for detection & description…


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l


Superpixels

[Ren et al.]

Shape context

[Belongie 02]

Maximally Stable

Extremal Regions

[Matas 02]

Geometric Blur

[Berg 05]

SIFT [Lowe 99]

Salient regions

[Kadir 01]

Harris-Affine

[Mikolajczyk 04]Spin images

[Johnson 99]


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Recall: Invariant local features

Subset of local feature types designed to be invariant to

� Scale

� Translation

� Rotation

� Affine transformations

y1 y2…

yd


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

� Affine transformations

� Illumination

1) Detect interest points

2) Extract descriptors

x1 x2…

xd

[Mikolajczyk01, Matas02, Tuytelaars04, Lowe99, Kadir01,… ]



Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Recognition with local feature sets

• Previously, we saw how to use local invariant features + a global spatial model to recognize specific objects, using a planar object assumption.

• Now, we’ll use local features for


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

• Now, we’ll use local features for

� Indexing-based recognition

� Bags of words representations

� Correspondence / matching kernels



Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Basic flow

…

…Index each one into pool of descriptors from previously seen images

…


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l


Detect or sample features

Describe features

List of positions,

scales,

orientations

Associated list of

d-dimensional

descriptors


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Indexing local features

• Each patch / region has a descriptor, which is a point in some high-dimensional feature space (e.g., SIFT)


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l



Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l


• When we see close points in feature space, we have similar descriptors, which indicates similar local content.


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Figure credit: A. Zisserman K. Grauman, B. Leibe


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l


• We saw in the previous section how to use voting and pose clustering to identify objects using local features


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l


Figure credit: David Lowe


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l


• With potentially thousands of features per image, and hundreds to millions of images to search, how to efficiently find those that are relevant to a new image?

� Low-dimensional descriptors : can use standard efficient


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

� Low-dimensional descriptors : can use standard efficient data structures for nearest neighbor search

� High-dimensional descriptors: approximate nearest neighbor search methods more practical

� Inverted file indexing schemes



Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Indexing local features: approximate nearest neighbor search

Best-Bin First (BBF), a variant of k-d trees that uses priority queue to examine most promising branches first [Beis & Lowe, CVPR 1997]


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l


Locality-Sensitive Hashing (LSH), a randomized hashing technique using hash functions that map similar points to the same bin, with high probability [Indyk & Motwani, 1998]


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

• For text documents, an efficient way to find all pages on which a word occurs is to use an index…

• We want to find all

Indexing local features: inverted file index


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

• We want to find all images in which a feature occurs.

• To use this idea, we’ll need to map our features to “visual words”.



Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Visual words: main idea

• Extract some local features from a number of images …


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l


e.g., SIFT descriptor space: each

point is 128-dimensional

Slide credit: D. Nister


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l



Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

15K. Grauman, B. LeibeSlide credit: D. Nister


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l



Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l



Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l



Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l



Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

lPerceptual and Sensory Augmented Computing

Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l



Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l



Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l


Map high-dimensional descriptors to tokens/words by quantizing the feature space

• Quantize via

clustering, let

cluster centers be

the prototype


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l


the prototype

“words”

Descriptor space


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l


Map high-dimensional descriptors to tokens/words by quantizing the feature space

• Determine which

word to assign to

each new image

region by finding


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l


region by finding

the closest cluster

center.

Descriptor space


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Visual words

• Example: each group of patches belongs to the same visual word


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l


Figure from Sivic & Zisserman, ICCV 2003


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Visual words

• First explored for texture and material representations

• Texton = cluster center of filter responses over collection of images


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

collection of images

• Describe textures and materials based on distribution of prototypical texture elements.

Leung & Malik 1999; Varma &

Zisserman, 2002; Lazebnik,

Schmid & Ponce, 2003;


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Visual words

• More recently used for describing scenes and objects for the sake of indexing or classification.


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l


Sivic & Zisserman 2003;

Csurka, Bray, Dance, & Fan

2004; many others.


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Inverted file index for images comprised of visual words

Word number

List of image numbers


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Image credit: A. Zisserman K. Grauman, B. Leibe


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Bags of visual words

• Summarize entire image based on its distribution (histogram) of word occurrences.

• Analogous to bag of words representation commonly


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

representation commonly used for documents.

26K. Grauman, B. LeibeImage credit: Fei-Fei Li


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Video Google System

1. Collect all words within query region

2. Inverted file index to find relevant frames

3. Compare word counts

4. Spatial verification

Query

region

Retrie

ved fra

mes


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

4. Spatial verification

Sivic & Zisserman, ICCV 2003

• Demo online at : http://www.robots.ox.ac.uk/~vgg/research/vgoogle/index.html


Retrie

ved fra

mes


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Basic flow

…


…

or


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l



Describe features

List of positions,

scales,

orientations

Associated list of

d-dimensional

descriptors

Quantize to form bag of words vector for the image

…


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Visual vocabulary formation

Issues:

• Sampling strategy

• Clustering / quantization algorithm

• Unsupervised vs. supervised

• What corpus provides features (universal vocabulary?)

•


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

• Vocabulary size, number of words



Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Sampling strategies

Dense, uniformly Sparse, at

interest pointsRandomly


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

30K. Grauman, B. LeibeImage credits: F-F. Li, E. Nowak, J. Sivic

interest points

Multiple interest

operators

• To find specific, textured objects, sparse

sampling from interest points often more

reliable.

• Multiple complementary interest operators

offer more image coverage.

• For object categorization, dense sampling

offers better coverage.

[See Nowak, Jurie & Triggs, ECCV 2006]


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Clustering / quantization methods

• k-means (typical choice), agglomerative clustering, mean-shift,…

• Hierarchical clustering: allows faster insertion / word assignment while still allowing large vocabularies

Vocabulary tree [Nister & Stewenius, CVPR 2006]


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

� Vocabulary tree [Nister & Stewenius, CVPR 2006]



Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Example: Recognition with Vocabulary Tree

• Tree construction:


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

32K. Grauman, B. Leibe Slide credit: David Nister

[Nister & Stewenius, CVPR’06]


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Vocabulary Tree

• Training: Filling the tree


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l




Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Vocabulary Tree



Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l




Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Vocabulary Tree



Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l




Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Vocabulary Tree



Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l




Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Vocabulary Tree



Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l




Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Vocabulary Tree

• Recognition

RANSAC

verification


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l




Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Vocabulary Tree: Performance

• Evaluated on large databases

� Indexing with up to 1M images

• Online recognition for databaseof 50,000 CD covers

Retrieval in ~1s


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l


� Retrieval in ~1s

• Find experimentally that large vocabularies can be beneficial for recognition



Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Vocabulary formation

• Ensembles of trees provide additional robustness


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Figure credit: F. Jurie K. Grauman, B. Leibe

Moosmann, Jurie, & Triggs 2006; Yeh, Lee, & Darrell 2007;

Bosch, Zisserman, & Munoz 2007; …


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Supervised vocabulary formation

• Recent work considers how to leverage labeled images when constructing the vocabulary


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l


Perronnin, Dance, Csurka, & Bressan, Adapted Vocabularies for

Generic Visual Categorization, ECCV 2006.


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l


• Merge words that don’t aid in discriminability


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Winn, Criminisi, & Minka, Object Categorization by Learned

Universal Visual Dictionary, ICCV 2005


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l


• Consider vocabulary and classifier construction jointly.


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l


Yang, Jin, Sukthankar, & Jurie, Discriminative Visual Codebook Generation

with Classifier Training for Object Category Recognition, CVPR 2008.


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Learning and recognition with bag of words histograms

• Bag of words representation makes it possible to describe the unordered point set with a single vector (of fixed dimension across image examples)


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

• Provides easy way to use distribution of feature types with various learning algorithms requiring vector input.



Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

• …including unsupervised topic models designed for documents.

• Hierarchical Bayesian text models (pLSA and LDA)

– Hoffman 2001, Blei, Ng & Jordan, 2004

– For object and scene categorization: Sivic et al. 2005,



Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

– For object and scene categorization: Sivic et al. 2005, Sudderth et al. 2005, Quelhas et al. 2005, Fei-Fei et al. 2005

45K. Grauman, B. LeibeFigure credit: Fei-Fei Li


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

• …including unsupervised topic models designed for documents.


Probabilistic Latent

Semantic Analysis

(pLSA)wN

d z

D


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l


D

“face”

Sivic et al. ICCV 2005

[pLSA code available at: http://www.robots.ox.ac.uk/~vgg/software/]

Figure credit: Fei-Fei Li


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Bags of words: pros and cons

+ flexible to geometry / deformations / viewpoint

+ compact summary of image content

+ provides vector representation for sets

+ has yielded good recognition results in practice

-


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

- basic model ignores geometry – must verify afterwards, or encode via features

- background and foreground mixed when bag covers whole image

- interest points or sampling: no guarantee to capture object-level parts

- optimal vocabulary formation remains unclear



Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Outline






Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l








Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l



Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Bastian Leibe &


Chicago, 14.07.2008

Kristen Grauman



Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Outline






Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l








Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Basic flow

…


…

or


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l



Describe features

List of positions,

scales,

orientations

Associated list of

d-dimensional

descriptors

Compute match with another image

or

Quantize to form bag of words vector for the image

…


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Local feature correspondences

• The matching between sets of local features helps to establish overall similarity between objects or shapes.

• Assigned matches also useful for localization


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l


Shape context

[Belongie &

Malik 2001]

Low-distortion matching [Berg & Malik 2005] Match kernel

[Wallraven,

Caputo & Graf

2003]


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Local feature correspondences

• Least cost match: minimize total cost between matched points


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

• Least cost partial match: match all of smaller set to some portion of larger set.

∑∈

→

−

Xx

ii

YXi

xx )(min:

ππ


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Pyramid match kernel (PMK)

• Optimal matching expensive relative to number of features per image (m).

• PMK is approximate partial match for efficient discriminative learning from sets of local features.


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l


Optimal match: O(m3)Greedy match: O(m2 log m)Pyramid match: O(m)

[Grauman & Darrell, ICCV 2005]


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Pyramid match kernel: pyramid extraction

,

Histogram

pyramid:


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

7K. Grauman

pyramid:

level i has bins

of size


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Pyramid match kernel: counting matches

Histogram intersection


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

8K. Grauman


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Pyramid match kernel: counting new matches

matches at this level matches at previous level

Histogram intersection


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

9K. Grauman

Difference in histogram intersections across

levels counts number of new pairs matched


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Pyramid match kernel

histogram pyramids


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

10K. Grauman

• For similarity, weights inversely proportional to bin size (or may be learned discriminatively)

• Normalize kernel values to avoid favoring large sets

measure of difficulty of a match at level i

number of newly matched pairs at level i


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Example pyramid match


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

11K. Grauman


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l



Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

12K. Grauman


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l



Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

13K. Grauman

pyramid match


optimal match

K. Grauman


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l


• Forms a Mercer kernel -> allows classification with SVMs, use of other kernel methods

• Bounded error relative to optimal partial match

• Linear time -> efficient learning with large feature sets


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l



Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l





Accu

racy

ET

H


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Accu

racy

Mean number of featuresT

ime (

s)

Mean number of features

ET

H-8

0 d

ata

set

Pyramid match

Match [Wallraven et al.]O(m2)

O(m)


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l





• Use data-dependent pyramid partitions for high-d feature spaces


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

feature spaces

Uniform pyramid bins Vocabulary-guided

pyramid bins

Code for PMK: http://people.csail.mit.edu/jjl/libpmk/


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Matching smoothness & local geometry

• Solving for linear assignment means (non-overlapping) features can be matched independently, ignoring relative geometry.

• One alternative: simply expand feature vectors to include spatial information before matching.


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l


[ f1,…,f128, ]

xa

yaxa, ya


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Spatial pyramid match kernel

• First quantize descriptors into words, then do one pyramid match per word in image coordinate space.


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Lazebnik, Schmid & Ponce, CVPR 2006



Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l


• Use correspondence to estimate parameterized transformation, regularize to enforce smoothness


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Shape context matching [Belongie, Malik, & Puzicha 2001]


Code: http://www.eecs.berkeley.edu/Research/Projects/CS/vision/shape/sc_digits.html


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l


• Let matching cost include term to penalize distortion between pairs of matched features.

j j'QueryTemplate

Rij


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Approximate for efficient solutions: Berg & Malik, CVPR 2005;

Leordeanu & Hebert, ICCV 2005

i i 'i i'

RijSi'j'

Figure credit: Alex Berg K. Grauman, B. Leibe


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l


• Compare “semi-local” features: consider configurations or neighborhoods and co-occurrence relationships


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l


Hyperfeatures: Agarwal &

Triggs, ECCV 2006]

Correlograms of

visual words

[Savarese, Winn, &

Criminisi, CVPR 2006]

Proximity

distribution kernel

[Ling & Soatto, ICCV

2007]

Feature neighborhoods [Sivic

& Zisserman, CVPR 2004]

Tiled neighborhood [Quack, Ferrari,

Leibe, van Gool ICCV 2007]


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l


• Learn or provide explicit object-specific shape model [Next in the tutorial : part-based models]


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

x1

x3

x4

x6

x5

x2


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Summary

• Local features are a useful, flexible representation

� Invariance properties - typically built into the descriptor

� Distinctive, especially helpful for identifying specific textured objects

� Breaking image into regions/parts gives tolerance to occlusions and clutter

Mapping to visual words forms discrete tokens from image


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

� Mapping to visual words forms discrete tokens from image regions

• Efficient methods available for

� Indexing patches or regions

� Comparing distributions of visual words

� Matching features



Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Outline






Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l








Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l



Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Bastian Leibe &


Chicago, 14.07.2008

Kristen Grauman



Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Outline






Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l








Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Recognition of Object Categories

• We no longer have exact correspondences…

• On a local level, wecan still detect similar parts.

• Represent objects


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

3T. Tuytelaars, B. Leibe

• Represent objectsby their parts

⇒⇒⇒⇒ Bag-of-features

• How can weimprove on this?

� Encode structure

Slide credit: Rob Fergus


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Part-Based Models

• Fischler & Elschlager 1973

• Model has two components

� parts (2D image fragments)


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l


(2D image fragments)

� structure (configuration of parts)


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Different Connectivity Structures

Fergus et al. ’03Fei-Fei et al. ‘03

Leibe et al. ’04, ‘08Crandall et al. ‘05

Crandall et al. ‘05 Felzenszwalb & Huttenlocher ‘05

O(N6) O(N2) O(N3) O(N2)


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l


Fei-Fei et al. ‘03 Crandall et al. ‘05Fergus et al. ’05

Huttenlocher ‘05

Bouchard & Triggs ‘05 Carneiro & Lowe ‘06Csurka ’04Vasconcelos ‘00

from [Carneiro & Lowe, ECCV’06]


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Spatial Models Considered Here

x1

x6 x2

“Star” shape model

x1

x6 x2

Fully connected shape model


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l


x3

x4

x5x3

x4

x5

� e.g. Constellation Model

� Parts fully connected

� Recognition complexity: O(NP)

� Method: Exhaustive search

� e.g. ISM

� Parts mutually independent

� Recognition complexity: O(NP)

� Method: Gen. Hough Transform

Slide credit: Rob Fergus


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Constellation Model

• Joint model for appearance and shape


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l


Gaussian shape pdf

Prob. of detection

Gaussian part appearance pdf Gaussian

relative scale pdf

Log(scale)

0.8 0.75 0.9


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Constellation ModelGaussian shape pdf

Prob. of detection

Gaussian part appearance pdf Gaussian

relative scale pdf

Log(scale)


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l


0.8 0.75 0.9

Uniform shape pdf

Clutter model

Gaussian appearance pdf

Poission pdf on # detections

Uniform

relative scale pdf

Log(scale)


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

• Goal: Find regions & their location, scale & appearance

• Initialize model parameters

• Use EM and iterate to convergence

� E-step: Compute assignments for which regions are foreground/background

� M-step: Update model parameters

Constellation Model: Learning Procedure


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l


• Trying to maximize likelihood – consistency in shape & appearance


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Example: Motorbikes


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l



Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Example: Motorbikes (2)


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l



Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Example: Spotted Cats


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l



Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Discussion: Constellation Model

• Advantages� Works well for many different object categories

� Can adapt well to categories where– Shape is more important

– Appearance is more important

� Everything is learned from training data

� Weakly-supervised training possible


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l


� Weakly-supervised training possible

• Disadvantages� Model contains many parameters that need to be estimated

� Cost increases exponentially with increasing number of parameters

⇒⇒⇒⇒ Fully connected model restricted to small number of parts.


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Implicit Shape Model (ISM)

• Basic ideas

� Learn an appearance codebook

� Learn a star-topology structural model

– Features are considered independent given obj. center

• Algorithm: probabilistic Gen. Hough Transform

x1

x3

x4

x6

x5

x2


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l


• Algorithm: probabilistic Gen. Hough Transform

� Exact correspondences →→→→ Prob. match to object part

� NN matching →→→→ Soft matching

� Feature location on obj. →→→→ Part location distribution

� Uniform votes →→→→ Probabilistic vote weighting

� Quantized Hough array →→→→ Continuous Hough space


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Codebook Representation

• Extraction of local object features� Interest Points (e.g. Harris detector)

� Sparse representation of the object appearance


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l


• Collect features from whole training set

• Example:


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Gen. Hough Transform with Local Features

• For every feature, store possible “occurrences”


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l


– Object identity

– Pose

– Relative position

• For new image, let the matched features vote for possible object positions


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Implicit Shape Model - Representation

Training images(+reference segmentation)

Appearance codebook…………

………………………………

…………


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

19B. Leibe

• Learn appearance codebook

� Extract local features at interest points

� Agglomerative clustering ⇒⇒⇒⇒ codebook

• Learn spatial distributions

� Match codebook to training images

� Record matching positions on object

Spatial occurrence distributionsx

y

sx

y

s

x

y

s

x

y

s

+ local figure-ground labels


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Implicit Shape Model - Recognition

Interest Points Matched Codebook

Entries

Probabilistic

Voting

yObject Image Feature Interpretation


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

21

3D Voting Space

(continuous)

xs

Object

Position

o,x

Image Feature

f

Interpretation

(Codebook match)

Ci

)( fCp i ),,( lin Cxop

∑=i

inin CxopfCpfxop ),,()(),,( ll

[Leibe04, Leibe08]


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Implicit Shape Model - Recognition


Entries

Probabilistic

Voting

y


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

22

Backprojected

Hypotheses

3D Voting Space

(continuous)

xs

Backprojection

of Maxima

[Leibe04, Leibe08]


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Example: Results on Cows


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l


Original image


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l



Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l


Original imageInterest points


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l



Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l


Original imageOriginal imageOriginal imageOriginal imageInterest pointsInterest pointsInterest pointsInterest pointsMatched patches


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l



Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l


Original imageOriginal imageOriginal imageOriginal imageInterest pointsInterest pointsInterest pointsInterest pointsMatched patchesMatched patchesMatched patchesMatched patchesProb. Votes


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l



Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l


1st hypothesis


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l



Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l


2nd hypothesis


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l



Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l


3rd hypothesis


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

• Scale-invariant feature selection

� Scale-invariant interest points

� Rescale extracted patches

� Match to constant-size codebook

• Generate scale votes

Scale Invariant Voting


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l


• Generate scale votes

� Scale as 3rd dimension in voting space

� Search for maxima in 3D voting space

Search window

x

y

s


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Scale Voting: Efficient Computation

y

s

Binned

y

s

x

Refinement

y

s

x

Candidate

y

s

Scale votes


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l


• Mean-Shift formulation for refinement

� Scale-adaptive balloon density estimator

Binned

accum. array

Refinement

(MSME)

Candidate

maxima

Scale votes


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Detection Results

• Qualitative Performance

� Recognizes different kinds of objects

� Robust to clutter, occlusion, noise, low contrast


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l



Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Figure-Ground Segregation

• Problem extensively studied in Psychophysics

• Experiments with ambiguousfigure-ground stimuli

• Results:

Evidence that object recognition can


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l


� Evidence that object recognition canand does operate before figure-ground organization

� Interpreted as Gestalt cue familiarity.

M.A. Peterson, “Object Recognition Processes Can and Do Operate Before Figure-

Ground Organization”, Cur. Dir. in Psych. Sc., 3:105-111, 1994.


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

ISM – Top-Down Segmentation


Entries

Probabilistic

Voting

y


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l


Backprojected

Hypotheses

Segmentation3D Voting Space

(continuous)

xs

Backprojection

of Maximap(figure)

Probabilities

[Leibe04, Leibe08]


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Segmentation: Probabilistic Formulation

• Influence of patch on object hypothesis (vote weight)


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l


( )( ) ( ) ( )

( )xop

f,pfCpCxopxofp

n

i iin

n,

||,,,

∑=

ll

( ) ( ) ( )∑∈

===),(

,|,,,,|,|l

ll

f

nnn xofpxoffigurepxofigurepp

pp

• Backprojection to features ff and pixels pp:

Segmentationinformation

Influence on object hypothesis

[Leibe04, Leibe08]


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Segmentation

p(figure)


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l


• Interpretation of p(figure) map

� per-pixel confidence in object hypothesis

� Use for hypothesis verification

p(figure)

p(ground)

Segmentation

p(ground)

Original image

[Leibe04, Leibe08]


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Example Results: Motorbikes


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l



Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Example Results: Cows

• Training

� 112 hand-segmented images

• Results on novel sequences:


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l


Single-frame recognition - No temporal continuity used!

[Leibe04, Leibe08]


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Example Results: Chairs

Dining room chairs


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

49B. Leibe

Office chairs


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Inferring Other Information: Part Labels

TrainingTraining

TestTest OutputOutput


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

50[Thomas07]


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Inferring Other Information: Part Labels (2)


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

51[Thomas07]


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Inferring Other Information: Depth Maps


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

52

“Depth from a single image”

[Thomas07]


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

• Estimating Articulation

Application for Pedestrian Detection


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

53B. Leibe

• Rotation-Invariant Detection

[Leibe, Seemann, Schiele, CVPR’05]

[Mikolajczyk, Leibe, Schiele, CVPR’06]

θq

φ

dq

φ

θ

d


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Outline






Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l








Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l



Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Bastian Leibe &


Chicago, 14.07.2008

Kristen Grauman



Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Outline






Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l







Highlight of some research topics not covered in the main tutorial


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Benchmark Data

• What degree of difficulty do current datasets have?


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Example: Caltech-101

A dataset that has been about mastered…


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l


Images from the Caltech-101:

101-way multi-class classification problem


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Example: Caltech256


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l


Images from the Caltech-256:

256 multi-class recognition problem


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Example: Pascal Visual Object Classes Challenge


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l


Pascal VOC 2007:

Binary detection problems

http://pascallin.ecs.soton.ac.uk/challenges/VOC/


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Example: LabelMe


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

http://labelme.csail.mit.edu/



Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Current challenges & ongoing research

• Multi-cue integration

• Finer level categorization

• View invariant recognition

• Unsupervised category discovery

• Learning from noisily labeled images

•


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

• Integration of segmentation and recognition

• Learning with text and images/video

• Use of video

• Context and scene layout


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Multi-cue integration

• Single cues often not sufficient.

• Integrate multiple local and global cues.


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l



Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Multi-Category Discrimination

• Distinguish similar categories.

• Need to look at specific details!


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l



Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

• Detectors for different viewpoints ⇒⇒⇒⇒ How can this be improved?

Multi-Aspect Recognition


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l



Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l



Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l


[Thomas et al., CVPR’06][Hoiem, Rother, Winn, CVPR’07]


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l



Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l


[Rothganger et al., CVPR’03]

[Savarese & Fei-Fei, ICCV’07]


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Unsupervised, semi-supervised category discovery

Probabilistic Latent Semantic Analysis (pLSA)

“face”

Topic models for images


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

wN

c zD

π

“beach”Latent Dirichlet Allocation (LDA)

Sivic et al. ICCV 2005, Fei-Fei et al. ICCV 2005Figure credit: Fei-Fei Li


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Unsupervised, semi-supervised category discovery

Clustering cluttered images

Learning from noisy keyword-based image search results


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Grauman & Darrell, CVPR 2006

Fergus et al. ECCV 2004, ICCV 2005

Li & Fei-Fei, CVPR 2007


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Learning with text and images/video


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Berg, Berg, Edwards,

& Forsyth, NIPS 2006

Barnard et al. JMLR 2003

Gupta et al. ECML 2008


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Integrating segmentation + recognition


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Kumar et al. CVPR 2005Borenstein & Ullman, ECCV 2002

Kannan, Winn, & Rother, NIPS 2006Tu, Chen, Yuille, Zhu, ICCV 2003


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Role of context, understanding scene layout


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Antonio Torralba, IJCV 2003


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Role of context, understanding scene layout


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Image World

Hoiem, Efros, & Hebert, CVPR 2006


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Integration with Scene Geometry

• Goal: Find the ground plane

� Restrict object location

� Assume Gaussian size prior

⇒⇒⇒⇒ Significantly reduced search space

Structure-from-Motion


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

20B. Leibe

Dense stereo

Structure-from-Motion

x

s

y Search corridor

Hough Volume


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Extensions

• Combination with 3D Geometry


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l


• Mobile Pedestrian Detection

[Leibe, Cornelis, Cornelis, Van Gool, CVPR’07]

[Ess, Leibe, Van Gool, ICCV’07]21


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Detections Using Ground Plane Constraints


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

22B. Leibe

left camera

1175 frames

[Leibe et al. CVPR’07]


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Extensions: Tracking-by-Detection


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

23

• Spacetime trajectory analysis

� Link up detections to form physically plausible ST trajectories

� Select set of ST trajectories that best explain the data

[Leibe et al. CVPR’07]


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Dynamic Scene Analysis Results


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

24B. Leibe [Leibe et al. CVPR’07]


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Extensions (2)

• Combination 3D Reconstruction


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l


[Cornelis, Leibe, Cornelis, Van Gool, 3DPVT’06]


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Textured 3D Model


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

26B. Leibe

• Run-times� SfM + Bundle adjustment: 27-30 fps on CPU

� Dense reconstruction: 36 fps on GPU

Original 3D Reconstruction

[Cornelis, Cornelis, Van Gool, CVPR’06]


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Improved 3D City Model

Enhancing your driving experience…


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

27

Original 3D Reconstruction

[Cornelis, Leibe, Cornelis, Van Gool, 3DPVT’06]


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Putting It All Together…

π

1..nπd oi

di

I D

x

y

s


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

28B. Leibe

x

t

z

itiH

,

H1 H2

Q

S

V

VT


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Mobile Pedestrian Tracking


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

29[Ess, Leibe, Schindler, Van Gool, CVPR’08]


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Mobile Tracking Through Crowds


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

30[Ess, Leibe, Schindler, Van Gool, CVPR’08]


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Extension: Recovering Articulations1...N


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

31B. Leibe

• Idea: Only perform articulated tracking where it’s easy!

• Multi-person tracking

� Solves hard data association problem

• Articulated tracking

� Only on individual “tracklets” between occlusions

[Gammeter, Ess, Jaeggli, Schindler, Leibe, Van Gool, ECCV’08]


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Articulated Multi-Person Tracking


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

32B. Leibe

• Multi-Person tracking� Recovers trajectories and solves data association

� Estimates 3D walking direction and speed

� Detects occlusion events



Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Articulated Tracking under Egomotion


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

33B. Leibe



Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l



Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

Summary

• Visual recognition is a challenging and very active research area.

• We’ve covered some basic models and representations that have been shown to be effective, and highlighted some ongoing issues.


Vis

ua

l O

bje

ct

Re

co

gn

itio

n T

uto

ria

l

• See tutorial website for slides, links, references.http://www.vision.ee.ethz.ch/~bleibe/teaching/tutorial-aaai08/

Thank you!


Education

AAAI08 tutorial: visual object recognition