Upload
zukun
View
785
Download
0
Tags:
Embed Size (px)
Citation preview
6.870
Object Recognition and Scene Understanding
student presentationMIT
6.870
Template matching and histograms
Nicolas Pinto
Introduction
Hosts
Hostsa guy...
(who has big arms)
HostsAntonio T...
(who knows a lot about vision)
a guy...
(who has big arms)
HostsAntonio T...
(who knows a lot about vision)
a frog...
(who has big eyes)
a guy...
(who has big arms)
HostsAntonio T...
(who knows a lot about vision)
a frog...
(who has big eyes)and thus should knowa lot about vision...
a guy...
(who has big arms)
3 papers
yey!!
Object Recognition from Local Scale-Invariant Features
David G. Lowe
Computer Science Department
University of British Columbia
Vancouver, B.C., V6T 1Z4, Canada
Abstract
Proc. of the International Conference on
Computer Vision, Corfu (Sept. 1999)
An object recognition system has been developed that uses a
new class of local image features. The features are invariant
to image scaling, translation, and rotation, and partially in-
variant to illumination changes and affine or 3D projection.
These features share similar properties with neurons in in-
ferior temporal cortex that are used for object recognition
in primate vision. Features are efficiently detected through
a staged filtering approach that identifies stable points in
scale space. Image keys are created that allow for local ge-
ometric deformations by representing blurred image gradi-
ents in multiple orientation planes and at multiple scales.
The keys are used as input to a nearest-neighbor indexing
method that identifies candidate object matches. Final veri-
fication of each match is achieved by finding a low-residual
least-squares solution for the unknown model parameters.
Experimental results show that robust object recognition
can be achieved in cluttered partially-occluded images with
a computation time of under 2 seconds.
1. Introduction
Object recognition in cluttered real-world scenes requires
local image features that are unaffected by nearby clutter or
partial occlusion. The features must be at least partially in-
variant to illumination, 3D projective transforms, and com-
mon object variations. On the other hand, the features must
also be sufficiently distinctive to identify specific objects
among many alternatives. The difficulty of the object recog-
nition problem is due in large part to the lack of success in
finding such image features. However, recent research on
the use of dense local features (e.g., Schmid & Mohr [19])
has shown that efficient recognition can often be achieved
by using local image descriptors sampled at a large number
of repeatable locations.
This paper presents a new method for image feature gen-
eration called the Scale Invariant Feature Transform (SIFT).
This approach transforms an image into a large collection
of local feature vectors, each of which is invariant to image
translation, scaling, and rotation, and partially invariant to
illumination changes and affine or 3D projection. Previous
approaches to local feature generation lacked invariance to
scale and were more sensitive to projective distortion and
illumination change. The SIFT features share a number of
properties in common with the responses of neurons in infe-
rior temporal (IT) cortex in primate vision. This paper also
describes improved approaches to indexing and model ver-
ification.
The scale-invariant features are efficiently identified by
using a staged filtering approach. The first stage identifies
key locations in scale space by looking for locations that
are maxima orminima of a difference-of-Gaussian function.
Each point is used to generate a feature vector that describes
the local image region sampled relative to its scale-space co-
ordinate frame. The features achieve partial invariance to
local variations, such as affine or 3D projections, by blur-
ring image gradient locations. This approach is based on a
model of the behavior of complex cells in the cerebral cor-
tex of mammalian vision. The resulting feature vectors are
called SIFT keys. In the current implementation, each im-
age generates on the order of 1000 SIFT keys, a process that
requires less than 1 second of computation time.
The SIFT keys derived from an image are used in a
nearest-neighbour approach to indexing to identify candi-
date object models. Collections of keys that agree on a po-
tential model pose are first identified through a Hough trans-
formhash table, and then througha least-squares fit to a final
estimate of model parameters. When at least 3 keys agree
on the model parameters with low residual, there is strong
evidence for the presence of the object. Since there may be
dozens of SIFT keys in the image of a typical object, it is
possible to have substantial levels of occlusion in the image
and yet retain high levels of reliability.
The current object models are represented as 2D loca-
tions of SIFT keys that can undergo affine projection. Suf-
ficient variation in feature location is allowed to recognize
perspective projection of planar shapes at up to a 60 degree
rotation away from the camera or to allow up to a 20 degree
rotation of a 3D object.
1
Lowe(1999)
3 papers
yey!!
Object Recognition from Local Scale-Invariant Features
David G. Lowe
Computer Science Department
University of British Columbia
Vancouver, B.C., V6T 1Z4, Canada
Abstract
Proc. of the International Conference on
Computer Vision, Corfu (Sept. 1999)
An object recognition system has been developed that uses a
new class of local image features. The features are invariant
to image scaling, translation, and rotation, and partially in-
variant to illumination changes and affine or 3D projection.
These features share similar properties with neurons in in-
ferior temporal cortex that are used for object recognition
in primate vision. Features are efficiently detected through
a staged filtering approach that identifies stable points in
scale space. Image keys are created that allow for local ge-
ometric deformations by representing blurred image gradi-
ents in multiple orientation planes and at multiple scales.
The keys are used as input to a nearest-neighbor indexing
method that identifies candidate object matches. Final veri-
fication of each match is achieved by finding a low-residual
least-squares solution for the unknown model parameters.
Experimental results show that robust object recognition
can be achieved in cluttered partially-occluded images with
a computation time of under 2 seconds.
1. Introduction
Object recognition in cluttered real-world scenes requires
local image features that are unaffected by nearby clutter or
partial occlusion. The features must be at least partially in-
variant to illumination, 3D projective transforms, and com-
mon object variations. On the other hand, the features must
also be sufficiently distinctive to identify specific objects
among many alternatives. The difficulty of the object recog-
nition problem is due in large part to the lack of success in
finding such image features. However, recent research on
the use of dense local features (e.g., Schmid & Mohr [19])
has shown that efficient recognition can often be achieved
by using local image descriptors sampled at a large number
of repeatable locations.
This paper presents a new method for image feature gen-
eration called the Scale Invariant Feature Transform (SIFT).
This approach transforms an image into a large collection
of local feature vectors, each of which is invariant to image
translation, scaling, and rotation, and partially invariant to
illumination changes and affine or 3D projection. Previous
approaches to local feature generation lacked invariance to
scale and were more sensitive to projective distortion and
illumination change. The SIFT features share a number of
properties in common with the responses of neurons in infe-
rior temporal (IT) cortex in primate vision. This paper also
describes improved approaches to indexing and model ver-
ification.
The scale-invariant features are efficiently identified by
using a staged filtering approach. The first stage identifies
key locations in scale space by looking for locations that
are maxima orminima of a difference-of-Gaussian function.
Each point is used to generate a feature vector that describes
the local image region sampled relative to its scale-space co-
ordinate frame. The features achieve partial invariance to
local variations, such as affine or 3D projections, by blur-
ring image gradient locations. This approach is based on a
model of the behavior of complex cells in the cerebral cor-
tex of mammalian vision. The resulting feature vectors are
called SIFT keys. In the current implementation, each im-
age generates on the order of 1000 SIFT keys, a process that
requires less than 1 second of computation time.
The SIFT keys derived from an image are used in a
nearest-neighbour approach to indexing to identify candi-
date object models. Collections of keys that agree on a po-
tential model pose are first identified through a Hough trans-
formhash table, and then througha least-squares fit to a final
estimate of model parameters. When at least 3 keys agree
on the model parameters with low residual, there is strong
evidence for the presence of the object. Since there may be
dozens of SIFT keys in the image of a typical object, it is
possible to have substantial levels of occlusion in the image
and yet retain high levels of reliability.
The current object models are represented as 2D loca-
tions of SIFT keys that can undergo affine projection. Suf-
ficient variation in feature location is allowed to recognize
perspective projection of planar shapes at up to a 60 degree
rotation away from the camera or to allow up to a 20 degree
rotation of a 3D object.
1
Lowe(1999)
Histograms of Oriented Gradients for Human DetectionNavneet Dalal and Bill Triggs
INRIA Rhone-Alps, 655 avenue de l’Europe, Montbonnot 38334, France{Navneet.Dalal,Bill.Triggs}@inrialpes.fr, http://lear.inrialpes.fr
AbstractWe study the question of feature sets for robust visual ob-
ject recognition, adopting linear SVM based human detec-tion as a test case. After reviewing existing edge and gra-dient based descriptors, we show experimentally that gridsof Histograms of Oriented Gradient (HOG) descriptors sig-nificantly outperform existing feature sets for human detec-tion. We study the influence of each stage of the computationon performance, concluding that fine-scale gradients, fineorientation binning, relatively coarse spatial binning, andhigh-quality local contrast normalization in overlapping de-scriptor blocks are all important for good results. The newapproach gives near-perfect separation on the original MITpedestrian database, so we introduce a more challengingdataset containing over 1800 annotated human images witha large range of pose variations and backgrounds.
1 IntroductionDetecting humans in images is a challenging task owing
to their variable appearance and the wide range of poses thatthey can adopt. The first need is a robust feature set thatallows the human form to be discriminated cleanly, even incluttered backgrounds under difficult illumination. We studythe issue of feature sets for human detection, showing that lo-cally normalized Histogram of Oriented Gradient (HOG) de-scriptors provide excellent performance relative to other ex-isting feature sets including wavelets [17,22]. The proposeddescriptors are reminiscent of edge orientation histograms[4,5], SIFT descriptors [12] and shape contexts [1], but theyare computed on a dense grid of uniformly spaced cells andthey use overlapping local contrast normalizations for im-proved performance. We make a detailed study of the effectsof various implementation choices on detector performance,taking “pedestrian detection” (the detection of mostly visiblepeople in more or less upright poses) as a test case. For sim-plicity and speed, we use linear SVM as a baseline classifierthroughout the study. The new detectors give essentially per-fect results on theMIT pedestrian test set [18,17], so we havecreated a more challenging set containing over 1800 pedes-trian images with a large range of poses and backgrounds.Ongoing work suggests that our feature set performs equallywell for other shape-based object classes.
We briefly discuss previous work on human detection in§2, give an overview of our method §3, describe our datasets in §4 and give a detailed description and experimentalevaluation of each stage of the process in §5–6. The mainconclusions are summarized in §7.
2 Previous WorkThere is an extensive literature on object detection, but
here we mention just a few relevant papers on human detec-tion [18,17,22,16,20]. See [6] for a survey. Papageorgiou etal [18] describe a pedestrian detector based on a polynomialSVM using rectified Haar wavelets as input descriptors, witha parts (subwindow) based variant in [17]. Depoortere et algive an optimized version of this [2]. Gavrila & Philomen[8] take a more direct approach, extracting edge images andmatching them to a set of learned exemplars using chamferdistance. This has been used in a practical real-time pedes-trian detection system [7]. Viola et al [22] build an efficientmoving person detector, using AdaBoost to train a chain ofprogressively more complex region rejection rules based onHaar-like wavelets and space-time differences. Ronfard etal [19] build an articulated body detector by incorporatingSVM based limb classifiers over 1st and 2nd order Gaussianfilters in a dynamic programming framework similar to thoseof Felzenszwalb & Huttenlocher [3] and Ioffe & Forsyth[9]. Mikolajczyk et al [16] use combinations of orientation-position histograms with binary-thresholdedgradient magni-tudes to build a parts based method containing detectors forfaces, heads, and front and side profiles of upper and lowerbody parts. In contrast, our detector uses a simpler archi-tecture with a single detection window, but appears to givesignificantly higher performance on pedestrian images.
3 Overview of the MethodThis section gives an overview of our feature extraction
chain, which is summarized in fig. 1. Implementation detailsare postponed until §6. The method is based on evaluatingwell-normalized local histograms of image gradient orienta-tions in a dense grid. Similar features have seen increasinguse over the past decade [4,5,12,15]. The basic idea is thatlocal object appearance and shape can often be characterizedrather well by the distribution of local intensity gradients or
1
Nalal and Triggs (2005)
3 papers
yey!!
Object Recognition from Local Scale-Invariant Features
David G. Lowe
Computer Science Department
University of British Columbia
Vancouver, B.C., V6T 1Z4, Canada
Abstract
Proc. of the International Conference on
Computer Vision, Corfu (Sept. 1999)
An object recognition system has been developed that uses a
new class of local image features. The features are invariant
to image scaling, translation, and rotation, and partially in-
variant to illumination changes and affine or 3D projection.
These features share similar properties with neurons in in-
ferior temporal cortex that are used for object recognition
in primate vision. Features are efficiently detected through
a staged filtering approach that identifies stable points in
scale space. Image keys are created that allow for local ge-
ometric deformations by representing blurred image gradi-
ents in multiple orientation planes and at multiple scales.
The keys are used as input to a nearest-neighbor indexing
method that identifies candidate object matches. Final veri-
fication of each match is achieved by finding a low-residual
least-squares solution for the unknown model parameters.
Experimental results show that robust object recognition
can be achieved in cluttered partially-occluded images with
a computation time of under 2 seconds.
1. Introduction
Object recognition in cluttered real-world scenes requires
local image features that are unaffected by nearby clutter or
partial occlusion. The features must be at least partially in-
variant to illumination, 3D projective transforms, and com-
mon object variations. On the other hand, the features must
also be sufficiently distinctive to identify specific objects
among many alternatives. The difficulty of the object recog-
nition problem is due in large part to the lack of success in
finding such image features. However, recent research on
the use of dense local features (e.g., Schmid & Mohr [19])
has shown that efficient recognition can often be achieved
by using local image descriptors sampled at a large number
of repeatable locations.
This paper presents a new method for image feature gen-
eration called the Scale Invariant Feature Transform (SIFT).
This approach transforms an image into a large collection
of local feature vectors, each of which is invariant to image
translation, scaling, and rotation, and partially invariant to
illumination changes and affine or 3D projection. Previous
approaches to local feature generation lacked invariance to
scale and were more sensitive to projective distortion and
illumination change. The SIFT features share a number of
properties in common with the responses of neurons in infe-
rior temporal (IT) cortex in primate vision. This paper also
describes improved approaches to indexing and model ver-
ification.
The scale-invariant features are efficiently identified by
using a staged filtering approach. The first stage identifies
key locations in scale space by looking for locations that
are maxima orminima of a difference-of-Gaussian function.
Each point is used to generate a feature vector that describes
the local image region sampled relative to its scale-space co-
ordinate frame. The features achieve partial invariance to
local variations, such as affine or 3D projections, by blur-
ring image gradient locations. This approach is based on a
model of the behavior of complex cells in the cerebral cor-
tex of mammalian vision. The resulting feature vectors are
called SIFT keys. In the current implementation, each im-
age generates on the order of 1000 SIFT keys, a process that
requires less than 1 second of computation time.
The SIFT keys derived from an image are used in a
nearest-neighbour approach to indexing to identify candi-
date object models. Collections of keys that agree on a po-
tential model pose are first identified through a Hough trans-
formhash table, and then througha least-squares fit to a final
estimate of model parameters. When at least 3 keys agree
on the model parameters with low residual, there is strong
evidence for the presence of the object. Since there may be
dozens of SIFT keys in the image of a typical object, it is
possible to have substantial levels of occlusion in the image
and yet retain high levels of reliability.
The current object models are represented as 2D loca-
tions of SIFT keys that can undergo affine projection. Suf-
ficient variation in feature location is allowed to recognize
perspective projection of planar shapes at up to a 60 degree
rotation away from the camera or to allow up to a 20 degree
rotation of a 3D object.
1
Lowe(1999)
Histograms of Oriented Gradients for Human DetectionNavneet Dalal and Bill Triggs
INRIA Rhone-Alps, 655 avenue de l’Europe, Montbonnot 38334, France{Navneet.Dalal,Bill.Triggs}@inrialpes.fr, http://lear.inrialpes.fr
AbstractWe study the question of feature sets for robust visual ob-
ject recognition, adopting linear SVM based human detec-tion as a test case. After reviewing existing edge and gra-dient based descriptors, we show experimentally that gridsof Histograms of Oriented Gradient (HOG) descriptors sig-nificantly outperform existing feature sets for human detec-tion. We study the influence of each stage of the computationon performance, concluding that fine-scale gradients, fineorientation binning, relatively coarse spatial binning, andhigh-quality local contrast normalization in overlapping de-scriptor blocks are all important for good results. The newapproach gives near-perfect separation on the original MITpedestrian database, so we introduce a more challengingdataset containing over 1800 annotated human images witha large range of pose variations and backgrounds.
1 IntroductionDetecting humans in images is a challenging task owing
to their variable appearance and the wide range of poses thatthey can adopt. The first need is a robust feature set thatallows the human form to be discriminated cleanly, even incluttered backgrounds under difficult illumination. We studythe issue of feature sets for human detection, showing that lo-cally normalized Histogram of Oriented Gradient (HOG) de-scriptors provide excellent performance relative to other ex-isting feature sets including wavelets [17,22]. The proposeddescriptors are reminiscent of edge orientation histograms[4,5], SIFT descriptors [12] and shape contexts [1], but theyare computed on a dense grid of uniformly spaced cells andthey use overlapping local contrast normalizations for im-proved performance. We make a detailed study of the effectsof various implementation choices on detector performance,taking “pedestrian detection” (the detection of mostly visiblepeople in more or less upright poses) as a test case. For sim-plicity and speed, we use linear SVM as a baseline classifierthroughout the study. The new detectors give essentially per-fect results on theMIT pedestrian test set [18,17], so we havecreated a more challenging set containing over 1800 pedes-trian images with a large range of poses and backgrounds.Ongoing work suggests that our feature set performs equallywell for other shape-based object classes.
We briefly discuss previous work on human detection in§2, give an overview of our method §3, describe our datasets in §4 and give a detailed description and experimentalevaluation of each stage of the process in §5–6. The mainconclusions are summarized in §7.
2 Previous WorkThere is an extensive literature on object detection, but
here we mention just a few relevant papers on human detec-tion [18,17,22,16,20]. See [6] for a survey. Papageorgiou etal [18] describe a pedestrian detector based on a polynomialSVM using rectified Haar wavelets as input descriptors, witha parts (subwindow) based variant in [17]. Depoortere et algive an optimized version of this [2]. Gavrila & Philomen[8] take a more direct approach, extracting edge images andmatching them to a set of learned exemplars using chamferdistance. This has been used in a practical real-time pedes-trian detection system [7]. Viola et al [22] build an efficientmoving person detector, using AdaBoost to train a chain ofprogressively more complex region rejection rules based onHaar-like wavelets and space-time differences. Ronfard etal [19] build an articulated body detector by incorporatingSVM based limb classifiers over 1st and 2nd order Gaussianfilters in a dynamic programming framework similar to thoseof Felzenszwalb & Huttenlocher [3] and Ioffe & Forsyth[9]. Mikolajczyk et al [16] use combinations of orientation-position histograms with binary-thresholdedgradient magni-tudes to build a parts based method containing detectors forfaces, heads, and front and side profiles of upper and lowerbody parts. In contrast, our detector uses a simpler archi-tecture with a single detection window, but appears to givesignificantly higher performance on pedestrian images.
3 Overview of the MethodThis section gives an overview of our feature extraction
chain, which is summarized in fig. 1. Implementation detailsare postponed until §6. The method is based on evaluatingwell-normalized local histograms of image gradient orienta-tions in a dense grid. Similar features have seen increasinguse over the past decade [4,5,12,15]. The basic idea is thatlocal object appearance and shape can often be characterizedrather well by the distribution of local intensity gradients or
1
Nalal and Triggs (2005)
3 papers
A Discriminatively Trained, Multiscale, Deformable Part Model
Pedro FelzenszwalbUniversity of [email protected]
David McAllesterToyota Technological Institute at Chicago
Deva RamananUC Irvine
Abstract
This paper describes a discriminatively trained, multi-scale, deformable part model for object detection. Our sys-tem achieves a two-fold improvement in average precisionover the best performance in the 2006 PASCAL person de-tection challenge. It also outperforms the best results in the2007 challenge in ten out of twenty categories. The systemrelies heavily on deformable parts. While deformable partmodels have become quite popular, their value had not beendemonstrated on difficult benchmarks such as the PASCALchallenge. Our system also relies heavily on new methodsfor discriminative training. We combine a margin-sensitiveapproach for data mining hard negative examples with aformalism we call latent SVM. A latent SVM, like a hid-den CRF, leads to a non-convex training problem. How-ever, a latent SVM is semi-convex and the training prob-lem becomes convex once latent information is specified forthe positive examples. We believe that our training meth-ods will eventually make possible the effective use of morelatent information such as hierarchical (grammar) modelsand models involving latent three dimensional pose.
1. IntroductionWe consider the problem of detecting and localizing ob-
jects of a generic category, such as people or cars, in staticimages. We have developed a new multiscale deformablepart model for solving this problem. The models are trainedusing a discriminative procedure that only requires bound-ing box labels for the positive examples. Using these mod-els we implemented a detection system that is both highlyefficient and accurate, processing an image in about 2 sec-onds and achieving recognition rates that are significantlybetter than previous systems.
Our system achieves a two-fold improvement in averageprecision over the winning system [5] in the 2006 PASCALperson detection challenge. The system also outperformsthe best results in the 2007 challenge in ten out of twenty
This material is based upon work supported by the National ScienceFoundation under Grant No. 0534820 and 0535174.
Figure 1. Example detection obtained with the person model. Themodel is defined by a coarse template, several higher resolutionpart templates and a spatial model for the location of each part.
object categories. Figure 1 shows an example detection ob-tained with our person model.
The notion that objects can be modeled by parts in a de-formable configuration provides an elegant framework forrepresenting object categories [1–3, 6, 10, 12, 13,15, 16, 22].While these models are appealing from a conceptual pointof view, it has been difficult to establish their value in prac-tice. On difficult datasets, deformable models are often out-performed by “conceptually weaker” models such as rigidtemplates [5] or bag-of-features [23]. One of our main goalsis to address this performance gap.
Our models include both a coarse global template cov-ering an entire object and higher resolution part templates.The templates represent histogram of gradient features [5].As in [14, 19, 21], we train models discriminatively. How-ever, our system is semi-supervised, trained with a max-margin framework, and does not rely on feature detection.We also describe a simple and effective strategy for learn-ing parts from weakly-labeled data. In contrast to computa-tionally demanding approaches such as [4], we can learn amodel in 3 hours on a single CPU.
Another contribution of our work is a new methodologyfor discriminative training. We generalize SVMs for han-dling latent variables such as part positions, and introduce anew method for data mining “hard negative” examples dur-ing training. We believe that handling partially labeled datais a significant issue in machine learning for computer vi-sion. For example, the PASCAL dataset only specifies a
1
Felzenszwalb et al.(2008)
yey!!
Object Recognition from Local Scale-Invariant Features
David G. Lowe
Computer Science Department
University of British Columbia
Vancouver, B.C., V6T 1Z4, Canada
Abstract
Proc. of the International Conference on
Computer Vision, Corfu (Sept. 1999)
An object recognition system has been developed that uses a
new class of local image features. The features are invariant
to image scaling, translation, and rotation, and partially in-
variant to illumination changes and affine or 3D projection.
These features share similar properties with neurons in in-
ferior temporal cortex that are used for object recognition
in primate vision. Features are efficiently detected through
a staged filtering approach that identifies stable points in
scale space. Image keys are created that allow for local ge-
ometric deformations by representing blurred image gradi-
ents in multiple orientation planes and at multiple scales.
The keys are used as input to a nearest-neighbor indexing
method that identifies candidate object matches. Final veri-
fication of each match is achieved by finding a low-residual
least-squares solution for the unknown model parameters.
Experimental results show that robust object recognition
can be achieved in cluttered partially-occluded images with
a computation time of under 2 seconds.
1. Introduction
Object recognition in cluttered real-world scenes requires
local image features that are unaffected by nearby clutter or
partial occlusion. The features must be at least partially in-
variant to illumination, 3D projective transforms, and com-
mon object variations. On the other hand, the features must
also be sufficiently distinctive to identify specific objects
among many alternatives. The difficulty of the object recog-
nition problem is due in large part to the lack of success in
finding such image features. However, recent research on
the use of dense local features (e.g., Schmid & Mohr [19])
has shown that efficient recognition can often be achieved
by using local image descriptors sampled at a large number
of repeatable locations.
This paper presents a new method for image feature gen-
eration called the Scale Invariant Feature Transform (SIFT).
This approach transforms an image into a large collection
of local feature vectors, each of which is invariant to image
translation, scaling, and rotation, and partially invariant to
illumination changes and affine or 3D projection. Previous
approaches to local feature generation lacked invariance to
scale and were more sensitive to projective distortion and
illumination change. The SIFT features share a number of
properties in common with the responses of neurons in infe-
rior temporal (IT) cortex in primate vision. This paper also
describes improved approaches to indexing and model ver-
ification.
The scale-invariant features are efficiently identified by
using a staged filtering approach. The first stage identifies
key locations in scale space by looking for locations that
are maxima orminima of a difference-of-Gaussian function.
Each point is used to generate a feature vector that describes
the local image region sampled relative to its scale-space co-
ordinate frame. The features achieve partial invariance to
local variations, such as affine or 3D projections, by blur-
ring image gradient locations. This approach is based on a
model of the behavior of complex cells in the cerebral cor-
tex of mammalian vision. The resulting feature vectors are
called SIFT keys. In the current implementation, each im-
age generates on the order of 1000 SIFT keys, a process that
requires less than 1 second of computation time.
The SIFT keys derived from an image are used in a
nearest-neighbour approach to indexing to identify candi-
date object models. Collections of keys that agree on a po-
tential model pose are first identified through a Hough trans-
formhash table, and then througha least-squares fit to a final
estimate of model parameters. When at least 3 keys agree
on the model parameters with low residual, there is strong
evidence for the presence of the object. Since there may be
dozens of SIFT keys in the image of a typical object, it is
possible to have substantial levels of occlusion in the image
and yet retain high levels of reliability.
The current object models are represented as 2D loca-
tions of SIFT keys that can undergo affine projection. Suf-
ficient variation in feature location is allowed to recognize
perspective projection of planar shapes at up to a 60 degree
rotation away from the camera or to allow up to a 20 degree
rotation of a 3D object.
1
Lowe(1999)
Histograms of Oriented Gradients for Human DetectionNavneet Dalal and Bill Triggs
INRIA Rhone-Alps, 655 avenue de l’Europe, Montbonnot 38334, France{Navneet.Dalal,Bill.Triggs}@inrialpes.fr, http://lear.inrialpes.fr
AbstractWe study the question of feature sets for robust visual ob-
ject recognition, adopting linear SVM based human detec-tion as a test case. After reviewing existing edge and gra-dient based descriptors, we show experimentally that gridsof Histograms of Oriented Gradient (HOG) descriptors sig-nificantly outperform existing feature sets for human detec-tion. We study the influence of each stage of the computation
We briefly discuss previous work on human detection in§2, give an overview of our method §3, describe our datasets in §4 and give a detailed description and experimentalevaluation of each stage of the process in §5–6. The mainconclusions are summarized in §7.
2 Previous WorkThere is an extensive literature on object detection, but
Nalal and Triggs (2005)
A Discriminatively Trained, Multiscale, Deformable Part Model
Pedro FelzenszwalbUniversity of [email protected]
David McAllesterToyota Technological Institute at Chicago
Deva RamananUC Irvine
Abstract
This paper describes a discriminatively trained, multi-scale, deformable part model for object detection. Our sys-tem achieves a two-fold improvement in average precisionover the best performance in the 2006 PASCAL person de-tection challenge. It also outperforms the best results in the2007 challenge in ten out of twenty categories. The systemrelies heavily on deformable parts. While deformable partmodels have become quite popular, their value had not beendemonstrated on difficult benchmarks such as the PASCAL
Figure 1. Example detection obtained with the person model. Themodel is defined by a coarse template, several higher resolution
Felzenszwalb et al.(2008)
Scale-Invariant Feature Transform (SIFT)
adapted from Kucuktunc
Scale-Invariant Feature Transform (SIFT)
adapted from Brown, ICCV 2003
SIFT local features are
invariant...
adapted from David Lee
like me they are robust...
Text
like me they are robust...
Text
... to changes in illumination, noise, viewpoint, occlusion, etc.
I am sure you want to know
how to build them
Text
I am sure you want to know
how to build them
Text1. find interest points or “keypoints”
I am sure you want to know
how to build them
Text1. find interest points or “keypoints”
2. find their dominant orientation
I am sure you want to know
how to build them
Text1. find interest points or “keypoints”
2. find their dominant orientation
3. compute their descriptor
I am sure you want to know
how to build them
Text1. find interest points or “keypoints”
2. find their dominant orientation
3. compute their descriptor
4. match them on other images
Text1. find interest points or “keypoints”
Text
keypoints are taken as maxima/minima
of a DoG pyramid
in this settings, extremas are invariant to scale...
a DoG (Difference of Gaussians) pyramidis simple to compute...
even him can do it!
before after
adapted from Pallus and Fleishman
then we just have to find neighborhood extremasin this 3D DoG space
then we just have to find neighborhood extremasin this 3D DoG space
if a pixel is an extremain its neighboring regionhe becomes a candidate
keypoint
too manykeypoints?
adapted from wikipedia
too manykeypoints?
1. remove low contrast
adapted from wikipedia
too manykeypoints?
1. remove low contrast
adapted from wikipedia
too manykeypoints?
1. remove low contrast
2. remove edges
adapted from wikipedia
too manykeypoints?
1. remove low contrast
2. remove edges
adapted from wikipedia
Text
2. find their dominant orientation
each selected keypoint is assigned to one or more“dominant” orientations...
each selected keypoint is assigned to one or more“dominant” orientations...
... this step is important to achieve rotation invariance
How?
How?using the DoG pyramid to achieve scale invariance:
How?using the DoG pyramid to achieve scale invariance:
a. compute image gradient magnitude and orientation
How?using the DoG pyramid to achieve scale invariance:
a. compute image gradient magnitude and orientation
b. build an orientation histogram
How?using the DoG pyramid to achieve scale invariance:
a. compute image gradient magnitude and orientation
b. build an orientation histogram
c. keypoint’s orientation(s) = peak(s)
a. compute image gradient magnitude and orientation
a. compute image gradient magnitude and orientation
b. build an orientation histogram
adapted from Ofir Pele
c. keypoint’s orientation(s) = peak(s)
*
* the peak ;-)
Text
3. compute their descriptor
SIFT descriptor= a set of orientation histograms
4x4 array x 8 bins= 128 dimensions (normalized)
16x16 neighborhood of pixel gradients
Text
4. match them on other images
How to atch?
How to atch?
nearest neighbor
How to atch?
nearest neighborhough transform voting
How to atch?
nearest neighborhough transform votingleast-squares fit
How to atch?
nearest neighborhough transform votingleast-squares fitetc.
SIFT is great!
Text
SIFT is great!
Text\\ invariant to affine transformations
SIFT is great!
Text\\ invariant to affine transformations
\\ easy to understand
SIFT is great!
Text\\ invariant to affine transformations
\\ easy to understand
\\ fast to compute
Extension example: Spatial Pyramid Matching using SIFT
Text
Beyond Bags of Features: Spatial Pyramid Matchingfor Recognizing Natural Scene Categories
Svetlana Lazebnik1
[email protected] Institute
University of Illinois
Cordelia Schmid2
[email protected] Rhone-AlpesMontbonnot, France
Jean Ponce1,3
[email protected] Normale Superieure
Paris, France
CVPR 2006
Object Recognition from Local Scale-Invariant Features
David G. Lowe
Computer Science Department
University of British Columbia
Vancouver, B.C., V6T 1Z4, Canada
Abstract
An object recognition system has been developed that uses a
new class of local image features. The features are invariant
to image scaling, translation, and rotation, and partially in-
variant to illumination changes and affine or 3D projection.
translation, scaling, and rotation, and partially invariant to
illumination changes and affine or 3D projection. Previous
approaches to local feature generation lacked invariance to
scale and were more sensitive to projective distortion and
illumination change. The SIFT features share a number of
properties in common with the responses of neurons in infe-
Lowe(1999)
Histograms of Oriented Gradients for Human DetectionNavneet Dalal and Bill Triggs
INRIA Rhone-Alps, 655 avenue de l’Europe, Montbonnot 38334, France{Navneet.Dalal,Bill.Triggs}@inrialpes.fr, http://lear.inrialpes.fr
AbstractWe study the question of feature sets for robust visual ob-
ject recognition, adopting linear SVM based human detec-tion as a test case. After reviewing existing edge and gra-dient based descriptors, we show experimentally that gridsof Histograms of Oriented Gradient (HOG) descriptors sig-nificantly outperform existing feature sets for human detec-tion. We study the influence of each stage of the computationon performance, concluding that fine-scale gradients, fineorientation binning, relatively coarse spatial binning, andhigh-quality local contrast normalization in overlapping de-scriptor blocks are all important for good results. The newapproach gives near-perfect separation on the original MITpedestrian database, so we introduce a more challengingdataset containing over 1800 annotated human images witha large range of pose variations and backgrounds.
1 IntroductionDetecting humans in images is a challenging task owing
to their variable appearance and the wide range of poses thatthey can adopt. The first need is a robust feature set thatallows the human form to be discriminated cleanly, even incluttered backgrounds under difficult illumination. We studythe issue of feature sets for human detection, showing that lo-cally normalized Histogram of Oriented Gradient (HOG) de-scriptors provide excellent performance relative to other ex-isting feature sets including wavelets [17,22]. The proposeddescriptors are reminiscent of edge orientation histograms[4,5], SIFT descriptors [12] and shape contexts [1], but theyare computed on a dense grid of uniformly spaced cells andthey use overlapping local contrast normalizations for im-proved performance. We make a detailed study of the effectsof various implementation choices on detector performance,taking “pedestrian detection” (the detection of mostly visiblepeople in more or less upright poses) as a test case. For sim-plicity and speed, we use linear SVM as a baseline classifierthroughout the study. The new detectors give essentially per-fect results on theMIT pedestrian test set [18,17], so we havecreated a more challenging set containing over 1800 pedes-trian images with a large range of poses and backgrounds.Ongoing work suggests that our feature set performs equallywell for other shape-based object classes.
We briefly discuss previous work on human detection in§2, give an overview of our method §3, describe our datasets in §4 and give a detailed description and experimentalevaluation of each stage of the process in §5–6. The mainconclusions are summarized in §7.
2 Previous WorkThere is an extensive literature on object detection, but
here we mention just a few relevant papers on human detec-tion [18,17,22,16,20]. See [6] for a survey. Papageorgiou etal [18] describe a pedestrian detector based on a polynomialSVM using rectified Haar wavelets as input descriptors, witha parts (subwindow) based variant in [17]. Depoortere et algive an optimized version of this [2]. Gavrila & Philomen[8] take a more direct approach, extracting edge images andmatching them to a set of learned exemplars using chamferdistance. This has been used in a practical real-time pedes-trian detection system [7]. Viola et al [22] build an efficientmoving person detector, using AdaBoost to train a chain ofprogressively more complex region rejection rules based onHaar-like wavelets and space-time differences. Ronfard etal [19] build an articulated body detector by incorporatingSVM based limb classifiers over 1st and 2nd order Gaussianfilters in a dynamic programming framework similar to thoseof Felzenszwalb & Huttenlocher [3] and Ioffe & Forsyth[9]. Mikolajczyk et al [16] use combinations of orientation-position histograms with binary-thresholdedgradient magni-tudes to build a parts based method containing detectors forfaces, heads, and front and side profiles of upper and lowerbody parts. In contrast, our detector uses a simpler archi-tecture with a single detection window, but appears to givesignificantly higher performance on pedestrian images.
3 Overview of the MethodThis section gives an overview of our feature extraction
chain, which is summarized in fig. 1. Implementation detailsare postponed until §6. The method is based on evaluatingwell-normalized local histograms of image gradient orienta-tions in a dense grid. Similar features have seen increasinguse over the past decade [4,5,12,15]. The basic idea is thatlocal object appearance and shape can often be characterizedrather well by the distribution of local intensity gradients or
1
Nalal and Triggs (2005)
A Discriminatively Trained, Multiscale, Deformable Part Model
Pedro FelzenszwalbUniversity of [email protected]
David McAllesterToyota Technological Institute at Chicago
Deva RamananUC Irvine
Abstract
This paper describes a discriminatively trained, multi-scale, deformable part model for object detection. Our sys-tem achieves a two-fold improvement in average precisionover the best performance in the 2006 PASCAL person de-tection challenge. It also outperforms the best results in the2007 challenge in ten out of twenty categories. The systemrelies heavily on deformable parts. While deformable partmodels have become quite popular, their value had not beendemonstrated on difficult benchmarks such as the PASCAL
Figure 1. Example detection obtained with the person model. Themodel is defined by a coarse template, several higher resolution
Felzenszwalb et al.(2008)
Histograms of Oriented Gradients for Human DetectionNavneet Dalal and Bill Triggs
INRIA Rhone-Alps, 655 avenue de l’Europe, Montbonnot 38334, France{Navneet.Dalal,Bill.Triggs}@inrialpes.fr, http://lear.inrialpes.fr
AbstractWe study the question of feature sets for robust visual ob-
ject recognition, adopting linear SVM based human detec-tion as a test case. After reviewing existing edge and gra-dient based descriptors, we show experimentally that gridsof Histograms of Oriented Gradient (HOG) descriptors sig-nificantly outperform existing feature sets for human detec-tion. We study the influence of each stage of the computationon performance, concluding that fine-scale gradients, fineorientation binning, relatively coarse spatial binning, andhigh-quality local contrast normalization in overlapping de-scriptor blocks are all important for good results. The newapproach gives near-perfect separation on the original MITpedestrian database, so we introduce a more challengingdataset containing over 1800 annotated human images witha large range of pose variations and backgrounds.
1 IntroductionDetecting humans in images is a challenging task owing
to their variable appearance and the wide range of poses thatthey can adopt. The first need is a robust feature set thatallows the human form to be discriminated cleanly, even incluttered backgrounds under difficult illumination. We studythe issue of feature sets for human detection, showing that lo-cally normalized Histogram of Oriented Gradient (HOG) de-scriptors provide excellent performance relative to other ex-isting feature sets including wavelets [17,22]. The proposeddescriptors are reminiscent of edge orientation histograms[4,5], SIFT descriptors [12] and shape contexts [1], but theyare computed on a dense grid of uniformly spaced cells andthey use overlapping local contrast normalizations for im-proved performance. We make a detailed study of the effectsof various implementation choices on detector performance,taking “pedestrian detection” (the detection of mostly visiblepeople in more or less upright poses) as a test case. For sim-plicity and speed, we use linear SVM as a baseline classifierthroughout the study. The new detectors give essentially per-fect results on theMIT pedestrian test set [18,17], so we havecreated a more challenging set containing over 1800 pedes-trian images with a large range of poses and backgrounds.Ongoing work suggests that our feature set performs equallywell for other shape-based object classes.
We briefly discuss previous work on human detection in§2, give an overview of our method §3, describe our datasets in §4 and give a detailed description and experimentalevaluation of each stage of the process in §5–6. The mainconclusions are summarized in §7.
2 Previous WorkThere is an extensive literature on object detection, but
here we mention just a few relevant papers on human detec-tion [18,17,22,16,20]. See [6] for a survey. Papageorgiou etal [18] describe a pedestrian detector based on a polynomialSVM using rectified Haar wavelets as input descriptors, witha parts (subwindow) based variant in [17]. Depoortere et algive an optimized version of this [2]. Gavrila & Philomen[8] take a more direct approach, extracting edge images andmatching them to a set of learned exemplars using chamferdistance. This has been used in a practical real-time pedes-trian detection system [7]. Viola et al [22] build an efficientmoving person detector, using AdaBoost to train a chain ofprogressively more complex region rejection rules based onHaar-like wavelets and space-time differences. Ronfard etal [19] build an articulated body detector by incorporatingSVM based limb classifiers over 1st and 2nd order Gaussianfilters in a dynamic programming framework similar to thoseof Felzenszwalb & Huttenlocher [3] and Ioffe & Forsyth[9]. Mikolajczyk et al [16] use combinations of orientation-position histograms with binary-thresholdedgradient magni-tudes to build a parts based method containing detectors forfaces, heads, and front and side profiles of upper and lowerbody parts. In contrast, our detector uses a simpler archi-tecture with a single detection window, but appears to givesignificantly higher performance on pedestrian images.
3 Overview of the MethodThis section gives an overview of our feature extraction
chain, which is summarized in fig. 1. Implementation detailsare postponed until §6. The method is based on evaluatingwell-normalized local histograms of image gradient orienta-tions in a dense grid. Similar features have seen increasinguse over the past decade [4,5,12,15]. The basic idea is thatlocal object appearance and shape can often be characterizedrather well by the distribution of local intensity gradients or
1
first of all, let me put this paper in context
Histograms of Oriented Gradients for Human DetectionNavneet Dalal and Bill Triggs
INRIA Rhone-Alps, 655 avenue de l’Europe, Montbonnot 38334, France{Navneet.Dalal,Bill.Triggs}@inrialpes.fr, http://lear.inrialpes.fr
AbstractWe study the question of feature sets for robust visual ob-
ject recognition, adopting linear SVM based human detec-tion as a test case. After reviewing existing edge and gra-dient based descriptors, we show experimentally that gridsof Histograms of Oriented Gradient (HOG) descriptors sig-nificantly outperform existing feature sets for human detec-tion. We study the influence of each stage of the computationon performance, concluding that fine-scale gradients, fineorientation binning, relatively coarse spatial binning, andhigh-quality local contrast normalization in overlapping de-scriptor blocks are all important for good results. The newapproach gives near-perfect separation on the original MITpedestrian database, so we introduce a more challengingdataset containing over 1800 annotated human images witha large range of pose variations and backgrounds.
1 IntroductionDetecting humans in images is a challenging task owing
to their variable appearance and the wide range of poses thatthey can adopt. The first need is a robust feature set thatallows the human form to be discriminated cleanly, even incluttered backgrounds under difficult illumination. We studythe issue of feature sets for human detection, showing that lo-cally normalized Histogram of Oriented Gradient (HOG) de-scriptors provide excellent performance relative to other ex-isting feature sets including wavelets [17,22]. The proposeddescriptors are reminiscent of edge orientation histograms[4,5], SIFT descriptors [12] and shape contexts [1], but theyare computed on a dense grid of uniformly spaced cells andthey use overlapping local contrast normalizations for im-proved performance. We make a detailed study of the effectsof various implementation choices on detector performance,taking “pedestrian detection” (the detection of mostly visiblepeople in more or less upright poses) as a test case. For sim-plicity and speed, we use linear SVM as a baseline classifierthroughout the study. The new detectors give essentially per-fect results on theMIT pedestrian test set [18,17], so we havecreated a more challenging set containing over 1800 pedes-trian images with a large range of poses and backgrounds.Ongoing work suggests that our feature set performs equallywell for other shape-based object classes.
We briefly discuss previous work on human detection in§2, give an overview of our method §3, describe our datasets in §4 and give a detailed description and experimentalevaluation of each stage of the process in §5–6. The mainconclusions are summarized in §7.
2 Previous WorkThere is an extensive literature on object detection, but
here we mention just a few relevant papers on human detec-tion [18,17,22,16,20]. See [6] for a survey. Papageorgiou etal [18] describe a pedestrian detector based on a polynomialSVM using rectified Haar wavelets as input descriptors, witha parts (subwindow) based variant in [17]. Depoortere et algive an optimized version of this [2]. Gavrila & Philomen[8] take a more direct approach, extracting edge images andmatching them to a set of learned exemplars using chamferdistance. This has been used in a practical real-time pedes-trian detection system [7]. Viola et al [22] build an efficientmoving person detector, using AdaBoost to train a chain ofprogressively more complex region rejection rules based onHaar-like wavelets and space-time differences. Ronfard etal [19] build an articulated body detector by incorporatingSVM based limb classifiers over 1st and 2nd order Gaussianfilters in a dynamic programming framework similar to thoseof Felzenszwalb & Huttenlocher [3] and Ioffe & Forsyth[9]. Mikolajczyk et al [16] use combinations of orientation-position histograms with binary-thresholdedgradient magni-tudes to build a parts based method containing detectors forfaces, heads, and front and side profiles of upper and lowerbody parts. In contrast, our detector uses a simpler archi-tecture with a single detection window, but appears to givesignificantly higher performance on pedestrian images.
3 Overview of the MethodThis section gives an overview of our feature extraction
chain, which is summarized in fig. 1. Implementation detailsare postponed until §6. The method is based on evaluatingwell-normalized local histograms of image gradient orienta-tions in a dense grid. Similar features have seen increasinguse over the past decade [4,5,12,15]. The basic idea is thatlocal object appearance and shape can often be characterizedrather well by the distribution of local intensity gradients or
1
histograms of local image measurement have been quite successful
Swain & Ballard 1991 - Color Histograms
Schiele & Crowley 1996 - Receptive Fields Histograms
Lowe 1999 - SIFT
Schneiderman & Kanade 2000 - Localized Histograms of Wavelets
Leung & Malik 2001 - Texton Histograms
Belongie et al. 2002 - Shape Context
Dalal & Triggs 2005 - Dense Orientation Histograms
...
λ λ λ
Histograms of Oriented Gradients for Human DetectionNavneet Dalal and Bill Triggs
INRIA Rhone-Alps, 655 avenue de l’Europe, Montbonnot 38334, France{Navneet.Dalal,Bill.Triggs}@inrialpes.fr, http://lear.inrialpes.fr
AbstractWe study the question of feature sets for robust visual ob-
ject recognition, adopting linear SVM based human detec-tion as a test case. After reviewing existing edge and gra-dient based descriptors, we show experimentally that gridsof Histograms of Oriented Gradient (HOG) descriptors sig-nificantly outperform existing feature sets for human detec-tion. We study the influence of each stage of the computationon performance, concluding that fine-scale gradients, fineorientation binning, relatively coarse spatial binning, andhigh-quality local contrast normalization in overlapping de-scriptor blocks are all important for good results. The newapproach gives near-perfect separation on the original MITpedestrian database, so we introduce a more challengingdataset containing over 1800 annotated human images witha large range of pose variations and backgrounds.
1 IntroductionDetecting humans in images is a challenging task owing
to their variable appearance and the wide range of poses thatthey can adopt. The first need is a robust feature set thatallows the human form to be discriminated cleanly, even incluttered backgrounds under difficult illumination. We studythe issue of feature sets for human detection, showing that lo-cally normalized Histogram of Oriented Gradient (HOG) de-scriptors provide excellent performance relative to other ex-isting feature sets including wavelets [17,22]. The proposeddescriptors are reminiscent of edge orientation histograms[4,5], SIFT descriptors [12] and shape contexts [1], but theyare computed on a dense grid of uniformly spaced cells andthey use overlapping local contrast normalizations for im-proved performance. We make a detailed study of the effectsof various implementation choices on detector performance,taking “pedestrian detection” (the detection of mostly visiblepeople in more or less upright poses) as a test case. For sim-plicity and speed, we use linear SVM as a baseline classifierthroughout the study. The new detectors give essentially per-fect results on theMIT pedestrian test set [18,17], so we havecreated a more challenging set containing over 1800 pedes-trian images with a large range of poses and backgrounds.Ongoing work suggests that our feature set performs equallywell for other shape-based object classes.
We briefly discuss previous work on human detection in§2, give an overview of our method §3, describe our datasets in §4 and give a detailed description and experimentalevaluation of each stage of the process in §5–6. The mainconclusions are summarized in §7.
2 Previous WorkThere is an extensive literature on object detection, but
here we mention just a few relevant papers on human detec-tion [18,17,22,16,20]. See [6] for a survey. Papageorgiou etal [18] describe a pedestrian detector based on a polynomialSVM using rectified Haar wavelets as input descriptors, witha parts (subwindow) based variant in [17]. Depoortere et algive an optimized version of this [2]. Gavrila & Philomen[8] take a more direct approach, extracting edge images andmatching them to a set of learned exemplars using chamferdistance. This has been used in a practical real-time pedes-trian detection system [7]. Viola et al [22] build an efficientmoving person detector, using AdaBoost to train a chain ofprogressively more complex region rejection rules based onHaar-like wavelets and space-time differences. Ronfard etal [19] build an articulated body detector by incorporatingSVM based limb classifiers over 1st and 2nd order Gaussianfilters in a dynamic programming framework similar to thoseof Felzenszwalb & Huttenlocher [3] and Ioffe & Forsyth[9]. Mikolajczyk et al [16] use combinations of orientation-position histograms with binary-thresholdedgradient magni-tudes to build a parts based method containing detectors forfaces, heads, and front and side profiles of upper and lowerbody parts. In contrast, our detector uses a simpler archi-tecture with a single detection window, but appears to givesignificantly higher performance on pedestrian images.
3 Overview of the MethodThis section gives an overview of our feature extraction
chain, which is summarized in fig. 1. Implementation detailsare postponed until §6. The method is based on evaluatingwell-normalized local histograms of image gradient orienta-tions in a dense grid. Similar features have seen increasinguse over the past decade [4,5,12,15]. The basic idea is thatlocal object appearance and shape can often be characterizedrather well by the distribution of local intensity gradients or
1
tons of “feature sets” have been proposed
features
Gravrila & Philomen 1999 - Edge Templates + Nearest Neighbor
Papageorgiou & Poggio 2000, Mohan et al. 2001, DePoortere et al. 2002 - Haar Wavelets + SVM
Viola & Jones 2001 - Rectangular Differential Features + AdaBoost
Mikolajczyk et al. 2004 - Parts Based Histograms + AdaBoost
Ke & Sukthankar 2004 - PCA-SIFT
...
Histograms of Oriented Gradients for Human DetectionNavneet Dalal and Bill Triggs
INRIA Rhone-Alps, 655 avenue de l’Europe, Montbonnot 38334, France{Navneet.Dalal,Bill.Triggs}@inrialpes.fr, http://lear.inrialpes.fr
AbstractWe study the question of feature sets for robust visual ob-
ject recognition, adopting linear SVM based human detec-tion as a test case. After reviewing existing edge and gra-dient based descriptors, we show experimentally that gridsof Histograms of Oriented Gradient (HOG) descriptors sig-nificantly outperform existing feature sets for human detec-tion. We study the influence of each stage of the computationon performance, concluding that fine-scale gradients, fineorientation binning, relatively coarse spatial binning, andhigh-quality local contrast normalization in overlapping de-scriptor blocks are all important for good results. The newapproach gives near-perfect separation on the original MITpedestrian database, so we introduce a more challengingdataset containing over 1800 annotated human images witha large range of pose variations and backgrounds.
1 IntroductionDetecting humans in images is a challenging task owing
to their variable appearance and the wide range of poses thatthey can adopt. The first need is a robust feature set thatallows the human form to be discriminated cleanly, even incluttered backgrounds under difficult illumination. We studythe issue of feature sets for human detection, showing that lo-cally normalized Histogram of Oriented Gradient (HOG) de-scriptors provide excellent performance relative to other ex-isting feature sets including wavelets [17,22]. The proposeddescriptors are reminiscent of edge orientation histograms[4,5], SIFT descriptors [12] and shape contexts [1], but theyare computed on a dense grid of uniformly spaced cells andthey use overlapping local contrast normalizations for im-proved performance. We make a detailed study of the effectsof various implementation choices on detector performance,taking “pedestrian detection” (the detection of mostly visiblepeople in more or less upright poses) as a test case. For sim-plicity and speed, we use linear SVM as a baseline classifierthroughout the study. The new detectors give essentially per-fect results on theMIT pedestrian test set [18,17], so we havecreated a more challenging set containing over 1800 pedes-trian images with a large range of poses and backgrounds.Ongoing work suggests that our feature set performs equallywell for other shape-based object classes.
We briefly discuss previous work on human detection in§2, give an overview of our method §3, describe our datasets in §4 and give a detailed description and experimentalevaluation of each stage of the process in §5–6. The mainconclusions are summarized in §7.
2 Previous WorkThere is an extensive literature on object detection, but
here we mention just a few relevant papers on human detec-tion [18,17,22,16,20]. See [6] for a survey. Papageorgiou etal [18] describe a pedestrian detector based on a polynomialSVM using rectified Haar wavelets as input descriptors, witha parts (subwindow) based variant in [17]. Depoortere et algive an optimized version of this [2]. Gavrila & Philomen[8] take a more direct approach, extracting edge images andmatching them to a set of learned exemplars using chamferdistance. This has been used in a practical real-time pedes-trian detection system [7]. Viola et al [22] build an efficientmoving person detector, using AdaBoost to train a chain ofprogressively more complex region rejection rules based onHaar-like wavelets and space-time differences. Ronfard etal [19] build an articulated body detector by incorporatingSVM based limb classifiers over 1st and 2nd order Gaussianfilters in a dynamic programming framework similar to thoseof Felzenszwalb & Huttenlocher [3] and Ioffe & Forsyth[9]. Mikolajczyk et al [16] use combinations of orientation-position histograms with binary-thresholdedgradient magni-tudes to build a parts based method containing detectors forfaces, heads, and front and side profiles of upper and lowerbody parts. In contrast, our detector uses a simpler archi-tecture with a single detection window, but appears to givesignificantly higher performance on pedestrian images.
3 Overview of the MethodThis section gives an overview of our feature extraction
chain, which is summarized in fig. 1. Implementation detailsare postponed until §6. The method is based on evaluatingwell-normalized local histograms of image gradient orienta-tions in a dense grid. Similar features have seen increasinguse over the past decade [4,5,12,15]. The basic idea is thatlocal object appearance and shape can often be characterizedrather well by the distribution of local intensity gradients or
1
localizing humans in images is achallenging task...
difficult!
Wide variety of articulated poses
Variable appearance/clothing
Complex backgrounds
Unconstrained illuminations
Occlusions
Different scales
...
Approach
Approach
•robust feature set (HOG)
Approach
•robust feature set (HOG)
Approach
•robust feature set (HOG)
•simple classifier (linear SVM)
Approach
•robust feature set (HOG)
•simple classifier (linear SVM)
•fast detection (sliding window)
adapted from Bill Triggs
• Gamma normalization
• Space: RGB, LAB or Gray
• Method: SQRT or LOG
• Filtering with simple masks
uncentered
centered
cubic-corrected
diagonal
Sobel
uncentered
centered
cubic-corrected
diagonal
Sobel
* centered performs the best
*
• Filtering with simple masks
uncentered
centered
cubic-corrected
diagonal
Sobel
remember SIFT ?
...after filtering, each “pixel” represents an oriented gradient...
...pixels are regrouped in “cells”, they cast a weighted vote for an orientation histogram...
HOG (Histogram of Oriented Gradients)
a window can be represented like that
then, cells are locally normalized using overlapping “blocks”
they used two types of blocks
they used two types of blocks
• rectangular
• similar to SIFT (but dense)
they used two types of blocks
• rectangular
• similar to SIFT (but dense)
• circular
• similar to Shape Context
and four different types of block normalization
and four different types of block normalization
like SIFT, they gain invariance...
...to illuminations, small deformations, etc.
finally, a sliding window is
classified by a simple linear SVM
during the learning phase, the algorithm “looked” for hard examples
Training
adapted from Martial Hebert
average gradients
positive weights negative weights
Example
Example
adapted from Bill Triggs
Example
adapted from Martial Hebert
Further Development
Further Development
• Detection on Pascal VOC (2006)
Further Development
• Detection on Pascal VOC (2006)
• Human Detection in Movies (ECCV 2006)
Further Development
• Detection on Pascal VOC (2006)
• Human Detection in Movies (ECCV 2006)
• US Patent by MERL (2006)
Further Development
• Detection on Pascal VOC (2006)
• Human Detection in Movies (ECCV 2006)
• US Patent by MERL (2006)
• Stereo Vision HoG (ICVES 2008)
Extension example:
Pyramid HoG++
Extension example:
Pyramid HoG++
Extension example:
Pyramid HoG++
A simple demo...
A simple demo...
A simple demo...
VIDEO HERE
A simple demo...
VIDEO HERE
so, it doesn’t work ?!?
so, it doesn’t work ?!?
no no, it works...
so, it doesn’t work ?!?
no no, it works...
...it just doesn’t work well...
Object Recognition from Local Scale-Invariant Features
David G. Lowe
Computer Science Department
University of British Columbia
Vancouver, B.C., V6T 1Z4, Canada
Abstract
An object recognition system has been developed that uses a
new class of local image features. The features are invariant
to image scaling, translation, and rotation, and partially in-
variant to illumination changes and affine or 3D projection.
translation, scaling, and rotation, and partially invariant to
illumination changes and affine or 3D projection. Previous
approaches to local feature generation lacked invariance to
scale and were more sensitive to projective distortion and
illumination change. The SIFT features share a number of
properties in common with the responses of neurons in infe-
Lowe(1999)
Histograms of Oriented Gradients for Human DetectionNavneet Dalal and Bill Triggs
INRIA Rhone-Alps, 655 avenue de l’Europe, Montbonnot 38334, France{Navneet.Dalal,Bill.Triggs}@inrialpes.fr, http://lear.inrialpes.fr
AbstractWe study the question of feature sets for robust visual ob-
ject recognition, adopting linear SVM based human detec-tion as a test case. After reviewing existing edge and gra-dient based descriptors, we show experimentally that gridsof Histograms of Oriented Gradient (HOG) descriptors sig-nificantly outperform existing feature sets for human detec-tion. We study the influence of each stage of the computation
We briefly discuss previous work on human detection in§2, give an overview of our method §3, describe our datasets in §4 and give a detailed description and experimentalevaluation of each stage of the process in §5–6. The mainconclusions are summarized in §7.
2 Previous WorkThere is an extensive literature on object detection, but
Nalal and Triggs (2005)
A Discriminatively Trained, Multiscale, Deformable Part Model
Pedro FelzenszwalbUniversity of [email protected]
David McAllesterToyota Technological Institute at Chicago
Deva RamananUC Irvine
Abstract
This paper describes a discriminatively trained, multi-scale, deformable part model for object detection. Our sys-tem achieves a two-fold improvement in average precisionover the best performance in the 2006 PASCAL person de-tection challenge. It also outperforms the best results in the2007 challenge in ten out of twenty categories. The systemrelies heavily on deformable parts. While deformable partmodels have become quite popular, their value had not beendemonstrated on difficult benchmarks such as the PASCALchallenge. Our system also relies heavily on new methodsfor discriminative training. We combine a margin-sensitiveapproach for data mining hard negative examples with aformalism we call latent SVM. A latent SVM, like a hid-den CRF, leads to a non-convex training problem. How-ever, a latent SVM is semi-convex and the training prob-lem becomes convex once latent information is specified forthe positive examples. We believe that our training meth-ods will eventually make possible the effective use of morelatent information such as hierarchical (grammar) modelsand models involving latent three dimensional pose.
1. IntroductionWe consider the problem of detecting and localizing ob-
jects of a generic category, such as people or cars, in staticimages. We have developed a new multiscale deformablepart model for solving this problem. The models are trainedusing a discriminative procedure that only requires bound-ing box labels for the positive examples. Using these mod-els we implemented a detection system that is both highlyefficient and accurate, processing an image in about 2 sec-onds and achieving recognition rates that are significantlybetter than previous systems.
Our system achieves a two-fold improvement in averageprecision over the winning system [5] in the 2006 PASCALperson detection challenge. The system also outperformsthe best results in the 2007 challenge in ten out of twenty
This material is based upon work supported by the National ScienceFoundation under Grant No. 0534820 and 0535174.
Figure 1. Example detection obtained with the person model. Themodel is defined by a coarse template, several higher resolutionpart templates and a spatial model for the location of each part.
object categories. Figure 1 shows an example detection ob-tained with our person model.
The notion that objects can be modeled by parts in a de-formable configuration provides an elegant framework forrepresenting object categories [1–3, 6, 10, 12, 13,15, 16, 22].While these models are appealing from a conceptual pointof view, it has been difficult to establish their value in prac-tice. On difficult datasets, deformable models are often out-performed by “conceptually weaker” models such as rigidtemplates [5] or bag-of-features [23]. One of our main goalsis to address this performance gap.
Our models include both a coarse global template cov-ering an entire object and higher resolution part templates.The templates represent histogram of gradient features [5].As in [14, 19, 21], we train models discriminatively. How-ever, our system is semi-supervised, trained with a max-margin framework, and does not rely on feature detection.We also describe a simple and effective strategy for learn-ing parts from weakly-labeled data. In contrast to computa-tionally demanding approaches such as [4], we can learn amodel in 3 hours on a single CPU.
Another contribution of our work is a new methodologyfor discriminative training. We generalize SVMs for han-dling latent variables such as part positions, and introduce anew method for data mining “hard negative” examples dur-ing training. We believe that handling partially labeled datais a significant issue in machine learning for computer vi-sion. For example, the PASCAL dataset only specifies a
1
Felzenszwalb et al.(2008)
This paper describes one of the best algorithm in object detection...
They used the following methods:
HOG Features
Deformable Part Model
Latent SVM
They used the following methods:
HOG Features
Introduced by Dalal & Triggs (2005)
They used the following methods:
Deformable Part Model
Introduced by Fischler & Elschlager (1973)
They used the following methods:
Latent SVM
Introduced by the authors
HOG Features
Model Overviewdetection root filter part filters
deformation models
HOG Features
// 8x8 pixel blocks window
// features computed at different resolutions (pyramid)
HOG Pyramid
Deformable Part Model
Deformable Part Model
// each part is a local property
// springs capture spatial relationships
// here, the springs can be “negative”
Deformable Part Model
detection score =sum of filter responses - deformation cost
root filter
Deformable Part Model
detection score =sum of filter responses - deformation cost
root filter
part filters
Deformable Part Model
detection score =sum of filter responses - deformation cost
root filter
part filtersdeformable
model
Deformable Part Model
detection score =sum of filter responses - deformation cost
Deformable Part Model
filters feature vector(at position p
in the pyramid H)
position relativeto the root location
coefficients of a quadratic function on
the placement
score of a placement
Latent SVM
Latent SVM
filters and deformation parameters
features part displacements
Latent SVM
Bonus
// Data Mining Hard Negatives
// Model Initialization
Results
Pascal VOC 2006
Results
Models learned
Experiments
~ Dalal’s model~ Dalal’s + LSVM
Examples
errors
A simple demo...
A simple demo...
A simple demo...
A simple demo...
Conclusions
Conclusions
so, it doesn’t work ?!?
Conclusions
so, it doesn’t work ?!?
no no, it works...
Conclusions
so, it doesn’t work ?!?
no no, it works...
...it just doesn’t work well...
Conclusions
so, it doesn’t work ?!?
no no, it works...
...it just doesn’t work well...
...or there is a problem with the seat-computer interface...
Conclusion