Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Viola Jones Simplified
By Eric Gregori
Introduction
Viola Jones refers to a paper written by Paul Viola and
Michael Jones describing a method of machine vision based fast
object detection. This method revolutionized the field of face
detection. Using this method, face detection could be
implemented in embedded devices and detect faces within a
practical amount of time.
In this paper they describe an algorithm that uses a
modified version of the AdaBoost machine learning algorithm to
train a cascade of weak classifiers ( Haar features ). Haar
features ( along with a unique concept, the integral image ) are
used as the weak classifiers. The weak classifiers are combined
using the AdaBoost algorithm to create a strong classifier. The
strong classifiers are combined to create a cascade. The
cascade provides the mechanism to achieve high classification
with a low cpu cycle count cost.
“This paper describes a machine learning approach for visual
object detection which is capable of processing images
extremely rapidly and achieving high detection rates. This
work is distinguished by three key contributions. The first
is the introduction of a new image representation called the
“Integral Image” which allows the features used by our detector
to be computed very quickly. The second is a learning
algorithm, based on AdaBoost, which selects a small number
of critical visual features from a larger set and yields
extremely efficient classifiers[6]. The third contribution is
a method for combining increasingly more complex classifiers
in a “cascade” which allows background regions of the
image to be quickly discarded while spending more computation
on promising object-like regions.”
Excerpt from Rapid Object Detection using a Boosted Cascade of Simple Features
Haar Features
Haar features are one of the mechanisms used by the Viola -
Jones algorithm. Images are made up of many pixels. A 250x250
image contains 62500 pixels. Processing images on a pixel by
pixel basis is a very CPU cycle intensive process. In addition,
individual pixel data contains no information about the pixels
around it. Pixel data is absolute, as aposed to relative. A
side effect of the absolute nature of pixel data is the effect
of lighting. Since the pixel data is absolutely effected by the
lighting of the image, large variances can occur in the pixel
data due only to changes in lighting.
Haar features solve both problem related to pixel data (
cpu cycles required and relativity of data). Haar features do
not encode individual pixel information, they encode relative
pixel information. Haar features provide relative information on
multiple pixels.
Haar features are based on Haar wavelets as proposed by:
S. Mallat. A theory for multiresolution signal decom- position: The wavelet representation. IEEE Transacttons
on Pattern Analyszs and Machzne Intellzgence,
11(7):674-93, July 1989.
Haar features were originally used in the paper:
A General Framework for Object Detection Constantine P. Papageorgiou Michael Oren Tomaso Poggio
Center for Biological and Computational Learning
Artificial Intelligence Laboratory
MIT
Cambridge, MA 02139
{ cpapa, oren, tp}@ai. mit , edu
A Haar feature is used to encode both the relative data between
pixels, and the position of that data. A Haar feature consists
of multiple adjacent areas that are subtracted from each other.
Viola - Jones suggested Haar features containing; 2, 3, and 4
areas.
Haar Features ( +/- Polarities )
Value = Pixels under Green – pixels under red
The value of a Haar feature is calculated by taking the sum of
the pixels under the green square and subtracting the sum of the
pixels under the red square.
value = �
��∑ ����� ∑ ���
� �
� � )
By encoding the difference between two adjoining areas in a
image, the Haar feature is effectively detecting edges. The
further the value is from zero, the harder or more distinct the
edge. A value of zero indicates the two areas are equal, thus
the pixels under the area have equal average intensities ( the
lack of, an edge ). It should be noted that although this
process can be done on color images, for the Viola Jones
algorithm this process is done on grayscale images. In most
cases individual pixel values are from 0 to 255, with 0 being
black, and 255 being white.
Average of pixels
under Green area under Red area Haar Value
125 250 -125
125 225 -100
125 200 -75
125 175 -50
125 150 -25
125 125 0
125 100 25
125 75 50
125 50 75
125 25 100
125 0 125
Average of pixels
under Green area under Red area Haar Value
250 250 0
250 225 25
250 200 50
250 175 75
250 150 100
250 125 125
250 100 150
250 75 175
250 50 200
250 25 225
250 0 250
The values calculated using Haar features require one additional
step before being used for object detection. The values must be
converted to true or false results. This is done using
thresholding.
Harder Edge
Harder Edge
Harder Edge
No Edge
Hard edge
Thresholding
Thresholding is the process of converting an analog value
into a true or false. In this case, the analog value is the
output of the Haar feature: value = �
��∑ ����� ∑ ���
� �
� � ). To
convert the analog value into a true / false statement the
analog value is compared to a threshold. If the value is >= to
the threshold, the statement is true, if not it is false.
Where: hj(x) – Weak classifier(basically 1 Haar feature)
Pj - Parity
fi(x) - Haar feature
Thetaj - Threshold
As the equation above illustrates, the output of the weak
classifier is either true or false. The threshold determines
when the function transitions the output state. The parity
determines the direction of the in-equality sign. This will be
demonstrated more later. The threshold and parity must be set
correctly to get the full benefit of the feature.
Setting the threshold and parity is not clearly defined in
the Viola Jones paper “For each feature, the weak learner
determines the optimal threshold classification function, such
that the minimum number of examples are misclassified” Viola
Jones. This statement is attributed to anonymous in the paper.
Many theories have been proposed to calculate the threshold;
minimum, average, standard variation, and average variation.
The following diagram illustrates the results of those theories.
The data is based off of one feature type, in a single known
position. The feature position and type were based off of data
from the Viola Jones paper.
Example
As the above text and illustrations show, Viola Jones found that
a 3 area Haar feature across the bridge of the nose provided a
better then average probability of detecting a face.
The eyes are darker then the bridge of the nose.
value = �
��2 ∑ ����� ∑ ���
� �
� � )
The value increases if the eye area gets darker, or the bridge
of the nose gets brighter. The feature shown above was placed
over the eyes of 313 face images. The values were calculated for
each image and graphed. Value for picture above
Figure 1 - Haar Feature sums over 313 face and 313 non-face images
Each blue point in the blue curve above represents a Haar
feature value ( like the one above ), when placed over the eyes
and nose of 313 different faces. The green line represents the
average of those values. The average value is positive, and
well above zero. This indicates that the Haar feature is over a
portion of the image that matches the features characteristics (
in this case, light in the middle and dark on the sides ).
The red line indicates the exact same feature being placed
in the upper left corner of the image, as shown above. This
represents a non-face image, or random noise. A feature is
weighted on how well it distinguishes between random noise ( no-
faces ) and a portion of the face ( eyes/nose in this case ).
The center of the 3 area Haar feature is multiplied by 2, to
balance the 2 negative sides. So the formula for the 3 area Haar
is slightly different then the other Haar types: value = �
��2 ∑ ����� ∑ ���
� �
� � ). A 3 area Haar feature over a random
noise image results in a value of 0. As you can see from the
graph above, the average over 313 images is about 0.
-60000
-50000
-40000
-30000
-20000
-10000
0
10000
20000
30000
40000
50000
1
12
23
34
45
56
67
78
89
10
0
11
1
12
2
13
3
14
4
15
5
16
6
17
7
18
8
19
9
21
0
22
1
23
2
24
3
25
4
26
5
27
6
28
7
29
8
30
9
Face
non-Face
Face Avg
Non-Face Avg
Parity
The parity variable is used to adjust the value so that it
is above 0. An inverted feature �
��2 ∑ ����� � ∑ ���
� �
� � ) would
present a value that is of the same magnitude with a different
sign. The parity is used to convert the value of an inverted
feature into a positive integer, bringing it above the zero
line.
The implementation described in this paper used a slightly
different approach. Instead of using a parity variable, 2 sets
of Haar features were used; inverted and non-inverted. During
the threshold learning process, Haar features were discarded if
their average value over the 313 images was negative. In
summary, both an inverted and non-inverted Haar feature were
placed in the same location on the image. The Haar feature with
a negative average value over all images was discarded. This was
simpler from an implementation point of view.
Haar feature thresholding
Figure 2 - Various statistical values from data in figure 1
The graph above shows various methods of calculating a
threshold. All methods calculate a threshold based on measuring
the Haar feature values at the same image position, in all 313
images.
The goal is to calculate a single threshold for all 313 images,
that maximizes the number of faces detected, and minimizes the
number of non face detection ( false positives ).
-60000
-50000
-40000
-30000
-20000
-10000
0
10000
20000
30000
40000
50000
1
11
21
31
41
51
61
71
81
91
10
1
11
1
12
1
13
1
14
1
15
1
16
1
17
1
18
1
19
1
20
1
21
1
22
1
23
1
24
1
25
1
26
1
27
1
28
1
29
1
30
1
31
1
24|24
Fdev
Favg
0|0
Nfavg
Nfdev
Fstd
Nstd
Mean Threshold
The mean threshold is the average of the value for all
images containing a face. This is done by summing the Haar
feature values obtained in the same position over the face on
all 313 images, and dividing by the number of images. This is
represented by the green line ( Favg ) in the graph above.
Threshold method #1 = �
�∑ ����
� � Where: N = Number of images
i = single image
�� = Single Haar
in same position
on all images.
As stated above, the parity determines the direction of the in-
equality sign. In this example the parity is set such that the
result is a 1 if the Haar feature value is >= threshold.
With parity set accordingly, everything above or on the green
line ( Favg ) in the graph above will register a true output
from the weak classifier equation. Images that result in a Haar
value >= the green line ( Favg ) is a face. If we apply the same
threshold to the non-face images ( noise ) we can get an idea of
how well the weak classifier works.
Figure 3 - Haar feature over 313 face images
The green shaded box, represents values over Favg. This Haar
feature in this position will categorize values greater then
17516 as faces. 17516 is the average value for this Haar feature
in this position across all 313 images. Using Favg for threshold
only detects 160 out of the 313 faces ( about 51% ). This makes
sense since the threshold is the average value over all 313
images.
If the same Haar feature is moved to a position in the image
where it is know there is no face, the following data is
generated.
-15000
-5000
5000
15000
25000
35000
1
11
21
31
41
51
61
71
81
91
10
1
11
1
12
1
13
1
14
1
15
1
16
1
17
1
18
1
19
1
20
1
21
1
22
1
23
1
24
1
25
1
26
1
27
1
28
1
29
1
30
1
31
1
Faces 24|24
Faces Fdev
Faces Favg
Faces Fstd
The Haar feature is located in the exact same position
within the image. Pixels under the red boxes are
subtracted from pixels in the green box. The graph below
shows the results from 313 images with the same
feature in the same position. When the Haar feature is
thresholded it produces a value between 1 and 0. At this
point it becomes a weak classifier.
Figure 4 - - Haar feature over 313 non-face images
The above graph shows the results of placing the Haar feature
over the same portion of background in each image. Since the
lfw images use random backgrounds, this results in generally
random data being measured by the feature in this position.
This is backed-up by the average being close to 0. Notice some
background image is incorrectly classified as a face. This is
determined by the number of points greater then or equal to the
green line ( Favg for all images, Haar feature over face ).
There were false positives in 11 of the 313 images.
Mean – results
Face Nonface
160/313 11/313
51% 3.5%
Faces were correctly identified in 160 images, and incorrectly
in 11 images. This was only testing 1 background position/image.
-15000
-5000
5000
15000
25000
35000
1
11
21
31
41
51
61
71
81
91
10
1
11
1
12
1
13
1
14
1
15
1
16
1
17
1
18
1
19
1
20
1
21
1
22
1
23
1
24
1
25
1
26
1
27
1
28
1
29
1
30
1
31
1
Non-Faces 0|0
Fdev
Favg
Fstd
The Haar feature is located in the exact same position
within the image. Pixels under the red boxes are
subtracted from pixels in the green box. The graph below
shows the results from 313 images with the same feature
in the same position. Notice there are many peaks over
the threshold ( green line ). These peaks represent false
positives. Background that the weak classifier mistakenly
classifies as a face.
Problem with using mean threshold
As expected, using the mean value for threshold resulted in
about half of the face images being classified correctly as
faces. This should not be a surprise. The number of false
positives was low ( 11/313 ), but the so was the number of faces
classified correctly ( 160/313 ). The data derived from using
the mean value for threshold suggests that the mean value may be
more appropriate as a ceiling for the threshold.
On the other side of the spectrum, the floor for the
threshold would be the minimum of the values across the images.
This would guarantee that all the training images would be
correctly classified as faces. The tradeoff is a high number of
false positives.
As mentioned earlier in this paper, this implementation
uses 2 separate Haar features of different polarities. This
allows the test to always be the same ( value >= threshold ). To
achieve this goal, a Haar feature that is primarily creating
negative values is disposed of ( it’s Haar feature of opposite
polarity, would create primarily positive results and will be
kept ). As a result minimum values that are negative, are
ignored. 0 is the lowest value threshold can be.
Figure 5 - Threshold = Min ( 0 ) Haar on faces
-15000
-5000
5000
15000
25000
35000
1
11
21
31
41
51
61
71
81
91
10
1
11
1
12
1
13
1
14
1
15
1
16
1
17
1
18
1
19
1
20
1
21
1
22
1
23
1
24
1
25
1
26
1
27
1
28
1
29
1
30
1
31
1
Faces 24|24
Fdev
Favg
Fstd
Notice, many more faces are detected using the min or 0 as the
threshold. The specific Haar feature, at the specific position
over the faces in the images, described by the above graph
yields 302/313 faces detected correctly. Classifiers are ranked
not only by the number of faces correctly classified, but how
well it filters noise by NOT classifying noise as faces.
Figure 6 - Threshold = Min ( 0 ) Haar on background ( noise )
Using the minimum ( from the faces – blue – graph ) as a
threshold, the above graph illustrates values from the same Haar
feature over a portion of the background ( representing noise ).
There are MANY false positives in the above graph when the
threshold is 0. The exact number is: 171 false positives.
Min Results:
Face Nonface
301/313 171/313
96% 55%
Although the minimum method detects more faces then the mean
threshold, it also creates significantly more false positives.
If the mean threshold represents the ceiling, the minimum
threshold represents the floor.
-15000
-5000
5000
15000
25000
35000
1
11
21
31
41
51
61
71
81
91
10
1
11
1
12
1
13
1
14
1
15
1
16
1
17
1
18
1
19
1
20
1
21
1
22
1
23
1
24
1
25
1
26
1
27
1
28
1
29
1
30
1
31
1
Non-Faces 0|0
Fdev
Favg
Fstd
The ideal threshold is between the Min and the Average
Figure 7 - Positives - False Positives
The ideal threshold is windowed by the minimum at the
bottom, and the average at the top. In the graph above, the blue
line represents the number of faces correctly detected. The red
line indicates the number of face incorrectly identified in
random non-face images ( noise ). The red line indicates false
positives. The ideal threshold would detect 100% of the faces,
with no false positives. The green line represents the
difference between the blue and red lines ( positives – false
positives ). Where the green line peaks is the provides the
best face detection with the least number of false positives.
This is the closest to ideal we can get.
The blue arrow is pointing to the peak of the green line.
This is our desired threshold. To find this peak, a simple max
operation was applied to the difference between the blue and red
curves.
0
100
200
300
400
500
600
Nu
mb
er
of
Ima
ge
s
Positives/False Positives - Average = 17102
False Positives
Positives
Difference
average
min
Threshold = threshold[max( ������������������ – ��������� !���_��������� )]
The peak difference threshold yields:
Face Nonface
460/500 64/500
92% 13%
As the results indicate, the threshold does not provide a
100% detection rate, but it does provide a low false positive
rate. This represents the best this one feature can do, at this
size, polarity, and location.
Using Haar Features
After the threshold has been determined, the Haar feature
becomes a weak classifier, and provides a Boolean result when
placed over any portion of the image. The pixels within the
area of the Haar feature either sum to be greater then or equal
to the threshold, or less then the threshold.
Where: ��(x)is a single Weak Classifier of a set size, location,
polarity, and type.
#� is equal to 1
$�(x)is a Haar Feature of a set size, location, polarity,
and type.
%� is the threshold calculated using the method
described above.
>=
With 5 weak classifier types and 2 possible polarities,
there are a total of 10 possible weak classifier types. Each
feature can have a size that varies from 12 to 28 pixels in
steps of 4 pixels ( for the implementation used in this paper ).
5 sizes * 10 types yields 50 variation per possible position
within the detector. The detector size for this implementation
was chosen based on the size of the faces in the lfw database.
The detector size is 96 x 96 pixels. The number of possible
positions for a weak classifier within the detector varies
depending on the size and type of classifer. Using a step size
of 4 pixels, this implementation creates over twelve thousand
weak classifier combinations; type, size, and location.
//---------------------------------------------------------------------------
//---------------------------------------------------------------------------
//
// 1) Create all possible weak classifiers within detector
//
//---------------------------------------------------------------------------
//---------------------------------------------------------------------------
haar_features^ haar;
feature_array->CreatedHaars = 0;
for( int haar_feature = 0; haar_feature<10; haar_feature++ ) // 10 Haar types
{
for( int haar_size = 12; haar_size < DETECTOR_SIZE; haar_size+=4 )
{
for( int x=0; x<DETECTOR_SIZE*2; x+=4 ) // DETECTOR_SIZE = 48
{
for( int y=0; y<DETECTOR_SIZE*2; y+=4 )
{
haar = CreateHaar( haar_feature, haar_size );
if( (x+haar->full_width)>=(DETECTOR_SIZE*2) ) continue;
if( (y+haar->full_height)>=(DETECTOR_SIZE*2) ) continue;
feature_array->haar_array->Add( haar );
feature_array->CreatedHaars++;
} // for y
} // for x
} // for haar_size
} // for haar_features
The Detector
The location parameter for a weak classifier is relative to
the location of the detector. The detector is a box that is
scanned across the entire image. The detector consists of many
weak classifiers. The combination of weak classifiers creates a
strong classifier ( the detector ). For this implementation, the
detector starts will all possible weak classifiers in all
possible locations ( 14,454 weak classifiers in a 96 x 96 pixel
box). The weak classifiers are pruned as the implementation
learns which weak classifiers combine to create the best strong
classifier.
The following graphs show a single weak classifier of size
16x16. The graphs show the number of face images the classifier
correctly classified, and the number of background images the
classifier incorrectly classified as a face ( false positive ).
There are 500 face images, and 500 non face images. The peak
value is the number of positive matches, minus the number of
false positives. So a classifier that returns true for all face
and non face images will yield a peak of 0 ( 500 – 500 ) and a
positive of 500.
The positions are all relative to the upper left corner of
the detector box ( Blue box in picture ). The classifier is a 3
area type, in a horizontal orientation. Each area is 16 pixels
by 16 pixels. The whole classifier is 48 pixels wide, by 16
pixels high.
if(( �
��2 ∑ ��� ∑ �����
� �
� � ) >= threshold
if(( �
��2 ∑ ����� ∑ ���
� �
� � ) >= threshold
Each weak classifier is put in every location within the detector. This
implementation uses a +4 pixel increment in both the X and Y directions
when creating the Haars. The result is over 14 thousand Haars of
various types, sizes, and locations within the 96x96 pixel wide detector.
Obviously this is considerably more then needed, so the next step in the
algorithm is how to minimize the number of required weak classifiers to
create a detector with a specific Positive to false positive ratio.
1, -2, 1
-1, 2, -1
0 100 200 300 400 500 600
0|0
0|20
0|40
0|60
4|0
4|20
4|40
4|60
8|0
8|20
8|40
8|60
12|0
12|20
12|40
12|60
16|0
16|20
16|40
16|60
20|0
20|20
20|40
20|60
24|0
24|20
24|40
24|60
28|0
28|20
28|40
28|60
32|0
32|20
32|40
32|60
36|0
36|20
36|40
36|60
40|0
40|20
40|40
40|60
44|0
44|20
44|40
44|60
Weak Classifiers -> size 16 all possible positions
peak -1,2,-1
pos -1,2,-1
peak 1,-2,1
pos 1,-2,1
Location x|y
Each bar represents a
unique weak classifier.
Each with it's own
threshold. The data
illustrated here is an
indicator of consistancly
across object images
Location x|y
Each bar represents a
unique weak classifier.
Each with it's own
threshold. The data
illustrated here is an
indicator of consistancy
across object images.
0 100 200 300 400 500 600
24|0
24|4
24|8
24|12
24|16
24|20
24|24
24|28
24|32
24|36
24|40
24|44
24|48
24|52
24|56
24|60
24|64
24|68
24|72
24|76
Weak Classifiers -> size 16
peak -1,2,-1
pos -1,2,-1
peak 1,-2,1
pos 1,-2,1
Location x|y
Each bar represents a unique weak
classifier. Each with it's own
threshold. The data illustrated here
is an indicator of consistancy across
object images.
Data is based on 500 face, and 500
non-face images. A high peak value
indicates a high quality classifier for
the particular position.
1 of 500 face images
used to create this
graph. An additional
500 non-face images
were also used. The
peak values on the
graph represent the
number of detections
on face images (
positives ) minus the
number of detections
on non-face images (
false positives ). A
peak of zero indicates
the classifier
incorrectly detected
faces in all 500 non-
face images.
images
The above bar graphs are best viewed in color. Each bar
represents a size 16 horizontal 3 area weak classifier in
different positions within the detector. The blue and red bars
are -1,2,-1 type classifier:
The green and purple bars are 1,-2,1 type classifiers:
Each bar represents the number of positive classifications the
classifier detected on faces ( faces detected ) out of a total
of 500 face images ( red and purple ). The bottom part of the
bar represents the number of positives ( faces correctly
detected ) minus the number of false positives ( faces detected
in images containing no faces ). The peak lines ( so called
because this data was used to select a threshold as stated above
) provide an indicator of the quality of the specific weak
classifier. Note, each bar represents a unique weak classifier
with it’s own threshold. After a threshold has been chosen, a
Haar feature becomes a weak classifier. The threshold does not
change. The purpose of these graphs is to indicate that certain
aspects of the image lend themselves to specific areas of the
object ( in this case a face ).
For example, the bar graph above represents a zoomed in
portion of the full bar graph. The graph has been zoomed in to
detector locations x=24, y = 0->76. This represents a stripe
down roughly the center of the face ( as shown by the picture –
not it is upside down ). The graph indicates, that weak
classifiers of the type 1,-2,1 provide quality data in the
regions of the eyes. This is shown by the high signal to noise
ratio ( true positives / false positives ) of this type and size
if(( �
��2 ∑ ����� ∑ ���
� �
� � ) >= threshold
-1, 2, -1
if(( �
��2 ∑ ��� ∑ �����
� �
� � ) >= threshold
1, -2, 1
of classifier in this region ( 24|24, 24|20 ). The weak
classifiers at location x=24, and y=24 or 20, have high peak
values, indicating that atleast in the training images, the
pixels in these areas are similar, and this particular
classifier does a good job of detecting them out of the noise.
Stated another way, the particular pattern that the -1,2,-1
classifier is designed to detect, is consistently present in all
500 images at the location 24,24 and 24,20. Also, the threshold
selected for the particular classifier does a good job of
differentiating the pattern at these locations versus general
noise in the image.
False positives, or noise
The purpose of any filter is to differentiate a signal from
noise. Classifiers are simply binary filters. Classifiers
provide a true/false indication of a specific pattern in within
data. The pattern is equivalent to signal, while all the data
not part of the pattern ( spurious data ) represents noise.
With a image, this can be demonstrated as the background being
noise or spurious data, and the object of interest being the
signal or desired pattern within the data.
This can be illustrated in the following images. The Green
indicates the portion of the face we are trying to detect using
the -1,2,-1 size 16 weak classifier described above. The red
indicates the false positives that occur when that same
classifier is tested in other regions of the image. The light
box in the image represents the total portion of the image the
weak classifier was scanned over.
Although the weak classifier does a good job of detecting
the area between the eyes, it does produce a lot of false
positives ( approximately 30% on average across all 500 images
). This is why single Haar based classifiers are referred to as
weak classifiers. A single Haar feature cannot define a complex
enough pattern to provide the signal to noise ratio for
practical object detection.
The problem with weak classifiers
( green is positive detection, red is a false positive detection)
Strong Classifiers
Constantine P. Papageorgiou, Michael Oren, and Tomaso Poggio
suggested in their paper “A General Framework for Object Detection
“, that weak classifiers based on Haar features can be
combined to create strong classifiers. The Viola, Jones paper
furthered the theory by grouping strong classifiers into a
cascade, and using the AdaBoost machine learning program to
choose the best weak classifiers.
As shown above, a single Haar feature is a weak classifier.
To decrease the number of false positives created by the single
Haar feature, groups of Haar features are combined to create a
strong classifier. The Haar features are grouped in a detector.
A detector is a group of Haar features of certain type, size,
and location. A detector can be any size, but 20 pixels by 20
pixels appears to be standard.
A 20 pixel by 20 pixel detector loaded with weak classifiers
The pictures below show the results of adding more weak
classifiers. As the number of weak classifiers used increases,
the number false positives ( red pixels ) decreases.
Unfortunately, the more Haar features used, the slower the
process takes. All 1231 Haar features within the detector need
to be calculated for every possible detector position in the
image, as the detector is scanned across the image.
1231 Haar features ( weak classifiers )
5357 Haar features ( weak classifiers )
There are 45,396 possible weak classifiers in a 24x24
detector. How do we determine the best weak classifiers to use
in a detector, to maximize the number of positive face
detection, while minimizing the number of false positives, and
minimizing the number of Haar features to provide practical
performance?
The answer is boosting. Boosting is a machine learning meta-
algorithm for performing supervised learning.
Boosting algorithms iteratively learn weak classifiers with
respect to a distribution and add them to a final strong
classifier. When they are added, they are typically weighted in
some way that is usually related to the weak learners' accuracy.
After a weak learner is added, the data is reweighted: examples
that are misclassified gain weight and examples that are
classified correctly lose weight (some boosting algorithms
actually decrease the weight of repeatedly misclassified
examples, e.g., boost by majority and BrownBoost). Thus, future
weak learners focus more on the examples that previous weak
learners misclassified. (http://en.wikipedia.org/wiki/Boosting).
AdaBoost
AdaBoost, short for Adaptive Boosting, is a machine learning algorithm, formulated by Yoav Freund and Robert
Schapire[1]
. It is a meta-algorithm, and can be used in conjunction with many other learning algorithms to improve their
performance. AdaBoost is adaptive in the sense that subsequent classifiers built are tweaked in favor of those instances
misclassified by previous classifiers. AdaBoost is sensitive to noisy data and outliers. However in some problems it can be
less susceptible to the overfitting problem than most learning algorithms.
AdaBoost calls a weak classifier repeatedly in a series of rounds . For each call a distribution of
weights Dt is updated that indicates the importance of examples in the data set for the classification. On each round, the
weights of each incorrectly classified example are increased (or alternatively, the weights of each correctly classified
example are decreased), so that the new classifier focuses more on those examples.
(http://en.wikipedia.org/wiki/AdaBoost).
AdaBoost derivative used in the Viola-Jones method
AdaBoost creates a strong classifier by combining weak
classifiers. It’s a machine learning algorithm that creates a
‘recipe’ for a strong classifier. The ingredients are the weak
classifiers. AdaBoost calculates how much of the result of each
weak classifier should be added to the final mix.
Equation for the Strong Classifier created by AdaBoost
A graphical representation of the above equation
The result of the AdaBoost algorithm is the weights shown
above. How the AdaBoost learning algorithm calculates those
weights is described by the learning algorithm. The algorithm
description is confusing because the term weight is used to
describe multiple aspect of the algorithm. For this summary,
the term weight is reserved for the final values calculated by
the algorithm.
AdaBoost learns by testing data (faces) and noise (non-
faces) against the weak classifiers. The data and noise is
referred to as a distribution. The result of the test is used
to score the distribution. If the weak classifier incorrectly
classified a object in the distribution, the score for that
Result of weak classifier Final Strong Classifier Weight
calculated by
AdaBoost
object is increased. If the weak classifier correctly
classifies a object, the score for that object is decreased.
This process is repeated for all weak classifiers.
This process is illustrated very well in the video here: http://www.authorstream.com/Presentation/aSGuest79199-727683-animated-adaboost-example/
Integral Image
A key technique used by the Viola-Jones method is the integral
image. This technique minimizes the number of summations
required to calculate the value of a single Haar feature.
A summed area table (also known as an integral image) is an
algorithm for quickly and efficiently generating the sum of
values in a rectangular subset of a grid
(http://en.wikipedia.org/wiki/Summed_area_table).
The sum area table is an accumulation of pixel values, starting
from the upper left and moving towards the lower right of an
image. The value at any point (x, y) in the table is the sum of
all the pixels above and to the left of that point.
The summed area table can be computed efficiently in a single
pass over the image, using the fact that the value in the summed
area table at (x, y) is just (http://en.wikipedia.org/wiki/Summed_area_table):
The above equation is used to create the integral image from the
original monochrome picture. The resulting integral image is
then used to calculate Haar feature values using only 4 sums.
D = 4 – 2 – 3 + 1
Calculating the sum of a region using a integral image
Cascade
A classifier is just another word for filter. It sorts
random information, letting the desired information pass
through, while throwing away the undesired information. Like
both electrical and mechanical filters, it can be more efficient
to perform the filtering in stages.
Viola-Jones uses a classifier cascade to increase the
computational efficiency of their method. Each stage in the
cascade has progressively more Haar features, requiring
progressively more computations. The first stage has the least
number of weak classifiers, requiring the least number of
computations. If the first stage identifies a face, it passes
the data (integral image) to the next stage. On the other hand,
if the first stage does not see a face it rejects that portion
of the integral image moving on to the next portion.
The cascade decreases the number of computation required
when scanning across an image, because the majority of the image
does not contain faces. It minimizes the number of calculation
based on the premise that the majority of the integral image
does not contain a face. When a face is detected, the number of
calculations that occur is the same as if a cascade was not
used.
integral image
A classifier cascade
First stage
Least number of weak
classifiers
2nd stage
More weak classifiers
then the first stage
nth stage
Max number of weak
classifiers
As the integral image passes
from one stage to the next,
progressively more weak
classifiers are applied against it.
Each stage requires more
computations then the previous
stage.
Possible face
Possible face
Face Detected
No face found ( noise )
No face found ( noise )
No face found ( noise )
Conclusion
Viola-Jones described a method of fast object detection in
computer vision by combining three techniques; Haar features,
the integral image, Adaboost, and the cascade. The methods
described in the Viola-Jones paper made it possible to perform
practical face detection with minimal computing power.
Haar features are a powerful tool for classifying images
like faces. Haar features classify a region of a image based on
differences between the pixels of the two symmetrical sides of
the feature. This results in a form of “edge detection”.
Combining Haar features together, creates a strong classifier or
detector.
Boosting is a machine learning method used to determine the
best combination of weak classifiers to optimize the resulting
strong classifier. Adaboost is a derivative of boosting that
scores the result of a weak classifier when used against a
distribution. The score is used to calculate a weight that is
used to multiply the result of the weak classifier in the final
strong classifier. The final strong classifier that results
from the Adaboost method is a sum of all the weights multiplied
by the results of the weak classifiers. Boosting is only done
during the learning phase of the Viola-Jones method. After the
weak classifiers have been selected, and the weights calculated,
the resulting strong classifier can be used to start filtering
desired data out of noise.
After the strong classifiers ( or filters ) have been
learned, they are combined into a cascade. Cascade is just
another term for multiple stages. The goal of using multiple
strong classifiers organized into stages is to decrease the
total number of computations required by the filter. The first
stage is a prefilter, designed to catch the majority of noise (
non-faces ). It has the least number of weak classifiers ( haar
features ) requiring the least number of computations. The
following stages progressively increase the number of weak
classifiers, but also filter out more noise than the previous
stage.
After the resulting strong classifiers have been learned
using Adaboost, and organized into a cascade, it can be used to
start filtering out noise to find the desired data. For face
detection, the data is face, and the noise is anything that is
not a face.
After the cascade has been learned, using it is easy.
Convert the image into a integral image. Slide the detector full
of Haar features over the integral image. For each position,
calculate all the Haar features required by the first stage of
the cascade, and multiply each of them by their corresponding
weights that resulted from the Adaboost algorithm. Threshold
the result into a true or false ( face under detector, or face
not under detector ). If the first stage returns a true, then
use the integral image to calculate the Haar features in the
second stage of the cascade. This process is repeated until a
stage either determines that no face is under the detector, or
the final stage determines that a face is under the detector.