Viola Jones Simplified - Robot MagMany theories have been proposed to calculate the threshold; minimum, average, standard variation, and average variation. The following diagram illustrates

Viola Jones Simplified

By Eric Gregori

Introduction

Viola Jones refers to a paper written by Paul Viola and

Michael Jones describing a method of machine vision based fast

object detection. This method revolutionized the field of face

detection. Using this method, face detection could be

implemented in embedded devices and detect faces within a

practical amount of time.

In this paper they describe an algorithm that uses a

modified version of the AdaBoost machine learning algorithm to

train a cascade of weak classifiers ( Haar features ). Haar

features ( along with a unique concept, the integral image ) are

used as the weak classifiers. The weak classifiers are combined

using the AdaBoost algorithm to create a strong classifier. The

strong classifiers are combined to create a cascade. The

cascade provides the mechanism to achieve high classification

with a low cpu cycle count cost.

“This paper describes a machine learning approach for visual

object detection which is capable of processing images

extremely rapidly and achieving high detection rates. This

work is distinguished by three key contributions. The first

is the introduction of a new image representation called the

“Integral Image” which allows the features used by our detector

to be computed very quickly. The second is a learning

algorithm, based on AdaBoost, which selects a small number

of critical visual features from a larger set and yields

extremely efficient classifiers[6]. The third contribution is

a method for combining increasingly more complex classifiers

in a “cascade” which allows background regions of the

image to be quickly discarded while spending more computation

on promising object-like regions.”

Excerpt from Rapid Object Detection using a Boosted Cascade of Simple Features

Haar Features

Haar features are one of the mechanisms used by the Viola -

Jones algorithm. Images are made up of many pixels. A 250x250

image contains 62500 pixels. Processing images on a pixel by

pixel basis is a very CPU cycle intensive process. In addition,

individual pixel data contains no information about the pixels

around it. Pixel data is absolute, as aposed to relative. A

side effect of the absolute nature of pixel data is the effect

of lighting. Since the pixel data is absolutely effected by the

lighting of the image, large variances can occur in the pixel

data due only to changes in lighting.

Haar features solve both problem related to pixel data (

cpu cycles required and relativity of data). Haar features do

not encode individual pixel information, they encode relative

pixel information. Haar features provide relative information on

multiple pixels.

Haar features are based on Haar wavelets as proposed by:

S. Mallat. A theory for multiresolution signal decom- position: The wavelet representation. IEEE Transacttons

on Pattern Analyszs and Machzne Intellzgence,

11(7):674-93, July 1989.

Haar features were originally used in the paper:

A General Framework for Object Detection Constantine P. Papageorgiou Michael Oren Tomaso Poggio

Center for Biological and Computational Learning

Artificial Intelligence Laboratory

MIT

Cambridge, MA 02139

{ cpapa, oren, tp}@ai. mit , edu

A Haar feature is used to encode both the relative data between

pixels, and the position of that data. A Haar feature consists

of multiple adjacent areas that are subtracted from each other.

Viola - Jones suggested Haar features containing; 2, 3, and 4

areas.

Haar Features ( +/- Polarities )

Value = Pixels under Green – pixels under red

The value of a Haar feature is calculated by taking the sum of

the pixels under the green square and subtracting the sum of the

pixels under the red square.

value = �

��∑ �� ∑ ��

� �

� � )

By encoding the difference between two adjoining areas in a

image, the Haar feature is effectively detecting edges. The

further the value is from zero, the harder or more distinct the

edge. A value of zero indicates the two areas are equal, thus

the pixels under the area have equal average intensities ( the

lack of, an edge ). It should be noted that although this

process can be done on color images, for the Viola Jones

algorithm this process is done on grayscale images. In most

cases individual pixel values are from 0 to 255, with 0 being

black, and 255 being white.

Average of pixels

under Green area under Red area Haar Value

125 250 -125

125 225 -100

125 200 -75

125 175 -50

125 150 -25

125 125 0

125 100 25

125 75 50

125 50 75

125 25 100

125 0 125

Average of pixels

under Green area under Red area Haar Value

250 250 0

250 225 25

250 200 50

250 175 75

250 150 100

250 125 125

250 100 150

250 75 175

250 50 200

250 25 225

250 0 250

The values calculated using Haar features require one additional

step before being used for object detection. The values must be

converted to true or false results. This is done using

thresholding.

Harder Edge

Harder Edge

Harder Edge

No Edge

Hard edge

Thresholding

Thresholding is the process of converting an analog value

into a true or false. In this case, the analog value is the

output of the Haar feature: value = �

��∑ �� ∑ ��

� �

� � ). To

convert the analog value into a true / false statement the

analog value is compared to a threshold. If the value is >= to

the threshold, the statement is true, if not it is false.

Where: hj(x) – Weak classifier(basically 1 Haar feature)

Pj - Parity

fi(x) - Haar feature

Thetaj - Threshold

As the equation above illustrates, the output of the weak

classifier is either true or false. The threshold determines

when the function transitions the output state. The parity

determines the direction of the in-equality sign. This will be

demonstrated more later. The threshold and parity must be set

correctly to get the full benefit of the feature.

Setting the threshold and parity is not clearly defined in

the Viola Jones paper “For each feature, the weak learner

determines the optimal threshold classification function, such

that the minimum number of examples are misclassified” Viola

Jones. This statement is attributed to anonymous in the paper.

Many theories have been proposed to calculate the threshold;

minimum, average, standard variation, and average variation.

The following diagram illustrates the results of those theories.

The data is based off of one feature type, in a single known

position. The feature position and type were based off of data

from the Viola Jones paper.

Example

As the above text and illustrations show, Viola Jones found that

a 3 area Haar feature across the bridge of the nose provided a

better then average probability of detecting a face.

The eyes are darker then the bridge of the nose.

value = �

��2 ∑ �� ∑ ��

� �

� � )

The value increases if the eye area gets darker, or the bridge

of the nose gets brighter. The feature shown above was placed

over the eyes of 313 face images. The values were calculated for

each image and graphed. Value for picture above

Figure 1 - Haar Feature sums over 313 face and 313 non-face images

Each blue point in the blue curve above represents a Haar

feature value ( like the one above ), when placed over the eyes

and nose of 313 different faces. The green line represents the

average of those values. The average value is positive, and

well above zero. This indicates that the Haar feature is over a

portion of the image that matches the features characteristics (

in this case, light in the middle and dark on the sides ).

The red line indicates the exact same feature being placed

in the upper left corner of the image, as shown above. This

represents a non-face image, or random noise. A feature is

weighted on how well it distinguishes between random noise ( no-

faces ) and a portion of the face ( eyes/nose in this case ).

The center of the 3 area Haar feature is multiplied by 2, to

balance the 2 negative sides. So the formula for the 3 area Haar

is slightly different then the other Haar types: value = �

��2 ∑ �� ∑ ��

� �

� � ). A 3 area Haar feature over a random

noise image results in a value of 0. As you can see from the

graph above, the average over 313 images is about 0.

-60000

-50000

-40000

-30000

-20000

-10000

0

10000

20000

30000

40000

50000

1

12

23

34

45

56

67

78

89

10

0

11

1

12

2

13

3

14

4

15

5

16

6

17

7

18

8

19

9

21

0

22

1

23

2

24

3

25

4

26

5

27

6

28

7

29

8

30

9

Face

non-Face

Face Avg

Non-Face Avg

Parity

The parity variable is used to adjust the value so that it

is above 0. An inverted feature �

��2 ∑ �� ∑ ��

� �

� � ) would

present a value that is of the same magnitude with a different

sign. The parity is used to convert the value of an inverted

feature into a positive integer, bringing it above the zero

line.

The implementation described in this paper used a slightly

different approach. Instead of using a parity variable, 2 sets

of Haar features were used; inverted and non-inverted. During

the threshold learning process, Haar features were discarded if

their average value over the 313 images was negative. In

summary, both an inverted and non-inverted Haar feature were

placed in the same location on the image. The Haar feature with

a negative average value over all images was discarded. This was

simpler from an implementation point of view.

Haar feature thresholding

Figure 2 - Various statistical values from data in figure 1

The graph above shows various methods of calculating a

threshold. All methods calculate a threshold based on measuring

the Haar feature values at the same image position, in all 313

images.

The goal is to calculate a single threshold for all 313 images,

that maximizes the number of faces detected, and minimizes the

number of non face detection ( false positives ).

-60000

-50000

-40000

-30000

-20000

-10000

0

10000

20000

30000

40000

50000

1

11

21

31

41

51

61

71

81

91

10

1

11

1

12

1

13

1

14

1

15

1

16

1

17

1

18

1

19

1

20

1

21

1

22

1

23

1

24

1

25

1

26

1

27

1

28

1

29

1

30

1

31

1

24|24

Fdev

Favg

0|0

Nfavg

Nfdev

Fstd

Nstd

Mean Threshold

The mean threshold is the average of the value for all

images containing a face. This is done by summing the Haar

feature values obtained in the same position over the face on

all 313 images, and dividing by the number of images. This is

represented by the green line ( Favg ) in the graph above.

Threshold method #1 = �

�∑ ��

� � Where: N = Number of images

i = single image

�� = Single Haar

in same position

on all images.

As stated above, the parity determines the direction of the in-

equality sign. In this example the parity is set such that the

result is a 1 if the Haar feature value is >= threshold.

With parity set accordingly, everything above or on the green

line ( Favg ) in the graph above will register a true output

from the weak classifier equation. Images that result in a Haar

value >= the green line ( Favg ) is a face. If we apply the same

threshold to the non-face images ( noise ) we can get an idea of

how well the weak classifier works.

Figure 3 - Haar feature over 313 face images

The green shaded box, represents values over Favg. This Haar

feature in this position will categorize values greater then

17516 as faces. 17516 is the average value for this Haar feature

in this position across all 313 images. Using Favg for threshold

only detects 160 out of the 313 faces ( about 51% ). This makes

sense since the threshold is the average value over all 313

images.

If the same Haar feature is moved to a position in the image

where it is know there is no face, the following data is

generated.

-15000

-5000

5000

15000

25000

35000

1

11

21

31

41

51

61

71

81

91

10

1

11

1

12

1

13

1

14

1

15

1

16

1

17

1

18

1

19

1

20

1

21

1

22

1

23

1

24

1

25

1

26

1

27

1

28

1

29

1

30

1

31

1

Faces 24|24

Faces Fdev

Faces Favg

Faces Fstd

The Haar feature is located in the exact same position

within the image. Pixels under the red boxes are

subtracted from pixels in the green box. The graph below

shows the results from 313 images with the same

feature in the same position. When the Haar feature is

thresholded it produces a value between 1 and 0. At this

point it becomes a weak classifier.

Figure 4 - - Haar feature over 313 non-face images

The above graph shows the results of placing the Haar feature

over the same portion of background in each image. Since the

lfw images use random backgrounds, this results in generally

random data being measured by the feature in this position.

This is backed-up by the average being close to 0. Notice some

background image is incorrectly classified as a face. This is

determined by the number of points greater then or equal to the

green line ( Favg for all images, Haar feature over face ).

There were false positives in 11 of the 313 images.

Mean – results

Face Nonface

160/313 11/313

51% 3.5%

Faces were correctly identified in 160 images, and incorrectly

in 11 images. This was only testing 1 background position/image.

-15000

-5000

5000

15000

25000

35000

1

11

21

31

41

51

61

71

81

91

10

1

11

1

12

1

13

1

14

1

15

1

16

1

17

1

18

1

19

1

20

1

21

1

22

1

23

1

24

1

25

1

26

1

27

1

28

1

29

1

30

1

31

1

Non-Faces 0|0

Fdev

Favg

Fstd

The Haar feature is located in the exact same position

within the image. Pixels under the red boxes are

subtracted from pixels in the green box. The graph below

shows the results from 313 images with the same feature

in the same position. Notice there are many peaks over

the threshold ( green line ). These peaks represent false

positives. Background that the weak classifier mistakenly

classifies as a face.

Problem with using mean threshold

As expected, using the mean value for threshold resulted in

about half of the face images being classified correctly as

faces. This should not be a surprise. The number of false

positives was low ( 11/313 ), but the so was the number of faces

classified correctly ( 160/313 ). The data derived from using

the mean value for threshold suggests that the mean value may be

more appropriate as a ceiling for the threshold.

On the other side of the spectrum, the floor for the

threshold would be the minimum of the values across the images.

This would guarantee that all the training images would be

correctly classified as faces. The tradeoff is a high number of

false positives.

As mentioned earlier in this paper, this implementation

uses 2 separate Haar features of different polarities. This

allows the test to always be the same ( value >= threshold ). To

achieve this goal, a Haar feature that is primarily creating

negative values is disposed of ( it’s Haar feature of opposite

polarity, would create primarily positive results and will be

kept ). As a result minimum values that are negative, are

ignored. 0 is the lowest value threshold can be.

Figure 5 - Threshold = Min ( 0 ) Haar on faces

-15000

-5000

5000

15000

25000

35000

1

11

21

31

41

51

61

71

81

91

10

1

11

1

12

1

13

1

14

1

15

1

16

1

17

1

18

1

19

1

20

1

21

1

22

1

23

1

24

1

25

1

26

1

27

1

28

1

29

1

30

1

31

1

Faces 24|24

Fdev

Favg

Fstd

Notice, many more faces are detected using the min or 0 as the

threshold. The specific Haar feature, at the specific position

over the faces in the images, described by the above graph

yields 302/313 faces detected correctly. Classifiers are ranked

not only by the number of faces correctly classified, but how

well it filters noise by NOT classifying noise as faces.

Figure 6 - Threshold = Min ( 0 ) Haar on background ( noise )

Using the minimum ( from the faces – blue – graph ) as a

threshold, the above graph illustrates values from the same Haar

feature over a portion of the background ( representing noise ).

There are MANY false positives in the above graph when the

threshold is 0. The exact number is: 171 false positives.

Min Results:

Face Nonface

301/313 171/313

96% 55%

Although the minimum method detects more faces then the mean

threshold, it also creates significantly more false positives.

If the mean threshold represents the ceiling, the minimum

threshold represents the floor.

-15000

-5000

5000

15000

25000

35000

1

11

21

31

41

51

61

71

81

91

10

1

11

1

12

1

13

1

14

1

15

1

16

1

17

1

18

1

19

1

20

1

21

1

22

1

23

1

24

1

25

1

26

1

27

1

28

1

29

1

30

1

31

1

Non-Faces 0|0

Fdev

Favg

Fstd

The ideal threshold is between the Min and the Average

Figure 7 - Positives - False Positives

The ideal threshold is windowed by the minimum at the

bottom, and the average at the top. In the graph above, the blue

line represents the number of faces correctly detected. The red

line indicates the number of face incorrectly identified in

random non-face images ( noise ). The red line indicates false

positives. The ideal threshold would detect 100% of the faces,

with no false positives. The green line represents the

difference between the blue and red lines ( positives – false

positives ). Where the green line peaks is the provides the

best face detection with the least number of false positives.

This is the closest to ideal we can get.

The blue arrow is pointing to the peak of the green line.

This is our desired threshold. To find this peak, a simple max

operation was applied to the difference between the blue and red

curves.

0

100

200

300

400

500

600

Nu

mb

er

of

Ima

ge

s

Positives/False Positives - Average = 17102

False Positives

Positives

Difference

average

min

Threshold = threshold[max( �� – �� !��_�� )]

The peak difference threshold yields:

Face Nonface

460/500 64/500

92% 13%

As the results indicate, the threshold does not provide a

100% detection rate, but it does provide a low false positive

rate. This represents the best this one feature can do, at this

size, polarity, and location.

Using Haar Features

After the threshold has been determined, the Haar feature

becomes a weak classifier, and provides a Boolean result when

placed over any portion of the image. The pixels within the

area of the Haar feature either sum to be greater then or equal

to the threshold, or less then the threshold.

Where: ��(x)is a single Weak Classifier of a set size, location,

polarity, and type.

#� is equal to 1

$�(x)is a Haar Feature of a set size, location, polarity,

and type.

%� is the threshold calculated using the method

described above.

>=

With 5 weak classifier types and 2 possible polarities,

there are a total of 10 possible weak classifier types. Each

feature can have a size that varies from 12 to 28 pixels in

steps of 4 pixels ( for the implementation used in this paper ).

5 sizes * 10 types yields 50 variation per possible position

within the detector. The detector size for this implementation

was chosen based on the size of the faces in the lfw database.

The detector size is 96 x 96 pixels. The number of possible

positions for a weak classifier within the detector varies

depending on the size and type of classifer. Using a step size

of 4 pixels, this implementation creates over twelve thousand

weak classifier combinations; type, size, and location.

//---------------------------------------------------------------------------

//---------------------------------------------------------------------------

//

// 1) Create all possible weak classifiers within detector

//

//---------------------------------------------------------------------------

//---------------------------------------------------------------------------

haar_features^ haar;

feature_array->CreatedHaars = 0;

for( int haar_feature = 0; haar_feature<10; haar_feature++ ) // 10 Haar types

{

for( int haar_size = 12; haar_size < DETECTOR_SIZE; haar_size+=4 )

{

for( int x=0; x<DETECTOR_SIZE*2; x+=4 ) // DETECTOR_SIZE = 48

{

for( int y=0; y<DETECTOR_SIZE*2; y+=4 )

{

haar = CreateHaar( haar_feature, haar_size );

if( (x+haar->full_width)>=(DETECTOR_SIZE*2) ) continue;

if( (y+haar->full_height)>=(DETECTOR_SIZE*2) ) continue;

feature_array->haar_array->Add( haar );

feature_array->CreatedHaars++;

} // for y

} // for x

} // for haar_size

} // for haar_features

The Detector

The location parameter for a weak classifier is relative to

the location of the detector. The detector is a box that is

scanned across the entire image. The detector consists of many

weak classifiers. The combination of weak classifiers creates a

strong classifier ( the detector ). For this implementation, the

detector starts will all possible weak classifiers in all

possible locations ( 14,454 weak classifiers in a 96 x 96 pixel

box). The weak classifiers are pruned as the implementation

learns which weak classifiers combine to create the best strong

classifier.

The following graphs show a single weak classifier of size

16x16. The graphs show the number of face images the classifier

correctly classified, and the number of background images the

classifier incorrectly classified as a face ( false positive ).

There are 500 face images, and 500 non face images. The peak

value is the number of positive matches, minus the number of

false positives. So a classifier that returns true for all face

and non face images will yield a peak of 0 ( 500 – 500 ) and a

positive of 500.

The positions are all relative to the upper left corner of

the detector box ( Blue box in picture ). The classifier is a 3

area type, in a horizontal orientation. Each area is 16 pixels

by 16 pixels. The whole classifier is 48 pixels wide, by 16

pixels high.

if(( �

��2 ∑ �� ∑ ��

� �

� � ) >= threshold

if(( �

��2 ∑ �� ∑ ��

� �


Each weak classifier is put in every location within the detector. This

implementation uses a +4 pixel increment in both the X and Y directions

when creating the Haars. The result is over 14 thousand Haars of

various types, sizes, and locations within the 96x96 pixel wide detector.

Obviously this is considerably more then needed, so the next step in the

algorithm is how to minimize the number of required weak classifiers to

create a detector with a specific Positive to false positive ratio.

1, -2, 1

-1, 2, -1

0 100 200 300 400 500 600

0|0

0|20

0|40

0|60

4|0

4|20

4|40

4|60

8|0

8|20

8|40

8|60

12|0

12|20

12|40

12|60

16|0

16|20

16|40

16|60

20|0

20|20

20|40

20|60

24|0

24|20

24|40

24|60

28|0

28|20

28|40

28|60

32|0

32|20

32|40

32|60

36|0

36|20

36|40

36|60

40|0

40|20

40|40

40|60

44|0

44|20

44|40

44|60

Weak Classifiers -> size 16 all possible positions

peak -1,2,-1

pos -1,2,-1

peak 1,-2,1

pos 1,-2,1

Location x|y

Each bar represents a

unique weak classifier.

Each with it's own

threshold. The data

illustrated here is an

indicator of consistancly

across object images

Location x|y

Each bar represents a

unique weak classifier.

Each with it's own

threshold. The data

illustrated here is an

indicator of consistancy

across object images.

0 100 200 300 400 500 600

24|0

24|4

24|8

24|12

24|16

24|20

24|24

24|28

24|32

24|36

24|40

24|44

24|48

24|52

24|56

24|60

24|64

24|68

24|72

24|76

Weak Classifiers -> size 16

peak -1,2,-1

pos -1,2,-1

peak 1,-2,1

pos 1,-2,1

Location x|y

Each bar represents a unique weak

classifier. Each with it's own

threshold. The data illustrated here

is an indicator of consistancy across

object images.

Data is based on 500 face, and 500

non-face images. A high peak value

indicates a high quality classifier for

the particular position.

1 of 500 face images

used to create this

graph. An additional

500 non-face images

were also used. The

peak values on the

graph represent the

number of detections

on face images (

positives ) minus the

number of detections

on non-face images (

false positives ). A

peak of zero indicates

the classifier

incorrectly detected

faces in all 500 non-

face images.

images

The above bar graphs are best viewed in color. Each bar

represents a size 16 horizontal 3 area weak classifier in

different positions within the detector. The blue and red bars

are -1,2,-1 type classifier:

The green and purple bars are 1,-2,1 type classifiers:

Each bar represents the number of positive classifications the

classifier detected on faces ( faces detected ) out of a total

of 500 face images ( red and purple ). The bottom part of the

bar represents the number of positives ( faces correctly

detected ) minus the number of false positives ( faces detected

in images containing no faces ). The peak lines ( so called

because this data was used to select a threshold as stated above

) provide an indicator of the quality of the specific weak

classifier. Note, each bar represents a unique weak classifier

with it’s own threshold. After a threshold has been chosen, a

Haar feature becomes a weak classifier. The threshold does not

change. The purpose of these graphs is to indicate that certain

aspects of the image lend themselves to specific areas of the

object ( in this case a face ).

For example, the bar graph above represents a zoomed in

portion of the full bar graph. The graph has been zoomed in to

detector locations x=24, y = 0->76. This represents a stripe

down roughly the center of the face ( as shown by the picture –

not it is upside down ). The graph indicates, that weak

classifiers of the type 1,-2,1 provide quality data in the

regions of the eyes. This is shown by the high signal to noise

ratio ( true positives / false positives ) of this type and size

if(( �

��2 ∑ �� ∑ ��

� �


-1, 2, -1

if(( �

��2 ∑ �� ∑ ��

� �


1, -2, 1

of classifier in this region ( 24|24, 24|20 ). The weak

classifiers at location x=24, and y=24 or 20, have high peak

values, indicating that atleast in the training images, the

pixels in these areas are similar, and this particular

classifier does a good job of detecting them out of the noise.

Stated another way, the particular pattern that the -1,2,-1

classifier is designed to detect, is consistently present in all

500 images at the location 24,24 and 24,20. Also, the threshold

selected for the particular classifier does a good job of

differentiating the pattern at these locations versus general

noise in the image.

False positives, or noise

The purpose of any filter is to differentiate a signal from

noise. Classifiers are simply binary filters. Classifiers

provide a true/false indication of a specific pattern in within

data. The pattern is equivalent to signal, while all the data

not part of the pattern ( spurious data ) represents noise.

With a image, this can be demonstrated as the background being

noise or spurious data, and the object of interest being the

signal or desired pattern within the data.

This can be illustrated in the following images. The Green

indicates the portion of the face we are trying to detect using

the -1,2,-1 size 16 weak classifier described above. The red

indicates the false positives that occur when that same

classifier is tested in other regions of the image. The light

box in the image represents the total portion of the image the

weak classifier was scanned over.

Although the weak classifier does a good job of detecting

the area between the eyes, it does produce a lot of false

positives ( approximately 30% on average across all 500 images

). This is why single Haar based classifiers are referred to as

weak classifiers. A single Haar feature cannot define a complex

enough pattern to provide the signal to noise ratio for

practical object detection.

The problem with weak classifiers

( green is positive detection, red is a false positive detection)

Strong Classifiers

Constantine P. Papageorgiou, Michael Oren, and Tomaso Poggio

suggested in their paper “A General Framework for Object Detection

“, that weak classifiers based on Haar features can be

combined to create strong classifiers. The Viola, Jones paper

furthered the theory by grouping strong classifiers into a

cascade, and using the AdaBoost machine learning program to

choose the best weak classifiers.

As shown above, a single Haar feature is a weak classifier.

To decrease the number of false positives created by the single

Haar feature, groups of Haar features are combined to create a

strong classifier. The Haar features are grouped in a detector.

A detector is a group of Haar features of certain type, size,

and location. A detector can be any size, but 20 pixels by 20

pixels appears to be standard.

A 20 pixel by 20 pixel detector loaded with weak classifiers

The pictures below show the results of adding more weak

classifiers. As the number of weak classifiers used increases,

the number false positives ( red pixels ) decreases.

Unfortunately, the more Haar features used, the slower the

process takes. All 1231 Haar features within the detector need

to be calculated for every possible detector position in the

image, as the detector is scanned across the image.

1231 Haar features ( weak classifiers )

5357 Haar features ( weak classifiers )

There are 45,396 possible weak classifiers in a 24x24

detector. How do we determine the best weak classifiers to use

in a detector, to maximize the number of positive face

detection, while minimizing the number of false positives, and

minimizing the number of Haar features to provide practical

performance?

The answer is boosting. Boosting is a machine learning meta-

algorithm for performing supervised learning.

Boosting algorithms iteratively learn weak classifiers with

respect to a distribution and add them to a final strong

classifier. When they are added, they are typically weighted in

some way that is usually related to the weak learners' accuracy.

After a weak learner is added, the data is reweighted: examples

that are misclassified gain weight and examples that are

classified correctly lose weight (some boosting algorithms

actually decrease the weight of repeatedly misclassified

examples, e.g., boost by majority and BrownBoost). Thus, future

weak learners focus more on the examples that previous weak

learners misclassified. (http://en.wikipedia.org/wiki/Boosting).

AdaBoost

AdaBoost, short for Adaptive Boosting, is a machine learning algorithm, formulated by Yoav Freund and Robert

Schapire[1]

. It is a meta-algorithm, and can be used in conjunction with many other learning algorithms to improve their

performance. AdaBoost is adaptive in the sense that subsequent classifiers built are tweaked in favor of those instances

misclassified by previous classifiers. AdaBoost is sensitive to noisy data and outliers. However in some problems it can be

less susceptible to the overfitting problem than most learning algorithms.

AdaBoost calls a weak classifier repeatedly in a series of rounds . For each call a distribution of

weights Dt is updated that indicates the importance of examples in the data set for the classification. On each round, the

weights of each incorrectly classified example are increased (or alternatively, the weights of each correctly classified

example are decreased), so that the new classifier focuses more on those examples.

(http://en.wikipedia.org/wiki/AdaBoost).

AdaBoost derivative used in the Viola-Jones method

AdaBoost creates a strong classifier by combining weak

classifiers. It’s a machine learning algorithm that creates a

‘recipe’ for a strong classifier. The ingredients are the weak

classifiers. AdaBoost calculates how much of the result of each

weak classifier should be added to the final mix.

Equation for the Strong Classifier created by AdaBoost

A graphical representation of the above equation

The result of the AdaBoost algorithm is the weights shown

above. How the AdaBoost learning algorithm calculates those

weights is described by the learning algorithm. The algorithm

description is confusing because the term weight is used to

describe multiple aspect of the algorithm. For this summary,

the term weight is reserved for the final values calculated by

the algorithm.

AdaBoost learns by testing data (faces) and noise (non-

faces) against the weak classifiers. The data and noise is

referred to as a distribution. The result of the test is used

to score the distribution. If the weak classifier incorrectly

classified a object in the distribution, the score for that

Result of weak classifier Final Strong Classifier Weight

calculated by

AdaBoost

object is increased. If the weak classifier correctly

classifies a object, the score for that object is decreased.

This process is repeated for all weak classifiers.

This process is illustrated very well in the video here: http://www.authorstream.com/Presentation/aSGuest79199-727683-animated-adaboost-example/

Integral Image

A key technique used by the Viola-Jones method is the integral

image. This technique minimizes the number of summations

required to calculate the value of a single Haar feature.

A summed area table (also known as an integral image) is an

algorithm for quickly and efficiently generating the sum of

values in a rectangular subset of a grid

(http://en.wikipedia.org/wiki/Summed_area_table).

The sum area table is an accumulation of pixel values, starting

from the upper left and moving towards the lower right of an

image. The value at any point (x, y) in the table is the sum of

all the pixels above and to the left of that point.

The summed area table can be computed efficiently in a single

pass over the image, using the fact that the value in the summed

area table at (x, y) is just (http://en.wikipedia.org/wiki/Summed_area_table):

The above equation is used to create the integral image from the

original monochrome picture. The resulting integral image is

then used to calculate Haar feature values using only 4 sums.

D = 4 – 2 – 3 + 1

Calculating the sum of a region using a integral image

Cascade

A classifier is just another word for filter. It sorts

random information, letting the desired information pass

through, while throwing away the undesired information. Like

both electrical and mechanical filters, it can be more efficient

to perform the filtering in stages.

Viola-Jones uses a classifier cascade to increase the

computational efficiency of their method. Each stage in the

cascade has progressively more Haar features, requiring

progressively more computations. The first stage has the least

number of weak classifiers, requiring the least number of

computations. If the first stage identifies a face, it passes

the data (integral image) to the next stage. On the other hand,

if the first stage does not see a face it rejects that portion

of the integral image moving on to the next portion.

The cascade decreases the number of computation required

when scanning across an image, because the majority of the image

does not contain faces. It minimizes the number of calculation

based on the premise that the majority of the integral image

does not contain a face. When a face is detected, the number of

calculations that occur is the same as if a cascade was not

used.

integral image

A classifier cascade

First stage

Least number of weak

classifiers

2nd stage

More weak classifiers

then the first stage

nth stage

Max number of weak

classifiers

As the integral image passes

from one stage to the next,

progressively more weak

classifiers are applied against it.

Each stage requires more

computations then the previous

stage.

Possible face

Possible face

Face Detected

No face found ( noise )



Conclusion

Viola-Jones described a method of fast object detection in

computer vision by combining three techniques; Haar features,

the integral image, Adaboost, and the cascade. The methods

described in the Viola-Jones paper made it possible to perform

practical face detection with minimal computing power.

Haar features are a powerful tool for classifying images

like faces. Haar features classify a region of a image based on

differences between the pixels of the two symmetrical sides of

the feature. This results in a form of “edge detection”.

Combining Haar features together, creates a strong classifier or

detector.

Boosting is a machine learning method used to determine the

best combination of weak classifiers to optimize the resulting

strong classifier. Adaboost is a derivative of boosting that

scores the result of a weak classifier when used against a

distribution. The score is used to calculate a weight that is

used to multiply the result of the weak classifier in the final

strong classifier. The final strong classifier that results

from the Adaboost method is a sum of all the weights multiplied

by the results of the weak classifiers. Boosting is only done

during the learning phase of the Viola-Jones method. After the

weak classifiers have been selected, and the weights calculated,

the resulting strong classifier can be used to start filtering

desired data out of noise.

After the strong classifiers ( or filters ) have been

learned, they are combined into a cascade. Cascade is just

another term for multiple stages. The goal of using multiple

strong classifiers organized into stages is to decrease the

total number of computations required by the filter. The first

stage is a prefilter, designed to catch the majority of noise (

non-faces ). It has the least number of weak classifiers ( haar

features ) requiring the least number of computations. The

following stages progressively increase the number of weak

classifiers, but also filter out more noise than the previous

stage.

After the resulting strong classifiers have been learned

using Adaboost, and organized into a cascade, it can be used to

start filtering out noise to find the desired data. For face

detection, the data is face, and the noise is anything that is

not a face.

After the cascade has been learned, using it is easy.

Convert the image into a integral image. Slide the detector full

of Haar features over the integral image. For each position,

calculate all the Haar features required by the first stage of

the cascade, and multiply each of them by their corresponding

weights that resulted from the Adaboost algorithm. Threshold

the result into a true or false ( face under detector, or face

not under detector ). If the first stage returns a true, then

use the integral image to calculate the Haar features in the

second stage of the cascade. This process is repeated until a

stage either determines that no face is under the detector, or

the final stage determines that a face is under the detector.

Documents

Viola Jones Simplified - Robot MagMany theories have been proposed to calculate the threshold; minimum, average, standard variation, and average variation. The following diagram illustrates