The Detection of Faces in Color Images: EE368 Project Report · The Detection of Faces in Color Images: EE368 Project Report Angela Chau, Ezinne Oji, Jeﬀ Walters Dept. of Electrical

The Detection of Faces in Color Images: EE368 Project Report

Angela Chau, Ezinne Oji, Jeff Walters

Dept. of Electrical EngineeringStanford UniversityStanford, CA 94305

angichau,ezinne,[email protected]

Abstract

A face detection system is presented. The system

has been optimized to detect faces in images created

for the purpose of the EE368 class. The system

uses a neural network trained to detect face-colored

pixels to create regions of interest. Then, a combi-

nation of low-pass filters and clustering techniques

is used to further refine these regions.

1 Introduction

The problem of face detection in images and videoshas been widely researched in the field of com-puter vision and image processing. Not only isthe detection of faces useful in applications suchas image classification and retrieval, it can alsobe thought of as an essential intermediate step inareas such as face recognition and tracking. Ingeneral, due to the variety of scene parameters suchas illumination, scaling, camera characteristics, aswell as the variability within the large set of ob-jects humans consider to be faces, face detectionbecomes a complex classification problem in a high-dimensional space.

For our project, we attempted only to imple-ment a system which will detect faces within thesmall set of images taken of the class and did nottry to build a robust algorithm for finding facesgiven any image. Our aim was to gain an under-standing into the problems associated with face de-tection, to implement and test some of well-knownmethods within this area, and to experiment withsome of our own ideas in solving the more specificdetection problem we were given.

2 High-Level System Design

As mentioned in the previous section, because ofprior knowledge of our test set of images, the com-plexity of our detection problem was significantlyreduced. In our project, an image may containany number of faces at any position, scale, andorientation, but with generally the same illumina-tion and with the same camera distortions (if any)across the whole set. Illumination usually causesthe appearance of skin to differ across images, and

Figure 1: High-level block diagram of face detectionsystem.

distortions from different cameras can warp theshape of a face differently. By eliminating thesetwo aspects of the problem, properties such as skincolor, the general shape and size of faces, as wellas the overall layout of the image were able to beutilized.

A block diagram of the face detection systemis presented in Figure 1. Each block in the processwill be decscribed in detail in the following sections.

3 Region of Interest Creation

The first step is to downsample the image by afactor of 2. The detail of the large image is notneeded, and the downsampling increases the speedof the remaining portion of the algorithm.

3.1 Color Detection

Regions of interest are created by examining thecolor of each pixel and assigning it a score indi-cating the likelihood that the pixel is skin-colored.Note that by “color”, we do not imply a value inRGB space.

Creating a detector for skin-colored pixelsamounts to determining a pair of condi-

tional probability density functions, namelyp(skin|pixel−color) and p(not−skin|pixel−color).

The training images provided a means by whichwe were able to estimate these densities. The ini-tial method tried was to use histograms to empir-ically estimate the densities. A pixel’s color wasviewed as simply a vector in R3, so histogrammingamounts to dividing the color-space into boxes andcounting the number of skin pixels that lie withineach color box.

Although the above method was found to beeffective, it did not produce an extreme contrast inthe results of skin pixels versus non-skin pixels.

The second approach taken was to train a neuralnetwork on the pixel colors. One can think of thetraining of the network as an optimization processwhich seeks out the optimal decision boundary touse in classifying pixels in the desired color space.In our final system, three layer networks were em-ployed, meaning that the decision boundary withinthe color space may be almost arbitrarily shaped.

Two parameters needed to be chosen at thisstage. First, a network topology needed to be cho-sen. As mentioned, a three layer network withthree input units was used due to the dimension-ality of the color spaces. The networks that weretested all included bias units to the hidden andoutput layers. The network was trained emit +1 onpresentation of a skin pixel and −1 otherwise. Theone remaining topology parameter was the numberof hidden units. This parameter was determined bytraining networks with the number of hidden unitsranging from three to eight. It was found that per-formance did not improve dramatically when thenumber of hidden units was greater than four, sofour hidden units were chosen. The “performance”of the network was determined by cross-validationwith a set of color pixels that were not used aspart of training. We used the mean square errorof the network output to the target output as arough indicator of performance and the extent towhich the output of the network was separated forface and non-face inputs as a secondary indicatorof performance. Figure 2 illustrates the topologyof the networks used.

For training, we used a method knownas Stochastic Backpropagation. Training viaBackpropagation involved determining thesensitivity of the network output to each weightin the network and adjusting each weight inaccordance with how much error it caused in theoutput. A learning parameter, η, that controlledthe amount by which each weight changed inresponse to each training-pattern/training-targetpair was set to 0.1. After each training epoch, η

was scaled to 80% of its previous value to realize adeclining learning rate as the system ’matured’.

Figure 2: Neural network topology for color detec-tion.

During Stochastic Backpropagation, a randomsubset of the training data was used at each epoch.At the end of each epoch, the rate of change of allof the weights was calculated, and learning ceasedwhen that rate was less than a pre-determinedvalue (2% was used). Approximately 70,000training pixels were used with half being skinpixels and half not. At each epoch, the size ofthe training subset was 10,000. Typically, eachnetwork required 10 to 15 epochs to completetraining.

The final important choice that was requiredfor the neural network was the colorspace. Fourcolorspaces were investigated, namely, RGB, XYZ,Lab, and HSV. Networks trained in RGB, Lab, andHSV were all able to clearly and uniformly iden-tify face pixels. Oberservations showed that thenetwork trained on pixels in the RGB colorspaceperformed better than networks trained on othercolorspaces. Specifically, the RGB network wasable to reject certain pieces of clothing that othercolorspaces’ networks classified as skin. In the fi-nal system, therefore, we used only the RGB net-work in color classification. The feature vectorsthat were presented to the network were not trulymembers of their respective colorspace, as the fea-ture vectors were pre-whitened. Whitening seeksto create sets of feature vectors that have a co-variance matrix that is the identity matrix, as thiscan aid pattern classification systems. A whiteningtransform for each colorspace was pre-calculatedfrom the training images. See Figures 3, 4, 5 forthe outputs of the various networks and 6 for thecombined output of the three networks.

4 Region of Interest Evaluation

4.1 Finding Ovals

While the RGB neural network was quite success-ful in finding pixels that were considered to be

RGB Neural Network Results

−1.5

−1

−0.5

0

0.5

1

1.5

Figure 3: Output of the RGB Neural Network

Lab Neural Network Results

−1.5

−1

−0.5

0

0.5

1

1.5

Figure 4: Output of the Lab Neural Network

HSV Neural Network Results

−1.5

−1

−0.5

0

0.5

1

1.5

Figure 5: Output of the HSV Neural Network

Average Neural Network Results

−1.5

−1

−0.5

0

0.5

1

1.5

Figure 6: Average of the Three Neural NetworkOutputs).

Figure 7: Oval Convolution Kernel.

skin-colored, it was expected that it would not becompletely accurate. Looking at the results in thefigures, one can see that arms, hands, and evenparts of clothing showed up clearly in the results ofthe neural net. However, because the camera usedin taking the pictures was a relatively “normal”camera with no special lens, we were able to makethe general assumption that faces in the imageswere roughly oval in shape and of a size that wasaround 30 pixels along its major axis and 25 pixelsalong its minor axis. The question then becamehow to pick out these ovals from the results of theneural net.

A simple yet effective method tested was con-volving the results of the color classification with anoval binary kernel. Figure 7 shows the convolutionkernel used.

After convolving with the kernel describedabove, the neural net results for regions that werenot oval-shaped became blurred, while the resultsfor oval regions remained distinct. Figure 8 showsthe result of the convolution for one of the trainingimages. As we can see, the oval regions (mostlikely containing faces) clearly “pop out” fromthe rest of the image. To create a mask from theresults of the convolution (see Figure 9), we useda threshold value (0.5 of the peak value) that washeuristically determined to give the best results onthe set of training images.

4.2 Framing

In the set of images under consideration, we noticeda common property in the spatial distribution of

−600

−400

−200

0

200

400

100 200 300 400 500 600 700 800 900

100

200

300

400

500

600

Figure 8: Neural Net Results (blurred).

100 200 300 400 500 600 700 800 900

100

200

300

400

500

600

Figure 9: Mask from Blurred Neural Net Results.

faces in the images. Because the images were takenby a standing cameraperson of a group of standingpeople, the resulting images tended to have thecluster of faces located in the upper portions ofthe images. Using this heuristic, we implemented asimple framing technique by considering the meanand standard deviations of row positions of all iden-tified skin pixels. Once this frame was calculated,skin pixels outside the frame were no longer consid-ered to be faces. While the limits of the frame werecalculated by considering the statistics of the pixellocations, the exact parameters used in calculatingthe limits were determined by optimizing over asmall range of values. Figure 10 shows the frameused for one of the training images and Figure 11shows the result of framing on the mask from theprevious section.

4.3 Further Region Splitting using k-

Means

The candidate regions that have thus far been iden-tified sometimes contained more than one face. Inmany situations, these multiple-face regions tookthe form of two ovals connected by a small bridge.

To break up such multiple face regions, we ap-plied the k-Means clustering algorithm to each can-didate region. k-Means is a well known and simplealgorithm that can be used to classify patterns as

100 200 300 400 500 600 700 800 900

100

200

300

400

500

600

Figure 10: Framed Mask.

100 200 300 400 500 600 700 800 900

100

200

300

400

500

600

Figure 11: Framed Results.

belonging to one of k groups. At each iteration,each pattern is classified as belonging to the classwhich has a mean closest to it, as measured by somedistance metric (in this case, Euclidean distancewas used). After each iteration, new means for eachclass are calculated, and iterations cease when themeans no longer move appreciably.

k-Means, as applied to this problem, usedthe (x, y) position, or similarly [row, column] ofeach pixel in a candidate region as the featurevector. Clearly, the difficult question was howmany regions any given region should be brokeninto. Choosing k, or the number of regions,was accomplished by computing the “DistaceTransform” for each pixel in the region inquestion. The Distance Transform of each pixel inthe region mask is the Euclidean distance betweenthat pixel and the nearest pixel not part of theregion mask. Thus, for a pixel in the middleof an oval portion of the mask, the value of theDistance Transform is high, while for any pixel inthe “bridge” between two ovals, the value of thetransform is low.

Once the Distance Transform has been calcu-lated, subregions of the mask where the value of thedistance transform is in the upper 3% are identified.k was chosen to be the number of these regions,and the initial patterns means were initialized tothe coordinates of the centers of these regions.

100 200 300 400 500 600 700 800 900

100

200

300

400

500

600

Figure 12: Labelled Regions.

100 200 300 400 500 600 700 800 900

100

200

300

400

500

600

Figure 13: Labelled Regions (Post k-Means).

Figure 13 shows the result of the region splittingalgorithm on the set of regions shown on Figure 12.

4.4 Final Calculation of Face Coordinates

After k-means region splitting, we are now fairlyconfident that each separate region containeda face. The last step required was to return aset of coordinates on the image such that eachcoordinate was a point on a face. Matlab providesa simple way to do this by returning the centroidof each region when given a mask containinglabels (Figure 13). Note that because the originalimage was downsampled by a factor of 2, the(row, column) coordinates returned by Matlabneeded to be multiplied by 2 to get the correctfinal set of coordinates.

5 System Evaluation

At the time of writing this report, results for thetest image were not available. Tests ran on the setof training images are, however, quite promising.In addition, the system run time averages under 10seconds.

Table 1 shows the results of the implementedsystem on the full set of training images and Figure14 shows the final results on one of the images in thetraining set. Also presented in the table are results

Figure 14: Face Detection on Training Image 3(Score=22).

from experiments by varying some of the systemparameters mentioned in the previous sections.

6 Failed Approaches

The following section describes methods attemptedthat were not used in the final implementation.

6.1 Adaptive Thresholding

The output of the neural network was used tocreate regions of interest. The key step was increating a threshold which indicated whether ornot a pixel should be further considered. From thetraining set, the average proportion of the imagethat was occupied by faces was calculated. If thisproportion was 5%, for example, then choosing athreshold that passed 5% of the pixels would seemappopriate. Of course, there may exist some pixelsin the image which were, in fact, skin, but were notpart of a face (they could be from a neck, hand,arm, or skin-colored clothing). Therefore, thethreshold was set such that 50% more pixels thanwere expected for just the faces were let through.This threshold was calculated by examining thehistogram of neural network outputs and choosingthe bin value that left the target percentagein the right tail of the distribution. With thetraining images that were provided, it was foundthat an average of 5.6% of the pixels were facepixels. Therefore, the threshold was set such that(1.5)(5.6%) of the pixels make it through thethreshold. In the final implementation, however,due to the change in blurring the neural networkresults prior to thresholding, this thresholdingmethod turned out to be less effective than usinga hard threshold.

6.2 Eigenfaces

Our initial approach after finding the regions ofinterest from the neural network and labelling wasto eliminate potential false positive detections by

Parameter

Color Threshold 0.5 0.5 0.5 0.55 0.45k-Means Cutoff 0.94 0.96 0.98 0.96 0.96Image Number Miss Repeat FP. Miss Repeat FP Miss Repeat FP Miss Repeat FP Miss Repeat FP

Image 1 1 0 0 2 0 0 2 0 1 2 0 1 2 0 1Image 2 2 3 2 2 3 3 2 1 3 2 7 0 2 1 4Image 3 1 0 1 1 0 1 2 1 1 1 0 1 3 0 1Image 4 3 0 0 3 0 0 3 1 0 3 0 0 2 0 0Image 5 2 0 1 1 1 1 1 0 1 1 0 1 3 0 1Image 6 2 1 0 2 0 0 2 0 0 2 0 0 1 0 0Image 7 3 1 0 3 1 0 3 1 0 1 0 2 5 0 6

Table 1: Face Detection Results on Training Images. Results from the final system implementationare shown on the second column. The other columns show results of the face detection by varying theparameters for color-based skin classfication and k-means-based region splitting.

projecting the regions of interest on an eigenspaceto see how much of the variance of each image blockcould be explained by the first few eigenfaces, indi-cating that the block was indeed a face. The goalwas to find the mean square error (MSE) betweenthe actual region of interest and its reconstuctionusing the first few eigenfaces and use that to dis-tinguish between true faces and regions that werenot actual faces.

Eigenfaces were created from the ORL databaseof faces and the eigenfaces that made up 90% ofthe energy were selected to provide the projectionspace. Unfortunately, there was no observed dis-tinction between the true faces and non-faces afterprojection and could not be used as a classifier.Our initial conclusion was that the similarity be-tween the faces in the class picture and the facesin the database used was not adequate. This canbe attributed to the fact that there is little over-lap between faces in the class picture compared to’mug-shots’ from the database. To work around ourperceived problem, we created eigenfaces from thefaces in the class picture, but the issue of overlapwas encountered as well as size difference betweenfaces from individuals in the first row of the picturecompared to those behind.

6.3 Template Matching

Human faces can be effectively detected usingtemplate-matching, but in order for a templatematching test to be meaningful, the templatemust be the right size and be correctly oriented.This was the first problem we encountered whenattempting to perform template matching on theimages. Using the set of training images, roughly50 faces were cut out, scaled to be the samesize, and aligned as best as possible. Then anaverage face was created using this set of cut-outs.Cross-correlation of this average face with eachregion of interest was performed to test whetherthe region contained a face and if so, to determinethe location of the face. In our experiments,however, a reliable threshold to distinguish aregion containing a face and that not containing a

face was not found. Our belief is that the averageface used was not a “good enough” average facefor performing template matching, i.e., the faceimages that it was constructed from may have hadslight shifts and rotations.

6.4 Morphological Operations

The initial regions were determined by finding all“islands” of non-adjacent pixel groups thatexceeded a certain threshold during colorsegmentation. Another method attemptedfor cleaning up the set of regions was to perform avariety of morphological operations on the resultsof the neural network. While operations such aserosion, the majority operator, and small-regionremoval did improve and clean up the results ofthe neural network in some of the images, theimprovement was not consistently observed acrossall the images and was also remarkably sensitiveto the different scales between the images.

7 Conclusions

To our surprise, the simplest face detection schemewas the most robust and accurate for this appli-cation. Some of the alternate methods involvedmany parameters that needed to be tuned: findingthe optimal set of these parameters may have beendifficult given the time constraints and the size oftraining set.

Some of the methods mentioned in the previ-ous section that looked most promising were aban-doned. Further improvements to the implementedsystem should definitely consider re-investigatingmethods such as template matching and eigenfaces.In particular, performing template matching on abetter average face should be tested and performingfemale faces detection using eigenfaces should beinvestigated.

8 Helpful References

The papers by Gomez[2] and Jones[4] gave us in-spiration regarding the color problem. The classic

eigenfaces paper[3] was used, as well as the bookPattern Classification[1].

References

[1] R. O. Duda, P. E. Hart and D. G. Stork.Pattern Classification. John Wiley and SonsInc., 2001.

[2] G. Gomez. On selecting colour components forskin detection. In ICPR02, pages II: 961–964,2002.

[3] A. Pentland M. Turk. Eigenfaces for recog-nition. Journal of Cognitive Neuroscience,Volume 3, Number 1, pages 71–86, 1991.

[4] J.M.Rehg M.J. Jones. Statistical color modelswith application to skin detection. Int. J. of

Computer Vision, Volume 46, Number 1, pages81–96, 2002.

Documents

The Detection of Faces in Color Images: EE368 Project Report · The Detection of Faces in Color Images: EE368 Project Report Angela Chau, Ezinne Oji, Jeﬀ Walters Dept. of Electrical