Synopsis of m. Tech. Thesis on Face Detection Using Neural Network in Matlab by Lalita Gurjari

A

SYNOPSIS OF THE DISSERTATION

PROPOSED FOR THE

M.TECH. DEGREE OF THE

JAGAN NATH UNIVERSITY, JAIPUR

FACULTY: ELECTRONICS AND COMMUNICATION

TOPIC: FACE DETECTION USING NEURAL NETWORK IN MATLAB

CANDIDATE: LALITA GURJAR

DATE: 25-09-2013

CONTENT-

1. Intoduction2. System overview3. Implementation method4. Result5. Reference

INTRODUCTION –

The goal of my thesis is to show that the face detection problem can be solved efficiently

and accurately using a view-based approach implemented with artificial neural networks.

Specifically, I will demonstrate how to detect upright, tilted, and non-frontal faces in

cluttered grayscale images, using multiple neural networks whose outputs are arbitrated

to give the final output. Object detection is an important and fundamental problem in

computer vision, and there have been many attempts to address it. The techniques which

have been applied can be broadly classified into one of two approaches: matching two-

or three-dimensional geometric models to images [Seutens et al., 1992, Chin and Dyer,

1986, Besl and Jain, 1985], or matching view-specific image-based models to images.

Previous work has shown that view-based methods can effectively detect upright frontal

faces and eyes in cluttered backgrounds [Sung, 1996, Vaillant et al., 1994, Burel and

Carel, 1994]. This thesis implements the view-based approach to object using neural

networks, and evaluates this approach in the face detection domain.

Representation of a face detection system

In developing a view-based object detector that uses machine learning, three main

subproblems arise. First, images of objects such as faces vary considerably, depending

on lighting, occlusion, pose, facial expression, and identity. The detection algorithm

should explicitly deal with as many of these sources of variation as possible, leaving

little unmodelled variation to be learned. Second, one or more neural-networks must be

trained to deal with all remaining variation in distinguishing objects from non-objects.

Third, the outputs from multiple detectors must be combined into a single decision about

the presence of an object. The automatic recognition of human faces presents a

significant challenge to the pattern recognition research community; human faces are

very similar in structure with minor differences from person to person. They are actually

within one class of “human face”. Furthermore, lighting condition changes, facial

expressions, and pose variations further complicate the face recognition task as one of

the difficult problems in pattern analysis. This proposed a novel concept, “faces can be

recognized using line edge detection”. A face pre filtering technique is proposed to speed

up the searching process. It is a very encouraging finding that the proposed face

recognition technique has performed superior to the most of the existing comparison

experiments. This describes a face detection framework that is capable of processing

images extremely rapidly while achieving high detection rates. As continual research is

being conducted in the area of computer vision, one of the most practical applications

under vigorous development is in the construction of a robust real-time face detection

system. Successfully constructing a real-time face detection system not only implies a

system capable of analyzing video streams, but also naturally leads onto the solution to

the problems of extremely constraint testing environments. Analyzing a video sequence

is the current challenge since faces are constantly in dynamic motion, presenting many

different possible rotational and illumination conditions. While solutions to the task of

face detection have been presented, detection performances of many systems are heavily

dependent upon a strictly constrained environment. The problem of detecting faces under

gross variations remains largely uncovered. This paper gives a face detection system

which uses an image based neural network to detect face images. Face to Face

communication is a real time process operating at a time scale. The level of uncertainty

at this time scale is considerable, making it necessary for humans & machines to rely on

sensory rich perceptual primitives rather than slow symbolic inference process. Because

of real time bandwidth & environmental constraints, video processing has to deal with

much lower resolution & image quality, when compared photograph processing. Video

images can be easily acquired & they can capture the motion of person, so its make

possible to track people until they are in a position convenient for recognition. The face

is the most distinctive and widely used key to a person‘s identity. The area of Face

detection has attracted considerable attention in the advancement of human-machine

interaction as it provides a natural and efficient way to communicate between humans

and machines. The problem of detecting the faces and facial parts in image sequences

has become a popular area of research due to emerging applications in intelligent human-

computer interface, surveillance systems, content-based image retrieval, video

conferencing, financial transaction, forensic applications, pedestrian detection, and

image database management system and so on. Face detection is essentially localising

and extracting a face region from the background. This may seem like an easy task but

the human face is a dynamic object and has a high degree of variability in its appearance,

which makes face detection a difficult problem in computer vision.

Overview of the Matlab Environment –The name MATLAB stands for matrix

laboratory,originally written to provide easy access to matrix software developed by the

LINPACK and EISPACK projects. Today, MATLAB engines incorporate the LAPACK and

BLAS libraries, embedding the state of the art in software for matrix

computation. MATLAB is an interactive, matrix based system for scientific and engineering

numeric computation and visualization. Its basic data element is an array that does not require

dimensioning. It is used to solve many technical computing problems, especially those with

matrix and vector formulation, in a fraction of the time it would take to write a program in a

scalar non interactive language such as C.

Matlab Environment

The software section is completely based on MATLAB. In our interface we have used

MATLAB for face recognition. We have used it in such a way that it matches the face from

the predefined database and generates an event. This event is used to control the device by

giving the controller input to control the output and thus control controls the door.While

some may regard face detection as simple pre-processing for the face recognition system, it is

by far the most important process in a face detection and recognition system. However face

recognition is not the only possible application of a fully automated face detection system.

There are applications in automated colour film development where information about the

exact face location is useful for determining exposure and colour levels during film

development. The are even uses in face tracking for automated camera control in the film and

television news industries.

In this project the author will attempt to detect faces in still images by using image

invariants. To do this it would be useful to study the grey-scale intensity distribution of

an average human face. The following 'average human face' was constructed from a

sample of 30 frontal view human faces, of which 12 were from females and 18 from

males. A suitably scaled colormap has been used to highlight grey-scale intensity

differences. The grey-scale differences, which are invariant across all the sample faces

are strikingly apparent. The eye-eyebrow area seem to always contain dark intensity

(low) gray-levels. while nose forehead and cheeks contain bright intensity (high) grey-

levels. After a great deal of experimentation, the researcher found that the following

areas of the human face were suitable for a face detection system based on image

invariants and a deformable template.

Most face detection systems attempt to extract a fraction of the whole face,

therebyeliminating most of the background and other areas of an individual's head such

as hair that are not necessary for the face recognition task. With static images, this is

often done by running a 'window' across the image. The face detection system then

judges if a face is present inside the window (Brunelli and Poggio, 1993). Unfortunately,

with static images there is a very large search space of possible locations of a face in an

image. Face may be large or small and be positioned anywhere from the upper left to the

lower right of the image. Most face detection systems use an example based learning

approach to decide whether or not a face is present in the window at that given instant

(Sung and Poggio,1994 and Sung,1995). A neural network or some other classifier is

trained using supervised learning with 'face' and 'non-face' examples, thereby enabling it

to classify an image (window in facedetection system) as a 'face' or 'non-face'..

Unfortunately, while it is relatively easy to find face examples, how would one find a

representative sample of images which represent non-faces (Rowley et al., 1996)

Therefore, face detection systems using example based learning need thousands of 'face'

and 'non-face' images for effective training. Rowley,Baluja, and Kanade (Rowley et

al.,1996) used 1025 face images and 8000 non-face images (generated from 146,212,178

sub-images) for their training set! There is another technique for determining whether

there is a face inside the face detection system's window - using Template Matching. The

difference between a fixed target pattern (face) and the window is computed and

thresholded. If the window contains a pattern which is close to the target pattern(face)

then the window is judged as containing a face.

Challenges in Face Detection

Object detection is the problem of determining whether or not a sub-window of an image

belongs to the set of images of an object of interest. Thus, anything that increases the

complexity of the decision boundary for the set of images of the object will increase the

difficulty of the problem, and possibly increase the number of errors the detector will

make. Suppose we want to detect faces that are tilted in the image plane, in addition to

upright faces. Adding tilted faces into the set of images we want to detect increases the

set’s variability, and may increase the complexity of the boundary of the set. Such

complexity makes the detection problem harder. Note that it is possible that adding new

images to the set of images of the object will make the decision boundary becomes

simpler and easier to learn. One way to imagine this happening is that the decision

boundary is smoothed by adding more images into the set. However, the conservative

assumption is that increasing the variability of the set will make the decision boundary

more complex, and thus make the detection problem harder. There are many sources of

variability in the object detection problem, and specifically in the problem of face

detection. These sources are outlined below.

Variation in the Image Plane: The simplest type of variability of images of a face can

be ex- pressed independently of the face itself, by rotating, translating, scaling, and

mirroring its image. Also includedin thiscategory are changes in theoverall brightness

and contrast of the image, and occlusion by other objects.

Pose Variation: Some aspects of the pose of a face are included in image plane

variations, such as rotation and translation. Rotations of the face that are not in the image

plane can have a larger impact on its appearance. Another source of variation is the

distance of the face

Lighting and Texture Variation: Up to now, I have described variations due to the

position and orientation of the object with respect to the camera. Now we come to

variation caused by the object and its environment, specifically the object’s surface

properties and the light sources. Changes in the light source in particular can radically

change a face’s appearance.

System overview

The face detection system designed as shown in Fig.1

1 Skin color filter (Preprocessing) - The first step in preprocessing a color image consists of

passing it through a skin color filter that detects the skin pixels. This is used to discard many

of the pixels in the case of color images, thus reducing the amount of comparisons between

the window and the image.

2 Filtering the image - This part consists of continuously applying a mask of 20*20 pixels to

the preprocessed image. The mask is to some degree invariant to rotation and scale.

3 Multilayer Perceptron (MLP) - The prenetwork is a single multilayer perceptron (MLP).

This is a neural network with input, hidden, and one output neurons (the output neuron is

responsible for outputting either a face or a nonface). The prenetwork is trained using back-

propagation. This filter eliminates many of the pixels to be considered in the comparison and

is applied directly to grayscale images. For color images, the output of the skin filter is fed to

the MLP.

4 Detection - The output of the neural network varies between 1 and -1 according to it

whether a face has been detected or not, respectively.

Implementation methods

1 Skin color filter

Detection of skin color in color images is a very popular and useful technique for face

detection. Many techniques have reported for locating skin color regions in the input image.

While the input color image is typically in the RGB format, these techniques usually use

color components in the color space, such as the HSV formats. That is because RGB

components are subject to the lighting conditions thus the face detection may fail if the

lighting condition changes.

The basic algorithm used for face detection

The first step in preprocessing a color image consists of passing it through a skin color filter

that detects the skin pixels. This is used to discard many of the pixels in the case of color

images, thus reducing the amount of comparisons between the window and the image. The

first step in designing a skin color filter consists of changing the image from RGB to HSV,

where H stands for Hue, S for Saturation and V for value. This reduces the effect of

illumination. H, S, and V are continuous values varying between 0 and 1. Quantization is

applied for both H and S to get discrete values. 10 points have been considered for each of

the previously mentioned parameters. Then, a color histogram is formed of H, S, and the

pixel value. Thus, pairs of H and S are formed and for each of them, the corresponding

number of pixels is determined. This allows us to get the first condition for a pixel to be a

skin pixel. In fact, the color histogram (H, S) is compared to a skin threshold (determined

empirically). If it is greater, the pixel can be classified as a skin pixel if it satisfies the second

condition (discussed later). Otherwise, the pixel is rejected for being a non skin pixel. The

second condition consists of comparing the edge at each pixel with an edge threshold

(determined empirically as well). The edge value is obtained by computing the gradient

image using Sobel operator. This is useful for detecting edges in the image. If the computed

edge at the pixel is less than the threshold, and the first condition has been satisfied, the pixel

is classified as a skin pixel (and set to white). Otherwise, it is set to black. The algorithm of

the skin color filter can be summarized as follows: 1. Transform the image from RGB to

HSV. 2. Compute the HSV values for each pixel and the color histogram (H, S). 3. Compute

the gradient of RGB image using Sobel operator. 4. if (color histogram (H, S) > skin

threshold and edge (x, y) < edge threshold)

Pixel (x, y) = 1 (white) skin pixel

Pixel (x, y) = 0 (black) non skin pixel.

2 Multilayer Perceptron (MLP)

The prenetwork is a single multilayer perceptron (MLP). This is a neural network with input,

hidden, and one output neurons (the output neuron is responsible for outputting either a face

or a nonface). The prenetwork is trained using back-propagation. This filter eliminates many

of the pixels to be considered in the comparison and is applied directly to grayscale images.

For color images, the output of the skin filter is fed to the MLP.

3 Detection

3.1 Filtering the image

This part consists of continuously applying a mask of 20*20 pixels to the preprocessed

image. The mask is to some degree invariant to rotation and scale. The output of this

operation (output of the neural network) varies between 1 and -1 according to whether a face

has been detected or not, respectively. If the face in the original (preprocessed) image is

larger than the window size, the image is sub-sampled (i.e., its size is reduced) and the filter

is applied to the image at each size until the new face fits the mask.

First step: At each step, another processing of the image is done to correct its illumination.

This is done by first creating a function that varies linearly with the intensity inside the

window. More precisely, the function varies linearly inside an oval in the window, and the

outer contour is black to discard the background pixels. This transformed version of the

image is then subtracted from the original one. Once this lighting correction has been done,

histogram equalization is applied to the image to emphasize its contrast. As before,

equalization is done in the oval part of the window. This is done to make sure that all images

have the same properties regardless of the conditions under which they were taken and of the

type of camera used.

Second step: The extracted window is fed to the input layer of the neural network that

determines whether the image contains a face or not. The hidden layers of the network consist

of three types of units, with each type being specialized in one task. The first type is a set of

four receptive fields (hidden units) that are responsible for detecting features such as the

individual eyes, the nose, and the corners of the mouth. These units look at 10*10 pixel

regions. The second category consists of 16 units that look at 5*5 pixel regions and have the

same job as the ones described above. The third type is constituted of 6 units that look at

20*5 pixel regions and are responsible for detecting the mouth and the pair of eyes. This is

possible since the units are horizontal.

Third step: If the output of the network is 1, a face is detected. The opposite occurs for an

output of -1. In order to train the system, a set of face and nonface images were used. Some

features such as the eyes, the nose and the mouth were labeled, and the images were scaled

and rotated using the following algorithm:

1. Initialize F, a vector that will be the average positions of each labeled feature over all the

faces, with the feature locations in the first face F1.

2. The feature coordinates in F are rotated, translated, and scaled, so that the average

locations of the eyes will appear at predetermined locations in a 20*20 pixel window.

3. For each face i, compute the best rotation, translation, and scaling to align the face‘s

features Fi with the average feature locations F. Such transformations can be written as a

linear function of their parameters. Thus, we can write a system of linear equations mapping

the features from Fi to F. The least squares solution to this over-constrained system yields the

parameters for the best alignment transformation. Call the aligned feature locations Fi.

4. Update F by averaging the aligned feature locations Fi for each face i.

5. Go to step 2.

The selection of nonface images during training is done as follows:

1. Create an initial set of nonface images by generating 1000 random images. Apply the

preprocessing steps to each of these images.

2. Train a neural network to produce an output of 1 for the face examples, and -1 for the

nonface examples. The training algorithm is standard error backpropagation with momentum.

After the first iteration, we use the weights computed by training in the previous iteration as

the starting point.

3. Run the system on an image of scenery which contains no faces. Collect sub images in

which the network incorrectly identifies a face (an output activation > 0)

4. Select up to 250 of these sub images at random, apply the preprocessing steps, and add

them into the training set as negative examples. Go to step 2.

Result of locating single Face within picture

Result of Locating Multiple Faces within picture

Locating Non-Frontal Faces

3.2 The merge of overlapping detections and arbitration

First step: merging overlapping detections Most faces are detected at many nearby locations,

and therefore, the final detection of the face consists of taking all these detections and

combining them to find the true position of the face in the image. For each location found, the

number of nearby detections is determined, and compared to a given threshold. If the number

of detections is greater than the threshold, a face is correctly detected. Otherwise, this is a

false detection. The location of the final detection is given by the centroid of all the nearby

detections. This allows the different detections to be merged to give the final one. Once a face

has been detected using the above approach, all the other detections are considered as errors

and as such, are discarded. The only detected part we keep from the image is the one with a

high enough number of detections within a small neighborhood.

Second step: arbitration among multiple networks The above step is helpful in reducing the

number of false detections (also called false positives). To reduce this number even further, a

second step can be added, which consists of applying many networks and arbitrating between

their outputs. Each detection at a particular position and scale is saved in an output pyramid

and the outputs of different pyramids are ANDed together. When the outputs are ANDed, the

detected part of an image will be correctly classified as a face if both networks agree upon it.

Since it is rare that two networks will misclassify the faces, this strategy is helpful in

decreasing the number of false detections. However, this strategy might reject a correctly

identified face if only one of the two networks detects it.

RESULTS

The training sets are given of with frontal faces & are only roughly aligned. This was done

by having a person place a bounding box around each face just above the eyebrows and about

half-way between the mouth and the chin. This bounding box was then enlarged by 50% and

then cropped and scaled to 20 by 20 pixels. By observing the performance of this face

detector on a number of test images. It was noticed a few different failure modes. The face

detector was trained on frontal, upright faces. The faces were only very roughly aligned so

there is some variation in rotation both in plane and out of plane. Informal observation

suggests that the face detector can detect faces that are tilted up to about ±15 degrees in plane

and about ±45 degrees out of plane (toward a profile view). The detector becomes unreliable

with more rotation than this. Also noticed that harsh backlighting in which the faces are very

dark while the background is relatively light sometimes causes failures. It is interesting to

note that using a nonlinear variance normalization based on robust statistics to remove

outliers improves the detection rate in this situation. Finally, this face detector fails on

significantly occluded faces. If the eyes are occluded for example, the detector will usually

fail. The mouth is not as important and so a face with a covered mouth will usually still be

detected.

REFERENCE

[1] Milan Sonka, Vaclav Hlavac, Roger Boyle―Image Processing, analysis and Machine

Vision‖, Tata McGraw-Hill ISBN

[2] Rowley, Baluja, and Kanade: ―Neural Network-Based Face Detection‖ IEEE Patt. Anal.

Mach. Intell., 20:22– 38.

[3] Henry A. Rowley, Shumeet Baluja, Takeo Kanade. (1998). Rotation Invariant Neural

Network-Based Face Detection. 1998 IEEE Computer Society Conference on Computer

Vision and Pattern Recognition (CVPR'98) p. 38.

[4] H. A. Rowley, S. Baluja, and T. Kanade. Neural network- based face detection. IEEE

Transactions on Pattern Analysis and Machine Intelligence, 20(1):23–38, 1998.

[5]http://engineeringprojects101.blogspot.in/2012/07/face-detection-using matlab.html.

[6] Goldstein, A. J., Harmon, L. D., and Lesk, A. B., Identification of human faces", Proc.

IEEE 59, pp. 748-760, (1971).

[7] Nakamura, O., Mathur, S., and Minami, T., "Identification of human faces based on

isodensity maps", Pattern Recognition, Vol. 24(3), pp. 263-272, (1991).

(Signature of Candidate)

Remarks of Supervisor:

(Signature of Supervisor )

http://engineeringprojects101.blogspot.in/2012/07/face-detection-using%20matlab.html

Documents

Synopsis of m. Tech. Thesis on Face Detection Using Neural Network in Matlab by Lalita Gurjari