CHAPTER 2 LITERATURE SURVEY 2.1 INTRODUCTION 2.2 FACIAL ...shodhganga.inflibnet.ac.in/bitstream/10603/8909/13/11_chapter2.pdf.pdf · segmentation of various parts that are designed

CHAPTER 2

LITERATURE SURVEY

2.1 INTRODUCTION

This chapter presents a detailed literature survey on facial tracking

using lip movement, skin color and mouth movement in a video sequence.

The Automatic Facial extraction, 3D modal Shaping, Algorithm for Robust

segmentation of various parts that are designed by various authors are

discussed.

2.2 FACIAL TRACKING USING LIP READING

Yuille et al., 1992, develop an automatic facial feature extraction

system, which is able to identify the detailed shape of eyes, eyebrows and

mouth from facial images. The developed system not only extracts the

location information of the features, but also estimates the parameters

pertaining the contours and parts of the features using parametric

deformable templates approach. In order to extract facial features,

deformable models for each of eye, eyebrow, and mouth are developed. The

development steps of the geometry, imaging model and matching

algorithms, and energy functions for each of these templates are presented

in detail, along with the important implementation issues. An eigenface

based multi-scale face detection algorithm which incorporates standard facial

proportions is implemented, so that when a face is detected, the rough

search regions for the facial features are readily available. The developed

system is tested on JAFFE (Japanese Females Facial Expression Database),

Yale Faces, and ORL (Olivetti Research Laboratory) face image databases.

The performance of each deformable template and the face detection

algorithm are discussed separately.

Rabiner 1993, state that although the face detection algorithm is

designed for frontal face, the same mechanism can also be applied to track

non-frontal faces with online adapted face models. Due to the essence of

template matching, the algorithm is capable of comparing the similarity

among different faces, which makes it suitable for tracking the same face

that occur at disjointed temporal locations in video. While the proposed face

detection method provides comparable accuracy as the neural network

based approach, it is much faster.

Terzopoulos et al., 1993, present a new approach to the analysis of

dynamic facial images for the purposes of estimating and resynthesizing

dynamic facial expressions. The approach exploits a sophisticated generative

model of the human face originally developed for realistic facial animation.

The face model, which may be simulated and rendered at interactive rates

on a graphics workstation, incorporates a physics-based synthetic facial

tissue and a set of anatomically motivated facial muscle actuators. They

consider the estimation of dynamic facial muscle contractions from video

sequences of expressive human faces. They develop an estimation technique

that uses deformable contour models (snakes) to track the non-rigid motions

of facial features in video images

Lanitis et al., 1994, present flexible shape and flexible grey-level

models for representing variations in the appearance of human faces. These

models are controlled by a small number of parameters which can be used

for coding and reconstructing a face image.

Jacquin Amaud et al., 1995, address the issue of automatically

tracking the faces and facial features of persons in head-and-shoulders video

sequences. They propose two totally automatic algorithms which

respectively perform the detection of head outlines and identify rectangular

eyes-nose-mouth regions, both from down sampled binary threshold edge

images. Unlike ones that have been proposed recently, a priori assumptions

regarding the nature and content of the sequences to code are minimal for

our techniques, and the algorithms operate accurately and robustly, even in

cases of significant head rotation or partial occlusion by moving objects.

Gavrila and Davis, 1996, present a vision system for the 3-D model-

based tracking of unconstrained human movement. Using image sequences

acquired simultaneously from multiple views, they recover the 3D body pose

at each time instant without the use of markers. The pose recovery problem

is formulated as a search problem and entails finding the pose parameters of

a graphical human model whose synthesized appearance is most similar to

the actual appearance of the real human in the multi-view images. The

models used for this purpose are acquired from the images. They use a

decomposition approach and a best-first technique to search through the

high dimensional pose parameter space. A robust variant of chamfer

matching is used as a fast similarity measure between synthesized and real

edge images. They present initial tracking results from a large new human-

in-action database containing more the 2500 frames in each of four

orthogonal views. The four image streams are synchronized. They contain

subjects involved in a variety of activities, of various degree of complexity

ranging from the simpler one person hand waving to the challenging two

persons close interaction in the Argentine Tango.

McKenna et al., 1996, they describe a dynamic face tracking system

based on an integrated motion-based object tracking and model based face

detection, framework. The motion-based tracker focuses attention for the

face detector whilst the latter aids the tracking process. The system

produces segmented face sequences from complex scenes with poor viewing

conditions in surveillance applications. They also investigate a Gabor wavelet

transform as a representation scheme for capturing head rotations in depth.

Principal components analysis was used to visualize the manifolds described

by pose change. Heinzmann and Zelinsky, 1997, state that people naturally

express themselves through facial gestures. They have implemented an

interface that tracks a person's facial features robustly in real time (30Hz)

and does not require artificial artifacts such as special illumination or facial

makeup. Even if features become occluded, the system is capable of

recovering tracking in a couple of frames after the features reappear in the

image. Based on this fault tolerant face tracker they have implemented real

time gesture recognition capable of distinguishing 12 different gestures

ranging from "yes", "no" and "may be" to winks, blinks and "asleep".

Sanchez et al., 1997, a method for lip tracking intended to support

personal verification is presented. Lip contours are represented by means of

quadratic B-splines. The lips are automatically localized in the original image

and an elliptic B-spline is generated to start up tracking. Lip localization

exploits grey-level gradient projections as well as chromaticity models to

find the lips in an automatically segmented region corresponding to the face

area. Tracking proceeds by estimating new lip contour positions according to

a statistical chromaticity model for the lips. The current tracker

implementation follows a deterministic second order model for the spline

motion based on a Lagrangian formulation of contour dynamics. The method

has been tested on the M2VTS database. Lips were accurately tracked on

sequences consisting of more than hundred frames.

Basu et al., 1998, address the problem of tracking and reconstructing

3D human lip motions from a 2D view. They build a physically-based 3D

model of lips and train it to cover only the subspace of lip motions. They

then track this model in video by finding the shape within the subspace that

maximizes the posterior probability of the model given the observed

features. The features are the likelihoods of the lip and non-lip color classes:

they iteratively derive forces from these values to apply to the physical

model and converge to the final solution. Because of the full 3D nature of

the model, this framework allows to track the lips from any head pose. In

addition, because of the constraints imposed by the learned subspace of the

model, they are able to accurately estimate the full 3D lip shape from the 2D

view.

Edward et al., 1998, address the problem of robust face identification

in the presence of pose, lighting, and expression variation. Previous

approaches to the problem have assumed similar models of variation for

each individual, estimated from pooled training data. They describe a

method of updating a first order global estimate to identity by learning the

class specific correlation between the estimate and the residual variation

during a sequence. This is integrated with an optimal tracking scheme, in

which identity variation is decoupled from pose, lighting and expression

variation. The method results in robust tracking and a more stable estimate

of facial identity under changing conditions.

Schödl Arno et al., 1998, describe the use of a three-dimensional

textured model of the human head under perspective projection to track a

person’s face. The system is hand-initialized by projecting an image of the

face onto a polygonal head model. Tracking is achieved by finding the six

translation and rotation parameters to register the rendered images of the

textured model with the video images. They find the parameters by mapping

the derivative of the error with respect to the parameters to intensity

gradients in the image. They use a robust estimator to pool the information

and do gradient descent to find an error minimum.

Stan Birchfield, 1998, presents an algorithm for tracking a person’s

head. The head’s projection onto the image plane is modeled as an ellipse

whose position and size are continually updated by a local search combining

the output of a module concentrating on the intensity gradient around the

ellipse’s perimeter with that of another module focusing on the color

histogram of the ellipse’s interior. Since these two modules have roughly

orthogonal failure modes, they serve to complement one another. The result

is a robust, real-time system that is able to track a person’s head with

enough accuracy to automatically control the camera’s pan, tilt, and zoom in

order to keep the person centered in the field of view at a desired size.

Extensive experimentation shows the algorithm’s robustness with respect to

full 360-degree out-of-plane rotation, up to 90-degree tilting, severe but

brief occlusion, arbitrary camera movement, and multiple moving people in

the background.

Toyama 1998, real-time 3D face tracking is a task with applications to

animation, video teleconferencing, speech reading, and accessibility. In spite

of advances in hardware and efficient vision algorithms, robust face tracking

remains elusive for all of the reasons which make computer vision difficult:

Variations in illumination, pose, expression, and visibility complicate the

tracking process, especially under real-time constraints. They note that

robust systems tend to possess some state-based architecture comprising

heterogeneous algorithms, and that robust recovery from tracking failure

requires several other facial image analysis tasks.

Cascia et al., 2000, propose an improved technique for 3D head

tracking under varying illuminating conditions. The head is modeled as a

texture mapped cylinder. Tracking is formulated as an image registration

problem in the cylinder's texture map image. The resulting dynamic texture

map provides a stabilized view of the face that can be used as input to many

existing 2D techniques for face recognition, facial expressions analysis, lip

reading, and eye tracking.

Lievin and Luthon, 2000, propose an algorithm for speaker's lip

segmentation and features extraction. A color video sequence of speaker's

face is acquired, under natural lighting conditions and without any particular

make-up. A logarithmic color transform is performed from the RGB to HI

(hue, intensity) color space. A statistical approach using Markov random

field modeling determines the red hue prevailing region and motion in a

spatiotemporal neighborhood. Third, the final label field is used to extract

ROI (region of interest) and geometrical features.

Tian et al., 2000, propose a dual state model based system of tracking

eye features that uses convergent tracking techniques and show how it can

be used to detect whether the eyes are open or closed, and to recover the

parameters of the eye model.

Jian et al., 2001, develop real time lip tracking information that can be

used to implement and control a virtual lip. The use of soft computing to

represent the real time lip parameters enables them to have a more robust

and flexible system which can compensate for the potential errors of lip

tracking.

Chan et al., 2002, state that contour model-based tracking is more

robust if an accurate reference shape model of the underlying object is

available. As lip shapes vary, the ability to automatically extract user-

dependent lip models from input images is desirable. They present an

unsupervised segmentation method to hierarchically locate the user's face

and lips. Techniques employed include modeling in the hue / saturation color

space using Gaussian mixture models and the use of geometric constraints.

With the region of interest automatically located, the model extraction

problem is formulated as a regularized model-fitting problem. The use of a

generic shape as prior information improves the accuracy of the extracted lip

model which is based on a cubic B-spline representation. They describe a

method to compute automatically an optimal linear color space transform

needed to obtain raw estimates of the lip boundary locations, as required by

the fitting procedure.

Delman and Lievin, 2002, present an algorithm for speaker's lip

segmentation and features extraction. A color video sequence of speaker's

face is acquired, under natural lighting conditions and without any particular

make-up. A logarithmic color transform is performed from RGB to HI (hue,

intensity) color space. A statistical approach using Markov randomly

modeling determines lip prevailing region and motion in spatiotemporal

neighborhoods.

Eveno et al., 2002, propose an accurate and robust lip segmentation

algorithm. Characteristic points are found by using hybrid edges, which

combine color and intensity information, and a priori knowledge about the lip

structure. Corner position, which is crucial, is provided by a coarse-to-fine

process. A model is fitted on the lips. Unlike most model oriented methods,

they consider that the lip boundary is composed of several independent

cubic polynomial models. It gives to the global model enough flexibility to

reproduce the specificity of very different lip shapes. Compared to existing

models, it brings a significant accuracy improvement. It ensures a robust

convergence towards the edges.

Liew et al., 2002, present use of visual information from lip

movements that can improve the accuracy and robustness of a speech

recognition system. A region-based lip contour extraction algorithm based

on deformable model is proposed. The algorithm employs a stochastic cost

function to partition a color lip image into lip and non-lip regions such that

the joint probability of the two regions is maximized. Given a discrete

probability map generated by spatial fuzzy clustering, they show how the

optimization of the cost function can be done in the continuous setting. The

region-based approach makes the algorithm more tolerant to noise and

artifacts in the image. It also allows larger region of attraction, thus making

the algorithm less sensitive to initial parameter settings. The algorithm

works on unadorned lips and accurate extraction of lip contour is possible.

Mark Barnard et al., 2002, propose a robust and adaptable lip tracking

method that uses a combination of snakes and a 2D template matching

technique. The snake, an energy minimizing spline, is driven by 2D template

matching techniques to find the expected lip contour of a specific speaker.

Their experiments show that the technique can track the unadorned lips in

various colors and shapes of speakers, including the lips of a bearded

speaker.

Morency et al., 2002, present a robust implementation of stereo-based

head tracking designed for interactive environments with uncontrolled

lighting. They integrate fast face detection and drift reduction algorithms

with a gradient-based stereo rigid motion tracking technique. Their system

can automatically segment and track a user’s head under large rotation and

illumination variations. Precision and usability of their approach are

compared with previous tracking methods for cursor control and target

selection in both desktop and interactive room environments.

Yang et al., 2002, insist that images containing faces are essential to

intelligent vision-based human computer interaction, and research efforts in

face processing include face recognition, face tracking, pose estimation, and

expression recognition. Given a single image, the goal of face detection is to

identify all image regions which contain a face regardless of its three-

dimensional position, orientation, and lighting conditions. Such a problem is

challenging because faces are not rigid and have a high degree of variability

in size, shape, color, and texture. Numerous techniques have been

developed to detect faces in a single image.

Blanz Volker and Vetter, 2003, present a method for face recognition

across variations in pose, ranging from frontal to profile views, and across a

wide range of illuminations, including cast shadows and secular reflections.

To account for these variations, the algorithm simulates the process of

image formation in 3D space, using computer graphics, and it estimates 3D

shape and texture of faces from single images. The estimate is achieved by

fitting a statistical, morph able model of 3D faces to images. The model is

learned from a set of textured 3D scans of heads. They describe the

construction of the morph able model, an algorithm to fit the model to

images, and a framework for face identification. In this framework, faces are

represented by model parameters for 3D shape and texture.

Liew, 2003, describe the application of a novel spatial fuzzy clustering

algorithm to the lip segmentation problem. The proposed spatial fuzzy

clustering algorithm is able to take into account both the distributions of

data in feature space and the spatial interactions between neighboring pixels

during clustering. By appropriate pre- and post processing utilizing the color

and shape properties of the lip region, successful segmentation of most lip

images is possible. Comparative study with some existing lip segmentation

algorithms such as the hue filtering algorithm and the fuzzy entropy

histogram thresholding algorithm has demonstrated the superior

performance of their method.

Suandi et al., 2003, introduce an extended technique in template

matching to track eyes and mouth in real-time. The technique makes use of

a set of ‘n’ correlation candidates from template matching. They first list all

the candidates from each face model regions, and select the best candidates

based on two selective functions. These functions are for right-left eyes pair

and eyes-mouth pair selection, respectively. They also introduce a novel

technique in tracking framework, called feature selective (FS), where the

system selects the features automatically so that it is feasible for multiple

face types and conditions.

Wu et al., 2003, state that occlusion is a difficult problem for

appearance-based target tracking, especially when it needs to track multiple

targets simultaneously and maintain the target identities during tracking.

They propose a dynamic Bayesian network which accommodates an extra

hidden process for occlusion and stipulates the conditions on which the

image observation likelihood is calculated. The statistical inference of such a

hidden process can reveal the occlusion relations among different targets,

which makes the tracker more robust against partial even complete

occlusions. In addition, considering the fact that target appearances change

with views, another generative model for multiple view representation is

proposed by adding a switching variable to select from different view

templates .The integration of the occlusion model and multiple view model

results in a complex dynamic Bayesian network, where extra hidden

processes describe the switch of targets’ templates, dynamics, and the

occlusions among different targets. The tracking and inference algorithms

are implemented by the sampling-based sequential Monte Carlo strategies.

Our experiments show the effectiveness of the proposed probabilistic models

and the algorithms.

Eveno Nicolas et al., 2004, they propose an accurate and robust quasi

automatic lip segmentation algorithm. The upper mouth boundary and

several characteristic points are detected in the first frame by using a new

kind of active contour: the “jumping snake”. Unlike classic snakes, it can be

initialized far from the final edge and the adjustment of its parameters is

easy and intuitive. Then, to achieve the segmentation they propose a

parametric model composed of several cubic curves. Its high flexibility

enables accurate lip contour extraction even in the challenging case of very

asymmetric mouth. It brings a significant accuracy and realism

improvement. The segmentation in the following frames is achieved by using

an inter frame tracking of the key points and the model parameters. The

key point’s positions become unreliable after a few frames. They propose an

adjustment process that enables an accurate tracking even after hundreds of

frames and the mean key points tracking errors of our algorithm are

comparable to manual point’s selection errors.

Leung Shu-Hung et al., 2004, presented a new fuzzy clustering

method for lip image segmentation. This clustering method takes both the

color information and the spatial distance into account while most of the

current clustering methods only deal with the former. A new dissimilarity

measure, which integrates the color dissimilarity and the spatial distance in

terms of an elliptic shape function, is introduced. Because of the presence of

the elliptic shape function, the new measure is able to differentiate the pixels

having similar Color information but located in different regions. A new

iterative algorithm for the determination of the membership and centroid for

each class is derived, which is shown to provide good differentiation between

the lip region and the non-lip region.

Wang et al., 2004, visual information from lip shapes and movements

helps improve the accuracy and robustness of a speech recognition system.

A new region-based lip contour extraction algorithm that combines the

merits of the point-based model and the parametric model is presented.

Their algorithm uses a 16-point lip model to describe the lip contour. Given a

robust probability map of the color lip image generated by the FCMS (fuzzy

clustering method incorporating shape function) algorithm, a region-based

cost function that maximizes the joint probability of the lip and non-lip

region can be established. Then an iterative point-driven optimization

procedure has been developed to fit the lip model to the probability map. In

each iteration, the adjustment of the 16 lip points is governed by three

pieces of quadratic curves that constrain the points to form a physical lip

shape.

Narayanan et al., 2006, they present a lip contour tracking algorithm

using attractor guided particle filtering. It is difficult to robustly track the lip

contour because the lip contour is highly deformable and the contrast

between skin and lip colors is very low. It makes the traditional blind

segmentation-based algorithms often fail to have robust and realistic results.

The lip contour is constrained by the facial muscles; the tracking

configuration space can then be represented by a lower dimensional

manifold. They take some representative lip shapes as the attractors in the

lower dimensional manifold. To resolve the low contrast problem, they adopt

a color feature selection algorithm to maximize the between skin and lip

colors. Then they integrate the shape priors and the discriminative feature

into the attractor-guided particle filtering framework to track the lip contour.

Nguyen et al., 2008, they propose and evaluate a novel method for

enhancing performance of lips contour tracking, which is based on the

concept of statistic shape models (ASM) and multi features. On the first

image of the video sequence, lip region is detected using the Bayesian's rule

in which lip color information is modeled by using the Gaussian Mixture

Model (GMM) and the GMM is trained by Expectation-Maximization (EM)

algorithm. The lip region is then used to initialize the lip shape model. A

single feature-based ASM presents good performance only in particular

conditions but gets stuck in local minima for noisy conditions enhance the

convergence, we propose to use 2 features: normal profile and grey level

patches, and combine them by using a voting approach. The standard ASM

is not able to take into account temporal information from previous frames

therefore the lip contours are tracked by replacing the standard ASM with a

hybrid active shape model (HASM) which is capable to take advantage of the

temporal information.

Ong Eng-Jon and Bowden, 2008, they propose a learnt data-driven

approach to the accurate, real-time tracking of lip shapes using only

intensity information. This has the advantage that constraints such as a-

priori shape models or temporal models for dynamics are not required or

used. Tracking the lip shape is simply the independent tracking of a set of

points that lie on the lip’s contour. This allows us to cope with different lip

shapes that were not present in the training data and performs as well as

other approaches that have pre-learnt shape models such as the AAM.

Tracking is archived via linear predictors, where each linear predictor

essentially linearly maps sparse template difference vectors to tracked

feature position displacements. Multiple linear predictors are grouped into a

rigid flock to obtain increased robustness. To achieve accurate tracking, two

approaches are proposed for selecting relevant sets of LPs within each flock.

Analysis of the selection results show that the LPs selected for tracking a

feature point choose areas that are strongly correlated with that of the

tracked target and that these areas are not necessarily the region around

the feature point as is commonly assumed in LK based approaches.,

effective fusion of acoustic and visual modalities in speech recognition has

been an important issue in human computer interfaces, warranting further

improvements in intelligibility and robustness. Speaker lip motion stands out

as the most linguistically relevant visual feature for speech recognition. They

present a new hybrid approach to improve lip localization and tracking,

aimed at improving speech recognition in noisy environments. It begins with

a new color space transformation for enhancing lip segmentation. In the

color space transformation, a PCA method is employed to derive a new one

dimensional color space which maximizes discrimination between lip and

non-lip colors. Intensity information is also incorporated in the process to

improve contrast of upper and corner lip segments. In the subsequent step,

a constrained deformable lip model with high flexibility is constructed to

accurately capture and track lip shapes. The model requires only six degrees

of freedom, yet provides a precise description of lip shapes using a simple

least square fitting method. Experimental results indicate that the proposed

hybrid approach delivers reliable and accurate localization and tracking of lip

motions under various measurement conditions.

Rohani et al., 2008, Lip feature extraction is one of the most

challenging tasks in the lip reading systems' performance. They propose a

new approach for lip contour extraction based on fuzzy clustering. The

algorithm employs a stochastic cost function to partition a color image into

lip and non-lip regions such that the joint probability of the two regions is

maximized. The mouth location is determined and then, lip region is

preprocessed using pseudo hue transformation. Fuzzy c-means clustering is

applied to each transformed image along with b components of CIELAB color

space. To delete the clustered pixels around lip, an ellipse and a Gaussian

mask were used. In order to show the performance of the proposed method,

the pseudo hue segmentation and fuzzy c-mean clustering without

preprocessing are compared. The compared methods were applied to the

VidTIMIT and M2VTS databases and the results show the superiority of the

proposed method in comparison with other methods.

Chin Siew Wen et al., 2009, present automatic lips detection and

tracking system based on watershed segmentation approach. For some of

the lips detection systems, skin / non-skin detection is a prerequisite step to

localize the face region followed by detection of lip region. A direct lips

detection technique using watershed segmentation without needing

preliminary face localization is proposed. The watershed algorithm segments

the input image into regions. The cubic spline interplant lips color modeling

and symmetry detection is used to detect the lips region from the

segmented regions. The position of the segmented lips is passed to the

tracking system to predict the location of the lips in the succeeding video

frame.

Hoai BAC Et Al., 2010, they present to solve a narrower problem, the

lip tracking, which is an essential step to provide visual lip data for the lip-

reading system. Inspired by the idea of AVCSR, which has combined visual

features with audio features to increase the accuracy in noisy environments;

they use AdaBoost algorithm and Kalman filter for the face and lip detectors.

1.3 FACIAL TRACKING USING SPEECH

Leymaric and Levine, 1993, propose segmentation of a noisy intensity

image and tracking a non-rigid object. A technique based on an active

contour model commonly called a snake is examined. The technique is

applied to cell locomotion and tracking studies. The snake permits both the

segmentation and tracking problems to be simultaneously solved in

constrained cases. A detailed analysis of the snake model, emphasizing its

limitations and shortcomings, is presented, and improvements to the original

description of the model are proposed. Problems of convergence of the

optimization scheme are considered. In particular, an improved terminating

criterion for the optimization scheme that is based on topographic features

of the graph of the intensity image is proposed. Hierarchical filtering

methods, as well as a continuation method based on a discrete sale-space

representation, are discussed.

Luettin Juergen et al., 1996, describe a robust method for extracting

visual speech information from the shape of lips to be used for an automatic

speech reading (lip reading) systems. Lip de-formation is model led by a

statistically based deformable contour model which learns typical lip

deformation from a training set. The main difficulty in locating and tracking

lips consists of finding dominant image features for representing the lip

contours. They describe the use of a statistical profile model which learns

dominant image features from a training set. The model captures global

intensity variation due to different illumination and different skin reflectance

as well as intensity changes at the inner lip contour due to mouth opening

and visibility of teeth and tongue. The method is validated for locating and

tracking lip movements on a database of a broad variety of speakers.

Kaucle and Blake, 1998, human speech is inherently multi-model

consisting of both audio and visual components. Recently researchers have

shown that the incorporation of information about the position of the lips

into acoustic speech recognizer enables robust recognition of noisy speech.

In the case of Hidden Markov model recognition, they show that this

happens because the visual signal stabilizes the alignment of states. It is

also shown that unadorned lips, both the inner and outer contours, can be

robustly tracked in real time on general purpose workstations. To accomplish

this, efficient algorithms are employed which contain three key components,

shape models, motion models, and focused color feature detectors all of

which are learnt from examples.

Lei et al., 2004, the paper presents a robust hierarchical lip tracking

approach (RoHiLTA) for lip-reading and audio visual speech recognition

(AVSR) applications. Lip regions of interest are subtly detected by motion

and facial structure information. Improvements are made on active shape

models (ASMs) for extracting lip contours more accurately and efficiently

from video sequences of a speaker's talking face in natural lighting

conditions and without particular make-ups. Local and global ASM search

algorithms are both improved by introducing color information, 2D mouth

corner match, and robust estimation. For noise-free features, localization

errors are automatically corrected by an interpolating scheme. A fast

implementation of the hierarchical approach is also proposed. Extensive

experiments show that the improved ASM can effectively reduce the lip

locating errors. The fast implementation of RoHiLTA can consistently achieve

superior performance to conventional ASMs in lip tracking tasks, and then

can be effectively integrated in lip-reading and AVSR systems.

1.4 FACIAL TRACKING USING SKIN AND COLOR

Sobottka Karin and Pitas loannis, 1996, present a new approach for

automatically segmenting and tracking of faces in color images. The

segmentation of faces is done based on color and shape information. By

searching for facial features, face hypotheses are verified. Afterwards

tracking is performed by using an active contour model. This ensures fast

processing and an increase in robustness for the face recognition process.

The exterior forces of the active contour are defined based on color features.

Results for tracking are shown for an image sequence consisting of 150

frames.

Yang and Waibel, 1996, present a real-time face tracker. The system

has achieved a rate of 30+ frames / second using an HP-9000 workstation

with a frame grabber and a Canon VC-C1 camera. It can track a person’s

face while the person moves freely (e.g., walks, jumps, sits down and stands

up) in a room. Three types of models have been employed in developing the

system. They present a stochastic model to characterize skin-color

distributions of human faces. The information provided by the model is

sufficient for tracking a human face in various poses and views. This model

is adaptable to different people and different lighting conditions in real-time.

A motion model is used to estimate image motion and to predict search

window. A camera model is used to predict and to compensate for camera

motion. The system can be applied to teleconferencing and many HCI

applications including lip-reading and gaze tracking. The principle in

developing this system can be extended to other tracking problems such as

tracking the human hand.

Jebara et al., 1997, describe automatic detecting, modeling and

tracking faces in 3D. A closed loop approach is proposed which utilizes

structure from motion to generate a 3D model of a face and then feedback

the estimated structure to constrain feature tracking in the next frame. The

system initializes by using skin classification, symmetry operations, 3D

warping and eigenfaces to and a face. Feature trajectories are then

computed by SSD or correlation-based tracking. The trajectories are

simultaneously processed by an extended Kalman filter to stably recover 3D

structure, camera geometry and facial pose. Adaptively weighted estimation

is used in this filter by modeling the noise characteristics of the 2D image

patch tracking technique. The structural estimate is constrained by using

parameterized models of facial structure (eigen-heads). The Kalman filter's

estimate of the 3D state and motion of the face predicts the trajectory of the

features which constrains the search space for the next frame in the video

sequence. The feature tracking and Kalman filtering closed loop system

operates at 30Hz.

Bradski Gary, 1998, states a first step towards a perceptual user

interface. A computer vision color tracking algorithm is developed and

applied towards tracking human faces. The algorithm is based on a robust

nonparametric technique for climbing density gradients to find the mode of

probability distributions called the mean shift algorithm. The mean shift

algorithm is modified to deal with dynamically changing color probability

distributions derived from video frame sequences. The modified algorithm is

called the continuously adaptive mean shift (CAMSHIFT) algorithm.

CAMSHIFT’s tracking accuracy is compared against a Polhemus tracker.

Bradski, 1998, develop computer vision algorithms that are intended

to form part of a perceptual user interface. They must be able to track in

real time yet not absorb a major share of computational resources: other

tasks must be able to run while the visual interface is being used. The new

algorithm developed is based on a robust nonparametric technique for

climbing density gradients to find the mode (peak) of probability

distributions called the mean shift algorithm. They want to find the mode of

a color distribution within a video scene. The mean shift algorithm is

modified to deal with dynamically changing color probability distributions

derived from video frame sequences. The modified algorithm is called the

Continuously Adaptive Mean Shift (CAMSHIFT) algorithm. CAMSHIFT’s

tracking accuracy is compared against a Polhemus tracker. Tolerance to

noise, distracters and performance is studied. CAMSHIFT is then used as a

computer interface for controlling commercial computer games and for

exploring immersive 3D graphic worlds.

Raja Yogesh et al., 1998, state that they used to obtain robust

detection and tracking of people in relatively unconstrained dynamic scenes.

Gaussian mixture models were used to estimate probability densities of color

for skin, clothing and background. These models were used to detect, track

and segment people, faces and hands. A technique for dynamically updating

the models to accommodate changes in apparent color due to varying

lighting conditions was used. Two applications are highlighted: (1) actor

segmentation for virtual studios and (2) focus of attention for face and

gesture recognition systems.

Yang et al., 1998, state that a human face provides a variety of

different communicative functions. They present approaches for real-time

face / facial feature tracking and their applications. They present techniques

of tracking human faces. It is revealed that human skin color can be used as

a major feature for tracking human faces. An adaptive stochastic model has

been developed to characterize the skin-color distributions. Based on the

maximum likelihood method, the model parameters can be adapted for

different people and different lighting conditions. The feasibility of the model

has been demonstrated by the development of a real time face tracker. We

then present a top-down approach for tracking facial features such as eyes,

nostrils, and lip corners. These real-time tracking techniques have been

successfully applied to many applications such as eye-gaze monitoring, head

pose tracking, and lip-reading.

Jordao et al., 1999, describe a method for the detection and tracking

of human face and facial features. Skin segmentation is learnt from samples

of an image. After detecting a moving object, the corresponding area is

searched for clusters of pixels with a known distribution. This process is

quite insensitive to illumination changes. The face localization procedure

looks for areas in the segmented area which resemble a head. Using simple

heuristics, the located head is searched and its centroid is fed back to a

camera motion control algorithm which tries to keep the face centered in the

image using a pan-tilt camera unit. The system is capable of tracking, in

every frame, the three main features of a human face. Since precise eye

location is computationally intensive, an eye and mouth locator using fast

morphological and linear filters is developed. This allows for frame-by-frame

checking, which reduces the probability of tracking a non-basis feature,

yielding a higher success ratio. Velocity and robustness are the main

advantages of this fast facial feature detector.

Lihin, 2000, propose an algorithm for speaker’s lip contour extraction.

A color video sequence of speaker’s face is acquired, under natural lighting

conditions and without any particular make-up. A logarithmic color transform

is performed from RGB to HI (hue, intensity) color space. A Bayesian

approach segments the mouth area using Markov random field modeling.

Motion is combined with red hue lip information into a spatiotemporal

neighborhood. Simultaneously, a region of interest and relevant boundaries

points are automatically extracted. An active contour using spatially varying

coefficients is initialized with the results of the preprocessing stage. An

accurate lip shape with inner and outer borders is obtained with good quality

results in this challenging situation.

Schwerdt and Crowley, 2000, discuss robust tracking technique

applied to histograms of intensity normalized color. This technique supports

a video codec based on orthonormal basis coding. Orthonormal basis coding

can be very efficient when the images to be coded have been normalized in

size and position. However, an imprecise tracking procedure can have a

negative impact on the efficiency and the quality of reconstruction of this

technique, since it may increase the size of the required basis space. The

method has greater stability, higher precision and less jitter, over

conventional tracking techniques using color histograms.

Zhang and Mersereau, 2000, state that the use of color information

can significantly improve the efficiency and robustness of lip feature

extraction capability over purely grayscale-based methods. Edge information

provides another useful tool in characterizing lip boundaries. They present a

method of integrating both types of information to address the problem of lip

feature extraction for the purpose of speech reading. They first examine

various color models and view hue as an effective descriptor to characterize

the lips due to its invariance to luminance and human skin color, and its

discriminative properties. They use prominent red hue as an indicator to

locate the position of the lips. Based on the identified lip area, they further

refine the interior and exterior lip boundary using both color and spatial edge

information, where those two are combined within a Markov random field

(MRF) framework.

Spors et al., 2001, present face localization and tracking algorithm

which is based upon skin color detection and principle component analysis

(PCA) based eye localization. Skin color segmentation is performed using

statistical models for human skin color. The skin color segmentation task

results in a mask marking the skin color regions in the actual frame, which is

further used to compute the position and size of the dominant facial region

utilizing a robust statistics-based localization method. To improve the results

of skin color segmentation a foreground / background segmentation and an

adaptive background update scheme were added. The derived face position

is tracked with Kalman filter.

Gargesha 2002, Existing techniques for facial feature point detection

from color images include template matching, facial geometry and symmetry

analysis, mathematical morphology, luminance and chrominance analysis,

and PCA. These techniques are plagued by poor performance in the presence

of scale variations. A hybrid technique is proposed that employs a

combination of the above approaches along with curvature analysis of the

intensity surface of the face image in order to provide a superior

performance with reduced computational complexity, even in the presence

of scale variations.

Perez et al., 2002, propose color-based trackers for drastically shape

varying objects. The method relies on the deterministic search of a window

whose color content matches a reference histogram color model. Relying on

the same principle of color histogram distance, but within a probabilistic

framework, they introduce a Monte Carlo tracking technique. The use of a

particle filter allows them to better handle color clutter in the background, as

well as complete occlusion of the tracked entities over few frames. The

probabilistic approach is very flexible and can be extended in a number of

useful ways. In particular, they introduce the following ingredients: multi-

part color modeling to capture a rough spatial layout ignored by global

histograms, incorporation of a background color model when relevant, and

extension to multiple objects.

Andreas et al., 2003, present a hierarchical realization of an enhanced

active shape model for color video tracking and study the performance of

both hierarchical and nonhierarchical implementations in the RGB, YUV, and

HSI color spaces. Active shape models can be applied to tracking non-rigid

objects in video image sequences.

Huang and Trivedi, et al., 2004, Human face analysis has been

recognized as a crucial part in intelligent systems. There are several

challenges before robust and reliable face analysis systems can be deployed

in real-world environments. One of the main difficulties is associated with

the detection of faces with variations in illumination conditions and viewing

perspectives. They present the development of a computational framework

for robust detection, tracking and pose estimation of faces captured by video

arrays. They discuss development of a multi primitive skin-tone and edge-

based detection module integrated with a tracking module for efficient and

robust detection and tracking. A multi-state continuous density Hidden

Markov Model based pose estimation module is developed for providing an

accurate estimate of the orientation of the face.

Varona et al., 2005, they present a robust real-time 3D tracking

system of human hands and face. This system can be used as a perceptual

interface for virtual reality activities in a workbench environment. In front of

the virtual reality device, do not needs any type of marker or special suite.

The system includes a real time color segmentation module to detect in real-

time the skin-color pixels present in the images. The results of this skin-

color segmentation are skin-color blobs that are the inputs of a data

association module that labels the blobs pixels using a set of object state

hypothesis from previous frames. The 2D tracking results are used for the

3D reconstruction of hands and face in order to obtain the 3D positions of

these limbs. They present several results using the H-ANIM standard to

show the system output performance.

Stasiak and Vicente-Garcia, 2010, a system for parallel face detection,

tracking and recognition in real-time video sequences is being developed.

The particle filtering is utilized for the purpose of combined and effective

detection, tracking and recognition. Temporal information contained in

videos is utilized. Fast, skin color-based face extraction and normalization

technique is applied. Consequently, real-time processing is achieved.

Implementation of face recognition mechanisms within the tracking

framework is used for the purpose of identity recognition, and to improve

the tracking robustness in case of multi-person tracking scenarios. Face-to-

track assignment conflicts can often be resolved with the use of motion

modeling. motion-based conflict resolution can be erroneous. Identity clue

can be used to improve tracking quality they describe the concept of face

tracking corrections with the use of identity recognition mechanism,

implemented within a compact particle filtering-based framework for face

detection, tracking and recognition.

Shi and Tomasi, 1994, state that no feature-based vision system can

work unless good features can be identified and tracked from frame to

frame. Although tracking itself is by and large a solved problem, selecting

features that can be tracked well and correspond to physical points in the

world is still hard. They propose a feature selection criterion that is optimal

by construction because it is based on how the tracker works, and a feature

monitoring method that can detect occlusions, disocclusions, and features

that do not correspond to points in the world. These methods are based on a

new tracking algorithm that extends previous Newton-Raphson style search

methods to work under affine image transformations. They test performance

with several simulations and experiments.

Black et al., 1995, explore the use of local parameterized models of

image motion for recovering and recognizing the non-rigid and articulated

motion of human faces. Parametric models are popular for estimating motion

in rigid scenes. They observe that within local regions in space and time,

such models not only accurately model non-rigid facial motions but also

provide a concise description of the motion in terms of a small number of

parameters. These parameters are intuitively related to the motion of facial

features during facial expressions and show how expressions can be

recognized from the local parametric motions in the presence of significant

head motion. The motion tracking and expression recognition approach

performs with high accuracy movie sequences.

MacCormick and Blake 1995, tracking multiple targets is a challenging

problem, especially when the targets are identical, in the sense that the

same model is used to describe each target. They present an observation

density for tracking, which solves the problem by exhibiting a probabilistic

exclusion principle. Exclusion arises naturally from a systematic derivation of

the observation density, without relying on heuristics. They presentation

partitioned sampling, a new sampling method for multiple object tracking.

Partitioned sampling avoids the high computational load associated with fully

coupled trackers, while retaining the desirable properties of coupling.

Basu Sumit et al., 1996, describe a method for the robust tracking of

rigid head motion from video. This method uses a 3D ellipsoidal model of the

head and interprets the optical flow in terms of the possible rigid motions of

the model. This method is robust to large angular and translational motions

of the head and is not subject to the singularities of a 2D model. The method

has been successfully applied to heads with a variety of shapes, hair styles.

This method has the advantage of accurately capturing the 3D motion

parameters of the head. The accuracy is shown through comparison with a

ground truth synthetic sequence. The ellipsoidal model is robust to small

variations in the initial fit, enabling the automation of the model

initialization.

Darrell et al., 1996, demonstrate real-time face tracking and pose

estimation in an unconstrained office environment with a camera. Using

vision routines previously implemented for an interactive environment they

determine the spatial location of a user’s head and guide and active camera

to obtain images of the face. Faces are analyzed using a set of Eigen spaces

indexed over both pose and world location. Closed loop feedback from the

estimated facial location is used to guide the camera when a face is present

in the fontal view.

Crowley James et al., 1997, describe a system which uses multiple

visual processes to detect and track faces for video compression and

transmission. The system is based on an architecture in which a supervisor

selects and activates visual processes in cyclic manner. Control of visual

processes is made possible by a confidence factor which accompanies each

observation. Fusion of results into a unified estimation for tracking is made

possible by estimating a covariance matrix with each observation. Visual

processes for face tracking are described using blink detection, normalized

color histogram matching, and cross correlation (SSD and NCC). Ensembles

of visual processes are organized into processing states so as to provide

robust tracking. Transition between states is determined by events detected

by processes. The result of face detection is fed into recursive estimator. The

output from the estimator drives a PD controller for a pan / tilts / zoom

camera.

Fieguth et al., 1997, develop a simple and very fast method for object

tracking based exclusively on color information in digitized video images.

Running on a silicon graphics R4600 Indy system with an Indy cam, the

algorithm is capable of simultaneously tracking objects at full frame size

(640 x 480) pixels and video frame rate 50fps. Robustness with respect to

occlusion is achieved via an explicit hypothesis tree model of the occlusion

process. They demonstrate the efficacy of their techniques in the challenging

task of tracking people, especially tracking human head and hands.

Oliver Nuria and Pentland, 1997, describe an active-camera real-time

system for tracking, shape description, and classification of the human face

and mouth using only an SGI Indy computer. The system is based on use of

2-D blob features, which are spatially-compact clusters of pixels that are

similar in terms of low-level image properties. Patterns of behavior Facial

expressions and head movements can be classified in real-time using Hidden

Markov Model (HMM) methods. The system has been tested on hundreds of

users and has demonstrated extremely reliable and accurate performance.

Birchfield, 1998, present for tracking a person’s head. The head’s

projection onto the image plane is modeled as an ellipse whose position and

size are continually updated by a local search combining the output of a

module concentrating on the intensity gradient around the ellipse’s

perimeter with that of another module focusing on the color histogram of the

ellipse’s interior. These two modules have roughly orthogonal failure modes;

they serve to complement one another. The result is a robust, real-time

system that is able to track a person’s head with enough accuracy to

automatically control the camera’s pan, tilt, and zoom in order to keep the

person centered in the field of view at a desired size.

Hager Gregory and Belhumeur, 1998, develop an efficient, general

framework for object tracking which addresses different complications. They

first develop a computationally efficient method for handling the geometric

distortions produced by changes in pose. Then combine geometry and

illumination into an algorithm that tracks large image regions using no more

computation than would be required to track with no accommodation for

illumination changes. They augment these methods with techniques from

robust statistics and treat occluded regions on the object as statistical

outliers. Throughout, they present experimental results performed on live

video sequences demonstrating the effectiveness and efficiency of their

methods.

Hager Gregory and Toyama et al., 1998, describe X Vision as a small

set of image-level tracking primitives, and a framework for combining

tracking primitives to form complex tracking systems. Efficiency and

robustness are achieved by propagating geometric and temporal constraints

to the feature detection level, where image warping and specialized image

processing are combined to perform feature detection quickly and robustly.

They present some of these applications as an illustration of how useful,

robust tracking systems can be constructed by simple combinations of a few

basic primitives combined with the appropriate task-specific constraints.

Colmenarez, 1999, provide information from video and keep track of

the people, recognize their facial expressions and gestures, and complement

other forms of human computer interfaces. A learning technique based on

information-theoretic discrimination is used to construct face and facial

feature detectors. A real-time system for face and facial feature detection

and tracking in continuous video is done. A probabilistic framework for

embedded face and facial expression recognition from image sequences is

obtained.

Harold Hualu Wang and Chang, 1999, present Face Track, a system

that detects, tracks, and groups faces from compressed video data. They

introduce the face tracking framework based on the Kalman filter and

multiple hypothesis techniques. They compare and discuss the effects of

various motion models on tracking performance. They investigate constant-

velocity, constant-acceleration, correlated-acceleration, and variable-

dimension-filter models. They find that constant-velocity and correlated-

acceleration models work more effectively for commercial videos sampled at

high frame rates. They also develop novel approaches based on multiple

hypothesis techniques to resolving ambiguity issues. Simulation results show

the effectiveness of the proposed algorithms on tracking faces in real

applications.

Vieux et al., 1999, use face tracking system developed in the robotics

area to normalize a video sequence to centered images of the face. The

face-tracking allowed us to implement a compression scheme based on

Principal Component Analysis (PCA), which they call Orthonormal Basis

Coding (OBC).

Comaniciu et al., 2000, propose a new method for real-time tracking

of non-rigid objects seen from a moving camera. The central computational

module is based on the mean shift iterations and finds the most probable

target position in the current frame. The dissimilarity between the target

model and the target candidates are expressed by a metric derived from the

Bhattacharyya coefficient. The theoretical analysis of the approach shows

that it relates to the Bayesian framework while providing a practical, fast

and efficient solution. The capability of the tracker to handle in real-time

partial occlusions, significant clutter, and target scale variations is

demonstrated for several image sequences.

Feris Rogério Schmidt et al., 2000, present a real time system for

detection and tracking of facial features in video sequences. Such system

may be used in visual communication applications, such as teleconferencing,

virtual reality, intelligent interfaces, human machine interaction, and

surveillance. They have used a statistical skin-color model to segment face-

candidate regions in the image. The presence or absence of a face in each

region is verified by means of an eye detector, based on an efficient

template matching scheme. Once a face is detected, the pupils, nostrils and

lip corners are located and these facial features are tracked in the image

sequence, performing real time processing.

Liu Zhu and Wang, 2000, propose a new approach for combined face

detection and tracking in video. The face detection algorithm is a fast

template matching procedure using iterative dynamic programming (DP).

Schneider man and Kanade, 2000, describe a statistical method for 3D

object detection. They represent the statistics of both object appearance and

“non-object” appearance using a product of histograms. Each histogram

represents the joint statistics of a subset of wavelet coefficients and their

position on the object. Their approach is to use many such histograms

representing a wide variety of visual attributes. Using this method, they

have developed the first algorithm that can reliably detect human faces with

out-of-plane rotation and the first algorithm that can reliably detect

passenger cars over a wide range of viewpoints.

Shan et al., 2001, present model-based bundle adjustment algorithm

to recover the 3D model of a scene / object from a sequence of images with

unknown motions. Instead of representing scene / object by a collection of

isolated 3D features (usually points), their algorithm uses a surface

controlled by a small set of parameters. Compared with previous model

based approaches, their approach has the following advantages. Instead of

using the model space as a regularized, they directly use it as their search

space, thus resulting in a more elegant formulation with fewer unknowns

and fewer equations. Their algorithm automatically associates tracked points

with their correct locations on the surfaces, thereby eliminating the need for

a prior 2D-to-3D association. Regarding face modeling, they use a very

small set of face metrics to parameterize the face geometry, resulting in a

smaller search space and a better posed system.

Towama and Blake, 2001, present probabilistic paradigm for visual

tracking. Probabilistic mechanisms are attractive because they handle fusion

of information, especially temporal fusion, in a principled manner. Exemplars

are selected as representatives of raw training data. They represent

probabilistic mixture distributions of object configurations. Their use avoids

tedious hand-construction of object models, and problems with changes of

topology. Using exemplars in place of a parameterized model poses several

challenges. It uses a noise model that is learned from training data. It

eliminates any need for an assumption of probabilistic pixel wise

independence.

Arulampalam et al., 2002, review both optimal and suboptimal

Bayesian algorithms for nonlinear / non-Gaussian tracking problems, with a

focus on particle filters. Particle filters are sequential Monte Carlo methods

based on point mass representations of probability densities, which can be

applied to any state-space model and that generalize the traditional Kalman

filtering methods. Several variants of the particle filter such as SIR, ASIR,

and RPF are introduced within a generic framework of the sequential

importance sampling (SIS) algorithm and compared with the standard EKF.

Chiang et al., 2003, present a real-time face detection algorithm for

locating faces in images and videos. This algorithm finds not only the face

regions, but also the precise locations of the facial components such as eyes

and lips. The algorithm starts from the extraction of skin pixels based upon

rules derived from a simple quadratic polynomial model. With a minor

modification, this polynomial model is also applicable to the extraction of

lips. The benefits of applying these two similar polynomial models are

twofold. First, much computation time are saved. Second both extraction

processes can be performed simultaneously in one scan of the image or

video frame. The eye components are then extracted after the extraction of

skin pixels and lips. The algorithm removes the falsely extracted components

by verifying with rules derived from the spatial and geometrical relationships

of facial components. The precise face regions are determined accordingly.

According to the experimental results, the proposed algorithm exhibits

satisfactory performance in terms of both accuracy and speed for detecting

faces with wide variations in size.

Verma et al., 2003, present probabilistic method for detecting and

tracking multiple faces in a video sequence. The proposed method integrates

the information of face probabilities provided by the detector and the

temporal information provided by the tracker to produce a method superior

to the available detection and tracking methods. They claim 1) Accumulation

of probabilities of detection over a sequence. This leads to coherent

detection over time and, improves detection results. 2) Prediction of the

detection parameters which are position, scale, and pose. This guarantees

the accuracy of accumulation as well as a continuous detection. 3) The

representation of pose is based on the combination of two detectors, one for

frontal views and one for profiles.

Zhou et al., 2003, they propose a time series state space model to

fuse temporal information in a probe video, which simultaneously

characterizes the kinematics and identity using a motion vector and an

identity variable, respectively. The joint posterior distribution of the motion

vector and the identity variable is estimated at each time instant and then

propagate to the next time instant. Marginalization over the motion vector

yields a robust estimate of the posterior distribution of the identity variable.

A computationally efficient sequential importance sampling (SIS) algorithm

is developed to estimate the posterior distribution. The propagation of the

identity variable over time, degeneracy in posterior probability of the identity

variable is achieved to give improved recognition. The gallery is generalized

to videos in order to realize video-to-video recognition. An exemplar-based

learning strategy is adopted to automatically select video representatives

from the gallery, serving as mixture centers in an updated likelihood

measure. The SIS algorithm is applied to approximate the posterior

distribution of the motion vector, the identity variable, and the exemplar

index, whose marginal distribution of the identity variable produces the

recognition result. The model formulation is very general and it allows a

variety of image representations and transformations.

Okuma Kenji et al., 2004, introduce a vision system that is capable of

learning, detecting and tracking the objects of interest. The system is

demonstrated in the context of tracking hockey players using video

sequences. Their approach combines the strengths of two successful

algorithms: mixture particle filters and Adaboost. The mixture particle filter

is ideally suited to multi-target tracking as it assigns a mixture component to

each player. The crucial design issues in mixture particle filters are the

choice of the proposal distribution and the treatment of objects leaving and

entering the scene. They construct the proposal distribution using a mixture

model that incorporates information from the dynamic models of each player

and the detection hypotheses generated by Adaboost. The learned Adaboost

proposal distribution allows us to quickly detect players entering the scene,

while the filtering process enables us to keep track of the individual players.

Perez, 2004, the effectiveness of probabilistic tracking of objects in

image sequences has been revolutionized by the development of particle

filtering. Kalman filters are restricted to Gaussian distributions, particle

filters can propagate more general distributions, albeit only approximately.

This is of particular benefit in visual tracking because of the inherent

ambiguity of the visual world that stems from its richness and complexity.

One important advantage of the particle filtering framework is that it allows

the information from different measurement sources to be fused in a

principled manner. They introduce generic importance sampling mechanisms

for data fusion and discuss them for fusing color with either stereo sound,

for teleconferencing, or with motion, for surveillance with a still camera.

They show how each of the three cues can be modeled by an appropriate

data likelihood function, and how the intermittent cues (sound or motion)

are best handled by generating proposal distributions from their likelihood

functions. The effective fusion of the cues by particle filtering is

demonstrated on real teleconference and surveillance data.

Vacchetti et al., 2004, they propose an efficient real-time solution for

tracking rigid objects in 3D using a single camera that can handle large

camera displacements, drastic aspect changes, and partial occlusions. While

commercial products are already available for offline camera registration,

robust online tracking remains an open issue because many real-time

algorithms described in the literature still lack robustness and are prone to

drift and jitter. To address these problems, they have formulated the

tracking problem in terms of local bundle adjustment and have developed a

method for establishing image correspondences that can equally well handle

short and wide baseline matching. They then can merge the information

from preceding frames with that provided by a very limited number of key

frames created during a training stage, which results in a real-time tracker

that does not jitter or drift and can deal with significant aspect changes.

Dong-gil Jeong et al., 2005, propose a robust real-time head tracking

algorithm using a pan-tilt-zoom camera. They assume the shape of a head is

an ellipse and a model color histogram is acquired in advance. In the first

frame, the appropriate position and scale of the head is determined based

on the user input. In the subsequent frames, the initial position is selected

at the same position of the ellipse as in the previous frame. The mean shift

procedure is applied to make the ellipse position converge to the target

center where the color histogram similarity to the model and previous one is

maximized. The previous histogram means a color histogram adaptively

extracted from the result of the previous frame. The position-adjusted ellipse

is refined by using color and shape information. Large background motion

often prohibits the initial position from converging to the target position.

They estimate a robust initial position by compensating the background

motion. They use vertical and horizontal 1-D projection datasets. Extensive

experiments prove that a head is well tracked even when the person moves

fast and the scale of the head changes drastically.

Fidaleo Douglas et al., 2005, provides an extensive analysis of a state-

of-the-art key frame based tracker: quantitatively demonstrating the

dependence of tracking performance on underlying mesh accuracy, number

and coverage of reliably matched feature points, and initial key frame

alignment. 3D tracking of faces in video streams is a difficult problem that

can be assisted with the use of a priori knowledge of the structure and

appearance of the subject’s face at predefined poses (key frames). Tracking

with a generic face mesh can introduce an erroneous bias that leads to

degraded tracking performance when the subject’s out-of-plane motion is far

from the set of key frames. To reduce this bias, they show how online

refinement of a rough estimate of face geometry may be used to re-estimate

the 3d key frame features, thereby mitigating sensitivities to initial key

frame inaccuracies in pose and geometry. An in-depth analysis is performed

on sequences of faces with synthesized rigid head motion. Subsequent trials

on real video sequences demonstrate that tracking performance is more

sensitive to initial model alignment and geometry errors when fewer feature

points are matched and/or do not adequately span the face. The analysis

suggests several indications for most effective 3D tracking of faces in real

environments.

Hampapur et al., 2005, Situation awareness is the key to security.

Awareness requires information that spans multiple scales of space and

time. Smart video surveillance systems are capable of enhancing situational

awareness across multiple scales of space and time at the present time, the

component technologies are evolving in isolation. To provide comprehensive,

nonintrusive situation awareness, it is imperative to address the challenge of

multi scale, spatiotemporal tracking. This article explores the concepts of

multi scale spatiotemporal tracking through the use of real-time video

analysis, active cameras, multiple object models, and long-term pattern

analysis to provide comprehensive situation awareness.

Koterba Seth et al., 2005, study the relationship between multi view

Active Appearance Model (AAM) fitting and camera calibration. They propose

to calibrate the relative orientation of a set of N > 1 cameras by fitting an

AAM to sets of N images. They use the human face as a (non-rigid)

calibration grid. Algorithm calibrates a set of 2 × 3 weak perspective camera

projection matrices, projections of the world coordinate system origin into

the images, depths of the world coordinate system origin, and focal lengths.

Roy-Chowdhury et al., 2005, present two algorithms for 3D face

modeling from a monocular video sequence. The first method is based on

Structure from Motion (SFM), while the second one relies on contour

adaptation over time. The SFM based method incorporates statistical

measures of quality of the 3D estimate into the reconstruction algorithm.

The initial multi-frame SFM estimate is smoothed using a generic face model

in an energy function minimization framework. Such a strategy avoids

excessively biasing the final 3D estimate towards the generic model. The

second method relies on matching a generic 3D face model to the outer

contours of a face in the input video sequence, and integrating this strategy

over all the frames in the sequence. It consists of an edge-based head pose

estimation step, followed by global and local deformations of the generic

face model in order to adapt it to the actual 3D face. This contour adaptation

approach is able to separate the geometric subtleties of the human head

from the variations in shading and texture and it does not rely on finding

accurate point correspondences across frames.

Adam et al., 2006, present an algorithm for tracking an object in a

video sequence. The template object is represented by multiple image

fragments or patches. The patches are arbitrary and are not based on an

object model. Every patch votes on the possible positions and scales of the

object in the current frame, by comparing its histogram with the

corresponding image patch histogram.

Dedeoglu et al., 2006, describe active appearance models (AAM) as

compact representations of the shape and appearance of objects. Fitting

AAMs to images is a difficult, non-linear optimization task. Traditional

approaches minimize the L2 norm error between the model instance and the

input image warped onto the model coordinate frame. While this works well

for high resolution data, the fitting accuracy degrades quickly at lower

resolutions. They show a careful design of the fitting criterion can overcome

many of the low resolution challenges. In resolution aware formulation

(RAF), they explicitly account for the finite size sensing elements of digital

cameras, and simultaneously model the processes of object appearance

variation, geometric deformation, and image formation. Gauss-Newton

gradient descent algorithm not only synthesizes model instances as a

function of estimated parameters, simulates the formation of low resolution

images in a digital camera. They compare the RAF algorithm against a state-

of-the-art tracker across a variety of resolution and model complexity levels.

Fonseca Pedro Miguel et al., 2006, states that a compressed domain

generic object tracking algorithm offers, in combination with a face detection

algorithm, a low-computational- cost solution to the problem of detecting

and locating faces in frames of compressed video sequences (such as MPEG-

1 or MPEG-2). Objects such as faces can thus be tracked through a

compressed video stream using motion information provided by existing

forward and backward motion vectors. The described solution requires only

low computational resources on CE devices and offers at one and the same

time sufficiently good location rates.

Lu Le and Dai Xiangtan, 2006, presents a hybrid sampling solution that

combines RANSAC and particle filtering.RANSAC provides proposal particles

that, with high probability, represent the observation likelihood. Both

conditionally independent RANSAC sampling and boosting-like conditionally

dependent RANSAC sampling are explored. They show that the use of

RANSAC-guided sampling reduces the necessary number of particles to

dozens for a full 3D tracking problem. The algorithm has been applied to the

problem of 3D face pose tracking with changing expression. They

demonstrate the validity of approach with several video sequences acquired

in an unstructured environment.

Xu and Roy Chowdhury, 2007, they present a theory for combining the

effects of motion, illumination, and 3D structure, and camera parameters in

a sequence of images obtained by a perspective camera. The set of all

Lambertian reflectance functions of a moving object, at any position,

illuminated by arbitrarily distant light sources, lies “close” to a bilinear

subspace consisting of nine illumination variables and six motion variables.

This result implies that, given an arbitrary video sequence, it is possible to

recover the 3D structure, motion and illumination conditions simultaneously

using the bilinear subspace formulation. The derivation builds upon existing

work on linear subspace representations of reflectance by generalizing it to

moving objects. Lighting can change slowly or suddenly, locally or globally,

and can originate from a combination of point and extended sources. They

experimentally compare the results of their theory with ground truth data

and also provide results on real data by using video sequences of a 3D face

and the entire human body with various combinations of motion and

illumination directions. They show results of their theory in estimating 3D

motion and illumination model parameters from a video sequence.

Yu et al., 2007, they propose a method to incrementally super resolve

3D facial texture by integrating information frame by frame from a video

captured under changing poses and illuminations. They recover illumination,

3D motion and shape parameters from our tracking algorithm. This

information is then used to super-resolve 3D texture using Iterative Back-

Projection (IBP) method. The super-resolved texture is fed back to the

tracking part to improve the estimation of illumination and motion

parameters. This closed-loop process continues to refine the texture as new

frames come in. They also propose a local-region based scheme to handle

non-rigidity of the human face.

Stasiak and Pacut, 2008, a system for parallel face detection, tracking

and recognition in real-time video sequences is being developed. They

describe its face detection and tracking modules. The solution is based on

the particle filtering in the conditional density propagation framework of

Izard and Blake and utilizes color information at different levels of detail.

The use of color makes processing computationally cheap and robust in

finding candidates for further processing.

Suandi et al., 2008, they describes a technique to estimate human

face pose from color video sequence using Dynamic Bayesian Network

(DBN). As face and facial features trackers usually track eyes, pupils, mouth

corners and skin region(face), their proposed method utilizes merely three of

these features–pupils, mouth center and skin region to compute the

evidence for DBN inference. No additional image processing algorithm is

required, thus, it is simple and operates in real-time. The evidence, which

are called horizontal ratio and vertical ratio, are determined using model-

based technique and designed significantly to simultaneously solve two

problems in tracking task; scales factor and noise influence.

Valenti and Gevers, 2008, the ubiquitous application of eye tracking is

precluded by the requirement of dedicated and expensive hardware, such as

infrared high definition cameras. The systems based solely on appearance

are being proposed in literature. These systems are able to successfully

locate eyes; their accuracy is significantly lower than commercial eye

tracking devices. Their aim is to perform very accurate eye center location

and tracking, using a simple web cam. By means of a novel relevance

mechanism, the proposed method makes use of isopoda properties to gain

invariance to linear lighting changes, to achieve rotational invariance and to

keep low computational costs. They test their approach for accurate eye

location and robustness to changes in illumination and pose, using the BioID

and the Yale Face B databases. They demonstrate that their system can

achieve a considerable improvement in accuracy over state of the art

techniques.

Yung et al., 2011, propose the state-of-the-art progress on visual

tracking methods, classify them into different categories, as well as identify

future trends. Visual tracking is a fundamental task in many computer vision

applications and has been well studied in the last decades. Robust visual

tracking remains a huge challenge. Difficulties in visual tracking can arise

due to abrupt object motion, appearance pattern change, non-rigid object

structures, occlusion and camera motion. They first analyze the state-of-the-

art feature descriptors which are used to represent the appearance of

tracked objects. Then, they categorize the tracking progresses into three

groups; provide detailed descriptions of representative methods in each

group examine their positive and negative aspects and the future trends for

visual tracking research.

2.5 SUMMARY

This chapter has presented the various methods used for face tracking

in a continuous video. The local features such as eye brows, lips, and mouth,

skin color based face tracking are presented. Chapter 3 presents the feature

extraction.

Documents

CHAPTER 2 LITERATURE SURVEY 2.1 INTRODUCTION 2.2 FACIAL ...shodhganga.inflibnet.ac.in/bitstream/10603/8909/13/11_chapter2.pdf.pdf · segmentation of various parts that are designed