41
CHAPTER 2 LITERATURE SURVEY 2.1 INTRODUCTION This chapter presents a detailed literature survey on facial tracking using lip movement, skin color and mouth movement in a video sequence. The Automatic Facial extraction, 3D modal Shaping, Algorithm for Robust segmentation of various parts that are designed by various authors are discussed. 2.2 FACIAL TRACKING USING LIP READING Yuille et al., 1992, develop an automatic facial feature extraction system, which is able to identify the detailed shape of eyes, eyebrows and mouth from facial images. The developed system not only extracts the location information of the features, but also estimates the parameters pertaining the contours and parts of the features using parametric deformable templates approach. In order to extract facial features, deformable models for each of eye, eyebrow, and mouth are developed. The development steps of the geometry, imaging model and matching algorithms, and energy functions for each of these templates are presented in detail, along with the important implementation issues. An eigenface based multi-scale face detection algorithm which incorporates standard facial proportions is implemented, so that when a face is detected, the rough search regions for the facial features are readily available. The developed system is tested on JAFFE (Japanese Females Facial Expression Database), Yale Faces, and ORL (Olivetti Research Laboratory) face image databases. The performance of each deformable template and the face detection algorithm are discussed separately. Rabiner 1993, state that although the face detection algorithm is designed for frontal face, the same mechanism can also be applied to track

CHAPTER 2 LITERATURE SURVEY 2.1 INTRODUCTION 2.2 FACIAL ...shodhganga.inflibnet.ac.in/bitstream/10603/8909/13/11_chapter2.pdf.pdf · segmentation of various parts that are designed

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

  • CHAPTER 2

    LITERATURE SURVEY

    2.1 INTRODUCTION

    This chapter presents a detailed literature survey on facial tracking

    using lip movement, skin color and mouth movement in a video sequence.

    The Automatic Facial extraction, 3D modal Shaping, Algorithm for Robust

    segmentation of various parts that are designed by various authors are

    discussed.

    2.2 FACIAL TRACKING USING LIP READING

    Yuille et al., 1992, develop an automatic facial feature extraction

    system, which is able to identify the detailed shape of eyes, eyebrows and

    mouth from facial images. The developed system not only extracts the

    location information of the features, but also estimates the parameters

    pertaining the contours and parts of the features using parametric

    deformable templates approach. In order to extract facial features,

    deformable models for each of eye, eyebrow, and mouth are developed. The

    development steps of the geometry, imaging model and matching

    algorithms, and energy functions for each of these templates are presented

    in detail, along with the important implementation issues. An eigenface

    based multi-scale face detection algorithm which incorporates standard facial

    proportions is implemented, so that when a face is detected, the rough

    search regions for the facial features are readily available. The developed

    system is tested on JAFFE (Japanese Females Facial Expression Database),

    Yale Faces, and ORL (Olivetti Research Laboratory) face image databases.

    The performance of each deformable template and the face detection

    algorithm are discussed separately.

    Rabiner 1993, state that although the face detection algorithm is

    designed for frontal face, the same mechanism can also be applied to track

  • non-frontal faces with online adapted face models. Due to the essence of

    template matching, the algorithm is capable of comparing the similarity

    among different faces, which makes it suitable for tracking the same face

    that occur at disjointed temporal locations in video. While the proposed face

    detection method provides comparable accuracy as the neural network

    based approach, it is much faster.

    Terzopoulos et al., 1993, present a new approach to the analysis of

    dynamic facial images for the purposes of estimating and resynthesizing

    dynamic facial expressions. The approach exploits a sophisticated generative

    model of the human face originally developed for realistic facial animation.

    The face model, which may be simulated and rendered at interactive rates

    on a graphics workstation, incorporates a physics-based synthetic facial

    tissue and a set of anatomically motivated facial muscle actuators. They

    consider the estimation of dynamic facial muscle contractions from video

    sequences of expressive human faces. They develop an estimation technique

    that uses deformable contour models (snakes) to track the non-rigid motions

    of facial features in video images

    Lanitis et al., 1994, present flexible shape and flexible grey-level

    models for representing variations in the appearance of human faces. These

    models are controlled by a small number of parameters which can be used

    for coding and reconstructing a face image.

    Jacquin Amaud et al., 1995, address the issue of automatically

    tracking the faces and facial features of persons in head-and-shoulders video

    sequences. They propose two totally automatic algorithms which

    respectively perform the detection of head outlines and identify rectangular

    eyes-nose-mouth regions, both from down sampled binary threshold edge

    images. Unlike ones that have been proposed recently, a priori assumptions

    regarding the nature and content of the sequences to code are minimal for

  • our techniques, and the algorithms operate accurately and robustly, even in

    cases of significant head rotation or partial occlusion by moving objects.

    Gavrila and Davis, 1996, present a vision system for the 3-D model-

    based tracking of unconstrained human movement. Using image sequences

    acquired simultaneously from multiple views, they recover the 3D body pose

    at each time instant without the use of markers. The pose recovery problem

    is formulated as a search problem and entails finding the pose parameters of

    a graphical human model whose synthesized appearance is most similar to

    the actual appearance of the real human in the multi-view images. The

    models used for this purpose are acquired from the images. They use a

    decomposition approach and a best-first technique to search through the

    high dimensional pose parameter space. A robust variant of chamfer

    matching is used as a fast similarity measure between synthesized and real

    edge images. They present initial tracking results from a large new human-

    in-action database containing more the 2500 frames in each of four

    orthogonal views. The four image streams are synchronized. They contain

    subjects involved in a variety of activities, of various degree of complexity

    ranging from the simpler one person hand waving to the challenging two

    persons close interaction in the Argentine Tango.

    McKenna et al., 1996, they describe a dynamic face tracking system

    based on an integrated motion-based object tracking and model based face

    detection, framework. The motion-based tracker focuses attention for the

    face detector whilst the latter aids the tracking process. The system

    produces segmented face sequences from complex scenes with poor viewing

    conditions in surveillance applications. They also investigate a Gabor wavelet

    transform as a representation scheme for capturing head rotations in depth.

    Principal components analysis was used to visualize the manifolds described

    by pose change. Heinzmann and Zelinsky, 1997, state that people naturally

    express themselves through facial gestures. They have implemented an

  • interface that tracks a person's facial features robustly in real time (30Hz)

    and does not require artificial artifacts such as special illumination or facial

    makeup. Even if features become occluded, the system is capable of

    recovering tracking in a couple of frames after the features reappear in the

    image. Based on this fault tolerant face tracker they have implemented real

    time gesture recognition capable of distinguishing 12 different gestures

    ranging from "yes", "no" and "may be" to winks, blinks and "asleep".

    Sanchez et al., 1997, a method for lip tracking intended to support

    personal verification is presented. Lip contours are represented by means of

    quadratic B-splines. The lips are automatically localized in the original image

    and an elliptic B-spline is generated to start up tracking. Lip localization

    exploits grey-level gradient projections as well as chromaticity models to

    find the lips in an automatically segmented region corresponding to the face

    area. Tracking proceeds by estimating new lip contour positions according to

    a statistical chromaticity model for the lips. The current tracker

    implementation follows a deterministic second order model for the spline

    motion based on a Lagrangian formulation of contour dynamics. The method

    has been tested on the M2VTS database. Lips were accurately tracked on

    sequences consisting of more than hundred frames.

    Basu et al., 1998, address the problem of tracking and reconstructing

    3D human lip motions from a 2D view. They build a physically-based 3D

    model of lips and train it to cover only the subspace of lip motions. They

    then track this model in video by finding the shape within the subspace that

    maximizes the posterior probability of the model given the observed

    features. The features are the likelihoods of the lip and non-lip color classes:

    they iteratively derive forces from these values to apply to the physical

    model and converge to the final solution. Because of the full 3D nature of

    the model, this framework allows to track the lips from any head pose. In

    addition, because of the constraints imposed by the learned subspace of the

  • model, they are able to accurately estimate the full 3D lip shape from the 2D

    view.

    Edward et al., 1998, address the problem of robust face identification

    in the presence of pose, lighting, and expression variation. Previous

    approaches to the problem have assumed similar models of variation for

    each individual, estimated from pooled training data. They describe a

    method of updating a first order global estimate to identity by learning the

    class specific correlation between the estimate and the residual variation

    during a sequence. This is integrated with an optimal tracking scheme, in

    which identity variation is decoupled from pose, lighting and expression

    variation. The method results in robust tracking and a more stable estimate

    of facial identity under changing conditions.

    Schödl Arno et al., 1998, describe the use of a three-dimensional

    textured model of the human head under perspective projection to track a

    person’s face. The system is hand-initialized by projecting an image of the

    face onto a polygonal head model. Tracking is achieved by finding the six

    translation and rotation parameters to register the rendered images of the

    textured model with the video images. They find the parameters by mapping

    the derivative of the error with respect to the parameters to intensity

    gradients in the image. They use a robust estimator to pool the information

    and do gradient descent to find an error minimum.

    Stan Birchfield, 1998, presents an algorithm for tracking a person’s

    head. The head’s projection onto the image plane is modeled as an ellipse

    whose position and size are continually updated by a local search combining

    the output of a module concentrating on the intensity gradient around the

    ellipse’s perimeter with that of another module focusing on the color

    histogram of the ellipse’s interior. Since these two modules have roughly

    orthogonal failure modes, they serve to complement one another. The result

    is a robust, real-time system that is able to track a person’s head with

  • enough accuracy to automatically control the camera’s pan, tilt, and zoom in

    order to keep the person centered in the field of view at a desired size.

    Extensive experimentation shows the algorithm’s robustness with respect to

    full 360-degree out-of-plane rotation, up to 90-degree tilting, severe but

    brief occlusion, arbitrary camera movement, and multiple moving people in

    the background.

    Toyama 1998, real-time 3D face tracking is a task with applications to

    animation, video teleconferencing, speech reading, and accessibility. In spite

    of advances in hardware and efficient vision algorithms, robust face tracking

    remains elusive for all of the reasons which make computer vision difficult:

    Variations in illumination, pose, expression, and visibility complicate the

    tracking process, especially under real-time constraints. They note that

    robust systems tend to possess some state-based architecture comprising

    heterogeneous algorithms, and that robust recovery from tracking failure

    requires several other facial image analysis tasks.

    Cascia et al., 2000, propose an improved technique for 3D head

    tracking under varying illuminating conditions. The head is modeled as a

    texture mapped cylinder. Tracking is formulated as an image registration

    problem in the cylinder's texture map image. The resulting dynamic texture

    map provides a stabilized view of the face that can be used as input to many

    existing 2D techniques for face recognition, facial expressions analysis, lip

    reading, and eye tracking.

    Lievin and Luthon, 2000, propose an algorithm for speaker's lip

    segmentation and features extraction. A color video sequence of speaker's

    face is acquired, under natural lighting conditions and without any particular

    make-up. A logarithmic color transform is performed from the RGB to HI

    (hue, intensity) color space. A statistical approach using Markov random

    field modeling determines the red hue prevailing region and motion in a

  • spatiotemporal neighborhood. Third, the final label field is used to extract

    ROI (region of interest) and geometrical features.

    Tian et al., 2000, propose a dual state model based system of tracking

    eye features that uses convergent tracking techniques and show how it can

    be used to detect whether the eyes are open or closed, and to recover the

    parameters of the eye model.

    Jian et al., 2001, develop real time lip tracking information that can be

    used to implement and control a virtual lip. The use of soft computing to

    represent the real time lip parameters enables them to have a more robust

    and flexible system which can compensate for the potential errors of lip

    tracking.

    Chan et al., 2002, state that contour model-based tracking is more

    robust if an accurate reference shape model of the underlying object is

    available. As lip shapes vary, the ability to automatically extract user-

    dependent lip models from input images is desirable. They present an

    unsupervised segmentation method to hierarchically locate the user's face

    and lips. Techniques employed include modeling in the hue / saturation color

    space using Gaussian mixture models and the use of geometric constraints.

    With the region of interest automatically located, the model extraction

    problem is formulated as a regularized model-fitting problem. The use of a

    generic shape as prior information improves the accuracy of the extracted lip

    model which is based on a cubic B-spline representation. They describe a

    method to compute automatically an optimal linear color space transform

    needed to obtain raw estimates of the lip boundary locations, as required by

    the fitting procedure.

    Delman and Lievin, 2002, present an algorithm for speaker's lip

    segmentation and features extraction. A color video sequence of speaker's

    face is acquired, under natural lighting conditions and without any particular

    make-up. A logarithmic color transform is performed from RGB to HI (hue,

  • intensity) color space. A statistical approach using Markov randomly

    modeling determines lip prevailing region and motion in spatiotemporal

    neighborhoods.

    Eveno et al., 2002, propose an accurate and robust lip segmentation

    algorithm. Characteristic points are found by using hybrid edges, which

    combine color and intensity information, and a priori knowledge about the lip

    structure. Corner position, which is crucial, is provided by a coarse-to-fine

    process. A model is fitted on the lips. Unlike most model oriented methods,

    they consider that the lip boundary is composed of several independent

    cubic polynomial models. It gives to the global model enough flexibility to

    reproduce the specificity of very different lip shapes. Compared to existing

    models, it brings a significant accuracy improvement. It ensures a robust

    convergence towards the edges.

    Liew et al., 2002, present use of visual information from lip

    movements that can improve the accuracy and robustness of a speech

    recognition system. A region-based lip contour extraction algorithm based

    on deformable model is proposed. The algorithm employs a stochastic cost

    function to partition a color lip image into lip and non-lip regions such that

    the joint probability of the two regions is maximized. Given a discrete

    probability map generated by spatial fuzzy clustering, they show how the

    optimization of the cost function can be done in the continuous setting. The

    region-based approach makes the algorithm more tolerant to noise and

    artifacts in the image. It also allows larger region of attraction, thus making

    the algorithm less sensitive to initial parameter settings. The algorithm

    works on unadorned lips and accurate extraction of lip contour is possible.

    Mark Barnard et al., 2002, propose a robust and adaptable lip tracking

    method that uses a combination of snakes and a 2D template matching

    technique. The snake, an energy minimizing spline, is driven by 2D template

    matching techniques to find the expected lip contour of a specific speaker.

  • Their experiments show that the technique can track the unadorned lips in

    various colors and shapes of speakers, including the lips of a bearded

    speaker.

    Morency et al., 2002, present a robust implementation of stereo-based

    head tracking designed for interactive environments with uncontrolled

    lighting. They integrate fast face detection and drift reduction algorithms

    with a gradient-based stereo rigid motion tracking technique. Their system

    can automatically segment and track a user’s head under large rotation and

    illumination variations. Precision and usability of their approach are

    compared with previous tracking methods for cursor control and target

    selection in both desktop and interactive room environments.

    Yang et al., 2002, insist that images containing faces are essential to

    intelligent vision-based human computer interaction, and research efforts in

    face processing include face recognition, face tracking, pose estimation, and

    expression recognition. Given a single image, the goal of face detection is to

    identify all image regions which contain a face regardless of its three-

    dimensional position, orientation, and lighting conditions. Such a problem is

    challenging because faces are not rigid and have a high degree of variability

    in size, shape, color, and texture. Numerous techniques have been

    developed to detect faces in a single image.

    Blanz Volker and Vetter, 2003, present a method for face recognition

    across variations in pose, ranging from frontal to profile views, and across a

    wide range of illuminations, including cast shadows and secular reflections.

    To account for these variations, the algorithm simulates the process of

    image formation in 3D space, using computer graphics, and it estimates 3D

    shape and texture of faces from single images. The estimate is achieved by

    fitting a statistical, morph able model of 3D faces to images. The model is

    learned from a set of textured 3D scans of heads. They describe the

    construction of the morph able model, an algorithm to fit the model to

  • images, and a framework for face identification. In this framework, faces are

    represented by model parameters for 3D shape and texture.

    Liew, 2003, describe the application of a novel spatial fuzzy clustering

    algorithm to the lip segmentation problem. The proposed spatial fuzzy

    clustering algorithm is able to take into account both the distributions of

    data in feature space and the spatial interactions between neighboring pixels

    during clustering. By appropriate pre- and post processing utilizing the color

    and shape properties of the lip region, successful segmentation of most lip

    images is possible. Comparative study with some existing lip segmentation

    algorithms such as the hue filtering algorithm and the fuzzy entropy

    histogram thresholding algorithm has demonstrated the superior

    performance of their method.

    Suandi et al., 2003, introduce an extended technique in template

    matching to track eyes and mouth in real-time. The technique makes use of

    a set of ‘n’ correlation candidates from template matching. They first list all

    the candidates from each face model regions, and select the best candidates

    based on two selective functions. These functions are for right-left eyes pair

    and eyes-mouth pair selection, respectively. They also introduce a novel

    technique in tracking framework, called feature selective (FS), where the

    system selects the features automatically so that it is feasible for multiple

    face types and conditions.

    Wu et al., 2003, state that occlusion is a difficult problem for

    appearance-based target tracking, especially when it needs to track multiple

    targets simultaneously and maintain the target identities during tracking.

    They propose a dynamic Bayesian network which accommodates an extra

    hidden process for occlusion and stipulates the conditions on which the

    image observation likelihood is calculated. The statistical inference of such a

    hidden process can reveal the occlusion relations among different targets,

    which makes the tracker more robust against partial even complete

  • occlusions. In addition, considering the fact that target appearances change

    with views, another generative model for multiple view representation is

    proposed by adding a switching variable to select from different view

    templates .The integration of the occlusion model and multiple view model

    results in a complex dynamic Bayesian network, where extra hidden

    processes describe the switch of targets’ templates, dynamics, and the

    occlusions among different targets. The tracking and inference algorithms

    are implemented by the sampling-based sequential Monte Carlo strategies.

    Our experiments show the effectiveness of the proposed probabilistic models

    and the algorithms.

    Eveno Nicolas et al., 2004, they propose an accurate and robust quasi

    automatic lip segmentation algorithm. The upper mouth boundary and

    several characteristic points are detected in the first frame by using a new

    kind of active contour: the “jumping snake”. Unlike classic snakes, it can be

    initialized far from the final edge and the adjustment of its parameters is

    easy and intuitive. Then, to achieve the segmentation they propose a

    parametric model composed of several cubic curves. Its high flexibility

    enables accurate lip contour extraction even in the challenging case of very

    asymmetric mouth. It brings a significant accuracy and realism

    improvement. The segmentation in the following frames is achieved by using

    an inter frame tracking of the key points and the model parameters. The

    key point’s positions become unreliable after a few frames. They propose an

    adjustment process that enables an accurate tracking even after hundreds of

    frames and the mean key points tracking errors of our algorithm are

    comparable to manual point’s selection errors.

    Leung Shu-Hung et al., 2004, presented a new fuzzy clustering

    method for lip image segmentation. This clustering method takes both the

    color information and the spatial distance into account while most of the

    current clustering methods only deal with the former. A new dissimilarity

  • measure, which integrates the color dissimilarity and the spatial distance in

    terms of an elliptic shape function, is introduced. Because of the presence of

    the elliptic shape function, the new measure is able to differentiate the pixels

    having similar Color information but located in different regions. A new

    iterative algorithm for the determination of the membership and centroid for

    each class is derived, which is shown to provide good differentiation between

    the lip region and the non-lip region.

    Wang et al., 2004, visual information from lip shapes and movements

    helps improve the accuracy and robustness of a speech recognition system.

    A new region-based lip contour extraction algorithm that combines the

    merits of the point-based model and the parametric model is presented.

    Their algorithm uses a 16-point lip model to describe the lip contour. Given a

    robust probability map of the color lip image generated by the FCMS (fuzzy

    clustering method incorporating shape function) algorithm, a region-based

    cost function that maximizes the joint probability of the lip and non-lip

    region can be established. Then an iterative point-driven optimization

    procedure has been developed to fit the lip model to the probability map. In

    each iteration, the adjustment of the 16 lip points is governed by three

    pieces of quadratic curves that constrain the points to form a physical lip

    shape.

    Narayanan et al., 2006, they present a lip contour tracking algorithm

    using attractor guided particle filtering. It is difficult to robustly track the lip

    contour because the lip contour is highly deformable and the contrast

    between skin and lip colors is very low. It makes the traditional blind

    segmentation-based algorithms often fail to have robust and realistic results.

    The lip contour is constrained by the facial muscles; the tracking

    configuration space can then be represented by a lower dimensional

    manifold. They take some representative lip shapes as the attractors in the

    lower dimensional manifold. To resolve the low contrast problem, they adopt

  • a color feature selection algorithm to maximize the between skin and lip

    colors. Then they integrate the shape priors and the discriminative feature

    into the attractor-guided particle filtering framework to track the lip contour.

    Nguyen et al., 2008, they propose and evaluate a novel method for

    enhancing performance of lips contour tracking, which is based on the

    concept of statistic shape models (ASM) and multi features. On the first

    image of the video sequence, lip region is detected using the Bayesian's rule

    in which lip color information is modeled by using the Gaussian Mixture

    Model (GMM) and the GMM is trained by Expectation-Maximization (EM)

    algorithm. The lip region is then used to initialize the lip shape model. A

    single feature-based ASM presents good performance only in particular

    conditions but gets stuck in local minima for noisy conditions enhance the

    convergence, we propose to use 2 features: normal profile and grey level

    patches, and combine them by using a voting approach. The standard ASM

    is not able to take into account temporal information from previous frames

    therefore the lip contours are tracked by replacing the standard ASM with a

    hybrid active shape model (HASM) which is capable to take advantage of the

    temporal information.

    Ong Eng-Jon and Bowden, 2008, they propose a learnt data-driven

    approach to the accurate, real-time tracking of lip shapes using only

    intensity information. This has the advantage that constraints such as a-

    priori shape models or temporal models for dynamics are not required or

    used. Tracking the lip shape is simply the independent tracking of a set of

    points that lie on the lip’s contour. This allows us to cope with different lip

    shapes that were not present in the training data and performs as well as

    other approaches that have pre-learnt shape models such as the AAM.

    Tracking is archived via linear predictors, where each linear predictor

    essentially linearly maps sparse template difference vectors to tracked

    feature position displacements. Multiple linear predictors are grouped into a

  • rigid flock to obtain increased robustness. To achieve accurate tracking, two

    approaches are proposed for selecting relevant sets of LPs within each flock.

    Analysis of the selection results show that the LPs selected for tracking a

    feature point choose areas that are strongly correlated with that of the

    tracked target and that these areas are not necessarily the region around

    the feature point as is commonly assumed in LK based approaches.,

    effective fusion of acoustic and visual modalities in speech recognition has

    been an important issue in human computer interfaces, warranting further

    improvements in intelligibility and robustness. Speaker lip motion stands out

    as the most linguistically relevant visual feature for speech recognition. They

    present a new hybrid approach to improve lip localization and tracking,

    aimed at improving speech recognition in noisy environments. It begins with

    a new color space transformation for enhancing lip segmentation. In the

    color space transformation, a PCA method is employed to derive a new one

    dimensional color space which maximizes discrimination between lip and

    non-lip colors. Intensity information is also incorporated in the process to

    improve contrast of upper and corner lip segments. In the subsequent step,

    a constrained deformable lip model with high flexibility is constructed to

    accurately capture and track lip shapes. The model requires only six degrees

    of freedom, yet provides a precise description of lip shapes using a simple

    least square fitting method. Experimental results indicate that the proposed

    hybrid approach delivers reliable and accurate localization and tracking of lip

    motions under various measurement conditions.

    Rohani et al., 2008, Lip feature extraction is one of the most

    challenging tasks in the lip reading systems' performance. They propose a

    new approach for lip contour extraction based on fuzzy clustering. The

    algorithm employs a stochastic cost function to partition a color image into

    lip and non-lip regions such that the joint probability of the two regions is

    maximized. The mouth location is determined and then, lip region is

  • preprocessed using pseudo hue transformation. Fuzzy c-means clustering is

    applied to each transformed image along with b components of CIELAB color

    space. To delete the clustered pixels around lip, an ellipse and a Gaussian

    mask were used. In order to show the performance of the proposed method,

    the pseudo hue segmentation and fuzzy c-mean clustering without

    preprocessing are compared. The compared methods were applied to the

    VidTIMIT and M2VTS databases and the results show the superiority of the

    proposed method in comparison with other methods.

    Chin Siew Wen et al., 2009, present automatic lips detection and

    tracking system based on watershed segmentation approach. For some of

    the lips detection systems, skin / non-skin detection is a prerequisite step to

    localize the face region followed by detection of lip region. A direct lips

    detection technique using watershed segmentation without needing

    preliminary face localization is proposed. The watershed algorithm segments

    the input image into regions. The cubic spline interplant lips color modeling

    and symmetry detection is used to detect the lips region from the

    segmented regions. The position of the segmented lips is passed to the

    tracking system to predict the location of the lips in the succeeding video

    frame.

    Hoai BAC Et Al., 2010, they present to solve a narrower problem, the

    lip tracking, which is an essential step to provide visual lip data for the lip-

    reading system. Inspired by the idea of AVCSR, which has combined visual

    features with audio features to increase the accuracy in noisy environments;

    they use AdaBoost algorithm and Kalman filter for the face and lip detectors.

  • 1.3 FACIAL TRACKING USING SPEECH

    Leymaric and Levine, 1993, propose segmentation of a noisy intensity

    image and tracking a non-rigid object. A technique based on an active

    contour model commonly called a snake is examined. The technique is

    applied to cell locomotion and tracking studies. The snake permits both the

    segmentation and tracking problems to be simultaneously solved in

    constrained cases. A detailed analysis of the snake model, emphasizing its

    limitations and shortcomings, is presented, and improvements to the original

    description of the model are proposed. Problems of convergence of the

    optimization scheme are considered. In particular, an improved terminating

    criterion for the optimization scheme that is based on topographic features

    of the graph of the intensity image is proposed. Hierarchical filtering

    methods, as well as a continuation method based on a discrete sale-space

    representation, are discussed.

    Luettin Juergen et al., 1996, describe a robust method for extracting

    visual speech information from the shape of lips to be used for an automatic

    speech reading (lip reading) systems. Lip de-formation is model led by a

    statistically based deformable contour model which learns typical lip

    deformation from a training set. The main difficulty in locating and tracking

    lips consists of finding dominant image features for representing the lip

    contours. They describe the use of a statistical profile model which learns

    dominant image features from a training set. The model captures global

    intensity variation due to different illumination and different skin reflectance

    as well as intensity changes at the inner lip contour due to mouth opening

    and visibility of teeth and tongue. The method is validated for locating and

    tracking lip movements on a database of a broad variety of speakers.

    Kaucle and Blake, 1998, human speech is inherently multi-model

    consisting of both audio and visual components. Recently researchers have

    shown that the incorporation of information about the position of the lips

  • into acoustic speech recognizer enables robust recognition of noisy speech.

    In the case of Hidden Markov model recognition, they show that this

    happens because the visual signal stabilizes the alignment of states. It is

    also shown that unadorned lips, both the inner and outer contours, can be

    robustly tracked in real time on general purpose workstations. To accomplish

    this, efficient algorithms are employed which contain three key components,

    shape models, motion models, and focused color feature detectors all of

    which are learnt from examples.

    Lei et al., 2004, the paper presents a robust hierarchical lip tracking

    approach (RoHiLTA) for lip-reading and audio visual speech recognition

    (AVSR) applications. Lip regions of interest are subtly detected by motion

    and facial structure information. Improvements are made on active shape

    models (ASMs) for extracting lip contours more accurately and efficiently

    from video sequences of a speaker's talking face in natural lighting

    conditions and without particular make-ups. Local and global ASM search

    algorithms are both improved by introducing color information, 2D mouth

    corner match, and robust estimation. For noise-free features, localization

    errors are automatically corrected by an interpolating scheme. A fast

    implementation of the hierarchical approach is also proposed. Extensive

    experiments show that the improved ASM can effectively reduce the lip

    locating errors. The fast implementation of RoHiLTA can consistently achieve

    superior performance to conventional ASMs in lip tracking tasks, and then

    can be effectively integrated in lip-reading and AVSR systems.

    1.4 FACIAL TRACKING USING SKIN AND COLOR

    Sobottka Karin and Pitas loannis, 1996, present a new approach for

    automatically segmenting and tracking of faces in color images. The

    segmentation of faces is done based on color and shape information. By

    searching for facial features, face hypotheses are verified. Afterwards

  • tracking is performed by using an active contour model. This ensures fast

    processing and an increase in robustness for the face recognition process.

    The exterior forces of the active contour are defined based on color features.

    Results for tracking are shown for an image sequence consisting of 150

    frames.

    Yang and Waibel, 1996, present a real-time face tracker. The system

    has achieved a rate of 30+ frames / second using an HP-9000 workstation

    with a frame grabber and a Canon VC-C1 camera. It can track a person’s

    face while the person moves freely (e.g., walks, jumps, sits down and stands

    up) in a room. Three types of models have been employed in developing the

    system. They present a stochastic model to characterize skin-color

    distributions of human faces. The information provided by the model is

    sufficient for tracking a human face in various poses and views. This model

    is adaptable to different people and different lighting conditions in real-time.

    A motion model is used to estimate image motion and to predict search

    window. A camera model is used to predict and to compensate for camera

    motion. The system can be applied to teleconferencing and many HCI

    applications including lip-reading and gaze tracking. The principle in

    developing this system can be extended to other tracking problems such as

    tracking the human hand.

    Jebara et al., 1997, describe automatic detecting, modeling and

    tracking faces in 3D. A closed loop approach is proposed which utilizes

    structure from motion to generate a 3D model of a face and then feedback

    the estimated structure to constrain feature tracking in the next frame. The

    system initializes by using skin classification, symmetry operations, 3D

    warping and eigenfaces to and a face. Feature trajectories are then

    computed by SSD or correlation-based tracking. The trajectories are

    simultaneously processed by an extended Kalman filter to stably recover 3D

    structure, camera geometry and facial pose. Adaptively weighted estimation

  • is used in this filter by modeling the noise characteristics of the 2D image

    patch tracking technique. The structural estimate is constrained by using

    parameterized models of facial structure (eigen-heads). The Kalman filter's

    estimate of the 3D state and motion of the face predicts the trajectory of the

    features which constrains the search space for the next frame in the video

    sequence. The feature tracking and Kalman filtering closed loop system

    operates at 30Hz.

    Bradski Gary, 1998, states a first step towards a perceptual user

    interface. A computer vision color tracking algorithm is developed and

    applied towards tracking human faces. The algorithm is based on a robust

    nonparametric technique for climbing density gradients to find the mode of

    probability distributions called the mean shift algorithm. The mean shift

    algorithm is modified to deal with dynamically changing color probability

    distributions derived from video frame sequences. The modified algorithm is

    called the continuously adaptive mean shift (CAMSHIFT) algorithm.

    CAMSHIFT’s tracking accuracy is compared against a Polhemus tracker.

    Bradski, 1998, develop computer vision algorithms that are intended

    to form part of a perceptual user interface. They must be able to track in

    real time yet not absorb a major share of computational resources: other

    tasks must be able to run while the visual interface is being used. The new

    algorithm developed is based on a robust nonparametric technique for

    climbing density gradients to find the mode (peak) of probability

    distributions called the mean shift algorithm. They want to find the mode of

    a color distribution within a video scene. The mean shift algorithm is

    modified to deal with dynamically changing color probability distributions

    derived from video frame sequences. The modified algorithm is called the

    Continuously Adaptive Mean Shift (CAMSHIFT) algorithm. CAMSHIFT’s

    tracking accuracy is compared against a Polhemus tracker. Tolerance to

    noise, distracters and performance is studied. CAMSHIFT is then used as a

  • computer interface for controlling commercial computer games and for

    exploring immersive 3D graphic worlds.

    Raja Yogesh et al., 1998, state that they used to obtain robust

    detection and tracking of people in relatively unconstrained dynamic scenes.

    Gaussian mixture models were used to estimate probability densities of color

    for skin, clothing and background. These models were used to detect, track

    and segment people, faces and hands. A technique for dynamically updating

    the models to accommodate changes in apparent color due to varying

    lighting conditions was used. Two applications are highlighted: (1) actor

    segmentation for virtual studios and (2) focus of attention for face and

    gesture recognition systems.

    Yang et al., 1998, state that a human face provides a variety of

    different communicative functions. They present approaches for real-time

    face / facial feature tracking and their applications. They present techniques

    of tracking human faces. It is revealed that human skin color can be used as

    a major feature for tracking human faces. An adaptive stochastic model has

    been developed to characterize the skin-color distributions. Based on the

    maximum likelihood method, the model parameters can be adapted for

    different people and different lighting conditions. The feasibility of the model

    has been demonstrated by the development of a real time face tracker. We

    then present a top-down approach for tracking facial features such as eyes,

    nostrils, and lip corners. These real-time tracking techniques have been

    successfully applied to many applications such as eye-gaze monitoring, head

    pose tracking, and lip-reading.

    Jordao et al., 1999, describe a method for the detection and tracking

    of human face and facial features. Skin segmentation is learnt from samples

    of an image. After detecting a moving object, the corresponding area is

    searched for clusters of pixels with a known distribution. This process is

    quite insensitive to illumination changes. The face localization procedure

  • looks for areas in the segmented area which resemble a head. Using simple

    heuristics, the located head is searched and its centroid is fed back to a

    camera motion control algorithm which tries to keep the face centered in the

    image using a pan-tilt camera unit. The system is capable of tracking, in

    every frame, the three main features of a human face. Since precise eye

    location is computationally intensive, an eye and mouth locator using fast

    morphological and linear filters is developed. This allows for frame-by-frame

    checking, which reduces the probability of tracking a non-basis feature,

    yielding a higher success ratio. Velocity and robustness are the main

    advantages of this fast facial feature detector.

    Lihin, 2000, propose an algorithm for speaker’s lip contour extraction.

    A color video sequence of speaker’s face is acquired, under natural lighting

    conditions and without any particular make-up. A logarithmic color transform

    is performed from RGB to HI (hue, intensity) color space. A Bayesian

    approach segments the mouth area using Markov random field modeling.

    Motion is combined with red hue lip information into a spatiotemporal

    neighborhood. Simultaneously, a region of interest and relevant boundaries

    points are automatically extracted. An active contour using spatially varying

    coefficients is initialized with the results of the preprocessing stage. An

    accurate lip shape with inner and outer borders is obtained with good quality

    results in this challenging situation.

    Schwerdt and Crowley, 2000, discuss robust tracking technique

    applied to histograms of intensity normalized color. This technique supports

    a video codec based on orthonormal basis coding. Orthonormal basis coding

    can be very efficient when the images to be coded have been normalized in

    size and position. However, an imprecise tracking procedure can have a

    negative impact on the efficiency and the quality of reconstruction of this

    technique, since it may increase the size of the required basis space. The

  • method has greater stability, higher precision and less jitter, over

    conventional tracking techniques using color histograms.

    Zhang and Mersereau, 2000, state that the use of color information

    can significantly improve the efficiency and robustness of lip feature

    extraction capability over purely grayscale-based methods. Edge information

    provides another useful tool in characterizing lip boundaries. They present a

    method of integrating both types of information to address the problem of lip

    feature extraction for the purpose of speech reading. They first examine

    various color models and view hue as an effective descriptor to characterize

    the lips due to its invariance to luminance and human skin color, and its

    discriminative properties. They use prominent red hue as an indicator to

    locate the position of the lips. Based on the identified lip area, they further

    refine the interior and exterior lip boundary using both color and spatial edge

    information, where those two are combined within a Markov random field

    (MRF) framework.

    Spors et al., 2001, present face localization and tracking algorithm

    which is based upon skin color detection and principle component analysis

    (PCA) based eye localization. Skin color segmentation is performed using

    statistical models for human skin color. The skin color segmentation task

    results in a mask marking the skin color regions in the actual frame, which is

    further used to compute the position and size of the dominant facial region

    utilizing a robust statistics-based localization method. To improve the results

    of skin color segmentation a foreground / background segmentation and an

    adaptive background update scheme were added. The derived face position

    is tracked with Kalman filter.

    Gargesha 2002, Existing techniques for facial feature point detection

    from color images include template matching, facial geometry and symmetry

    analysis, mathematical morphology, luminance and chrominance analysis,

    and PCA. These techniques are plagued by poor performance in the presence

  • of scale variations. A hybrid technique is proposed that employs a

    combination of the above approaches along with curvature analysis of the

    intensity surface of the face image in order to provide a superior

    performance with reduced computational complexity, even in the presence

    of scale variations.

    Perez et al., 2002, propose color-based trackers for drastically shape

    varying objects. The method relies on the deterministic search of a window

    whose color content matches a reference histogram color model. Relying on

    the same principle of color histogram distance, but within a probabilistic

    framework, they introduce a Monte Carlo tracking technique. The use of a

    particle filter allows them to better handle color clutter in the background, as

    well as complete occlusion of the tracked entities over few frames. The

    probabilistic approach is very flexible and can be extended in a number of

    useful ways. In particular, they introduce the following ingredients: multi-

    part color modeling to capture a rough spatial layout ignored by global

    histograms, incorporation of a background color model when relevant, and

    extension to multiple objects.

    Andreas et al., 2003, present a hierarchical realization of an enhanced

    active shape model for color video tracking and study the performance of

    both hierarchical and nonhierarchical implementations in the RGB, YUV, and

    HSI color spaces. Active shape models can be applied to tracking non-rigid

    objects in video image sequences.

    Huang and Trivedi, et al., 2004, Human face analysis has been

    recognized as a crucial part in intelligent systems. There are several

    challenges before robust and reliable face analysis systems can be deployed

    in real-world environments. One of the main difficulties is associated with

    the detection of faces with variations in illumination conditions and viewing

    perspectives. They present the development of a computational framework

    for robust detection, tracking and pose estimation of faces captured by video

  • arrays. They discuss development of a multi primitive skin-tone and edge-

    based detection module integrated with a tracking module for efficient and

    robust detection and tracking. A multi-state continuous density Hidden

    Markov Model based pose estimation module is developed for providing an

    accurate estimate of the orientation of the face.

    Varona et al., 2005, they present a robust real-time 3D tracking

    system of human hands and face. This system can be used as a perceptual

    interface for virtual reality activities in a workbench environment. In front of

    the virtual reality device, do not needs any type of marker or special suite.

    The system includes a real time color segmentation module to detect in real-

    time the skin-color pixels present in the images. The results of this skin-

    color segmentation are skin-color blobs that are the inputs of a data

    association module that labels the blobs pixels using a set of object state

    hypothesis from previous frames. The 2D tracking results are used for the

    3D reconstruction of hands and face in order to obtain the 3D positions of

    these limbs. They present several results using the H-ANIM standard to

    show the system output performance.

    Stasiak and Vicente-Garcia, 2010, a system for parallel face detection,

    tracking and recognition in real-time video sequences is being developed.

    The particle filtering is utilized for the purpose of combined and effective

    detection, tracking and recognition. Temporal information contained in

    videos is utilized. Fast, skin color-based face extraction and normalization

    technique is applied. Consequently, real-time processing is achieved.

    Implementation of face recognition mechanisms within the tracking

    framework is used for the purpose of identity recognition, and to improve

    the tracking robustness in case of multi-person tracking scenarios. Face-to-

    track assignment conflicts can often be resolved with the use of motion

    modeling. motion-based conflict resolution can be erroneous. Identity clue

    can be used to improve tracking quality they describe the concept of face

  • tracking corrections with the use of identity recognition mechanism,

    implemented within a compact particle filtering-based framework for face

    detection, tracking and recognition.

    Shi and Tomasi, 1994, state that no feature-based vision system can

    work unless good features can be identified and tracked from frame to

    frame. Although tracking itself is by and large a solved problem, selecting

    features that can be tracked well and correspond to physical points in the

    world is still hard. They propose a feature selection criterion that is optimal

    by construction because it is based on how the tracker works, and a feature

    monitoring method that can detect occlusions, disocclusions, and features

    that do not correspond to points in the world. These methods are based on a

    new tracking algorithm that extends previous Newton-Raphson style search

    methods to work under affine image transformations. They test performance

    with several simulations and experiments.

    Black et al., 1995, explore the use of local parameterized models of

    image motion for recovering and recognizing the non-rigid and articulated

    motion of human faces. Parametric models are popular for estimating motion

    in rigid scenes. They observe that within local regions in space and time,

    such models not only accurately model non-rigid facial motions but also

    provide a concise description of the motion in terms of a small number of

    parameters. These parameters are intuitively related to the motion of facial

    features during facial expressions and show how expressions can be

    recognized from the local parametric motions in the presence of significant

    head motion. The motion tracking and expression recognition approach

    performs with high accuracy movie sequences.

    MacCormick and Blake 1995, tracking multiple targets is a challenging

    problem, especially when the targets are identical, in the sense that the

    same model is used to describe each target. They present an observation

    density for tracking, which solves the problem by exhibiting a probabilistic

  • exclusion principle. Exclusion arises naturally from a systematic derivation of

    the observation density, without relying on heuristics. They presentation

    partitioned sampling, a new sampling method for multiple object tracking.

    Partitioned sampling avoids the high computational load associated with fully

    coupled trackers, while retaining the desirable properties of coupling.

    Basu Sumit et al., 1996, describe a method for the robust tracking of

    rigid head motion from video. This method uses a 3D ellipsoidal model of the

    head and interprets the optical flow in terms of the possible rigid motions of

    the model. This method is robust to large angular and translational motions

    of the head and is not subject to the singularities of a 2D model. The method

    has been successfully applied to heads with a variety of shapes, hair styles.

    This method has the advantage of accurately capturing the 3D motion

    parameters of the head. The accuracy is shown through comparison with a

    ground truth synthetic sequence. The ellipsoidal model is robust to small

    variations in the initial fit, enabling the automation of the model

    initialization.

    Darrell et al., 1996, demonstrate real-time face tracking and pose

    estimation in an unconstrained office environment with a camera. Using

    vision routines previously implemented for an interactive environment they

    determine the spatial location of a user’s head and guide and active camera

    to obtain images of the face. Faces are analyzed using a set of Eigen spaces

    indexed over both pose and world location. Closed loop feedback from the

    estimated facial location is used to guide the camera when a face is present

    in the fontal view.

    Crowley James et al., 1997, describe a system which uses multiple

    visual processes to detect and track faces for video compression and

    transmission. The system is based on an architecture in which a supervisor

    selects and activates visual processes in cyclic manner. Control of visual

    processes is made possible by a confidence factor which accompanies each

  • observation. Fusion of results into a unified estimation for tracking is made

    possible by estimating a covariance matrix with each observation. Visual

    processes for face tracking are described using blink detection, normalized

    color histogram matching, and cross correlation (SSD and NCC). Ensembles

    of visual processes are organized into processing states so as to provide

    robust tracking. Transition between states is determined by events detected

    by processes. The result of face detection is fed into recursive estimator. The

    output from the estimator drives a PD controller for a pan / tilts / zoom

    camera.

    Fieguth et al., 1997, develop a simple and very fast method for object

    tracking based exclusively on color information in digitized video images.

    Running on a silicon graphics R4600 Indy system with an Indy cam, the

    algorithm is capable of simultaneously tracking objects at full frame size

    (640 x 480) pixels and video frame rate 50fps. Robustness with respect to

    occlusion is achieved via an explicit hypothesis tree model of the occlusion

    process. They demonstrate the efficacy of their techniques in the challenging

    task of tracking people, especially tracking human head and hands.

    Oliver Nuria and Pentland, 1997, describe an active-camera real-time

    system for tracking, shape description, and classification of the human face

    and mouth using only an SGI Indy computer. The system is based on use of

    2-D blob features, which are spatially-compact clusters of pixels that are

    similar in terms of low-level image properties. Patterns of behavior Facial

    expressions and head movements can be classified in real-time using Hidden

    Markov Model (HMM) methods. The system has been tested on hundreds of

    users and has demonstrated extremely reliable and accurate performance.

    Birchfield, 1998, present for tracking a person’s head. The head’s

    projection onto the image plane is modeled as an ellipse whose position and

    size are continually updated by a local search combining the output of a

    module concentrating on the intensity gradient around the ellipse’s

  • perimeter with that of another module focusing on the color histogram of the

    ellipse’s interior. These two modules have roughly orthogonal failure modes;

    they serve to complement one another. The result is a robust, real-time

    system that is able to track a person’s head with enough accuracy to

    automatically control the camera’s pan, tilt, and zoom in order to keep the

    person centered in the field of view at a desired size.

    Hager Gregory and Belhumeur, 1998, develop an efficient, general

    framework for object tracking which addresses different complications. They

    first develop a computationally efficient method for handling the geometric

    distortions produced by changes in pose. Then combine geometry and

    illumination into an algorithm that tracks large image regions using no more

    computation than would be required to track with no accommodation for

    illumination changes. They augment these methods with techniques from

    robust statistics and treat occluded regions on the object as statistical

    outliers. Throughout, they present experimental results performed on live

    video sequences demonstrating the effectiveness and efficiency of their

    methods.

    Hager Gregory and Toyama et al., 1998, describe X Vision as a small

    set of image-level tracking primitives, and a framework for combining

    tracking primitives to form complex tracking systems. Efficiency and

    robustness are achieved by propagating geometric and temporal constraints

    to the feature detection level, where image warping and specialized image

    processing are combined to perform feature detection quickly and robustly.

    They present some of these applications as an illustration of how useful,

    robust tracking systems can be constructed by simple combinations of a few

    basic primitives combined with the appropriate task-specific constraints.

    Colmenarez, 1999, provide information from video and keep track of

    the people, recognize their facial expressions and gestures, and complement

    other forms of human computer interfaces. A learning technique based on

  • information-theoretic discrimination is used to construct face and facial

    feature detectors. A real-time system for face and facial feature detection

    and tracking in continuous video is done. A probabilistic framework for

    embedded face and facial expression recognition from image sequences is

    obtained.

    Harold Hualu Wang and Chang, 1999, present Face Track, a system

    that detects, tracks, and groups faces from compressed video data. They

    introduce the face tracking framework based on the Kalman filter and

    multiple hypothesis techniques. They compare and discuss the effects of

    various motion models on tracking performance. They investigate constant-

    velocity, constant-acceleration, correlated-acceleration, and variable-

    dimension-filter models. They find that constant-velocity and correlated-

    acceleration models work more effectively for commercial videos sampled at

    high frame rates. They also develop novel approaches based on multiple

    hypothesis techniques to resolving ambiguity issues. Simulation results show

    the effectiveness of the proposed algorithms on tracking faces in real

    applications.

    Vieux et al., 1999, use face tracking system developed in the robotics

    area to normalize a video sequence to centered images of the face. The

    face-tracking allowed us to implement a compression scheme based on

    Principal Component Analysis (PCA), which they call Orthonormal Basis

    Coding (OBC).

    Comaniciu et al., 2000, propose a new method for real-time tracking

    of non-rigid objects seen from a moving camera. The central computational

    module is based on the mean shift iterations and finds the most probable

    target position in the current frame. The dissimilarity between the target

    model and the target candidates are expressed by a metric derived from the

    Bhattacharyya coefficient. The theoretical analysis of the approach shows

    that it relates to the Bayesian framework while providing a practical, fast

  • and efficient solution. The capability of the tracker to handle in real-time

    partial occlusions, significant clutter, and target scale variations is

    demonstrated for several image sequences.

    Feris Rogério Schmidt et al., 2000, present a real time system for

    detection and tracking of facial features in video sequences. Such system

    may be used in visual communication applications, such as teleconferencing,

    virtual reality, intelligent interfaces, human machine interaction, and

    surveillance. They have used a statistical skin-color model to segment face-

    candidate regions in the image. The presence or absence of a face in each

    region is verified by means of an eye detector, based on an efficient

    template matching scheme. Once a face is detected, the pupils, nostrils and

    lip corners are located and these facial features are tracked in the image

    sequence, performing real time processing.

    Liu Zhu and Wang, 2000, propose a new approach for combined face

    detection and tracking in video. The face detection algorithm is a fast

    template matching procedure using iterative dynamic programming (DP).

    Schneider man and Kanade, 2000, describe a statistical method for 3D

    object detection. They represent the statistics of both object appearance and

    “non-object” appearance using a product of histograms. Each histogram

    represents the joint statistics of a subset of wavelet coefficients and their

    position on the object. Their approach is to use many such histograms

    representing a wide variety of visual attributes. Using this method, they

    have developed the first algorithm that can reliably detect human faces with

    out-of-plane rotation and the first algorithm that can reliably detect

    passenger cars over a wide range of viewpoints.

    Shan et al., 2001, present model-based bundle adjustment algorithm

    to recover the 3D model of a scene / object from a sequence of images with

    unknown motions. Instead of representing scene / object by a collection of

    isolated 3D features (usually points), their algorithm uses a surface

  • controlled by a small set of parameters. Compared with previous model

    based approaches, their approach has the following advantages. Instead of

    using the model space as a regularized, they directly use it as their search

    space, thus resulting in a more elegant formulation with fewer unknowns

    and fewer equations. Their algorithm automatically associates tracked points

    with their correct locations on the surfaces, thereby eliminating the need for

    a prior 2D-to-3D association. Regarding face modeling, they use a very

    small set of face metrics to parameterize the face geometry, resulting in a

    smaller search space and a better posed system.

    Towama and Blake, 2001, present probabilistic paradigm for visual

    tracking. Probabilistic mechanisms are attractive because they handle fusion

    of information, especially temporal fusion, in a principled manner. Exemplars

    are selected as representatives of raw training data. They represent

    probabilistic mixture distributions of object configurations. Their use avoids

    tedious hand-construction of object models, and problems with changes of

    topology. Using exemplars in place of a parameterized model poses several

    challenges. It uses a noise model that is learned from training data. It

    eliminates any need for an assumption of probabilistic pixel wise

    independence.

    Arulampalam et al., 2002, review both optimal and suboptimal

    Bayesian algorithms for nonlinear / non-Gaussian tracking problems, with a

    focus on particle filters. Particle filters are sequential Monte Carlo methods

    based on point mass representations of probability densities, which can be

    applied to any state-space model and that generalize the traditional Kalman

    filtering methods. Several variants of the particle filter such as SIR, ASIR,

    and RPF are introduced within a generic framework of the sequential

    importance sampling (SIS) algorithm and compared with the standard EKF.

    Chiang et al., 2003, present a real-time face detection algorithm for

    locating faces in images and videos. This algorithm finds not only the face

  • regions, but also the precise locations of the facial components such as eyes

    and lips. The algorithm starts from the extraction of skin pixels based upon

    rules derived from a simple quadratic polynomial model. With a minor

    modification, this polynomial model is also applicable to the extraction of

    lips. The benefits of applying these two similar polynomial models are

    twofold. First, much computation time are saved. Second both extraction

    processes can be performed simultaneously in one scan of the image or

    video frame. The eye components are then extracted after the extraction of

    skin pixels and lips. The algorithm removes the falsely extracted components

    by verifying with rules derived from the spatial and geometrical relationships

    of facial components. The precise face regions are determined accordingly.

    According to the experimental results, the proposed algorithm exhibits

    satisfactory performance in terms of both accuracy and speed for detecting

    faces with wide variations in size.

    Verma et al., 2003, present probabilistic method for detecting and

    tracking multiple faces in a video sequence. The proposed method integrates

    the information of face probabilities provided by the detector and the

    temporal information provided by the tracker to produce a method superior

    to the available detection and tracking methods. They claim 1) Accumulation

    of probabilities of detection over a sequence. This leads to coherent

    detection over time and, improves detection results. 2) Prediction of the

    detection parameters which are position, scale, and pose. This guarantees

    the accuracy of accumulation as well as a continuous detection. 3) The

    representation of pose is based on the combination of two detectors, one for

    frontal views and one for profiles.

    Zhou et al., 2003, they propose a time series state space model to

    fuse temporal information in a probe video, which simultaneously

    characterizes the kinematics and identity using a motion vector and an

    identity variable, respectively. The joint posterior distribution of the motion

  • vector and the identity variable is estimated at each time instant and then

    propagate to the next time instant. Marginalization over the motion vector

    yields a robust estimate of the posterior distribution of the identity variable.

    A computationally efficient sequential importance sampling (SIS) algorithm

    is developed to estimate the posterior distribution. The propagation of the

    identity variable over time, degeneracy in posterior probability of the identity

    variable is achieved to give improved recognition. The gallery is generalized

    to videos in order to realize video-to-video recognition. An exemplar-based

    learning strategy is adopted to automatically select video representatives

    from the gallery, serving as mixture centers in an updated likelihood

    measure. The SIS algorithm is applied to approximate the posterior

    distribution of the motion vector, the identity variable, and the exemplar

    index, whose marginal distribution of the identity variable produces the

    recognition result. The model formulation is very general and it allows a

    variety of image representations and transformations.

    Okuma Kenji et al., 2004, introduce a vision system that is capable of

    learning, detecting and tracking the objects of interest. The system is

    demonstrated in the context of tracking hockey players using video

    sequences. Their approach combines the strengths of two successful

    algorithms: mixture particle filters and Adaboost. The mixture particle filter

    is ideally suited to multi-target tracking as it assigns a mixture component to

    each player. The crucial design issues in mixture particle filters are the

    choice of the proposal distribution and the treatment of objects leaving and

    entering the scene. They construct the proposal distribution using a mixture

    model that incorporates information from the dynamic models of each player

    and the detection hypotheses generated by Adaboost. The learned Adaboost

    proposal distribution allows us to quickly detect players entering the scene,

    while the filtering process enables us to keep track of the individual players.

  • Perez, 2004, the effectiveness of probabilistic tracking of objects in

    image sequences has been revolutionized by the development of particle

    filtering. Kalman filters are restricted to Gaussian distributions, particle

    filters can propagate more general distributions, albeit only approximately.

    This is of particular benefit in visual tracking because of the inherent

    ambiguity of the visual world that stems from its richness and complexity.

    One important advantage of the particle filtering framework is that it allows

    the information from different measurement sources to be fused in a

    principled manner. They introduce generic importance sampling mechanisms

    for data fusion and discuss them for fusing color with either stereo sound,

    for teleconferencing, or with motion, for surveillance with a still camera.

    They show how each of the three cues can be modeled by an appropriate

    data likelihood function, and how the intermittent cues (sound or motion)

    are best handled by generating proposal distributions from their likelihood

    functions. The effective fusion of the cues by particle filtering is

    demonstrated on real teleconference and surveillance data.

    Vacchetti et al., 2004, they propose an efficient real-time solution for

    tracking rigid objects in 3D using a single camera that can handle large

    camera displacements, drastic aspect changes, and partial occlusions. While

    commercial products are already available for offline camera registration,

    robust online tracking remains an open issue because many real-time

    algorithms described in the literature still lack robustness and are prone to

    drift and jitter. To address these problems, they have formulated the

    tracking problem in terms of local bundle adjustment and have developed a

    method for establishing image correspondences that can equally well handle

    short and wide baseline matching. They then can merge the information

    from preceding frames with that provided by a very limited number of key

    frames created during a training stage, which results in a real-time tracker

    that does not jitter or drift and can deal with significant aspect changes.

  • Dong-gil Jeong et al., 2005, propose a robust real-time head tracking

    algorithm using a pan-tilt-zoom camera. They assume the shape of a head is

    an ellipse and a model color histogram is acquired in advance. In the first

    frame, the appropriate position and scale of the head is determined based

    on the user input. In the subsequent frames, the initial position is selected

    at the same position of the ellipse as in the previous frame. The mean shift

    procedure is applied to make the ellipse position converge to the target

    center where the color histogram similarity to the model and previous one is

    maximized. The previous histogram means a color histogram adaptively

    extracted from the result of the previous frame. The position-adjusted ellipse

    is refined by using color and shape information. Large background motion

    often prohibits the initial position from converging to the target position.

    They estimate a robust initial position by compensating the background

    motion. They use vertical and horizontal 1-D projection datasets. Extensive

    experiments prove that a head is well tracked even when the person moves

    fast and the scale of the head changes drastically.

    Fidaleo Douglas et al., 2005, provides an extensive analysis of a state-

    of-the-art key frame based tracker: quantitatively demonstrating the

    dependence of tracking performance on underlying mesh accuracy, number

    and coverage of reliably matched feature points, and initial key frame

    alignment. 3D tracking of faces in video streams is a difficult problem that

    can be assisted with the use of a priori knowledge of the structure and

    appearance of the subject’s face at predefined poses (key frames). Tracking

    with a generic face mesh can introduce an erroneous bias that leads to

    degraded tracking performance when the subject’s out-of-plane motion is far

    from the set of key frames. To reduce this bias, they show how online

    refinement of a rough estimate of face geometry may be used to re-estimate

    the 3d key frame features, thereby mitigating sensitivities to initial key

    frame inaccuracies in pose and geometry. An in-depth analysis is performed

  • on sequences of faces with synthesized rigid head motion. Subsequent trials

    on real video sequences demonstrate that tracking performance is more

    sensitive to initial model alignment and geometry errors when fewer feature

    points are matched and/or do not adequately span the face. The analysis

    suggests several indications for most effective 3D tracking of faces in real

    environments.

    Hampapur et al., 2005, Situation awareness is the key to security.

    Awareness requires information that spans multiple scales of space and

    time. Smart video surveillance systems are capable of enhancing situational

    awareness across multiple scales of space and time at the present time, the

    component technologies are evolving in isolation. To provide comprehensive,

    nonintrusive situation awareness, it is imperative to address the challenge of

    multi scale, spatiotemporal tracking. This article explores the concepts of

    multi scale spatiotemporal tracking through the use of real-time video

    analysis, active cameras, multiple object models, and long-term pattern

    analysis to provide comprehensive situation awareness.

    Koterba Seth et al., 2005, study the relationship between multi view

    Active Appearance Model (AAM) fitting and camera calibration. They propose

    to calibrate the relative orientation of a set of N > 1 cameras by fitting an

    AAM to sets of N images. They use the human face as a (non-rigid)

    calibration grid. Algorithm calibrates a set of 2 × 3 weak perspective camera

    projection matrices, projections of the world coordinate system origin into

    the images, depths of the world coordinate system origin, and focal lengths.

    Roy-Chowdhury et al., 2005, present two algorithms for 3D face

    modeling from a monocular video sequence. The first method is based on

    Structure from Motion (SFM), while the second one relies on contour

    adaptation over time. The SFM based method incorporates statistical

    measures of quality of the 3D estimate into the reconstruction algorithm.

    The initial multi-frame SFM estimate is smoothed using a generic face model

  • in an energy function minimization framework. Such a strategy avoids

    excessively biasing the final 3D estimate towards the generic model. The

    second method relies on matching a generic 3D face model to the outer

    contours of a face in the input video sequence, and integrating this strategy

    over all the frames in the sequence. It consists of an edge-based head pose

    estimation step, followed by global and local deformations of the generic

    face model in order to adapt it to the actual 3D face. This contour adaptation

    approach is able to separate the geometric subtleties of the human head

    from the variations in shading and texture and it does not rely on finding

    accurate point correspondences across frames.

    Adam et al., 2006, present an algorithm for tracking an object in a

    video sequence. The template object is represented by multiple image

    fragments or patches. The patches are arbitrary and are not based on an

    object model. Every patch votes on the possible positions and scales of the

    object in the current frame, by comparing its histogram with the

    corresponding image patch histogram.

    Dedeoglu et al., 2006, describe active appearance models (AAM) as

    compact representations of the shape and appearance of objects. Fitting

    AAMs to images is a difficult, non-linear optimization task. Traditional

    approaches minimize the L2 norm error between the model instance and the

    input image warped onto the model coordinate frame. While this works well

    for high resolution data, the fitting accuracy degrades quickly at lower

    resolutions. They show a careful design of the fitting criterion can overcome

    many of the low resolution challenges. In resolution aware formulation

    (RAF), they explicitly account for the finite size sensing elements of digital

    cameras, and simultaneously model the processes of object appearance

    variation, geometric deformation, and image formation. Gauss-Newton

    gradient descent algorithm not only synthesizes model instances as a

    function of estimated parameters, simulates the formation of low resolution

  • images in a digital camera. They compare the RAF algorithm against a state-

    of-the-art tracker across a variety of resolution and model complexity levels.

    Fonseca Pedro Miguel et al., 2006, states that a compressed domain

    generic object tracking algorithm offers, in combination with a face detection

    algorithm, a low-computational- cost solution to the problem of detecting

    and locating faces in frames of compressed video sequences (such as MPEG-

    1 or MPEG-2). Objects such as faces can thus be tracked through a

    compressed video stream using motion information provided by existing

    forward and backward motion vectors. The described solution requires only

    low computational resources on CE devices and offers at one and the same

    time sufficiently good location rates.

    Lu Le and Dai Xiangtan, 2006, presents a hybrid sampling solution that

    combines RANSAC and particle filtering.RANSAC provides proposal particles

    that, with high probability, represent the observation likelihood. Both

    conditionally independent RANSAC sampling and boosting-like conditionally

    dependent RANSAC sampling are explored. They show that the use of

    RANSAC-guided sampling reduces the necessary number of particles to

    dozens for a full 3D tracking problem. The algorithm has been applied to the

    problem of 3D face pose tracking with changing expression. They

    demonstrate the validity of approach with several video sequences acquired

    in an unstructured environment.

    Xu and Roy Chowdhury, 2007, they present a theory for combining the

    effects of motion, illumination, and 3D structure, and camera parameters in

    a sequence of images obtained by a perspective camera. The set of all

    Lambertian reflectance functions of a moving object, at any position,

    illuminated by arbitrarily distant light sources, lies “close” to a bilinear

    subspace consisting of nine illumination variables and six motion variables.

    This result implies that, given an arbitrary video sequence, it is possible to

    recover the 3D structure, motion and illumination conditions simultaneously

  • using the bilinear subspace formulation. The derivation builds upon existing

    work on linear subspace representations of reflectance by generalizing it to

    moving objects. Lighting can change slowly or suddenly, locally or globally,

    and can originate from a combination of point and extended sources. They

    experimentally compare the results of their theory with ground truth data

    and also provide results on real data by using video sequences of a 3D face

    and the entire human body with various combinations of motion and

    illumination directions. They show results of their theory in estimating 3D

    motion and illumination model parameters from a video sequence.

    Yu et al., 2007, they propose a method to incrementally super resolve

    3D facial texture by integrating information frame by frame from a video

    captured under changing poses and illuminations. They recover illumination,

    3D motion and shape parameters from our tracking algorithm. This

    information is then used to super-resolve 3D texture using Iterative Back-

    Projection (IBP) method. The super-resolved texture is fed back to the

    tracking part to improve the estimation of illumination and motion

    parameters. This closed-loop process continues to refine the texture as new

    frames come in. They also propose a local-region based scheme to handle

    non-rigidity of the human face.

    Stasiak and Pacut, 2008, a system for parallel face detection, tracking

    and recognition in real-time video sequences is being developed. They

    describe its face detection and tracking modules. The solution is based on

    the particle filtering in the conditional density propagation framework of

    Izard and Blake and utilizes color information at different levels of detail.

    The use of color makes processing computationally cheap and robust in

    finding candidates for further processing.

    Suandi et al., 2008, they describes a technique to estimate human

    face pose from color video sequence using Dynamic Bayesian Network

    (DBN). As face and facial features trackers usually track eyes, pupils, mouth

  • corners and skin region(face), their proposed method utilizes merely three of

    these features–pupils, mouth center and skin region to compute the

    evidence for DBN inference. No additional image processing algorithm is

    required, thus, it is simple and operates in real-time. The evidence, which

    are called horizontal ratio and vertical ratio, are determined using model-

    based technique and designed significantly to simultaneously solve two

    problems in tracking task; scales factor and noise influence.

    Valenti and Gevers, 2008, the ubiquitous application of eye tracking is

    precluded by the requirement of dedicated and expensive hardware, such as

    infrared high definition cameras. The systems based solely on appearance

    are being proposed in literature. These systems are able to successfully

    locate eyes; their accuracy is significantly lower than commercial eye

    tracking devices. Their aim is to perform very accurate eye center location

    and tracking, using a simple web cam. By means of a novel relevance

    mechanism, the proposed method makes use of isopoda properties to gain

    invariance to linear lighting changes, to achieve rotational invariance and to

    keep low computational costs. They test their approach for accurate eye

    location and robustness to changes in illumination and pose, using the BioID

    and the Yale Face B databases. They demonstrate that their system can

    achieve a considerable improvement in accuracy over state of the art

    techniques.

    Yung et al., 2011, propose the state-of-the-art progress on visual

    tracking methods, classify them into different categories, as well as identify

    future trends. Visual tracking is a fundamental task in many computer vision

    applications and has been well studied in the last decades. Robust visual

    tracking remains a huge challenge. Difficulties in visual tracking can arise

    due to abrupt object motion, appearance pattern change, non-rigid object

    structures, occlusion and camera motion. They first analyze the state-of-the-

    art feature descriptors which are used to represent the appearance of

  • tracked objects. Then, they categorize the tracking progresses into three

    groups; provide detailed descriptions of representative methods in each

    group examine their positive and negative aspects and the future trends for

    visual tracking research.

    2.5 SUMMARY

    This chapter has presented the various methods used for face tracking

    in a continuous video. The local features such as eye brows, lips, and mouth,

    skin color based face tracking are presented. Chapter 3 presents the feature

    extraction.