Image Comm Pardas Bonafonte

7/29/2019 Image Comm Pardas Bonafonte

1/30

FACIAL ANIMATION PARAMETERS EXTRACTION AND EXPRESSION

RECOGNITION USING HIDDEN MARKOV MODELS

Montse Pards, Antonio Bonafonte

Universitat Politcnica de CatalunyaC/Jordi Girona 1-3, D-5

08034 Barcelona Spain

email: {montse,antonio}@gps.tsc.upc.es

This work has been supported by the European project InterFace and TIC2001-0996 of the Spanish

Government

The video analysis system described in this paper aims at facial expression recognition

consistent with the MPEG4 standardized parameters for facial animation, FAP. For this

reason, two levels of analysis are necessary: low level analysis to extract the MPEG4

compliant parameters and high level analysis to estimate the expression of the sequence

using these low level parameters.

The low level analysis is based on an improved active contour algorithm that uses high

level information based on Principal Component Analysis to locate the most significant

contours of the face (eyebrows and mouth), and on motion estimation to track them.

The high level analysis takes as input the FAP produced by the low level analysis tool

and, by means of a Hidden Markov Model classifier, detects the expression of the

sequence.

1. INTRODUCTION

The critical role that emotions play in rational decision-making, in perception and in

human interaction has opened an interest in introducing the ability to recognize and


2/30

reproduce emotions in computers. In [26] many applications which could benefit from

this ability are explored. The importance of introducing non verbal communication in

automatic dialogue systems is highlighted in [4] and [5]. However, the research that has

been carried out on understanding human facial expressions cannot be directly applied

to a dialogue system, as it has mainly worked on static images or on image sequences

where the subjects show a specific emotion but there is no speech. Psychological studies

have indicated that at least six emotions are universally associated with distinct facial

expressions: happiness, sadness, surprise, fear, anger and disgust [32]. Several other

emotions and many combinations of emotions have been studied but remain

unconfirmed as universally distinguishable. Thus, most of the research up to now has

been oriented towards detecting these six basic expressions.

Some research has been conducted on pictures that capture the subjects expression at

its peak. These pictures allow detecting the presence of static cues (such as wrinkles) as

well as the position and shape of the facial features [32]. For instance, in [23] and [7]

classifiers for facial expressions were based on neural networks. Static faces were

presented to the network as projections of blocks from feature regions onto the principal

component space generated from the image data set. They obtain around 86% accuracy

in distinguishing the six basic emotions.

Research has also been conducted on the extraction of facial expressions from video

sequences. Most works in this area develop a video database from subjects making

expressions on demand. Their aim has been to classify the six basic expressions that we

have mentioned above, and they have not been tested in a dialogue environment. Most

approaches in this area rely on the Ekman and Friesen Facial Action Coding System

[10]. The FACS is based on the enumeration of all Action Units of a face that cause

facial movements. The combination of these actions units results in a large set of


3/30

possible facial expressions. In [31] the directions of rigid and non-rigid motions that are

caused by human facial expressions are identified computing the optical flow at the

points with high gradient at each frame. Then they propose a dictionary to describe the

facial actions of the FACS through the motion of the features and a rule-based system to

recognize facial expressions from these facial actions. A similar approach is presented

in [3], where a local parameterized model of the image motion in specific facial areas is

used for recovering and recognising the non-rigid and articulated motion of the faces.

The parameters of this detected motion are related to the motion of facial features

during facial expressions. In [28] a radial basis function network architecture is

developed that learns the correlation between facial feature motion patterns and human

emotions. The motion of the features is also obtained computing the optical flow at the

points with high gradient. The accuracy of all these systems is also around 85%.

Other approaches employ physically-based models of heads including skin and

musculature [11], [12]. They combine this physical model with registered optical flow

measurements from human faces to estimate the muscle actuations. They propose two

approaches, the first one creates typical patterns of muscle actuation for the different

facial expressions. A new image sequence is classified by its similarity to the typical

patterns of muscle actuation. The second one builds on the same methodology to

generate, from the muscle actuation, the typical pattern of motion energy associated

with each facial expression. In [19] a system is presented that uses facial feature point

tracking, dense flow tracking and high gradient component analysis in the spatio-

temporal domain to extract the FACS Action Units, from which expressions can be

derived. A complete review of this techniques can be found in [24]

In this work we have developed an expression recognition technique consistent with the

MPEG4 standardized parameters for facial definition and animation, FDP and FAP.


4/30

Thus, the expression recognition process can be divided in two steps: facial parameter

extraction and facial parameter analysis, which we also refer as low level and high level

analysis. The facial parameter extraction process is based on a feature point detection

and tracking system which uses active contours. Conventional active contours (snakes)

approaches [16] find the position of the snake by finding a minimum of its energy,

composed of internal and external forces. The external forces pull the contours toward

features such as lines and edges. However, in many applications this minimization leads

to contours that do not represent correctly the feature we are looking for. We propose in

this paper to introduce some higher level information by a statistical characterization of

the snaxels that should represent the contour. From the automatically produced

initialization, the MPEG-4 compliant FAP are computed. These FAP will be used for

training of the expressions extraction system. These techniques are developed in Section

2.

Using the MPEG4 parameters for the analysis of facial emotions has different

advantages. In first place, the developed high level analysis can benefit from already

existing low level analysis techniques for FDP and FAP extraction, as well as from any

advances that will be made in this area in the future years. Besides, the low-level FAP

constitute a concise representation of the evolution of the expression of the face. From

the training database, and using the available FAP, spatio-temporal patterns for

expressions will be constructed. Our approach for the interpretation of facial

expressions will use Hidden Markov Models (HMM) to recognize different patterns of

FAP evolution. For every defined facial expression, a HMM will be trained with the

extracted feature vectors. In the recognition phase, the HMM system will use as

classification criterion the probability of the input sequence with respect to all the

models. Tests are also performed in sequences composed of different emotions and with


5/30

periods of speech, using connected HMM. This process will be explained in Section 3.

Finally, Section 4 will resume the conclusions of this work.

2. LOW LEVEL ANALYSIS

This Section describes the low level video processing required for the extraction of

Facial Animation Parameters (FAPs). The first step in this analysis is the face detection.

Different techniques can be used for this aim, like [17] or [30], and we will not go into

details of this part. After the face is detected, facial features have to be located and

tracked. Two different approaches are possible: contour based or model based. The first

approach will localize and track the most important contours of the face and track them

in 2D (that is, track their projection in the image plane). The second approach consists

in the adaptation, in each frame, of a 3D wireframe model of a face to the image.

Example of this second approach can be found in [1], [6], [8] and [9]. In this work we

will use a technique belonging to the first class because, as it will be explained later on,

we will work on a restricted environment. Facial Animation Parameters (FAPs) will be

computed from the tracked contours or from the adapted model. First, in section 2.1 we

will review the meaning of these parameters. The following subsections are devoted to

present a 2D technique for the extraction of these parameters, which uses active

contours. So, section 2.1 will review the general framework of active contours, section

2.2 will apply this technique to facial feature detection and section 2.3 to facial feature

tracking.

2.1 FACIAL ANIMATION PARAMETERS

Facial Animation Parameters (FAPs) are defined in the ISO MPEG-4 standard [21],

together with the Facial Definition Parameters (FDPs), to allow the definition of a facial


6/30

shape and its animation reproducing expressions, emotions, and speech pronunciation.

FDPs are illustrated in Figure 1 (figures 1 and 2 have been provided by Roberto Pockaj,

from Genova University), they represent key points in a human face. All the feature

points can be used for the calibration of a model to a face, while only those represented

by black dots in the image are used also for the animation (that is, there are FAPs

describing its motion).

The FAPs are based on the study of minimal facial actions and are closely related to

muscle actions. They represent a complete set of basic facial actions, such as squeeze or

raise eyebrows, open or close eyelids, and therefore allow the representation of most

natural expressions. All FAPs involving translational movement are expressed in terms

of the Facial Animation Parameter Units (FAPUs). These units aim at allowing

interpretation of the FAPs on any facial model in a consistent way, producing

reasonable results in terms of expression and speech pronunciation. FAPUs are

illustrated in Figure 1 and correspond to fractions of distances between some key facial

features.

We will be interested in those FAPs which a) convey important information about the

emotion of the face and b) can be reliably extracted from a natural video sequence. In

particular, we have focused in those FAPs related to the motion of the contour of the

eyebrows and the mouth. These FAPs are specified in Table 1.


7/30

Figure 1. Facial Definition Parameters

FAPU value

IRISD = IRISD0 / 1024

ES = ES0 / 1024

ENS = ENS0 / 1024

MNS = MNS0 / 1024

MW = MW0 / 1024

AU = 10-5 rad

Figure 2. Facial Animation Parameter Units

# FAP name FAP description Unit

31 raise_l_i_eyebrow Vertical displacement of left inner eyebrow ENS

32 raise_r_i_eyebrow Vertical displacement of right inner eyebrow ENS

33 raise_l_m_eyebrow Vertical displacement of left middle eyebrow ENS

34 raise_r_m_eyebrow Vertical displacement of right middle eyebrow ENS

35 raise_l_o_eyebrow Vertical displacement of left outer eyebrow ENS


8/30

36 raise_r_o_eyebrow Vertical displacement of right outer eyebrow ENS

37 squeeze_l_eyebrow Horizontal displacement of left eyebrow ES

38 squeeze_r_eyebrow Horizontal displacement of right eyebrow ES

51 lower_t_midlip _o Vertical top middle outer lip displacement

MNS

52 raise_b_midlip_o Vertical bottom middle outer lip displacementMNS

53 stretch_l_cornerlip_o Horizontal displacement of left outer lip cornerMW

54 stretch_r_cornerlip_o Horizontal displacement of right outer lip corner MW

55 lower_t_lip_lm _o Vertical displacement of midpoint between left corner

and middle of top outer lip

MNS

56 lower_t_lip_rm _o Vertical displacement of midpoint between right

corner and middle of top outer lip

MNS

57 raise_b_lip_lm_o

Vertical displacement of midpoint between left corner

and middle of bottom outer lip MNS

58 raise_b_lip_rm_oVertical displacement of midpoint between right

corner and middle of bottom outer lipMNS

59 raise_l_cornerlip_o Vertical displacement of left outer lip corner MNS

60 raise_r_cornerlip _o Vertical displacement of right outer lip corner MNS

Table 1. FAP table for the eyebrows

To extract these FAPs in frontal faces, the tracking of the 2-D contour of the eyebrows

and mouth is sufficient. If the sequences present rotation and translation of the head,

then a system based on a 3D face model tracking would be more robust. In our case the

FAPs are used to train the HMM classifier. For this reason, it is better to constrain the

kind of sequences that are going to be analyzed. We will use frontal faces and a

supervised tracking procedure, in order not to introduce any errors in the training of the

system. Once the system has been trained, any FAP extraction tool can be used. If the

results it produces are accurate, the high level analysis will have a higher probability of

success. The low level analysis system that we propose for training can also be used for

testing with other sequences, as long as the faces are frontal or we previously apply a

global motion estimation for the head.


9/30

The detection and tracking algorithm that will be explained in the next section has been

integrated in a Graphical User Interface (GUI) that supports corrections to the results

being produced by the automatic algorithm. From the detection of facial features

produced, it computes the FAP units and then from the results of the automatic tracking

it writes the MPEG-4 compliant FAP files. These FAP files will be used for training of

the expressions extraction system.

The algorithm is applied to tracking of eyebrows and mouth, thus producing the

MPEG4 FAPs for the eyebrows 31 32 33 34 35 36 37 38 and for the mouth 51 52 53 54

55 56 57 58 59 60, described in Table 1.

2.2 ACTIVE CONTOURS

2.2.1 Introduction

Active contours were first introduced by Kass et al. [16]. They proposed energy

minimization as a framework where low-level information (such as image gradient or

image intensity) can be combined with higher-level information (such as shape,

continuity of the contour or user interactivity). In their original work the energy

minimization problem was solved using a variational technique. In [2] Amini et al.

proposed Dynamic Programming (DP) as a different solution to the minimization

problem. The use of the Dynamic Programming approach will allow us to introduce the

concept of candidates for the control points of the contour, thus avoiding the risk to

fall on local minima located near the initial position.

In [25], we proposed a first approach for tracking facial features, based on dynamic

programming, introducing a new term in the energy of the snake, and selecting the

candidate pixels for the contour (snaxels) using motion estimation. Currently the DP


10/30

approach has been extended to be able to automatically initialize the facial contours. In

this case, the candidate snaxels are selected by a statistical characterization of the

contour based on Principal Component Analysis (PCA). In this subsection we will

review the basic formulation of active contours, while the next subsections explain how

we apply them for facial feature detection (2.3) and for facial feature tracking (2.4).

2.2.2 Active contours formulation

In the discrete formulation of active contour models a contour is represented as a set of

snaxels i=(xi,yi) for i=0,,N-1, where xi and yi are the x and y coordinates of thesnaxel i. The energy of the contour, which is going to be minimized, is defined by:

(

=

+=1N

0i

int )()()( ii vvv extsnake EEE ) (1)

We can use a discrete approximation of the second derivative to computeEint.

1ii1ii vvvv + += 2)(intE (2)

This is an approximation to the curvature of the contour at snaxel i, if the snaxels are

equidistant. Minimizing this energy will produce smooth curves. This is only

appropriate for snaxels that are not corners of the contour. More appropriated

definitions for the internal energy will be proposed in Section 2.3.2 for the initialisation

of the snake and in Section 2.4.2 for tracking.

The purpose of the termEext is to attract the snake to desired feature points or contours

in the image. In this work we have used the gradient of the image intensity (I(x,y)),

along the contour from vi to vi+1. Thus, the Eext at snaxel vi will depend only on the

position of the snaxels viand vi+1. That is,


11/30

),,()( 1iivvi vvIv 1i + == + fEE icontext (3)

However, a new term will also be added to the external energy, in order to be able to

track contours that have stronger edges nearby.

2.2.3 Dynamic programming

We will use the DP approach to minimize the energy in Eq. (1). Let us express the

Energy of the snake remarking the dependencies of its terms:

[ ]

=

+

=

++ =+=1

0

1

0

int ),,(),(),,()(N

i

i

N

i

extsnake EEEE 1ii1i1ii1ii1i vvvvvvvvv (4)

Although snakes can be open or closed, the DP approach can be applied directly only to

open snakes. To apply DP to open snakes, the limits of Eq. 4 are adjusted to 1 and N-2

respectively.

Now, as described in [2], this energy can be minimized via discrete DP defining a two-

element vector of state variables in the ith decision stage: (vi+1, vi). The optimal value

function is a function of two adjacent points on the contour Si(vi+1,vi), and can be

calculated, for every couple of possible candidate positions for snaxels vi+1and vi, as:

[ ),,(),(min),( 11

1ii1i1iii1i vvvvvvv ++ +=

iiv

i ESSi

] (5)

S0(v1, v0) is initialized to Eext(v0,v1) for every possible candidate pair (v0,v1) and from

this, Si can be computed iteratively from i=1 up to i=N-2 for every candidate position

forvi. The total energy of the snake will be

(6))(min)( 2,121

= NNNv

snake SEN

vvv

Besides, we have to store at every step i a matrix which stores the position of vi-1 that

minimizes Eq. (5), that is,


12/30

such that v1ii1ii vvvM + =),( i-1minimizes (5).

By backtracking from the final energy of the snake and using matrix Mi, the optimal

position for every snaxel can be found.

In the case of a closed contour the solution proposed in [13] is to impose the first and

last snaxels to be the same, and fix it to a given candidate for this position. The

application of the DP algorithm will produce the best result under this restriction. Then

this initial (and final) snaxel is successively changed to all the possible candidates, and

the one that produces a smaller energy is selected. We use an approximation proposed

in [14] that requires only two open contour optimisation steps.

2.2.4 Selection of candidates

Up to now, we have assumed that for every snaxel vi there is a finite (and hopefully

small) number of candidates, but we have omitted how to select these candidates. The

computational complexity of each optimisation step is O(nm3), where n is the number of

snaxels and m the number of candidates for every snaxel. Thus, it is very important to

maintain m low.

In [2] only a small neighbourhood around the previous position of the snaxel was

considered. However, the algorithm was iteratively applied starting from the obtained

solution until there was no change in the total energy of the snake. This method has

several disadvantages. First, like in the approaches which use variational techniques for

the minimization, the snake can fall into a local minimum. Second, the computational

time can be very high if the initialisation is far from the minimum.

In [13] and [14] a different set of candidates is considered for every snaxel. In

particular, [13] establishes uncertainty lists for the high curvature points and defines a


13/30

search space between these uncertainty lists. In [14] the search zone is defined with two

initial concentric contours. Each contour point is constrained to lie on a line joining

these two initial contours. This approach gives very good results if the two concentric

contours that contain the expected contour are available and the contour being tracked is

the absolute minima in this area. However, these concentric contours are not always

available.

In the next sections we will describe how we can select these candidates and how the

snake energy is defined for facial feature point detection and tracking, respectively.

2.3 FACIAL FEATURE POINT DETECTION

2.3.1 Selection of candidates

We propose a new method that in a first step needs to fix the topology of the snakes. In

our case, we are using a 16 snaxels equally spaced snake for the mouth and a 8 snaxels

equally spaced snake for the eyebrows. To select the best candidates for each of these

snaxels we compute what we call the vi-eigen-snaxel, by extracting samples of them

from a database. That is, after resizing the faces from our database to the same size

(200x350), we extract for each snaxel vi the 16x16 area around the snaxel in every

image of the database. After an illumination normalization using an histogram

specification for every snaxel, the extracted sub-images are used to form the training set

of vectors for the snaxel vi, and from them, the eigenvectors (eigen-snaxels) are

computed by classical PCA techniques.

The first step for initialising the snakes in a new image is to roughly locate the face.

Different techniques can be used for this aim [17], [30]. After size normalization of the

face area, a large area around the rough position for every snaxel vi is examined by


14/30

computing the reconstruction error with the corresponding vi-eigen-snaxel. Those pixels

leading to the smaller reconstruction error are considered as candidates for the snaxel vi.

Principal Component Analysis (PCA)

The principle of PCA is to construct a low dimensional space with decorrelated features

which preserve the majority of the variation in the training set [20]. PCA has been used

in the past to detect faces (by means of eigen-faces) or facial features (by means of

eigen-features). In this paper, we extend its use to detect the snake control points for

specific facial features (by using eigen-snaxels).

For each snaxel vi, the vectors {xt} are constructed by lexicographic ordering of the

pixel elements of each sub-image from the training set. A partial KLT is performed on

these vectors to identify the largest-eigenvalue eigenvectors.

Distance measure

To evaluate the feasibility for a given pixel to be a given snaxel we first construct the

vector {xt} using the corresponding sub-image. Then we obtain a principal component

feature vectory=MT x~ , where xxx =~ is the mean-normalized image vector, isthe eigenvector matrix for snaxel vi and M is a sub-matrix of that contains the M

principal eigenvectors. The reconstruction error is calculated as:

( ) += =

==N

Mi

M

i

i

1 1

222 ~ yxyx 2i (7)

We have used M=5 in our experiments. The pixels producing the smaller reconstruction

error are selected as candidate locations for the snaxel vi.

2.3.2. Snake energy


15/30

External energy

A new term is added to the external Energy function in Eq (3) to take into account the

result obtained in the PCA process. The energy will be lower in those positions where

the reconstruction error is lower.

)()1()( 2 ivvi vv 1ii += +context EE (8)

In our experiments the value of has been set to 0.5.

Internal energy

As mentioned, the Energy term defined in (2) is only appropriate for snaxels that are not

corners of the contour. In the case of corners the energy has to be low when the second

derivative is high. We use, for the energy of the snaxels belonging to the eyebrows and

the mouth:

( )[ ]1ii1ii1ii vvvvvvv ++ +++= 2)1(2)( 1int BE iii (9)

where i is set to 1 ifvi is not a corner point and to 0 if it is (note that we work with a

fixed topology for the snakes). B represents the maximum value that the approximation

of the second derivative can take.

2.4 FEATURE POINT TRACKING

2.4.1. Selection of candidates

In the initialisation of the snake, the candidates for every snaxel were selected on the

basis of a statistical characterization of the texture around them. However, in the case of


16/30

tracking, we have a better knowledge of this texture if we use the previous frame. We

propose to find the candidates for every snaxel in the tracking process using motion

estimation in order to select the search space for every snaxel.

A small region around every snaxel is selected as basis for the motion estimation. The

shape of this region is rectangular and its size is application dependent. However, the

region should be small enough so that its motion can be approximated by a translational

motion. The motion compensation error (MCE) for all the possible displacements

(dx,dy) of the block in a given range is computed as:

=

=

=

=

++=Ryj

Ryj

Rxi

Rxi

dyjydxixjyixdydx2

0000 ),(),(),( t1tvi0 IIMCE (10)

being (x0, y0) the x and y coordinates of the snaxel vi in the previous frame, which we

have called vi0. The region under consideration is centred at the snaxel and has size 2Rx

in the horizontal dimension and 2Ry in the vertical dimension.

The range for (dx,dy) determines the maximum displacement that a snaxel can suffer.

The M positions with the minumum MCEvi0(dx,dy) are selected as possible new

locations for snaxel vi.

2.4.2 Snake energy

External energy

We use for the external energy in the tracking procedure the same principle as for the

initialisation. That is, it will be composed, as in (8), of two terms. The first one is the

gradient along the contour, but the second one is slightly different, as in this case we

can use the texture provided by the previous frame instead of the statistical

characterization. Thus, the second term corresponds to the compensation error obtained

in the motion estimation. In this way preference is given to those positions with the


17/30

smaller compensation error. That is, the energy will be lower in those positions whose

texture is most similar to the texture around the position of the corresponding snaxel in

the previous frame. Therefore, the expression for the external energy will be:

)()1()( ivvvi vMCEv i01i += icontext EE (11)

The constant is also set to 0.5.

Internal energy

The internal energy we use to track feature points has the same formulation as in section

2.4.2. As before, we assume here that we know the geometry of the different face

features to correctly set i to 1 or 0, Eq. (9).

To solve problems of snaxel grouping we also add, as in [18], another term to the

internal energy that forces snaxels to preserve the distance between them from one

frame to the next one:

idistE , = | | + | | (12)t

i

1t

i uu + t

1i

1t

1i uu +++

t

iu = , in frame tiv 1iv

So when the distance is altered the energy increases proportionally.

2.5 RESULTS

In Figures 3 and 4 we show some examples of automatic initialisation and tracking. To

perform the tests, the algorithms have been introduced in a Graphical User Interface

(GUI) that allows manual correction of the snaxels position. In the tests performed, the


18/30

contours obtained with the automatic initialisation always correspond to the facial

features, if the face has been located accurately. However, as shown in Figure 4,

sometimes some points should be manually modified if more accuracy is required. The

tracking has been performed on 276 sequences from the Cohn-Kanade facial expression

database [15]. From these sequences, 180 needed no manual correction, 48 needed 1, 2

or 3 points correction along the sequence and 40 sequences had one major problem in at

least one frame for one feature tracking.

From the initialization, the Facial Animation Units (FAPUs) are computed. This means

that the first frame needs to present a neutral expression. Then, from the output of the

tracking, the FAPs for the sequence are extracted. These FAPs are going to be used in

the second part of the system for the expression recognition.


19/30

Figure 3. First row: Initialization examples. Second and third row: tracking examples.

Figure 4. Automatic initialisation examples with minor errors

3. HIGH LEVEL ANALYSIS

The second part of the system is based on the modeling of the expressions by means of

Hidden Markov Models. The observations to model are the MPEG-4 standardized

Facial Animation Parameters (FAPs) computed in the previous step. The FAPs of a

video sequence are first extracted and then analyzed using semi-continuous HMM.

The Cohn-Kanade facial expression database [15] has been selected as basis for doing

the training of the models and the evaluation of the system. The whole database has

been processed in order to extract a subset of the Low Level FAPs and perform the

expression recognition experiments. Different experiments have been carried out to


20/30

determine the best topology of the HMM to recognize expressions from the Low Level

FAPs.

HMM are one of the basic probabilistic tools used for time series modelling. They are

definitely the most used model in speech recognition, and they are beginning to be used

in vision, mainly because they can be learned from data and they implicitly handle time-

varying signals by dynamic time warping. They have already been successfully used to

classify the mouth shape in video sequences [22] and to combine different information

sources (optical flow, point tracking and furrow detection) for expression intensity

estimation [19]. HMM were also used in [29] for expression recognition. In this work a

spatio-temporal dependent shape/texture eigenspace is learned in a HMM like structure.

We will extend the use of HMM creating the feature vectors from the available low-

level FAP. In the following sections we review the basic concepts of HMM (3.1), define

the topology of the models that we will use (3.2) and explain how we estimate the

parameters of these models (3.3).

This first approach tries to recognize isolated expressions. That is, we take a sequence

which contains the transition from a neutral face to a given expression and we have to

decide which is this expression. All the sequences from the Cohn-Kanade facial

expression database belong to this type. We will evaluate the system using this

approach in section 3.4. In section 3.5 we will introduce a new model which will allow

us to use the system in more complex environments. It will model the parts of the

sequences where the person is talking. Finally, in section 3.6 a more complex

experiment is presented, where the system classifies the different parts of the sequences

using a one-stage dynamic programming algorithm.

3.1 HIDDEN MARKOV MODELS DEFINITION


21/30

HMM model temporal series assuming a (hidden) temporal structure. For each emotion

a HMM can be estimated from training examples. Once these models are trained we

will be able to compute, for a given sequence of observations (the FAPs of the video

sequence in our case), the probability that this sequence was produced by each of the

models. Thus, this sequence of observations will be assigned to the expression whose

HMM gives a higher probability of generating it.

The HMM [27] are defined by a number N of states connected by transitions. The

output of these states are the observations, which belong to a set of symbols. The time

variation is associated with transitions between states.

To completely define the models, the following parameters are used:

The set of states S=(s1,s2,...,sN)

The set of symbols that can be observed from the states (O1,...OM), in our case, the

FAPs from each frame

The probability of transition from state i to statej, defined by the transition matrix

A, where

aij=p(qt+1=sj|qt=si) 1i, jN

The probability distribution function of the different observations in state j, B,

where

bj(Oi)=P(Oi| q=Sj) 1 j N

The initial state probabilities i=p(q1=Si) 1 i NTo train a HMM for an emotion (that is, to adjust its parameters), we will use as training

sequences all those sequences that we select as representatives of the given emotion. In

our case, we use all those sequences from the Cohn-Kanade facial expression database

manually classified as belonging to this emotion.


22/30

3.2 SELECTION OF THE HMM TOPOLOGY

The Baum-Welch algorithm [27] can estimate the value ofA and . However, better

results can be obtained if the topology of the Markov chain is restricted using a priori

knowledge. In our case, each considered emotion (sad, anger, fear, joy, disgust and

surprise) reflects a temporal structure: let's name it start, middle and end of the emotion.

This structure can be modeled using left-to-right HMM. This topology is appropriated

for signals whose properties change over time in a successive manner. It implies aij=0 if

i>j and i=1 fori=1, i=0 fori1. As the time increases, the observable symbols in eachsequence either stay at the same state or increase in a successive manner. In our case,

we have defined for each emotion a three-states HMM.

To select this topology different configurations have been tested. The experiments show

that using HMM with only 2 states the recognition rate is 4% lower. Using more than 3

states increases the complexity of the model without producing any improvement in the

recognition results. The selected topology is represented in the following figure.

Figure 5. Topology of Hidden Markov Models

3.3 ESTIMATION OF THE OBSERVATION PROBABILITIES

Each state of each emotion models the observed FAPs using a probability function,

bj(O) which can be either a discrete or a continuous one. In the first case, each set of


23/30

FAPs for a given image has to be discretized using vector quantification. In the

continuous case, a parametric pdf is defined for each state. Typically, multivariate

Gaussian or Laplacian pdf's are used. In the case of having sparse data, as it is our case,

better results can be achieved if all the states of all the emotion models share the

Gaussian mixtures (mean and variance). Then, the estimation of the pdf's for each state

is reduced to the estimation of the contribution of each mixture to the pdf.

Each set of FAPs has been divided in two subsets corresponding to eyebrows and

mouth, following the MPEG4 classification. At each frame, the probability of the whole

set of FAPs is computed as the product of the probability of each subset (independence

assumption). For each of these subsets, a set of 32 Gaussian mixtures has been

estimated (mean and variance) applying a clustering algorithm on the FAPs of the

training database. In order to select the number of Gaussian mixtures, different

experiments have been performed. Having less than 16 mixtures produces a recognition

rate more than a 10% lower than that obtained with 16 mixtures. The use of 32 mixtures

produces a slightly better recognition rate than 16, while more than 32 do not produce

any improvement.

3.4 EVALUATION OF THE SYSTEM

To evaluate the system we use the Cohn-Kanade facial expression database, which

consists of the recording from 90 subjects, each one with several basic expressions. The

recording was done with a standard VHS video camera, with a video rate of 30 frames

per second, with constant illumination and only full-face frontal views were captured.

Although the subjects were not previously trained in displaying facial expressions, they

practiced the expressions with an expert prior to video recording. Each posed expression


24/30

begins from neutral and ends at peak expression. The number of sequences for each

expression is the following:

F Sa Su J A D# Seq 33 52 60 61 33 37

Table 2. Number of available sequences for each expression (F: Fear, Sa: Sad, Su:

Surprise, J: Joy, A: Anger, D: Disgust).

The FAPs of these sequences were first extracted using the FAP extraction tool

described in Section 2. The system will train, for every defined facial expression (sad,

anger, fear, joy, disgust and surprise), a HMM with the extracted FAPs. In the

recognition phase, the HMM system computes the maximum probability of the input

sequence with respect to all the models and assigns to the sequence the expression with

the highest probability.

The system has been evaluated by first training the six models with all the subjects

except one, and then testing the recognition with this subject which has not participated

in the training. This process has been repeated for all the subjects, obtaining the

following recognition results:

F Sa Su J A D %Cor

F 26 3 0 3 0 1 78,7%

Sa 8 31 2 2 6 3 59,6%

Su 0 0 60 0 0 0 100%

J 4 0 0 57 0 0 93,4%

A 0 2 0 1 24 6 72,7%

D 0 0 0 0 1 36 97,3%

Table 3. Number of sequences correctly detected and number of confusions.

The overall recognition rate obtained is 84%.

These results are coherent with our expectation: surprise, joy and disgust have the

higher recognition rates, because they involve clear motion of mouth or eyebrows,

while sadness, anger and fear have a higher confusability rate. Experiments have been


25/30

repeated involving only three emotions, obtaining 98% recognition rate when joy,

surprise and anger are involved, and 95% with joy, surprise and sadness.

We have also studied which one of the facial features conveys more information about

the facial expressions, by obtaining the recognition results using only the FAPs related

with the motion of the eyebrows, or only the FAPs related with the motion of the

mouth. The overall recognition rate in the first case is 50%, while in the second it is

78%. This result shows that the mouth conveys most of the information, although the

information of the eyebrows helps to improve the results.

Although it is difficult to compare, due to the usage of different databases, the

recognition rates obtained using all the extracted information are of the same order than

those obtained by other systems which use dense motion fields combined with

physically-based models, showing that the MPEG4 FAPs convey the necessary

information for the extraction of these emotions.

3.5 INTRODUCTION OF THE TALK MODEL

An important objective of this work is to use sequences that represent a conversation

with an agent, in order to consider a real situation where the system could be applied.

The system developed should be able to classify expressions in silence frames as well as

to detect important clues in speech frames. A first step has been to work on silence

frames. However, we also want to separate the motion due to the expressions from the

motion due to speech. One possibility is to combine audio analysis with video analysis.

The other possibility that we have developed consists on the generation of a new model,

in addition to those of the six emotions, to classify the speech sequences.

In order to train this additional model, that we have designated as talk, we have

selected 28 sequences that contain speech, from available data sequences. A few more


26/30

sequences corresponding to different expressions have also been added to our dataset.

These sequences have been processed with the FAP extraction tool, in the same way

that the emotion models had been previously analysed. Then, the recognition tests were

repeated including these sequences. The performance of the recognition algorithm in

this case is shown in next table. The overall recognition rate is 81%, showing the

possibility to distinguish talk from emotion using this methodology.

F Sa Su J A D T %Cor

F 27 2 1 3 0 1 0 79,4%

Sa 6 29 3 2 7 1 5 54,7%

Su0 0 64 0 0 0 0 100%J 4 0 1 57 0 0 0 91,9%

A 0 1 0 2 22 7 2 64,7%

D 1 0 0 0 1 34 0 91,9%

T 0 3 0 0 4 0 21 75,0%

Table 4. Number of sequences correctly detected and number of confusions adding the

talk (T) sequence. The overall recognition rate obtained is 81%.

3.6 CONNECTED SEQUENCES RECOGNITION

As discussed before, the first experiments performed were oriented to the extraction of

an expression from the input sequence. However, in real situations, we will have a

continuous video sequence where the person is talking or just looking at the camera and,

at a given point, an expression will be performed by the person. So, we want to explore

the capability of the system to separate, from a video sequence, different parts where

different emotions occur.

In this case the emotions can be decoded using the same HMM by means of a one-stage

dynamic programming algorithm. This algorithm is widely used in connected and

continuous speech recognition. The basic idea is to allow during the decoding a

transition from the final state of each HMM (associated to each emotion) to the first

state of the other HMMs. This can be interpreted as a large HMM composed from the


27/30

emotion HMMs. The transitions along this large HMM indicate both the decoded

emotions and the frames associated to each emotion. In order to recover the transitions,

the algorithm needs to save some backtracking information, as it is usually the case in

dynamic programming.

Because of the lack of a more appropriate database, the connected recognition algorithm

has been tested by concatenating sequences of six FAP files extracted from the

emotions and talk sequences. Results are given in next table. The recognition rate

for these sequences is 64%, showing a lower performance than the isolated sequences

recognition.

F Sa Su J A D T %Cor

F 25 4 1 3 0 1 0 73,5%Sa 5 25 4 2 8 2 7 47,1%

Su 2 2 37 0 6 4 9 57,8%

J 2 2 1 56 0 0 0 91,8%

A 0 2 1 2 16 4 8 47,0%

D 2 0 9 0 3 19 3 51,3%

T 1 8 4 3 6 1 56 70,9%

Table 5. Number of sequences correctly detected and number of confusions in the

connected sequences experiment. The overall recognition rate obtained is 64%.

4. CONCLUSIONS

We have presented a video analysis system that aims at expression recognition in two

steps. In the first step, the Low Level Facial Animation Parameters are extracted. The

second step analyses these FAPs by means of HMM to estimate the expressions. For

training purposes, a supervised active contour based algorithm is applied to extract these

Low Level FAPs from an emotions face database. This algorithm introduces a criterion

for selection of the candidate snaxels in a dynamic approach implementation of the

active contours algorithm. Besides, an additional term is introduced in the external


28/30

energy, related to the specific texture around the snaxels, in order to avoid the contour

to fall in stronger edges that might be around the one we are aiming at. This technique

allows to extract the Low Level FAPs that we will use for training the recognition

system. Once the system has been trained, the recognition procedure can be applied to

the FAPs extracted by any other method. The results obtained prove that the FAPs

convey the necessary information for the extraction of emotions. The system shows

good performance for distinguishing isolated expressions and can also be used, with

lower accuracy, to extract the expressions in long video sequences where speech is

mixed with silence frames.

5. REFERENCES

[1] J. Ahlberg, An Active Model for Facial Feature Tracking. Accepted for publication in

EURASIP Journal on Applied Signal processing, to appear in June 2002.

[2] A. Amini, T. Weymouth and R. Jain, Using Dynamic Programming for Solving

Variational Problems in Vision, IEEE Transactions on Pattern Analysis and MachineIntelligence, Vol. 12, No. 9, September 1990.

[3] M. Black, Y. Yacoob, Recognizing facial expressions in image sequences using local

parameterized models of image motion, Int. Journal of Computer Vision, 25 (1), 1997,

23-48.

[4] J. Cassell, Embodied Conversation: Integrating Face and Gesture into Automatic Spoken

Dialogue Systems, In Luperfoy (ed.), Spoken Dialogue Systems, Cambridge, MA: MIT

Press.

[5] F. Cassell, K.R. Thorisson, The power of a nod and a glance: Envelope vs. Emotional

Feedback in Animated Conversational Agents, Applied Artificial Intelligence 13: 519-

538, 1999.

[6] T.F. Cootes, G.J. Edwards, C.J. Taylor, C.J. Active appearance models, IEEE

Transactions on Pattern Analysis and Machine Intelligence, Volume: 23 Issue: 6 , pp. 681

685, June 2001.


29/30

[7] G.W. Cottrell and J. Metcalfe, EMPATH: Face, emotion, and gender recognition using

holons, in Neural Information Processing Systems, vol.3, pp. 564-571, 1991.

[8] D. DeCarlo, D. Metaxas, Adjusting shape parameters using model-based optical flow

residuals, IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume: 24

Issue: 6 , pp. 814 823, June 2002

[9] P. Eisert, B. Girod, Analyzing Facial Expression for Virtual Conferencing, IEEE

Computer Graphics and Applications, Vol. 18, N. 5, 1998.

[10] P. Ekman and W. Friesen, Facial Action Coding System. Consulting Psychologists Press

Inc., 577 College Avenue, Palo Alto, California 94306, 1978.

[11] I. Essa and A. Pentland, Facial Expression Recognition using a Dynamic Model and

Motion Energy, Proc. of the International Conference on Computer Vision 1995,

Cambridge, MA, May 1995.

[12] I. Essa, Analysis, Interpretation and Synthesis of Facial Expressions, Ph.D. Thesis,

Massachusetts Institute of Technology (MIT Media Laboratory).

[13] D. Geiger, A. Gupta, L. Costa and J. Vlontzos, Dynamic Programming for Detection,

Tracking, and Matching Deformable Contours, IEEE Transactions on Pattern Analysis

and Machine Intelligence, Vol. 17, No. 3, March 1995.

[14] S. Gunn and M. Nixon, Global and Local Active Contours for Head BoundaryExtraction, International Journal of Computer Vision 30(1), pp. 43-54, 1998.

[15] Kanade, T., Cohn, J.F., Tian. Y., Comprehensive Database for Facial Expression

Analysis, Proceedings of the Fourth IEEE International Conference on Automatic Face

and Gestures Recognition, Grenoble, France, 2000.

[16] M. Kass, A. Witkin and D. Terzopoulos, Snakes: Active Contour Models, International

Journal of Computer Vision, Vol. 1, No. 4, pp. 321-331, 1988.

[17] C. Kervrann, F. Davoine, P. Perez, R. Forchheimer and C. Labit, Generalized likelihood

ratio-based face detection and extraction of mouth features, Pattern Recognition letters

18, pp. 899-912, 1997.

[18] C. L. Lam, S. Y. Yuen, An unbiased active contour algorithm for object tracking,

Pattern Recognition Letters 19, pp. 491-498, 1998.

[19] J. Lien, T. Kanade, J. Cohn, C. Li, Subtly different Facial Expression Recognition And

Expression Intensity Estimation, in Proc. Of the IEEE Int. Conference on Computer

Vision and Pattern Recognition, pp. 853-859, Santa Barbara, Ca, June 1998.


30/30

[20] B. Moghaddam, A. Pentland, Probabilistic Visual Learning for Object Detection, in 5th

International Conference on Computer Vision, Cambridge, MA, June 95.

[21] ISO/IEC MPEG-4 Part 2 (Visual)

[22] N. Oliver, A. Pentland, F. Berard, LAFTER: A Real-time Lips and Face Tracker with

Facial Expression Recognition, in Proc. of IEEE Conf. on Computer Vision, Puerto

Rico, 1997.

[23] C. Padgett, G. Cottrell, Identifying emotion in static face images, in Proc. Of the 2nd Joint

Symposium on Neural Computation, Vol.5, pp.91-101, La Jolla, CA, University of

California, San Diego.

[24] M. Pantic , L. Rothkrantz, Automatic Analysis of Facial Expressions: The State of the

Art, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 22, N. 12,

pp. 1424-1445, 2000.

[25] M. Pards, E. Sayrol, Motion estimation based tracking of active contours, Pattern

Recognition Letters, Vol. 22/13, pp. 1447-1456, Ed. Elsevier Science, ISSN, November

2001.

[26] R.W. Picard, Affective Computing, MIT Press, Cambridge 1997.

[27] L. Rabiner, A tutorial on Hidden Markov Models and selected applications in Speech

Recognition, Proceedings IEEE, pp. 257-284, February 1989.

[28] M. Rosenblum, Y. Yacoob, L.S. Davis, Human Expression Recognition from Motion

Using a Radial Basis Function Network Architecture, IEEE Trans. On Neural Networks, 7

(5), 1121-1138, 1996.

[29] F. de la Torre, Y. Yacoob, L. Davis, A probabilistic framework for rigid and non-rigid

appearance based tracking and recognition, Int. Conf. on Automatic Face and Gesture

Recognition, pp. 491-498, 2000.

[30] V. Vilaplana, F. Marqus, P. Salembier, L Garrido, Region Based Segmentation and

Tracking of Human Faces, Proceedings of European Signal Processing Conference

(EUSIPCO-98), pp. 311-314, 1998.

[31] Y. Yacoob, L.S. Davis, Computing Spatio-Temporal Representation of Human Faces,

IEEE Trans. On PAMI, 18 (6), 636-642

[32] A. Young and H. Ellis (eds.), Handbook of Research on Face Processing, Elsevier

Science Publishers 1989.

Documents

Image Comm Pardas Bonafonte