Upload
anurag-awasthi
View
220
Download
0
Embed Size (px)
Citation preview
7/29/2019 Image Comm Pardas Bonafonte
1/30
FACIAL ANIMATION PARAMETERS EXTRACTION AND EXPRESSION
RECOGNITION USING HIDDEN MARKOV MODELS
Montse Pards, Antonio Bonafonte
Universitat Politcnica de CatalunyaC/Jordi Girona 1-3, D-5
08034 Barcelona Spain
email: {montse,antonio}@gps.tsc.upc.es
This work has been supported by the European project InterFace and TIC2001-0996 of the Spanish
Government
The video analysis system described in this paper aims at facial expression recognition
consistent with the MPEG4 standardized parameters for facial animation, FAP. For this
reason, two levels of analysis are necessary: low level analysis to extract the MPEG4
compliant parameters and high level analysis to estimate the expression of the sequence
using these low level parameters.
The low level analysis is based on an improved active contour algorithm that uses high
level information based on Principal Component Analysis to locate the most significant
contours of the face (eyebrows and mouth), and on motion estimation to track them.
The high level analysis takes as input the FAP produced by the low level analysis tool
and, by means of a Hidden Markov Model classifier, detects the expression of the
sequence.
1. INTRODUCTION
The critical role that emotions play in rational decision-making, in perception and in
human interaction has opened an interest in introducing the ability to recognize and
7/29/2019 Image Comm Pardas Bonafonte
2/30
reproduce emotions in computers. In [26] many applications which could benefit from
this ability are explored. The importance of introducing non verbal communication in
automatic dialogue systems is highlighted in [4] and [5]. However, the research that has
been carried out on understanding human facial expressions cannot be directly applied
to a dialogue system, as it has mainly worked on static images or on image sequences
where the subjects show a specific emotion but there is no speech. Psychological studies
have indicated that at least six emotions are universally associated with distinct facial
expressions: happiness, sadness, surprise, fear, anger and disgust [32]. Several other
emotions and many combinations of emotions have been studied but remain
unconfirmed as universally distinguishable. Thus, most of the research up to now has
been oriented towards detecting these six basic expressions.
Some research has been conducted on pictures that capture the subjects expression at
its peak. These pictures allow detecting the presence of static cues (such as wrinkles) as
well as the position and shape of the facial features [32]. For instance, in [23] and [7]
classifiers for facial expressions were based on neural networks. Static faces were
presented to the network as projections of blocks from feature regions onto the principal
component space generated from the image data set. They obtain around 86% accuracy
in distinguishing the six basic emotions.
Research has also been conducted on the extraction of facial expressions from video
sequences. Most works in this area develop a video database from subjects making
expressions on demand. Their aim has been to classify the six basic expressions that we
have mentioned above, and they have not been tested in a dialogue environment. Most
approaches in this area rely on the Ekman and Friesen Facial Action Coding System
[10]. The FACS is based on the enumeration of all Action Units of a face that cause
facial movements. The combination of these actions units results in a large set of
7/29/2019 Image Comm Pardas Bonafonte
3/30
possible facial expressions. In [31] the directions of rigid and non-rigid motions that are
caused by human facial expressions are identified computing the optical flow at the
points with high gradient at each frame. Then they propose a dictionary to describe the
facial actions of the FACS through the motion of the features and a rule-based system to
recognize facial expressions from these facial actions. A similar approach is presented
in [3], where a local parameterized model of the image motion in specific facial areas is
used for recovering and recognising the non-rigid and articulated motion of the faces.
The parameters of this detected motion are related to the motion of facial features
during facial expressions. In [28] a radial basis function network architecture is
developed that learns the correlation between facial feature motion patterns and human
emotions. The motion of the features is also obtained computing the optical flow at the
points with high gradient. The accuracy of all these systems is also around 85%.
Other approaches employ physically-based models of heads including skin and
musculature [11], [12]. They combine this physical model with registered optical flow
measurements from human faces to estimate the muscle actuations. They propose two
approaches, the first one creates typical patterns of muscle actuation for the different
facial expressions. A new image sequence is classified by its similarity to the typical
patterns of muscle actuation. The second one builds on the same methodology to
generate, from the muscle actuation, the typical pattern of motion energy associated
with each facial expression. In [19] a system is presented that uses facial feature point
tracking, dense flow tracking and high gradient component analysis in the spatio-
temporal domain to extract the FACS Action Units, from which expressions can be
derived. A complete review of this techniques can be found in [24]
In this work we have developed an expression recognition technique consistent with the
MPEG4 standardized parameters for facial definition and animation, FDP and FAP.
7/29/2019 Image Comm Pardas Bonafonte
4/30
Thus, the expression recognition process can be divided in two steps: facial parameter
extraction and facial parameter analysis, which we also refer as low level and high level
analysis. The facial parameter extraction process is based on a feature point detection
and tracking system which uses active contours. Conventional active contours (snakes)
approaches [16] find the position of the snake by finding a minimum of its energy,
composed of internal and external forces. The external forces pull the contours toward
features such as lines and edges. However, in many applications this minimization leads
to contours that do not represent correctly the feature we are looking for. We propose in
this paper to introduce some higher level information by a statistical characterization of
the snaxels that should represent the contour. From the automatically produced
initialization, the MPEG-4 compliant FAP are computed. These FAP will be used for
training of the expressions extraction system. These techniques are developed in Section
2.
Using the MPEG4 parameters for the analysis of facial emotions has different
advantages. In first place, the developed high level analysis can benefit from already
existing low level analysis techniques for FDP and FAP extraction, as well as from any
advances that will be made in this area in the future years. Besides, the low-level FAP
constitute a concise representation of the evolution of the expression of the face. From
the training database, and using the available FAP, spatio-temporal patterns for
expressions will be constructed. Our approach for the interpretation of facial
expressions will use Hidden Markov Models (HMM) to recognize different patterns of
FAP evolution. For every defined facial expression, a HMM will be trained with the
extracted feature vectors. In the recognition phase, the HMM system will use as
classification criterion the probability of the input sequence with respect to all the
models. Tests are also performed in sequences composed of different emotions and with
7/29/2019 Image Comm Pardas Bonafonte
5/30
periods of speech, using connected HMM. This process will be explained in Section 3.
Finally, Section 4 will resume the conclusions of this work.
2. LOW LEVEL ANALYSIS
This Section describes the low level video processing required for the extraction of
Facial Animation Parameters (FAPs). The first step in this analysis is the face detection.
Different techniques can be used for this aim, like [17] or [30], and we will not go into
details of this part. After the face is detected, facial features have to be located and
tracked. Two different approaches are possible: contour based or model based. The first
approach will localize and track the most important contours of the face and track them
in 2D (that is, track their projection in the image plane). The second approach consists
in the adaptation, in each frame, of a 3D wireframe model of a face to the image.
Example of this second approach can be found in [1], [6], [8] and [9]. In this work we
will use a technique belonging to the first class because, as it will be explained later on,
we will work on a restricted environment. Facial Animation Parameters (FAPs) will be
computed from the tracked contours or from the adapted model. First, in section 2.1 we
will review the meaning of these parameters. The following subsections are devoted to
present a 2D technique for the extraction of these parameters, which uses active
contours. So, section 2.1 will review the general framework of active contours, section
2.2 will apply this technique to facial feature detection and section 2.3 to facial feature
tracking.
2.1 FACIAL ANIMATION PARAMETERS
Facial Animation Parameters (FAPs) are defined in the ISO MPEG-4 standard [21],
together with the Facial Definition Parameters (FDPs), to allow the definition of a facial
7/29/2019 Image Comm Pardas Bonafonte
6/30
shape and its animation reproducing expressions, emotions, and speech pronunciation.
FDPs are illustrated in Figure 1 (figures 1 and 2 have been provided by Roberto Pockaj,
from Genova University), they represent key points in a human face. All the feature
points can be used for the calibration of a model to a face, while only those represented
by black dots in the image are used also for the animation (that is, there are FAPs
describing its motion).
The FAPs are based on the study of minimal facial actions and are closely related to
muscle actions. They represent a complete set of basic facial actions, such as squeeze or
raise eyebrows, open or close eyelids, and therefore allow the representation of most
natural expressions. All FAPs involving translational movement are expressed in terms
of the Facial Animation Parameter Units (FAPUs). These units aim at allowing
interpretation of the FAPs on any facial model in a consistent way, producing
reasonable results in terms of expression and speech pronunciation. FAPUs are
illustrated in Figure 1 and correspond to fractions of distances between some key facial
features.
We will be interested in those FAPs which a) convey important information about the
emotion of the face and b) can be reliably extracted from a natural video sequence. In
particular, we have focused in those FAPs related to the motion of the contour of the
eyebrows and the mouth. These FAPs are specified in Table 1.
7/29/2019 Image Comm Pardas Bonafonte
7/30
Figure 1. Facial Definition Parameters
FAPU value
IRISD = IRISD0 / 1024
ES = ES0 / 1024
ENS = ENS0 / 1024
MNS = MNS0 / 1024
MW = MW0 / 1024
AU = 10-5 rad
Figure 2. Facial Animation Parameter Units
# FAP name FAP description Unit
31 raise_l_i_eyebrow Vertical displacement of left inner eyebrow ENS
32 raise_r_i_eyebrow Vertical displacement of right inner eyebrow ENS
33 raise_l_m_eyebrow Vertical displacement of left middle eyebrow ENS
34 raise_r_m_eyebrow Vertical displacement of right middle eyebrow ENS
35 raise_l_o_eyebrow Vertical displacement of left outer eyebrow ENS
7/29/2019 Image Comm Pardas Bonafonte
8/30
36 raise_r_o_eyebrow Vertical displacement of right outer eyebrow ENS
37 squeeze_l_eyebrow Horizontal displacement of left eyebrow ES
38 squeeze_r_eyebrow Horizontal displacement of right eyebrow ES
51 lower_t_midlip _o Vertical top middle outer lip displacement
MNS
52 raise_b_midlip_o Vertical bottom middle outer lip displacementMNS
53 stretch_l_cornerlip_o Horizontal displacement of left outer lip cornerMW
54 stretch_r_cornerlip_o Horizontal displacement of right outer lip corner MW
55 lower_t_lip_lm _o Vertical displacement of midpoint between left corner
and middle of top outer lip
MNS
56 lower_t_lip_rm _o Vertical displacement of midpoint between right
corner and middle of top outer lip
MNS
57 raise_b_lip_lm_o
Vertical displacement of midpoint between left corner
and middle of bottom outer lip MNS
58 raise_b_lip_rm_oVertical displacement of midpoint between right
corner and middle of bottom outer lipMNS
59 raise_l_cornerlip_o Vertical displacement of left outer lip corner MNS
60 raise_r_cornerlip _o Vertical displacement of right outer lip corner MNS
Table 1. FAP table for the eyebrows
To extract these FAPs in frontal faces, the tracking of the 2-D contour of the eyebrows
and mouth is sufficient. If the sequences present rotation and translation of the head,
then a system based on a 3D face model tracking would be more robust. In our case the
FAPs are used to train the HMM classifier. For this reason, it is better to constrain the
kind of sequences that are going to be analyzed. We will use frontal faces and a
supervised tracking procedure, in order not to introduce any errors in the training of the
system. Once the system has been trained, any FAP extraction tool can be used. If the
results it produces are accurate, the high level analysis will have a higher probability of
success. The low level analysis system that we propose for training can also be used for
testing with other sequences, as long as the faces are frontal or we previously apply a
global motion estimation for the head.
7/29/2019 Image Comm Pardas Bonafonte
9/30
The detection and tracking algorithm that will be explained in the next section has been
integrated in a Graphical User Interface (GUI) that supports corrections to the results
being produced by the automatic algorithm. From the detection of facial features
produced, it computes the FAP units and then from the results of the automatic tracking
it writes the MPEG-4 compliant FAP files. These FAP files will be used for training of
the expressions extraction system.
The algorithm is applied to tracking of eyebrows and mouth, thus producing the
MPEG4 FAPs for the eyebrows 31 32 33 34 35 36 37 38 and for the mouth 51 52 53 54
55 56 57 58 59 60, described in Table 1.
2.2 ACTIVE CONTOURS
2.2.1 Introduction
Active contours were first introduced by Kass et al. [16]. They proposed energy
minimization as a framework where low-level information (such as image gradient or
image intensity) can be combined with higher-level information (such as shape,
continuity of the contour or user interactivity). In their original work the energy
minimization problem was solved using a variational technique. In [2] Amini et al.
proposed Dynamic Programming (DP) as a different solution to the minimization
problem. The use of the Dynamic Programming approach will allow us to introduce the
concept of candidates for the control points of the contour, thus avoiding the risk to
fall on local minima located near the initial position.
In [25], we proposed a first approach for tracking facial features, based on dynamic
programming, introducing a new term in the energy of the snake, and selecting the
candidate pixels for the contour (snaxels) using motion estimation. Currently the DP
7/29/2019 Image Comm Pardas Bonafonte
10/30
approach has been extended to be able to automatically initialize the facial contours. In
this case, the candidate snaxels are selected by a statistical characterization of the
contour based on Principal Component Analysis (PCA). In this subsection we will
review the basic formulation of active contours, while the next subsections explain how
we apply them for facial feature detection (2.3) and for facial feature tracking (2.4).
2.2.2 Active contours formulation
In the discrete formulation of active contour models a contour is represented as a set of
snaxels i=(xi,yi) for i=0,,N-1, where xi and yi are the x and y coordinates of thesnaxel i. The energy of the contour, which is going to be minimized, is defined by:
(
=
+=1N
0i
int )()()( ii vvv extsnake EEE ) (1)
We can use a discrete approximation of the second derivative to computeEint.
1ii1ii vvvv + += 2)(intE (2)
This is an approximation to the curvature of the contour at snaxel i, if the snaxels are
equidistant. Minimizing this energy will produce smooth curves. This is only
appropriate for snaxels that are not corners of the contour. More appropriated
definitions for the internal energy will be proposed in Section 2.3.2 for the initialisation
of the snake and in Section 2.4.2 for tracking.
The purpose of the termEext is to attract the snake to desired feature points or contours
in the image. In this work we have used the gradient of the image intensity (I(x,y)),
along the contour from vi to vi+1. Thus, the Eext at snaxel vi will depend only on the
position of the snaxels viand vi+1. That is,
7/29/2019 Image Comm Pardas Bonafonte
11/30
),,()( 1iivvi vvIv 1i + == + fEE icontext (3)
However, a new term will also be added to the external energy, in order to be able to
track contours that have stronger edges nearby.
2.2.3 Dynamic programming
We will use the DP approach to minimize the energy in Eq. (1). Let us express the
Energy of the snake remarking the dependencies of its terms:
[ ]
=
+
=
++ =+=1
0
1
0
int ),,(),(),,()(N
i
i
N
i
extsnake EEEE 1ii1i1ii1ii1i vvvvvvvvv (4)
Although snakes can be open or closed, the DP approach can be applied directly only to
open snakes. To apply DP to open snakes, the limits of Eq. 4 are adjusted to 1 and N-2
respectively.
Now, as described in [2], this energy can be minimized via discrete DP defining a two-
element vector of state variables in the ith decision stage: (vi+1, vi). The optimal value
function is a function of two adjacent points on the contour Si(vi+1,vi), and can be
calculated, for every couple of possible candidate positions for snaxels vi+1and vi, as:
[ ),,(),(min),( 11
1ii1i1iii1i vvvvvvv ++ +=
iiv
i ESSi
] (5)
S0(v1, v0) is initialized to Eext(v0,v1) for every possible candidate pair (v0,v1) and from
this, Si can be computed iteratively from i=1 up to i=N-2 for every candidate position
forvi. The total energy of the snake will be
(6))(min)( 2,121
= NNNv
snake SEN
vvv
Besides, we have to store at every step i a matrix which stores the position of vi-1 that
minimizes Eq. (5), that is,
7/29/2019 Image Comm Pardas Bonafonte
12/30
such that v1ii1ii vvvM + =),( i-1minimizes (5).
By backtracking from the final energy of the snake and using matrix Mi, the optimal
position for every snaxel can be found.
In the case of a closed contour the solution proposed in [13] is to impose the first and
last snaxels to be the same, and fix it to a given candidate for this position. The
application of the DP algorithm will produce the best result under this restriction. Then
this initial (and final) snaxel is successively changed to all the possible candidates, and
the one that produces a smaller energy is selected. We use an approximation proposed
in [14] that requires only two open contour optimisation steps.
2.2.4 Selection of candidates
Up to now, we have assumed that for every snaxel vi there is a finite (and hopefully
small) number of candidates, but we have omitted how to select these candidates. The
computational complexity of each optimisation step is O(nm3), where n is the number of
snaxels and m the number of candidates for every snaxel. Thus, it is very important to
maintain m low.
In [2] only a small neighbourhood around the previous position of the snaxel was
considered. However, the algorithm was iteratively applied starting from the obtained
solution until there was no change in the total energy of the snake. This method has
several disadvantages. First, like in the approaches which use variational techniques for
the minimization, the snake can fall into a local minimum. Second, the computational
time can be very high if the initialisation is far from the minimum.
In [13] and [14] a different set of candidates is considered for every snaxel. In
particular, [13] establishes uncertainty lists for the high curvature points and defines a
7/29/2019 Image Comm Pardas Bonafonte
13/30
search space between these uncertainty lists. In [14] the search zone is defined with two
initial concentric contours. Each contour point is constrained to lie on a line joining
these two initial contours. This approach gives very good results if the two concentric
contours that contain the expected contour are available and the contour being tracked is
the absolute minima in this area. However, these concentric contours are not always
available.
In the next sections we will describe how we can select these candidates and how the
snake energy is defined for facial feature point detection and tracking, respectively.
2.3 FACIAL FEATURE POINT DETECTION
2.3.1 Selection of candidates
We propose a new method that in a first step needs to fix the topology of the snakes. In
our case, we are using a 16 snaxels equally spaced snake for the mouth and a 8 snaxels
equally spaced snake for the eyebrows. To select the best candidates for each of these
snaxels we compute what we call the vi-eigen-snaxel, by extracting samples of them
from a database. That is, after resizing the faces from our database to the same size
(200x350), we extract for each snaxel vi the 16x16 area around the snaxel in every
image of the database. After an illumination normalization using an histogram
specification for every snaxel, the extracted sub-images are used to form the training set
of vectors for the snaxel vi, and from them, the eigenvectors (eigen-snaxels) are
computed by classical PCA techniques.
The first step for initialising the snakes in a new image is to roughly locate the face.
Different techniques can be used for this aim [17], [30]. After size normalization of the
face area, a large area around the rough position for every snaxel vi is examined by
7/29/2019 Image Comm Pardas Bonafonte
14/30
computing the reconstruction error with the corresponding vi-eigen-snaxel. Those pixels
leading to the smaller reconstruction error are considered as candidates for the snaxel vi.
Principal Component Analysis (PCA)
The principle of PCA is to construct a low dimensional space with decorrelated features
which preserve the majority of the variation in the training set [20]. PCA has been used
in the past to detect faces (by means of eigen-faces) or facial features (by means of
eigen-features). In this paper, we extend its use to detect the snake control points for
specific facial features (by using eigen-snaxels).
For each snaxel vi, the vectors {xt} are constructed by lexicographic ordering of the
pixel elements of each sub-image from the training set. A partial KLT is performed on
these vectors to identify the largest-eigenvalue eigenvectors.
Distance measure
To evaluate the feasibility for a given pixel to be a given snaxel we first construct the
vector {xt} using the corresponding sub-image. Then we obtain a principal component
feature vectory=MT x~ , where xxx =~ is the mean-normalized image vector, isthe eigenvector matrix for snaxel vi and M is a sub-matrix of that contains the M
principal eigenvectors. The reconstruction error is calculated as:
( ) += =
==N
Mi
M
i
i
1 1
222 ~ yxyx 2i (7)
We have used M=5 in our experiments. The pixels producing the smaller reconstruction
error are selected as candidate locations for the snaxel vi.
2.3.2. Snake energy
7/29/2019 Image Comm Pardas Bonafonte
15/30
External energy
A new term is added to the external Energy function in Eq (3) to take into account the
result obtained in the PCA process. The energy will be lower in those positions where
the reconstruction error is lower.
)()1()( 2 ivvi vv 1ii += +context EE (8)
In our experiments the value of has been set to 0.5.
Internal energy
As mentioned, the Energy term defined in (2) is only appropriate for snaxels that are not
corners of the contour. In the case of corners the energy has to be low when the second
derivative is high. We use, for the energy of the snaxels belonging to the eyebrows and
the mouth:
( )[ ]1ii1ii1ii vvvvvvv ++ +++= 2)1(2)( 1int BE iii (9)
where i is set to 1 ifvi is not a corner point and to 0 if it is (note that we work with a
fixed topology for the snakes). B represents the maximum value that the approximation
of the second derivative can take.
2.4 FEATURE POINT TRACKING
2.4.1. Selection of candidates
In the initialisation of the snake, the candidates for every snaxel were selected on the
basis of a statistical characterization of the texture around them. However, in the case of
7/29/2019 Image Comm Pardas Bonafonte
16/30
tracking, we have a better knowledge of this texture if we use the previous frame. We
propose to find the candidates for every snaxel in the tracking process using motion
estimation in order to select the search space for every snaxel.
A small region around every snaxel is selected as basis for the motion estimation. The
shape of this region is rectangular and its size is application dependent. However, the
region should be small enough so that its motion can be approximated by a translational
motion. The motion compensation error (MCE) for all the possible displacements
(dx,dy) of the block in a given range is computed as:
=
=
=
=
++=Ryj
Ryj
Rxi
Rxi
dyjydxixjyixdydx2
0000 ),(),(),( t1tvi0 IIMCE (10)
being (x0, y0) the x and y coordinates of the snaxel vi in the previous frame, which we
have called vi0. The region under consideration is centred at the snaxel and has size 2Rx
in the horizontal dimension and 2Ry in the vertical dimension.
The range for (dx,dy) determines the maximum displacement that a snaxel can suffer.
The M positions with the minumum MCEvi0(dx,dy) are selected as possible new
locations for snaxel vi.
2.4.2 Snake energy
External energy
We use for the external energy in the tracking procedure the same principle as for the
initialisation. That is, it will be composed, as in (8), of two terms. The first one is the
gradient along the contour, but the second one is slightly different, as in this case we
can use the texture provided by the previous frame instead of the statistical
characterization. Thus, the second term corresponds to the compensation error obtained
in the motion estimation. In this way preference is given to those positions with the
7/29/2019 Image Comm Pardas Bonafonte
17/30
smaller compensation error. That is, the energy will be lower in those positions whose
texture is most similar to the texture around the position of the corresponding snaxel in
the previous frame. Therefore, the expression for the external energy will be:
)()1()( ivvvi vMCEv i01i += icontext EE (11)
The constant is also set to 0.5.
Internal energy
The internal energy we use to track feature points has the same formulation as in section
2.4.2. As before, we assume here that we know the geometry of the different face
features to correctly set i to 1 or 0, Eq. (9).
To solve problems of snaxel grouping we also add, as in [18], another term to the
internal energy that forces snaxels to preserve the distance between them from one
frame to the next one:
idistE , = | | + | | (12)t
i
1t
i uu + t
1i
1t
1i uu +++
t
iu = , in frame tiv 1iv
So when the distance is altered the energy increases proportionally.
2.5 RESULTS
In Figures 3 and 4 we show some examples of automatic initialisation and tracking. To
perform the tests, the algorithms have been introduced in a Graphical User Interface
(GUI) that allows manual correction of the snaxels position. In the tests performed, the
7/29/2019 Image Comm Pardas Bonafonte
18/30
contours obtained with the automatic initialisation always correspond to the facial
features, if the face has been located accurately. However, as shown in Figure 4,
sometimes some points should be manually modified if more accuracy is required. The
tracking has been performed on 276 sequences from the Cohn-Kanade facial expression
database [15]. From these sequences, 180 needed no manual correction, 48 needed 1, 2
or 3 points correction along the sequence and 40 sequences had one major problem in at
least one frame for one feature tracking.
From the initialization, the Facial Animation Units (FAPUs) are computed. This means
that the first frame needs to present a neutral expression. Then, from the output of the
tracking, the FAPs for the sequence are extracted. These FAPs are going to be used in
the second part of the system for the expression recognition.
7/29/2019 Image Comm Pardas Bonafonte
19/30
Figure 3. First row: Initialization examples. Second and third row: tracking examples.
Figure 4. Automatic initialisation examples with minor errors
3. HIGH LEVEL ANALYSIS
The second part of the system is based on the modeling of the expressions by means of
Hidden Markov Models. The observations to model are the MPEG-4 standardized
Facial Animation Parameters (FAPs) computed in the previous step. The FAPs of a
video sequence are first extracted and then analyzed using semi-continuous HMM.
The Cohn-Kanade facial expression database [15] has been selected as basis for doing
the training of the models and the evaluation of the system. The whole database has
been processed in order to extract a subset of the Low Level FAPs and perform the
expression recognition experiments. Different experiments have been carried out to
7/29/2019 Image Comm Pardas Bonafonte
20/30
determine the best topology of the HMM to recognize expressions from the Low Level
FAPs.
HMM are one of the basic probabilistic tools used for time series modelling. They are
definitely the most used model in speech recognition, and they are beginning to be used
in vision, mainly because they can be learned from data and they implicitly handle time-
varying signals by dynamic time warping. They have already been successfully used to
classify the mouth shape in video sequences [22] and to combine different information
sources (optical flow, point tracking and furrow detection) for expression intensity
estimation [19]. HMM were also used in [29] for expression recognition. In this work a
spatio-temporal dependent shape/texture eigenspace is learned in a HMM like structure.
We will extend the use of HMM creating the feature vectors from the available low-
level FAP. In the following sections we review the basic concepts of HMM (3.1), define
the topology of the models that we will use (3.2) and explain how we estimate the
parameters of these models (3.3).
This first approach tries to recognize isolated expressions. That is, we take a sequence
which contains the transition from a neutral face to a given expression and we have to
decide which is this expression. All the sequences from the Cohn-Kanade facial
expression database belong to this type. We will evaluate the system using this
approach in section 3.4. In section 3.5 we will introduce a new model which will allow
us to use the system in more complex environments. It will model the parts of the
sequences where the person is talking. Finally, in section 3.6 a more complex
experiment is presented, where the system classifies the different parts of the sequences
using a one-stage dynamic programming algorithm.
3.1 HIDDEN MARKOV MODELS DEFINITION
7/29/2019 Image Comm Pardas Bonafonte
21/30
HMM model temporal series assuming a (hidden) temporal structure. For each emotion
a HMM can be estimated from training examples. Once these models are trained we
will be able to compute, for a given sequence of observations (the FAPs of the video
sequence in our case), the probability that this sequence was produced by each of the
models. Thus, this sequence of observations will be assigned to the expression whose
HMM gives a higher probability of generating it.
The HMM [27] are defined by a number N of states connected by transitions. The
output of these states are the observations, which belong to a set of symbols. The time
variation is associated with transitions between states.
To completely define the models, the following parameters are used:
The set of states S=(s1,s2,...,sN)
The set of symbols that can be observed from the states (O1,...OM), in our case, the
FAPs from each frame
The probability of transition from state i to statej, defined by the transition matrix
A, where
aij=p(qt+1=sj|qt=si) 1i, jN
The probability distribution function of the different observations in state j, B,
where
bj(Oi)=P(Oi| q=Sj) 1 j N
The initial state probabilities i=p(q1=Si) 1 i NTo train a HMM for an emotion (that is, to adjust its parameters), we will use as training
sequences all those sequences that we select as representatives of the given emotion. In
our case, we use all those sequences from the Cohn-Kanade facial expression database
manually classified as belonging to this emotion.
7/29/2019 Image Comm Pardas Bonafonte
22/30
3.2 SELECTION OF THE HMM TOPOLOGY
The Baum-Welch algorithm [27] can estimate the value ofA and . However, better
results can be obtained if the topology of the Markov chain is restricted using a priori
knowledge. In our case, each considered emotion (sad, anger, fear, joy, disgust and
surprise) reflects a temporal structure: let's name it start, middle and end of the emotion.
This structure can be modeled using left-to-right HMM. This topology is appropriated
for signals whose properties change over time in a successive manner. It implies aij=0 if
i>j and i=1 fori=1, i=0 fori1. As the time increases, the observable symbols in eachsequence either stay at the same state or increase in a successive manner. In our case,
we have defined for each emotion a three-states HMM.
To select this topology different configurations have been tested. The experiments show
that using HMM with only 2 states the recognition rate is 4% lower. Using more than 3
states increases the complexity of the model without producing any improvement in the
recognition results. The selected topology is represented in the following figure.
Figure 5. Topology of Hidden Markov Models
3.3 ESTIMATION OF THE OBSERVATION PROBABILITIES
Each state of each emotion models the observed FAPs using a probability function,
bj(O) which can be either a discrete or a continuous one. In the first case, each set of
7/29/2019 Image Comm Pardas Bonafonte
23/30
FAPs for a given image has to be discretized using vector quantification. In the
continuous case, a parametric pdf is defined for each state. Typically, multivariate
Gaussian or Laplacian pdf's are used. In the case of having sparse data, as it is our case,
better results can be achieved if all the states of all the emotion models share the
Gaussian mixtures (mean and variance). Then, the estimation of the pdf's for each state
is reduced to the estimation of the contribution of each mixture to the pdf.
Each set of FAPs has been divided in two subsets corresponding to eyebrows and
mouth, following the MPEG4 classification. At each frame, the probability of the whole
set of FAPs is computed as the product of the probability of each subset (independence
assumption). For each of these subsets, a set of 32 Gaussian mixtures has been
estimated (mean and variance) applying a clustering algorithm on the FAPs of the
training database. In order to select the number of Gaussian mixtures, different
experiments have been performed. Having less than 16 mixtures produces a recognition
rate more than a 10% lower than that obtained with 16 mixtures. The use of 32 mixtures
produces a slightly better recognition rate than 16, while more than 32 do not produce
any improvement.
3.4 EVALUATION OF THE SYSTEM
To evaluate the system we use the Cohn-Kanade facial expression database, which
consists of the recording from 90 subjects, each one with several basic expressions. The
recording was done with a standard VHS video camera, with a video rate of 30 frames
per second, with constant illumination and only full-face frontal views were captured.
Although the subjects were not previously trained in displaying facial expressions, they
practiced the expressions with an expert prior to video recording. Each posed expression
7/29/2019 Image Comm Pardas Bonafonte
24/30
begins from neutral and ends at peak expression. The number of sequences for each
expression is the following:
F Sa Su J A D# Seq 33 52 60 61 33 37
Table 2. Number of available sequences for each expression (F: Fear, Sa: Sad, Su:
Surprise, J: Joy, A: Anger, D: Disgust).
The FAPs of these sequences were first extracted using the FAP extraction tool
described in Section 2. The system will train, for every defined facial expression (sad,
anger, fear, joy, disgust and surprise), a HMM with the extracted FAPs. In the
recognition phase, the HMM system computes the maximum probability of the input
sequence with respect to all the models and assigns to the sequence the expression with
the highest probability.
The system has been evaluated by first training the six models with all the subjects
except one, and then testing the recognition with this subject which has not participated
in the training. This process has been repeated for all the subjects, obtaining the
following recognition results:
F Sa Su J A D %Cor
F 26 3 0 3 0 1 78,7%
Sa 8 31 2 2 6 3 59,6%
Su 0 0 60 0 0 0 100%
J 4 0 0 57 0 0 93,4%
A 0 2 0 1 24 6 72,7%
D 0 0 0 0 1 36 97,3%
Table 3. Number of sequences correctly detected and number of confusions.
The overall recognition rate obtained is 84%.
These results are coherent with our expectation: surprise, joy and disgust have the
higher recognition rates, because they involve clear motion of mouth or eyebrows,
while sadness, anger and fear have a higher confusability rate. Experiments have been
7/29/2019 Image Comm Pardas Bonafonte
25/30
repeated involving only three emotions, obtaining 98% recognition rate when joy,
surprise and anger are involved, and 95% with joy, surprise and sadness.
We have also studied which one of the facial features conveys more information about
the facial expressions, by obtaining the recognition results using only the FAPs related
with the motion of the eyebrows, or only the FAPs related with the motion of the
mouth. The overall recognition rate in the first case is 50%, while in the second it is
78%. This result shows that the mouth conveys most of the information, although the
information of the eyebrows helps to improve the results.
Although it is difficult to compare, due to the usage of different databases, the
recognition rates obtained using all the extracted information are of the same order than
those obtained by other systems which use dense motion fields combined with
physically-based models, showing that the MPEG4 FAPs convey the necessary
information for the extraction of these emotions.
3.5 INTRODUCTION OF THE TALK MODEL
An important objective of this work is to use sequences that represent a conversation
with an agent, in order to consider a real situation where the system could be applied.
The system developed should be able to classify expressions in silence frames as well as
to detect important clues in speech frames. A first step has been to work on silence
frames. However, we also want to separate the motion due to the expressions from the
motion due to speech. One possibility is to combine audio analysis with video analysis.
The other possibility that we have developed consists on the generation of a new model,
in addition to those of the six emotions, to classify the speech sequences.
In order to train this additional model, that we have designated as talk, we have
selected 28 sequences that contain speech, from available data sequences. A few more
7/29/2019 Image Comm Pardas Bonafonte
26/30
sequences corresponding to different expressions have also been added to our dataset.
These sequences have been processed with the FAP extraction tool, in the same way
that the emotion models had been previously analysed. Then, the recognition tests were
repeated including these sequences. The performance of the recognition algorithm in
this case is shown in next table. The overall recognition rate is 81%, showing the
possibility to distinguish talk from emotion using this methodology.
F Sa Su J A D T %Cor
F 27 2 1 3 0 1 0 79,4%
Sa 6 29 3 2 7 1 5 54,7%
Su0 0 64 0 0 0 0 100%J 4 0 1 57 0 0 0 91,9%
A 0 1 0 2 22 7 2 64,7%
D 1 0 0 0 1 34 0 91,9%
T 0 3 0 0 4 0 21 75,0%
Table 4. Number of sequences correctly detected and number of confusions adding the
talk (T) sequence. The overall recognition rate obtained is 81%.
3.6 CONNECTED SEQUENCES RECOGNITION
As discussed before, the first experiments performed were oriented to the extraction of
an expression from the input sequence. However, in real situations, we will have a
continuous video sequence where the person is talking or just looking at the camera and,
at a given point, an expression will be performed by the person. So, we want to explore
the capability of the system to separate, from a video sequence, different parts where
different emotions occur.
In this case the emotions can be decoded using the same HMM by means of a one-stage
dynamic programming algorithm. This algorithm is widely used in connected and
continuous speech recognition. The basic idea is to allow during the decoding a
transition from the final state of each HMM (associated to each emotion) to the first
state of the other HMMs. This can be interpreted as a large HMM composed from the
7/29/2019 Image Comm Pardas Bonafonte
27/30
emotion HMMs. The transitions along this large HMM indicate both the decoded
emotions and the frames associated to each emotion. In order to recover the transitions,
the algorithm needs to save some backtracking information, as it is usually the case in
dynamic programming.
Because of the lack of a more appropriate database, the connected recognition algorithm
has been tested by concatenating sequences of six FAP files extracted from the
emotions and talk sequences. Results are given in next table. The recognition rate
for these sequences is 64%, showing a lower performance than the isolated sequences
recognition.
F Sa Su J A D T %Cor
F 25 4 1 3 0 1 0 73,5%Sa 5 25 4 2 8 2 7 47,1%
Su 2 2 37 0 6 4 9 57,8%
J 2 2 1 56 0 0 0 91,8%
A 0 2 1 2 16 4 8 47,0%
D 2 0 9 0 3 19 3 51,3%
T 1 8 4 3 6 1 56 70,9%
Table 5. Number of sequences correctly detected and number of confusions in the
connected sequences experiment. The overall recognition rate obtained is 64%.
4. CONCLUSIONS
We have presented a video analysis system that aims at expression recognition in two
steps. In the first step, the Low Level Facial Animation Parameters are extracted. The
second step analyses these FAPs by means of HMM to estimate the expressions. For
training purposes, a supervised active contour based algorithm is applied to extract these
Low Level FAPs from an emotions face database. This algorithm introduces a criterion
for selection of the candidate snaxels in a dynamic approach implementation of the
active contours algorithm. Besides, an additional term is introduced in the external
7/29/2019 Image Comm Pardas Bonafonte
28/30
energy, related to the specific texture around the snaxels, in order to avoid the contour
to fall in stronger edges that might be around the one we are aiming at. This technique
allows to extract the Low Level FAPs that we will use for training the recognition
system. Once the system has been trained, the recognition procedure can be applied to
the FAPs extracted by any other method. The results obtained prove that the FAPs
convey the necessary information for the extraction of emotions. The system shows
good performance for distinguishing isolated expressions and can also be used, with
lower accuracy, to extract the expressions in long video sequences where speech is
mixed with silence frames.
5. REFERENCES
[1] J. Ahlberg, An Active Model for Facial Feature Tracking. Accepted for publication in
EURASIP Journal on Applied Signal processing, to appear in June 2002.
[2] A. Amini, T. Weymouth and R. Jain, Using Dynamic Programming for Solving
Variational Problems in Vision, IEEE Transactions on Pattern Analysis and MachineIntelligence, Vol. 12, No. 9, September 1990.
[3] M. Black, Y. Yacoob, Recognizing facial expressions in image sequences using local
parameterized models of image motion, Int. Journal of Computer Vision, 25 (1), 1997,
23-48.
[4] J. Cassell, Embodied Conversation: Integrating Face and Gesture into Automatic Spoken
Dialogue Systems, In Luperfoy (ed.), Spoken Dialogue Systems, Cambridge, MA: MIT
Press.
[5] F. Cassell, K.R. Thorisson, The power of a nod and a glance: Envelope vs. Emotional
Feedback in Animated Conversational Agents, Applied Artificial Intelligence 13: 519-
538, 1999.
[6] T.F. Cootes, G.J. Edwards, C.J. Taylor, C.J. Active appearance models, IEEE
Transactions on Pattern Analysis and Machine Intelligence, Volume: 23 Issue: 6 , pp. 681
685, June 2001.
7/29/2019 Image Comm Pardas Bonafonte
29/30
[7] G.W. Cottrell and J. Metcalfe, EMPATH: Face, emotion, and gender recognition using
holons, in Neural Information Processing Systems, vol.3, pp. 564-571, 1991.
[8] D. DeCarlo, D. Metaxas, Adjusting shape parameters using model-based optical flow
residuals, IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume: 24
Issue: 6 , pp. 814 823, June 2002
[9] P. Eisert, B. Girod, Analyzing Facial Expression for Virtual Conferencing, IEEE
Computer Graphics and Applications, Vol. 18, N. 5, 1998.
[10] P. Ekman and W. Friesen, Facial Action Coding System. Consulting Psychologists Press
Inc., 577 College Avenue, Palo Alto, California 94306, 1978.
[11] I. Essa and A. Pentland, Facial Expression Recognition using a Dynamic Model and
Motion Energy, Proc. of the International Conference on Computer Vision 1995,
Cambridge, MA, May 1995.
[12] I. Essa, Analysis, Interpretation and Synthesis of Facial Expressions, Ph.D. Thesis,
Massachusetts Institute of Technology (MIT Media Laboratory).
[13] D. Geiger, A. Gupta, L. Costa and J. Vlontzos, Dynamic Programming for Detection,
Tracking, and Matching Deformable Contours, IEEE Transactions on Pattern Analysis
and Machine Intelligence, Vol. 17, No. 3, March 1995.
[14] S. Gunn and M. Nixon, Global and Local Active Contours for Head BoundaryExtraction, International Journal of Computer Vision 30(1), pp. 43-54, 1998.
[15] Kanade, T., Cohn, J.F., Tian. Y., Comprehensive Database for Facial Expression
Analysis, Proceedings of the Fourth IEEE International Conference on Automatic Face
and Gestures Recognition, Grenoble, France, 2000.
[16] M. Kass, A. Witkin and D. Terzopoulos, Snakes: Active Contour Models, International
Journal of Computer Vision, Vol. 1, No. 4, pp. 321-331, 1988.
[17] C. Kervrann, F. Davoine, P. Perez, R. Forchheimer and C. Labit, Generalized likelihood
ratio-based face detection and extraction of mouth features, Pattern Recognition letters
18, pp. 899-912, 1997.
[18] C. L. Lam, S. Y. Yuen, An unbiased active contour algorithm for object tracking,
Pattern Recognition Letters 19, pp. 491-498, 1998.
[19] J. Lien, T. Kanade, J. Cohn, C. Li, Subtly different Facial Expression Recognition And
Expression Intensity Estimation, in Proc. Of the IEEE Int. Conference on Computer
Vision and Pattern Recognition, pp. 853-859, Santa Barbara, Ca, June 1998.
7/29/2019 Image Comm Pardas Bonafonte
30/30
[20] B. Moghaddam, A. Pentland, Probabilistic Visual Learning for Object Detection, in 5th
International Conference on Computer Vision, Cambridge, MA, June 95.
[21] ISO/IEC MPEG-4 Part 2 (Visual)
[22] N. Oliver, A. Pentland, F. Berard, LAFTER: A Real-time Lips and Face Tracker with
Facial Expression Recognition, in Proc. of IEEE Conf. on Computer Vision, Puerto
Rico, 1997.
[23] C. Padgett, G. Cottrell, Identifying emotion in static face images, in Proc. Of the 2nd Joint
Symposium on Neural Computation, Vol.5, pp.91-101, La Jolla, CA, University of
California, San Diego.
[24] M. Pantic , L. Rothkrantz, Automatic Analysis of Facial Expressions: The State of the
Art, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 22, N. 12,
pp. 1424-1445, 2000.
[25] M. Pards, E. Sayrol, Motion estimation based tracking of active contours, Pattern
Recognition Letters, Vol. 22/13, pp. 1447-1456, Ed. Elsevier Science, ISSN, November
2001.
[26] R.W. Picard, Affective Computing, MIT Press, Cambridge 1997.
[27] L. Rabiner, A tutorial on Hidden Markov Models and selected applications in Speech
Recognition, Proceedings IEEE, pp. 257-284, February 1989.
[28] M. Rosenblum, Y. Yacoob, L.S. Davis, Human Expression Recognition from Motion
Using a Radial Basis Function Network Architecture, IEEE Trans. On Neural Networks, 7
(5), 1121-1138, 1996.
[29] F. de la Torre, Y. Yacoob, L. Davis, A probabilistic framework for rigid and non-rigid
appearance based tracking and recognition, Int. Conf. on Automatic Face and Gesture
Recognition, pp. 491-498, 2000.
[30] V. Vilaplana, F. Marqus, P. Salembier, L Garrido, Region Based Segmentation and
Tracking of Human Faces, Proceedings of European Signal Processing Conference
(EUSIPCO-98), pp. 311-314, 1998.
[31] Y. Yacoob, L.S. Davis, Computing Spatio-Temporal Representation of Human Faces,
IEEE Trans. On PAMI, 18 (6), 636-642
[32] A. Young and H. Ellis (eds.), Handbook of Research on Face Processing, Elsevier
Science Publishers 1989.