5
HAL Id: hal-00726185 https://hal.archives-ouvertes.fr/hal-00726185 Submitted on 29 Aug 2012 HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés. 3D Dynamic Expression Recognition Based on a Novel Deformation Vector Field and Random Forest Drira Hassen, Boulbaba Ben Amor, Daoudi Mohamed, Srivastava Anuj, Stefano Berretti To cite this version: Drira Hassen, Boulbaba Ben Amor, Daoudi Mohamed, Srivastava Anuj, Stefano Berretti. 3D Dynamic Expression Recognition Based on a Novel Deformation Vector Field and Random Forest. 21st International Conference on Pattern Recognition, Nov 2012, Tsukuba, Japan. https://iapr.papercept.net/conferences/scripts/abstract.pl?ConfID=7&Number=901, 2012. <hal- 00726185>

3D Dynamic Expression Recognition Based on a … Dynamic Expression Recognition based on a Novel Deformation Vector Field and Random Forest Hassen Drira LIFL (UMR Lille1/CNRS 8022),

Embed Size (px)

Citation preview

HAL Id: hal-00726185https://hal.archives-ouvertes.fr/hal-00726185

Submitted on 29 Aug 2012

HAL is a multi-disciplinary open accessarchive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come fromteaching and research institutions in France orabroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, estdestinée au dépôt et à la diffusion de documentsscientifiques de niveau recherche, publiés ou non,émanant des établissements d’enseignement et derecherche français ou étrangers, des laboratoirespublics ou privés.

3D Dynamic Expression Recognition Based on a NovelDeformation Vector Field and Random Forest

Drira Hassen, Boulbaba Ben Amor, Daoudi Mohamed, Srivastava Anuj,Stefano Berretti

To cite this version:Drira Hassen, Boulbaba Ben Amor, Daoudi Mohamed, Srivastava Anuj, Stefano Berretti. 3DDynamic Expression Recognition Based on a Novel Deformation Vector Field and RandomForest. 21st International Conference on Pattern Recognition, Nov 2012, Tsukuba, Japan.https://iapr.papercept.net/conferences/scripts/abstract.pl?ConfID=7&Number=901, 2012. <hal-00726185>

3D Dynamic Expression Recognition based on a Novel Deformation VectorField and Random Forest

Hassen DriraLIFL (UMR Lille1/CNRS 8022), France.

[email protected]

Boulbaba Ben Amor, Mohamed DaoudiLIFL (UMR Lille1/CNRS 8022), France.

Institut TELECOM ; TELECOM Lille 1, France.

Anuj SrivastavaDepartement of Statistics, Florida

State University, USA.

Stefano BerrettiDipartimento di Sistemi e Informatica

University of Firenze, Italy.

Abstract

This paper proposes a new method for facial motionextraction to represent, learn and recognize observedexpressions, from 4D video sequences. The approachcalled Deformation Vector Field (DVF) is based onRiemannian facial shape analysis and captures denselydynamic information from the entire face. The resultingtemporal vector field is used to build the feature vectorfor expression recognition from 3D dynamic faces. Byapplying LDA-based feature space transformation fordimensionality reduction which is followed by a Multi-class Random Forest learning algorithm, the proposedapproach achieved 93% average recognition rate onBU-4DFE database and outperforms state-of-art ap-proaches.

1 Introduction and related work

Facial expressions analysis and recognition from 2Dand/or 3D data has emerged as an active research topicwith applications in several different areas, such ashuman-centered user interfaces, computer graphics, fa-cial animation, ambient intelligence, etc. In recentyears, there has been tremendous interest in trackingand recognizing facial expressions over time, it is sug-gested that the dynamics of facial expressions providesimportant cues about the underlying emotions that arenot available in static 2D or 3D images.

In 3D-based researches, there are a few works thatuse 4D data for facial expression analysis. In [7], aspatio-temporal expression analysis approach based on3D dynamic geometric facial model sequences is pro-posed. The approach integrates a 3D facial surface de-scriptor and Hidden Markov Models (HMMs) to recog-nize facial expressions. Experiments were performed

on the BU-4DFE and their approach achieved a 90.44%(using spatio-temporal) and 80.04% (using temporalonly) recognition rates for distinguishing the six pro-totypic facial expressions (anger, disgust, fear, happi-ness, sadness and surprise). The main limitation ofthis solution resides in the use of 83 proprietary an-notated landmarks of the BU-4DFE that are not re-leased for public use. An extension of this work for4D face recognition appeared in [6]. Sandbach et al.proposed in [3] to exploit the dynamics of 3D facialmotions to recognize observed expressions. First, theycaptured motion between frames using Free-Form De-formations and extract motion features using a quad-tree decomposition of several motion fields. Second,they used HMM models for temporal modeling of thefull expression sequence to be represented by 4 seg-ments which are neutral-onset-apex-offset expressionsegments. In their approach, features selection is de-rived using GentleBoost technique applied on the on-set and offset temporal segments of the expression, theaverage correct classification results for three basic ex-pressions (happy, angry and surprise) achieved 81.93%in whole sequence-based classification and 73.61% inframe-by-frame classification. An extension of thiswork is presented in [4]. Another work on 4D facial ex-pression recognition is proposed by Le et al. [2]. Theyproposed a level curve based approach to capture theshape of 3D facial models. The level curves are parame-terized using the arc-length function. The Chamfer dis-tances is applied to measure the distances between thecorresponding normalized segments, partitioned fromthese level curves of two 3D facial shapes. These mea-sures are then used as spatio-temporal features to trainHidden Markov Model (HMM), and since the trainingdata were not sufficient for learning HMM, the authorspropose to apply the universal background modeling

(UBM) to overcome the over-fitting problem. Usingthe BU-4DFE database to evaluate their approach, theyreached an overall recognition accuracy of 92.22% forthree prototypic expressions (happy, sad and surprise),when classifying whole sequences.

In the present work, we present a fully automaticfacial expression recognition approach based on 4D(3D+t) data. To capture the deformation across the se-quences, we propose a new method calledDeformationVector Filed, obtained via facial surfaces parametriza-tion by collections of radial curves emanating from theirtips of the noses. Our paper is organized as follows:In Sect. 2, a face representation model is proposed thatcaptures facial features relevant to categorize expres-sion variations in 3D dynamic sequences. In Sect. 3,the LDA-based feature space transformation and Multi-class Random Forest based classification are addressed.Experimental results and comparative evaluation ob-tained on the BU-4DFE are reported and discussed inSect. 4. Finally, conclusions are outlined in Sect. 5.

2 The Deformation Vector Field

One basic idea to capture facial deformation across3D video sequences is to track densely meshes’ verticesalong successive 3D frames. To do so, as the meshesresolutions vary across 3D video frames, establishing adense matching on consecutive frames is necessary. Sunet al. [7] proposed to adapt a generic model (a trackingmodel) to each 3D frame. However, a set of 83 pre-defined key-points is required to control the adaptationbased on radial basis function. A second solution is pre-sented by Sandbach et al. [3], where the authors used anexisting non-rigid registration algorithm (FFD) basedon B-splines interpolation between a lattice of controlpoints. The dense matching is a step of preprocessingstage to estimate a motion vector field between framest andt-1. However, the results provided by the authorsare limited to three facial expressions:happy, angryandsurprise. To address this problem, we propose to rep-resent facial surfaces by a set of parameterized radialcurves emanating from the tip of the nose. Such an ap-proximation of facial surfaces by indexed collection ofcurves can be seen as solution to facial surface parame-terizations which capture locally their shapes.

2.1 Shape Deformation Capture

A parameterized curve on the face,β : I → R3,

where I = [0, 1], is represented mathematically us-ing the square-root velocity function [5], denoted

by q(t), according to: q(t) = β(t)√‖β(t)‖

. This spe-

cific parametrization has the advantage of capturing theshape of the curve and provides simple calculus. Letdefine the space of such functions:C = {q : I →

R3, ‖q‖ = 1} ⊂ L

2(I,R3), where‖ · ‖ implies theL2 norm. With theL2 metric on its tangent spaces,C

becomes a Riemannian manifold. Given two curvesq1andq2, letψ denotes a path on the manifoldC betweenq1 andq2, ψ ∈ Tψ(C) is a tangent vector field on thecurveψ ∈ C and< . > denotes theL2 inner prod-uct on the tangent space. In our case, as the elementsof C have a unitL2 norm, C is a Hypersphere in theHilbert spaceL2(I,R3). The geodesic path betweenany two pointsq1, q2 ∈ C is simply given by the minorarc of great circle connecting them on this Hypersphere,ψ∗ : [0, 1] → C, given by Eq. (1):

ψ∗(τ) =1

sin(θ)(sin((1− τ)θ)q1 + sin(θτ)q2) (1)

andθ = dC(q1, q2) = cos−1(〈q1, q2〉). We point outthatsin(θ) = 0 if the distance between the two curvesis null, in other wordsq1 = q2. In this case, for eachτ , ψ∗(τ) = q1 = q2. The tangent vector field on thisgeodesicdψ

dτ: [0, 1] → Tψ(C) is then given by Eq. (2):

dψ∗

dτ=

−θsin(θ)

(cos((1− τ)θ)q1 − cos(θτ)q2) (2)

Knowing that on geodesic path, the covariant deriva-tive of its tangent vector field is equal to0, dψ

dτis paral-

lel along the geodesicψ∗ and we shall represent it withdψ∗

dτ|τ=0. Accordingly, Eq. (2) becomes:

dψ∗

dτ|τ=0 =

θ

sin(θ)(q2 − cos(θ)q1) , (3)

with θ 6= 0. Thus, dψ∗

dτ|τ=0 is sufficient to represent

this vector field, the remaining vectors can be obtainedby parallel transport ofdψ

dτ|τ=0 along the geodesicψ∗.

In practice, the first step to capture the deformation be-tween two given 3D facesS1 andS2 is to extract the ra-dial curves. Letβ1

α andβ2α denote the radial curves that

make an angleα with a reference radial curve on facesS1 andS2, respectively. The reference curve is chosento be the vertical curve as the faces have been rotatedto the upright position during the preprocessing step.The tangent vector fieldψα

∗that represents the energy

E given in Eq. (1) needed to deformβ1α to β2

α is thencalculated for each indexα. We consider the magnitudeof this vector field at each point, located inβα and is ofindexk on this curve, for building aDeformation Vec-tor Field on the facial surface,V kα = ||ψ∗

α|(τ=0)(k)||,whereα denotes the angle to the vertical radial curveandk denotes a point on this curve. This scalar vectorfield quantifies the local (on each point) deformation be-tween the facesS1 andS2.

2.2 Dynamic Shape Deformation Analysis

To capture the dynamic of facial deformations across3D face sequences, we consider the deformation mapcomputed between successive frames, as described insection 2.1. Similarly to the recognition schema pro-posed by Sun et al. [7], in order to make possible tocome to the recognition system at any time and makethe recognition process possible from any frame of agiven video, we consider subsequences ofn frames.Thus, we chose the firstn frames as the first subse-quence. Then, we chosen-consecutive frames startingfrom the second frame as the second subsequence. Theprocess is repeated by shifting the starting index of thesequence every one frame till the end of the sequence.For each sub-sequence, the first frame is considered asreference one and the deformation map is calculated toeach of the remaining frames. The feature vector forthis subsequence is built based on the average deforma-tion of then − 1 calculated deformation maps. Thus,each subsequence is represented by a feature vector ofsize the number of points on the face. We point out thatwe consider5000 points, 50 for each of the 100 radialcurves.

Figure 1 illustrates one subsequence for each expres-sion withn = 6 frames. Each expression is illustratedin two rows, the upper row gives the reference frameof the subsequence and then − 1 successive frames ofsubsequences. Below are the corresponding deforma-tion maps computed for each frame. The mean defor-mation map is reported at the right and represent thefeature vector for that subsequence. Thus, this deforma-tion map summarizes the temporal deformation under-gone by the facial surface when conveying expressions.

3 Classification

We now represent each subsequence by its Deforma-tion Vector FiledV kα = ||ψ∗

α|(τ=0)(k)||, as describedin section 2. Since, the dimensionality of the featurevector is high, we use LDA-based transformation (Lin-ear Discriminant Analysis) to transform the present fea-ture space to an optimal one that is relatively insensi-tive to different subjects while preserving the discrimi-native expression information. LDA defines the within-class matrixSw and the between-class matrixSb. Ittransforms a n-dimensional feature to an optimized d-dimensional feature where d< n. For our experiments,the discriminative classes are 6 expressions, thus the re-duced dimension d is 5. For the classification task weused the Multi-class version of Random Forest algo-rithm. The Random Forest algorithm was proposed byLeo Breiman in [1] and defined as a meta-learner com-prised of many individual trees. It was designed to oper-

Reference

frame

#9 #10 #11 #12 #13 #14

Mean

deforma!on

Reference

frame

#9 #10 #11 #12 #13 #14

Mean

deforma!on

Sad

Surprise

Reference

frame

#15 #16 #17 #18 #19 #20

Mean

deforma!on

Reference

frame

#13 #14 #15 #16 #17 #18

Mean

deforma!on

Happy

Fear

Reference

frame

#12 #13 #14 #15 #16 #17

Mean

deforma!on

Disgust

Reference

frame

#23 #24 #25 #26 #27 #28

Mean

deforma!on

Angry

High deforma!ons

Low deforma!ons

High deforma!ons

Low deforma!ons

High deforma!ons

Low deforma!ons

High deforma!ons

Low deforma!ons

High deforma!ons

Low deforma!ons

High deforma!ons

Low deforma!ons

Radial curves-based

approxima!on

Figure 1. Calculus of dynamic shape de-formation on subsequences taken fromthe BU-4DFE sequences.

ate quickly over large datasets and more importantly tobe diverse by using random samples to build each treein the forest. Diversity is obtained by randomly choos-ing attributes at each node of the tree and then usingthe attribute that provides the highest level of learning.Once trained, Random Forest classify a new expressionfrom an input feature vector by putting it down eachof the trees in the forest. Each tree gives a classifica-tion decision by voting for that class. Then, the forestchooses the classification having the most votes (overall the trees in the forest). In our experiments we usedWeka Multi-class implementation of Random Forest al-

gorithm by considering40 trees.

4 Experiments and DiscussionsTo demonstrate the effectiveness of the proposed ap-

proach, we perform expression recognition from 3Dface sequences following the experimental setting de-scribed in [7]. The experiments were conducted on asubset of 60 subjects (arbitrarily selected) from the BU-4DFE dataset and includes the 6 prototypic expressions(Angry (An), Disgust (Di), Fear (Fe), Happy (Ha), Sad(Sa), Surprise (Su)). We note that the experiments areconducted in a identity-independent fashion. Followinga similar setup as in [7], we randomly divided the fea-ture vectors (computed from the subsequences) of 60subjects into two sets, the training set containing 54 sub-jects, and the test set containing 6 subjects. We performon these feature vectors Random Forest algorithm andreported obtained results in Table 1. We note that the re-ported rates are obtained by averaging the results of the10-independent and arbitrarily run experiments (10-foldcross validation). The average recognition rate is equalto 93.21%. It can also be observed that the best classi-fied expressions are (Ha) and (Su) with recognition ac-curacies of 95.47% and 94.53%, respectively, whereasthe (Fe) expression is more difficult to classify. This ismainly due to the subtle changes in facial shapes motionof this expression compared to (Ha) and (Su).

Table 1. Confusion matrix.% An Di Fe Ha Sa SuAn 93.11 2.42 1.71 0.46 1.61 0.66Di 2.3 92.46 2.44 0.92 1.27 0.58Fe 1.89 1.75 91.24 1.5 1.76 1.83Ha 0.57 0.84 1.71 95.47 0.77 0.62Sa 1.7 1.52 2.01 1.09 92.46 1.19Su 0.71 0.85 1.84 0.72 1.33 94.53

Average recognition rate = 93.21%

We note that the proposed approach outperformsstate-of-the-art approaches following the same exper-imental settings. The recognition rates reported in[7] based on temporal analysis only was80.04% andspatio-temporal analysis was90.44%. In both stud-ies subsequences of constant window width equal to6 (Win = 6) is defined for experiments. We em-phasize that their approach is not completely automaticrequiring 83 manually marked key points on the firstframe of the sequence to allow accurate model track-ing. We can also make comparisons with performancesshowed in [3] on only three expressions (Ha, An, andSu), their approach achieved an average recognition rateabout 73.61% (for each frame) compared to our ap-proach which showed an increase of more than19% by

achieving(93.21%). We note also that the subjects con-sidered on our study are arbitrary selected whereas in[3] sequences are accurately selected. Approaches like[2] and [3] reported recognition results on whole facialsequences, this hinders the possibility of the methods toadhere to a real-time protocol. In fact, showing recogni-tion results depends on the preprocessing of whole se-quences unlike our approach and the one described in[7] which are able to provide recognition results whenprocessing very few 3D frames.

5 Conclusion

This paper proposes a new Deformation VectorField (DVF) which accurately describes local defor-mations across 3D facial sequences. A facial surfaceparametrization by their radial curves allows the defi-nition of this descriptor on each facial point based onRiemannian Geometry. Then the well known learningalgorithm, Random Forest, is performed for the clas-sification task. Experiments conducted on BU-4DFEdataset, following state-of-the-art setting demonstratethe effectiveness of the proposed approach based onlyon temporal analysis of facial sequences.

References

[1] L. Breiman. Random forests. Machine Learning,45(1):5–32, 2001.

[2] V. Le, H. Tang, and T. Huang. Expression recogni-tion from 3d dynamic faces using robust spatio-temporalshape features. InAutomatic Face Gesture Recogni-tion and Workshops (FG 2011), 2011 IEEE InternationalConference on, pages 414 –421, 2011.

[3] G. Sandbach, S. Zafeiriou, M. Pantic, and D. Rueckert.A dynamic approach to the recognition of 3d facial ex-pressions and their temporal models. InProceedings ofIEEE International Conference on Automatic Face andGesture Recognition (FG’11), Special Session: 3D Fa-cial Behavior Analysis and Understanding, Santa Bar-bara, CA, USA, March 2011.

[4] G. Sandbach, S. Zafeiriou, M. Pantic, and D. Rueckert.Recognition of 3d facial expression dynamics.Image andVision Computing, February 2012. (in press).

[5] A. Srivastava, E. Klassen, S. Joshi, and I. Jermyn. Shapeanalysis of elastic curves in euclidean spaces.To appear,IEEE Transactions on, Pattern Analysis and Machine In-telligence, 2011.

[6] Y. Sun, X. Chen, M. Rosato, and L. Yin. Tracking ver-tex flow and model adaptation for three-dimensional spa-tiotemporal face analysis.Trans. Sys. Man Cyber. Part A,40:461–474, May 2010.

[7] Y. Sun and L. Yin. Facial expression recognition basedon 3d dynamic range model sequences. InProceedings ofthe 10th European Conference on Computer Vision: PartII , ECCV ’08, pages 58–71, 2008.