10
Dynamic soft encoded patterns for facial event analysis Peng Yang , Qingshan Liu, Dimitris Metaxas Computer Science Department, Rutgers University, 110 Frelinghuysen Road, Piscataway, NJ 08854-8019, USA article info Article history: Received 9 February 2010 Accepted 8 November 2010 Available online 31 December 2010 Keywords: Facial expression Dynamic feature Time resolution abstract In this paper, we propose a new feature: dynamic soft encoded pattern (DSEP) for facial event analysis. We first develop similarity features to describe complicated variations of facial appearance, which take similarities between a haar-like feature in a given image and the corresponding ones in reference images as feature vector. The reference images are selected from the apex images of facial expressions, and the k- means clustering is applied to the references. We further perform a temporal clustering on the similarity features to produce several temporal patterns along the temporal domain, and then we map the similar- ity features into DSEP to describe the dynamics of facial expressions, as well as to handle the issue of time resolution. Finally, boosting-based classifier is designed based on DSEPs. Different from previous works, the proposed method makes no assumption on the time resolution. The effectiveness is demonstrated by extensive experiments on the Cohn–Kanade database. Published by Elsevier Inc. 1. Introduction As early as in 1970s, facial expression analysis has attracted much attention of psychologists [18]. They tried to explore human emotion with the help of facial expressions. In [18], Izard pre- sented to categorize facial expression into six basic expressions: happiness, sadness, disgust, surprise, anger, and fear, which are across different races and different cultures. In [10], Ekman and Friesen designed a comprehensive standard to decompose each expression into several special action units (AUs), i.e., Facial Action Coding System (FACS). These two works can be regarded as the pioneer works of facial expression analysis. Due to the potential applications of human–computer interface, multimedia, and sur- veillance, automatic facial expression recognition has become a hot topic in the communities of computer vision and pattern rec- ognition. Because the definition of AU is an qualitative semantic description, it makes automatic AU detection and AU-based expression analysis very difficult in practice. Therefore, most of the works focus on classifying an input facial image or sequence into one of six basic expressions. The previous works of automatic facial expression recognition can be categorized into two main categories: image based methods [23,28,30] and video-based methods [12,25,45]. The image based methods take only mug shots as observations which capture characteristic images at the apex of the expressions, and recognize expressions according to appearance features [28,3,24,29,23,44]. For example, Gabor features were used in [23]; Haar features were employed in [35]; Local Binary Pattern features were adopted in [28]. In some cases, it is good enough to do expression recognition based on the static image. However, a natural facial expression is dynamic, which evolves over time from the onset, the apex, and the offset. The image based methods ignore such dynamic character- istics, so they can not perform well in most real world settings. Psychological researches have also demonstrated that besides the categories of expression, facial expression dynamics is important to decipher its meaning [19,2]. Therefore, the video-based methods become much popular in recent years [5,38,8,16,30,42,41,44], which aim to analyze the dynamics of facial expression for recognition. For the video-based methods, how to extract and represent the dynamics of facial expression is a key issue. The typical approaches track facial key points, and analyze their motion and geometric variation of facial appearance [15,22,33,46,11]. This kind of ap- proaches highly depends on facial key point detection and tracking, which are easily influenced by illumination. Some researchers as- sumed that the dynamics of facial expression embedded in a man- ifold subspace, and they tried to learn such manifold subspace for facial expression recognition [28,7,21]. However, manifold learning has an issue of ‘‘out of sample’’. Besides this, how to decide the dimension of manifold is still an open problem. In [16], Zhao and Pietikainen proposed a Volume Local Binary Pattern (VLBP) descriptor to capture the dynamics of facial expression, which took the video as a volumetric data in the spatio-temporal domain. The volume feature has the advantage of coupling temporal dynamics with spacial appearance tightly. Similar volume features have also been introduced to action recognition [43], video-based face recog- nition [17], and pedestrian detection [37]. The volume features make an assumption that the training and the testing data must have the same time resolution, i.e., the same video length and 1077-3142/$ - see front matter Published by Elsevier Inc. doi:10.1016/j.cviu.2010.11.015 Corresponding author. E-mail addresses: [email protected] (P. Yang), [email protected] (Q. Liu), [email protected] (D. Metaxas). Computer Vision and Image Understanding 115 (2011) 456–465 Contents lists available at ScienceDirect Computer Vision and Image Understanding journal homepage: www.elsevier.com/locate/cviu

Dynamic soft encoded patterns for facial event analysis

Embed Size (px)

Citation preview

Computer Vision and Image Understanding 115 (2011) 456–465

Contents lists available at ScienceDirect

Computer Vision and Image Understanding

journal homepage: www.elsevier .com/ locate /cviu

Dynamic soft encoded patterns for facial event analysis

Peng Yang ⇑, Qingshan Liu, Dimitris MetaxasComputer Science Department, Rutgers University, 110 Frelinghuysen Road, Piscataway, NJ 08854-8019, USA

a r t i c l e i n f o

Article history:Received 9 February 2010Accepted 8 November 2010Available online 31 December 2010

Keywords:Facial expressionDynamic featureTime resolution

1077-3142/$ - see front matter Published by Elsevierdoi:10.1016/j.cviu.2010.11.015

⇑ Corresponding author.E-mail addresses: [email protected] (P. Yan

Liu), [email protected] (D. Metaxas).

a b s t r a c t

In this paper, we propose a new feature: dynamic soft encoded pattern (DSEP) for facial event analysis.We first develop similarity features to describe complicated variations of facial appearance, which takesimilarities between a haar-like feature in a given image and the corresponding ones in reference imagesas feature vector. The reference images are selected from the apex images of facial expressions, and the k-means clustering is applied to the references. We further perform a temporal clustering on the similarityfeatures to produce several temporal patterns along the temporal domain, and then we map the similar-ity features into DSEP to describe the dynamics of facial expressions, as well as to handle the issue of timeresolution. Finally, boosting-based classifier is designed based on DSEPs. Different from previous works,the proposed method makes no assumption on the time resolution. The effectiveness is demonstrated byextensive experiments on the Cohn–Kanade database.

Published by Elsevier Inc.

1. Introduction

As early as in 1970s, facial expression analysis has attractedmuch attention of psychologists [18]. They tried to explore humanemotion with the help of facial expressions. In [18], Izard pre-sented to categorize facial expression into six basic expressions:happiness, sadness, disgust, surprise, anger, and fear, which areacross different races and different cultures. In [10], Ekman andFriesen designed a comprehensive standard to decompose eachexpression into several special action units (AUs), i.e., Facial ActionCoding System (FACS). These two works can be regarded as thepioneer works of facial expression analysis. Due to the potentialapplications of human–computer interface, multimedia, and sur-veillance, automatic facial expression recognition has become ahot topic in the communities of computer vision and pattern rec-ognition. Because the definition of AU is an qualitative semanticdescription, it makes automatic AU detection and AU-basedexpression analysis very difficult in practice. Therefore, most ofthe works focus on classifying an input facial image or sequenceinto one of six basic expressions.

The previous works of automatic facial expression recognitioncan be categorized into two main categories: image based methods[23,28,30] and video-based methods [12,25,45]. The image basedmethods take only mug shots as observations which capturecharacteristic images at the apex of the expressions, and recognizeexpressions according to appearance features [28,3,24,29,23,44].For example, Gabor features were used in [23]; Haar features were

Inc.

g), [email protected] (Q.

employed in [35]; Local Binary Pattern features were adopted in[28]. In some cases, it is good enough to do expression recognitionbased on the static image. However, a natural facial expression isdynamic, which evolves over time from the onset, the apex, andthe offset. The image based methods ignore such dynamic character-istics, so they can not perform well in most real world settings.Psychological researches have also demonstrated that besides thecategories of expression, facial expression dynamics is importantto decipher its meaning [19,2]. Therefore, the video-based methodsbecome much popular in recent years [5,38,8,16,30,42,41,44], whichaim to analyze the dynamics of facial expression for recognition.

For the video-based methods, how to extract and represent thedynamics of facial expression is a key issue. The typical approachestrack facial key points, and analyze their motion and geometricvariation of facial appearance [15,22,33,46,11]. This kind of ap-proaches highly depends on facial key point detection and tracking,which are easily influenced by illumination. Some researchers as-sumed that the dynamics of facial expression embedded in a man-ifold subspace, and they tried to learn such manifold subspace forfacial expression recognition [28,7,21]. However, manifold learninghas an issue of ‘‘out of sample’’. Besides this, how to decide thedimension of manifold is still an open problem. In [16], Zhao andPietikainen proposed a Volume Local Binary Pattern (VLBP)descriptor to capture the dynamics of facial expression, which tookthe video as a volumetric data in the spatio-temporal domain. Thevolume feature has the advantage of coupling temporal dynamicswith spacial appearance tightly. Similar volume features have alsobeen introduced to action recognition [43], video-based face recog-nition [17], and pedestrian detection [37]. The volume featuresmake an assumption that the training and the testing data musthave the same time resolution, i.e., the same video length and

Fig. 1. Some examples of smile events from different subjects and different cameras. (The happiness begins change from onset to apex, and different cameras capturedifferent frames for the first sequence. And all the three subjects do the same expression with different speed.)

P. Yang et al. / Computer Vision and Image Understanding 115 (2011) 456–465 457

speed rates. However, this assumption is hard to reach in practice.For example, different cameras have different capture speed rates;Different subjects often show an expression with different speed;Even for the same person, it is almost impossible to duplicate anexpression with the same time resolution. Fig. 1 shows an illustra-tion. Thus, a time-warping pre-processing should be operated, butfew literature discussed this issue [44]. In [39], a dynamic binarypattern was proposed for facial expression recognition. Differentfrom [16], it makes no assumption on time resolution. The dynamicbinary pattern is simply based on the haar-like visual descriptor, soit is not sufficient to capture complex nonlinear variations includ-ing illumination, pose, and facial appearance diversity.

In this paper, we propose a new feature called dynamic soft en-coded pattern (DSEP) which can handle time resolution problemvery well for facial event analysis. We develop the similarity fea-ture as a vector of similarities between a sample and the referencesamples on the corresponding raw feature. If the similarity is cal-culated by a nonlinear dot-product kernel, the similarity featureis similar to the kernel feature that is composed of dot-productsbetween a sample and all the training samples [1]. In this sense,the similarity feature has a similar characteristic of capturingnonlinear variation like kernel feature. Taking account of facialappearance diversity, we conduct the k-means clustering on eachraw feature of all the apex images, and we adopt the centers ofthe clusters as the references. Why we use the apex images isbased on the natural thinking [32]: expressions around apex areeasy to be recognized by human being, and it is natural to compare

Fig. 2. Clustering one haar-like feature from th

one mug shot with the apex expressions to make decision whetherthis image belongs to one particular emotion. The similarity isbased on the haar-like visual descriptor and is calculated by aGaussian kernel. To harness the issue of time resolution, we furtherperform a temporal clustering on the similarity features to produceseveral temporal patterns along the temporal domain, and then wemap the similarity features into the DSEPs for final classifier designwith the boosting learning [13]. Inspired by [23], Support VectorMachines (SVM) is also used to build the classifier based on thefeatures selected by Adaboost. SVM constructs a hyperplane in ahigh or infinite dimensional space, which can be used for classifica-tion, regression or other tasks. Intuitively, a good separation isachieved by the hyperplane with maximum margin, since in gen-eral the larger the margin means the lower the generalization errorof the classifier.

Compared with our previous work [40], the main differencesare: (1) The new proposed similarity feature as a five dimensionvector is more robust to noise compared with the previous one;(2) Soft-Assignment is applied on the dynamic binary patternencoding which could more precisely describe the status of expres-sion changing; (3) More detailed analysis and extensive experi-ment are done to verify the performance of DSEP. Consideringthe continuous variation of the expression along the time domain,we embed soft assignment into our encoding strategy. For the softassignment, it is used to identify a continuous value with aweighted combination of nearby bins, so that the count in onebin is separated to neighboring bins. Therefore it is more

e mugs of the apex on angry expression.

458 P. Yang et al. / Computer Vision and Image Understanding 115 (2011) 456–465

reasonable to use soft assignment in this work compared with ourprevious work.

In the rest of this paper, the similarity features are introduced inthe Section 2; We propose clustering of temporal patterns andbuilding the dynamic soft encoded patterns in the Section 3; Theclassifier design is introduced in Section 4; The Section 5 showsthe experiment results on the Cohn–Kanade database followedby the conclusion in the Section 6.

2. Similarity features

Visually, the dynamics of facial expression is displayed by facialappearance variations, and facial appearance is the intuitional visualcue of facial event, so we first need to describe facial appearance.Haar-like descriptor is a good one to describe facial appearance,and it has achieved much success in face detection [34], face recog-nition [14], and facial expression recognition [41,35]. In this paper,we use the haar-like descriptors to represent facial appearance too.A haar-like descriptor is defined as the difference of the sum of pixelsof sub-areas inside the rectangle [31]. Though it is simple, it indicatescertain micro-structure in the image, such as edges or changes in thetexture. Fig. 3 shows the haar-like descriptors defined in OpenCVthat we use in the experiments.

Due to the variation of scale and position of the haar-likedescriptors, we can obtain thousands of haar-like features in oneimage. For simplicity, we denote Hi = {hi,j}, j = 1, 2, . . ., M as thehaar-like features of the image Ii, where the subscript j meansthe jth haar-like feature in Ii, and M is the number of the features.For an image sequence with N frames, A = {Ii}, i = 1, 2, . . ., N, weextract the haar-like features from each frame Ii respectively, sowe get an ensemble of the haar-like features, AH = {Hi}, i = 1, 2,. . ., N.

For facial expression recognition, we need to take account of thecomplex nonlinear variations of facial appearance suffered fromillumination and face shape. The above simple haar-likeappearance descriptors are not sufficient to capture these varia-tions. To handle this issue, we design a new similarity feature rep-resentation based on these haar-like features. The basic idea is tocompare each haar-like feature to the corresponding ones in thereference images, and take the similarities as the similarity featureof the haar-like feature. Thus, we define the similarity featureS!ðhðxÞÞ as:

S!ðhðxÞÞ ¼ ff ðhðxÞ;hðr1ÞÞ; f ðhðxÞ; hðr2ÞÞ; . . . ; f ðhðxÞ;hðrmÞÞg; ð1Þ

where f is the similarity measurement function, {r1,r2 , . . . , rm} are mreferences, and h(r) represents the corresponding haar-like featurein the reference r. If f is defined as a nonlinear dot-product kernel,S!

is actually similar to kernel feature used in kernel learning[36,6]. We use the Gaussian kernel to calculate the similarity, so

50 100 150 200 250 300 350 400

50

100

150

200

250

Fig. 3. The haar-like descriptors.

the proposed similarity feature has the characteristic of capturingnonlinear variation similar to kernel features.

How to select the reference {r1,r2, . . . ,rm} is another key issue. In[32], the psychological experiments showed that expressionsaround apex are easy to be recognized by the human beings, andit is natural to compare one mug shot with the apex expressionsto make decision whether this image belongs to one particularemotion. Inspired by this idea, we select the reference images fromthe apex images. Taking account of facial appearance diversity, weconduct the k-means clustering on the apex images collected fromthe training data, and take the centers of the clusters as the refer-ences. In our experiments, we set the number of clusters as fiveempirically. Fig. 2 shows the procedure of reference selection forthe expression of angry.

3. Temporal patterns clustering and representation

As mentioned in Section 1, it is almost impossible to make allthe training data and the testing data with the same time resolu-tion due to several factors. In this section, we propose a dynamicsoft encoding technique to harness this issue, as well as to encodethe dynamics of facial expressions. We first address how to explorethe intrinsic temporal patterns of each expression, and then wewill describe how to map the similarity features into DSEPs withthe temporal patterns.

3.1. Clustering intrinsic temporal patterns

Facial expression is a dynamic event, which evolves over timefrom the onset, the apex, and the offset. As in most literatures[21,29], the researchers just focus on the expression at the apexstatus, and ignore the frames at the low intensity level. In thiswork, we take the dynamics of expression from the onset to theapex into account. To describe the dynamics effectively, we quan-tify the evolution of an expression process into several intrinsicstates along the temporal domain. Following the idea in [10] thatseparates the expression status into five levels, we assume eachexpression can be divided into five temporal states from the onsetto the apex. Correspondingly, each similarity feature S

!ðhÞ can be

quantified into five temporal patterns when it evolves from the on-set to the apex.

Without loss of generality, in the following, we discuss how tomodel these five temporal patterns of a similarity feature S

!ðhÞ

for an expression E. Given all the S!ðhÞs on the training data, we

cluster them into five clusters by the k-means algorithm. Wemodel five clusters as five Gaussian distributions respectively,Nk

j f~lkj ;rk

j g; k ¼ 1;2; . . . ;5, where ~l and r represent the mean andthe variance of the cluster respectively. Then we use an ensembleof such five-level Gaussian distributions to model the five temporalstates of the expression E as in Eq. (2), where M is the total numberof the similarity features extracted from an image. For conve-nience, we call BE the temporal pattern model.

3.2. Dynamic soft encoded pattern mapping

In [40], we proposed the dynamic binary encoding strategy tohandle the issue of the time resolution. Its basic idea is to map eachfeature S

!ðhÞ into a 5-dimensional binary vector V

!ðhÞ ¼ ðvkÞk¼1;2;...5

by the Bayesian decision as following:

BE ¼

N11ð~l1

1;r11Þ;N

21ð~l2

1;r21Þ; . . . ;N5

1ð~l51;r5

1ÞN1

2ð~l12;r1

2Þ;N22ð~l2

2;r22Þ; . . . ;N5

2ð~l52;r5

2Þ...

N1Mð~l1

M ;r1MÞ;N

2Mð~l2

M ;r2MÞ; . . . ;N5

Mð~l5M;r5

8>>>>><>>>>>:

ð2Þ

P. Yang et al. / Computer Vision and Image Understanding 115 (2011) 456–465 459

vk ¼1 if k ¼ arg max

cPðS!ðhÞjNc

hÞ; c ¼ 1;2; . . . ;5;

0 otherwise;

8<: ð3Þ

where PðS!ðhÞjNc

hÞ is the probability of S!ðhÞ given the model Nc

h. Forthe binary feature V

!ðhÞ, there is only one dimension which is 1, and

the other four dimensions are 0. It means a S!ðhÞ can be only pro-

jected into one of the corresponding five clusters.The encoding strategy in Eq. (3) is a typical hard assignment.

Due to the continues variation of expression, the hard assignmentseems a little inaccurate. In [27], Reilly et al., used the Gaussianmodels to simulate the distributions of five level intensities of ac-tion units in FACS based on linear local embedding, and they founda big overlap existing in the inter-levels, as shown in Fig. 4. This re-sult implied that the hard assignment is inaccurate.

In this paper, we propose a soft assignment to improve the per-formance of the above encoding strategy. The idea is inspired by‘‘soft assignment’’ in histogram calculation [4], and in [4] it is usedto identify a continuous value with a weighted combination ofnearby bins, or ‘‘smooth’’ a histogram so that the count in onebin is separated to neighboring bins. Philbin used soft assignmentin object retrieval and got much improvement [26]. Because theexpression change is continues, this insinuates that each sampleshould be assigned to different clusters with different weights.We explore a soft assignment due to the variation of the expressionintensity. We set the number of the nearest neighbors as 2, becausethe expression intensity always changes monotonously. In otherwords, ideally, the intensity is between two neighboring clusters.We convert S

!ðhÞ into a five-dimensional encoded vector by the soft

assignment, S!ðhÞ ! ~VðhÞ ¼ ðvkÞ;vk ¼ w, where w 2 [0,1]. ~VðhÞ is

computed by the Bayesian rule as:

vk1 ¼PðS!ðhÞjNk1

PðS!ðhÞjNk1

hÞþPð~SðhÞjNk2

vk2¼ Pð~SðhÞjNk2

Pð~SðhÞjNk1hÞþPðS

!ðhÞjNk2

other vki¼ 0

ð4Þ

where, k1 and k2 are the indices of the top two clusters which aremost close to S

!ðhÞ.

We map all the similarity features to the five-dimensional softencoded feature vectors for each frame of the sequence. Inspiredby the idea in [9], we calculate the histogram of all the soft en-coded feature vectors over the whole sequence for each feature,and do the normalization same as Eq. (5). Based on this mapping,normalization is done on each sequence as

D!ðhÞ ¼

XN

i¼1

Vi

!ðhÞN

; ð5Þ

where D!ðhÞ is a five-dimensional vector. Whatever how long a se-

quence is, the final features are always five dimensions, and thetime-warping operation is done implicitly. The efficiency of this

Fig. 4. Intensity over time for AU25. (From [27].)

encoding strategy was demonstrated by extensive experiments in[40]. Basically, our method is a video sequence based method andit can handle time resolution problem very well. Because DESP doesnot depend on the number of the frames, DSEP can even handle theextreme case that input sequence holds just one frame.

Based on Eq. (5), the sequence is represented by M five-dimen-sional feature D

!ðhÞ in which the dynamics of the sequence is en-

coded, and ~DðhÞ is independent of the time resolution of thesequence. We call ~DðhÞ DSEP. Similar to [16], we convert DSEPs intodecimal values, and we use them to construct the expression clas-sifier. Fig. 5 shows an example of extracting DSEP.

4. Classifier design

The number of DSEPs is tremendous, since it is equal to thenumber of the haar-like features in one image. For one 64 � 64 im-age, we can extract as many as 195,552 haar-like features. For eachexpression, actually only a few parts of facial appearance play thedominated role, so the most of DSEPs are not significant for expres-sion recognition. It is well-known that Adaboost learning is a goodtool to select some good features and combine them together toconstruct a strong classifier [34]. Therefore we adopt Adaboost tolearn a set of discriminant DSEPs and use them to construct theexpression classifier. In this paper, we take six basic expressionsinto account, so it is a six-class recognition problem. We use theone-against-all strategy to decompose the six-class issue into mul-tiple two-class issues. For each expression, we set the correspond-ing samples as the positive samples, and the samples of otherexpressions as the negative samples. Table 1 summarizes thelearning algorithm.

The work in [23] proposed to use SVM on the features selectedby Adaboost to further improve the final performance. Following[23], we also design a SVM classifier based on DSEPs selected byAdaboost.

5. Experiments

5.1. Data set

We conduct our experiments on the Cohn–Kanade facialexpression database [20], which is widely used to evaluate facialexpression recognition algorithms. This database consists of 100students aged from 18 to 30 years old, of which 65% were female,15% were African-American, and 3% were Asian or Latino. Subjectswere instructed to perform a series of 23 facial displays, six ofwhich were prototypic emotions mentioned above. For our exper-iments, we selected 352 image sequences from 96 subjects. Theselection criterion was that a sequence could be labeled as one ofthe six basic emotions. Fig. 6 shows the samples of six expressions.

In the previous works [30,23,16,28], the researchers only pickedup the frames at the apex level to do recognition and discarded theframes with low intensities. Such data collection simplifies exper-iment setting up, and it does not follow the suggestion from psy-chological studies that facial expression dynamics is importantwhen attempting to decipher its meaning [2]. Different from theprevious works, we take account of all the frames from the onsetto the apex. On average, there are around 12 frames for each se-quence in our experimental data. Besides including the frameswith low intensities, our data set is four times larger than[28,30]. All the images are normalized to 64 � 64 based on thelocation of eyes. We use five-fold cross-validation to test the pro-posed method and compare it with the related works. The differentsliding windows with different sampling rate are applied on theboth training and testing set to simulate the different time resolu-tions, and DSEPs are extracted on these subsequences.

Fig. 5. Calculate normalized dynamic soft encoded pattern.

Table 1Learning procedure.

1. Give example image sequences (xi,yi), . . ., (xn,yn), yi 2 {1,�1} for specifiedexpression E and other expressions respectively

2. Get the similarity features on each image sequence3. Encode the dynamic similarity features based on the corresponding

temporal patten model BE, and get D!

i. Build one weak classifier on each

dynamic soft encoded pattern4. Use Adaboost to learn the strong classifier H(xi)

5. Output the strong classifier: HðxÞ ¼ signðPT

t¼1athtðxÞÞ

Table 2The area under the ROC curves (3D haar-like feature and DSEP).

Expression 9 Frames 7 Frames

3D-Haar DSEP 3D-Haar DSEP

Angry 0.795 0.958 0.848 0.948Disgust 0.733 0.924 0.869 0.926Fear 0.740 0.925 0.777 0.939Happiness 0.912 0.987 0.944 0.974Sadness 0.836 0.994 0.921 0.982Surprise 0.958 0.998 0.984 0.998

460 P. Yang et al. / Computer Vision and Image Understanding 115 (2011) 456–465

5.2. Evaluation of dynamic soft encoded patterns

Similar to the haar-like volume feature [37] (we denote it as3D-haar for simplicity), the proposed DSEP is based on the haar-like features and integrated the spatio-temporal information to-gether. However, it is different from 3D-haar, DSEP is built onthe similarity features of the haar-like descriptors and is encodedon the temporal pattern models. Therefore DSEP is more robustand discriminative than 3D-haar, especially more robust to thetime resolution variation. To evaluate the performance of DSEP,we first compare it with 3D-haar under the same experimental set-ting and both of them use Adaboost as the classifier. Because 3D-haar requires the training data and the testing data in the samelength, we use a fixed-length window to slide over the sequencesto produce the fix-length samples. We test two sliding windows,i.e., the sliding window with 7 frames and 9 frames respectively.Table 2 reports the area under the ROC curves and Fig. 7 demon-strates the ROC curves. It is clear to see that the performance ofDSEP is better than that of 3D-haar.

Fig. 6. Examples of six basic expressions. (Anger, d

Compared to 3D-haar, DSEP has a special advantage that itmakes no assumption on the time resolution of the data. To showthis advantage, two experiment settings are as follows: (1) we usethe fixed sliding window (7 frames) to produce the trainingsamples, but the testing samples are collected with various timeresolutions including various length; (2) both the training samplesand the testing samples are collected with various time resolu-tions. To effectively simulate the various time resolutions, we firstuse a fixed sliding windows to produce the samples with the fixedlength, and then we randomly extract some frames from thesefixed samples as the samples with variable time resolutions. Forexample, as shown in the following figures and tables, ‘‘xxxxxxx’’means a sample produced from the sliding window with 7 frames,and ‘‘x0x0xxx’’ means a sample with 5 frames extracted from a7-frame sliding window, where ‘‘0’’ represents that frame is dis-carded. Fig. 8 and Table 3 report the experiment results of the firstexperiment group, where the length of testing samples vary from10 to 5. Tables 4 and 5 show the results on the second experimentgroup. Two cases are considered: (a) learn the model based on thetraining set with various mixture resolutions, and test the model

isgust, fear, happiness, sadness and surprise.)

50 100 150 200 250 300

50

100

150

200

250

300

50 100 150 200 250 300

50

100

150

200

250

300

50 100 150 200 250 300

50

100

150

200

250

300

50 100 150 200 250 300

50

100

150

200

250

300

50 100 150 200 250 300

50

100

150

200

250

300

50 100 150 200 250 300

50

100

150

200

250

300

Fig. 7. ROC curves of six expressions in Table 2.

P. Yang et al. / Computer Vision and Image Understanding 115 (2011) 456–465 461

50 100 150 200 250 300

50

100

150

200

250

300

50 100 150 200 250 300

50

100

150

200

250

300

50 100 150 200 250 300

50

100

150

200

250

300

50 100 150 200 250 300

50

100

150

200

250

300

50 100 150 200 250 300

50

100

150

200

250

300

50 100 150 200 250 300

50

100

150

200

250

300

Fig. 8. ROC curves of six expressions in Table 3.

462 P. Yang et al. / Computer Vision and Image Understanding 115 (2011) 456–465

individually on the testing set with different time resolutions. (b)learn the models individually based on the training set with differ-ent time resolutions, and test the model on the testing set with

mixture time resolutions. We use the recognition rate to evaluatethe performance. From the results, we can see that DSEP is insen-sitive to the varied time resolutions.

Table 3The area under the ROC curves (training on 7(xxxxxxx) frames).

Angry Disgust Fear Happiness Sadness Surprise

xxxxxxxxxx 0.9662 0.9207 0.9355 0.9735 0.9700 0.9994xx0xx0xx0x 0.9763 0.9111 0.9523 0.9838 0.9690 1.0000xxxxxxxxx 0.9500 0.9343 0.9216 0.9726 0.9769 0.9985xxxxxxx 0.9481 0.9262 0.9391 0.9742 0.9825 0.9986x000xxx 0.9550 0.9265 0.9315 0.9745 0.9794 1.0000xx0xxxx 0.9620 0.9159 0.9117 0.9749 0.9808 0.9992x0x0xxx 0.9525 0.9253 0.9163 0.9776 0.9788 0.9995xx000xx 0.9626 0.9266 0.9380 0.9787 0.9807 1.00000xx0x 0.9610 0.9249 0.9201 0.9762 0.9807 0.9982

Mean 0.9593 0.9235 0.9296 0.9762 0.9776 0.9993Standard variance 0.0089 0.0068 0.0131 0.0034 0.0049 0.0007

Table 4The performance of the model trained on various mixedresolutions.

Resolution (testing set) Recognition rate

Mixture 82.51xxx 82.57xxxxxxx 82.380xx0x 83.04xx000xx 81.92x000xxx 82.81xx0xxxx 82.21x0x0xxx 82.57

Mean 82.50

Table 5The performance of the model trained on single resolution.

Resolution (training set) Recognition rate

Mixture 82.51xxx 82.23xxxxxxx 80.400xx0x 81.34xx000xx 80.99x000xxx 82.07xx0xxxx 80.31x0x0xxx 81.59

Mean 81.49

Table 6The recognition results of different methods.

Method Recognition rate

Haar + AdaSVM 76.24LBP � Top8,8,8,3,2,3 + SVM 76.93LBP � Top8,8,8,3,3,3 + SVM 77.54Our method ([40])xxx 80.81DSEP + AdaSVM xxx 82.20DSEP + AdaSVM xxxxxxx 81.35DSEP + Ada xxx 82.60DSEP + Ada xxxxxxx 81.20

Table 7The confusion matrix based on static haar features.

Recognitionrate

Angry Disgust Fear Happiness Sadness Surprise

Angry 56.72 14.570 7.88 1.41 19.42 0Disgust 19.28 75.920 2.03 0 2.77 0Fear 4.34 1.860 68.07 21.86 1.96 1.91Happiness 0.46 0.560 13.15 85.44 0.20 0.19Sadness 13.58 2.500 8.60 2.11 67.41 5.80Surprise 2.18 0 1.34 1.55 2.61 92.32

Table 8The confusion matrix based on VLBP LBP � Top8,8,8,3,2,3.

Recognition rate Angry Disgust Fear Happiness Sadness Surprise

Angry 58.03 7.22 3.59 0 29.67 1.47Disgust 16.06 56.83 6.02 0.73 20.34 0Fear 3.00 2.77 69.16 17.09 5.06 2.89Happiness 1.07 0.22 9.17 88.73 0.78 0Sadness 13.00 3.48 6.53 0 76.01 0.95Surprise 2.19 0 1.84 0 3.64 92.31

Table 9The confusion matrix based on DSEP (xxx frames).

Recognition rate Angry Disgust Fear Happiness Sadness Surprise

Angry 71.35 11.89 1.40 0.60 14.74 0Disgust 10.35 77.56 0 0 12.08 0Fear 3.34 0.70 72.19 21.86 1.89 0Happiness 0.40 0 10.14 89.44 0 0Sadness 8.55 0.79 4.85 0 81.93 3.86Surprise 0.35 0 1.99 0 3.31 94.34

P. Yang et al. / Computer Vision and Image Understanding 115 (2011) 456–465 463

5.3. Experiment results

We design two classifiers based on DSEP for expression recogni-tion. One is Adaboost, and another is SVM based on DSEP selectedfrom Adaboost. For simplicity, we denote them as DSEP + Ada andDSEP + AdaSVM respectively.

In order to evaluate the proposed method efficiently, we com-pare it with two popular methods: volume local binary patternswith SVM [16] and Gabor feature with AdaSVM[23]. In [16], facialappearance dynamics are represented by the volume local binarypatterns (VLBP) and SVM is adopted for classifier. In our experi-ments, two kinds of VLBP are implemented, i.e., LBP � Top8,8,8,3,2,3

and LBP � Top8,8,8,3,3,3. In [23], Littlewort et al., first used Adabooston Gabor features, and performed SVM on the selected Gabor fea-tures from Adaboost for expression recognition. Here, we replaceGabor features with haar features because [35] reports that haarfeatures are comparable with Gabor features for expression recog-nition. To compare our method with the method in [23], 195,552haar-like features are extracted from each image as the raw featurepool, and 150 features are selected from the feature pool for eachexpression in both our method and the method of [23]. The exper-iment result is shown in Table 6. The recognition rate is based onthe number of correctly recognized samples divided by the total

number of samples. Comparing with our previous work [40], weinvolve two technologies into current work: five-dimensionalencoding on apex images and soft assignment. Without using softassignment, we got 81.32% with standard variance 1.39 on the case

Table 10The recognition results of different time resolutions.

Testing set Training set

xxx xxxxxxx 0xx0x xx000xx x000xxx xx0xxxx x0x0xxx

xxx 82.20 81.36 80.08 80.28 81.45 80.06 81.12xxxxxxx 82.07 81.35 81.44 80.67 82.19 80.52 81.750xx0x 82.09 81.15 81.33 80.07 82.29 80.34 81.84xx000xx 82.44 81.49 81.09 80.35 82.79 80.45 81.87x000xxx 82.06 81.60 81.71 80.70 82.86 80.73 82.01xx0xxxx 82.81 81.61 81.10 80.60 82.04 80.34 81.87x0x0xxx 81.86 81.10 81.32 80.04 82.27 80.05 81.42

Mean 82.21 81.38 81.15 80.38 82.27 80.35 81.69Variance 0.314 0.202 0.518 0.276 0.474 0.244 0.313

464 P. Yang et al. / Computer Vision and Image Understanding 115 (2011) 456–465

using xxx. Combining with soft assignment, the recognition rate isincreased to 82.20% with standard variance 1.26. Although the softassignment does not improve the performance significantly, bothof these two technologies positively impact the recognition rate.In the Tables 7–9, we show the confusion matrix of the methodbased on static features [23], VLBP [16] and our work. From the re-sults in the Table 6–9, we can clearly see that our method is betterthan the other two methods. The recognition rates are low forexpression angry, disgust and fear compared to the other threeexpressions. This result is also same as the result reported in thework [16]. It is not surprise to see that all three methods do notget the high recognition rates compared to the results in [23,16],because we take more frames which cover the expression at lowdegree. This implies that recognizing the low intensity expressionsis still a hard problem.

To show that our method is robust to the problem of the timeresolution, the model is trained based on one specifical snippet,and is applied to various snippets. We use the recognition rate di-rectly to show the performance on different time resolutions andsequence length, and the results are listed in Table 10. We cansee the robustness of our method.

6. Conclusions

This paper presents a novel feature DSEP for video-based facialexpression recognition. First, we build the similarity featureswhich take the apex of facial expressions as the references. In orderto capture the dynamics of facial expression and handle the issueof the time resolution, the similarity features are further mappedinto DSEP. Adaboost is adopted for facial expression classifier andto select discriminative patterns. SVM is also applied on the se-lected DSEP to build classifiers. Compared to the state-of-the-arts,our method is robust to the time resolution, and gets a promisingperformance on the low intensity expression recognition. Experi-ments on the well-known Cohn–Kanade facial expression databaseshow the power of the proposed method.

References

[1] A. Aizerman, E.M. Braverman, L.I. Rozoner, Theoretical foundations of thepotential function method in pattern recognition learning, Automation andRemote Control 25 (1964) 821–837.

[2] Z. Ambadar, J. Schooler, J.F. Cohn, Deciphering the enigmatic face: theimportance of facial dynamics in interpreting subtle facial expression,Psychological Science (2005).

[3] M. Bartlett, G. Littlewort, I. Fasel, J. Movellan, Real time face detection andfacial expression recognition: development and applications to human–computer interaction, Computer Vision and Pattern Recognition Workshopon Human–Computer Interaction (2003).

[4] C.M. Bishop, Pattern Recognition and Machine Learning (Information Scienceand Statistics), Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2006.

[5] M.J. Black, Y. Yacoob, Recognizing facial expressions in image sequences usinglocal parameterized models of image motion, International Journal ofComputer Vision 25 (1) (1997) 23–48.

[6] B. Cao, D. Shen, J.-T. Sun, Q. Yang, Z. Chen, Feature selection in a kernel space,2007.

[7] Y. Chang, C. Hu, M. Turk, Manifold of facial expression, in: IEEE InternationalWorkshop on Analysis and Modeling of Faces and Gestures, 2003.

[8] I. Cohen, N. Sebe, L. Chen, A. Garg, T. Huang, Facial expression recognition fromvideo sequences: temporal and static modeling, Computer Vision and ImageUnderstanding 91 (1–2) (2003) 160–187.

[9] J. Daugman, Demodulation by complex-valued wavelets for stochastic patternrecognition, International Journal of Wavelets, Multiresolution andInformation Processing (2003).

[10] P. Ekman, W.V. Friesen, Facial action coding system, Consulting PsychologistsPress, 1978.

[11] I. Essa, A.P. pentland, Facial expression recognition using a dynamic model andmotion energy (1995).

[12] B. Fasel, J. Luettin, Automatic facial expression analysis: a survey, PatternRecognition 36 (2003) 259–275.

[13] Y. Freund, R.E. Schapire, A decision-theoretic generalization of on-line learningand an application to boosting, in: European Conference on ComputationalLearning Theory, 1995.

[14] B. Fröba, S. Stecher, C. Küblbeck, Boosting a haar-like feature set for faceverification, Audio-and Video-Based Biometrie Person Authentication (2003).

[15] H. Gu, Q. Ji, Facial event classification with task oriented dynamic bayesiannetwork, IEEE Computer Vision and Pattern Recognition (2004).

[16] G. Zhao, M. Pietikainen, Dynamic texture recognition using local binarypatterns with an application to facial expressions, IEEE Transactions on PatternAnalysis Machine Intelligence 29 (6) (2007) 915–928.

[17] A. Hadid, M.P. ainen, S.Z. Li, Learning personal specific facial dynamics for facerecognition from videos, Analysis and Modeling of Faces and Gestures (2007).

[18] C.E. Izard, The Face of Emotion, Appleton-Century-Crofts, New York, 1971.[19] J. Bassili, Emotion recognition: the role of facial movement and the relative

importance of upper and lower areas of the face, Journal of Personality andSocical Psychology 37 (1979).

[20] T. Kanade, J. Cohn, Y.-L. Tian, Comprehensive database for facial expressionanalysis, in: Proceedings of the 4th IEEE International Conference onAutomatic Face and Gesture Recognition, 2000.

[21] C.-S. Lee, A. Elgammal, Facial expression analysis using nonlineardecomposable generative models, in: IEEE International Workshop onAnalysis and Modeling of Faces and Gestures (AMFG05) with ICCV’05, 2005.

[22] J. Lien, T. Kanade, J. Cohn, C. Li, Detection, tracking, and classification of actionunits in facial expression, Journal of Robotics and Autonomous Systems(1999).

[23] G. Littlewort, M.S. Bartlett, I. Fasel, J. Susskind, J. Movellan, Dynamics of facialexpression extracted automatically from video, Journal of Image and VisionComputing (2006).

[24] M. Pantic, J. Rothkrantz, Facial action recognition for facial expression analysisfrom static face images, IEEE Transactions on Systems, Man and Cybernetics(2004).

[25] M. Pantic, L.J.M. Rothkrantz, Automatic analysis of facial expressions: the stateof the art, IEEE Transactions on Pattern Analysis and Machine Intelligence 22(12) (2000) 1424–1445.

[26] J. Philbin, O. Chum, M. Isard, J. Sivic, A. Zisserman, Lost in quantization:Improving particular object retrieval in large scale image databases, in:Proceedings of the IEEE Conference on Computer Vision and PatternRecognition, 2008.

[27] J. Reilly, J. Ghent, J. McDonald, Investigating the dynamics of facial expression,in: International Symposium on Visual Computing, 2006.

[28] C. Shan, S. Gong, P.W. McOwan, Conditional mutual information basedboosting for facial expression recognition, in: British Machine VisionConference, 2005.

[29] C. Shan, S. Gong, P.W. McOwan, Robust facial expression recognition usinglocal binary patterns, in: IEEE International Conference on Image Processing,2005.

[30] Y. Tian, Evaluation of face resolution for expression analysis, in:Computer Vision and Pattern Recognition Workshop on Face Processingin Video, 2004.

[31] K. Tieu, P. Viola, Boosting image retrieval, IEEE Computer Vision and PatternRecognition (2000).

[32] A. Tversky, Features of similarity, Psychological Review (1977).

P. Yang et al. / Computer Vision and Image Understanding 115 (2011) 456–465 465

[33] M.F. Valstar, I. Patras, M. Pantic, Facial action unit detection using probabilisticactively learned support vector machines on tracked facial point data,CVPRW’05 on Vision for Human-Interaction (2005).

[34] P. Viola, M. Jones, Robust real-time object detection, International Journal ofComputer Vision 57 (2) (2001) 137–154.

[35] J. Whitehill, C.W. Omlin, Haar features for facs au recognition, in: Proceedingsof IEEE International Conference on Automatic Face and Gesture Recognition,2006.

[36] L. Wolf, L. Wolf, A. Shashua, Kernel feature selection, 2003.[37] X. Cui, Y. Liu, S. Shan, X. Chen, W. Gao, 3D haar-like features for pedestrian

detection, in: IEEE International Conference on Multimedia and Expo, 2007.[38] Y. Yacoob, L. Davis, Computing spatio-temporal representations of human

faces, Computer Vision and Pattern Recognition (1994).[39] P. Yang, Q. Liu, D. Metaxas, Facial expression recognition using encoded

dynamic features, in: The IEEE Conference Computer Vision and PatternRecognition, 2008.

[40] P. Yang, Q. Liu, D. Metaxas, Similarity features for facial event analysis, in: TheProceedings of the 10th European Conference on Computer Vision, 2008.

[41] P. Yang, Q. Liu, D.N. Metaxas, Boosting coded dynamic features for facial actionunits and facial expression recognition, Computer Vision and PatternRecognition (2007).

[42] M. Yeasin, B. Bullot, R. Sharma, From facial expression to level of interest:a spatio-temporal approach, Computer Vision and Pattern Recognition(2004).

[43] Y. Ke, R. Sukthankar, M. Hebert, Efficient visual event detection using volumetricfeatures, in: IEEE International Conference on Computer Vision, 2005.

[44] Z. Zeng, M. Pantic, G. Roisman, T. Huang, A survey of affect recognitionmethods: audio, visual, and spontaneous expressions, IEEE Transactions onPattern Analysis and Machine Intelligence 31 (1) (2009) 39–58.

[45] Z. Zeng, M. Pantic, G.I. Roisman, T.S. Huang, A survey of affect recognitionmethods: audio, visual and spontaneous expressions, in: InternationalConference on Multimodal Interfaces, 2007.

[46] Z. Zhang, M. Lyons, M. Schuster, S. Akamatsu, Comparison between geometry-based and gabor-wavelets-based facialexpression recognition using multi-layer perceptrong, in: Proceedings of IEEE International Conference onAutomatic Face and Gesture Recognition, 1998.