8
Real-time facial animation on mobile devices Yanlin Weng , Chen Cao, Qiming Hou, Kun Zhou State Key Lab of CAD&CG, Zhejiang University, Hangzhou 310058, China article info Article history: Received 13 August 2013 Accepted 1 October 2013 Available online 19 October 2013 Keywords: Video tracking 3D avatars Facial performance User-specific blendshapes Shape regression abstract We present a performance-based facial animation system capable of running on mobile devices at real-time frame rates. A key component of our system is a novel regression algo- rithm that accurately infers the facial motion parameters from 2D video frames of an ordin- ary web camera. Compared with the state-of-the-art facial shape regression algorithm [1], which takes a two-step procedure to track facial animations (i.e., first regressing the 3D positions of facial landmarks, and then computing the head poses and expression coeffi- cients), we directly regress the head poses and expression coefficients. This one-step approach greatly reduces the dimension of the regression target and significantly improves the tracking performance while preserving the tracking accuracy. We further propose to collect the training images of the user under different lighting environments, and make use of the data to learn a user-specific regressor, which can robustly handle lighting changes that frequently occur when using mobile devices. Ó 2013 Elsevier Inc. All rights reserved. 1. Introduction Performance-based facial animation has attracted a lot of research attention in the fields of computer vision and computer graphics. And facial animation techniques based on special equipment (e.g., facial markers and structure light projectors) have achieved great success in film and game production. However, these techniques cannot be adopted by ordinary users who do not have access to these equipment. With the rapid spread of mobile devices, a real- time, single-camera based facial animation system could play an important role in many applications like video chat, social network and online games. Weise et al. [2] developed a real-time facial animation system that utilizes depth and color information from a Ki- nect RGBD camera. It, however, can only work in indoor environments. Moreover, none of existing mobile devices integrates a depth camera. Earlier this year, Cao et al. [1] proposed a more practical solution for ordinary users, which only needs a single web camera. Although most mo- bile devices contains a video camera, there still exist two major issues preventing the system from running on mobile devices efficiently: first, the 3D shape regression algorithm in [1] is still time consuming, significantly limiting the system’s performance on mobile devices; sec- ond, the limitation of handling dramatic lighting changes becomes very serious on mobile devices which could be frequently used under different environments. In this paper, we present a facial animation system suit- able for mobile devices by making two contributions. First, instead of using a two-step procedure to track facial ani- mations (i.e., first regressing the 3D positions of facial land- marks, and then computing the head poses and expression coefficients.) as in [1], we directly regress the facial motion parameters (i.e., head poses and expression coefficients) from 2D video frames of an ordinary web camera. This one-step approach greatly reduces the dimension of the regression target and significantly improves the perfor- mance of the tracking process while preserving the track- ing accuracy. Second, we propose to collect the training images of the user under different lighting environments, and make use of the data to learn a user-specific regressor. This regressor can robustly handle lighting changes that frequently occur when using mobile devices. 1524-0703/$ - see front matter Ó 2013 Elsevier Inc. All rights reserved. http://dx.doi.org/10.1016/j.gmod.2013.10.002 Corresponding author. E-mail address: [email protected] (Y. Weng). Graphical Models 76 (2014) 172–179 Contents lists available at ScienceDirect Graphical Models journal homepage: www.elsevier.com/locate/gmod

Real-time facial animation on mobile devices

  • Upload
    kun

  • View
    215

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Real-time facial animation on mobile devices

Graphical Models 76 (2014) 172–179

Contents lists available at ScienceDirect

Graphical Models

journal homepage: www.elsevier .com/locate /gmod

Real-time facial animation on mobile devices

1524-0703/$ - see front matter � 2013 Elsevier Inc. All rights reserved.http://dx.doi.org/10.1016/j.gmod.2013.10.002

⇑ Corresponding author.E-mail address: [email protected] (Y. Weng).

Yanlin Weng ⇑, Chen Cao, Qiming Hou, Kun ZhouState Key Lab of CAD&CG, Zhejiang University, Hangzhou 310058, China

a r t i c l e i n f o

Article history:Received 13 August 2013Accepted 1 October 2013Available online 19 October 2013

Keywords:Video tracking3D avatarsFacial performanceUser-specific blendshapesShape regression

a b s t r a c t

We present a performance-based facial animation system capable of running on mobiledevices at real-time frame rates. A key component of our system is a novel regression algo-rithm that accurately infers the facial motion parameters from 2D video frames of an ordin-ary web camera. Compared with the state-of-the-art facial shape regression algorithm [1],which takes a two-step procedure to track facial animations (i.e., first regressing the 3Dpositions of facial landmarks, and then computing the head poses and expression coeffi-cients), we directly regress the head poses and expression coefficients. This one-stepapproach greatly reduces the dimension of the regression target and significantly improvesthe tracking performance while preserving the tracking accuracy. We further propose tocollect the training images of the user under different lighting environments, and makeuse of the data to learn a user-specific regressor, which can robustly handle lightingchanges that frequently occur when using mobile devices.

� 2013 Elsevier Inc. All rights reserved.

1. Introduction

Performance-based facial animation has attracted a lotof research attention in the fields of computer vision andcomputer graphics. And facial animation techniques basedon special equipment (e.g., facial markers and structurelight projectors) have achieved great success in film andgame production. However, these techniques cannot beadopted by ordinary users who do not have access to theseequipment. With the rapid spread of mobile devices, a real-time, single-camera based facial animation system couldplay an important role in many applications like videochat, social network and online games.

Weise et al. [2] developed a real-time facial animationsystem that utilizes depth and color information from a Ki-nect RGBD camera. It, however, can only work in indoorenvironments. Moreover, none of existing mobile devicesintegrates a depth camera. Earlier this year, Cao et al. [1]proposed a more practical solution for ordinary users,which only needs a single web camera. Although most mo-

bile devices contains a video camera, there still exist twomajor issues preventing the system from running onmobile devices efficiently: first, the 3D shape regressionalgorithm in [1] is still time consuming, significantlylimiting the system’s performance on mobile devices; sec-ond, the limitation of handling dramatic lighting changesbecomes very serious on mobile devices which could befrequently used under different environments.

In this paper, we present a facial animation system suit-able for mobile devices by making two contributions. First,instead of using a two-step procedure to track facial ani-mations (i.e., first regressing the 3D positions of facial land-marks, and then computing the head poses and expressioncoefficients.) as in [1], we directly regress the facial motionparameters (i.e., head poses and expression coefficients)from 2D video frames of an ordinary web camera. Thisone-step approach greatly reduces the dimension of theregression target and significantly improves the perfor-mance of the tracking process while preserving the track-ing accuracy. Second, we propose to collect the trainingimages of the user under different lighting environments,and make use of the data to learn a user-specific regressor.This regressor can robustly handle lighting changes thatfrequently occur when using mobile devices.

Page 2: Real-time facial animation on mobile devices

Fig. 1. Our facial animation system running at over 30 fps on a smartphone. The insert on the top right is the screenshot of the phone, showingthe input video frame, reconstructed landmarks and avatar.

Y. Weng et al. / Graphical Models 76 (2014) 172–179 173

1.1. Related work

Many facial animation systems work by tracking facialfeatures and use the information derived from thesefeatures to animate digital avatars. Tracking algorithmsbased on special equipments (e.g., markers [3,4], structurelights [5,6], and camera arrays [7,8]) have been widelyused in film production where high-fidelity animationsare required. While being able to generate high-qualityanimations, these methods are not suitable for ordinaryusers due to the requirement of special equipments.

Face tracking/animation based on simple devices, suchas an ordinary video camera, has also been explored.According to how facial features are extracted from video,these tracking algorithms can be divided into two maincategories: optimization-based and regression-based. Opti-mization-based methods always formulate a fitting energyrelated to the matching error between the features and im-age appearance. For example, Active Appearance Model(AAM) methods [9,10] reconstruct the face shape basedon an appearance model by minimizing the error betweenthe reconstructed appearance and image appearance.Regression-based methods learn a regression function thatmaps the image appearance to the target output. Somemethods [11,12] use the parametric models (e.g., AAM)and regress the parameters. Cao et al. [13] directly regressfacial landmarks without using any parametric shape mod-els by minimizing the alignment error over training data ina holistic manner. All these algorithms learn a model or aregressor from a large pool of training data, and try to finda general routine to track facial features in any face image.However, such general models or regressors may not gen-erate satisfactory tracking results for images of a specificperson, especially for non-frontal faces and exaggeratedexpressions.

Weise et al. [2] developed a novel facial animation sys-tem based on a Microsoft Kinect RGBD camera. They com-bine the geometry and texture registration with animationpriors into a single optimization problem. With the help ofthe user-specific blendshape model pre-constructed for theuser, they formulate the optimization problem as a maxi-mum a posterior estimation in a reduced parameter space.Cao et al. [1] demonstrated comparable tracking results

using a single web camera. They developed a novel user-specific 3D shape regressor, which can regress the 3D posi-tions of landmarks directly from 2D video frames. These 3Dlandmarks are then used to track the facial motion, includ-ing the rigid transformation and non-rigid blendshapecoefficients, which can be mapped to any digital avatar.

1.2. Overview

We propose to learn a user-specific regressor that candirectly regress facial motion parameters from 2D videoframes. To learn this user-specific regressor, we need toprepare the training data as in [1] (see Section 2.1). Wethen construct the training data for the regressor in Sec-tion 2.2, and learn a facial motion regressor in Section 2.3.In Section 2.4, at run time, the regressor takes the videoframe and the previous frame’s facial motion parametersas input and compute the facial motion parameters forthe current frame.

In Section 3, we describe two strategies to handle dra-matic lighting changes in the system. First, we collectimages taken under different lighting environments andtrain them together to broaden the range of lighting envi-ronments that can be handled in our regressor. Second, todeal with the global intensity adjustment caused by thecamera white balance, we perform a histogram normaliza-tion to normalize all training images and runtime video.

2. Facial motion regression

Our facial motion regression algorithm is an extensionof the 3D facial shape regression algorithm proposed byCao et al. [1]. Unlike in [1] where the regression target isthe 3D facial shape, our regression target is the facial mo-tion parameters that describe the rigid and non-rigid facialmotion. We take an image I and an initial guess of its facialmotion parameters xc as input, iteratively update xc in anadditional manner, and output the final motion parame-ters. Like in [1], the regression algorithm consists of fourparts: training data preparation, training data construction,training and run-time regression.

2.1. Training data preparation

Following [1], we need to collect training data from theuser.

2.1.1. Image capturing and labelingFirst, we capture 60 images {Ii} of the user performing a

sequence of pre-defined facial poses and expressions. Thenthe 2D facial alignment algorithm described in [13] is usedto automatically locate 75 facial landmarks on each image.These landmarks describe the relatively distinctive facialfeatures on the image, such as the face contour, eye bound-aries, and mouth boundary. The user can also manually ad-just the landmark positions if the automatic detectionresults are not accurate enough. Two examples of labeledimages are shown in Fig. 2.

Page 3: Real-time facial animation on mobile devices

Fig. 2. Two captured images with automatically located landmarks (a)and (c), and the manually corrected landmarks (b) and (d).

174 Y. Weng et al. / Graphical Models 76 (2014) 172–179

2.1.2. User-specific blendshape generationMaking use of these labeled images and the FaceWare-

house database [14], we extract the user-specific blend-shapes B = {B0,B1, . . . ,B46} and obtain the intrinsic matrixQ of the camera by minimizing the following energy asin [1]:

EtðQ ;wid;wexp;i;MiÞ ¼X60

i¼1

X75

k¼1

PQ MiFðvkÞ

� �� sðkÞi

��� ���2;

F ¼ Cr�2wTid�3wT

exp;i;

ð1Þ

where PQ is the projection operator which transforms the3D position in camera coordinates to a 2D position inscreen coordinates, Mi is the rigid transformation matrixfor each image, F is the reconstructed mesh by tensor con-traction, Cr is the core tensor from FaceWarehouse, wid andwexp,i are the column vector of identity and expressionweights in the tensor (the identity weights are the samefor all the images), sðkÞi is the 2D position of the k-th land-mark for image Ii, and vk is the related vertex index onthe mesh.

Following the assumption of the ideal pinhole camera,the intrinsic matrix Q of the camera can be represented as

Q ¼f 0 cx

0 f cy

0 0 1

0B@

1CA; ð2Þ

where f represents the focal length in terms of pixel, (cx, cy)represents the principal point and is the center of the im-age. Thus f is the only unknown parameter in the matrix.

We solve for all the unknowns (Q, wid, wexp,i, Mi) usingthe same approach described in [1]. The expression blend-shapes {Bi} for the user can be computed as

Bi ¼ Cr�2wid�3ð�UexpdiÞ; 0 6 i < 47; ð3Þ

where �Uexp is the truncated transform matrix for theexpression mode in FaceWarehouse, and di is an expres-sion weight vector with value 1 for the ith element and 0for other elements.

2.1.3. Motion parameter recoveryWith the user-specific blendshapes and camera’s intrin-

sic matrix, we can recover the 3D face meshes for all inputimages. For each image, we minimize the error betweenthe labeled 2D landmark positions and the projection ofthe 3D landmark vertices, and solve the facial motionparameters

arg minM;a

X75

l¼1

PQ M B0 þX46

i¼1

aiBi

!ðv lÞ0@

1A� qðlÞ

������������

2

; ð4Þ

where a = {a1,a2, . . . ,a46} is the expression coefficients M isthe rigid transformation matrix. Note that the setupimages consist of pre-defined standard expressions. Tomake the blendshape expression coefficients consistentwith pre-defined standard blendshape coefficients a⁄, wealso add a regularization term to the above energy:ka � a⁄k2.

We further represent the rigid transformation matrix Mas a 4D rotation quaternion vector R and a 3D translationvector T. Finally, for each input image Ii, we concatenateits motion parameters to form a 46 + 4 + 3 = 53D vector:xg

i ¼ ðagi ;R

gi ;T

gi Þ.

2.2. Training set construction

For each image and the computed motion parametervector ðIi;x

gi Þ, we construct a set of augmented parameter

vectors as its initial guess in the regression process. Specif-ically, for each parameter vector xg

i ¼ ðagi ;R

gi ; T

gi Þ, the aug-

mented parameter vectors are computed as follows:

� Translation: ðagi ;R

gi ;T

gi þ DijÞ, where translation vectors

Dij, 1 6 j 6 27, form a 3 � 3 � 3 jittered grid.� Rotation: ðag

i ;Rgi0;Tg

i þ DijÞ, where fRgi0; i0 – ig cover the

rotations of all other input images. For each Rgi0, several

random translations Dij, 1 6 j 6 32 are applied.� All other expressions: ðag

i0;Rg

i ;Tgi þ DijÞ, where fag

i0; i0 – ig

cover the blendshape coefficients of all other inputimages. For each ag

i0, several random translations Dij,

1 6 j 6 10 are applied.

We denote each augmented parameter vector as xcij,

combing which we can construct a training tuple asðIi;x

gi ;x

cijÞ. For description simplicity, we denote all these

training data as fðIi; xi;xci Þg.

2.3. Regressor training

With these N training data fðIi;xi;xci Þg, we train a mo-

tion parameter regression function from xci to xi based on

the pixel intensity of image Ii. We use the two-levelboosted regression algorithm in [1], as described inAlgorithm 1. In the first level, we reconstruct the 3D land-marks from the current motion parameters xc

i , and samplepixels on image Ii to construct the appearance vector. In thesecond level, we build a sequence of weak regressors basedon this appearance vector, and update the current motionparameters xc

i by minimizing the error between xi and xci .

2.3.1. Appearance vector generationWe need to randomly generate P points around the face

mesh and use them to sample the images. Each point isrepresented as an index-offset pair (kp, dp), where kp isthe landmark index and dp is the offset. In our experiment,we randomly choose a landmark kp, and determine a ran-dom offset as in [1].

Page 4: Real-time facial animation on mobile devices

Table 1The scaling factors for different components in motion parameters, where ris the standard deviation of translation in the training data.

Coefficients a Rotation R Translation T

4 8 1/r

Y. Weng et al. / Graphical Models 76 (2014) 172–179 175

For each training data ðIi;xi;xci Þ, we can first reconstruct

the 3D landmark positions at the rest pose (i.e., withoutrotation and translation), Sc

i ¼ fscikg, by

scik ¼ B0 þ

X46

j¼1

acijBj

!ðvkÞ

: ð5Þ

Based on the positions of these landmarks at rest pose, wegenerate the P points at the rest pose according to the in-dex-offset pairs {(kp, dp)}. We then transform these 3Dpoints via the transformation in xc

i , and then project thetransformed points to the image space via the mappingfunction PQ:

up ¼ PQ Rci sc

ikpþ dp

� �þ Tc

i

� �: ð6Þ

The image intensity values at {up} construct the appear-ance vector Vi ¼ AppðIi;xc

i ; fðkp;dpÞgÞ. For each appearancevector Vi, we generate P2 index-pair features by computingthe differences between each pair of elements in Vi.

In the second level, based on the appearance vector Vi

obtained in the first level, we construct a set of weakregressors in an additive manner, which minimize the er-ror between the ground truth parameters xi and the cur-rent parameters xc

i . To train these weak regressors, weneed to efficiently choose the features from the P2 index-pair features.

2.3.2. Feature selectionWe choose the feature that has the highest correlation

with the regression target. To select the effective features,we first generate a random 53D direction Y. For each train-ing data, we calculate the difference vector by dxi ¼ xi � xc

i .We then project this difference vector to the random direc-tion Y to produce a scalar. We choose the index-pair withthe highest correlation with this scalar.

In the parameter vector x = (a, R, T), the three compo-nents are in different spaces with different units. We thuscannot handle them in the same way. We need to carefullychoose weights to reflect their influences on the features.According to our experiments, the weight values listed inTable 1 can generate satisfactory results.

Algorithm 1. Facial motion regression training

Input: N Training data ðIi;xi;xci Þ

Output: Two-level boosted regressor/� level one �/for t = 1 to T dofkp;d

tpg randomly generate P offsets

for i = 1 to N doVi AppðIi;xc

i ; fðkp;dpÞgÞCompute the P2 feature values for Vi

end for/� level two �/for k = 1 to K do

for f = 1 to F doYf randomly generate a directionfor i = 1 to N do

dxi xi � xci

dx0i scale dxi

ci dx0i � Yf

end forFind the index-pair with the highest

correlation with {ci}, and randomly choose athreshold

end forfor i = 1 to N do

Calculate the features in Vi using F index-pairsCompare the features to the thresholds and

determine which bin the data belongs toend forfor i = 1 to 2F do

Compute dxbiaccording to Eq. (7)

for each training data l 2 Xbjdo

xcl xc

l þ dxbi

end forend for

end forend for

2.3.3. Fern constructionWe repeat F times to select F different index-pair fea-

tures. For each index-pair feature, we assign a randomthreshold value between the minimum and maximum dif-ferences of element pairs in the appearance vectors. Thesefeatures and related thresholds are used to build a primi-tive regressor called fern. In each fern, we calculate the Findex-pair features for each training data, compare the fea-ture values with the threshold values, and determinewhich bin in the index-pair space it belongs to. In eachbin b, for the training data falling into it, which we denotethem as Xb, we try to find an additional offset that mini-mizes the error between the ground truth parametersand current parameters. Following [1], this additional off-set can be calculated as

dxb ¼1

1þ b=jXbj

Pi2Xb

xi � xci

� �jXbj

; ð7Þ

where jXbj is the number of training data falling into bin b,b is the free shrinkage parameter that helps to overcomethe overfitting when the number training in the bin istoo small. We take dxb as the regression output for eachbin, and for all the data in the bin, we update their currentparameters by xc = xc + dxb.

As described in [1], we process T stages in the first leveland K stages in the second level, iteratively build the prim-itive regressor, calculate the regression output and updatecurrent motion parameters for each training data. In our

Page 5: Real-time facial animation on mobile devices

Fig. 4. Histogram normalization. Left: before normalization, most pixelsare located in the dark region in histogram and the face appears verydark; right: after normalization, the histogram distributes uniformly andthe face region becomes brighter.

176 Y. Weng et al. / Graphical Models 76 (2014) 172–179

experiment, we choose T = 7, P = 300, K = 200, F = 5,b = 150.

Algorithm 2. Runtime regression

Input: Previous frame’s motion parameters x0, currentframe’s image I

Output: Current frame’s motion parameters xS0 RC(x0)Get the parameter vector xs in fxg

i g most similar tox0

Rb Rs(R0)�1

S0⁄ = RbS0

fx�l g Get L parameter vectors most similar to S0⁄ inthe training datafor l = 1 to L do

M�l mergeðRs;TlÞMb M0ðM�l Þ

�1

xl Mbx�lfor t = 1 to T do

Vl App(I, xl, {(kp, dp)})for k = 1 to K do

Get the F index-pairs recorded during trainingCalculate the F appearance feature values for

Vl

Use the values to locate its bin b in the fernGet dxb in bxl xl + dxb

end forend for

end forx Compute the median result of {xl, 1 6 l 6 L}

Table 2

2.4. Runtime regression

With the facial motion regressor trained in the last sec-tion, we can compute the facial motion parameters for in-put video frame I in real time. We start from x0, theregression result of the previous frame and find motionparameters similar to x0 from the training set as the initialparameters for regression.

2.4.1. Initial parametersSimilar to the training process, we first reconstruct the

3D landmark vector S0 ¼ fs0kg using the previous frame’sparameter vector x0 via

s0k ¼ R0 B0 þX46

j¼1

a0jBj

!ðvkÞ

: ð8Þ

Fig. 3. Captured images under different light conditions. From left toright: in the office, outdoor with direct sunshine, and in a hotel with dimlight.

Notice that we omit the translation T0 here, and wedenote this reconstructed operator as S = RC(x). For theground truth motion parameter of each input image xg

i ,we reconstruct the 3D landmark positions viaSg

i ¼ RCðxgi Þ. We then find the landmark vector Ss from

fSgi g, which is most similar to S0, and take the correspond-

ing parameters as the most similar parameters xs. To com-pare two landmark vectors, we directly compute theEuclidean distance between them. We then transform S0

to the pose of xs via rotation Rb = Rs(R0)�1. Finally, we com-pare the rotated landmarks RbS0 with the reconstructedlandmarks fSc

i ¼ RCðxci Þg from all augmented parameters

in the training data, find the L most similar ones, and takethe related parameters as fx�l g.

Given that we rotate the landmarks according to Rb infinding the similar landmarks and omit all translations inchoosing similar parameters, we need to transform eachsimilar parameter vector x�l to the pose of the previousframe’s parameters x0. We first combine the rotation Rs inxs and the translation in x�l , and merge them as a newtransformation M�

l ¼ mergeðRs;T�l Þ. Thus the transforma-

tion from xl to x0 pose is Mb ¼M0 M�l

� ��1, where M0 is thetransformation in x0. Applying the Mb on x�l , we finally findthe L initial parameter vectors: xl ¼Mbx�l .

2.4.2. Appearance vectorWe take each parameter vector xl as the initial parame-

ters in the regressor, and update it through the regressor.In each stage of the first level regression, we constructthe appearance vector based on current motion parametersxl and the index-offset pairs {(kp, dp)}.

2.4.3. Fern passingIn the second level of regression, we take the appear-

ance vector Vl, and pass all the ferns in an additional man-

The training timing and memory consumption of different regressionalgorithms. All results are trained using the same training data.

Target 2D shape 3D shape Ours

Dimension 150 225 53Training (min) 8 10 3.5Regression (ms) 4.3 6.2 1.8Fitting (ms) / 8.2 0.0

Runtime total (ms) / 14.4 1.8Memory (MB) 21.2 32.3 9.17

Page 6: Real-time facial animation on mobile devices

Table 3Percentages of frames with RMSE less than given thresholds and theaverage errors.

RMSE <2 px <3 px <4 px Avg. Err. (px)

Cao et al. [1] 66.2% 87.1% 97.1% 2.08Ours 20.1% 65.5% 95.7% 2.93

Fig. 5. Comparison with Cao et al. [1]. Top: our tracking results. Bottom:results from Cao et al. [1].

Y. Weng et al. / Graphical Models 76 (2014) 172–179 177

ner. For each fern, we take the recorded F index-pairs andcalculate the related features from Vl. These features arecompared with related threshold values to locate the bin

Table 4Percentages of frames with RMSE less than given thresholds and the average erro

RMSE <3.5 pixels <5.0 pixels

Basic 3.0% 17.4%Basic + AD 11.4% 46.1%Basic + HN 22.2% 46.7%Basic + AD + HN 61.6% 88.0%

Fig. 6. Comparison of different algorithms for dynamic light adaptation. F

b among the 2F bins. We then take the regression outputdb associated with b and update the parameters viaxl = xl + dxb.

For each initial parameter vector, after passing all fernsin the regressor, we get an updated parameter vector. Wefinally calculate the median of all the updated parametervectors {xl} as the final result x. The runtime regressionalgorithm is described in Algorithm 2.

Note that to get realistic facial animation, the range ofexpression coefficients needs to be restricted to between0 and 1. To do this, we clamp the coefficients to between0 and 1 each time the parameter vector is updated in theregression process. Such a simple approach can generatesatisfactory results in our experiments. And interestingly,the results from our motion parameter regressor are al-ready very smooth and temporally coherent. Although itis possible to consider animation prior in a postprocessingstep like in [1], we found it is unnecessary to do this in ourexperiment.

3. Handling lighting changes

Unlike desktop PCs which are used indoors most of thetime, mobile devices may be used under different environ-ments. For example, if the user is using our system whilewalking, the lighting and background are changing all the

rs using four different methods.

<6.5 pixels <8.0 pixels Avg. Err. (pixels)

29.9% 44.3% 12.1269.5% 80.2% 6.1860.5% 77.8% 7.2297.6% 100.0% 3.31

rom top to bottom: Basic, Basic + AD, Basic + HN, Basic + AD + HN.

Page 7: Real-time facial animation on mobile devices

Fig. 7. Real-time facial animation results of two examples. For each example, from left to right: input video frame, tracked mesh, reconstructed landmark,driven avatar.

178 Y. Weng et al. / Graphical Models 76 (2014) 172–179

time. Cao et al. [1] can handle lighting changes to a modestextent, but may fail to generate accurate tracking results ifthe lighting environment changes dramatically.

Our first strategy is to include training data from multi-ple environments to train the regressor. We collect theuser’s setup images under different environments. In ourexperiments, as shown in Fig. 3, we include the images inthe office, outdoors with direct sunshine, and in a hotelroom with dim light.

Second, mobile cameras often perform white balancing.This changes the overall intensity of the image, making thewhole image darker or brighter. Given that we comparethe index-pair features acquired from images, which areabsolute values related to pixel’s intensities, the globaladjustment will make the intensity value inconsistent inrange. To handle this problem, we additionally perform ahistogram normalization on the appearance vectors, inboth the training and run-time testing.

An example of histogram normalization is shown inFig. 4. Due to the bright background, white balance makesthe face appear dark. Histogram normalization helps to ad-just the intensity of the entire face, unify the global inten-sity of the image, and make the tracking more robust.

1 http://gaps-zju.org/publication/2013/mface-divx.avi.

4. Experimental results

We have implemented the system on a PC with an IntelCore i7 (3.5 GHz) CPU, with an ordinary web camerarecording 640 � 480 images at 30 fps. The runtime algo-rithm runs at over 200 fps on this device, eight times fasterthan [1]. We also tested the regression algorithm on a

Motorola MT788 cell phone, with an Intel Atom 2.0 GHzCPU and Android 4.0 operating system. The performanceis still very high, at about 30 fps (Fig. 1). Fig. 7 shows someother results using our system. Please see the accompany-ing video1 for animation demos.

Unlike [13,1] which regress face shapes, we directly re-gress facial motion parameters, which are represented as a53D vector. The dimension of our regression target is thusmuch lower than the regression targets of [13] (75 � 2D)and [1] (75 � 3D). The efficiency of our algorithm thuscomes from several aspects: (1) the reduced dimension ofregression target greatly improves the regression effi-ciency, in both the training and runtime testing; (2) giventhat we have the rotation R and translation T in each iter-ation, we do not need to calculate the transformation viasingular value decomposition (SVD); and (3) we do notneed to fit facial parameters after the regression, whichcontains the SVD and BFGS solving procedure in [1]. InTable 2, we compare the timings in training, testing andthe memory consumption with different regressiontargets.

To compare the accuracy of different algorithms, wetake a set of video sequences as input, and use the gener-ated regressors to compute the landmark positions (orparameters) respectively. In Table 3, we measure the errors(in pixels) between the screen projection of landmarks ob-tained by the two regression algorithms and the groundtruth positions, which are manually labeled for each frame.

Page 8: Real-time facial animation on mobile devices

Y. Weng et al. / Graphical Models 76 (2014) 172–179 179

Fig. 5 shows the comparisons of our algorithm with Caoet al. [1]. It can be seen that our algorithm achieves compa-rable tracking results. We note that the landmark positionsreconstructed from our motion parameters are not as accu-rate as those generated by Cao et al.’s direct shape regres-sion. Some facial features of our results do not match verywell. In practice we found that the animations generatedfrom our motion parameters are visually plausible andresemble the user’s facial motion very well. Please seethe accompanying video for animation results.

To evaluate the effects of our two strategies for han-dling lighting changes, we designed four different regres-sors: the basic regressor trained with images capturedunder one lighting environment (basic), the regressortrained with images captured under tree lighting environ-ment (basic + AD), the regressor trained with images cap-tured under one lighting environment using histogramnormalization (basic + HN), and the regressor trained withimages captured under tree lighting environment usinghistogram normalization (basic + AD + HN). We testedthe four regressors on a testing video recorded under anew lighting environment which is not used in the trainingprocess. From Table 4 and Fig. 6, it can be seen thatBasic + HN + AD can produce the best results in ourexperiment.

5. Conclusion

We have introduced a novel facial motion regressionalgorithm and shown that it can be used to generate accu-rate facial animations in real time even on mobile devices.We also described two strategies to improve the robust-ness of our algorithm to lighting changes. The whole sys-tem provides a practical solution to real-time facialanimation on mobile devices.

Acknowledgments

This work was supported by the NSF of China (Nos.61003145 and 61272305) and the National High-TechR&D program of China (No. 2012AA011503).

References

[1] C. Cao, Y. Weng, S. Lin, K. Zhou, 3d shape regression for realtimefacial animation, ACM Trans. Graph. 32 (4) (2013) 41:1–41:10.

[2] T. Weise, S. Bouaziz, H. Li, M. Pauly, Realtime performance-basedfacial animation, ACM Trans. Graph. 30 (4) (2011) 77:1–77:10.

[3] L. Williams, Performance driven facial animation, in: Proceedings ofSIGGRAPH, 1990, pp. 235–242.

[4] H. Huang, J. Chai, X. Tong, H.-T. Wu, Leveraging motion capture and3d scanning for high-fidelity facial performance acquisition, ACMTrans. Graph. 30 (4) (2011) 74:1–74:10.

[5] L. Zhang, N. Snavely, B. Curless, S.M. Seitz, Spacetime faces: highresolution capture for modeling and animation, ACM Trans. Graph.23 (3) (2004) 548–558.

[6] T. Weise, H. Li, L.V. Gool, M. Pauly, Face/off: live facial puppetry, in:Eurographics/SIGGRAPH Symposium on Computer Animation, 2009.

[7] D. Bradley, W. Heidrich, T. Popa, A. Sheffer, High resolution passivefacial performance capture, ACM Trans. Graph. 29 (4) (2010) 41:1–41:10.

[8] T. Beeler, B. Bickel, P. Beardsley, R. Sumner, M. Gross, High-qualitysingle-shot capture of facial geometry, ACM Trans. Graph. 29 (4)(2010) 40:1–40:9.

[9] T. Cootes, G. Edwards, C. Taylor, Active appearance models, in:ECCV’98, 1998, pp. 484–498.

[10] I. Matthews, S. Baker, Active appearance models revisited, Int. J.Comput. Vis. 60 (2) (2004) 135–164.

[11] P. Dollár, P. Welinder, P. Perona, Cascaded pose regression, in: 2010IEEE Conference on Computer Vision and Pattern Recognition(CVPR), IEEE, 2010, pp. 1078–1085.

[12] P. Sauer, T. Cootes, C. Taylor, Accurate regression procedures foractive appearance models, in: British Machine Vision Conference,2011.

[13] X. Cao, Y. Wei, F. Wen, J. Sun, Face alignment by explicit shaperegression, in: Computer Vision and Pattern Recognition (CVPR),2012, pp. 2887–2894.

[14] C. Cao, Y. Weng, S. Zhou, Y. Tong, K. Zhou, Facewarehouse: A 3dFacial Expression Database for Visual Computing, Tech. Rep., 2012.