10
Learning to Regress 3D Face Shape and Expression from an Image without 3D Supervision Soubhik Sanyal Timo Bolkart Haiwen Feng Michael J. Black Perceiving Systems Department Max Planck Institute for Intelligent Systems {soubhik.sanyal, timo.bolkart, haiwen.feng, black}@tuebingen.mpg.de Figure 1: Without 3D supervision, RingNet learns a mapping from the pixels of a single image to the 3D facial parameters of the FLAME model [21]. Top: Images are from the CelebA dataset [22]. Bottom: estimated shape, pose and expression. Abstract The estimation of 3D face shape from a single image must be robust to variations in lighting, head pose, ex- pression, facial hair, makeup, and occlusions. Robustness requires a large training set of in-the-wild images, which by construction, lack ground truth 3D shape. To train a network without any 2D-to-3D supervision, we present RingNet, which learns to compute 3D face shape from a single image. Our key observation is that an individual’s face shape is constant across images, regardless of expres- sion, pose, lighting, etc. RingNet leverages multiple images of a person and automatically detected 2D face features. It uses a novel loss that encourages the face shape to be sim- ilar when the identity is the same and different for different people. We achieve invariance to expression by represent- ing the face using the FLAME model. Once trained, our method takes a single image and outputs the parameters of FLAME, which can be readily animated. Additionally we create a new database of faces “not quite in-the-wild” (NoW) with 3D head scans and high-resolution images of the subjects in a wide variety of conditions. We evaluate publicly available methods and find that RingNet is more accurate than methods that use 3D supervision. The dataset, model, and results are available for research purposes at http://ringnet.is.tuebingen.mpg.de. 1. Introduction Our goal is to estimate 3D head and face shape from a single image of a person. In contrast to previous meth- ods, we are interested in more than just a tightly cropped region around the face. Instead, we estimate the full 3D face, head and neck. Such a representation is necessary for applications in VR/AR, virtual glasses try-on, animation, biometrics, etc. Furthermore, we seek a representation that captures the 3D facial expression, factors face shape from expression, and can be reposed and animated. While there have been numerous methods proposed in the computer vi- sion literature to address the problem of facial shape esti- mation [40], no previous methods address all of our goals. Specifically, we train a neural network that regresses from image pixels directly to the parameters of a 3D face model. Here we use FLAME [21] because it is more ac- curate than other models, captures a wide range of shapes, models the whole head and neck, can be easily animated, and is freely available. Training a network to solve this problem, however, is challenging because there is little paired data of 3D heads/faces together with natural images of people. For robustness to imaging conditions, pose, fa- cial hair, camera noise, lighting, etc., we wish to train from a large corpus of in-the-wild images. Such images, by defi- nition, lack controlled ground truth 3D data. This is a generic problem in computer vision – finding 1

Learning to Regress 3D Face Shape and Expression from an ... · curate than other models, captures a wide range of shapes, models the whole head and neck, can be easily animated,

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Learning to Regress 3D Face Shape and Expressionfrom an Image without 3D Supervision

Soubhik Sanyal Timo Bolkart Haiwen Feng Michael J. BlackPerceiving Systems Department

Max Planck Institute for Intelligent Systems{soubhik.sanyal, timo.bolkart, haiwen.feng, black}@tuebingen.mpg.de

Figure 1: Without 3D supervision, RingNet learns a mapping from the pixels of a single image to the 3D facial parametersof the FLAME model [21]. Top: Images are from the CelebA dataset [22]. Bottom: estimated shape, pose and expression.

Abstract

The estimation of 3D face shape from a single imagemust be robust to variations in lighting, head pose, ex-pression, facial hair, makeup, and occlusions. Robustnessrequires a large training set of in-the-wild images, whichby construction, lack ground truth 3D shape. To traina network without any 2D-to-3D supervision, we presentRingNet, which learns to compute 3D face shape from asingle image. Our key observation is that an individual’sface shape is constant across images, regardless of expres-sion, pose, lighting, etc. RingNet leverages multiple imagesof a person and automatically detected 2D face features. Ituses a novel loss that encourages the face shape to be sim-ilar when the identity is the same and different for differentpeople. We achieve invariance to expression by represent-ing the face using the FLAME model. Once trained, ourmethod takes a single image and outputs the parametersof FLAME, which can be readily animated. Additionallywe create a new database of faces “not quite in-the-wild”(NoW) with 3D head scans and high-resolution images ofthe subjects in a wide variety of conditions. We evaluatepublicly available methods and find that RingNet is moreaccurate than methods that use 3D supervision. The dataset,model, and results are available for research purposes athttp://ringnet.is.tuebingen.mpg.de.

1. Introduction

Our goal is to estimate 3D head and face shape from asingle image of a person. In contrast to previous meth-ods, we are interested in more than just a tightly croppedregion around the face. Instead, we estimate the full 3Dface, head and neck. Such a representation is necessary forapplications in VR/AR, virtual glasses try-on, animation,biometrics, etc. Furthermore, we seek a representation thatcaptures the 3D facial expression, factors face shape fromexpression, and can be reposed and animated. While therehave been numerous methods proposed in the computer vi-sion literature to address the problem of facial shape esti-mation [40], no previous methods address all of our goals.

Specifically, we train a neural network that regressesfrom image pixels directly to the parameters of a 3D facemodel. Here we use FLAME [21] because it is more ac-curate than other models, captures a wide range of shapes,models the whole head and neck, can be easily animated,and is freely available. Training a network to solve thisproblem, however, is challenging because there is littlepaired data of 3D heads/faces together with natural imagesof people. For robustness to imaging conditions, pose, fa-cial hair, camera noise, lighting, etc., we wish to train froma large corpus of in-the-wild images. Such images, by defi-nition, lack controlled ground truth 3D data.

This is a generic problem in computer vision – finding

1

2D training data is easy but learning to regress 3D from 2Dis hard when paired 3D training data is very limited and dif-ficult to acquire. Without ground truth 3D, there are severaloptions but each has problems. Synthetic training data typ-ically does not capture real-world complexity. One can fita 3D model to 2D image features but this mapping is am-biguous and, consequently, inaccurate. Because of the am-biguity, training a neural network using only a loss betweenobserved 2D, and projected 3D, features does not lead togood results (cf. [17]).

To address the lack of training data, we propose a newmethod that learns the mapping from pixels to 3D shapewithout any supervised 2D-to-3D training data. To do so,we learn the mapping using only 2D facial features, auto-matically extracted with OpenPose [29]. To make this pos-sible, our key observation is that multiple images of thesame person provide strong constraints on 3D face shapebecause the shape remains constant although other thingsmay change such as pose, lighting, and expression. FLAMEfactors pose and shape, allowing our model to learn what isconstant (shape) and factor out what changes (pose and ex-pression).

While it is a fact that face shape is constant for an indi-vidual across images, we need to define a training approachthat lets a neural network exploit this shape constancy. Tothat end, we introduce RingNet. RingNet takes multipleimages of a person and enforces that the shape should besimilar between all pairs of images, while minimizing the2D error between observed features and projected 3D fea-tures. While this encourages the network to encode theshapes similarly, we find this is not sufficient. We also addto the “ring” a face belonging to a different random personand enforce that the distance in the latent space betweenall other images in the ring is larger than the distance be-tween the same person. Similar ideas have been used inmanifold learning (e.g. triplet loss) [37] and face recogni-tion [26], but, to our knowledge, our approach has not pre-viously been used to learn a mapping from 2D to 3D geom-etry. We find that going beyond a triplet to a larger ring, iscritical in learning accurate geometry.

While we train with multiple images of a person, notethat, at run time, we only need a single image. With thisformulation, we are able to train a network to regress theparameters of FLAME directly from image pixels. Becausewe train this with “in the wild” images, the network is robustacross a wide range of conditions as illustrated in Fig. 1.The approach is more general, however, and could be ap-plied to other 2D-to-3D learning problems.

Evaluating the accuracy of 3D face estimation methodsremains a challenge and, despite many methods that havebeen published, there are no rigorous comparisons of 3Daccuracy across a wide range of imaging conditions, poses,lighting and occlusion. To address this, we collected a

Figure 2: The NoW dataset includes a variety of imagestake in different conditions (top) and high-resolution 3Dhead scans (bottom). The dark blue region is the part weconsidered for face challenge.

new dataset called NoW (Not quite in-the-Wild), with high-resolution ground truth scans and high-quality images of100 subjects taken in a range of conditions (Fig. 2). NoWis more complex than previous datasets and we use it toevaluate all recent methods with publicly available imple-mentations. Specifically we compare with [34], [35] and[9], which are trained with 3D supervision. Despite not hav-ing any 2D-to-3D supervision our RingNet method recoversmore accurate 3D face shape. We also evaluate the methodqualitatively on challenging in-the-wild face images.

In summary, the main contributions of our paper are: (1)Full face, head with neck reconstruction from a single faceimage. (2) RingNet – an end-to-end trainable network thatenforces shape consistency across face images of the sub-ject with varying viewing angle, light conditions, resolutionand occlusion. (3) A novel shape consistency loss for learn-ing 3D geometry from 2D input. (4) NoW – a benchmarkdataset for qualitative and quantitative evaluation of 3D facereconstruction methods. (5) Finally, we make the model,training code, and new dataset freely available for researchpurposes to encourage quantitative comparison [25].

2. Related workThere are several approaches to the problem of 3D face

estimation from images. One approach estimates depthmaps, normals, etc.; that is, these methods produce a rep-resentation of object shape tied to pixels but specialized forfaces. The other approach estimates a 3D shape model thatcan be animated. We focus on methods in the latter cate-gory. In a recent review paper, Zollhofer et al. [40] describethe state of the art in monocular face reconstruction and pro-vide a forward-looking set of challenges for the field. Note,that the boundary between supervised, weakly supervised,and unsupervised methods is a blurry one. Most methodsuse some form of 3D shape model, which is learned fromscans in advance; we do not call this supervision here. Herethe term supervised implies that paired 2D-to-3D data isused; this might be from real data or synthetic data. If a 3Dmodel is first optimized to fit 2D image features, then we

2

say this uses 2D-to-3D supervision. If 2D image featuresare used but there is no 3D data in training the network,then this is weakly supervised in general and unsupervisedrelative to the 2D-to-3D task.

Quantitative evaluation: Quantitative comparison be-tween methods has been limited by a lack of commondatasets with complex images and high-quality groundtruth. Recently, Feng et al. [10] organized a single image to3D face reconstruction challenge where they provided theground truth scans for subjects. Our NoW benchmark iscomplementary to this method as its focus is on extremeviewing angles, facial expressions, and partial occlusions.

Optimization: Most existing methods require tightlycropped input images and/or reconstruct only a tightlycropped region of the face for which existing shape pri-ors are appropriate. Most current shape models are de-scendants of the original Blanz and Vetter 3D morphablemodel (3DMM) [3]. While there are many variations andimprovements to this model such as [13], we use FLAME[21] here because both the shape space and expression spaceare trained from more scans than other methods. OnlyFLAME includes the neck region in the shape space andmodels the pose-dependent deformations of the neck withhead rotation. Tightly cropped face regions make the esti-mation of head rotation ambiguous. Until very recently, thishas been the dominant paradigm [2, 30, 11]. For example,Kemelmacher-Shlizerman and Seitz [18] use multi-imageshading to reconstruct from collection of images allowingchanges in viewpoint and shape. Thies et al. [33] achieveaccurate results on monocular video sequences. While theseapproaches can achieve good results with high-realism, theyare computationally expensive.

Learning with 3D supervision: Deep learning meth-ods are quickly replacing the optimization-based ap-proaches [35, 39, 19, 16]. For example, Sela et al. [27] usea synthetic dataset to generate an image-to-depth mappingand a pixel-to-vertex mapping, which are combined to gen-erate the face mesh. Tran et al. [34] directly regress the3DMM parameters of a face model with a dense network.Their key idea is to use multiple images of the same subjectand fit a 3DMM to each image using 2D landmarks. Theythen take a weighted average of the fitted meshes to use itas the ground truth to train their network. Feng et al. [9]regress from image to a UV position map that records theposition information of the 3D face and provides dense cor-respondence to the semantic meaning of each point on UVspace. All the aforementioned methods use some form of3D supervision like synthetic rendering, optimization-basedfitting of a 3DMM, or a 3DMM to generate UV maps orvolumetric representation. None of the fitting-based meth-ods produce true ground truth for real world face images,while synthetically generated faces may not generalize wellto the real world [31]. Methods that rely on fitting a 3DMM

to images using 2D-3D correspondences to create a pseudoground truth are always limited by the expressiveness of the3DMM and the accuracy of the fitting process.

Learning with weak 3D supervision: Sengupta etal. [28] learn to mimic a Lambertian rendering process byusing a mixture of synthetically rendered images and realimages. They work with tightly cropped faces and do notproduce a model that can be animated. Genova et al. [12]propose an end-to-end learning approach using a differen-tiable rendering process. They also train their encoder usingsynthetic data and its corresponding 3D parameters. Tranand Liu [36] learn a nonlinear 3DMM model by using ananalytically differentiable rendering layer and in a weaklysupervised fashion with 3D data.

Learning with no 3D supervision:MoFA [32] estimates the parameters of a 3DMM and is

trained end-to-end using a photometric loss and an optional2D feature loss. It is effectively a neural network version ofthe original Blanz and Vetter model in that it models shape,skin reflectance, and illumination to produce a realistic im-age that is matched to the input. The advantage of this isthat the approach is significantly faster than optimizationmethods [31]. MoFA estimates a tight crop of the face andproduces good looking results but has trouble with extremeexpressions. They only perform quantitative evaluation onreal images using the FaceWarehouse model as the “groundtruth”; this is not an accurate representation of true 3D faceshape.

The methods that learn without any 2D-to-3D supervi-sion all explicitly model the image formation process (likeBlanz and Vetter) and formulate a photometric loss andtypically also incorporate 2D face feature detections withknown correspondence to the 3D model. The problem withthe photometric loss is that the model of image formation isalways approximate (e.g. Lambertian). Ideally, one wouldlike a network to learn not just about face shape but aboutthe complexity of real world images and how they relate toshape. To that end, our RingNet approach uses only the 2Dface features and no photometric term. Despite (or becauseof) this, the method is able to learn a mapping from pixelsdirectly to 3D face shape. This is the least supervised ofpublished methods.

3. Proposed methodThe goal of our method is to estimate 3D head and face

shape from a single face image I. Given an image, we as-sume the face is detected, loosely cropped, and approxi-mately centered. During training, our method leverages 2Dlandmarks and identity labels as input. During inference ituses only image pixels; 2D landmarks and identity labelsare not used.

Key idea: The key idea can be summarized as fol-lows: 1) The face shape of a person remains unchanged,

3

Subject A

3D

Mesh

2D

Landm

ark

s

Subject B

Subject A

Subject A

3D Mesh

2D Landmarks

3D

Mesh

2D

Landm

arks3D Mesh

2D Landmarks

Shape ConsistencyShape Consistency

Shape InconsistencyShap

e Incon

sisten

cy

Figure 3: RingNet takes multiple images of the same per-son (Subject A) and an image of a different person (Sub-ject B) during training and enforces shape consistency be-tween the same subjects and shape inconsistency betweenthe different subjects. The computed 3D landmarks fromthe predicted 3D mesh projected into 2D domain to com-pute loss with ground-truth 2D landmarks. During infer-ence, RingNet takes a single image as input and predictsthe corresponding 3D mesh. Images are taken from [6]. Thefigure is a simplified version for illustration purpose.

ResNet 50

512 512

3

12

100

50

FLAME

Figure 4: Ring element that outputs a 3D mesh for an image.

even though an image of the face may vary in viewing an-gle, lighting condition, resolution, occlusion, expression orother factors. 2) Every person has a unique face shape (notconsidering identical twins).

We leverage this idea by introducing a shape consistencyloss, embodied in our ring-structured network. RingNet(Fig. 3) is a multiple encoder-decoder based architecture,with weight sharing between the encoders, and shape con-straints on the shape variables. Each encoder in the ring is acombination of a feature extractor network and a regressornetwork. Imposing shape constraints on the shape variablesforces the network to disentangle facial shape, expression,head pose, and camera parameters. We use FLAME [21]as a decoder to reconstruct 3D faces from the semanticallymeaningful embedding, and to obtain a decoupling withinthe embedding space into semantically meaningful parame-ters (i.e. shape, expression, and pose parameters).

We introduce the FLAME decoder, the RingNet archi-tecture, and the losses in more details in the following.

3.1. FLAME model

FLAME uses linear transformations to describe iden-tity and expression dependent shape variations, and stan-dard linear blend skinning (LBS) to model neck, jaw, andeyeball rotations around K = 4 joints. Parametrized bycoefficients for shape, ~β ∈ R ~|β|, pose ~θ ∈ R3K+3, andexpression ~ψ ∈ R ~|ψ|, FLAME returns N = 5023 ver-tices. FLAME models identity dependent shape variationsBS(~β;S) : R ~|β| → R3N , corrective pose blendshapesBP (~θ;P) : R3K+3 → R3N , and expression blendshapesBE(~ψ; E) : R ~|ψ| → R3N as linear transformations withlearned bases S, E , and P . Given a template T ∈ R3N inthe “zero pose”, identity, pose, and expression blendshapes,are modeled as vertex offsets from T.

Each of the pose vectors ~θ ∈ R3K+3 contains (K+1)rotation vectors in axis-angle representation; i.e. one vec-tor per joint plus the global rotation. The blend skinningfunction W (T, J, ~θ,W) then rotates the vertices aroundthe joints J ∈ R3K , linearly smoothed by blendweightsW ∈ RK×N . More formally, FLAME is given as

M(~β, ~θ, ~ψ) =W (TP (~β, ~θ, ~ψ), J(~β), ~θ,W), (1)

with

TP (~β, ~θ, ~ψ) = T+BS(~β;S)+BP (~θ;P)+BE(~ψ; E). (2)

The joints are defined as a function of ~β since different faceshapes require different joint locations. We use Equation 1for decoding our embedding space to generate a 3D meshof a complete head and face.

3.2. RingNet

The recent advances in face recognition (e.g. [38]) andfacial landmark detection (e.g. [4, 29]) have led to large im-age datasets with identity labels and 2D face landmarks. Fortraining, we assume a corpus of 2D face images Ii, corre-sponding identity labels ci, and landmarks ki.

The shape consistency assumption can be formalizedby ~βi = ~βj ,∀ci = cj (i.e. the face shape of one sub-ject should remain the same across multiple images) and~βi 6= ~βj ,∀ci 6= cj (i.e. the face shape of different subjectsshould be distinct). RingNet introduces a ring-shaped ar-chitecture that jointly optimizes for shape consistency foran arbitrary number input images in parallel. For detailsregarding the shape consistency, see Section 3.

RingNet is divided into R ring elements ei=Ri=1 as shownin Figure 3, where each ei consists of an encoder and a de-coder network (see Figure 4). The encoders share weightsacross ei, the decoder weights remain fixed during train-ing. The encoder is a combination of a feature extractornetwork ffeat and regression network freg. Given an im-age Ii, ffeat outputs a high-dimensional vector, which is

4

then encoded by freg into a semantically meaningful vector(i.e., fenc(Ii) = freg(ffeat(Ii))). This vector can be ex-pressed as a concatenation of the camera, pose, shape andexpression parameters, i.e., fenc(Ii) = [cami, ~θi, ~βi, ~ψi],where ~θi, ~βi, ~ψi are FLAME parameters.

For simplicity we omit I in the following and usefenc(Ii) = fenc,i and ffeat(Ii) = ffeat,i. The regres-sion network iteratively regresses fenc,i in an iterative errorfeedback loop [17, 7], instead of directly regressing fenc,ifrom ffeat,i. In each iteration step, progressive shifts fromthe previous estimate are made to reach the current esti-mate. Formally the regression network takes the concate-nated [f tfeat,i, f

tenc,i] as input and gives δf tenc,i as output.

Then we update the current estimate by,

fenc,it+1 = fenc,i

t + δfenc,it. (3)

This iterative network performs multiple regression itera-tions per iteration of the entire RingNet training. The initialestimate is set to ~0. The output of the regression networkis then fed to the differentiable FLAME decoder networkwhich outputs the 3D head mesh.

The number of ring elements R is a hyper-parameterof our network, which determines the number of imagesprocessed in parallel with optimized consistency on the ~β.RingNet allows to use any combination of images of thesame subject and images of different subjects in parallel.However, without loss of generality, we feed face images ofthe same identity to {ej}j=R−1j=1 and different identity to eR.Hence for each input training batch, each slice consists ofR− 1 images of the same person and one image of anotherperson (see Fig. 3).

3.3. Shape consistency loss

For simplicity let us call two subjects who have sameidentity label “matched pairs” and two subjects who havedifferent identity labels are “unmatched pairs”. A key goalof our work is to make a robust end-to-end trainable net-work that can produce the same shapes from images of thesame subject and different shapes for different subjects. Inother words we want to make our shape generators discrim-inative. We enforce this by requiring matched pairs to havea distance in shape space that is smaller by a margin, η, thanthe distance for unmatched pairs. Distance is computed inthe space of face shape parameters, which corresponds to aEuclidean space of vertices in the neutral pose.

In the RingNet structure, ej and ek produce ~βj and ~βk,which are matched pairs when j 6= k and j, k 6= R. Sim-ilarly ej and eR produce ~βj and ~βR, which are unmatchedpairs when j 6= R. Our shape constancy term is then∥∥∥~βj − ~βk

∥∥∥22+ η ≤

∥∥∥~βj − ~βR

∥∥∥22

(4)

Thus we minimize the following loss while trainingRingNet end-to-end, LS =

nb∑i=1

R−1∑j,k=1

max(0,∥∥∥~βij − ~βik

∥∥∥22−∥∥∥~βij − ~βiR

∥∥∥22+ η) (5)

which is normalized to,

LSC =1

nb ×R× LS (6)

where nb is the batch size for each element in the ring.

3.4. 2D feature loss

Finally we compute theL1 loss between the ground-truthlandmarks provided during the training procedure and thepredicted landmarks. Note that we do not directly predict2D landmarks, but 3D meshes with known topology, fromwhich the landmarks are retrieved.

Given the FLAME template mesh, we define for eachOpenPose [29] keypoint the corresponding 3D point in themesh surface. Note that this is the only place where weprovide supervision that connects 2D and 3D. This is doneonly once. While the mouth, nose, eye, and eyebrow key-points have a fixed corresponding 3D point (referred to asstatic 3D landmarks), the position of the contour featureschanges with head pose (referred to as dynamic 3D land-marks). Similar to [5, 31], we model the contour landmarksas dynamically moving with the global head rotation (seeSup. Mat.). To automatically compute this dynamic con-tour, we rotate the FLAME template between -20 and 40degrees to the left and right, render the mesh with texture,run OpenPose to predict 2D landmarks, and project these2D points to the 3D surface. The resulting trajectories aresymmetrically transferred between the left and right side ofthe face.

During training, RingNet outputs 3D meshes, computesthe static and dynamic 3D landmarks for these meshes, andprojects these into the image plane using the camera param-eters predicted in the encoder output. Henceforth we com-pute the following L1 loss between the projected landmarkskpi and the ground-truth 2D landmarks ki.

Lproj = ‖wi × (kpi − ki)‖1 (7)

where wi is the confidence score of each ground-truth land-mark which is provided by the 2D landmark predictor. Weset it to 1 if the confidence is above 0.41 and to 0 otherwise.The total loss Ltot, which trains RingNet end-to-end is

Ltot = λSCLSC + λprojLproj + λ~β

∥∥∥~β∥∥∥22+ λ~ψ

∥∥∥~ψ∥∥∥22(8)

where the λ are the weights of each loss term and the lasttwo terms regularize the shape and expression coefficients.

5

Since BS(~β;S) and BE(~ψ; E) are scaled by the squaredvariance, the L2 norm of ~β and ~ψ represent the Mahalanobisdistance in the orthogonal shape and expression space.

3.5. Implementation details

The feature extractor network uses a pre-trained ResNet-50 [15] architecture, also optimized during training. Thefeature extractor network outputs a 2048 dimensional vec-tor. That serves as input to the regression network. The re-gression network consists of two fully-connected layers ofdimension 512 with ReLu activation and dropout, followedby a final linear fully-connected layer with 159-dimensionaloutput. To this 159-dimensional output vector we concate-nate the camera, pose, shape, and expression parameters.The first three elements represent scale and 2D image trans-lation. The following 6 elements are the global rotation andjaw rotation, each in axis-angle representation. The neckand eyeball rotations of FLAME are not regressed sincethe facial landmarks do not impose any constraints on theneck. The next 100 elements are the shape parameters, fol-lowed by 50 expression parameters of FLAME. The differ-entiable FLAME layer is kept fixed during training. Wetrain RingNet for 10 epochs with a constant learning rateof 1e-4, and use Adam [20] for optimization. The differ-ent model parameters are R = 6, λSC = 1, λproj = 60,λ~β = 1e − 4, λ~ψ = 1e − 4, η = 0.5. The RingNet archi-tecture is implemented in Tensorflow [1] and will be madepublicly available. We use VGG2 Face database [6] as ourtraining dataset which consists of face images and their cor-responding labels. We run OpenPose [29] on the databaseand compute 68 landmark points on the face. OpenPosefails for many cases. After cleaning for the failed cases wehave around 800K images with their corresponding labelsand facial landmarks for our training corpus. We also con-sider around 3000 extreme pose images with correspondinglandmarks provided by [4]. Since for these extreme imageswe do not have any labels we replicate each image with ran-dom crops and scale for matched pair consideration.

4. Benchmark dataset and evaluation metricThis section introduces our NoW benchmark for the task

of 3D face reconstruction from single monocular images.The goal of this benchmark is to introduce a standard eval-uation metric to measure the accuracy and robustness of 3Dface reconstruction methods under variations in viewing an-gle, lighting, and common occlusions.

Dataset: The dataset contains 2054 2D images of 100subjects, captured with an iPhone X, and a separate 3D headscan for each subject. This head scan serves as ground-truth for the evaluation. The subjects are selected to containvariations in age, BMI, and sex (55 female, 45 male).

We categorize the captured data in four challenges;neutral (620 images), expression (675 images), occlusion

(528 images) and selfie (231 images). Neutral, expressionand occlusion contain neutral, expressive, and partially oc-cluded face images of all subjects in multiple views, rang-ing from frontal view to profile view. Expression containsdifferent acted facial expressions such as happiness, sad-ness, surprise, disgust, and fear. Occlusion contain imageswith varying occlusions from e.g. glasses, sunglasses, facialhair, hats or hoods. For the selfie category, participants areasked to take selfies with the iPhone, without imposing con-straints on the performed facial expression. The images arecaptured indoor and outdoor to provide variations of naturaland artificial light.

The challenge for all categories is to reconstruct a neutral3D face given a single monocular image. Note that facialexpressions are present in several images, which requiresmethods to disentangle identity and expression to evaluatethe quality of the predicted identity.

Capture setup: For each subject we capture a raw headscan in neutral expression with an active stereo system(3dMD LLC, Atlanta). The multi-camera system consistsof six gray-scale stereo camera pairs, six color cameras,five speckle pattern projectors, and six white LED panels.The reconstructed 3D geometry contains about 120K ver-tices for each subject. Each subject wears a hair cap duringscanning to avoid occlusions and scanner noise in the faceor neck region due to hair.

Data processing: Most existing 3D face reconstructionmethods require a localization of the face. To mitigate theinfluence of this pre-processing step we provide for eachimage, a bounding box, that covers the face. To obtainbounding boxes for all images, we first run a face detec-tor on all images [38], and then predict keypoints for eachdetected face [4]. We manually select 2D landmarks forfailure cases. We then expand the bounding box of the land-marks to each side by 5% (bottom), 10% (left and right),and 30% to the top to obtain a box covering the entire faceincluding forehead. For the face challenge, we follow pro-cessing protocol similar to [10]. For each scan, the facecenter is selected, and the scan is cropped by removing ev-erything outside of a specified radius. The selected radiusis subject specific computed as 0.7 × (outer eye dist +nose dist) (see Figure 2).

Evaluation metric: Given a single monocular image,the challenge consists of reconstructing a 3D face. Sincethe predicted meshes occur in different local coordinate sys-tems, the reconstructed 3D mesh is rigidly aligned (rotation,translation, and scaling) to the scan using a set of corre-sponding landmarks between the prediction and the scan.We further perform a rigid alignment based on the scan-to-mesh distance (which is the absolute distance betweeneach scan vertex and the closest point in the mesh surface)between the ground truth scan, and the reconstructed meshusing the landmarks alignment as initialization. The error

6

for each image is then computed as the scan-to-mesh dis-tance between the ground truth scan, and the reconstructedmesh. Different errors are then reported including cumula-tive error plots over all distances, median distance, averagedistance, and standard deviation.

How to participate: To participate in the challenge, weprovide a website [25] to download the test images, and toupload the reconstruction results and selected landmarks foreach registration. The error metrics are then automaticallycomputed and returned. Note that we do not provide theground truth scans to prevent fine-tuning on the test data.

5. ExperimentsWe evaluate RingNet qualitatively and quantitatively

and compare our results with publicly available methods,namely: PRNet (ECCV 2018 [9]), Extreme3D (CVPR 2018[35]) and 3DMM-CNN (CVPR 2017 [34]).

Quantitative evaluation: We compare methods on [10]and our NoW dataset.

Feng et al. benchmark: Feng et al. [10] describe abenchmark dataset for evaluating 3D face reconstructionfrom single images. They provide a test dataset, that con-tains facial images and their 3D ground truth face scanscorresponding to a subset of the Stirling/ESRC 3D facedatabase. The test dataset contains 2000 2D neutral faceimages, including 656 high-quality (HQ) and 1344 low-quality (LQ) images. The high quality images are takenin controlled scenarios and the low quality images are ex-tracted from video frames. The data focuses on neutral faceswhereas our data has higher variety in expression, occlu-sion, and lighting as explained in Section 4.

Recall that the methods we compare with (PRNet, Ex-treme3D, 3DMM-CNN) use 3D supervision for trainingwhereas our approach does not. PRNet [9] requires a verytightly cropped face region to give good results and per-forms poorly when given the loosely cropped input imagethat comes with the benchmark database (see Sup. Mat.).Rather than try to crop the images for PRNet, we run iton the given images and note when it succeeds: it outputsmeshes for 918 of the low resolution test images and for 509of the high-quality images. To be able to compare with PR-Net, we run all the other methods only on the 1427 imagesfor which PRNet succeeds.

We compute the error using the method in [10], whichcomputes the distance from ground truth scan points to theestimated mesh surface. Figure 5 (left and middle) showthe cumulative error curve for different approaches for thelow-quality and high-quality images respectively; RingNetoutperforms the other methods. Table 1 reports the mean,standard deviation and median errors.

NoW face challenge: For this challenge we use croppedscans like [10] to evaluate different methods. We first per-form a rigid alignment of the predicted meshes to the scans

MethodMedian(mm)

Mean(mm)

Std(mm)

LQ HQ LQ HQ LQ HQPRNet [9] 1.79 1.60 2.38 2.06 2.19 1.79Extreme3D [35] 2.40 2.37 3.49 3.58 6.15 6.753DMM-CNN [34] 1.88 1.85 2.32 2.29 1.89 1.88Ours 1.63 1.58 2.08 2.02 1.79 1.69

Table 1: Statistics on Feng et al. [10] benchmark

Method Median(mm)

Mean(mm)

Std(mm)

PRNet [9] 1.51 1.99 1.903DMM-CNN [34] 1.83 2.33 2.05FLAME-neutral [21] 1.24 1.57 1.34Ours 1.23 1.55 1.32

Table 2: Statistics for the NoW dataset face challenge.

R Median (mm) Mean (mm) Std (mm)3 1.25 1.68 1.514 1.24 1.67 1.505 1.20 1.63 1.486 1.19 1.63 1.48

Table 3: Effect of varying number of ring elements R. Weevaluate on a validation set described in the ablation study.

for all the compared methods. Then we compute the scan-to-mesh distance [10] between the predicted meshes and thescans as above. Figure 5 (right) shows the cumulative er-ror curves for the different methods; again RingNet outper-forms the others. We provide the mean, median and stan-dard division error in Table 2.

Qualitative results: Here we show the qualitative re-sults of estimating a 3D face/head mesh from a single faceimage on CelebA [22] and MultiPIE dataset [14]. Figure 1shows a few results for RingNet, illustrating its robustnessto expression, gender, head pose, hair, occlusions, etc. Weshow robustness of our approach under different conditionslike lighting, poses and occlusion in Figures 6 and 7. Qual-itative comparisons are provided in the Sup. Mat.

Ablation study: Here we provide some motivation forthe choice of using a ring architecture in RingNet by com-paring different values for R in Table 3. We evaluate theseon a validation set that contains 2D images and 3D scans of10 subjects (six subjects from [8], four from [21]) For eachsubject we choose one neutral scan and two to four scannerimages, reconstruct the 3D meshes for the images, and mea-sure the scan-to-mesh reconstruction error after rigid align-ments. The error decreases when using a ring structure withmore elements over using a single triplet loss only, but italso increases training time. To make a trade of betweentime and error, we chose R = 6 in our experiments.

7

Figure 5: Cumulative error curves. Left to right: LQ data of [10]. HQ data of [10]. NoW dataset face challenge.

Figure 6: Robustness of RingNet to varying lighting condi-tions. Images from the MultiPIE dataset [14].

Figure 7: Robustness of RingNet to occlusions, variationsin pose, and lighting. Images from the NoW dataset.

6. Conclusion

We have addressed the challenging problem of learningto estimate a 3D, articulated, and deformable shape from asingle 2D image with no paired 3D training data. We haveapplied our RingNet model to faces but the formulation isgeneral. The key idea is to exploit a ring of pairwise lossesthat encourage the solution to share the same shape for im-ages of the same person and a different shape when theydiffer. We exploit the FLAME face model to factor facepose and expression from shape so that RingNet can con-strain the shape while letting the other parameters vary. Ourmethod requires a dataset in which some of the people ap-pear multiple times, as well as 2D facial features, whichcan be estimated by existing methods. We provide only therelationship between the standard 2D face features and the

vertices of the 3D FLAME model. Unlike previous meth-ods we do not optimize a 3DMM to 2D features, nor do weuse synthetic data. Competing methods typically exploit aphotometric loss using an approximate generative model offacial albedo, reflectance and shading. RingNet does notneed this to learn the relationship between image pixels and3D shape. In addition, our formulation captures the fullhead and its pose. Finally, we have created a new publicdataset with accurate ground truth 3D head shape and high-quality images taken in a wide range of conditions. Sur-prisingly, RingNet outperforms methods that use 3D super-vision. This opens many directions for future research, forexample extending RingNet with [24]. Here we focused ona case with no 3D supervision but we could relax this anduse supervision when it is available. We expect that a smallamount of supervision would increase accuracy while thelarge dataset of in-the-wild images provides robustness toillumination, occlusion, etc. Our 2D feature detector doesnot include the ears, though these are highly distinctive fea-tures. Adding 2D ear detections would further improve the3D head pose and shape. While our model stops with theneck, we plan to extend our model to the full body [23].It would be interesting to see if RingNet can be extendedto reconstruct 3D body pose and shape from images solelyusing 2D joints. This could go beyond current methods,like HMR [17], to learn about body shape. While RingNetlearns a mapping to an existing 3D model of the face, wecould relax this and also optimize over the low-dimensionalshape space, enabling us to learn a more detailed shapemodel from examples. For this, incorporating shading cues[32, 28] would help constrain the problem.

Acknowledgement: We thank T. Alexiadis in buildingthe NoW dataset, J. Tesch for rendering results, D. Lleshajfor annotations, A. Osman for supplementary video, and S.Tang for useful discussions.

Disclosure: Michael J. Black has received research giftfunds from Intel, Nvidia, Adobe, Facebook, and Amazon.He is a part-time employee of Amazon and has financialinterests in Amazon and Meshcapade GmbH. His researchwas performed solely at, and funded solely by, MPI.

8

References[1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean,

M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. Tensor-flow: a system for large-scale machine learning. In OSDI,volume 16, pages 265–283, 2016. 6

[2] A. Bas, W. A. Smith, T. Bolkart, and S. Wuhrer. Fitting a 3Dmorphable model to edges: A comparison between hard andsoft correspondences. In ACCV, pages 377–391, 2016. 3

[3] V. Blanz and T. Vetter. A morphable model for the synthesisof 3D faces. In Proceedings of the 26th annual conference onComputer graphics and interactive techniques, pages 187–194, 1999. 3

[4] A. Bulat and G. Tzimiropoulos. How far are we from solv-ing the 2D & 3D face alignment problem? (and a dataset of230,000 3D facial landmarks). In ICCV, 2017. 4, 6

[5] C. Cao, Q. Hou, and K. Zhou. Displaced dynamic expressionregression for real-time facial tracking and animation. ACMTrans. Graph., 33(4):43:1–43:10, July 2014. 5

[6] Q. Cao, L. Shen, W. Xie, O. M. Parkhi, and A. Zisserman.Vggface2: A dataset for recognising faces across pose andage. In International Conference on Automatic Face andGesture Recognition, 2018. 4, 6

[7] J. Carreira, P. Agrawal, K. Fragkiadaki, and J. Malik. Hu-man pose estimation with iterative error feedback. In CVPR,pages 4733–4742, 2016. 5

[8] H. Dai, N. Pears, W. A. P. Smith, and C. Duncan. A 3dmorphable model of craniofacial shape and texture variation.In The IEEE International Conference on Computer Vision(ICCV), Oct 2017. 7

[9] Y. Feng, F. Wu, X. Shao, Y. Wang, and X. Zhou. Joint 3Dface reconstruction and dense alignment with position mapregression network. In ECCV, 2018. 2, 3, 7

[10] Z. H. Feng, P. Huber, J. Kittler, P. Hancock, X. J. Wu,Q. Zhao, and M. Rtsch. Evaluation of dense 3D reconstruc-tion from 2D face images in the wild. In FG, pages 780–786,2018. 3, 6, 7, 8

[11] P. Garrido, M. Zollhfer, D. Casas, L. Valgaerts, K. Varanasi,P. Prez, and C. Theobalt. Reconstruction of personalized3D face rigs from monocular video. ACM Transactions onGraphics, 35(3), 2016. 3

[12] K. Genova, F. Cole, A. Maschinot, A. Sarna, D. Vlasic, andW. T. Freeman. Unsupervised training for 3D morphablemodel regressiond. In CVPR, pages 8377–8386, 2018. 3

[13] T. Gerig, A. Morel-Forster, C. Blumer, B. Egger, M. Luthi,S. Schnborn, and T. Vetter. Morphable face models-an openframework. In FG, pages 75–82, 2018. 3

[14] R. Gross, I. Matthews, J. Cohn, T. Kanade, and S. Baker.Multi-pie. Image and Vision Computing, 28(5):807–813,2010. 7, 8

[15] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings indeep residual networks. In ECCV, pages 630–645, 2016. 6

[16] A. S. Jackson, A. Bulat, V. Argyriou, and G. Tzimiropoulos.Large pose 3D face reconstruction from a single image viadirect volumetric cnn regression. In ICCV, 2017. 3

[17] A. Kanazawa, M. J. Black, D. W. Jacobs, and J. Malik. End-to-end recovery of human shape and pose. In CVPR, 2018.2, 5, 8

[18] I. Kemelmacher-Shlizerman and S. M. Seitz. Face recon-struction in the wild. In ICCV, pages 1746–1753, 2011. 3

[19] H. Kim, M. Zollhfer, A. Tewari, J. Thies, C. Richardt, andC. Theobalt. Inversefacenet: Deep monocular inverse facerendering. In CVPR, pages 4625–46342, 2018. 3

[20] D. Kinga and J. B. Adam. A method for stochastic optimiza-tion. In International Conference on Learning Representa-tions (ICLR), volume 5, 2015. 6

[21] T. Li, T. Bolkart, M. J. Black, H. Li, and J. Romero. Learninga model of facial shape and expression from 4D scans. ACMTransactions on Graphics, 36(6):194, 2017. 1, 3, 4, 7

[22] Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning faceattributes in the wild. In International Conference on Com-puter Vision (ICCV), Dec. 2015. 1, 7

[23] G. Pavlakos, V. Choutas, N. Ghorbani, T. Bolkart, A. A. A.Osman, D. Tzionas, and M. J. Black. Expressive body cap-ture: 3d hands, face, and body from a single image. In Pro-ceedings IEEE Conf. on Computer Vision and Pattern Recog-nition (CVPR), June 2019. 8

[24] A. Ranjan, T. Bolkart, S. Sanyal, and M. J. Black. Gener-ating 3D faces using convolutional mesh autoencoders. InEuropean Conference on Computer Vision (ECCV), volumeLecture Notes in Computer Science, vol 11207, pages 725–741. Springer, Cham, Sept. 2018. 8

[25] RingNet. http://ringnet.is.tuebingen.mpg.de. In CVPR, 2019.2, 7

[26] F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A uni-fied embedding for face recognition and clustering. In CVPR,pages 815–823, 2015. 2

[27] M. Sela, E. Richardson, and R. Kimmel. Unrestricted fa-cial geometry reconstruction using image-to-image transla-tion. In ICCV, pages 1585–1594, 2017. 3

[28] S. Sengupta, A. Kanazawa, C. D. Castillo, and D. Jacobs.Sfsnet: Learning shape, reflectance and illuminance of facesin the wild. In CVPR, 2018. 3, 8

[29] T. Simon, H. Joo, I. Matthews, and Y. Sheikh. Hand keypointdetection in single images using multiview bootstrapping. InCVPR, 2017. 2, 4, 5, 6

[30] S. Suwajanakorn, I. Kemelmacher-Shlizerman, and S. M.Seitz. Total moving face reconstruction. In ECCV, pages796–812, 2014. 3

[31] A. Tewari, M. Zollhfer, P. Garrido, F. Bernard, H. Kim,P. Prez, and C. Theobalt. Self-supervised multi-level facemodel learning for monocular reconstruction at over 250 hz.In CVPR, 2018. 3, 5

[32] A. Tewari, M. Zollhfer, H. Kim, P. Garrido, F. Bernard,P. Prez, and C. Theobalt. MoFA: Model-based deep con-volutional face autoencoder for unsupervised monocular re-construction. In ICCV, pages 1274–1283, 2017. 3, 8

[33] J. Thies, M. Zollhofer, M. Stamminger, C. Theobalt, andM. Niener. Face2face: Real-time face capture and reenact-ment of rgb videos. In CVPR, pages 2387–2395, 2016. 3

[34] A. T. Tran, T. Hassner, I. Masi, and G. Medioni. Regressingrobust and discriminative 3D morphable models with a verydeep neural network. In CVPR, pages 1493–1502, 2017. 2,3, 7

9

[35] A. T. Tran, T. Hassner, I. Masi, E. Paz, Y. Nirkin, andG. Medioni. Extreme 3D face reconstruction: Seeingthrough occlusions. In CVPR, pages 3935–3944, 2018. 2,3, 7

[36] L. Tran and L. X. Nonlinear 3D face morphable model. InCVPR, 2018. 3

[37] K. Q. Weinberger, J. Blitzer, and L. K. Saul. Distance metriclearning for large margin nearest neighbor classification. InNIPS, pages 1473–1480, 2006. 2

[38] S. Zhang, X. Zhu, Z. Lei, H. Shi, X. Wang, and S. Z. Li. Sfd:Single shot scale-invariant face detector. 2017. 4, 6

[39] X. Zhu, Z. Lei, X. Liu, H. Shi, and S. Z. Li. Face alignmentacross large poses: A 3d solution. In CVPR, pages 146–155,2016. 3

[40] M. Zollhofer, J. Thies, P. Garrido, D. Bradley, T. Beeler,P. Prez, M. Stamminger, M. Niener, and C. Theobalt. Stateof the art on monocular 3D face reconstruction, tracking, andapplications. Computer Graphics Forum, 37(2):523–550,2018. 1, 2

AppendixIn the following, we show the cumulative error plots

for the individual challenges neutral (Figure 8), expression(Figure 9), occlusion (Figure 10), and selfie (Figure 11) ofthe NoW dataset. The right of Figure 5 shows the cumula-tive error across all challenges.

Figure 8: Cumulative error curves for neutral challenge.

Figure 9: Cumulative error curves for expression challenge.

Figure 10: Cumulative error curves for occlusion challenge.

Figure 11: Cumulative error curves for selfie challenge.

10