12
Detailed Human Avatars from Monocular Video Thiemo Alldieck 1,2 Marcus Magnor 1 Weipeng Xu 2 Christian Theobalt 2 Gerard Pons-Moll 2 1 Computer Graphics Lab, TU Braunschweig, Germany 2 Max Planck Institute for Informatics, Saarland Informatics Campus, Germany {alldieck,magnor}@cg.cs.tu-bs.de {wxu,theobalt,gpons}@mpi-inf.mpg.de Figure 1: Our method creates a detailed avatar from a monocular video of a person turning around. Based on the SMPL model, we first compute a medium-level avatar, then add subject-specific details and finally generate a seamless texture. Abstract We present a novel method for high detail-preserving human avatar creation from monocular video. A param- eterized body model is refined and optimized to maxi- mally resemble subjects from a video showing them from all sides. Our avatars feature a natural face, hairstyle, clothes with garment wrinkles, and high-resolution texture. Our paper contributes facial landmark and shading-based human body shape refinement, a semantic texture prior, and a novel texture stitching strategy, resulting in the most sophisticated-looking human avatars obtained from a sin- gle video to date. Numerous results show the robustness and versatility of our method. A user study illustrates its supe- riority over the state-of-the-art in terms of identity preser- vation, level of detail, realism, and overall user preference. 1. Introduction The automatic generation of personalized 3D human models is needed for many applications, including virtual and augmented reality, entertainment, teleconferencing, vir- tual try-on, biometrics or surveillance. A personal 3D hu- man model should comprise all the details that make us dif- ferent from each other, such as hair, clothing, facial details and shape. Failure to faithfully recover all details results in users not feeling identified with their self-avatar. To address this challenging problem, researchers have used very expensive recording equipment including 3D and 4D scanners [64, 11, 50] or multi-camera studios with con- trolled lighting [68, 46]. An alternative is to use passive stereo reconstruction [27, 55] with a camera moving around the person, but the person has to maintain a static pose which is not feasible in practice. Using depth data as in- put, the field has seen significant progress in reconstructing accurate 3D body models [9, 81, 91] or free-form geome- try [95, 53, 59, 24] or both jointly [77]. Depth cameras are however much less ubiquitous than RGB cameras. Monocular RGB methods are typically restricted to prediciting the parameters of a statistical body model [58, 42, 60, 10, 5, 35]. To the best of our knowledge, the only exception is a recent method [3] that can reconstruct shape, clothing and hair geometry from a monocular video se- quence of a person rotating in front of the camera. The basic idea is to fuse the information from frame-wise silhouettes into a canonical pose, and optimize a free-form shape reg- ularized by the SMPL body model [50]. While this is a significant step in 3D human reconstruction from monocu- lar video, the reconstructions are overly smooth, lack facial details and the textures are blurry. This results in avatars that do not fully retain the identity of the real subjects. In this work, we extend [3] in several important ways to improve the quality of the 3D reconstructions and tex- tures. Specifically, we incorporate information from facial landmark detectors, shape-from-shading, and we introduce a new algorithm to efficiently stitch partial textures coming from frames of the moving person. Since the person is mov- ing, information (projection rays from face landmarks and normal fields from shading cues) can not be directly fused into a single reconstruction. Hence, we track the person’s pose using SMPL [50]; then we apply an inverse pose trans- formation to frame-wise projection rays and normal fields 1

Detailed Human Avatars from Monocular Video...Detailed Human Avatars from Monocular Video Thiemo Alldieck1,2 Marcus Magnor1 Weipeng Xu2 Christian Theobalt2 Gerard Pons-Moll2 1Computer

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Detailed Human Avatars from Monocular Video...Detailed Human Avatars from Monocular Video Thiemo Alldieck1,2 Marcus Magnor1 Weipeng Xu2 Christian Theobalt2 Gerard Pons-Moll2 1Computer

Detailed Human Avatars from Monocular Video

Thiemo Alldieck1,2 Marcus Magnor1 Weipeng Xu2 Christian Theobalt2 Gerard Pons-Moll2

1Computer Graphics Lab, TU Braunschweig, Germany2Max Planck Institute for Informatics, Saarland Informatics Campus, Germany{alldieck,magnor}@cg.cs.tu-bs.de {wxu,theobalt,gpons}@mpi-inf.mpg.de

Figure 1: Our method creates a detailed avatar from a monocular video of a person turning around. Based on the SMPLmodel, we first compute a medium-level avatar, then add subject-specific details and finally generate a seamless texture.

Abstract

We present a novel method for high detail-preservinghuman avatar creation from monocular video. A param-eterized body model is refined and optimized to maxi-mally resemble subjects from a video showing them fromall sides. Our avatars feature a natural face, hairstyle,clothes with garment wrinkles, and high-resolution texture.Our paper contributes facial landmark and shading-basedhuman body shape refinement, a semantic texture prior,and a novel texture stitching strategy, resulting in the mostsophisticated-looking human avatars obtained from a sin-gle video to date. Numerous results show the robustness andversatility of our method. A user study illustrates its supe-riority over the state-of-the-art in terms of identity preser-vation, level of detail, realism, and overall user preference.

1. IntroductionThe automatic generation of personalized 3D human

models is needed for many applications, including virtualand augmented reality, entertainment, teleconferencing, vir-tual try-on, biometrics or surveillance. A personal 3D hu-man model should comprise all the details that make us dif-ferent from each other, such as hair, clothing, facial detailsand shape. Failure to faithfully recover all details results inusers not feeling identified with their self-avatar.

To address this challenging problem, researchers haveused very expensive recording equipment including 3D and4D scanners [64, 11, 50] or multi-camera studios with con-trolled lighting [68, 46]. An alternative is to use passive

stereo reconstruction [27, 55] with a camera moving aroundthe person, but the person has to maintain a static posewhich is not feasible in practice. Using depth data as in-put, the field has seen significant progress in reconstructingaccurate 3D body models [9, 81, 91] or free-form geome-try [95, 53, 59, 24] or both jointly [77]. Depth cameras arehowever much less ubiquitous than RGB cameras.

Monocular RGB methods are typically restricted toprediciting the parameters of a statistical body model [58,42, 60, 10, 5, 35]. To the best of our knowledge, the onlyexception is a recent method [3] that can reconstruct shape,clothing and hair geometry from a monocular video se-quence of a person rotating in front of the camera. The basicidea is to fuse the information from frame-wise silhouettesinto a canonical pose, and optimize a free-form shape reg-ularized by the SMPL body model [50]. While this is asignificant step in 3D human reconstruction from monocu-lar video, the reconstructions are overly smooth, lack facialdetails and the textures are blurry. This results in avatarsthat do not fully retain the identity of the real subjects.

In this work, we extend [3] in several important waysto improve the quality of the 3D reconstructions and tex-tures. Specifically, we incorporate information from faciallandmark detectors, shape-from-shading, and we introducea new algorithm to efficiently stitch partial textures comingfrom frames of the moving person. Since the person is mov-ing, information (projection rays from face landmarks andnormal fields from shading cues) can not be directly fusedinto a single reconstruction. Hence, we track the person’spose using SMPL [50]; then we apply an inverse pose trans-formation to frame-wise projection rays and normal fields

1

Page 2: Detailed Human Avatars from Monocular Video...Detailed Human Avatars from Monocular Video Thiemo Alldieck1,2 Marcus Magnor1 Weipeng Xu2 Christian Theobalt2 Gerard Pons-Moll2 1Computer

to fuse all the evidence in a canonical T-pose; in that space,we optimize a high-resolution shape regularized by SMPL.Precisely, with respect to previous work, our approach dif-fers in four important aspects that allow us better preservesubject identity and details in the reconstructions:

Facial landmarks: Since the face is a crucial part of thebody, we incorporate 2D facial landmark detections intothe 3D reconstruction objective. To gain robustness againstmisdetections, we fuse temporal detections by transformingthe landmark projection rays into the joint T-pose space.

Illumination and shape-from-shading: Shading is astrong cue to recover fine details such as wrinkles. Mostshape-from-shading approaches focus on adding detail tostatic objects. Here, we perform shape-from-shading at ev-ery frame, obtaining frame-wise partial 3D normal fieldsthat are then fused in T-pose space for final reconstruction.

Efficient texture stitching: Seamless stitching of par-tial textures from different camera views is particulary hardfor moving articulated objects. To prevent blurry textures,one typically assigns the RGB value of one the views toeach texture pixel (texel), while preserving spatial smooth-ness. Such assignment problem can be formulated as amulti-labeling assignment, where number possible labelsgrows with the number of views. Consequently, the com-putational time and memory becomes intractable for for alarge number of labels – we define a novel texture updateenergy function which can be minimized efficiently with agraph cut for every new incoming view.

Semantic texture stitching: Aside from stitching arti-facts, texture spilling is another common problem. For ex-ample texture that corresponds to the clothing often floodsinto the skin region. To minimize spilling we add an ad-ditional semantic term into the texture update energy. Theterm penalizes updating a texel with an RGB value that isunlikely under a part-based appearance distribution. Thissemantic appearance term significantly reduces spilling,and implicitly ”connects” texels belonging to the same part.

The result is the most sophisticated method to obtain de-tailed 3D human shape reconstructions from single monoc-ular video. Since metric based evaluations such as scanto mesh distances do not reflect the perceptual quality, weperformed a user study to assess the improvement of ourmethod. The results show that users prefer our avatars overstate-of-the-art 89.64% of the times and they think our re-constructions are more detailed 95.72% of the times.

2. Related workModeling the human body is a long-standing prob-

lem in computer vision. Given a densely distributed multi-camera system, one can make use of multi-view stereomethods [43] for reconstructing the human body [27, 29,40, 93]. More advanced systems allow reconstruction ofbody shape under clothing [90, 87, 85], joint shape, pose

and clothing reconstruction [63], or capture body pose andfacial expressions [41]. However, such setups are expensiveand require complicated calibration.

Hence, monocular 3D reconstruction methods [54, 55]are appealing but require depth images from many viewpoints around a static object and humans can not hold astatic pose for a long time. Therefore, nonrigid deforma-tion of the human body has to be taken into account. Manymethods are based on depth sensors and require the subjectto hold the same pose. For example, in [48, 20, 71, 89],the subject alternatively makes a certain pose and rotatesin front of the sensor. Then, several depth snapshots takenfrom different view points are fused to generate a complete3D model. Similarly, [78] proposes to use a turntable to ro-tate the subject to minimize pose variations. In contrast, themethods of [9, 81, 91] allow a user to move freely in frontof the sensor. In recent years, real time nonrigid depth fu-sion has been achieved [53, 39, 74]. These methods usuallymaintain a growing template and consist of two alternatingsteps, i.e. a registration step, where the current template isaligned to the new frame, and a fusion step, where the obser-vation in the new frame is merged to the template. However,these methods typically suffer from “phantom surfaces” ar-tifacts during fast motion. In [77], this problem is alleviatedby using SMPL to constraint tracking. Model based monoc-ular methods [10, 23, 32, 5, 35, 66, 65] have recently beenintegrated with deep learning [58, 42, 60]. However, theyare restricted to predicting the parameters of a statisticalbody model [50, 4, 36, 96, 64]. There are two exceptions,that recover clothing and shape from a single image [33, 18]but these methods require manual initialization of pose andclothing parameters. [3] is the first method capable of re-constructing full 3D shape and clothing geometry from asingle RGB video. Users can freely rotate in front of thecamera while roughly holding the A-pose. Unfortunately,this approach is restricted to recover only medium-level de-tails. The fine-level details such as garment wrinkles, subtlegeometry on the clothes and facial features, which are es-sential elements for preserving the identity information, aremissing. Our goal is to recover the missing fine-level de-tails of the geometry and improve the texture quality suchthat the appearance identity information can be faithfullyrecovered.

Another branch of work in human body reconstructionis more focused on capturing the dynamic motion of thecharacter. Works either recover articulated skeletal mo-tion [76, 51, 30, 72, 2, 38], or surfaces with deformedclothing, usually called performance capture. In perfor-mance capture many approaches reconstruct a 3D modelfor each individual frame [75, 47, 19] or fuse a windowof frames [59, 24]. However, these methods cannot gener-ate a temporal coherent representation of the model, whichis an important characteristic for many applications. To

Page 3: Detailed Human Avatars from Monocular Video...Detailed Human Avatars from Monocular Video Thiemo Alldieck1,2 Marcus Magnor1 Weipeng Xu2 Christian Theobalt2 Gerard Pons-Moll2 1Computer

solve this, methods register a common model to resultsof all frames [15], use volumetric representation for sur-face tracking [1, 37], or assume a pre-built static template.Again, most of those methods are based on multi-view im-ages [21, 28, 62, 17, 67, 68]. There are attempts on reducingthe number of cameras, such as the stereo method [82], sin-gle view depth based method [95] and the recent monocularRGB based method [86]. Note that the result of our methodcan be used as the initial template for above-mentioned tem-plate based performance capture methods.

Shape-from-shading is also highly related to ourmethod. A comprehensive survey can be found in [92]. Weonly discuss the application of shape-from-shading in thecontext of human body modeling. Geometric details, e.g.folds in the non-textured region, are difficult to capture withsilhouette or photometric information. In contrast, shape-from-shading captures such details [82, 83, 34]. There arealso approaches for photometric stereo which recover theshape using controlled light stage setup [79].

Texture generation is an essential task for modeling arealistic virtual character, since a texture image can describethe material properties that cannot be modeled by the sur-face geometry. The key of a texture generation method ishow to combine texture fragments created from differentviews. Many early works blend the texture fragments usingweighted averaging across the entire surface [7, 22, 57, 61].Others make use of mosaicing strategies, which yieldssharper results [6, 45, 56, 69]. [44] is the first to formu-late texture stitching as a graph cut problem. Such for-mulation has been commonly used in texture generationfor multi-view 3D reconstruction. However, without ac-curately reconstructed 3D geometry and registered images,these methods usually suffer from blurring or ghosting ar-tifacts. To this end, many methods focus on compensatingregistration errors [25, 8, 80, 26, 94]. In our scenario, theregistration misalignment problem is even more severe, dueto our challenging monocular nonrigid setting. Therefore,we propose to take advantage of semantic information tobetter constrain our problem.

3. MethodIn this paper, our goal is to create a detailed avatar from

an RGB video of a subject rotating in front of the cam-era. The focus lies hereby on fine-level details, that modela subject’s identity and individual appearance. As shown inFig. 2, our method reconstructs a textured mesh model in acoarse-to-fine manner, which consists of three steps: Firstwe estimate a rough body shape of the subject, similar to[3], where the medium-level geometry of the clothing andskin is reconstructed. Then we add fine-level geometric de-tails, such as garment wrinkles and facial features, based onshape-from-shading. Finally, we compute a seamless tex-ture to capture the texel-level appearance details. In the fol-

a) b) c)

Figure 2. Our method 3-step method: We first estimate a mediumlevel body shape based on segmentations (a), then we add detailsusing shape-from-shading (b). Finally we compute a texture usinga semantic prior and a novel graph cut optimization strategy (c).

lowing, we first describe our body shape model, and thendiscuss the details of our three steps.

3.1. Subdivided SMPL body model

Our method is based on the SMPL body model [50].However, the original SMPL model is too coarse to modelfine-level details such as garment wrinkles and fine facialfeatures. To this end, we adapt the model as follows.

The SMPL model is a parameterized human body modeldescribed by a function of pose θ and shape β returningN = 6890 vertices and F = 13776 faces. As SMPLonly models naked humans, we use the extended formula-tion from [3] allowing offsets D from the template T:

M(β,θ,D) = W (T (β,θ,D), J(β),θ,W) (1)

T (β,θ,D) = T +Bs(β) +Bp(θ) + D (2)

where W is a linear blend-skinning function applied to arest pose T (β,θ,D) based on the skeleton joints J(β) andafter poseBp(θ) and shape dependentBs(β) deformations.The inverse function M−1(β,θ,D) unposes the model andbrings the vertices back into the canonical T-pose. As weaim for fine details and a subject’s identity, we further extentthe formulation. As shown in Fig. 3, we subdivide everyedge of the the SMPL model twice. Every new vertex isdefined as:

vN+e = 0.5(vi + vj) + sene, (i, j) ∈ Ee (3)

where E defines the pairs of vertices forming an edge andne is the average normal between the normals of the vertexpair. s ∈ s defines the displacement in normal directionne. ne is calculated at initialization time in unposed spaceand can be posed according to W . The new finer modelMf (β,θ,D, s) consists of N = 110210 vertices and F =220416 faces. To recover the high-res smooth surface wecalculate an initial set s0 = {s0, . . . , se} by minimizing

arg mins

(LMf =

∑j∈N (i)

wij(vi − vj))

(4)

where L is the Laplace matrix with cotangent weights wij

and N (i) defines the neighbors around vi.

Page 4: Detailed Human Avatars from Monocular Video...Detailed Human Avatars from Monocular Video Thiemo Alldieck1,2 Marcus Magnor1 Weipeng Xu2 Christian Theobalt2 Gerard Pons-Moll2 1Computer

3.2. Medium-level body shape reconstruction

In recent work, a pipeline to recover a subject’s bodyshape, hair and clothing in the same setup as ours has beenpresented [3]. They first select a number of key-frames(K ≈ 120) evenly distributed over the sequence and seg-ment them into foreground and background using a CNN[14]. Then they recover the 3D pose for each selected framebased on 2D landmarks [16]. At the core of their methodthey transform the silhouette cone of every key-frame backinto the canonical T-pose of the SMPL model using the in-verse formulation of SMPL. This allows efficient optimiza-tion of the body shape independent of pose. We follow theirpipeline and optimize for the subjects body shape in un-posed space. However, we notice that the face estimationof [3] is not accurate enough. This prevents us from furtherrecovering fine-level facial features in the following steps,since precise face alignment is necessary for that. To thisend, we propose a new objective for body shape estimation(dependency on parameters removed for clarity):

arg minβ,D

Esilh + Eface + Eregm (5)

The silhouette term Esilh measures the distance betweenboundary vertices and silhouette rays. See [3] for detailsand regularization Eregm. The face alignment term Eface pe-nalizes the distance between the 2D facial landmark detec-tions and the 2D projection of 3D facial landmarks. Weuse OpenPose [73] to detect 2D facial landmarks for ev-ery key-frame. In order to incorporate the detections intothe method, we establish a static mapping between land-marks and points on the mesh. Every landmark l is mappedto the surface via barycentric interpolation of neighboringvertices. During optimization, we measure the point to linedistance between the landmark l on the model and the cor-responding camera ray r describing the 2D landmark detec-tion in unposed space:

δ(l, r) = l× rn − rm (6)

where r = (rm, rn) is given in Plucker coordinates. Theface alignment term finally is:

Eface =∑l,r∈L

wlρ(δ(ll, rr)) (7)

where L defines the mapping between mesh points andlandmarks, w is the confidence of the landmark given by theCNN and ρ is the Geman-McClure robust cost function. Tospeed up computation time, we use the coarse SMPL modelformulation (Eq. 1) for the medium-level shape estimation.

3.3. Modeling fine-level surface details

In Sec. 3.2, we capture the medium-level details by glob-ally integrating the silhouette information from all key-frames. Now our goal is to obtain fine-level surface de-tails, which cannot be estimated from silhouette, based

di ∈ D

sene

Figure 3. One face of the new SMPL formulation. The displace-ment field vectors d∗ and the normal displacements s∗n∗ formthe subdivided surface.

on shape-from-shading. Note that estimating shape-from-shading globally over all frames would lead to a smoothshape without details, due to fabric movement and mis-alignments. Thus, we first capture the details for a numberof key-frames individually, and then incrementally mergethe details into the model as new triangles become visiblein a consecutive key-frame. We found that the number ofkey-frames can be lower than in the first step and chooseK = 60. Now we describe how to capture the fine-level de-tails for a single key-frame k based on shape-from-shading.To make this process robust, we estimate shading normalsindividually in a window around the key-frame and thenjointly optimize for the surface.

Shape-from-shading: For each frame, we first decom-pose the image into reflectance Ir and shading Is using theCNN based intrinsic decomposition method of [52]. Thefunction Hc calculates the shading of a vertex with spher-ical harmonic components c. We estimate spherical har-monic components c that minimize the difference betweenthe simulated shading and the observed image shading Isjointly for the given window of frames [84]:

arg minc

∑i∈V|Hc(ni)− Is(Pvi)| , (8)

where V denotes the subset of visible vertices, i.e. the anglebetween the normal and the viewing direction is 0 < α ≤αmax. P is the projection matrix. Having the scene illumi-nation and the shading for every pixel, we can now estimateauxiliary normals N = {n0, . . . , nN} for every vertex perframe:

arg minN

Egrad + wlapnElapn. (9)

The Laplacian smoothness term Elapn = LN enforces thenormals to be locally smooth. Egrad penalizes shading errorsby calculating the difference between the gradient betweena shaded vertex and its neighborsN and the image gradientat the projected vertex positions:

Egrad =∑i∈V

∑j∈N (i)∩V

||∆Hc(ni, nj)−∆Is(Pvi,Pvj)||2

(10)with ∆f (a, b) = f(a)− f(b).

Surface reconstruction: In order to merge informationabout all estimated normals within the window, we trans-

Page 5: Detailed Human Avatars from Monocular Video...Detailed Human Avatars from Monocular Video Thiemo Alldieck1,2 Marcus Magnor1 Weipeng Xu2 Christian Theobalt2 Gerard Pons-Moll2 1Computer

Figure 4. We calculate a semantic segmentation for every keyframe. The semantic labels are mapped into texture space andcombined into a semantic texture prior.

form the normals back into the canonical T-pose using theinverse pose function of SMPL M−1. Then we optimizefor the surface which explains the merged normals. Further,we include the silhouette term and face term of Sec. 3.2 toenforce the surface to be well aligned to the images. Specif-ically, we minimize:

argminD,s

∑j∈C

(λjEsilh,j + λjwfaceEface,j

)+ wsfsEsfs + Eregf (11)

with weights w∗ and λj = 1 for j = k and λj < 1 other-wise. Esilh and Eface are evaluated over a number of controlframes C and matches in Esilh are limited to vertices in theoriginal SMPL model. The shape-from-shading term is de-fined as:

Esfs =

k+m∑f=k−m

∑i∈V

||ni − nfi ||

2 (12)

where k is the current key-frame and m specifies the win-dow size, usually m = 1. nf

i denotes the auxiliary normalof vertex i calculated from frame f . All normals are in T-pose space. Eregf regularizes the optimization as describedin the following:Eregf = wmatchEmatch +wlapElap +wstrucEstruc +wconsEcons (13)

Ematch penalizes the discrepancy between two neighbor-ing key-frames. Specifically, for a perfect estimation, thefollowing assumption should hold: When warping a key-frame into a neighboring key-frame based on the warp-field described by the projected vertex displacement, thewarped frame and the target frame should be similar. Ematchdescribes this metric: First we calculate the describedwarp. Then we calculate warping errors based on opticalflow [13]. Based on the sum of the initial warp-field andthe calculated error, we establish a grid of correspondencesbetween neighboring key-frames. Every correspondence cshould be explained by a particular point of the mesh sur-face. We first find a candidate for every correspondence:

arg mini∈V

cos(αik)δ(vk

i , rkc ) + cos(αi

j)δ(vji , r

jc)

cos(αik) + cos(αi

j)(14)

where αik is the viewing angle under which the vertex i has

been seen in key-frame k and rkc is the projection ray of

Figure 5. Based on part textures from key frames (left), we stitcha complete texture using graph-cut based optimization. The asso-ciated key frames for each texel are shown as colors on the right.

correspondence c in posed space of key-frame k. Then weminimize point to line distance in unposed space:

Ematch =∑

i,c∈Mρ(δ(vi, rc)) (15)

whereM is the set of matches established in Eq. 14.The remaining regularization terms of Eq.13 are as

follows: Elap is the Laplacian smoothness term withanisotropic weights [84]. Estruc aims to keep the structureof the mesh by pruning edge length variations. Econs pruneslarge deviations from the consensus shape.

We optimize using a dog-leg trust region method usingthe chumpy autodifferentiation framework. We alternateminimizing and finding silhouette point to line correspon-dences. Regularization is reduced step-wise.

3.4. Texture generation

A high quality texture image is an essential componentfor a realistic virtual character, since it can describe the ma-terial properties that cannot be modeled by the surface ge-ometry. In order to obtain a sharp and seamless texture, wesolve the texture stitching on a per texel level (Fig. 5), incontrast to that on a per face level as in other works [44].In other words, our goal is to color each pixel in the tex-ture image with a pixel value taken from one out of K key-frames. However, this makes the scale of our problem muchlarger, and therefore does not allow us to perform global op-timization. To this end, we propose a novel texture mergingmethod based on graph cut, which translates our problemto a series of binary labeling subproblems that can be ef-ficiently solved. Furthermore, meshes and key-frames arenot perfectly aligned. To reduce color spilling and artifactscaused by misalignments, we compute a semantic prior be-fore stitching the final texture (Fig. 4).

Partial texture generation: For every key-frame, wefirst project all visible surface points to the frame and writethe color at the projected position into the correspondingtexture coordinates. In order to factor out the illumination inthe texture images, we unshade the input images by divid-ing them with the shading images as used in Sec. 3.3. Thepartial texture calculation can easily be achieved using the

Page 6: Detailed Human Avatars from Monocular Video...Detailed Human Avatars from Monocular Video Thiemo Alldieck1,2 Marcus Magnor1 Weipeng Xu2 Christian Theobalt2 Gerard Pons-Moll2 1Computer

OpenGL rasterization pipeline. Apart from the partial colortexture image, we calculate two additional texture maps forthe merging step, i.e. the viewing-angle map and the se-mantic map. For the viewing-angle map, we compute theviewing angle αt

k under which the surface point t has beenseen in key-frame k.

The semantic prior is generated by re-projecting the hu-man semantic segmentation to the texture space. Specif-ically, we first calculate a semantic label for every pixelin the input frames using a CNN based human parsingmethod [49]. Each frame is segmented into 10 seman-tic classes such as hair, face, left leg and upper clothes.Then the semantic information of all frames is fused intothe global semantic map by minimizing for labeling x:

arg minx

T∑t=0

ϕt(xt) +∑

t,q∈N

ψ(xt, xq) (16)

ϕt(xt) = 1−∑K

k=0Xk(cos2 αtk)

K(17)

Here ϕ is the energy term describing the compatibility of alabel x with the texel t, where Xk returns the given valueif the texel was labeled with x in view k and 0 otherwise.ψ gives the label compatibility of neighboring texels t andq. We solve Eq. 16 by multi-label graph-cut optimizationwith alpha-beta swaps [12]. While constructing the graph,we connect every texel not only with its neighbors in texturespace but with all neighbors on the surface. In particular thismeans texels are connected across texture seams. To have astrong prior for the texture completion, we calculate Gaus-sian mixture models (GMM) of the colors in HSV space perlabel using the part-textures and corresponding labels.

Texture merging: Next, we calculate the complete tex-ture by merging the partial textures. While keeping thesame graph structure, the objective function is:

arg minu

T∑t=0

θt(ut) +∑

t,q∈N

ηt,q(ut, uq) (18)

where the labeling u assigns every texel to a partial texturek. The first term seeks to find the best image for each texel:

θt(k) = wvis sin2 αtk + wgmmm(Ut

k, xt)

+wfaced(Utk) + wsilhEsilh,k (19)

with weights w∗. m returns the Mahalanobis distance be-tween the color value for t in part-texture k given the se-mantic label xt. d calculates the structural dissimilarity be-tween the first and the given key-frame. d is only evaluatedon texels belonging to the facial region and ensures consis-tent facial expression over the texture.

The smoothness-term η ensures similar colors for neigh-boring texels. For neighboring texels assigned to differentkey-frames ut 6= uq , while belonging to the same semanticregion xt = xq , ηt,q equals the gradient magnitude betweenthe texel colors ||Ut

ut−Uq

uq||.

a) b)

Figure 6. Side-by-side comparisons of our reconstructions (b) andthe input frame (a). As can be seen from (b), our method closelyresembles the subject in the video (a).

Since the number of combinations in η is very high, it iscomputationally not feasible to solve Eq. 18 as a multi labelgraph-cut problem. Thus, we propose the following strat-egy for an approximate solution: We convert the multi-labelproblem to a binary labeling decision b ∈ {update, keep}.We initialize the texture with M = U0. Then we randomlychoose a key-frame k and test it against the current solu-tion. The likelihood of selecting a key-frame is inversly pro-portional to its remaining silhouette error Esilh,k in order tofavor well-aligned key-frames. Further, η is approximatedwith:

ηt,q =

{max(||Mt −Uq

k||, ||Mq −Ut

k||), if bt 6= bq ∧ xt = xq

0, otherwise(20)

Convergence is usually reached between 2K to 3K iter-ations. Finally, we cross-blend between different labelsto reduce visible seams. The run-time per iteration on1000 × 1000 px with Python code using a standard graphcut library is∼2 sec. No attempts for run-time optimizationhave been made.

4. ExperimentsWe evaluate our method on two publicly available

datasets: The People-Snapshot dataset [3] and the datasetused in [9]. To validate the perceived quality of our resultswe performed a user study.

4.1. Qualitative results and comparisons

We compare our method to the recent method of [3] ontheir People-Snapshot dataset. The approach of [3] is theonly other monocular 3D person reconstruction method.The People-Snapshot dataset consists of 24 sequences ofdifferent subjects rotating in front of the camera whileroughly holding an A-pose. In Fig. 6, we show some exam-ples of our reconstruction results, which precisely overlaythe subjects in the image. Note that the level of detail of

Page 7: Detailed Human Avatars from Monocular Video...Detailed Human Avatars from Monocular Video Thiemo Alldieck1,2 Marcus Magnor1 Weipeng Xu2 Christian Theobalt2 Gerard Pons-Moll2 1Computer

a) b)Figure 7. Our results (b) in comparison against the RGB-D method[9] (a). Note that the texture prior has not been used (see Sec. 4.1).

Figure 8. In comparison to the method of [3] (left), the faces in ourresults (right) have finer details in the mesh and closely resemblethe subject in the photograph.

the input images is captured by our reconstructed avatars.In Fig. 11, we show side-by-side comparison to [3]. Ourresults (right) reconstruct the face better and preserve manymore details, e.g. clothing wrinkles and t-shirt stamps.

Additionally, we compare against the state-of-the-artRGB-D method [9], also using their dataset of people inminimal clothing1. While their method relies on depth data,we only use the RGB video which makes the problem muchharder. Despite this, as shown in Fig. 7, our results are com-parable in quality to theirs.

4.2. Face similarity measure

One goal of our method was to preserve the individualappearance of subjects in their avatars. Since the face iscrucial for this, we leverage facial landmarks detections andshape-from-shading. As seen in Fig. 11 our method adds asignificant level of detail to the facial region in comparisonto state-of-the-art. In Fig. 8 we show the same compari-son also for untextured meshes. Our result closely resem-bles the subject in the photograph. To further demonstratethe effectiveness of our method for face similarity preser-vation, we perform the following experiment: FaceNet [70]is a deep network, that is trained to map from face imagesto an Euclidean space where distance corresponds to facesimilarity. We use FaceNet trained on the CASIA WebFacedataset [88] to measure the similarity between photos of thesubjects in the People-snapshot dataset and their reconstruc-tions. Two distinct subjects in the dataset have a mean sim-ilarity distance of 1.33 ± 0.13. Same subjects in differentsettings differ by 0.55 ± 0.18. Our reconstructions featurea mean distance of 0.99± 0.11 to their photo counterparts.Reconstructions of [3] perform significantly worse with a

1The deep learning based segmentation [31] only works for fullyclothed people so we had to deactivate the semantic prior in this dataset.

Figure 9. Comparison of a result of our method before (left) andafter (right) applying shape-from-shading based detail enhancing.

Figure 10. The semantic prior for texture stitching successfullyremoves color spilling (left) in our final texture (right).

mean distance of 1.09±0.15. While our reconstructions canbe reliable identified using FaceNet, reconstructions of [3]have a similarity distance close to a distance of distinct peo-ple, making them less likely to be identified correctly.

4.3. Ablation analysis

In the following we qualitatively demonstrate the effec-tiveness of further design choices of our method.

Shape-from-shading: In order to render the avatars un-der different illuminations, detailed geometry should bepresent in the mesh. In Fig. 9, we demonstrate the level ofdetail added to the meshes by shape-from-shading. Whilethe mesh on the left only describes the low-frequency shape,our refined result on the right contains fine-grained detailssuch as wrinkles and buttons.

Influence of the texture prior: In Fig. 10 we showthe effectiveness of the semantic prior for texture stitching.While the texture on the left computed without the priorcontains noticeable color spills on the arms and hands, thefinal texture on the right contains no color spills and lessstitching artifacts along semantic boundaries.

4.4. User study

Finally, we conducted a user study in order to validatethe visual fidelity of our results. Each participant was askedfour questions about 6 randomly chosen results out of the24 reconstructed subjects in People-Snapshot dataset. Theavatars shown to each participant and the questions askedwere randomized. In every question the participants had todecide between our method, and the method of [3]. Thefour question were:

− Which avatar preserves the identity of the person in theimage better? (identity)

Page 8: Detailed Human Avatars from Monocular Video...Detailed Human Avatars from Monocular Video Thiemo Alldieck1,2 Marcus Magnor1 Weipeng Xu2 Christian Theobalt2 Gerard Pons-Moll2 1Computer

Identitiy Details Realism PreferenceTextured Avatars 83.12 % - 92.27 % 89.64 %

Untextured Avatars 65.70 % 95.72 % 89.73 % -

Table 1. Results of the user study. Percentage of answers whereusers preferred our method over [3]. We asked for four differentaspects. See Sec. 4.4 for details.

− Which avatar has more detail? (detail)− Which avatar looks more real to you? (realism)− Which avatar do you like better? (preference)

We presented the users renderings of the meshes in con-sistent pose and illumination. The users were allowed tozoom into the images. At questions identity and realismwe showed the participants either textured or untexturedmeshes. For identity comparison we additionally showeda photo of the subject next to the renderings. When askingfor detail we only showed untextured meshes, and whenasking for preference we only showed textured results. Ad-ditionally, we asked for the level of experience with 3D data(None, Beginner, Proficient, Expert). 74 people participatedin our online survey, covering the whole range of expertise.

The results of the study are summarized in Table 1. Theparticipants clearly preferred our results in all scenariosover current state-of-the-art. Admittedly, when asked aboutidentity preservation in untextured meshes, users preferredour method, but this time only 65.70%. Further inspectionof the results shows that users with high experience with3D data think our method preserves the identity better with90.48% versus 60.49% for novice users. We hypothesizethat unexperienced users find it more difficult to recognizepeople from 3D meshes without textures. Most importantly,by a large margin, our results are perceived as more real-istic (92.27%), preserve more details (95.72%) and wherepreferred 89.64% of the times.

5. Discussion and ConclusionWe have proposed a novel method to create highly de-

tailed personalized avatars from monocular video. We im-prove over the state-of-the-art in several important aspects:Our optimization scheme allows to integrate face landmarkdetections and shape-from-shading from multiple frames.Experiments demonstrate that this results in better face re-construction and better identity preservation. This is alsoconfirmed by our user study, which shows that people thinkour method preserves identity better 83.12% of the times,and capture more details 95.72% of the times.

We introduced a new texture stitching binary optimiza-tion, which allows us to efficiently merge the appearanceof multiple frames into a single coherent texture. The op-timization includes a semantic texture term that incorpo-rates appearance models for each semantic segmentationpart. Results demonstrate that the common artifact of colorspilling from skin to clothing or viceversa gets reduced.

Figure 11. In comparison to the method of [3] (left), our results(right) look much more natural and have finer details.

We have argued for a method to capture the subtle, butvery important details to make avatars look realistic. Indeeddetails matter, the user study shows that users think our re-sults are more realistic than the state of the art 92.7% of thetimes, and prefer our avatars 89.64% of the times.

Future work should address capture of subjects wear-ing clothing with topology different from the body, includ-ing skirts and coats. Furthermore, to obtain full texturing,subjects have to be seen from all sides – it may be possi-ble to infer occluded appearance using sufficient trainingdata. Another avenue to explore is reconstruction in an un-cooperative setting, e.g. from online videos of people.

Having cameras all around us, we can now serve thegrowing demand for personalized avatars in virtual and aug-mented reality applications e.g. in the fields of entertain-ment, communication or e-commerce.

Acknowledgments: The authors gratefully acknowledge funding by theGerman Science Foundation from project DFG MA2555/12-1.

Page 9: Detailed Human Avatars from Monocular Video...Detailed Human Avatars from Monocular Video Thiemo Alldieck1,2 Marcus Magnor1 Weipeng Xu2 Christian Theobalt2 Gerard Pons-Moll2 1Computer

References[1] B. Allain, J.-S. Franco, and E. Boyer. An Efficient Volu-

metric Framework for Shape Tracking. In IEEE Conf. onComputer Vision and Pattern Recognition, pages 268–276,Boston, United States, 2015. IEEE. 3

[2] T. Alldieck, M. Kassubeck, B. Wandt, B. Rosenhahn, andM. Magnor. Optical flow-based 3d human motion estimationfrom monocular video. In German Conf. on Pattern Recog-nition, pages 347–360, 2017. 2

[3] T. Alldieck, M. Magnor, W. Xu, C. Theobalt, and G. Pons-Moll. Video based reconstruction of 3d people models. InIEEE Conf. on Computer Vision and Pattern Recognition,2018. 1, 2, 3, 4, 6, 7, 8

[4] D. Anguelov, P. Srinivasan, D. Koller, S. Thrun, J. Rodgers,and J. Davis. SCAPE: shape completion and animation ofpeople. In ACM Transactions on Graphics, volume 24, pages408–416. ACM, 2005. 2

[5] A. O. Balan and M. J. Black. The naked truth: Estimatingbody shape under clothing. In European Conf. on ComputerVision, pages 15–29. Springer, 2008. 1, 2

[6] A. Baumberg. Blending images for texturing 3d models. InBritish Machine Vision Conference, volume 3, page 5. Cite-seer, 2002. 3

[7] F. Bernardini, I. M. Martin, and H. Rushmeier. High-qualitytexture reconstruction from multiple scans. IEEE Transac-tions on Visualization and Computer Graphics, 7(4):318–332, 2001. 3

[8] S. Bi, N. K. Kalantari, and R. Ramamoorthi. Patch-basedoptimization for image-based texture mapping. ACM Trans-actions on Graphics, 36(4), 2017. 3

[9] F. Bogo, M. J. Black, M. Loper, and J. Romero. Detailedfull-body reconstructions of moving people from monocularRGB-D sequences. In IEEE International Conf. on Com-puter Vision, pages 2300–2308, 2015. 1, 2, 6, 7

[10] F. Bogo, A. Kanazawa, C. Lassner, P. Gehler, J. Romero,and M. J. Black. Keep it SMPL: Automatic estimation of3D human pose and shape from a single image. In EuropeanConf. on Computer Vision. Springer International Publish-ing, 2016. 1, 2

[11] F. Bogo, J. Romero, G. Pons-Moll, and M. J. Black. Dy-namic FAUST: Registering human bodies in motion. In IEEEConf. on Computer Vision and Pattern Recognition, 2017. 1

[12] Y. Boykov, O. Veksler, and R. Zabih. Fast approximate en-ergy minimization via graph cuts. IEEE Transactions on Pat-tern Analysis and Machine Intelligence, 23(11):1222–1239,2001. 6

[13] T. Brox, A. Bruhn, N. Papenberg, and J. Weickert. Highaccuracy optical flow estimation based on a theory for warp-ing. In European Conf. on Computer Vision, pages 25–36.Springer, 2004. 5

[14] S. Caelles, K. Maninis, J. Pont-Tuset, L. Leal-Taixe, D. Cre-mers, and L. Van Gool. One-shot video object segmentation.In IEEE Conf. on Computer Vision and Pattern Recognition,2017. 4

[15] C. Cagniart, E. Boyer, and S. Ilic. Probabilistic deformablesurface tracking from multiple videos. In K. Daniilidis,

P. Maragos, and N. Paragios, editors, European Conf. onComputer Vision, volume 6314 of Lecture Notes in Com-puter Science, pages 326–339, Heraklion, Greece, 2010.Springer. 3

[16] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh. Realtime multi-person 2d pose estimation using part affinity fields. In IEEEConf. on Computer Vision and Pattern Recognition, 2017. 4

[17] J. Carranza, C. Theobalt, M. A. Magnor, and H.-P. Seidel.Free-viewpoint video of human actors. In ACM Transactionson Graphics, volume 22, pages 569–577. ACM, 2003. 3

[18] X. Chen, Y. Guo, B. Zhou, and Q. Zhao. Deformable modelfor estimating clothed and naked human shapes from a singleimage. The Visual Computer, 29(11):1187–1196, 2013. 2

[19] A. Collet, M. Chuang, P. Sweeney, D. Gillett, D. Evseev,D. Calabrese, H. Hoppe, A. Kirk, and S. Sullivan. High-quality streamable free-viewpoint video. ACM Transactionson Graphics, 34(4):69, 2015. 2

[20] Y. Cui, W. Chang, T. Noll, and D. Stricker. Kinectavatar:fully automatic body capture using a single kinect. In AsianConf. on Computer Vision, pages 133–147, 2012. 2

[21] E. De Aguiar, C. Stoll, C. Theobalt, N. Ahmed, H.-P. Sei-del, and S. Thrun. Performance capture from sparse multi-view video. In ACM Transactions on Graphics, volume 27,page 98. ACM, 2008. 3

[22] P. E. Debevec, C. J. Taylor, and J. Malik. Modeling and ren-dering architecture from photographs: A hybrid geometry-and image-based approach. In Annual Conf. on ComputerGraphics and Interactive Techniques, pages 11–20. ACM,1996. 3

[23] E. Dibra, H. Jain, C. Oztireli, R. Ziegler, and M. Gross. Hu-man shape from silhouettes using generative hks descriptorsand cross-modal neural networks. In IEEE Conf. on Com-puter Vision and Pattern Recognition, pages 4826–4836,2017. 2

[24] M. Dou, S. Khamis, Y. Degtyarev, P. Davidson, S. R. Fanello,A. Kowdle, S. O. Escolano, C. Rhemann, D. Kim, J. Tay-lor, et al. Fusion4d: Real-time performance capture of chal-lenging scenes. ACM Transactions on Graphics, 35(4):114,2016. 1, 2

[25] M. Eisemann, B. De Decker, M. Magnor, P. Bekaert,E. De Aguiar, N. Ahmed, C. Theobalt, and A. Sellent. Float-ing textures. In Computer Graphics Forum, volume 27,pages 409–418. Wiley Online Library, 2008. 3

[26] Y. Fu, Q. Yan, L. Yang, J. Liao, and C. Xiao. Texture map-ping for 3d reconstruction with rgb-d sensor. In IEEE Conf.on Computer Vision and Pattern Recognition, 2018. 3

[27] S. Fuhrmann, F. Langguth, and M. Goesele. Mve-a multi-view reconstruction environment. In Eurographics Work-shops on Graphics and Cultural Heritage, pages 11–18,2014. 1, 2

[28] J. Gall, C. Stoll, E. De Aguiar, C. Theobalt, B. Rosenhahn,and H.-P. Seidel. Motion capture using joint skeleton track-ing and surface estimation. In IEEE Conf. on ComputerVision and Pattern Recognition, pages 1746–1753. IEEE,2009. 3

[29] S. Galliani, K. Lasinger, and K. Schindler. Massively par-allel multiview stereopsis by surface normal diffusion. In

Page 10: Detailed Human Avatars from Monocular Video...Detailed Human Avatars from Monocular Video Thiemo Alldieck1,2 Marcus Magnor1 Weipeng Xu2 Christian Theobalt2 Gerard Pons-Moll2 1Computer

IEEE International Conf. on Computer Vision, pages 873–881, 2015. 2

[30] D. M. Gavrila and L. S. Davis. 3-d model-based trackingof humans in action: a multi-view approach. In IEEE Conf.on Computer Vision and Pattern Recognition, pages 73–80.IEEE, 1996. 2

[31] K. Gong, X. Liang, X. Shen, and L. Lin. Look into per-son: Self-supervised structure-sensitive learning and a newbenchmark for human parsing. In IEEE Conf. on ComputerVision and Pattern Recognition, 2017. 7

[32] P. Guan, A. Weiss, A. O. Balan, and M. J. Black. Estimatinghuman shape and pose from a single image. In IEEE Inter-national Conf. on Computer Vision, pages 1381–1388. IEEE,2009. 2

[33] Y. Guo, X. Chen, B. Zhou, and Q. Zhao. Clothed and nakedhuman shapes estimation from a single image. Computa-tional Visual Media, pages 43–50, 2012. 2

[34] B. Haefner, Y. Queau, T. Mollenhoff, and D. Cremers.Fight ill-posedness with ill-posedness: Single-shot varia-tional depth super-resolution from shading. In Proceedingsof the IEEE Conference on Computer Vision and PatternRecognition, pages 164–174, 2018. 3

[35] N. Hasler, H. Ackermann, B. Rosenhahn, T. Thormahlen,and H.-P. Seidel. Multilinear pose and body shape estisma-tion of dressed subjects from image sets. In IEEE Conf.on Computer Vision and Pattern Recognition, pages 1823–1830. IEEE, 2010. 1, 2

[36] N. Hasler, C. Stoll, M. Sunkel, B. Rosenhahn, and H.-P. Sei-del. A statistical model of human pose and body shape.In Computer Graphics Forum, volume 28, pages 337–346,2009. 2

[37] C.-H. Huang, B. Allain, J.-S. Franco, N. Navab, S. Ilic, andE. Boyer. Volumetric 3d tracking by detection. In IEEEConf. on Computer Vision and Pattern Recognition, pages3862–3870, 2016. 3

[38] Y. Huang, F. Bogo, C. Classner, A. Kanazawa, P. V. Gehler,I. Akhter, and M. J. Black. Towards accurate markerless hu-man shape and pose estimation over time. In InternationalConf. on 3D Vision, 2017. 2

[39] M. Innmann, M. Zollhofer, M. Nießner, C. Theobalt, andM. Stamminger. Volumedeform: Real-time volumetric non-rigid reconstruction. In European Conf. on Computer Vision,2016. 2

[40] M. Jancosek and T. Pajdla. Multi-view reconstruction pre-serving weakly-supported surfaces. In IEEE Conf. on Com-puter Vision and Pattern Recognition, pages 3121–3128.IEEE, 2011. 2

[41] H. Joo, T. Simon, and Y. Sheikh. Total capture: A 3d defor-mation model for tracking faces, hands, and bodies. In Pro-ceedings of the IEEE Conference on Computer Vision andPattern Recognition, pages 8320–8329, 2018. 2

[42] A. Kanazawa, M. J. Black, D. W. Jacobs, and J. Malik. End-to-end recovery of human shape and pose. In IEEE Conf. onComputer Vision and Pattern Recognition. IEEE ComputerSociety, 2018. 1, 2

[43] R. Koch, M. Pollefeys, and L. Van Gool. Multi viewpointstereo from uncalibrated video sequences. In European con-ference on computer vision, pages 55–71. Springer, 1998. 2

[44] V. Lempitsky and D. Ivanov. Seamless mosaicing of image-based texture maps. In IEEE Conf. on Computer Vision andPattern Recognition, pages 1–6. IEEE, 2007. 3, 5

[45] H. P. Lensch, W. Heidrich, and H.-P. Seidel. A silhouette-based algorithm for texture registration and stitching. Graph-ical Models, 63(4):245–262, 2001. 3

[46] V. Leroy, J.-S. Franco, and E. Boyer. Multi-view dynamicshape refinement using local temporal integration. In IEEEInternational Conf. on Computer Vision, 2017. 1

[47] V. Leroy, J.-S. Franco, and E. Boyer. Multi-View DynamicShape Refinement Using Local Temporal Integration. InIEEE International Conf. on Computer Vision, Venice, Italy,2017. 2

[48] H. Li, E. Vouga, A. Gudym, L. Luo, J. T. Barron, andG. Gusev. 3d self-portraits. ACM Transactions on Graph-ics, 32(6):187, 2013. 2

[49] X. Liang, S. Liu, X. Shen, J. Yang, L. Liu, J. Dong, L. Lin,and S. Yan. Deep human parsing with active template re-gression. Pattern Analysis and Machine Intelligence, IEEETransactions on, 37(12):2402–2414, Dec 2015. 6

[50] M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J.Black. SMPL: A skinned multi-person linear model. ACMTransactions on Graphics, 34(6):248:1–248:16, 2015. 1, 2,3

[51] D. Mehta, S. Sridhar, O. Sotnychenko, H. Rhodin,M. Shafiei, H.-P. Seidel, W. Xu, D. Casas, and C. Theobalt.Vnect: Real-time 3d human pose estimation with a single rgbcamera. ACM Transactions on Graphics, 36(4):44, 2017. 2

[52] T. Nestmeyer and P. V. Gehler. Reflectance adaptive fil-tering improves intrinsic image estimation. In IEEE Conf.on Computer Vision and Pattern Recognition, pages 1771–1780. IEE, 2017. 4

[53] R. A. Newcombe, D. Fox, and S. M. Seitz. Dynamicfusion:Reconstruction and tracking of non-rigid scenes in real-time.In IEEE Conf. on Computer Vision and Pattern Recognition,pages 343–352, 2015. 1, 2

[54] R. A. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux,D. Kim, A. J. Davison, P. Kohi, J. Shotton, S. Hodges, andA. Fitzgibbon. Kinectfusion: Real-time dense surface map-ping and tracking. In IEEE International Symposium onMixed and Augmented Reality, pages 127–136, 2011. 2

[55] R. A. Newcombe, S. J. Lovegrove, and A. J. Davison. Dtam:Dense tracking and mapping in real-time. In IEEE Interna-tional Conf. on Computer Vision, pages 2320–2327, 2011. 1,2

[56] W. Niem and J. Wingbermuhle. Automatic reconstructionof 3d objects using a mobile monoscopic camera. In Proc.Inter. Conf. on Recent Advances in 3-D Digital Imaging andModeling, pages 173–180. IEEE, 1997. 3

[57] E. Ofek, E. Shilat, A. Rappoport, and M. Werman. Mul-tiresolution textures from image sequences. IEEE ComputerGraphics and Applications, 17(2):18–29, Mar. 1997. 3

[58] M. Omran, C. Lassner, G. Pons-Moll, P. Gehler, andB. Schiele. Neural body fitting: Unifying deep learning andmodel based human pose and shape estimation. In Interna-tional Conf. on 3D Vision, 2018. 1, 2

Page 11: Detailed Human Avatars from Monocular Video...Detailed Human Avatars from Monocular Video Thiemo Alldieck1,2 Marcus Magnor1 Weipeng Xu2 Christian Theobalt2 Gerard Pons-Moll2 1Computer

[59] S. Orts-Escolano, C. Rhemann, S. Fanello, W. Chang,A. Kowdle, Y. Degtyarev, D. Kim, P. L. Davidson,S. Khamis, M. Dou, et al. Holoportation: Virtual 3d telepor-tation in real-time. In Symposium on User Interface Softwareand Technology, pages 741–754. ACM, 2016. 1, 2

[60] G. Pavlakos, L. Zhu, X. Zhou, and K. Daniilidis. Learning toestimate 3D human pose and shape from a single color im-age. In IEEE Conf. on Computer Vision and Pattern Recog-nition, 2018. 1, 2

[61] F. Pighin, J. Hecker, D. Lischinski, R. Szeliski, and D. H.Salesin. Synthesizing realistic facial expressions from pho-tographs. In ACM SIGGRAPH 2006 Courses, page 19.ACM, 2006. 3

[62] R. Plankers and P. Fua. Articulated soft objects for video-based body modeling. In IEEE International Conf. on Com-puter Vision, number CVLAB-CONF-2001-005, pages 394–401, 2001. 3

[63] G. Pons-Moll, S. Pujades, S. Hu, and M. Black. ClothCap:Seamless 4D clothing capture and retargeting. ACM Trans-actions on Graphics, 36(4), 2017. 2

[64] G. Pons-Moll, J. Romero, N. Mahmood, and M. J. Black.Dyna: a model of dynamic human shape in motion. ACMTransactions on Graphics, 34:120, 2015. 1, 2

[65] G. Pons-Moll and B. Rosenhahn. Model-Based Pose Estima-tion, chapter 9, pages 139–170. Springer, 2011. 2

[66] G. Pons-Moll, J. Taylor, J. Shotton, A. Hertzmann, andA. Fitzgibbon. Metric regression forests for correspondenceestimation. International Journal of Computer Vision, pages1–13, 2015. 2

[67] H. Rhodin, N. Robertini, D. Casas, C. Richardt, H.-P. Seidel,and C. Theobalt. General automatic human shape and motioncapture using volumetric contour cues. In European Conf. onComputer Vision, pages 509–526. Springer, 2016. 3

[68] N. Robertini, D. Casas, E. De Aguiar, and C. Theobalt.Multi-view performance capture of surface details. Inter-national Journal of Computer Vision, pages 1–18, 2017. 1,3

[69] C. Rocchini, P. Cignoni, C. Montani, and R. Scopigno. Mul-tiple textures stitching and blending on 3d objects. In Ren-dering Techniques 99, pages 119–130. Springer, 1999. 3

[70] F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A uni-fied embedding for face recognition and clustering. In IEEEConf. on Computer Vision and Pattern Recognition, pages815–823, 2015. 7

[71] A. Shapiro, A. Feng, R. Wang, H. Li, M. Bolas, G. Medioni,and E. Suma. Rapid avatar capture and simulation usingcommodity depth sensors. Computer Animation and VirtualWorlds, 25(3-4):201–211, 2014. 2

[72] L. Sigal, S. Bhatia, S. Roth, M. J. Black, and M. Isard. Track-ing loose-limbed people. In IEEE Conf. on Computer Vi-sion and Pattern Recognition, volume 1, pages I–421. IEEE,2004. 2

[73] T. Simon, H. Joo, I. Matthews, and Y. Sheikh. Hand key-point detection in single images using multiview bootstrap-ping. In IEEE Conf. on Computer Vision and Pattern Recog-nition, 2017. 4

[74] M. Slavcheva, M. Baust, D. Cremers, and S. Ilic. Killingfu-sion: Non-rigid 3d reconstruction without correspondences.In IEEE Conf. on Computer Vision and Pattern Recognition,volume 3, page 7, 2017. 2

[75] J. Starck and A. Hilton. Surface capture for performance-based animation. IEEE Computer Graphics and Applica-tions, 27(3), 2007. 2

[76] C. Stoll, N. Hasler, J. Gall, H.-P. Seidel, and C. Theobalt.Fast articulated motion tracking using a sums of gaussiansbody model. In IEEE International Conf. on Computer Vi-sion, pages 951–958. IEEE, 2011. 2

[77] Y. Tao, Z. Zheng, K. Guo, J. Zhao, D. Quionhai, H. Li,G. Pons-Moll, and Y. Liu. Doublefusion: Real-time cap-ture of human performance with inner body shape from adepth sensor. In IEEE Conf. on Computer Vision and PatternRecognition, 2018. 1, 2

[78] J. Tong, J. Zhou, L. Liu, Z. Pan, and H. Yan. Scanning 3dfull human bodies using kinects. IEEE Transactions on Vi-sualization and Computer Graphics, 18(4):643–650, 2012.2

[79] D. Vlasic, P. Peers, I. Baran, P. Debevec, J. Popovic,S. Rusinkiewicz, and W. Matusik. Dynamic shape captureusing multi-view photometric stereo. In ACM Transactionson Graphics, volume 28, page 174. ACM, 2009. 3

[80] M. Waechter, N. Moehrle, and M. Goesele. Let there becolor! large-scale texturing of 3d reconstructions. In Euro-pean Conf. on Computer Vision, pages 836–850. Springer,2014. 3

[81] A. Weiss, D. Hirshberg, and M. J. Black. Home 3d bodyscans from noisy image and range data. In IEEE Interna-tional Conf. on Computer Vision, pages 1951–1958. IEEE,2011. 1, 2

[82] C. Wu, C. Stoll, L. Valgaerts, and C. Theobalt. On-set perfor-mance capture of multiple actors with a stereo camera. ACMTransactions on Graphics, 32(6):161, 2013. 3

[83] C. Wu, K. Varanasi, Y. Liu, H.-P. Seidel, and C. Theobalt.Shading-based dynamic shape refinement from multi-viewvideo under general illumination. In IEEE InternationalConf. on Computer Vision, pages 1108–1115. IEEE, 2011.3

[84] C. Wu, B. Wilburn, Y. Matsushita, and C. Theobalt. High-quality shape from multi-view stereo and shading under gen-eral illumination. In IEEE Conf. on Computer Vision andPattern Recognition, pages 969–976, 2011. 4, 5

[85] S. Wuhrer, L. Pishchulin, A. Brunton, C. Shu, and J. Lang.Estimation of human body shape and posture under cloth-ing. Computer Vision and Image Understanding, 127:31–42,2014. 2

[86] W. Xu, A. Chatterjee, M. Zollhoefer, H. Rhodin, D. Mehta,H.-P. Seidel, and C. Theobalt. Monoperfcap: Human perfor-mance capture from monocular video. ACM Transactions onGraphics, 2018. 3

[87] J. Yang, J.-S. Franco, F. Hetroy-Wheeler, and S. Wuhrer. Es-timation of Human Body Shape in Motion with Wide Cloth-ing. In European Conf. on Computer Vision, Amsterdam,Netherlands, 2016. 2

Page 12: Detailed Human Avatars from Monocular Video...Detailed Human Avatars from Monocular Video Thiemo Alldieck1,2 Marcus Magnor1 Weipeng Xu2 Christian Theobalt2 Gerard Pons-Moll2 1Computer

[88] D. Yi, Z. Lei, S. Liao, and S. Z. Li. Learning face represen-tation from scratch. arXiv preprint arXiv:1411.7923, 2014.7

[89] M. Zeng, J. Zheng, X. Cheng, and X. Liu. Templatelessquasi-rigid shape modeling with implicit loop-closure. InIEEE Conf. on Computer Vision and Pattern Recognition,pages 145–152, 2013. 2

[90] C. Zhang, S. Pujades, M. Black, and G. Pons-Moll. Detailed,accurate, human shape estimation from clothed 3D scan se-quences. In IEEE Conf. on Computer Vision and PatternRecognition, 2017. 2

[91] Q. Zhang, B. Fu, M. Ye, and R. Yang. Quality dynamichuman body modeling using a single low-cost depth camera.In IEEE Conf. on Computer Vision and Pattern Recognition,pages 676–683. IEEE, 2014. 1, 2

[92] R. Zhang, P.-S. Tsai, J. E. Cryer, and M. Shah. Shape-from-shading: a survey. IEEE Transactions on Pattern Analysisand Machine Intelligence, 21(8):690–706, 1999. 3

[93] E. Zheng, E. Dunn, V. Jojic, and J.-M. Frahm. Patchmatchbased joint view selection and depthmap estimation. In IEEEConf. on Computer Vision and Pattern Recognition, pages1510–1517, 2014. 2

[94] Q.-Y. Zhou and V. Koltun. Color map optimization for 3dreconstruction with consumer depth cameras. ACM Trans-actions on Graphics, 33(4):155, 2014. 3

[95] M. Zollhofer, M. Nießner, S. Izadi, C. Rehmann, C. Zach,M. Fisher, C. Wu, A. Fitzgibbon, C. Loop, C. Theobalt, et al.Real-time non-rigid reconstruction using an rgb-d camera.ACM Transactions on Graphics, 33(4):156, 2014. 1, 3

[96] S. Zuffi and M. J. Black. The stitched puppet: A graph-ical model of 3d human shape and pose. In IEEE Conf.on Computer Vision and Pattern Recognition, pages 3537–3546. IEEE, 2015. 2