18
JOURNAL OF L A T E X CLASS FILES, VOL. X, NO. X, MONTH YEAR 1 Learning 3D Human Shape and Pose from Dense Body Parts Hongwen Zhang, Jie Cao, Guo Lu, Wanli Ouyang, Senior Member, IEEE, and Zhenan Sun, Senior Member, IEEE Abstract—Reconstructing 3D human shape and pose from monocular images is challenging despite the promising results achieved by the most recent learning-based methods. The commonly occurred misalignment comes from the facts that the mapping from images to the model space is highly non-linear and the rotation-based pose representation of the body model is prone to result in the drift of joint positions. In this work, we investigate learning 3D human shape and pose from dense correspondences of body parts and propose a Decompose-and-aggregate Network (DaNet) to address these issues. DaNet adopts the dense correspondence maps, which densely build a bridge between 2D pixels and 3D vertexes, as intermediate representations to facilitate the learning of 2D-to-3D mapping. The prediction modules of DaNet are decomposed into one global stream and multiple local streams to enable global and fine-grained perceptions for the shape and pose predictions, respectively. Messages from local streams are further aggregated to enhance the robust prediction of the rotation-based poses, where a position-aided rotation feature refinement strategy is proposed to exploit spatial relationships between body joints. Moreover, a Part-based Dropout (PartDrop) strategy is introduced to drop out dense information from intermediate representations during training, encouraging the network to focus on more complementary body parts as well as neighboring position features. The efficacy of the proposed method is validated on both in-door and real-world datasets including Human3.6M, UP3D, and DensePose-COCO, showing that our method could significantly improve the reconstruction performance in comparison with previous state-of-the-art methods. Our code will be publicly available at https://hongwenzhang.github.io/dense2mesh. Index Terms—3D human shape and pose estimation, decompose-and-aggregate network, position-aided rotation feature refinement, part-based dropout. 1 I NTRODUCTION R ECONSTRUCTING human shape and pose from a monocular image is an appealing yet challenging task, which typically involves the prediction of the camera and parameters of a statistical body model (e.g. the most commonly used SMPL [1] model). Fig. 1(a) shows an example of the reconstructed result. The chal- lenges of this task come from the fundamental depth ambiguity, the complexity and flexibility of human bodies, and variations in clothing and viewpoint, etc. Traditional approaches [2], [3] fit the SMPL model to 2D evidence such as 2D body joints or silhouettes in images, which involve complex non-linear optimization and iterative refinement. Recently, learning-based approaches [4], [5], [6], [7] integrate the SMPL model within neural networks and predict model parameters directly in an end-to-end manner. Though great progress has been made, the direct prediction of the body model from the image space is still complex and difficult even for deep neural networks. In this work, we propose to adopt IUV maps as intermediate representations to facilitate the learning of the mapping from images to models. As depicted in Fig. 1(b), compared with other 2D representations [4], [6], [7], the IUV map could provide more rich information, because it H. Zhang, J. Cao, and Z. Sun are with CRIPAC, NLPR, CAS CEBSIT, Institute of Automation, Chinese Academy of Sci- ences, and University of Chinese Academy of Sciences, Beijing, China. E-mail: [email protected]; [email protected]; [email protected]. (Corresponding author: Zhenan Sun.) G. Lu is with Shanghai Jiao Tong University, Shanghai, China. E-mail: [email protected]. W. Ouyang is with the University of Sydney, Sydney, Australia. E-mail: [email protected]. This work was done when H. Zhang was a visiting student at the University of Sydney. (a) raw RGB image silhouette segmentation IUV map (b) (c) Rotation Feature Space Position Feature Space ... ... Part-based Dropout (d) Fig. 1. Illustration of our main ideas. (a) A human image with the reconstructed 3D model. The rotation-based pose representation of the body model is prone to result in drift of joint positions. (b) Comparison of the raw RGB image, silhouette, segmentation, and IUV map. (c) Local visual cues are crucial for joint rotation status perceptions. (d) Our DaNet learns 3D human shape and pose from IUV maps with decomposed perception, aggregated refinement, and part-based dropout strategies. encodes the dense correspondence between foreground pixels on 2D images and vertexes on 3D meshes. Such a dense semantic map not only contains essential information for shape and pose estimation from RGB images, but also eliminates the interference of unrelated factors such as appearance, clothing, and illumination variations. The representation of 3D body model [1], [8] can be factorized into the shape and pose components, depicting the model at different scales. The body shape give an overall description about the model such as the height and weight, while the body pose

JOURNAL OF LA Learning 3D Human Shape and Pose from …The recovery of the 3D human pose from a monocular image is challenging. Common strategies use intermediate estimations as the

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: JOURNAL OF LA Learning 3D Human Shape and Pose from …The recovery of the 3D human pose from a monocular image is challenging. Common strategies use intermediate estimations as the

JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, MONTH YEAR 1

Learning 3D Human Shape and Pose fromDense Body Parts

Hongwen Zhang, Jie Cao, Guo Lu, Wanli Ouyang, Senior Member, IEEE, and Zhenan Sun, SeniorMember, IEEE

Abstract—Reconstructing 3D human shape and pose from monocular images is challenging despite the promising results achieved bythe most recent learning-based methods. The commonly occurred misalignment comes from the facts that the mapping from images tothe model space is highly non-linear and the rotation-based pose representation of the body model is prone to result in the drift of jointpositions. In this work, we investigate learning 3D human shape and pose from dense correspondences of body parts and propose aDecompose-and-aggregate Network (DaNet) to address these issues. DaNet adopts the dense correspondence maps, which denselybuild a bridge between 2D pixels and 3D vertexes, as intermediate representations to facilitate the learning of 2D-to-3D mapping. Theprediction modules of DaNet are decomposed into one global stream and multiple local streams to enable global and fine-grainedperceptions for the shape and pose predictions, respectively. Messages from local streams are further aggregated to enhance therobust prediction of the rotation-based poses, where a position-aided rotation feature refinement strategy is proposed to exploit spatialrelationships between body joints. Moreover, a Part-based Dropout (PartDrop) strategy is introduced to drop out dense informationfrom intermediate representations during training, encouraging the network to focus on more complementary body parts as well asneighboring position features. The efficacy of the proposed method is validated on both in-door and real-world datasets includingHuman3.6M, UP3D, and DensePose-COCO, showing that our method could significantly improve the reconstruction performance incomparison with previous state-of-the-art methods. Our code will be publicly available at https://hongwenzhang.github.io/dense2mesh.

Index Terms—3D human shape and pose estimation, decompose-and-aggregate network, position-aided rotation feature refinement,part-based dropout.

F

1 INTRODUCTION

R ECONSTRUCTING human shape and pose from a monocularimage is an appealing yet challenging task, which typically

involves the prediction of the camera and parameters of a statisticalbody model (e.g. the most commonly used SMPL [1] model).Fig. 1(a) shows an example of the reconstructed result. The chal-lenges of this task come from the fundamental depth ambiguity,the complexity and flexibility of human bodies, and variations inclothing and viewpoint, etc. Traditional approaches [2], [3] fit theSMPL model to 2D evidence such as 2D body joints or silhouettesin images, which involve complex non-linear optimization anditerative refinement. Recently, learning-based approaches [4], [5],[6], [7] integrate the SMPL model within neural networks andpredict model parameters directly in an end-to-end manner.

Though great progress has been made, the direct predictionof the body model from the image space is still complex anddifficult even for deep neural networks. In this work, we proposeto adopt IUV maps as intermediate representations to facilitatethe learning of the mapping from images to models. As depictedin Fig. 1(b), compared with other 2D representations [4], [6],[7], the IUV map could provide more rich information, because it

• H. Zhang, J. Cao, and Z. Sun are with CRIPAC, NLPR,CAS CEBSIT, Institute of Automation, Chinese Academy of Sci-ences, and University of Chinese Academy of Sciences, Beijing,China. E-mail: [email protected]; [email protected];[email protected]. (Corresponding author: Zhenan Sun.)

• G. Lu is with Shanghai Jiao Tong University, Shanghai, China. E-mail:[email protected].

• W. Ouyang is with the University of Sydney, Sydney, Australia. E-mail:[email protected].

This work was done when H. Zhang was a visiting student at the University ofSydney.

(a)

raw RGB image silhouette segmentation IUV map

(b)

(c)

Rotation Feature Space Position Feature Space

...

...

Part-based Dropout

(d)

Fig. 1. Illustration of our main ideas. (a) A human image with thereconstructed 3D model. The rotation-based pose representation of thebody model is prone to result in drift of joint positions. (b) Comparison ofthe raw RGB image, silhouette, segmentation, and IUV map. (c) Localvisual cues are crucial for joint rotation status perceptions. (d) Our DaNetlearns 3D human shape and pose from IUV maps with decomposedperception, aggregated refinement, and part-based dropout strategies.

encodes the dense correspondence between foreground pixels on2D images and vertexes on 3D meshes. Such a dense semanticmap not only contains essential information for shape and poseestimation from RGB images, but also eliminates the interferenceof unrelated factors such as appearance, clothing, and illuminationvariations.

The representation of 3D body model [1], [8] can be factorizedinto the shape and pose components, depicting the model atdifferent scales. The body shape give an overall description aboutthe model such as the height and weight, while the body pose

Page 2: JOURNAL OF LA Learning 3D Human Shape and Pose from …The recovery of the 3D human pose from a monocular image is challenging. Common strategies use intermediate estimations as the

JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, MONTH YEAR 2

provide more detailed descriptions about the rotation status ofeach body joint. Previous learning-based methods [5], [7] typicallypredict them simultaneously using global information from the lastlayer of the neural network. We observe that the detailed pose ofbody joints should be captured by local visual cues instead ofglobal information. As shown in Fig. 1(c), we can estimate therotation status of those visible body joints only based on localvisual cues, while the information from other body joints andbackground regions would be irrelevant.

For the rotation-based pose representation of commonly usedbody models [1], [8], small rotation errors accumulated along thekinematic chain could lead to large drift of position at the leafjoint. Moreover, the rotation estimation is error-prone for thoseoccluded body joints since their perceptions are less reliable underocclusions. Hence, it is crucial to utilize information from visiblebody joints and the prior about the structure of human bodies. Asshown in previous work [9], [10], the structural information at thefeature level is helpful for more robust and accurate pose estima-tion. However, it is non-trivial to apply these feature refinementmethods to our case due to the weak correlation between rotation-based poses of different joints. For instance, the shoulder, elbow,and wrist are three consecutive body joints, and one can hardlyinfer the relative rotation of wrist w.r.t. the elbow given the relativerotation of elbow w.r.t. the shoulder. On the other hand, we observethat the 3D locations of body joints have stronger correlations thanthe rotation of body joints. For instance, the positions of shoulder,elbow, and wrist are strongly constrained by the length of the arm.

Based on the observations above, we propose a Decompose-and-aggregate Network (DaNet) to learn 3D human shape andpose from dense correspondences of body parts. As illustrated inFig. 1(d), DaNet utilizes IUV maps as the intermediate informa-tion for more efficient learning, and decomposes the predictionmodules into multiple streams considering that the prediction ofdifferent parameters requires the receptive fields with differentsizes. To robustly predict the rotation status of body joints,DaNet aggregates messages from different streams and refinesthe rotation features via an auxiliary position feature space toexploit the spatial relationships between body joints. For bettergeneralization, a Part-based Dropout (PartDrop) strategy is furtherintroduced to drop out dense information from intermediate repre-sentations during training, which could effectively regularize thenetwork and encourage it to learn features from complementarybody parts and leverage information from neighboring body joints.As will be validated in our experiments, all the above new designscould contribute to better part-based learning and improve thereconstruction performance. To sum up, the main contributionsin this work are listed as follows.• We comprehensively study the effectiveness of adopting the

IUV maps in both global and local scales, which containsdensely semantic information of body parts, as intermediaterepresentations for the task of 3D human pose and shapeestimation.

• Our reconstruction network is designed to have decomposedstreams to provide global perception for the camera and shapeprediction while detailed perception for pose prediction of eachbody joint.

• A part-based dropout strategy is introduced to drop denseinformation from intermediate representations during training.Such a strategy can encourage the network to learn featuresfrom complementary body parts, which also has the potentialfor other structured image understanding tasks.

• A position-aided rotation feature refinement strategy is proposedto aggregate messages from position features of different bodyjoints. It is more efficient to exploit the spatial relationship in anauxiliary position feature space since the correlations betweenposition features are much stronger.

An early version of this work appeared in [11]. We have madesignificant extensions to our previous work in three main aspects.First, the methodology is improved to be more accurate and robustthanks to several new designs, including the part-based dropoutstrategy for better generalization performance and the customizedgraph convolutions for faster and better feature mapping andrefinement. Second, more extensive evaluations and comparisonsare included to validate the effectiveness of our method, includingevaluations on additional datasets and comparisons of the recon-struction errors across different human actions and model surfaceareas. Third, more discussions are provided in our ablation studies,including comprehensive evaluations on the benefit of adoptingIUV as intermediate representations and in-depth analyses on therefinement upon the rotation feature space and position featurespace.

The remainder of this paper is organized as follows. Section 2briefly reviews previous work related to ours. Section 3 providespreliminary knowledge about the SMPL model and IUV maps.Details of the proposed network are presented in Section 4. Ex-perimental results and analyses are included in Section 5. Finally,Section 6 concludes the paper.

2 RELATED WORK

2.1 3D Human Shape and Pose Estimation

Early pioneering work on 3D human model reconstruction mainlyfocuses on the optimization of the fitting process. Among them,[12], [13] fit the body model SCAPE [8] with the requirement ofground truth silhouettes or manual initialization. Bogo et al. [2]introduce the optimization method SMPLify and make the firstattempt to automatically fit the SMPL model to 2D body joints byleveraging multiple priors. Lassner et al. [3] extend this methodand improve the reconstruction performance by incorporatingsilhouette information in the fitting procedure. These optimization-based methods typically rely on accurate 2D observations and theprior terms imposed on the shape and pose parameters, makingthe procedure time-consuming and sensitive to the initialization.Alternatively, recent learning-based methods employ neural net-works to predict the shape and pose parameters directly andlearn the priors in a data-driven manner. These efforts mainlyfocus on several aspects including intermediate representationleveraging, architecture designs, structural information modeling,and re-projection loss designs, etc. Our work makes contributionsto the first three aspects above and is also complementary to thework focusing on the re-projection loss designs [4], [14], [15],reconstruction from videos or multi-view images [16], [17], [18],[19], and detailed or holistic body model learning [20], [21], [22].

2.1.1 Intermediate RepresentationThe recovery of the 3D human pose from a monocular image ischallenging. Common strategies use intermediate estimations asthe proxy representation to alleviate the difficulty. These methodscan benefit from existing state-of-the-art networks for lower-leveltasks. For 3D human pose estimation, 2D joint postions [23], [24],[25], [26], volumetric representation [27], joint heat maps [28]

Page 3: JOURNAL OF LA Learning 3D Human Shape and Pose from …The recovery of the 3D human pose from a monocular image is challenging. Common strategies use intermediate estimations as the

JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, MONTH YEAR 3

and 3D orientation fields [29] are adopted in literature as inter-mediate representations to facilitate the learning task. Similarly,for 3D human model reconstruction, silhouette [6], [30], [31],joint heatmap [4], [6], segmentation [7], depth map [32], 3Dvolumetric representation [33], [34], [35], and 3D orientationfield [36] have also been adopted in existing methods as proxyrepresentations. Though the aforementioned representations arehelpful for the task, detailed information contained within bodyparts is missing in these coarse representations, which becomes thebottleneck for subsequent predictions. Recently, DensePose [37]regresses the IUV maps directly from images, which provides thedense correspondence mapping from the image to the human bodymodel. However, the 3D model cannot be directly retrieved fromsuch a 2.5D projection. In our work, we propose to adopt such adensely semantic map as the intermediate representation for thetask of 3D human shape and pose estimation. To the best of ourknowledge, we are among the first attempts [15], [38], [39] toleverage IUV maps for 3D human model recovery. In comparison,the major differences between concurrent efforts and ours lie inthree aspects: 1) [15], [38], [39] obtain IUV predictions from apretrained network of DensePose [37], while our work augmentsthe annotations of 3D human pose datasets with the renderedground-truth IUV maps and imposes dense supervisions on theintermediate representations; 2) [15], [38], [39] only leverageglobal IUV maps, while our work exploits using IUV mapsin both global and local scales; 3) DenseRaC [39] resorts toinvolving more synthetic IUV maps as additional training datawhile our work introduces the part-based dropout upon IUVmaps to improve generalization. We believe these concurrent workcomplement each other and enrich the research community.

2.1.2 Architecture Design

Existing approaches to 3D human shape and pose estimation havedesigned a number of network architectures for more effectivelearning of the highly nonlinear image-to-model mapping. Tan etal. [40] develop an encoder-decoder based framework where thedecoder learns the SMPL-to-silhouette mapping from syntheticdata and the encoder learns the image-to-SMPL mapping withthe decoder frozen. Kanazawa et al. [5] present an end-to-endframework HMR to reconstruct the SMPL model directly fromimages using a single CNN with an iterative regression module.Kolotouros et al. [41] enhance HMR with the fitting process ofSMPLify [2] to incorporate regression- and optimization-basedmethods. Pavlakos et al. [6] propose to predict the shape andpose parameters from the estimated silhouettes and joint locationsrespectively. Sun et al. [42] also leverage joint locations andfurther involve deep features into the prediction process. Insteadof regressing the shape and pose parameters directly, Kolotouroset al. [38] employ a Graph CNN [43] to regress the 3D coordinatesof the human mesh vertices, while Yao et al. [44] regress the 3Dcoordinates in the form of an unwrapped position map. All afore-mentioned learning-based methods predict the pose in a globalmanner. In contrast, our DaNet predicts joint poses from multiplestreams, hence the visual cues could be captured in a fine-grainedmanner. Recently, Guler et al. [14] also introduce a part-basedreconstruction method to predict poses from the deep featurespooled around body joints. In comparison, the pooling operation ofour DaNet is performed on intermediate representations, enablingdetailed perception for better pose feature learning. Moreover,existing approaches for rotation-based pose estimation do not

consider feature refinement, while DaNet includes an effectiverotation feature refinement scheme for robust pose predictions.

2.1.3 Structural Information ModelingLeveraging the articulated structure information is crucial forhuman pose modeling [45], [46], [47]. Recent deep learning-basedapproaches to human pose estimation [9], [10], [48], [49], [50]incorporate the structured feature learning in their network archi-tecture designs. All these efforts exploit the relationship betweenthe position features of body joints and their feature refinementstrategies are only validated on the position-based pose estimationproblem. Our work is complementary to them by investigatingthe refinement for rotation features under the context of therotation-based pose representation, which paves a new way toimpose structural constraints upon rotation features. Our solutionaggregates the rotation features into a position feature space,where the aforementioned structural feature learning approachescould be easily applied.

For more geometrically reasonable pose predictions, differenttypes of pose priors [51], [52], [53], [54], [55] are also employedas constraints in the learning procedure. For instance, Akhterand Black [51] learn the pose prior in the form of joint angleconstraints. Sun et al. [53] design handcrafted constraints suchas limb-lengths and their proportions. Similar constraints areexploited in [54] under the weakly-supervised setting. For therotation-based pose representation in the SMPL model, thoughit inherently satisfies structure constraints such as limb propor-tions, the pose prior is still essential for better reconstructionperformance. SMPLify [2] imposes several penalizing terms onpredicted poses to prevent unnatural results. Kanazawa et al. [5]introduce an adversarial prior for guiding the prediction to berealistic. All these methods consider the pose prior at the outputlevel. In our work, we will exploit the relationship at the featurelevel for better 3D pose estimation in the SMPL model.

2.2 Regularization in Neural NetworksRegularization is important to neural networks for better general-ization performance. A number of regularization techniques havebeen proposed to remove features from neural networks at differ-ent granularity levels. Among them, dropout [56] is commonlyused at the fully connected layers of neural networks to dropunit-wise features independently. The introduction of dropout hasinspired the development of other dropping out strategies withstructured forms. For instance, SpatialDropout [57] drops channel-wise features across the entire feature map, while DropBlock [58]drops block-wise features in contiguous regions. Different fromthese techniques, our PartDrop strategy drops part-wise featuresat the granularity level of semantic body parts. Such a part-wise dropping strategy could remove patterns in a more struc-tured manner and perform better in our learning task. Moreover,our PartDrop strategy is applied on intermediate representations,which is also different from data augmentation methods such asCutout [59].

3 SMPL MODEL AND IUV MAPS

SMPL Model. The Skinned Multi-Person Linear model(SMPL) [1] is one of the widely used statistical human bodymodels, which represents the body mesh with two sets of pa-rameters, i.e., the shape and pose parameters. The shape indicatesthe model’s height, weight and limb proportions while the pose

Page 4: JOURNAL OF LA Learning 3D Human Shape and Pose from …The recovery of the 3D human pose from a monocular image is challenging. Common strategies use intermediate estimations as the

JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, MONTH YEAR 4

0 0.25 0.5 0.75 1

(a) (b) (c)

Renderer

Human Body Models IUV maps

Texture

(d)

...

...

.....

...

...

.

...

.....

...

...

...

.

......

Rotation Feature Space

Position Feature Space

FCNFCNFCN

Camera

Shape

Pose

Joint-centric RoI Pooling

FC

FC

Rotation Feature Refinement (Sec. 4.3)

Global and Partial IUV Estimation (Sec. 4.1) Camera, Shape and Pose Prediction (Sec. 4.2)

Soft-argmax

Part-b

ased D

rop

ou

t (for train

ing)

(e)Fig. 2. Illustration of our method. (a)(b)(c) show the Index, U , and V values defined in DensePose [37], respectively. Note that the original Indexvalues (range from 1 to 24) are also normalized into the [0, 1] interval. (d) Preparation of ground truth IUV Maps for 3D human body models. (e)Overview of the proposed Decompose-and-aggregate Network (DaNet).

indicates how the model deforms with the rotated skeleton joints.Such decomposition of shape and pose makes it convenient foralgorithms to focus on one of these two factors independently.In the SMPL model, the shape parameters β ∈ R10 denotethe coefficients of the PCA basis of the body shape. The poseparameters θ ∈ R3K denote the axis-angle representations ofthe relative rotation of K skeleton joints with respect to theirparents in the kinematic tree, where K = 24 in the SMPL model.For simplicity, the root orientation is also included as the poseparameters of the root joint in our formulation. Given the pose andshape parameters, the model deforms accordingly and generates atriangulated mesh with N = 6890 vertices M(θ,β) ∈ R3×N .The deformation processM(θ,β) is differentiable with respect tothe pose θ and shape β, which means that the SMPL model couldbe integrated within a neural network as a typical layer without anylearnable weights. After obtaining the final mesh, vertices couldbe further mapped to sparse 3D keypoints by a pretrained linearregressor.

IUV Maps. Reconstructing the 3D object model from amonocular image is ambiguous, but there are determinate corre-spondences between foreground pixels on 2D images and vertexeson 3D surfaces. Such correspondences could be represented inthe form of UV maps, where the foreground pixels contain thecorresponding UV coordinate values. In this way, the pixels onthe foreground could be projected back to vertexes on the templatemesh according to a predefined bijective mapping between the 3Dsurface space and the 2D UV space. For the human body model,the correspondence could have finer granularity by introducing theIndex of the body parts [37], [60], which results in the IUV mapsH = (Hi|Hu|Hv) ∈ R(1+P )×hiuv×wiuv×3, where P denotesthe number of body parts, hiuv and wiuv denote the height andwidth of IUV maps. The Index channels Hi indicates whether apixel belongs to the background or a specific body part, while theUV channelsHu andHv contain the corresponding U , V valuesof visible body parts respectively. The IUV mapsH encode Index,U, and V values individually for P body parts in a one-hot manneralong (1+P ) ways. The Index values for body parts count from 1and Index 0 is reserved for the background. For each body part, theUV space is independent so that the representation could be morefine-grained. The IUV annotation of the human body is firstlyintroduced in DenseReg [60] and DensePose [37]. Figs. 2(a)(b)(c)show the Index, U, and V values on the SMPL model as defined

in DensePose [37].Preparation of IUV Maps for 3D Human Pose Datasets.

Currently, there is no 3D human pose dataset providing IUVannotations. In this work, for those datasets providing SMPLparameters with human images, we augment their annotationsby rendering the corresponding ground-truth IUV maps based onthe same IUV mapping protocol of DensePose [37]. Specifically,we first construct a template texture map from IUV values ofeach vertex on the SMPL model, and then employ a rendererto generate IUV maps. As illustrated in Fig. 2(d), for each facein the triangulated mesh, the texture values used for renderingis a triplet vector denoting the corresponding Index, U , and Vvalues. Then, given SMPL models, the corresponding IUV mapscan be generated by existing rendering algorithms such as [61],[62]. Specifically, the renderer takes the template texture map and3D model as inputs and output a rendered image with the size ofhiuv × wiuv × 3. Afterwards, the rendered image is reorganizedas the shape of (1 + P )× hiuv ×wiuv × 3 by converting valuesinto one-hot representations.

4 METHODOLOGY

As illustrated in Fig. 2(e), our DaNet decomposes the predictiontask into one global stream for the camera and shape predictionsand multiple local streams for joint pose predictions. The overallpipeline involves two consecutive stages, where the IUV maps arefirstly estimated from the fully convolution network and then takenas inputs for subsequent parameter predictions.

In the first stage, the IUV maps are estimated from global andlocal perspectives in consideration of the different sizes of thereceptive fields required by the prediction of different parameters.In the second stage, the global and local IUV maps are usedfor different feature extraction and prediction tasks. The globalfeatures are extracted from global IUV maps and then directlyused to predict camera and body shape. The rotation featuresare extracted from partial IUV maps and further fed into theaggregated refinement module before the final prediction of jointposes. During training, the part-based dropout is applied to theestimated IUV maps between the above two stages.

Overall, our objective function is a combination of threeobjectives:

L = Linter + Ltarget + Lrefine, (1)

Page 5: JOURNAL OF LA Learning 3D Human Shape and Pose from …The recovery of the 3D human pose from a monocular image is challenging. Common strategies use intermediate estimations as the

JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, MONTH YEAR 5

where Linter is the objective for estimating the intermediaterepresentations (Sec. 4.1), Ltarget is the objective for predictingthe camera and SMPL parameters (Sec. 4.2), Lrefine is theobjective involving in the aggregated refinement module (Sec.4.3). In the following subsections, we will present the technicaldetails and rationale of our method.

4.1 Global and Partial IUV EstimationThe first stage in our method aims to estimate correspondingIUV maps from input images for subsequent prediction tasks.Specifically, a fully convolutional network is employed to produceK+1 sets of IUV maps, including one set of global IUV maps andK sets of partial IUV maps for the corresponding K body joints.The global IUV maps are aligned with the original image throughup-sampling, while the partial IUV maps are centered around thebody joints. Fig. 3 visualizes a sample of the global and partialIUV maps. The feature maps outputted from the last layer of theFCN would be shared by the estimation tasks of both global andpartial IUV maps. The estimation of the global IUV maps is quitestraightforward since they could be obtained by simply feedingthese feature maps into a convolutional layer. For the estimationof each set of partial IUV maps, a joint-centric RoI pooling wouldbe first performed on these feature maps to extract appropriatesub-regions, which results in K sets of partial feature maps. Then,the K sets of partial IUV maps would be estimated independentlyfrom these partial feature maps. Now, we will give details aboutthe RoI pooling process for partial IUV estimation.

Joint-centric RoI Pooling. For pose parameters in the SMPLmodel, they represent the relative rotation of each body joint withrespect to its parent in the kinematic tree. Hence, the perceptionof joint poses should individually focus on corresponding bodyparts. In other words, globally zooming, translating the human inthe image should have no effect on the pose estimation of bodyjoints. Moreover, the ideal scale factors for the perception of jointposes should vary from one joint to another since the propor-tions of body parts are different. To this end, we perform joint-centric RoI pooling on feature maps for partial IUV estimation.Particularly, for each body joint, a sub-region of the feature mapsis extracted and spatially transformed to a fixed resolution forsubsequent partial IUV map estimation and joint pose prediction.In our implementation, the RoI pooling is accomplished by aSpatial Transformer Network (STN) [63]. In comparison with theconventional STNs, the pooling process in our network is learnedin an explicitly supervised manner.

As illustrated in Fig. 4(a), the joint-centric RoI poolingoperations are guided by 2D joint positions so that each sub-region is centered around the target joint. Specifically, 2D jointheatmaps are estimated along with the global IUV maps in a multi-task learning manner, and 2D joint positions are retrieved fromheatmaps using the soft-argmax [64] operation. Without loss ofgenerality, let jk denote the position of the k-th body joint. Then,the center and scale parameters used for spatial transformationare determined individually for each set of partial IUV maps.Specifically, for the k-th set of partial IUV maps, the center ckis the position of the k-th joint, while the scale sk is proportionalto the size of the foreground region, i.e.,

ck = jk,

sk = αk max(wbbox, hbbox) + δ,(2)

where αk and δ are two constants, wbbox and hbbox denote thewidth and height of the foreground bounding box respectively.

(a) (b) (c)

Fig. 3. Visualization of (a) global, (b) partial, and (c) simplified partialIUV maps.

Transformation Parameters

2D Joint Position Spatial Transformer

Image Feature Maps IUV Map (GT)

Partial Feature Maps

Partial IUV Map (GT)

Image of Body Part

(a)

iteration

k

(b)

Fig. 4. Joint-centric RoI pooling. (a) The RoI pooling is implemented asan STN. (b) The evolution of αks of different body joints over learningiterations.

In our implementation, the foreground is obtained from the partsegmentation (i.e., Index channels of estimated IUV maps). Com-pared with our previous work [11] calculating sk from 2D joints,the sks determined by foreground regions here are more robust to2D joint localization.

Note that the above constants αk and δ can be handcrafted orlearned in the STN by taking ground-truth IUV maps as inputs.For learned αks, Fig. 4(b) shows how the values of differentbody joints evolve over learning iterations. It can be observed thatαks are enlarged for some joints while shrunk for others, whichprovides more suitable RoI sizes for each body joint.

After obtaining the transformation parameters in Eq. 2, thefeature maps extracted from the last layer of fully convolutionalnetwork are spatially transformed to a fixed resolution and used toestimate the partial IUV maps, where the corresponding ground-truth ones are also extracted from the ground-truth global IUVmaps using the same pooling process.

Simplified Partial IUV Maps. Considering that the pose of abody joint is only related to its adjacent body parts, we furtherintroduce the simplified partial IUV maps by discarding thoseirrelevant body parts. For each set of partial IUV maps, we retainspecific channels corresponding to those body parts surroundingthe target joint. By doing so, the resulting simplified partialIUV maps are much cleaner, which eliminates interference fromirrelevant body parts. The partial IUV maps before and after thesimplification are depicted in Fig. 3(b) and Fig. 3(c) respectively.

Loss Functions. A classification loss and several regressionlosses are involved in the training of this stage. For both globaland partial IUV maps, the loss is calculated in the same mannerand denoted as Liuv . Specifically, a classification loss is imposedon the Index channels of IUV maps, where a (1 + P )-way cross-entropy loss is employed to classify a pixel belonging to eitherbackground or one among P body parts. For the UV channels ofIUV maps, an L1 based regression loss is adopted, and is onlytaken into account for those foreground pixels. In other words,the estimated UV channels are firstly masked by the ground-truthIndex channel before applying the regression loss. For the 2D jointheatmaps and 2D joint positions estimated for RoI pooling, an L1

based regression loss is adopted and denoted as Lroi. Overall, the

Page 6: JOURNAL OF LA Learning 3D Human Shape and Pose from …The recovery of the 3D human pose from a monocular image is challenging. Common strategies use intermediate estimations as the

JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, MONTH YEAR 6

objective in the IUV estimation stage involves two main losses:

Linter = λiuvLiuv + λroiLroi, (3)

where λiuv and λroi are used to balance the two terms.

4.2 Camera, Shape and Pose Prediction

After obtaining the global and partial IUV maps, the camera andshape parameters would be predicted in the global stream, whilepose parameters would be predicted in the local streams.

The global stream consists of a ResNet [65] as the backbonenetwork and a fully connected layer added at the end with 13outputs, corresponding to the camera scale s ∈ R, translationt ∈ R2 and the shape parameters β ∈ R10. In the local streams, atailored ResNet acts as the backbone network shared by all bodyjoints and is followed by K residual layers for rotation featureextraction individually. For the k-th body joint, the extractedrotation features would be refined (see Sec. 4.3) and then usedto predict the rotation matrix Rk ∈ R3×3 via a fully connectedlayer. Here, we follow previous work [6], [7] to predict the rotationmatrix representation of the pose parameters θ rather than theaxis-angle representation defined in the SMPL model. An L1 lossis imposed on the predicted camera, shape, and pose parameter,and we denote it as Lsmpl.

Following previous work [5], [6], [7], we also add additionalconstraint and regression objective for better performance. Forthe predicted rotation matrix, it is necessary to make it lie onthe manifold of rotation matrices. In our method, we imposean orthogonal constraint loss Lorth upon the predicted rotationmatrices {Rk}K−1

k=0 to guarantee their orthogonality, i.e.,

Lorth =K−1∑k=0

∥∥∥RkRTk − I

∥∥∥2. (4)

Given the predicted SMPL parameters, the performance could befurther improved by adding supervision explicitly on the resultingmodel M(θ,β). Specifically, three L1 based loss functions areused to measure the difference between the ground-truth positionsand the predicted ones. The corresponding losses are denoted asLvert for vertexes on 3D mesh, L3Dkp for sparse 3D humankeypoints, and Lreproj for the reprojected 2D human keypoints,respectively. For the sparse 3D human keypoints, the predictedpositions are obtained via a pretrained linear regressor by mappingthe mesh vertices to the 3D keypoints defined in human posedatasets. Overall, the objective in this prediction stage is theweighted sum of multiple losses:

Ltarget = λsmplLsmpl + λorthLorth

+ λpoint (Lvert + L3Dkp + Lreproj) ,(5)

where λsmpl, λorth, and λpoint are balance weights.Part-based Dropout. Our approach learn the shape and pose

from the IUV intermediate representation, which contains densecorrespondences of the body parts. Inspired by previous work ondata augmentation [59] and model regularization [56], [58], wepropose a Part-based Dropout (PartDrop) strategy to drop outsemantic information from intermediate representations duringtraining. PartDrop has a dropping rate γ as the probability ofdropping values in the estimated IUV maps. In contrast to otherdropping out strategies such as Dropout [56] and DropBlock [58],the proposed PartDrop strategy drops features in contiguousregions at the granularity level of body parts. Specifically, for

(a) (b) (c) (d)

Fig. 5. Comparison of different dropping out strategy. (a) Original IUVmap. (b)(c)(d) PartDrop (ours), DropBlock [58] and Dropout [56] dropIUV values in part-wise, block-wise, and unit-wise manners, respectively.The corresponding binary masks are shown on the top row.

each training sample, the index subset Idrop of the body partsto be dropped is randomly selected from {1, 2, . . . , P} with theprobability of γ. Then, for both global and partial IUV maps, theestimated IUV values of selected body parts are dropped out bysetting corresponding body parts as zeros:

H[p, :, :, :] = 0, for p ∈ Idrop, (6)

where H[p, :, :, :] denotes IUV maps with the part index of p.Fig. 5 visualizes how PartDrop, DropBlock [58], and

Dropout [56] drop values in part-wise, block-wise, and unit-wise manners. As observed, in comparison with DropBlock andDropout, the proposed PartDrop can remove semantic informationin a more structured manner, which consequently enforces theneural network to learn features from complementary body partsand improves its generalization.

4.3 Rotation Feature RefinementIn our approach, the rotation features extracted in local streamsare aggregated to exploit spatial relationships among body joints.As illustrated in Fig. 6(a), the position-aided rotation featurerefinement involves three consecutive steps, namely rotation fea-ture collection, position feature refinement, and refined featureconversion. Specifically, the rotation features are first collectedinto the position feature space where the feature refinement isperformed. After that, the rotation feature refinement is accom-plished by converting the refined position features back to therotation feature space. All these three steps are performed bycustomized graph convolution layers. In particular, we considerthe following graph-based convolution layer G(·) that employsone popular formulation of the Graph Convolution Networks asproposed in Kipf et al. [43].

Zout = G(A,Zin) = σ(AZinW ), (7)

where Zin and Zout are input and output features respectively,σ(·) is the activation function, W is the parameters of convolu-tion kernels, A denotes the row-normalized matrix of the graphadjacency matrix A, i.e., A = D− 1

2AD− 12 if A is a symmetric

matrix, and otherwise A = D−1A, where D is the diagonalnode degree matrix of A with Dii =

∑jAij . For simplicity, we

also refer to the graph with adjacency matrix of A as graph A.Step 1: Rotation Feature Collection. Note that the rotation

of each body joint could be viewed as sequential data alongthe kinematic chain. This is inspired by the fact that the humancould act in a recurrent manner according to the kinematic tree

Page 7: JOURNAL OF LA Learning 3D Human Shape and Pose from …The recovery of the 3D human pose from a monocular image is challenging. Common strategies use intermediate estimations as the

JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, MONTH YEAR 7

shown in Fig.6(b). The position of a specific body joint canbe calculated from the collection of the relative rotations andbone lengths of those joints belonging to the same kinematicchain. At the feature level, we propose to learn the mapping fromrotation feature space to position feature space. To that end, onegraph convolution layer is customized to gather information frombody joints along the kinematic chain and learn such mapping.Formally, let X ∈ RK×C denote the rotation features extractedfrom K sets of partial IUV maps with C being the featuredimension. The position features Y ∈ RK×C of K joints isobtained by feeding X to the graph convolution, i.e.,

Y = G(Ar2p,X), (8)

where the graph with adjacency matrix Ar2p is customized as acollection graph for mapping rotation features into the positionfeature space, in which Ar2p

ij = 1 if the j-th joint is one of theancestors of the i-th joint along the kinematic chain, and otherwiseAr2p

ij = 0. The adjacency matrix Ar2p of the collection graph isdepicted in Fig. 6(c).

Step 2: Position Feature Refinement. Since there are strongspatial correlations among neighboring body joints, utilizingsuch structured constraints could effectively improve the featureslearned at each joint. Towards this goal, a graph-based convolutionnetwork is employed to exploit spatial relationships betweenjoints. Specifically, the position features Y are fed into L graphconvolution layers with the following layer-wise formulation:

Y (l) = G(Arf ,Y (l−1)), (9)

where Y l denotes the position features obtained from the l-thlayer with Y 0 = Y , and the graph with adjacency matrix Arf =I + Arf serves as a refinement graph for feature refinement, inwhich Arf

ij = 1 if the i-th and j-th joints are neighboring, andotherwise Arf

ij = 0. After graph convolutions, the refined positionfeatures Y are obtained by adding Y L with the original positionfeatures Y in a residual manner, i.e., Y = Y + Y L. Fig. 6(d)shows an example of the adjacency matrix Arf which considersboth one-hop and two-hop neighbors. Note that Arf could havevarious forms according to the neighbor definition of body joints.

Inspired by previous work [50], [66], we also add a learnableedge weighting mask on the graph convolution of this step sincemessages from different joints should have different contributionsto the feature refinement of the target joint. In this way, we havethe adjacency matrix Arf improved as

Arf = I +M ◦ Arf , (10)

where ◦ denotes the element-wise product, M ∈ [0, 1]K×K isthe learnable edge weighting matrix serving as an attention maskof the graph to balance the contributions of neighboring featuresto the target feature.

Step 3: Refined Feature Conversion. The last step of re-finement is to convert the refined features back to the originalrotation feature space. Since the rotation and position of bodyjoints are two mutual representation of 3D human pose, after therefinement of position features, the rotation features can be refinedaccordingly1. Specifically, for the k-th body joint, its rotationfeatures can be refined by aggregating messages from the refinedposition features of three consecutive body joints, i.e., the joint

1. Strictly speaking, the joint rotation status can not be fully retrieved fromthe joint positions due to the fewer DoFs specified in position-based poses.This issue is mild at the feature level since features could be more redundant.

Step2: Position Feature Refinement

Step3: Refined Feature Conversion

Rotation Feature Position Feature Refined Rotation FeatureRefined Position Feature

Step1: Rotation Feature Collection

ancestor nodes along the kinematic chain

neighboring nodes of the target one

the target node itself and its parent and child nodes

the target node

Position Supervision

Rotation Supervision

( ) ( )( )

(a) Position-aided Rotation Feature Refinement

012

3

45

6

78

9

1011

12131415

1617 1819 2021 2223

(b) Kine. Tree

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

0123456789

1011121314151617181920212223

011111111111111111111111

000010010010000000000000

000001001001000000000000

000000100100111111111111

000000010010000000000000

000000001001000000000000

000000000100111111111111

000000000010000000000000

000000000001000000000000

000000000000111111111111

000000000000000000000000

000000000000000000000000

000000000000000100000000

000000000000000010101010

000000000000000001010101

000000000000000000000000

000000000000000000101010

000000000000000000010101

000000000000000000001010

000000000000000000000101

000000000000000000000010

000000000000000000000001

000000000000000000000000

000000000000000000000000

(c) Ar2p

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

0123456789

1011121314151617181920212223

111111100000000000000000

111110010000000000000000

111101001000000000000000

111100100100000000000000

110010010010000000000000

101001001001000000000000

100100100100111000000000

010010010010000000000000

001001001001000000000000

000100100100111111000000

000010010010000000000000

000001001001000000000000

000000100100111111000000

000000100100111010100000

000000100100111001010000

000000000100100100000000

000000000100110010101000

000000000100101001010100

000000000000010010101010

000000000000001001010101

000000000000000010101010

000000000000000001010101

000000000000000000101010

000000000000000000010101

(d) Arf

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

0123456789

1011121314151617181920212223

111100000000000000000000

110010000000000000000000

101001000000000000000000

100100100000000000000000

010010010000000000000000

001001001000000000000000

000100100100000000000000

000010010010000000000000

000001001001000000000000

000000100100111000000000

000000010010000000000000

000000001001000000000000

000000000100100100000000

000000000100010010000000

000000000100001001000000

000000000000100100000000

000000000000010010100000

000000000000001001010000

000000000000000010101000

000000000000000001010100

000000000000000000101010

000000000000000000010101

000000000000000000001010

000000000000000000000101

(e) Ap2r

Fig. 6. Illustration of the aggregated refinement module. (a) Three stepsof the proposed refinement strategy. (b) The kinematic tree with K = 24joints in the SMPL model. The pelvis joint with 0 index is the root node ofthe tree. Joints belonging to the same kinematic chain are linked by theline with the same color. (c)(d)(e) Adjacency matrices of the graphs usedin three steps for the feature collection, refinement, and conversion.

itself and its parent and child joints. Similar to the first step, themapping from position features to rotation features is also learnedvia a graph-based convolution layer, where the difference lies inthe adjacency matrix of the graph. Formally, the refined positionfeatures Y are fed into the graph to obtain features in the rotationspace, resulting in the refined rotation features X for the finalprediction of joint pose parameters, i.e.,

X = G(Ap2r, Y ), (11)

where the graph with adjacency matrix Ap2r = I + Ap2r iscustomized as a conversion graph for mapping position featuresto rotation features, in which Ap2r

ij = 1 if the j-th joint is theparent or child joint of the i-th joint, and otherwise Aij = 0.The adjacency matrix Ap2r of the conversion graph is depicted inFig. 6(e).

Supervision in Refinement. The rotation and position featurespaces are built under corresponding supervisions during training.As illustrated in Fig. 6(a), the rotation featuresX and X are usedto predict joint rotations, while the position features Y and Yare used to predict joint positions. L1 based rotation and positionsupervisions are imposed on these predictions correspondingly,which compose the objective Lrefine involved in the refinementprocedure. Note that these intermediate predictions are unneces-sary during testing.

5 EXPERIMENTS

5.1 Implementation Details

The FCN for IUV estimation in our framework adopts the architec-ture of HRNet-W48 [67], which is one of the most recent state-of-the-art networks for dense estimation tasks. The FCN receives the224×224 input and produces 56×56 feature maps for estimatingthe global and local IUV maps, which have the same resolutionof 56 × 56. Two ResNet-18 [65] are employed as the backbonenetworks for global and rotation feature extraction respectively.

Page 8: JOURNAL OF LA Learning 3D Human Shape and Pose from …The recovery of the 3D human pose from a monocular image is challenging. Common strategies use intermediate estimations as the

JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, MONTH YEAR 8

During testing, due to the fundamental depth-scale ambiguity, wefollow previous work [5], [7] to center the person within the imageand perform scaling such that the inputs have the same setting astraining. Our experiments are implemented in PyTorch [68] andrun with a TITAN Xp GPU. More details could be found in thesupplementary material and publicly available code.

5.2 Datasets and Evaluation MetricsHuman3.6M. Human3.6M [69] is a large-scale dataset whichconsists of 3.6 millions of video frames captured in the controlledenvironment, and currently the most commonly used benchmarkdataset for 3D human pose estimation. Kanazawa et al. [5] gener-ated the ground truth SMPL parameters by applying MoSH [70]to the sparse 3D MoCap marker data. Following the commonprotocols [5], [6], [27], we use five subjects (S1, S5, S6, S7, S8)for training and two subjects (S9, S11) for evaluation. We alsodown-sample the original videos from 50fps to 10fps to removeredundant frames, resulting in 312,188 frames for training and26,859 frames for testing.

UP-3D. UP-3D [3] is a collection dataset of existing 2Dhuman pose datasets (i.e., LSP [71], LSP-extended [72], MPIIHumanPose [73], and FashionPose [74]), containing 5,703 imagesfor training, 1,423 images for validation, and 1,389 images fortesting. The SMPL parameter annotations of these real-worldimages are augmented in a semi-automatic way by using anextended version of SMPLify [3].

DensePose-COCO. DensePose-COCO [37] provides thedense correspondences from 2D images to the 3D surface ofthe human body model for 50K humans appearing in the COCOdataset [75]. Different from our rendered IUV maps, the corre-spondence annotations in DensePose-COCO only consist of ap-proximately 100-150 points per person, which are a sparse subsetof the foreground pixels of human images. In our experiments, wediscard those persons without 2D keypoint annotations, resultingin 39,210 samples for training and 7,297 samples for evaluation.

Evaluation Metrics. Following previous work [6], [15], [33],for evaluating the reconstruction performance, we adopt the meanPer-vertex Error (PVE) as the primary metric, which is definedas the average point-to-point Euclidean distance between thepredicted model vertices and the ground truth model vertices.Besides the PVE metric, we further adopt PVE-S and PVE-Pas secondary metrics for separately evaluate the shape and poseprediction results. The PVE-S computes the per-vertex error withthe pose parameters of ground truth and predicted models setas zeros (i.e., models under the rest pose [1]), while the PVE-Pcomputes the analogous per-vertex error with the shape parametersset as zeros. For the Human3.6M dataset, the widely used MeanPer Joint Position Error (MPJPE) and the MPJPE after rigidalignment of the prediction with ground truth using ProcrustesAnalysis (MPJPE-PA) are also adopted to quantitatively evaluatethe 3D human pose estimation performance. The above metricswill be reported in millimeters (mm) by default.

5.3 Comparison with State-of-the-art MethodsComparison on the In-door Dataset. We evaluate the recon-struction as well as 3D human pose estimation performancefor quantitative comparison on Human3.6M. Table 1 reports thecomparison results with previous methods that output more thansparse 3D keypoint positions. Among them, HMR [5] adopts asingle CNN and an iterative regression module to produce all

Table 1Quantitative comparison with state-of-the-art methods on the

Human3.6M dataset.

Method PVE MPJPE MPJPE-PA

Zhou et al. [52] - 107.3 -Tung et al. [4] - - 98.4SMPLify [2] 202.0 - 82.3SMPLify++ [3] - - 80.7Pavlakos et al. [6] 155.5 - 75.9HMR [5] - 88.0 56.8NBF [7] - - 59.9Xiang et al. [36] - 65.6 -Arnab et al. [17] - 77.8 54.3CMR [38] - - 50.1HoloPose [14] - 64.3 50.6TexturePose [19] - - 49.7DenseRaC [39] - 76.8 48.0

DaNet-LSTM [11] 75.1 61.5 48.6Ours 66.5 54.6 42.9

Table 2Quantitative comparison of PVE with state-of-the-art methods on the

UP-3D dataset.

Method LSP MPII FashionPose Full

SMPLify++ [3] 174.4 184.3 108.0 169.8HMR [5] - - - 149.2NBF [5] - - - 134.6Pavlakos et al. [6] 127.8 110.0 106.5 117.7BodyNet [33] 102.5 - - -Rong et al. [15] - - - 122.2

DaNet-LSTM [11] 90.4 83.0 61.8 83.7Ours 88.5 82.1 60.8 82.3

parameters. Pavlakos et al. [6] decompose the shape and pose pre-diction tasks, while their pose parameters are predicted from 2Djoints positions. NBF [7] adopts segmentation as the intermediaterepresentation and learns all parameters from it. CMR [38] directlyregresses 3D meshes with a graph-based convolutional network.All these methods except [14] estimate pose parameters througha single stream and our method outperforms them significantly.HoloPose [14] predicts pose parameters using a part-based modeland is the most competitive one to ours.

Comparison on In-the-wild Datasets. Reconstructing 3Dhuman model on real-world images is much more challenge dueto factors such as extreme poses and heavy occlusions. In ournetwork, the aggregated refinement module and PartDrop trainingstrategy are proposed to enhance its robustness and generalization.We conduct evaluation experiments on UP-3D and DensePose-COCO to demonstrate the efficacy of our method.

For evaluation on UP-3D, we report quantitative results in thePVE of the reconstructed meshes in Table 2. In comparison withprevious methods, our method outperforms them across all subsetsof UP-3D by a large margin. Our closest competitor BodyNet [33]has the PVE value of 102.5 on LSP, while ours is 88.5. Moreover,BodyNet [33] uses both 2D and 3D estimation as the interme-diate representation, which is much more time-consuming thanours. Reconstruction results on UP-3D are visualized in Fig. 7.Compared with other methods, our DaNet could produce moresatisfactory results under challenging scenarios.

For evaluation on DensePose-COCO, our model is trainedon the mixture of training data from both DensePose-COCO

Page 9: JOURNAL OF LA Learning 3D Human Shape and Pose from …The recovery of the 3D human pose from a monocular image is challenging. Common strategies use intermediate estimations as the

JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, MONTH YEAR 9

Imag

eH

MR

[5]

NB

F[7

]C

MR

[38]

Our

s

Fig. 7. Qualitative comparison of reconstruction results on the UP-3D dataset.

and Human3.6M datasets. We only apply PartDrop to samplesof Human3.6M considering that PartDrop somewhat imitatesthe commonly occurred occlusions in samples of DensePose-COCO. Since there is no ground truth human model providedin DensePose-COCO, we only perform qualitative evaluations onthis dataset. We show reconstruction examples in Fig. 8, andmake comparisons with HMR [5] and Rong et al. [15], the twomethods trained on COCO with code publicly available. It canbe observed that our method has better generalization in real-world scenarios with more accurate and aligned reconstructionperformances. Our method can produce reasonable results even incases of extreme poses, occlusions, and incomplete human bodies,while competitors fail or produce visually displeasing results.

Running Time. During inference, our method takes about93ms on a Titan Xp GPU, where the IUV estimation accounts for60ms while the parameter prediction accounts for the rest 33ms.The running time and platform of other state-of-the-art methodsare included in Table 3. Numbers are obtained from respectiveliterature or evaluated using their official implementation. Overall,our method has a moderate computation cost among learning-based reconstruction methods.

Table 3Comparison of running time (ms) with state-of-the-art methods.

Method Run Time GPU

HMR [5] 40 GTX 1080 TiPavlakos et al. [6] 50 Titan XNBF [7] 110 Titan XpBodyNet [33] 280 Modern GPUCMR [38] 33 RTX 2080 TiDenseRaC [39] 75 Tesla V100

Ours 93 Titan Xp

5.4 Ablation StudyTo evaluate the effectiveness of the key components proposedin our method, we conduct ablation experiments on Human3.6Munder various settings. We will begin with our baseline networkby removing the local streams, aggregated refinement module,and PartDrop strategy in our method. In other words, the baselinesimply uses the global stream of DaNet to predict all parameters.Moreover, it adopts ResNet101 [65] as the backbone network forparameter predictions such that the model size of the baseline iscomparable to that of the networks used in ablation experiments.

5.4.1 Intermediate RepresentationTo show the superiority of adopting the IUV map as the intermedi-ate representation, our baseline network adopts its alternatives forthe shape and pose prediction tasks. Specifically, the IUV mapsare replaced by the convolutional feature maps outputted from thelast layer of the FCN or the part segmentation (i.e., Index channelsof IUV maps). Note that there is actually no intermediate repre-sentation for the approach adopting feature maps as “intermediaterepresentation”. As observed from Table 4, the approach adopt-ing IUV maps as intermediate representations achieves the bestperformance. Compared with alternative representations, adoptingIUV maps reduces the PVE value from 98.9 of feature maps and90.4 of part segmentation to 87.8. In our experiments, we foundthat the approach without using any intermediate representation ismore prone to overfitting to the training set.

Effect of IUV Estimation Quality. We further conduct exper-iments to investigate the impact of the quality of dense estimationon the final shape and pose prediction performance. To this end,different architectures or initializations of the IUV estimatorsare adopted in ablation experiments to produce IUV maps withdifferent qualities. Specifically, the IUV estimator adopts thepose estimation networks [76] built upon ResNet-50 and ResNet-101 as alternative architectures, and these models are pretrained

Page 10: JOURNAL OF LA Learning 3D Human Shape and Pose from …The recovery of the 3D human pose from a monocular image is challenging. Common strategies use intermediate estimations as the

JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, MONTH YEAR 10

Imag

eH

MR

[5]

Ron

get

al.[

15]

Our

s

Fig. 8. Qualitative comparison of reconstruction results on the DensePose-COCO dataset.

Table 4Performance of approaches adopting different intermediate

representations on the Human3.6M dataset.

Method PVE MPJPE MPJPE-PA

ConvFeat 98.9 82.5 60.3Segmentation 90.4 74.6 57.1IUV 87.8 71.6 55.4

on ImageNet [77] or COCO [75]. Following the protocol ofDensePose [37], we measure the quality of dense correspondenceestimations via the pointwise evaluation [37], where the area underthe curve at the threshold of 10cm (i.e., AUC10) is adopted as themetric. Fig. 9(a) reports the reconstruction results of ablation ap-proaches versus their qualities of IUV estimations. As can be seen,networks with better IUV estimations consistently achieve betterreconstruction performance. To investigate the performance upperbound of adopting IUV maps as intermediate representations, wealso report the results of the approach using ground truth IUVmaps as input with the removal of the IUV estimator. As shownin the rightmost result of Fig. 9(a), the approach learning from theground truth IUV maps achieves much better performance thanusing the estimated one outputted from networks, which meansthat there is still a large margin for improvement by adopting IUVmaps as intermediate representations.

In contrast to the concurrent work [15], [38], [39] obtainingIUV maps from the pretrained network of DensePose [37], ourapproach augments the annotation of Human3.6M with the ren-dered IUV maps so that our IUV estimator can be trained onHuman3.6M with dense supervision, which enables our networkto have a higher quality of IUV estimation. To verify this, the IUVestimator is firstly trained on DensePose-COCO or Human3.6M,and then frozen to generate IUV maps for the training of thereconstruction task on Human3.6M. As can be seen from Fig. 9(b),approaches with the IUV estimators trained on Human3.6M con-sistently achieve better performances on both IUV estimation andmodel reconstruction tasks.

46 47 48 49 50 51 52

50

60

70

80

90

100

100

AUC10 (%)

Erro

r (m

m)

Res50(ImgNet)Res50(COCO)Res101(ImgNet)Res101(COCO)

HR(ImgNet)HR(COCO)IUV_GT

PVEMPJPEMPJPE-PA

(a)

33 34 35

60

70

80

90

100

50 51 52

AUC10 (%)

Erro

r (m

m)

Res50[DP]Res101[DP]HR[DP]

Res50[H36M]Res101[H36M]HR[H36M]

PVEMPJPEMPJPE-PA

(b)Fig. 9. Reconstruction performance on Human3.6M versus the IUVestimation quality for approaches adopting IUV estimators with dif-ferent architectures and training strategies. (a) Higher IUV estimationqualities generally contribute to better reconstruction performance. IUVestimators are all trained on Human3.6M but initialized with differentmodels. (b) The IUV estimators trained on Human3.6M with densesupervisions have higher IUV estimation qualities. IUV estimators areall pretrained on COCO and then trained on different datasets. DifferentIUV estimators are denoted as †(?) or †[∗], where † is the architecture, ?and ∗ denote the pretrained and training datasets. IUV GT stands fortaking ground-truth IUV as input. ImgNet, DP , and H36M abbreviateImageNet, DensePose-COCO, and Human3.6M, respectively.

5.4.2 Decomposed Perception

The decomposed perception provides fined-grained informationfor detailed pose estimation. To validate the effectiveness of sucha design, we report the performance of the approaches using one-stream and multiple streams in Table 5, where the D-Net denotesthe variant of our DaNet without using the aggregated refinementmodule and PartDrop strategy. Results in PVE-S and PVE-P arealso reported in Table 5 for separately studying the efficacy ofthe decomposed design on the shape and pose predictions. Itcan be seen that the reconstruction performance metric PVE isactually dominated by the PVE-P metric. Comparison of the firstand second rows in Table 5 shows that using multiple streamshas barely effects on the shape prediction but brings a significantimprovement on the pose prediction (i.e., the PVE-P value dropsmore than 14%). We also report results to validate the use ofdifferent ratios αk and the simplification of partial IUV maps. In

Page 11: JOURNAL OF LA Learning 3D Human Shape and Pose from …The recovery of the 3D human pose from a monocular image is challenging. Common strategies use intermediate estimations as the

JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, MONTH YEAR 11

Walking Eating WalkTogether Posing Purchases Directions Discussion Smoking Phoning Sitting Waiting WalkDog Photo Greeting SittingDownAction

60

80

100

120

140Pe

r-ver

tex

Erro

r (m

m)

6875 79 81 82 82 86 86 87 92 94 98

112 113

130

6272 71 67

73 73 78 76 7883

7686

9686

120

56 60 62 6066 67 71 67 67

7665

8489

80

108

52 56 58 56 59 63 65 63 6270

59

75 7569

92

Baseline+Decomposition +PartDrop +Aggregation

Fig. 10. Reconstruction performance of ablation approaches across different actions on the Human3.6M dataset.

0cm 5cm 10cm 15cm

(a) (b) (c) (d)

Fig. 11. Comparison of the average per-vertex error upon the modelsurface for ablation approaches on the Human3.6M dataset. (a) Thebaseline approach using one stream only. (b) The approach usingmultiple streams for decomposed perception. (c) The approach usingdecomposed perception and PartDrop strategies. (d) Our final approachwith the aggregated refinement.

the 3rd and 4th rows of Table 5, D-Net-ES adopts equal scaleswith all αks set to 0.5, while D-Net-AP adopts partial IUV mapswith all body parts. It can be seen that such modifications degradethe performance, which is due to two facts that (i) the proportionsof body parts are different and (ii) the pose status of different bodyjoints is relatively independent and involving irrelevant body partscould disturb the inference of the target joint poses.

To visualize the reconstruction performance on different bodyareas, Fig. 11 depicts the average per-vertex error with respect tothe surface areas of the human model. As shown in Fig. 11(a), forthe baseline network, the per-vertex errors of limb parts (hands,feet) are much higher than that of the torso. By comparingFigs. 11(a) and 11(b), we can conclude that our decomposedperception design alleviates the above issue and achieves muchbetter reconstruction performance on limb parts. Reconstructionperformances across different actions on Human3.6M are alsoreported in Fig. 10 for comprehensive evaluations. We can see thatthe decomposed perception design reduces reconstruction errorsconsistently for all actions.

Table 5Performance of approaches using different perception strategies on the

Human3.6M dataset.

Method PVE PVE-S PVE-P MPJPE MPJPE-PA

Baseline 87.8 38.0 76.3 71.6 55.4D-Net 74.3 36.3 64.0 61.8 48.5D-Net-ES 76.1 36.6 65.5 63.1 49.8D-Net-AP 76.8 36.8 65.8 63.4 49.5

5.4.3 Part-based Dropout

The proposed Part-based Dropout (PartDrop) strategy drops IUVvalues in contiguous regions at the granularity level of body parts.Such a dropping out strategy can effectively regularize the neuralnetwork by removing semantic information from intermediaterepresentations. In this subsection, we conduct experiments tovalidate its effectiveness and evaluate the impact of the droppingrate on the reconstruction performance.

To validate the superiority of our PartDrop strategy, we adoptDropBlock [58] and Dropout [56] as alternative strategies todrop values from intermediate representations during training. ForDropBlock, following the setting of [58], the size of the blockto be dropped is set to 7 in our experiments. Fig. 12 reports theshape and pose reconstruction results of different strategies acrossdifferent dropping rates. For fair comparisons, only the foregroundpixels are involved in counting the dropping rate. It can be seenthat the performance gains brought by dropping out strategiesmainly come from the pose prediction tasks. Among three strate-gies, Dropout is the worst and its performance deteriorates quicklywhen increasing the rate of dropping out. DropBlock works betterthan Dropout and brings marginal gains when the dropping rateis less than 20%. The proposed PartDrop overwhelmingly outper-forms its counterparts, and achieves the best results at the droppingrate around 30%. The above comparisons of unit-wise, block-wise,and part-wise dropping strategies suggest that removing featuresin a structured manner is crucial to our reconstruction task, wherePartDrop performs best among them. The efficacy of PartDropcould be also validated from the reconstruction error reductionshown in Fig. 10 and Fig. 11(c).

0 10 20 30 40 50Dropping Rate (%)

75

80

85

90

95

PVE

(mm

)

DropoutDropBlockPartDrop

(a)

0 10 20 30 40 50Dropping Rate (%)

40

50

60

70

PVE-

S (m

m)

DropoutDropBlockPartDrop

(b)

0 10 20 30 40 50Dropping Rate (%)

62

64

66

68

70

PVE-

P (m

m)

DropoutDropBlockPartDrop

(c)

Fig. 12. Comparison of reconstruction performance for approachesusing different dropping out strategies on the Human3.6M dataset.(a)(b)(c) report results with metrics of PVE, PVE-S, and PVE-P to revealthe quality of the full model recovery, shape recovery, and pose recoveryacross different dropping rates, respectively.

Page 12: JOURNAL OF LA Learning 3D Human Shape and Pose from …The recovery of the 3D human pose from a monocular image is challenging. Common strategies use intermediate estimations as the

JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, MONTH YEAR 12

Imag

ew

/oR

ef.

Dir

ect R

ef.

Pos.

-aid

edR

ef.

Fig. 13. Example results of approaches without refinement, or usingdirect / position-aided refinement strategies.

Table 6Performance of approaches using different feature refinement

strategies on the Human3.6M dataset.

Refinement Strategy PVE MPJPE MPJPE-PA

w/o Ref. 71.7 59.1 46.1

Direct Ref. 70.3 58.1 45.5Pos.-implicit Ref. 69.2 56.5 44.7Pos.-aided Ref. 66.5 54.6 42.9

5.4.4 Aggregated Refinement

Our aggregated refinement module is proposed to impose spacialstructure constraints upon rotation-based pose features. As ob-served from Fig. 11(d) and Fig. 10, the aggregation in DaNeteffectively reduces the reconstruction errors across all surfaceareas and human actions considerably.

A straightforward strategy to refine the feature would beconducting refinement between the rotation features directly. Insuch a direct refinement strategy, the first and third steps of ourrefinement procedure are removed and the rotation features aredirectly refined by the graph convolution layers of the secondstep. The features outputted from the last refinement layer arealso added with the original rotation features in a residual mannerand then used to predict joint rotation status. For fair comparisons,the refinement layer number of the direct strategy is equal to thenumber of the layers involved in the three steps of the position-aided strategy.

Rotation Feature Space vs. Position Feature Space. Theproposed position-aided refinement strategy performs refinementin the position feature space instead of the rotation feature space.The graphs Ar2p and Ap2r of the first and last refinementsteps are customized to connect the rotation and position featurespaces. The graph Ar2p collects rotation features to the positionfeature space, while the graph Ap2r converts position featuresback to the rotation feature space. To validate their functions, wediscard position supervisions from the objective Lrefine duringrefinement. We refer to this strategy as the position-implicitrefinement strategy since the position feature space is built inan implicit manner. The only difference between the direct andposition-implicit refinement strategies is that, in the latter one,

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

0123456789

1011121314151617181920212223

(a)

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

0123456789

1011121314151617181920212223

(b)

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

0123456789

1011121314151617181920212223 0.0

0.2

0.4

0.6

0.8

1.0

(c)

Fig. 14. Correlation matrices of the features extracted from (a) rotation,(b) implicit position, and (c) position feature spaces.

Table 7Ablation study of using learnable graph edge and PartDrop strategies

on the Human3.6M dataset.

Method PVE MPJPE MPJPE-PA

D-Net 74.3 61.8 48.5+ PartDrop 71.7 59.1 46.1

D-Net+Direct 72.1 59.4 46.9+ LearntEdge 72.7 59.6 47.0

+ PartDrop 70.3 58.1 45.5

D-Net+Pos.-aided 70.8 57.1 45.9+ LearntEdge 68.9 55.8 44.9

+ PartDrop 66.5 54.6 42.9

there are two mapping operations performed before and afterthe refinement. We report the results of the approaches usingdirect, position-implicit, position-aided strategies in Table 10 forcomparisons. It can be seen that the position-implicit strategyachieves inferior results than the position-aided strategy but betterresults than the direct strategy, which means that the implicitposition space still works better than the rotation space for featurerefinement. Example results of the approach using the direct orposition-aided refinement strategy are also depicted in Fig. 13 forcomparisons. We can see that the position-aided refinement helpsto handle challenging cases and produce more realistic and well-aligned results, while the direct refinement brings marginal to noimprovement.

The reason behind the inferior performance of the directrefinement is that the correlation between rotation features is weak,and the messages of neighboring rotation features are generallyirrelevant to refine the target rotation feature. Our refinementmodule builds an auxiliary position feature space for featurerefinement, making it much more efficient than that in the originalrotation feature space. To verify this, we extract the featuresbefore refinement from the rotation, implicit position, and positionspaces of the three strategies mentioned above, and compute thecorrelations between features of different body joints. Fig. 14shows the comparison of correlation matrices of these three typesof features. As observed from Fig. 14(a), the correlation matrix ofrotation features approximates to an identity matrix, meaning thatthe correlations between the rotation features of different jointsare rather weak even for two adjacent joints. By contrast, forimplicit position features in Fig. 14(b) and position features inFig. 14(c), the correlations between features of adjacent joints aremuch higher, making it more feasible to refine features with themessages from neighboring joints.

Benefit from Learnable Graph Edge and PartDrop. Thelearnable edge weighting matrix M of the refinement graphcontributes to better balancing the importance of neighboring

Page 13: JOURNAL OF LA Learning 3D Human Shape and Pose from …The recovery of the 3D human pose from a monocular image is challenging. Common strategies use intermediate estimations as the

JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, MONTH YEAR 13

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

0123456789

1011121314151617181920212223

(a)0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

0123456789

1011121314151617181920212223

(b)

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

0123456789

1011121314151617181920212223

(c)

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

0123456789

1011121314151617181920212223 0.0

0.1

0.2

0.3

0.4

0.5

(d)

Fig. 15. Visualization of learned edge weighting matrices under differenttraining settings. (a)(b) Direct refinement without and with PartDrop.(c)(d) Position-aided refinement without and with PartDrop.

messages, while the PartDrop strategy helps to encourage thenetwork to leverage more information from neighboring joints. Toverify their effectiveness during feature refinement, Table 7 reportsthe results of the ablation approaches incrementally adopting thelearnable edge in the refinement graph and the PartDrop strategy,where D-Net+Direct and D-Net+Pos.-aided adopt the refinementmodule with the direct and position-aided strategy, respectively.It can be seen that, for the direct refinement, the performancegains mainly come from the PartDrop strategy. In contrast, for theposition-aided refinement, the performance gains are attributedto both the learnable edge and the PartDrop strategy. Fig. 15depicts the learned edge weighting matrices of different ablationapproaches. As observed, the learned edge weighting matrices ofthe direct refinement are relatively flat with lower values. Whenusing the PartDrop strategy, the learnable values of most edges inthe refinement graph rise for the position-aided refinement, whilesuch a phenomenon is not observed for the direct refinement.We conjecture that the PartDrop strategy brings gains from twoperspectives. First, PartDrop regularizes the backbone featureextractor to focus on more complementary regions in intermediaterepresentations for better feature exploitation. Second, PartDropencourages the refinement module to borrow more informationfrom neighbors in the position feature space for better featurerefinement.

6 CONCLUSION

In this work, a Decompose-and-aggregate Network is proposedto learn 3D human shape and pose from dense correspondencesof body parts with the decomposed perception, aggregated refine-ment, and part-based dropout strategies. All these new designscontribute to better part-based learning and effectively improve thereconstruction performance by providing well-suited part percep-tion, leveraging spatial relationships for part pose refinement, andencouraging the exploitation of complementary part features. Ex-tensive experiments have been conducted to validate the efficacyof key components in our method. In comparison with previousones, our network can produce more accurate results, while beingrobust to extreme poses, heavy occlusions, and incomplete humanbodies, etc. In future work, we will explore integrating denserefinement [14] to further improve the shape and pose recoveryresults.

REFERENCES

[1] M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black,“Smpl: A skinned multi-person linear model,” ACM Transactions onGraphics, vol. 34, no. 6, p. 248, 2015.

[2] F. Bogo, A. Kanazawa, C. Lassner, P. Gehler, J. Romero, and M. J. Black,“Keep it smpl: Automatic estimation of 3d human pose and shape from asingle image,” in European Conference on Computer Vision. Springer,2016, pp. 561–578.

[3] C. Lassner, J. Romero, M. Kiefel, F. Bogo, M. J. Black, and P. V.Gehler, “Unite the people: Closing the loop between 3d and 2d humanrepresentations,” in Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition, 2017, pp. 6050–6059.

[4] H.-Y. Tung, H.-W. Tung, E. Yumer, and K. Fragkiadaki, “Self-supervisedlearning of motion capture,” in Advances in Neural Information Process-ing Systems, 2017, pp. 5236–5246.

[5] A. Kanazawa, M. J. Black, D. W. Jacobs, and J. Malik, “End-to-end recovery of human shape and pose,” in Proceedings of the IEEEConference on Computer Vision and Pattern Recognition, 2018, pp.7122–7131.

[6] G. Pavlakos, L. Zhu, X. Zhou, and K. Daniilidis, “Learning to estimate3d human pose and shape from a single color image,” in Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition,2018, pp. 459–468.

[7] M. Omran, C. Lassner, G. Pons-Moll, P. Gehler, and B. Schiele, “Neuralbody fitting: Unifying deep learning and model based human pose andshape estimation,” in International Conference on 3D Vision. IEEE,2018, pp. 484–494.

[8] D. Anguelov, P. Srinivasan, D. Koller, S. Thrun, J. Rodgers, and J. Davis,“Scape: shape completion and animation of people,” in ACM Transac-tions on Graphics, vol. 24, no. 3. ACM, 2005, pp. 408–416.

[9] X. Chen and A. L. Yuille, “Articulated pose estimation by a graphicalmodel with image dependent pairwise relations,” in Advances in NeuralInformation Processing Systems, 2014, pp. 1736–1744.

[10] X. Chu, W. Ouyang, H. Li, and X. Wang, “Structured feature learning forpose estimation,” in Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition, 2016, pp. 4715–4723.

[11] H. Zhang, J. Cao, G. Lu, W. Ouyang, and Z. Sun, “Danet: Decompose-and-aggregate network for 3d human shape and pose estimation,” inProceedings of the 27th ACM International Conference on Multimedia.ACM, 2019, pp. 935–944.

[12] L. Sigal, A. Balan, and M. J. Black, “Combined discriminative andgenerative articulated pose and non-rigid shape estimation,” in Advancesin Neural Information Processing Systems, 2008, pp. 1337–1344.

[13] P. Guan, A. Weiss, A. O. Balan, and M. J. Black, “Estimating humanshape and pose from a single image,” in Proceedings of the IEEEInternational Conference on Computer Vision. IEEE, 2009, pp. 1381–1388.

[14] R. A. Guler and I. Kokkinos, “Holopose: Holistic 3d human reconstruc-tion in-the-wild,” in Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition, 2019, pp. 10 884–10 894.

[15] Y. Rong, Z. Liu, C. Li, K. Cao, and C. C. Loy, “Delving deep into hybridannotations for 3d human recovery in the wild,” in Proceedings of theIEEE International Conference on Computer Vision, 2019, pp. 5340–5348.

[16] A. Kanazawa, J. Y. Zhang, P. Felsen, and J. Malik, “Learning 3dhuman dynamics from video,” in Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, 2019, pp. 5614–5623.

[17] A. Arnab, C. Doersch, and A. Zisserman, “Exploiting temporal contextfor 3d human pose estimation in the wild,” in Proceedings of the IEEEConference on Computer Vision and Pattern Recognition, 2019, pp.3395–3404.

[18] J. Liang and M. C. Lin, “Shape-aware human pose and shape reconstruc-tion using multi-view images,” in Proceedings of the IEEE InternationalConference on Computer Vision, 2019, pp. 4352–4362.

[19] G. Pavlakos, N. Kolotouros, and K. Daniilidis, “Texturepose: Supervisinghuman mesh estimation with texture consistency,” in Proceedings of theIEEE International Conference on Computer Vision, 2019, pp. 803–812.

[20] H. Joo, T. Simon, and Y. Sheikh, “Total capture: A 3d deformationmodel for tracking faces, hands, and bodies,” in Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition, 2018,pp. 8320–8329.

[21] G. Pavlakos, V. Choutas, N. Ghorbani, T. Bolkart, A. A. Osman,D. Tzionas, and M. J. Black, “Expressive body capture: 3d hands, face,and body from a single image,” in Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, 2019, pp. 10 975–10 985.

[22] H. Zhu, X. Zuo, S. Wang, X. Cao, and R. Yang, “Detailed human shapeestimation from a single image by hierarchical mesh deformation,” inProceedings of the IEEE Conference on Computer Vision and PatternRecognition, 2019, pp. 4491–4500.

[23] J. Martinez, R. Hossain, J. Romero, and J. J. Little, “A simple yeteffective baseline for 3d human pose estimation,” in Proceedings of theIEEE International Conference on Computer Vision, 2017, pp. 2640–2649.

Page 14: JOURNAL OF LA Learning 3D Human Shape and Pose from …The recovery of the 3D human pose from a monocular image is challenging. Common strategies use intermediate estimations as the

JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, MONTH YEAR 14

[24] B. X. Nie, P. Wei, and S.-C. Zhu, “Monocular 3d human pose estimationby predicting depth on joints,” in Proceedings of the IEEE InternationalConference on Computer Vision. IEEE, 2017, pp. 3467–3475.

[25] F. Moreno-Noguer, “3d human pose estimation from a single image viadistance matrix regression,” in Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition, 2017, pp. 2823–2832.

[26] K. Lee, I. Lee, and S. Lee, “Propagating lstm: 3d pose estimation basedon joint interdependency,” in Proceedings of the European Conferenceon Computer Vision, 2018, pp. 119–135.

[27] G. Pavlakos, X. Zhou, K. G. Derpanis, and K. Daniilidis, “Coarse-to-finevolumetric prediction for single-image 3d human pose,” in Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition,2017, pp. 7025–7034.

[28] B. Tekin, P. Marquez-Neila, M. Salzmann, and P. Fua, “Learning tofuse 2d and 3d image cues for monocular body pose estimation,” inProceedings of the IEEE International Conference on Computer Vision,2017, pp. 3941–3950.

[29] C. Luo, X. Chu, and A. L. Yuille, “Orinet: A fully convolutional networkfor 3d human pose estimation,” in British Machine Vision Conference,2018, p. 92.

[30] E. Dibra, H. Jain, C. Oztireli, R. Ziegler, and M. Gross, “Hs-nets:Estimating human body shape from silhouettes with convolutional neuralnetworks,” in International Conference on 3D Vision. IEEE, 2016, pp.108–117.

[31] B. M. Smith, V. Chari, A. Agrawal, J. M. Rehg, and R. Sever, “Towardsaccurate 3d human body reconstruction from silhouettes,” in Interna-tional Conference on 3D Vision. IEEE, 2019, pp. 279–288.

[32] V. Gabeur, J.-S. Franco, X. Martin, C. Schmid, and G. Rogez, “Mouldinghumans: Non-parametric 3d human shape estimation from single im-ages,” in Proceedings of the IEEE International Conference on ComputerVision, 2019, pp. 2232–2241.

[33] G. Varol, D. Ceylan, B. Russell, J. Yang, E. Yumer, I. Laptev, andC. Schmid, “Bodynet: Volumetric inference of 3d human body shapes,”in Proceedings of the European Conference on Computer Vision, 2018,pp. 20–36.

[34] A. S. Jackson, C. Manafas, and G. Tzimiropoulos, “3d human bodyreconstruction from a single image via volumetric regression,” in Pro-ceedings of the European Conference on Computer Vision, 2018.

[35] Z. Zheng, T. Yu, Y. Wei, Q. Dai, and Y. Liu, “Deephuman: 3d humanreconstruction from a single image,” in Proceedings of the IEEE Inter-national Conference on Computer Vision, 2019, pp. 7739–7749.

[36] D. Xiang, H. Joo, and Y. Sheikh, “Monocular total capture: Posing face,body, and hands in the wild,” in Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition, 2019, pp. 10 965–10 974.

[37] R. Alp Guler, N. Neverova, and I. Kokkinos, “Densepose: Dense humanpose estimation in the wild,” in Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition, 2018, pp. 7297–7306.

[38] N. Kolotouros, G. Pavlakos, and K. Daniilidis, “Convolutional meshregression for single-image human shape reconstruction,” in Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition,2019, pp. 4501–4510.

[39] Y. Xu, S.-C. Zhu, and T. Tung, “Denserac: Joint 3d pose and shapeestimation by dense render-and-compare,” in Proceedings of the IEEEInternational Conference on Computer Vision, 2019, pp. 7760–7770.

[40] V. Tan, I. Budvytis, and R. Cipolla, “Indirect deep structured learning for3d human body shape and pose prediction,” in British Machine VisionConference, 2017.

[41] N. Kolotouros, G. Pavlakos, M. J. Black, and K. Daniilidis, “Learning toreconstruct 3d human pose and shape via model-fitting in the loop,” inProceedings of the IEEE International Conference on Computer Vision,2019, pp. 2252–2261.

[42] Y. Sun, Y. Ye, W. Liu, W. Gao, Y. Fu, and T. Mei, “Human mesh recoveryfrom monocular images via a skeleton-disentangled representation,” inProceedings of the IEEE International Conference on Computer Vision,2019, pp. 5349–5358.

[43] T. N. Kipf and M. Welling, “Semi-supervised classification with graphconvolutional networks,” in International Conference on Learning Rep-resentations, 2017.

[44] P. Yao, Z. Fang, F. Wu, Y. Feng, and J. Li, “Densebody: Directlyregressing dense 3d human pose and shape from a single color image,”arXiv preprint arXiv:1903.10153, 2019.

[45] L. Pishchulin, M. Andriluka, P. Gehler, and B. Schiele, “Poselet condi-tioned pictorial structures,” in Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition, 2013, pp. 588–595.

[46] Y. Yang and D. Ramanan, “Articulated pose estimation with flexiblemixtures-of-parts,” in Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition. IEEE, 2011, pp. 1385–1392.

[47] S. Zuffi and M. J. Black, “The stitched puppet: A graphical model of3d human shape and pose,” in Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition, 2015, pp. 3537–3546.

[48] J. J. Tompson, A. Jain, Y. LeCun, and C. Bregler, “Joint training of aconvolutional network and a graphical model for human pose estimation,”in Advances in Neural Information Processing Systems, 2014, pp. 1799–1807.

[49] H.-S. Fang, Y. Xu, W. Wang, X. Liu, and S.-C. Zhu, “Learning posegrammar to encode human body configuration for 3d pose estimation,”in Thirty-Second AAAI Conference on Artificial Intelligence, 2018.

[50] L. Zhao, X. Peng, Y. Tian, M. Kapadia, and D. N. Metaxas, “Seman-tic graph convolutional networks for 3d human pose regression,” inProceedings of the IEEE Conference on Computer Vision and PatternRecognition, 2019, pp. 3425–3435.

[51] I. Akhter and M. J. Black, “Pose-conditioned joint angle limits for 3dhuman pose reconstruction,” in Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition, 2015, pp. 1446–1455.

[52] X. Zhou, X. Sun, W. Zhang, S. Liang, and Y. Wei, “Deep kinematic poseregression,” in European Conference on Computer Vision. Springer,2016, pp. 186–201.

[53] X. Sun, J. Shang, S. Liang, and Y. Wei, “Compositional human poseregression,” in Proceedings of the IEEE International Conference onComputer Vision, 2017, pp. 2602–2611.

[54] X. Zhou, Q. Huang, X. Sun, X. Xue, and Y. Wei, “Towards 3d humanpose estimation in the wild: a weakly-supervised approach,” in Proceed-ings of the IEEE International Conference on Computer Vision, 2017,pp. 398–407.

[55] W. Yang, W. Ouyang, X. Wang, J. Ren, H. Li, and X. Wang, “3d humanpose estimation in the wild by adversarial learning,” in Proceedings ofthe IEEE Conference on Computer Vision and Pattern Recognition, 2018,pp. 5255–5264.

[56] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdi-nov, “Dropout: a simple way to prevent neural networks from overfitting,”The Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014.

[57] J. Tompson, R. Goroshin, A. Jain, Y. LeCun, and C. Bregler, “Efficientobject localization using convolutional networks,” in Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition, 2015,pp. 648–656.

[58] G. Ghiasi, T.-Y. Lin, and Q. V. Le, “Dropblock: A regularizationmethod for convolutional networks,” in Advances in Neural InformationProcessing Systems, 2018, pp. 10 727–10 737.

[59] T. DeVries and G. W. Taylor, “Improved regularization of convolutionalneural networks with cutout,” arXiv preprint arXiv:1708.04552, 2017.

[60] R. Alp Guler, G. Trigeorgis, E. Antonakos, P. Snape, S. Zafeiriou, andI. Kokkinos, “Densereg: Fully convolutional dense shape regression in-the-wild,” in Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition, 2017, pp. 6799–6808.

[61] M. M. Loper and M. J. Black, “Opendr: An approximate differentiablerenderer,” in European Conference on Computer Vision. Springer, 2014,pp. 154–169.

[62] H. Kato, Y. Ushiku, and T. Harada, “Neural 3d mesh renderer,” inProceedings of the IEEE Conference on Computer Vision and PatternRecognition, 2018, pp. 3907–3916.

[63] M. Jaderberg, K. Simonyan, A. Zisserman et al., “Spatial transformernetworks,” in Advances in Neural Information Processing Systems, 2015,pp. 2017–2025.

[64] X. Sun, B. Xiao, F. Wei, S. Liang, and Y. Wei, “Integral human poseregression,” in Proceedings of the European Conference on ComputerVision, 2018, pp. 529–545.

[65] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” in Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition, 2016, pp. 770–778.

[66] S. Yan, Y. Xiong, and D. Lin, “Spatial temporal graph convolutionalnetworks for skeleton-based action recognition,” in Proceedings of theThirty-Second AAAI Conference on Artificial Intelligence, 2018.

[67] K. Sun, B. Xiao, D. Liu, and J. Wang, “Deep high-resolution represen-tation learning for human pose estimation,” Proceedings of the IEEEConference on Computer Vision and Pattern Recognition, 2019.

[68] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan,T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al., “Pytorch: Animperative style, high-performance deep learning library,” in Advancesin Neural Information Processing Systems, 2019, pp. 8024–8035.

[69] C. Ionescu, D. Papava, V. Olaru, and C. Sminchisescu, “Human3. 6m:Large scale datasets and predictive methods for 3d human sensingin natural environments,” IEEE Transactions on Pattern Analysis andMachine Intelligence, vol. 36, no. 7, pp. 1325–1339, 2014.

Page 15: JOURNAL OF LA Learning 3D Human Shape and Pose from …The recovery of the 3D human pose from a monocular image is challenging. Common strategies use intermediate estimations as the

JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, MONTH YEAR 15

[70] M. Loper, N. Mahmood, and M. J. Black, “Mosh: Motion and shapecapture from sparse markers,” ACM Transactions on Graphics, vol. 33,no. 6, p. 220, 2014.

[71] S. Johnson and M. Everingham, “Clustered pose and nonlinear appear-ance models for human pose estimation.” in British Machine VisionConference, vol. 2, no. 4, 2010, p. 5.

[72] ——, “Learning effective human pose estimation from inaccurate anno-tation,” in Proceedings of the IEEE Conference on Computer Vision andPattern Recognition. IEEE, 2011, pp. 1465–1472.

[73] M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele, “2d human poseestimation: New benchmark and state of the art analysis,” in Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition,2014, pp. 3686–3693.

[74] M. Dantone, J. Gall, C. Leistner, and L. Van Gool, “Body parts dependentjoint regressors for human pose estimation in still images,” IEEE Trans-actions on Pattern Analysis and Machine Intelligence, vol. 36, no. 11,pp. 2131–2143, 2014.

[75] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,P. Dollar, and C. L. Zitnick, “Microsoft coco: Common objects incontext,” in European Conference on Computer Vision. Springer, 2014,pp. 740–755.

[76] B. Xiao, H. Wu, and Y. Wei, “Simple baselines for human pose estimationand tracking,” in Proceedings of the European Conference on ComputerVision, 2018, pp. 466–481.

[77] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet:A large-scale hierarchical image database,” in Proceedings of the IEEEConference on Computer Vision and Pattern Recognition. IEEE, 2009,pp. 248–255.

[78] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”International Conference on Learning Representations, 2014.

[79] G. Varol, J. Romero, X. Martin, N. Mahmood, M. J. Black, I. Laptev,and C. Schmid, “Learning from synthetic humans,” in Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition, 2017,pp. 109–117.

APPENDIX

TRAINING DETAILS

During training, data augmentation techniques, including rotation±30◦, color jittering (±30% channel-wise) and flipping, areapplied randomly to input images. The FCN is initialized with themodel pre-trained on the COCO keypoint detection dataset [75]for 2D human pose estimation, which is essential for robust 2Djoint position localization and partial IUV estimation. The αksin Eq. (2) are first learned using ground-truth IUV maps asinputs and then frozen as constants for other experiments, whileδ is empirically set to 0.1. The hyper-parameters λs are decidedbased on the scales of values in objectives. The dropping rateγ for PartDrop is adopted as 0.3 in our experiments. For morerobust pose prediction from the estimated partial IUV, we performrandom jittering on the estimated 2D joint position (±5%) andthe scale of partial IUV maps (±10%) during training. The IUVestimation task is first trained for 5k iterations before involving theparameter prediction task. We adopt the ADAM [78] optimizerwith an initial learning rate of 1 × 10−4 to train the model,and reduce the learning rate to 1 × 10−5 after 30k iterations.The learning process converges after around 60k iterations. Forfaster runtime, the local streams are implemented to run in aparallel manner. Specifically, the partial IUV maps of all bodyjoints are concatenated batch-wise and then fed into the backbonefeature extractor. Moreover, individual rotation feature extractionis implemented based on group convolution. The training processtakes about 25 hours on a single TITAN Xp GPU.

ADDITIONAL QUANTITATIVE EVALUATION

Evaluation on SURREAL. SURREAL [79] is a large scaledataset containing synthetic images with the ground truth labelsof SMPL parameters. The accurate ground truth labels make theSURREAL dataset ideal for quantitative evaluations of recon-struction performance. The original dataset provides 1,964 videosequences of 115 subjects for training, and 703 video sequencesof 30 subjects for testing. We downsample the video to select oneout of every three frames to remove redundancy, resulting in 1.7million frames as training data. Since the training samples aremuch abundant in this dataset, we double the number of trainingiterations by reducing the learning rate after 60k iterations andfinishing training after 120k iterations. For testing, we follow theprotocol in [33] to select 507 video clips from the test sequencesas a sub-set, in which the middle frame of each clip is chosenas testing data. Those frames containing no human body orincomplete human bodies are further discarded, which makes totaltesting samples of 465 frames.

Table 8 shows the evaluation results of Per-vertex Error (PVE)on the SURREAL dataset. It can be seen that our method achievesthe best result compared with previous methods. Our closestcompetitor BodyNet [33] predicts volumetric representations ofthe human body with both 2D and 3D estimation as intermediaterepresentations, which is much more time-consuming than ours.

ADDITIONAL ABLATION STUDY

Effect of Image Features. The IUV maps are the re-projectionof the 3D human body models on image planes, containing highlyabstract information about the shape and pose of models whileeliminating other irrelevant factors. Compared with IUV maps,the image features contain the raw information extracted from

Page 16: JOURNAL OF LA Learning 3D Human Shape and Pose from …The recovery of the 3D human pose from a monocular image is challenging. Common strategies use intermediate estimations as the

JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, MONTH YEAR 16

Table 8Quantitative comparison on the SURREAL dataset.

Method PVE

SMPLify++ [3] 75.3Tung et al. [4] 74.5BodyNet [33] 65.8Ours 65.1

RGB images. To evaluate how important the image features are toour task, we concatenate the image features with the IUV mapsand feed them as input for the shape and pose prediction. Ourexperiments found that image features are not always helpful,which depends on the initialization of networks and the quality ofIUV maps. As shown in Table 9, concatenating the image featuresof the network pretrained on COCO could result in slightlybetter performance. However, image features are not helpful andeven degrade the performance when the network is trained fromscratch. The degraded performance is also observed in case ofconcatenating image features with the ground-truth IUV maps,although the network is pretrained on COCO. Based on theseobservations, we do not resort to using image feature in ourapproach and leave it for future work to improve the quality ofthe estimated IUV maps.

Table 9Performance of approaches with/without concatenated image features

as intermediate representations on the Human3.6 dataset.

Pre-trained Input PVE MPJPE MPJPE-PA

NoneIUV 103.5 85.8 64.5IUV+Feat 120.7 100.1 73.1

ImageNetIUV 94.9 77.9 59.5IUV+Feat 94.2 77.7 56.8

COCOIUV 87.8 71.6 55.4IUV+Feat 88.9 71.4 52.2

COCOIUV GT 69.5 57.9 46.9IUV GT+Feat 83.5 67.4 49.9

Number of Refinement Layers. We also evaluate the impactof the number of graph convolutional layers for feature refinement.Fig. 16 reports the results of approaches adopting the director position-aided refinement strategy across different numbersof refinement layers. For fair comparisons, all layers from thethree steps of the position-aided strategy are considered in therefinement layer number. Thus, there are N − 2 layers in Step2 for position-aided refinement when N layers are deployed inthe corresponding direct refinement. As can be seen, the position-aided strategy outperforms its direct counterpart consistently. Thedirect refinement strategy obtains marginal to no improvement, nomatter how many layers are deployed for refinement.

Refinement Technique. An LSTM-based refinement ap-proach is proposed in our previous work [11], where both featuremapping and refinement are conducted sequentially for each bodyjoint. Compared with the LSTM-based refinement, the GCN-basedrefinement proposed in this work performs feature mapping orrefinement in parallel for all body joints, which is faster andhas better performance. The approaches with different refinementtechniques are compared in Table 10. The results of GCN-basedrefinement with the refinement graph Arf considering only one-hop neighbors are also reported in Table 10 for comprehensive

0 3 5 7 9# refinement layers

45

50

55

60

65

70

Erro

r (m

m)

DirectPosition-aided

PVEMPJPE

MPJPE-PA

Fig. 16. Comparison of direct and position-aided refinement strategiesacross different numbers of refinement layers.

comparisons. We can see that GCN-based refinement techniquesbring gains for both direct and position-aided refinement, andthat the position-aided refinement is consistently superior to thedirect refinement regardless of the refinement techniques. ForGCN-based refinement, involving two-hop neighbors intoArf hasslightly better performance for position-aided refinement whileworse performance for direct refinement, since the rotation statusof two-hop neighbors is much more independent.

Table 10Performance of approaches using different feature refinement

strategies on the Human3.6M dataset.

Refinement Strategy PVE MPJPE MPJPE-PA

w/o Refinement 71.7 59.1 46.1

LSTM-based [11]Direct 71.9 59.0 46.0Position-aided 68.4 57.0 45.1

GCN-based(1-hop)

Direct 69.9 57.7 45.3Position-aided 66.7 54.4 43.6

GCN-based(1- & 2-hop)

Direct 70.3 58.1 45.5Position-aided 66.5 54.6 42.9

ADDITIONAL QUALITATIVE RESULTS

Fig. 17 depicts more example results of our method includingvisualization under side viewpoints. Moreover, failure cases ofour method as well as the estimated IUV are also visualized inFig. 18. These failure cases are typically caused by extreme poses,heavy occlusion/self-occlusion, and poor IUV estimation.

Page 17: JOURNAL OF LA Learning 3D Human Shape and Pose from …The recovery of the 3D human pose from a monocular image is challenging. Common strategies use intermediate estimations as the

JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, MONTH YEAR 17

Fig. 17. Successful results of our method. For each example from left to right: Image, Our reconstruction result, Our reconstruction result from aside view. All images come from the DensePose-COCO [37] dataset.

Page 18: JOURNAL OF LA Learning 3D Human Shape and Pose from …The recovery of the 3D human pose from a monocular image is challenging. Common strategies use intermediate estimations as the

JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, MONTH YEAR 18

Fig. 18. Erroneous reconstructions of our method. For each example from left to right: Image, estimated IUV, Our reconstruction result, Ourreconstruction result from a side view. All images come from the DensePose-COCO [37] dataset.