Unsupervised 3D Reconstruction Networkscpslab.snu.ac.kr/publications/papers/2019_iccv_3durn.pdf · infer the basis shapes, unlike the proposed scheme which generates several basis

Unsupervised 3D Reconstruction Networks

Geonho Cha†∗ Minsik Lee‡∗ Songhwai Oh†

Electrical and Computer Engineering, ASRI, Seoul National University, Korea†

Division of Electrical Engineering, Hanyang University, Korea‡

[email protected] [email protected] [email protected]

Abstract

In this paper, we propose 3D unsupervised reconstruc-tion networks (3D-URN), which reconstruct the 3D struc-tures of instances in a given object category from their 2Dfeature points under an orthographic camera model. 3D-URN consists of a 3D shape reconstructor and a rotationestimator, which are trained in a fully-unsupervised man-ner incorporating the proposed unsupervised loss functions.The role of the 3D shape reconstructor is to reconstruct the3D shape of an instance from its 2D feature points, and therotation estimator infers the camera pose. After training,3D-URN can infer the 3D structure of an unseen instance inthe same category, which is not possible in the conventionalschemes of non-rigid structure from motion and structurefrom category. The experimental result shows the state-of-the-art performance, which demonstrates the effectivenessof the proposed method.

1. Introduction

Unsupervised 3D reconstruction from 2D observationsis one of the key problems in computer vision, which hasnumerous applications such as human computer interac-tion and augmented reality. Structure from motion (SfM)and non-rigid structure from motion (NRSfM) have handledthe unsupervised 3D reconstruction problem, of which thegoals are to reconstruct 3D trajectories and motions from2D feature trajectories of rigid and non-rigid objects, re-spectively. Thanks to the significant advances [16,29], SfMis considered as a mature area in computer vision. However,NRSfM has been generally considered as an ill-conditionedproblem due to the increased degrees of freedom. To al-leviate this issue, some prior information on the nature ofdeformation has been incorporated. For example, the shapespace model [4, 7, 8] and the trajectory space model [3, 15]assume that the aligned deformations of a non-rigid instancecan be represented by a low-rank matrix. However, most

∗Authors contributed equally.

Figure 1. The proposed 3D unsupervised reconstruction networks(3D-URN) reconstructs the 3D structures of instances in an objectcategory from their 2D feature points. After training, 3D-URNcan reconstruct the 3D structure of an unseen instance in the samecategory as the one used in the training process.

schemes of NRSfM have focused on the reconstruction ofa specific instance of an object and have assumed smoothinput trajectories.

These limitations have been alleviated in the field ofstructure from category (SfC) [1,2,12,20], which is a prob-lem to reconstruct the 3D structures of instances in an ob-ject category and no smoothness assumption of input fea-ture points is needed. Furthermore, SfC can handle a gen-eral object category, unlike NRSfM that has conventionallydealt with limited object categories such as human face orbody [20]. However, despite their promising results, mostworks of NRSfM and SfC can only produce the 3D recon-struction of a given input, i.e., the information learned fromreconstructing a certain input cannot be reused, and a suffi-cient number of samples are needed in the reconstructionprocess. That is, the optimization process should be re-peated with a sufficient number of samples to reconstruct

the 3D structure of each instance, although the 2D featurepoints of instances in the same object category are similarwith each other.

In this paper, we propose 3D unsupervised reconstruc-tion networks (3D-URN), a neural-network-based frame-work which reconstructs 3D structures of instances in anobject category from their 2D feature points under an or-thographic camera model (see Figure 1). 3D-URN consistsof a 3D shape reconstructor and a rotation estimator, whichare based on neural networks. The role of the 3D shapereconstructor is to reconstruct the 3D shape of an instancefrom its 2D feature points, which is a highly non-linear pro-cedure. To solve this problem, inspired by the shape-basismethods in NRSfM [4,7,8], we incorporate multiple 3D re-constructors, from which the final 3D shape is representedas a weighted sum of multiple 3D reconstructors. Here, theweights of 3D reconstructors are also estimated based ona neural-network-based estimator. On the other hand, therotation estimator estimates a rotation matrix from the 2Dfeature points of an instance. Unlike the 3D shape recon-structor, the output of the rotation estimator should meet theorthogonality constraint. To handle this issue, we proposea rotation refiner that decomposes the output of the rota-tion estimator to make it orthogonal. The proposed rotationrefiner is composed of differentiable operations, which isuseful in back-propagation.

The proposed 3D-URN is trained in a fully-unsupervisedmanner based on the proposed loss functions, namely, theprojection loss and the low-rank prior loss. After the pro-posed network is trained, the network can infer the 3D struc-ture of an instance in the same object category unlike pre-vious works of NRSfM and SfC. The proposed 3D-URNshows the state-of-the-art performance, which demonstratesthe effectiveness of the scheme.

The contribution of the proposed method is summarizedas:

• We propose a neural-network-based framework whichsolves the structure from category problem.

• We propose a novel rotation refiner, which is a differ-entiable layer that resolves the issues of orthogonalityconstraints and reflection ambiguities among the esti-mated rotation matrices.

• The proposed scheme shows the state-of-the-art per-formance on the popular benchmark data sets.

The remainder of this paper is organized as follows: wediscuss related work in Section 2. The proposed 3D-URNand the proposed unsupervised loss functions are explainedin Section 3. Section 4 shows the experimental results, andwe conclude the paper in Section 5.

2. Related work

In this section, we introduce recent work on non-rigidstructure from motion, and present some related work instructure from category.

2.1. Non-rigid structure from motion

The goal of NRSfM is to reconstruct 3D trajectories andcamera poses from 2D feature trajectories of non-rigid ob-jects. NRSfM is generally considered as an ill-conditionedproblem, because of its large degrees of freedom. That is,we have to find out the camera pose and the shape of theobject, which change in each frame, from an observation of2D feature points. To resolve the ill-conditioned nature ofNRSfM, some prior information has been incorporated. Thelow-rank assumption [3, 4, 7, 8, 15] is one of the promisingprior assumptions. The local-rigidity prior [5, 6, 17, 23, 27]and the sparsity assumption [19] have also been incorpo-rated to make the problem easier. However, most NRSfMschemes have difficulties reconstructing a set of arbitraryinstances in a general object category, and so they have fo-cused on limited object categories, such as a human faceand body. Furthermore, they often assume smooth inputtrajectories, which limits the range of practical uses.

2.2. Structure from category

To resolve the limitations of NRSfM, some frameworkshave been proposed to reconstruct multiple arbitrary in-stances of an object category without the assumption ofsmooth input trajectories [1, 2, 12, 20]. Gao et al. [12] ex-tended two existing schemes to exploit the proposed sym-metry constraints. Agudo et al. [1] proposed a formula-tion based on a dual low-rank shape representation, whichis learned through a variant of the probabilistic linear dis-criminant analysis. On the one hand, Kong et al. [20] intro-duced the concept of SfC, of which the goal is to reconstructthe 3D shapes and camera poses of multiple instances in anobject category from their 2D feature points. They formu-lated an augmented sparse shape-space model that can re-construct multiple instances of a rigid object category. Athing to note here is that all of the previous works [1,12,20]only handled rigid object categories. To resolve this issue,Agudo et al. [2] proposed a framework for SfC which canhandle both the rigid and non-rigid object categories. Theyformulated the model of the object deformation based onmultiple unions of subspaces, which was solved with aug-mented Lagrange multipliers. In spite of the promising re-sults of NRSfM and SfC, most of them can only producethe 3D reconstruction of a given input, i.e., the informa-tion learned from reconstructing a certain input cannot bereused, which hinders their utility in real world applications.

Figure 2. An overview of the proposed 3D shape reconstructor. It consists of nf 3D reconstructors and a weight estimator. Here, each 3Dreconstructor consists of nd modules, and each module produces an intermediate reconstruction which is used in the intermediate lossesto prevent the gradient vanishing problem. The reconstructions are weighted summed to estimate the final 3D shape. Here, the numberwritten under each layer represents the channel size of the layer.

3. 3D reconstruction networksIn this section, we introduce the proposed 3D-URN,

which consists of a 3D shape reconstructor and a rotationestimator. The network reconstructs the 3D shape of thei-th sample, Xi ∈ R3×p, where p is the total number ofpoints, and the camera matrix Ri ∈ R3×3 from 2D featurepoints of the i-th sample, xi ∈ R2×p. In this paper, we as-sume that the 2D feature points and the reconstructed 3Dshapes are translated so that their means are zero.

3.1. 3D shape reconstructor

An illustration of the proposed 3D shape reconstructor isvisualized in Figure 2. The role of the 3D shape reconstruc-tor, f , is to reconstruct the 3D shape Xi given 2D featurepoints xi, which can be expressed as

Xi = f(xi). (1)

In (1), the 3D shape is reconstructed based on a single re-constructor. However, 3D reconstruction from a 2D cue isa highly non-linear procedure, which might be hard to dealwith based on a single reconstructor. To resolve this issue,we incorporate multiple reconstructors, and the final resultis estimated as a weighted sum of reconstructions, whichcan be written as

Xi =∑

j=1,··· ,nf

wjfj(xi), (2)

where fj is the j-th reconstructor, w ∈ Rnf is the weightof the reconstructions which is estimated from a weight esti-mator, wj is the j-th element of w, and nf is the total num-

ber of reconstructors. The design of the 3D shape recon-structor has been inspired by the shape-space-based methodin NRSfM [4, 7, 8] that represents the 3D shape based ona weighted sum of basis shapes. However, in their formu-lation, a batch of 2D feature points is always necessary toinfer the basis shapes, unlike the proposed scheme whichgenerates several basis shapes for each frame. In the nextsection, the details of the proposed 3D reconstructor andthe weight estimator are introduced.

3.1.1 Design of the 3D reconstructor

The proposed 3D reconstructor is designed based on a neu-ral network. The vectorized 2D feature points, vec(xi),where vec(·) denotes the vectorization operator, is mappedinto a feature space with a fully-connected layer, whichis fed into nd cascaded modules. Each module consistsof two fully-connected layers. After each fully-connectedlayer, a ReLU activation function is followed. Note that theoutput of each module is fed into another fully-connectedlayer without activation function to infer intermediate re-sult, which is denoted by fkj (xi) ∈ R3p, k = 1, · · · , nd.Here, the output of the last module is the final reconstruc-tion of each 3D reconstructor, i.e., fj(xi) = fnd

j (xi).There are two things to consider in designing the 3D re-

constructor. The first one is scale ambiguity between theweights and the reconstructions in (2). Specifically, numer-ous combinations of wj and fj(xi) with different scalescan produce the same 3D shape Xi, which can make thesolution ambiguous. To resolve this issue, we normalize thereconstructions so that they have unit Frobenius norms. The

Figure 3. An overview of the proposed rotation estimator. Theoutput of the estimation network is fed into the proposed rotationrefiner, which resolves the issues of orthogonality constraints andreflection ambiguities among the estimated rotation matrices.

second issue is that the training procedure of the proposednetwork could suffer from gradient vanishing problem incase of large nd. To alleviate this issue, we put the pro-posed losses to the output of each module, which will beintroduced in Section 3.3.

3.1.2 Design of the weight estimator

An illustration of the weight estimator is visualized in Fig-ure 2. The weight estimator infers the weights of the recon-structions from 2D feature points. It is also designed basedon a neural network, which consists of six fully-connectedlayers. The output of each fully-connected layer is followedby a ReLU activation function except the last layer. Here,we empirically found out that incorporating the absolutevalue of the output from the network as the weight improvesthe performance. Hence, we utilize the absolute value of theoutput from the weight estimator as the weight, w.

3.2. Rotation estimator

An overview of the rotation estimator is visualized inFigure 3. The role of the rotation estimator, g, is to inferrotation matrix Ri given 2D feature points xi, which can beexpressed as

Ri = g(xi). (3)

The structure of the rotation estimation network is similarto the weight estimation network, with the different size ofchannel dimensions. However, in contrast to the design ofthe weight estimation network, there are two things to con-sider in designing the rotation estimation network. The firstone is the orthogonality constraint of the rotation matrix:

RiRTi = I, (4)

where I is the identity matrix. The second one is the reflec-tion ambiguities among the estimated rotation matrices. To

resolve these issues, we propose a novel rotation refiningprocedure, which is applied to the output of the rotation es-timation network. The proposed rotation refining procedureis given as

Ri ← UVT ,

Ri ← U

1 0 00 1 0

0 0 |Ri|

VT ,(5)

where g(xi) = UΣV is the singular value decomposition(SVD) of g(xi) and | · | denotes the matrix determinant. Thefirst step here enforces the orthogonality constraint. Thereflection ambiguities are also resolved by putting the esti-mated rotation matrices to the special orthogonal group of|Ri| = 1, which is achieved in the second step.

Note that many deep learning tools like TensorFlow pro-vide an ability to back-propagate through SVD, which en-ables us to use the proposed rotation refiner without anyproblem.

3.3. Unsupervised losses

We train the proposed network in a fully-unsupervisedmanner incorporating the proposed losses, which are theprojection loss and the low-rank loss. Before explaining theproposed loss functions, we introduce some additional no-tations. Let Sk

j be the collection of reconstructions, whichare estimated from the k-th module of the j-th reconstruc-tion network, given a batch of training samples, which isdefined as

Skj =

[vec(fkj (x1)), · · · , vec(fkj (xnb

))]T, (6)

where nb is the batch size. Meanwhile, the weighted sumof the intermediate reconstructions is denoted by Xk

i , i.e.,Xk

i =∑

j=1,··· ,nfwjf

kj (xi).

3.3.1 Projection loss

The projection loss measures how well the projection of therotated reconstruction coincide with the given 2D featurepoints. The projection loss is defined as

Lproj =1

pndnb

∑i∈B

∑k=1,··· ,nd

‖PRiXki − xi‖2F , (7)

where ‖ · ‖F denotes the Frobenius norm, P is the projec-tion matrix, and B is a set of training batch samples. Here,we assume that 2D observations are given from the ortho-graphic camera model, which is reasonable to model objectsthat are far enough from the camera compared to the depthvariation. Hence, the projection matrix P is given as

P =

[1 0 00 1 0

]. (8)

3.3.2 Low-rank prior loss

The low-rank assumption has been generally incorporatedin the field of NRSfM [3,4,7,8,15] to reconstruct 3D shapesin an unsupervised manner. In general, a smooth surrogateof the rank cost such as log-determinant and the nuclearnorm have been adopted in the formulation [10]. We foundout empirically that a nuclear-norm-based cost gives supe-rior performance, which is given as

Llr =1

ndnf

∑k=1,··· ,nd

∑j=1,··· ,nf

‖Skj (S

kj )

T ‖∗, (9)

where ‖ · ‖∗ denotes the nuclear norm.

3.3.3 Total loss

The total loss function of the proposed network is given as

Ltotal = λ1Lproj + λ2Llr + λ3Lreg, (10)

where λ1, λ2, and λ3 are weighting parameters, and Lregis the regularization loss for estimated weights which is de-fined as

Lreg = ‖w‖2F . (11)

The loss function is defined on both every intermediateoutput and the last output of the reconstruction network,which facilitates the training procedure by preventing thegradient vanishing problem.

4. Experimental resultsIn this section, we show experimental results of 3D-

URN on various data sets. For all experiments, we usedthe following parameter setting unless stated otherwise:nf = 12, nd = 12, nb = 32, λ1 = 800, λ2 = 80, andλ3 = 5.

4.1. Implementation details

We have empirically found that putting a constraint onparameters of each layer to clip their norm by 1 improvesthe performance. Specifically, we have normalized the pa-rameter matrices of which the Frobenius norms are largerthan 1 so that they have unit Frobenius norms. We havealso found out that alternatingly updating the parameters ofthe 3D shape reconstructor and the rotation estimator showsbetter performance. Note that the rotation estimation per-formance might be crucial for training the 3D shape recon-structor. Hence, we updated the parameters of the rotationestimation network for 25 times before every update of theparameters of the 3D shape reconstructor. The parametersof both the networks were updated for 100 epochs based onAdam [18] with a learning rate of 1.0 × 10−3. Lastly, inour framework, x and y coordinates correspond to the inputcoordinates. Hence, we concatenated the input 2D pointswith the estimated depths for the final output.

4.2. Quantitative evaluation

For the quantitative evaluation, we have applied 3D-URN on the PASCAL3D+ data set [30] and the face se-quence [14]. In the PASCAL3D+ data set, ground truth 3DCAD models are labeled to the samples of the PASCALVOC data set [9]. Following the practices in [2], we haveevaluated the performance on the instances in eight objectcategories which have at least eight keypoints, and the per-formance is measured in terms of the normalized mean 3Derror, e13D, as in [7, 13], which is defined as

e13D =1

σnsp

ns∑i=1

p∑j=1

eij , σ =1

3ns

ns∑i=1

(σix + σiy + σiz),

(12)where σix, σiy , σiz are the standard deviations in x, y, andz coordinates of the ground truth 3D shape for the i-th in-stance, ns is the total number of instances, and eij is the Eu-clidean distance error of the j-th point in the i-th instance.Note that we have calculated eij for both the original re-constructed shape and the reflected shape and picked thesmaller one, due to the inherent reflection ambiguity. Wehave also considered full and clean 2D feature points ineach object category, following the practice in [2]. We havecompared the performance of 3D-URN to various meth-ods [2, 7, 11, 13, 14, 21, 22, 24, 29, 31].

The results are summarized in Table 1. The proposedmethod shows the state-of-the-art performance on aver-age, which demonstrates the effectiveness of the proposedmethod. Some reconstruction results are visualized in Fig-ure 6. We have also evaluated 3D-URN on noisy input. Forthis evaluation, zero mean Gaussian noises are added to 2Dfeature points, of which the standard deviation of the noiseis set to 0.01dmax where dmax is the maximum absolute valueof the input feature points. The comparison results on thenoisy data are summarized in Table 2. Also in this case, 3D-URN shows the state-of-the-art performance on average. Incase the standard deviation of the noise has doubled, theaverage performance of 3D-URN was 0.189, which is com-parable to the performance of [2] with small noise.

The face sequence is one of the popular benchmark datasets in the field of NRSfM, which is a motion capture dataof facial feature points. We have compared the performanceof 3D-URN to various methods [7,22,26] based on the errormeasure, e23D, used in [21, 22], which is defined as

e23D =1

ns

ns∑i=1

‖XGTi −Xinfer

i ‖F‖XGT

i ‖F, (13)

where XGTi and Xinfer

i are the ground truth 3D shape and thereconstructed 3D shape, respectively. Again, for this mea-sure, both the original reconstructed shape and the reflectedshape have been used in calculating the error. Table 3 sum-marizes the comparison results. Although the other NRSfM

TK [29] MC [24] CSF [13] KSTA [14] BMM [7] EM-PND [21] TUS [31] GBNR [11] CNR [22] MUS [2] Ours Ours*Aeroplane 0.679 0.584 0.363 0.145 0.843 0.578 0.294 - 0.263 0.261 0.121 0.157Bicycle 0.309 0.440 0.424 0.442 0.308 0.763 0.182 0.221 - 0.178 0.328 0.305Bus 0.202 0.238 0.217 0.214 0.300 1.048 0.129 0.214 - 0.113 0.097 0.105Car 0.239 0.256 0.195 0.159 0.266 0.496 0.084 0.217 0.099 0.078 0.104 0.097Chair 0.356 0.447 0.398 0.399 0.357 0.687 0.211 - - 0.210 0.115 0.141Diningtable 0.386 0.512 0.406 0.372 0.422 0.670 0.265 0.351 - 0.264 0.115 0.107Motorbike 0.339 0.346 0.278 0.270 0.336 0.740 0.228 0.268 - 0.222 0.287 0.265Sofa 0.381 0.390 0.409 0.298 0.279 0.692 0.179 0.264 0.214 0.167 0.181 0.153Average 0.361 0.402 0.336 0.287 0.388 0.709 0.196 0.256 0.192 0.186 0.168 0.166

Table 1. Results on the PASCAL3D+ data set based on the error measure e13D. The performances of the other methods are quoted from [2].Here, “-” means failure of reconstructions, and “Ours*” is the performance of 3D-URN measured on unseen instances.

TK [29] MC [24] CSF [13] KSTA [14] BMM [7] EM-PND [21] TUS [31] GBNR [11] CNR [22] MUS [2] Ours Ours*Aeroplane 0.677 0.583 0.233 0.183 0.566 0.760 0.297 - 0.294 0.271 0.144 0.163Bicycle 0.308 0.442 0.455 0.457 0.307 0.808 0.195 0.231 - 0.188 0.320 0.355Bus 0.204 0.241 0.227 0.218 0.255 1.197 0.139 0.223 - 0.122 0.103 0.127Car 0.241 0.259 0.169 0.164 0.161 0.624 0.100 0.222 0.122 0.093 0.119 0.109Chair 0.358 0.447 0.398 0.396 0.258 0.818 0.221 - - 0.220 0.125 0.151Diningtable 0.392 0.522 0.414 0.383 0.358 0.807 0.268 0.370 - 0.267 0.121 0.113Motorbike 0.342 0.348 0.295 0.290 0.299 0.748 0.237 0.277 - 0.233 0.295 0.240Sofa 0.384 0.392 0.303 0.294 0.240 0.726 0.188 0.271 0.228 0.174 0.185 0.157Average 0.363 0.404 0.312 0.298 0.305 0.811 0.206 0.266 0.215 0.196 0.177 0.177

Table 2. Results on the PASCAL3D+ data set with noise based on the error measure e13D. The performances of the other methods are quotedfrom [2]. Here, “-” means failure of reconstructions, and “Ours*” is the performance of 3D-URN measured on unseen instances.

schemes show better performance, the performance of 3D-URN is comparable and is good enough for practical ap-plications. This result shows that 3D-URN can reconstructnot only the instances in rigid object categories but also innon-rigid object categories.

On the one hand, to verify the fact that the proposed 3D-URN can reconstruct the 3D shape of an unseen instanceafter we train the proposed network, we have divided the in-stances of each object category into a training set and a testset. Specifically, we put 80% of the instances into the train-ing set, and the remaining instances into the test set. Theresults on clean and noisy inputs are summarized in Table 1and Table 2, respectively, which are represented as “Ours*”.We can verify that 3D-URN is capable of reconstructing the3D structures of unseen instances.

4.3. Qualitative evaluation

We have evaluated the proposed framework on MUCTdata set [25] and the two cloths sequence [28]1 for the qual-itative evaluation. MUCT is a face image data set whichhas labeled 2D landmark feature points. It consists of 86RGB face images of diverse ages, races, and lighting con-ditions, and each image has 68 2D feature points. On theone hand, the two cloths sequence is one of the popular se-quences in the field of NRSfM, which is a sequence of twoindependently moving cloths. It consists of 163 frames, andeach frame has 525 points. In this case, we set nf = 24to handle the large number of 2D feature points. Since noground truth 3D points are provided for both cases, we onlyreport the qualitative results. An illustration of the recon-

1The reconstruction video is provided in the supplementary material.

struction results on MUCT data set is visualized in Figure4. We can verify that the proposed scheme has successfullyreconstructed the 3D shapes of faces. Some reconstructionresults of the two cloths sequence are visualized in Figure5. A notable thing is that 3D-URN has successfully recon-structed multiple parts of an instance at once without a sep-arate segmentation process. Again from the results of theMUCT data set and the two cloths sequence, we can verifythat 3D-URN can reconstruct instances in non-rigid objectcategories.

4.4. Additional experiments

We have performed ablation experiments on the PAS-CAL3D+ data set. The performance was evaluated by re-moving some components of the proposed scheme. The re-sults are summarized in Table 4. Removing the low-rankprior loss increased the error by 0.007. The error was in-creased by 0.006 when we removed the weight regulariza-tion loss, and by 0.100 when the intermediate loss was notincorporated.

In addition, we have evaluated the performance of 3D-URN on the PASCAL3D+ data set with various numbers of3D reconstructors, of which the results are summarized inTable 5. We can see that the performance gets improved asthe number of 3D reconstructors increases, which shows thevalidity of incorporating multiple 3D reconstructors. Lastly,we have performed experiments on the PASCAL3D+ dataset with other low-rank losses. The use of the determi-nant cost increases the error by 0.009, and using the log-determinant cost causes numerical instability, resulting in atraining failure.

Figure 4. Some example reconstructions of the MUCT data set. Left: It shows the input 2D feature points and the projection of thereconstructed 3D shape. Here, green circle means the input 2D feature point, and the red “+” means the estimated 2D feature point.Middle, Right: They visualize the reconstructed 3D shape in the camera view and the side view.

Figure 5. Some example reconstructions of the two cloths sequence. In each pair, the left figure represents the input 2D feature points, andthe right figure shows the reconstruction result.

MP [26] BMM [7] CNR [22] OursError 0.032 0.023 0.025 0.044

Table 3. Results on the face sequence based on the error measuree23D.

Variant Ours w/o Llr w/o Lreg w/o Lint

Error 0.168 0.175 0.174 0.2684 - 0.007 0.006 0.100

Table 4. Ablation experiments on different components in ourframework.

nf 2 3 5 9 10 12Error 0.257 0.211 0.211 0.211 0.176 0.168

Table 5. Average errors for various numbers of 3D reconstructorson the PASCAL3D+ data set.

5. Conclusion

In this paper, we have proposed 3D-URN, a 3D struc-ture reconstruction network of which the goal is to recon-struct 3D structures of instances in an object category giventheir 2D feature points. The proposed network consists of a3D shape reconstructor and a rotation estimator, which aretrained in a fully-unsupervised manner based on the pro-posed loss functions. The 3D shape reconstructor recon-structs the 3D shape of an instance from 2D feature points,which is a highly non-linear procedure. To alleviate the dif-

ficulty, we incorporated multiple 3D reconstructions, whichare weighted summed to estimate the final 3D shape. Therotation estimator estimates the camera pose of the giveninstance. To handle the orthogonality constraint and the re-flection ambiguities in the rotations, we proposed a rotationrefiner, which consists of differentiable operations. The pro-posed method shows the state-of-the-art performance on thepopular PASCAL3D+ data set, demonstrating the effective-ness of the proposed scheme. Meanwhile, reconstructingmultiple object categories simultaneously with a single net-work and extending the proposed scheme to handle missingdata and the temporal smoothness assumption are key prob-lems that can broaden the practical applications, which areleft as future work.

Acknowledgment

This work was supported by Institute of Information &Communications Technology Planning & Evaluation (IITP)grant funded by the Korea government (MSIT) (No. 2019-0-01190, [SW Star Lab] Robot Learning: Efficient, Safe,and Socially-Acceptable Machine Learning, and No. 2019-0-01371, Development of Brain-Inspired AI with Human-Like Intelligence).

Figure 6. Some example reconstructions of the PASCAL3D+ data set. Left: Given 2D feature points, Middle, Right: 3D reconstruc-tion results in two different views. Here, the empty circles represent the ground-truth 3D structures, and the filled circles represent thereconstructed 3D shapes.

References[1] Antonio Agudo and Francesc Moreno-Noguer. Recovering

Pose and 3D Deformable Shape from Multi-instance ImageEnsembles. In Proc. of the Asian Conference on ComputerVision, pages 291–307, 2016.

[2] Antonio Agudo, Melcior Pijoan, and Francesc Moreno-Noguer. Image Collection Pop-up: 3D Reconstruction andClustering of Rigid and Non-Rigid Categories. In Proc.IEEE Conf. Computer Vision and Pattern Recognition, June2018.

[3] Ijaz Akhter, Yaser Sheikh, Sohaib Khan, and Takeo Kanade.Nonrigid structure from motion in trajectory space. In Ad-vances in neural information processing systems, pages 41–48, 2009.

[4] Christoph Bregler, Aaron Hertzmann, and Henning Bier-mann. Recovering Non-Rigid 3D Shape from ImageStreams. In Proc. IEEE Conf. Computer Vision and PatternRecognition, June 2000.

[5] Geonho Cha, Minsik Lee, Jungchan Cho, and Songhwai Oh.Non-rigid surface recovery with a robust local-rigidity prior.Pattern Recognition Letters, 110:51–57, July 2018.

[6] Ajad Chhatkuli, Daniel Pizarro, Toby Collins, and AdrienBartoli. Inextensible Non-Rigid Shape-from-Motion bySecond-Order Cone Programming. In Proc. IEEE Conf.Computer Vision and Pattern Recognition, June 2016.

[7] Yuchao Dai, Hongdong Li, and Mingyi He. A Simple Prior-free Method for Non-Rigid Structure-from-Motion Factor-ization. In Proc. IEEE Conf. Computer Vision and PatternRecognition, June 2012.

[8] Alessio Del Bue. A Factorization Approach to Structurefrom Motion with Shape Priors. In Proc. Computer Visionand Pattern Recognition, June 2008.

[9] Mark Everingham, Luc Van Gool, Christopher KI Williams,John Winn, and Andrew Zisserman. The PASCAL VisualObject Classes (VOC) Challenge. International Journal ofComputer Vision, 88(2):303–338, 2010.

[10] Maryam Fazel, Haitham Hindi, and Stephen P Boyd. Log-det heuristic for matrix rank minimization with applicationsto Hankel and Euclidean distance matrices. In Proc. Ameri-can Control Conference, July 2003.

[11] Katerina Fragkiadaki, Marta Salas, Pablo Arbelaez, and Ji-tendra Malik. Grouping-Based Low-Rank Trajectory Com-pletion and 3D Reconstruction. In Advances in Neural Infor-mation Processing Systems, pages 55–63, 2014.

[12] Yuan Gao and Alan L Yuille. Symmetric Non-Rigid Struc-ture from Motion for Category-Specific Object Structure Es-timation. In Proc. of the European Conference on ComputerVision, pages 408–424, 2016.

[13] Paulo F. U. Gotardo and Aleix M. Martinez. Comput-ing Smooth Time-Trajectories for Camera and DeformableShape in Structure from Motion with Occlusion. IEEE Trans.Pattern Analysis and Machine Intelligence, 33(10):2051–2065, October 2011.

[14] Paulo F. U. Gotardo and Aleix M. Martinez. Kernel Non-Rigid Structure from Motion. In Proc. IEEE Int’l Conf. Com-puter Vision, November 2011.

[15] Paulo F. U. Gotardo and Aleix M. Martinez. Non-RigidStructure from Motion with Complementary Rank-3 Spaces.In Proc. IEEE Conf. Computer Vision and Pattern Recogni-tion, June 2011.

[16] Richard Hartley and Andrew Zisserman. Multiple View Ge-ometry in Computer Vision. Cambridge University Press, 2edition, 2004.

[17] Pan Ji, Hongdong Li, Yuchao Dai, and Ian Reid. Maximiz-ing Rigidity Revisited: a Convex Programming Approach forGeneric 3D Shape Reconstruction from Multiple PerspectiveViews. In Proc. IEEE Int’l Conf. Computer Vision, October2017.

[18] Diederik P Kingma and Jimmy Ba. Adam: A Method forStochastic Optimization. arXiv preprint arXiv:1412.6980,2014.

[19] Chen Kong and Simon Lucey. Prior-less CompressibleStructure from Motion. In Proc. IEEE Conf. Computer Vi-sion and Pattern Recognition, June 2016.

[20] Chen Kong, Rui Zhu, Hamed Kiani, and Simon Lucey.Structure from Category: A Generic and Prior-less Ap-proach. In 2016 Fourth International Conference on 3D Vi-sion, October 2016.

[21] Minsik Lee, Jungchan Cho, Chong-Ho Choi, and SonghwaiOh. Procrustean Normal Distribution for Non-Rigid Struc-ture from Motion. In Proc. IEEE Conf. Computer Vision andPattern Recognition, June 2013.

[22] Minsik Lee, Jungchan Cho, and Songhwai Oh. Consensus ofNon-Rigid Reconstructions. In Proc. IEEE Conf. ComputerVision and Pattern Recognition, June 2016.

[23] Xiu Li, Hongdong Li, Hanbyul Joo, Yebin Liu, and YaserSheikh. Structure from Recurrent Motion: From Rigidityto Recurrency. In Proc. IEEE Conf. Computer Vision andPattern Recognition, June 2018.

[24] Manuel Marques and Joao Costeira. Optimal shape frommotion estimation with missing and degenerate data. In IEEEWorkshop on Motion and video Computing, January 2008.

[25] Stephen Milborrow, John Morkel, and Fred Nicolls. TheMUCT Landmarked Face Database. Pattern Recognition As-sociation of South Africa, 2010.

[26] Marco Paladini, Alessio Del Bue, Marko Stosic, MarijaDodig, Jo ao Xavier, and Lourdes Agapito. Factorizationfor Non-Rigid and Articulated Structure using Metric Pro-jections. In Proc. IEEE Conf. Computer Vision and PatternRecognition, June 2009.

[27] Daniel Pizarro Shaifali Parashar and Adrien Bartoli. Isomet-ric Non-Rigid Shape-from-Motion in Linear Time. In Proc.IEEE Conf. Computer Vision and Pattern Recognition, June2016.

[28] Jonathan Taylor, Allan D. Jepson, and Kiriakos N. Kutu-lakos. Non-rigid Structure from Locally-rigid Motion. InProc. IEEE Conf. Computer Vision and Pattern Recognition,June 2010.

[29] Carlo Tomasi and Takeo Kanade. Shape and Motion fromImage Streams under Orthography: a Factorization Method.Int’l J. Computer Vision, 9(2):137–154, November 1992.

[30] Yu Xiang, Roozbeh Mottaghi, and Silvio Savarese. Be-yond PASCAL: A Benchmark for 3D Object Detection in

the Wild. In Porc. IEEE Winter Conference on Applicationsof Computer Vision, March 2014.

[31] Yingying Zhu, Dong Huang, Fernando De La Torre, and Si-mon Lucey. Complex Non-rigid Motion 3D Reconstructionby Union of Subspaces. In Proc. IEEE Conf. Computer Vi-sion and Pattern Recognition, June 2014.

Documents

Unsupervised 3D Reconstruction Networkscpslab.snu.ac.kr/publications/papers/2019_iccv_3durn.pdf · infer the basis shapes, unlike the proposed scheme which generates several basis