10
Deep Marching Cubes: Learning Explicit Surface Representations Yiyi Liao 1,2 Simon Donn´ e 1,3 Andreas Geiger 1,4 1 Autonomous Vision Group, MPI for Intelligent Systems and University of T¨ ubingen 2 Institute of Cyber-Systems and Control, Zhejiang University 3 imec - IPI - Ghent University 4 CVG Group, ETH Z¨ urich {yiyi.liao,simon.donne,andreas.geiger}@tue.mpg.de Abstract Existing learning based solutions to 3D surface predic- tion cannot be trained end-to-end as they operate on inter- mediate representations (e.g., TSDF) from which 3D sur- face meshes must be extracted in a post-processing step (e.g., via the marching cubes algorithm). In this paper, we investigate the problem of end-to-end 3D surface predic- tion. We first demonstrate that the marching cubes algo- rithm is not differentiable and propose an alternative differ- entiable formulation which we insert as a final layer into a 3D convolutional neural network. We further propose a set of loss functions which allow for training our model with sparse point supervision. Our experiments demon- strate that the model allows for predicting sub-voxel accu- rate 3D shapes of arbitrary topology. Additionally, it learns to complete shapes and to separate an object’s inside from its outside even in the presence of sparse and incomplete ground truth. We investigate the benefits of our approach on the task of inferring shapes from 3D point clouds. Our model is flexible and can be combined with a variety of shape encoder and shape inference techniques. 1. Introduction 3D reconstruction is a core problem in computer vision, yet despite its long history many problems remain unsolved. Ambiguities or noise in the input require the integration of strong geometric priors about our 3D world. Towards this goal, many existing approaches formulate 3D reconstruc- tion as inference in a Markov random field [2, 21, 41, 46] or as a variational problem [17, 47]. Unfortunately, the ex- pressiveness of such prior models is limited to simple local smoothness assumptions [2, 17, 21, 47] or very specialized shape models [1, 15, 16, 42]. Neither can such simple pri- ors resolve strong ambiguities, nor are they able to reason about missing or occluded parts of the scene. Hence, ex- isting 3D reconstruction systems work either in narrow do- mains where specialized shape knowledge is available, or Encoder Point Generation Observation Explicit surface Point set Meshing (a) Sparse Point Prediction (e.g., [12]) Encoder Decoder Observation Explicit surface Implicit surface Marching Cubes (b) Implicit Surface Prediction (e.g., [35, 45]) Encoder Decoder Observation Explicit surface Occupancy Geometry (c) Explicit Surface Prediction (ours) Figure 1: Illustration comparing point prediction (a), im- plicit surface prediction (b) and explicit surface prediction (c). The encoder is shared across all approaches and de- pends on the input (we use point clouds in this paper). The decoder is specific to the output representation. All train- able components are highlighted in yellow. Note that only (c) can be trained end-to-end for the surface prediction task. require well captured and highly-textured environments. However, the recent success of deep learning [19, 20, 38] and the availability of large 3D datasets [5, 6, 9, 26, 37] nourishes hope for models that are able to learn powerful 3D shape representations from data, allowing reconstruc- tion even in the presence of missing, noisy and incom- plete observations. And indeed, recent advances in this area [7, 12, 18, 24, 34, 36, 39, 40] suggest that this goal can ultimately be achieved. Existing 3D representation learning approaches can be classified into two categories: voxel based methods and point based methods, see Fig. 1 for an illustration. Voxel 1

Deep Marching Cubes: Learning Explicit Surface … · based methods [34,35,45] use a grid of voxels as output rep-resentation and predict either voxel occupancy [34,45] or a truncated

Embed Size (px)

Citation preview

Deep Marching Cubes: Learning Explicit Surface Representations

Yiyi Liao1,2 Simon Donne1,3 Andreas Geiger1,41Autonomous Vision Group, MPI for Intelligent Systems and University of Tubingen

2Institute of Cyber-Systems and Control, Zhejiang University3imec - IPI - Ghent University 4CVG Group, ETH Zurich

yiyi.liao,simon.donne,[email protected]

Abstract

Existing learning based solutions to 3D surface predic-tion cannot be trained end-to-end as they operate on inter-mediate representations (e.g., TSDF) from which 3D sur-face meshes must be extracted in a post-processing step(e.g., via the marching cubes algorithm). In this paper, weinvestigate the problem of end-to-end 3D surface predic-tion. We first demonstrate that the marching cubes algo-rithm is not differentiable and propose an alternative differ-entiable formulation which we insert as a final layer intoa 3D convolutional neural network. We further proposea set of loss functions which allow for training our modelwith sparse point supervision. Our experiments demon-strate that the model allows for predicting sub-voxel accu-rate 3D shapes of arbitrary topology. Additionally, it learnsto complete shapes and to separate an object’s inside fromits outside even in the presence of sparse and incompleteground truth. We investigate the benefits of our approachon the task of inferring shapes from 3D point clouds. Ourmodel is flexible and can be combined with a variety ofshape encoder and shape inference techniques.

1. Introduction3D reconstruction is a core problem in computer vision,

yet despite its long history many problems remain unsolved.Ambiguities or noise in the input require the integration ofstrong geometric priors about our 3D world. Towards thisgoal, many existing approaches formulate 3D reconstruc-tion as inference in a Markov random field [2, 21, 41, 46]or as a variational problem [17, 47]. Unfortunately, the ex-pressiveness of such prior models is limited to simple localsmoothness assumptions [2, 17, 21, 47] or very specializedshape models [1, 15, 16, 42]. Neither can such simple pri-ors resolve strong ambiguities, nor are they able to reasonabout missing or occluded parts of the scene. Hence, ex-isting 3D reconstruction systems work either in narrow do-mains where specialized shape knowledge is available, or

EncoderPoint

Generation

Observation Explicit surfacePoint set

Meshing

(a) Sparse Point Prediction (e.g., [12])

Encoder Decoder

Observation Explicit surfaceImplicit surface

MarchingCubes

(b) Implicit Surface Prediction (e.g., [35, 45])

Encoder Decoder

Observation Explicit surfaceOccupancy Geometry

(c) Explicit Surface Prediction (ours)

Figure 1: Illustration comparing point prediction (a), im-plicit surface prediction (b) and explicit surface prediction(c). The encoder is shared across all approaches and de-pends on the input (we use point clouds in this paper). Thedecoder is specific to the output representation. All train-able components are highlighted in yellow. Note that only(c) can be trained end-to-end for the surface prediction task.

require well captured and highly-textured environments.However, the recent success of deep learning [19,20,38]

and the availability of large 3D datasets [5, 6, 9, 26, 37]nourishes hope for models that are able to learn powerful3D shape representations from data, allowing reconstruc-tion even in the presence of missing, noisy and incom-plete observations. And indeed, recent advances in thisarea [7, 12, 18, 24, 34, 36, 39, 40] suggest that this goal canultimately be achieved.

Existing 3D representation learning approaches can beclassified into two categories: voxel based methods andpoint based methods, see Fig. 1 for an illustration. Voxel

1

based methods [34,35,45] use a grid of voxels as output rep-resentation and predict either voxel occupancy [34, 45] or atruncated signed distance field (TSDF) which implicitly de-termines the surface [35]. Point based methods [12] directlyregress a fixed number of points as output. While voxeland point based representations are easy to implement, bothrequire a post processing step to retrieve the actual 3D sur-face mesh which is the quantity of interest in 3D reconstruc-tion. For point based methods, meshing techniques such asPoisson surface reconstruction [25] or SSD [4] can be em-ployed. In contrast, implicit voxel based methods typicallyuse marching cubes [29] to extract the zero level set.

As both techniques cannot be trained end-to-end for the3D surface prediction task, an auxiliary loss (e.g., Chamferdistance on point sets, `1 loss on signed distance field) mustbe used during training. However, there are two major lim-itations in this setup: firstly, while implicit methods require3D supervision on the implicit model, the ground truth ofthe implicit representation is often hard to obtain, e.g., inthe presence of a noisy and incomplete point cloud or whenthe inside and outside of the object is unknown. Secondly,these methods only optimize an auxiliary loss defined on anintermediate representation and require an additional post-processing step for surface extraction. Thus they are unableto directly constrain the properties of the predicted surface.

In this work, we propose Deep Marching Cubes (DMC),a model which predicts explicit surface representations ofarbitrary topology. Inspired by the seminal work on March-ing Cubes [29], we seek for an end-to-end trainable modelthat directly produces an explicit surface representation andoptimizes a geometric loss function. This avoids the needfor defining auxiliary losses or converting target meshesto implicit distance fields. Instead, we directly train ourmodel to predict surfaces that agree with the 3D observa-tions. We demonstrate that direct surface prediction canlead to more accurate reconstructions while also handlingnoise and missing observations. Besides, this allows forseparating inside from outside even if the ground truth issparse or not watertight, as well as easily integrating addi-tional priors about the surface (e.g., smoothness). We sum-marize our contributions as follows:

• We demonstrate that Marching Cubes is not differen-tiable with respect to topological changes and proposea modified representation which is differentiable.

• We present a model for end-to-end surface predictionand derive appropriate geometric loss functions. Ourmodel can be trained from unstructured point cloudsand does not require explicit surface ground truth.

• We propose a novel loss function which allows for sep-arating an object’s inside from its outside even whenlearning with sparse unstructured 3D data.

• We apply our model to several surface prediction tasksand demonstrate its ability to recover surfaces even inthe presence of incomplete or noisy ground truth.

Our code and data is available on the project website1.

2. Related Work

Point Based Representations: Point based representa-tions have a long history in robotics and computer graph-ics. However, the irregular structure complicates the us-age of point clouds in deep learning. Qi et al. [31] pro-posed PointNet for point cloud classification and segmen-tation. Invariance wrt. the order of the points is achievedby means of a global pooling operation over all points. Asglobal pooling does not preserve local information, a hierar-chical neural network that applies PointNet recursively ona nested partitioning of the input point set has been pro-posed in follow-up work [33]. Fan et al. [12] proposed amodel for sparse 3D reconstruction, predicting a point setfrom a single image. While point sets require less parame-ters to store compared to dense volumetric grids, the maxi-mal number of points which can be predicted is limited to afew thousand due to the simple fully connected decoder. Incontrast to the method proposed in this paper, an additionalpost-processing step [4,25] is required to “lift” the 3D pointcloud to a dense surface mesh.

Implicit Surface Representations: Implicit surface repre-sentations are amongst the most widely adopted representa-tions in 3D deep learning as they can be processed by meansof standard 3D CNNs. By far the most popular representa-tion are binary occupancy grids which have been appliedto a series of discriminative tasks such as 3D object clas-sification [30, 32], 3D object detection [38] and 3D recon-struction [7, 13, 34, 44, 45]. Its drawback, however, is obvi-ous: the accuracy of the predicted reconstruction is limitedto the size of a voxel. While most existing approaches arelimited to a resolution of 323 voxels, methods that exploitadaptive space partitioning techniques [18, 39] scale up to2563 or 5123 voxel resolution. Yet, without sub voxel es-timation, the resulting reconstructions exhibit voxel-baseddiscretization artefacts. Sub voxel precision can be achievedby exploiting the truncated signed distance function (TSDF)[8] as representation where each voxel stores the truncatedsigned distance to the closest 3D surface point [10, 28, 35].

While the aforementioned works require post-processingfor isosurface extraction, e.g., using Marching Cubes [29],here we propose an end-to-end trainable solution which in-tegrates this step into one coherent model. This allows fortraining the model directly using point based supervisionand geometric loss functions. Thus, our model avoids theneed for converting the ground truth point cloud or mesh

1https://avg.is.tue.mpg.de/research projects/deep-marching-cubes

into an intermediate representation (e.g., TSDF) and defin-ing auxiliary loss functions. It is worth noting that thisconversion is not only undesirable but often also very diffi-cult, i.e., when learning from unordered point sets or non-watertight meshes for which inside/outside distinction isdifficult. Our approach avoids the conversion step and in-stead infers such relationships using weak prior knowledge.

Explicit Surface Representations: Compared to implicitsurface representations, explicit surface representations areless structured as they are typically organized as a set ofvertices and faces, complicating their deployment in deeplearning. Several works consider the problem of shapeclassification and segmentation by defining neural networkswhich operate on the graph spanned by the edges and ver-tices of a 3D mesh [3, 14, 43]. However, these methodsassume a fixed input graph while in 3D reconstruction thegraph (i.e., mesh) itself needs to be inferred. Very limitedresults have been presented for mesh based inference, andexisting works are restricted by a fixed 3D topology or milddeviations from a 3D template. Rezende et al. [34] predicta small number of vertices using a fully connected network.Each vertex is constrained to move along a pre-defined line.Thus, their method is limited to very simple convex shapes(they consider spheres, cuboids and cylinders) with a smallnumber of vertices. Kong et al. [27] predict a mesh by de-forming the vertices of a nearest neighbor CAD model, re-sulting in predictions close to the original shape templates.Kanazawa et al. [23] also predict meshes, however theirmethod is specialized to human body shapes.

Our goal is to overcome these difficulties by combiningvoxel and mesh based representations. Our decoder oper-ates in a volumetric space, but predicts the local face param-eters of the surface mesh. Compared to the aforementionedmethods, our representation is scalable regarding the num-ber of vertex points while allowing for arbitrary topologiesand the prediction of non-convex shapes. No shape tem-plates are required at test time and the model generalizeswell to unseen shape categories.

3. Deep Marching Cubes

We tackle the problem of predicting an explicit surfacerepresentation (i.e., a mesh) directly from raw observations(e.g., a mesh, point cloud, volumetric data or an image). Ex-isting works formulate this problem as the prediction of anintermediate signed distance representation using an auxil-iary (typically `1) loss [10, 35], followed by applying theMarching Cubes (MC) algorithm [29]. In this work weaim at making this last step differentiable, hence allowingfor end-to-end training using surface based geometric lossfunctions.

We first provide a formal introduction to the MarchingCubes algorithm [29]. We then demonstrate that backprop-

Figure 2: Mesh Topology. The 28 = 256 topologies can begrouped into 15 equivalence classes due to rotational sym-metry. In this paper, we consider only the singly connectedtopologies (highlighted in yellow).

Triangle Face

(a) Marching Cubes

Triangle Face

(b) Differentiable MC

Figure 3: Representation used by Marching Cubes (a) andthe proposed Differentiable Marching Cubes (b). The for-mer uses an implicit surface representation based on signeddistances D while the latter exploits an explicit surface rep-resentation which is parameterized in terms of occupancyprobabilites O and vertex displacements X.

agation through this algorithm is intractable and proposea modified differentiable representation which avoids theseproblems. We exploit this representation as a DifferentiableMarching Cubes Layer (DMCL) in a neural network forend-to-end surface prediction of arbitrary topology.

3.1. Marching Cubes

The Marching Cubes (MC) algorithm extracts the zerolevel set of a signed distance field and represents it as a setof triangles. It comprises two steps: estimation of the topol-ogy (i.e., the number and connectivity of triangles in eachcell of the volumetric grid) and the prediction of the vertexlocations of the triangles, determining the geometry.

More formally, let D ∈ RN×N×N denote a (discretized)signed distance field obtained using volumetric fusion [8] orpredicted by a neural network [10,35] where N denotes thenumber of voxels along each dimension. Let further dn ∈ Rdenote the n’th element of D where n = (i, j, k) ∈ N3

is a multi-index (i, j, k correspond to the 3 dimensions ofD). As D is a signed distance field, |dn| is the distance

between voxel n and its closest surface point. Without lossof generality, let us assume that dn > 0 if voxel n is lo-cated inside an object and dn < 0 otherwise. The zero levelset of the signed distance field D defines the surface whichcan be represented by means of a triangular mesh M. Thismesh M can be extracted from D using the Marching Cubes(MC) algorithm [29] which iterates (“marching”) throughall cells of the grid connecting the voxel centers and insertstriangular faces whenever a sign change is detected2. Morespecifically, MC performs the following two steps:

First, the cell’s surface topology T is determined basedon the sign of dn at its 8 corners. T can be represented asa binary tensor T ∈ 0, 12×2×2 where each element rep-resents a corner. The total number of configurations equals28 = 256, see Fig. 2 for an illustration. A vertex is createdin case of a sign change of the distance values of two adja-cent corners of the cell (i.e., corners connected by an edge).The vertex is placed at the edge connecting both corners.

In a second step, the vertex location of each triangularface along the edge is determined using linear interpolation.More formally, let x ∈ [0, 1] denote the relative location ofa triangle vertex w along edge e = (v, v′) where v and v′

are the corresponding edge vertices as illustrated in Fig. 3a.In particular, let’s assume x = 0 if w = v and x = 1 ifw = v′. Let further d ∈ R and d′ ∈ R denote the signeddistance values at v and v′, respectively. In the MarchingCubes algorithm, x is determined from d and d′ as the zerocrossing of the linear interpolant of d and d′. This inter-polant is given as f(x) = d+ x(d′ − d). Setting f(x) = 0yields x = d/(d− d′), see also Fig. 3a.

Discussion: Given the MC algorithm, can we construct adeep neural network for end-to-end surface prediction? In-deed, we could try to construct a deep neural network whichpredicts a signed distance field that is converted into a trian-gular mesh using MC. We could then compare this surfaceto a ground truth surface or point cloud and backpropagateerrors through the MC layer and the neural network. Unfor-tunately, this approach is intractable for two reasons:

• First, x = d/(d − d′) is singular at d = d′, thus pre-venting topological changes during training. However,the topology is unknown at training time if a pointcloud or a partial mesh is used as input. Instead, thenetwork needs to learn the topology during training.

• Second, observations affect only grid cells in their im-mediate vicinity, i.e., they act solely on cells where thesurface passes through. Thus gradients are not propa-gated to cells further away from the predicted surface.

2We distinguish voxels and cells in this paper: voxels are the regularrepresentation used by occupancy maps, while cells are displaced by adistance of 0.5 voxels and connect the voxel centers. Marching cubes aswell as our algorithm operates on the edges and vertices of these cells.

To circumvent these problems we propose a modified differ-entiable representation which separates the mesh topologyfrom the geometry. In contrast to predicting signed distancevalues, we predict the probability of occupancy for eachvoxel. The mesh topology is then implicitly (and proba-bilistically) defined by the state of the occupancy variablesat its corners. In addition, we predict a vertex location forevery edge of each cell. The combination of both implic-itly defined topology and vertex location defines a distribu-tion over meshes which is differentiable and can be used forbackpropagation. The second problem can be tackled by in-troducing appropriate loss functions on the occupancy andthe vertex location variables.

Note that predicting occupancies instead of distance val-ues is not a limitation as the surface computed via MC doesnot depend on cells further away. Similar to MC, our repre-sentation is flexible in terms of the output topology.

3.2. Differentiable Marching Cubes

We now formalize our Differentiable Marching CubesLayer (DMCL). Let again n = (i, j, k) ∈ N3 denote amulti-index into a 3D tensor and let 1 = (1, 1, 1) index thefirst element of the tensor. Let O ∈ [0, 1]N×N×N denotethe occupancy field and let X ∈ [0, 1]N×N×N×3 denotethe vertex displacement field predicted by a neural network(see Section 3.3 for details on the network architecture). Leton ∈ [0, 1] denote the n’th element of O, representing theoccupancy probability of that voxel with o = 1 if the voxelis occupied. Similarly, let xn ∈ [0, 1]3 denote the n’th el-ement of X, representing the displacements of the trianglevertices along the edges associated with xn. Note that xn isa 3-dimensional vector as we need to specify one vertex dis-placement for each dimension of the 3D space (see Fig. 3b).Let w denote a vertex of the output mesh located on edgee = (v, v′). As before, we have x = 0 if w = v and x = 1if w = v′. In other words, w is displaced linearly betweenv and v′ based on x.

The topology can be implicitly defined via the occupancyvariables. We consider the predictions of the neural networkon ∈ [0, 1] as parameters of a Bernoulli distribution

pn(t) = (on)t(1− on)1−t (1)

where t ∈ 0, 1 is a random variable and pn(t) is the prob-ability of voxel n being occupied (t = 1) or unoccupied(t = 0). Let now on, . . . , on+1 denote the 23 = 8 occu-pancy variables corresponding to the 8 corners of the n’thgrid cell. Let further T ∈ 0, 12×2×2 denote a binary ran-dom tensor representing the topology. The probability fortopology T at grid cell n is the product of the 8 occupancyprobabilities at its corners

pn(T) =∏

m∈0,13(on+m)tm(1− on+m)1−tm (2)

Encoder-Decoder

Point feature extraction Explicit surfaceSkip connections

32xxx N2

N2

N216xxx NNN

64xxx N4

N4

N4 64xxx N

4N4

N4 32xxx N

2N2

N2

16xxx NNN

16xxx NNN

16xxx NNN

16xxx NNN

1xxx NNN

3xxx NNN

x(N-1) (N-1) 140xx(N-1)

16xxx NNN16xK256xK3xK

16xK

16x1 16x1

16x1 16x1 16x1

16x1

16x116x116x1

GridPooling

=

Figure 4: Network Architecture. The input point cloud P is converted into a volumetric representation using grid pooling.The grid pooling operation (highlighted in yellow) takes as input a set of K points with their D = 16 dimensional featuremaps and performs max pooling within each cell. Empty cells are associated with the zero vector. The pooled features areprocessed by an encoder-decoder network with skip connections. The decoder has two heads: one for occupancies O andone for vertex displacements X. All details of the architecture can be found in the supplementary material.

with tm ∈ 0, 1 denoting the m’th element of T. Note thatjointly with the vertex displacement field X, the distributionover topologies pn(T) at cell n defines a distribution overtriangular meshes within cell n. Considering all cells n ∈T we obtain a distribution over meshes in the entire grid as

p(Tn|n ∈ T ) =∏n∈T

pn(Tn) (3)

where T = 1, . . . , N − 13 and the vertex displacementsX are fixed to the predictions of the neural net.

Remark: While the total number of possible topologieswithin a voxel is 28 = 256, many of them represent discon-nected meshes. As those are unlikely to occur in practicegiven a fine enough voxel resolution, in this paper we onlyconsider the 140 singly connected topologies (highlightedin yellow in Fig. 2) and renormalize (2) accordingly.

3.3. Network Architecture

This section describes our complete network architecturewhich integrates the Differentiable Marching Cubes Layerdescribed in the previous section as a final layer for explicitsurface prediction. We adopt an encoder-decoder architec-ture as illustrated in Fig. 4. The encoder extracts featuresfrom the raw observations and the decoder predicts an ex-plicit surface. In this paper we consider a 3D point cloudP ∈ RK×3 with K points as input. However, note that theencoder could be easily adapted to other types of observa-tions including 3D volumetric information or 2D images.

Our point cloud encoder is a variation of PointNet++[33] which is invariant to the local point ordering while re-taining local information. Similar to PointNet++, we firstextract a local feature vector for each point using fully con-nected layers. The major difference is that our feature repre-sentation is tightly coupled with the discrete structure of thevoxel grid. While PointNet++ recursively samples pointsfor grouping, we group all points falling into a voxel intoone set and apply pooling within this voxel. Thus, we retain

the regular grid structure of the decoder which allows forexploiting skip connections in our model (see Fig. 4).

The result of the grid pooling operation is fed into a stan-dard 3D encoder-decoder CNN for increasing the size ofthe receptive field. This subnetwork is similar to the oneused in [11,36] and comprises convolution, pooling and un-pooling layers as well as ReLU non-linearities. Followingcommon practice, we exploit skip connections to preservedetails. The decoder head is split into two branches, one forestimating the occupancy probabilities O and one for pre-dicting the vertex displacement field X. A sigmoid layeris added to both O and X to ensure valid probabilities be-tween 0 and 1 for O, and valid vertex displacements for X.The distribution over topologies is given by equation (3).For more details we refer to the supplementary material.

3.4. Loss Functions

At training time, our goal is to minimize the distance be-tween the ground truth point cloud and the predicted surfacemesh M. Note that our model predicts a distribution oversurface meshes p(M) thus we minimize the expected sur-face error. We add additional constraints to regularize theoccupancy variables and the smoothness of the estimatedmesh. Our loss function decomposes into four parts

L(θ) = w1

∑n

Lmeshn (θ) + w2 Locc(θ) + (4)

w3

∑n∼m

Lsmoothn,m (θ) + w4

∑n∼m

Lcurven,m(θ)

where θ represents the parameters of the neural network inFig. 4, wi are the weights of the loss function and n ∼mdenotes the set of adjacent cells in the grid. Each part of thisloss function will be described in the following paragraphs.

Point to Mesh Loss: We first introduce a geometric losswhich measures the compatibility of the predicted 3D sur-face mesh with respect to the observed 3D points. Let Ydenote the set of observed 3D points (i.e., the ground truth)

and let Yn ⊆ Y denote the set of observed points fallinginto cell n. As our model predicts a distribution of topolo-gies pn(T) and hence also meshes at every cell n, we seekto minimize the expected error with respect to this distribu-tion. More formally, we have

Lmeshn (θ) = Epn(T|θ)

∑y∈Yn

∆(Mn(T,X(θ)),y)

(5)

where y ∈ R3 is an observed 3D point, Mn(T,X) repre-sents the mesh induced by topology T and vertex displace-ment field X at cell n, and ∆(M,y) denotes the point-to-mesh distance if . The point-to-mesh distance is calculatedby finding the triangle closest to y in terms of euclidean (`2)distance. Note that in contrast to losses defined on implicitsurface representations (e.g., TSDF), the loss in (5) directlymeasures the geometric error of the inferred mesh.

While (5) ensures that the inferred mesh covers all obser-vations the converse is not true. That is, surface predictionsfar from the observations are not penalized as long as all ob-servations are covered by the predicted surface mesh. Un-fortunately, such a penalty is not feasible in our case as theground truth may be incomplete. We therefore add a smallconstant loss on all non-empty topologies for cells withoutobserved points. Moreover, we introduce additional lossfunctions that prefer simple solutions in the following para-graphs. In particular, these constraints enforce free-space atthe boundary of the volume and smoothness of the surface.

Occupancy Loss: As mentioned above, the occupancy sta-tus is ambiguous when considering unstructured 3D pointclouds as observations. That is, flipping the occupied withthe free voxels will result in exactly the same geometric lossas only the distance to the surface can be measured, but noinformation about what is inside or outside is present in thedata. However, we observe that for most scenes objects aresurrounded by free space, thus we can safely assume that the6 faces of the cube bounding the 3D scene are unoccupied.Defining a prior for occupied voxels is more challenging.One could naıvely assume that the center of the boundingcube must be occupied, yet this is not true in general. Thus,we relax this assumption by encouraging a sub-volume in-side the scene to be occupied. More formally, we have:

Locc(θ) =1

|Γ|∑n∈Γ

on(θ) + w(1− 1

|Ω|∑n∈Ω

on(θ)) (6)

where Γ denotes the boundary of the scene cube (i.e., allvoxels on its six faces) and Ω denotes a sub-volume insidethe cube (e.g., half the size of the scene cube). Minimizingthe first term of (6) encourages the boundary voxels to be-come unoccupied. Minimizing the second term enforces aregion within the scene cube to become occupied depending

on the adaptive weight w, which decreases with the numberof high confident occupied voxels in the scene.

Smoothness Loss: Note that both Lmesh as well as Locc

act only locally on the volume. To propagate occupancy in-formation within the volume, we therefore introduce an ad-ditional smoothness loss. In particular, we assume that themajority of all neighboring voxels take the same occupancystate. This assumption is justified by the fact that transi-tions happen only at the surface of an object (covering theminority of voxels). We therefore introduce the followingpairwise loss, encouraging occupancy smoothness:

Lsmoothn,m = |on(θ)− om(θ)| (7)

Curvature Loss: Similarly to the smoothness loss on theoccupancy variables we can encourage smoothness of thepredicted mesh geometry. This is particularly important ifthe ground truth point cloud is sparse and noisy as assumedin this paper. We therefore add a curvature loss which en-forces smooth transitions between adjacent cells by mini-mizing the expected difference in normal orientation:

Lcurven,m(θ) = Epn,m(T,T′|θ) [ϕn,m(T,T′,X(θ))] (8)

Here, pn,m(T,T′|θ) = pn(T|θ) pm(T′|θ) is the jointdistribution over the topologies of voxel n and voxel m.Furthermore, ϕn,m(·) denotes a function which returns thesquared `2 distance between the normals of the faces in celln and m which are connected by a joint edge, and 0 if thefaces in both cells are not topologically connected.

4. Experimental EvaluationIn this section, we first thoroughly evaluate the effective-

ness and robustness of the proposed method in 2D. Then wedemonstrate the ability of our method to predict 3D meshesfrom 3D point clouds.

4.1. Model Validation in 2D

For clarity, we validate our model in 2D before we con-sider the 3D case. In 2D, the total number of topologiesreduces to 24 = 16 as illustrated in the supplementary mate-rial. We rendered silhouettes of 1547 different car instancesfrom ShapeNet [5], which we split into 1237 training sam-ples and 310 test samples. We randomly sampled 300 pointsfrom the silhouette boundaries which we feed as input tothe network. We use a voxel grid of size N ×N ×N withN = 32 throughout all of our experiments. All other hyper-parameters are specified in the supplementary material. Forevaluation, we use Chamfer distance, accuracy and com-pleteness. We follow common practice [22] and specify allmeasures as distances, thus lower accuracy / completenessvalues indicate better results.

(a) Lmesh (b) +Locc (c) +Lsmooth (d) +Lcurve (e) Car→Bot. (f) Topology

Chamfer Acc. Comp. Hamming

Lmesh 0.339 0.388 0.289 83.69%+Locc 0.357 0.429 0.285 4.67%+Lsmooth 0.240 0.224 0.255 0.56%+Lcurve 0.245 0.219 0.272 0.53%

(g) Quantitative Results (Lower is Better)

Figure 5: 2D Ablation Study. (a)-(d)+(g) show our results when incrementally adding the loss functions of (4). (e)+(f)demonstrate the ability of our model to generalize to novel categories (train: car, test: bottle) and more complex surfacetopologies (in this case, two separated objects). The top row shows the input points in gray and the estimated occupancy fieldO with red indicating occupied voxels. The bottom row shows the most probable surface M in red.

Ablation Study: We first validate the effectiveness of eachcomponent of our loss function in Fig. 5. Starting with thepoint to mesh loss Lmesh, we incrementally add the occu-pancy loss Locc, smoothness loss Lsmooth and curvature lossLcurve. We evaluate the quality of the predicted mesh bymeasuring the Chamfer distance in voxels, which considersboth accuracy and completeness of the predicted mesh. Forthis experiment, we also evaluated the Hamming distancebetween our occupancy prediction and the ground truth oc-cupancy to assess the ability of our model in separating in-side from outside. Using only Lmesh, the network predictsmultiple surfaces around the true surface and fails to pre-dict occupancy (a). Adding the occupancy loss Locc allowsthe network to separate inside from outside, but still leadsto fragmented surface boundaries (b). Adding the smooth-ness loss Lsmooth, removes these fragmentations (c). Thecurvature loss Lcurve further enhances the smoothness of thesurface without decreasing performance. Thus, we adoptthe full model in the following evaluation.

Generalization & Topology: To demonstrate the flexibil-ity of our approach, we apply our model trained on the cat-egory “car” to point clouds from the category “bottle”. Asevidenced by Fig. 5e, our model generalizes well to novelcategories; it learns local shape representations rather thancapturing purely global shape properties. Fig. 5f shows thatour method, trained and tested with multiple separated carinstances also handles complex topologies, correctly sepa-rating inside from outside, even when the center voxel is notoccupied, validating the robustness of our occupancy loss.

Model Robustness: In practice, 3D point cloud measure-ments are often noisy or incomplete due to sensor occlu-sions. In this section, we demonstrate that our method isable to reconstruct surfaces even in the presence of noisyand incomplete observations. Note that this is a challeng-ing problem which is typically not considered in learning-based approaches to 3D reconstruction which assume thatthe ground truth is densely available. We vary the level

Chamfer Accuracy Complete.

σ = 0.00 0.245 0.219 0.272σ = 0.15 0.246 0.219 0.273σ = 0.30 0.296 0.267 0.325

Table 1: Robustness wrt. Noisy Ground Truth.

Chamfer Accuracy Complete.

θ = 15 0.234 0.210 0.257θ = 30 0.250 0.227 0.273θ = 45 0.308 0.261 0.354

Table 2: Robustness wrt. Incomplete Ground Truth.

of noise and completeness in Table 1 and Table 2. Formoderate levels of noise, the predicted mesh degrades onlyslightly. Moreover, our model correctly predicts the shapeof the car in Table 2 even though information within an an-gular range of up to 45 was not available during training.

4.2. 3D Shape Prediction from Point Clouds

In this section, we verify the main hypothesis of this pa-per, namely if end-to-end learning for 3D shape predictionis beneficial wrt. regressing an auxiliary representation andextracting the 3D shape in a postprocessing step. Towardsthis goal, we compare our model to two baseline methodswhich regress an implicit representation as widely adoptedin the 3D deep learning literature [7, 13, 34, 44, 45], as wellas to the well-known Screened Poisson Surface Reconstruc-tion (PSR) [25]. Specifically, given the same point cloud en-coder as introduced in Section 3.3, we construct two base-lines which predict occupancy and Truncated Signed Dis-tance Functions (TSDFs), respectively, followed by classi-cal Marching Cubes (MC) for extracting the meshes. Fora fair comparison, we use the same decoder architecture asour occupancy branch and predict at the same resolution(32 × 32 × 32 voxels). We apply PSR with its default pa-

Resolution Method Chamfer Accuracy Complete.

323

Occ. + MC 0.407 0.246 0.567TSDF + MC 0.412 0.236 0.588wTSDF + MC 0.354 0.219 0.489PSR-5 0.352 0.405 0.298Ours 0.218 0.182 0.254

2563 PSR-8 0.198 0.196 0.200

Table 3: 3D Shape Prediction from Point Clouds.

rameters3. While the default resolution of the underlyinggrid (with reconstruction depth d = 8) is 256 × 256 × 256we also evaluate PSR with d = 5 (and hence a 32×32×32grid as in our method) for a fair comparison.

Again, we conduct our experiments on the ShapeNetdataset, but this time we directly use the provided 3D mod-els. More specifically, we train our models jointly on ob-jects from 3 classes (bottle, car, sofa). As ShapeNet mod-els comprise interior faces such as car seats, we rendereddepth images and applied TSDF fusion at a high resolution(128 × 128 × 128 voxels) for extracting clean meshes andoccupancy grids. We randomly sampled points on thesemeshes which are used as input to the encoder as well asobservations. Note that training the implicit representationbaselines requires dense ground truth of the implicit surface/ occupancy grid while our approach only requires a sparseunstructured 3D point cloud for supervision. For the inputpoint cloud we add Gaussian noise with σ = 0.15 voxels.

Table 3 shows our results. All predicted meshes are com-pared to the ground truth mesh extracted from the TSDF at128× 128× 128 voxels resolution. Here, wTSDF refers toa TSDF variant where higher importance is given to voxelscloser to the surface resulting in better meshes.

Our method outperforms both baseline methods and PSRin all three metrics given the same resolution. This validatesour hypothesis that directly optimizing a surface loss leadsto better surface reconstructions. Note that our method in-fers occupancy using only unstructured points as supervi-sion while both baselines require this knowledge explicitly.

A qualitative comparison is shown in Fig. 6. Our methodsignificantly outperforms the baseline methods in recon-structing small details (e.g., wheels of the cars in rows 1-4)and thin structures (e.g., back of the sofa in rows 6+8). Thereason for this is that implicit representations require dis-cretization of the ground truth while our method does not.Furthermore, the baseline methods fail completely when theground truth mesh is not closed (e.g., car underbody is miss-ing in row 4) or has holes (e.g., car windows in row 2).In this case, large portions of the space are incorrectly la-beled free space. While the baselines use this informationdirectly as training signal, our method uses a surface-based

3PSR: https://github.com/mkazhdan/PoissonRecon;We use Meshlab to estimate normal vectors as input to PSR.

Input Occ wTSDF PSR-5 PSR-8 Ours GT

Figure 6: 3D Shape Prediction from Point Clouds. Sur-faces are colored: the outer surface is yellow, the inner red.

loss. Thus it is less affected by errors in the occupancyground truth. Even though PSR-8 beats our method on com-pleteness given its far higher resolution, it is less robust tonoisy inputs compared to PSR-5, while our method handlesthe trade-off between reconstruction and robustness moregracefully. Furthermore, PSR sometimes flips inside andoutside (rows 2+4+6+7) as estimating oriented normal vec-tors from a sparse point set is a non-trivial task.

We also provide some failure cases of our method in thelast two rows of Fig. 6. Our method might fail on very thinsurfaces (row 9) or connect disconnected parts (row 10) al-though in both cases our method still convincingly outper-forms the other methods. Those failures are caused by therather low-resolution output (a 323 grid), which could beaddressed using octree networks [18, 35, 36, 39].

5. ConclusionWe proposed a flexible framework for learning 3D mesh

prediction. We demonstrated that training the surface pre-diction task end-to-end leads to more accurate and completereconstructions. Moreover, we showed that surface-basedsupervision results in better predictions in case the groundtruth 3D model is incomplete. In future work, we plan toadapt our method to higher resolution outputs using octreestechniques [18,36,39] and integrate our approach with otherinput modalities like the ones illustrated in Fig. 1.

Acknowledgements: Yiyi Liao was partially supported byNSFC under grant U1509210.

References[1] S. Bao, M. Chandraker, Y. Lin, and S. Savarese. Dense object

reconstruction with semantic priors. In Proc. IEEE Conf. onComputer Vision and Pattern Recognition (CVPR), 2013. 1

[2] Y. Boykov, O. Veksler, and R. Zabih. Fast approximate en-ergy minimization via graph cuts. IEEE Trans. on PatternAnalysis and Machine Intelligence (PAMI), 23:2001, 1999.1

[3] M. M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, and P. Van-dergheynst. Geometric deep learning: Going beyond eu-clidean data. Signal Processing Magazine, 34(4):18–42,2017. 3

[4] F. Calakli and G. Taubin. SSD: smooth signed dis-tance surface reconstruction. Computer Graphics Forum,30(7):1993–2002, 2011. 2

[5] A. X. Chang, T. A. Funkhouser, L. J. Guibas, P. Hanrahan,Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su,J. Xiao, L. Yi, and F. Yu. Shapenet: An information-rich 3dmodel repository. arXiv.org, 1512.03012, 2015. 1, 6

[6] S. Choi, Q. Zhou, S. Miller, and V. Koltun. A large datasetof object scans. arXiv.org, 1602.02481, 2016. 1

[7] C. B. Choy, D. Xu, J. Gwak, K. Chen, and S. Savarese. 3d-r2n2: A unified approach for single and multi-view 3d objectreconstruction. In Proc. of the European Conf. on ComputerVision (ECCV), 2016. 1, 2, 7

[8] B. Curless and M. Levoy. A volumetric method for build-ing complex models from range images. In ACM Trans. onGraphics (SIGGRAPH), 1996. 2, 3

[9] A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser,and M. Niessner. Scannet: Richly-annotated 3d reconstruc-tions of indoor scenes. In Proc. IEEE Conf. on ComputerVision and Pattern Recognition (CVPR), 2017. 1

[10] A. Dai, C. R. Qi, and M. Nießner. Shape completion us-ing 3d-encoder-predictor cnns and shape synthesis. In Proc.IEEE Conf. on Computer Vision and Pattern Recognition(CVPR), 2017. 2, 3

[11] A. Dosovitskiy, P. Fischer, E. Ilg, P. Haeusser, C. Hazirbas,V. Golkov, P. v.d. Smagt, D. Cremers, and T. Brox. Flownet:Learning optical flow with convolutional networks. In Proc.of the IEEE International Conf. on Computer Vision (ICCV),2015. 5

[12] H. Fan, H. Su, and L. J. Guibas. A point set generationnetwork for 3d object reconstruction from a single image.Proc. IEEE Conf. on Computer Vision and Pattern Recogni-tion (CVPR), 2017. 1, 2

[13] R. Girdhar, D. F. Fouhey, M. Rodriguez, and A. Gupta.Learning a predictable and generative vector representationfor objects. In Proc. of the European Conf. on ComputerVision (ECCV), 2016. 2, 7

[14] K. Guo, D. Zou, and X. Chen. 3d mesh labeling via deepconvolutional neural networks. In ACM Trans. on Graphics(SIGGRAPH), 2015. 3

[15] F. Guney and A. Geiger. Displets: Resolving stereo ambigu-ities using object knowledge. In Proc. IEEE Conf. on Com-puter Vision and Pattern Recognition (CVPR), 2015. 1

[16] C. Haene, N. Savinov, and M. Pollefeys. Class specific 3dobject shape priors using surface normals. In Proc. IEEE

Conf. on Computer Vision and Pattern Recognition (CVPR),2014. 1

[17] C. Haene, C. Zach, A. Cohen, R. Angst, and M. Pollefeys.Joint 3D scene reconstruction and class segmentation. InProc. IEEE Conf. on Computer Vision and Pattern Recogni-tion (CVPR), 2013. 1

[18] C. Hane, S. Tulsiani, and J. Malik. Hierarchical surface pre-diction for 3d object reconstruction. arXiv.org, 1704.00710,2017. 1, 2, 8

[19] K. He, G. Gkioxari, P. Dollar, and R. B. Girshick. MaskR-CNN. arXiv.org, 1703.06870, 2017. 1

[20] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learningfor image recognition. In Proc. IEEE Conf. on ComputerVision and Pattern Recognition (CVPR), 2016. 1

[21] H. Hirschmuller. Stereo processing by semiglobal matchingand mutual information. IEEE Trans. on Pattern Analysisand Machine Intelligence (PAMI), 30(2):328–341, 2008. 1

[22] R. R. Jensen, A. L. Dahl, G. Vogiatzis, E. Tola, andH. Aanæs. Large scale multi-view stereopsis evaluation. InProc. IEEE Conf. on Computer Vision and Pattern Recogni-tion (CVPR), 2014. 6

[23] A. Kanazawa, M. J. Black, D. W. Jacobs, and J. Malik. Es-cape from cells: Deep kd-networks for the recognition of 3dpoint cloud models. arXiv.org, 1712.06584, 2017. 3

[24] A. Kar, S. Tulsiani, J. Carreira, and J. Malik. Category-specific object reconstruction from a single image. In Proc.IEEE Conf. on Computer Vision and Pattern Recognition(CVPR), 2015. 1

[25] M. M. Kazhdan and H. Hoppe. Screened poisson surfacereconstruction. ACM Trans. on Graphics (SIGGRAPH),32(3):29, 2013. 2, 7

[26] A. Knapitsch, J. Park, Q.-Y. Zhou, and V. Koltun. Tanksand temples: Benchmarking large-scale scene reconstruc-tion. ACM Trans. on Graphics (SIGGRAPH), 36(4), 2017.1

[27] C. Kong, C.-H. Lin, and S. Lucey. Using locally correspond-ing cad models for dense 3d reconstructions from a singleimage. In Proc. IEEE Conf. on Computer Vision and PatternRecognition (CVPR), 2017. 3

[28] L. Ladicky, O. Saurer, S. Jeong, F. Maninchedda, andM. Pollefeys. From point clouds to mesh using regression.In Proc. of the IEEE International Conf. on Computer Vision(ICCV), 2017. 2

[29] W. E. Lorensen and H. E. Cline. Marching cubes: A highresolution 3d surface construction algorithm. In ACM Trans.on Graphics (SIGGRAPH), 1987. 2, 3, 4

[30] D. Maturana and S. Scherer. Voxnet: A 3d convolutionalneural network for real-time object recognition. In Proc.IEEE International Conf. on Intelligent Robots and Systems(IROS), 2015. 2

[31] C. R. Qi, H. Su, K. Mo, and L. J. Guibas. Pointnet: Deeplearning on point sets for 3d classification and segmentation.In Proc. IEEE Conf. on Computer Vision and Pattern Recog-nition (CVPR), 2017. 2

[32] C. R. Qi, H. Su, M. Nießner, A. Dai, M. Yan, and L. Guibas.Volumetric and multi-view cnns for object classification on3d data. In Proc. IEEE Conf. on Computer Vision and PatternRecognition (CVPR), 2016. 2

[33] C. R. Qi, L. Yi, H. Su, and L. J. Guibas. Pointnet++: Deep hi-erarchical feature learning on point sets in a metric space. InAdvances in Neural Information Processing Systems (NIPS),2017. 2, 5

[34] D. J. Rezende, S. M. A. Eslami, S. Mohamed, P. Battaglia,M. Jaderberg, and N. Heess. Unsupervised learning of 3dstructure from images. In Advances in Neural InformationProcessing Systems (NIPS), 2016. 1, 2, 3, 7

[35] G. Riegler, A. O. Ulusoy, H. Bischof, and A. Geiger. Oct-NetFusion: Learning depth fusion from data. In Proc. of theInternational Conf. on 3D Vision (3DV), 2017. 1, 2, 3, 8

[36] G. Riegler, A. O. Ulusoy, and A. Geiger. Octnet: Learningdeep 3d representations at high resolutions. In Proc. IEEEConf. on Computer Vision and Pattern Recognition (CVPR),2017. 1, 5, 8

[37] T. Schops, J. Schonberger, S. Galliani, T. Sattler,K. Schindler, M. Pollefeys, and A. Geiger. A multi-viewstereo benchmark with high-resolution images and multi-camera videos. In Proc. IEEE Conf. on Computer Visionand Pattern Recognition (CVPR), 2017. 1

[38] S. Song and J. Xiao. Deep sliding shapes for amodal 3d ob-ject detection in rgb-d images. In Proc. IEEE Conf. on Com-puter Vision and Pattern Recognition (CVPR), June 2016. 1,2

[39] M. Tatarchenko, A. Dosovitskiy, and T. Brox. Octree gen-erating networks: Efficient convolutional architectures forhigh-resolution 3d outputs. In Proc. of the IEEE Interna-tional Conf. on Computer Vision (ICCV), 2017. 1, 2, 8

[40] S. Tulsiani, T. Zhou, A. A. Efros, and J. Malik. Multi-viewsupervision for single-view reconstruction via differentiableray consistency. In Proc. IEEE Conf. on Computer Visionand Pattern Recognition (CVPR), 2017. 1

[41] A. O. Ulusoy, M. Black, and A. Geiger. Patches, planes andprobabilities: A non-local prior for volumetric 3d reconstruc-tion. In Proc. IEEE Conf. on Computer Vision and PatternRecognition (CVPR), 2016. 1

[42] A. O. Ulusoy, M. Black, and A. Geiger. Semantic multi-viewstereo: Jointly estimating objects and voxels. In Proc. IEEEConf. on Computer Vision and Pattern Recognition (CVPR),2017. 1

[43] P. Wang, Y. Gan, Y. Zhang, and P. Shui. 3d shape segmen-tation via shape fully convolutional networks. Computers &Graphics, 1702.08675, 2017. 3

[44] J. Wu, C. Zhang, T. Xue, B. Freeman, and J. Tenenbaum.Learning a probabilistic latent space of object shapes via 3dgenerative-adversarial modeling. In Advances in Neural In-formation Processing Systems (NIPS), 2016. 2, 7

[45] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, andJ. Xiao. 3d shapenets: A deep representation for volumetricshapes. In Proc. IEEE Conf. on Computer Vision and PatternRecognition (CVPR), 2015. 1, 2, 7

[46] K. Yamaguchi, D. McAllester, and R. Urtasun. Efficient jointsegmentation, occlusion labeling, stereo and flow estimation.In Proc. of the European Conf. on Computer Vision (ECCV),2014. 1

[47] C. Zach, T. Pock, and H. Bischof. A globally optimal algo-rithm for robust tv-l1 range image integration. In Proc. of the

IEEE International Conf. on Computer Vision (ICCV), 2007.1