GEOMetrics: Exploiting Geometric Structure for Graph ... · GEOMetrics: Exploiting Geometric Structure for Graph-Encoded Objects Edward J. Smith1 Scott Fujimoto1 2 Adriana Romero3

GEOMetrics: Exploiting Geometric Structure for Graph-Encoded Objects

Edward J. Smith 1 Scott Fujimoto 1 2 Adriana Romero 3 1 David Meger 1

AbstractMesh models are a promising approach for en-coding the structure of 3D objects. Currentmesh reconstruction systems predict uniformlydistributed vertex locations of a predeterminedgraph through a series of graph convolutions, lead-ing to compromises with respect to performanceor resolution. In this paper, we argue that thegraph representation of geometric objects allowsfor additional structure, which should be lever-aged for enhanced reconstruction. Thus, we pro-pose a system which properly benefits from theadvantages of the geometric structure of graph-encoded objects by introducing (1) a graph con-volutional update preserving vertex information;(2) an adaptive splitting heuristic allowing detailto emerge; and (3) a training objective operatingboth on the local surfaces defined by vertices aswell as the global structure defined by the mesh.Our proposed method is evaluated on the task of3D object reconstruction from images with theShapeNet dataset, where we demonstrate stateof the art performance, both visually and numeri-cally, while having far smaller space requirementsby generating adaptive meshes.

1. IntroductionSurfaces in our physical world exhibit highly non-uniformcurvature; compare a plane’s wing to a microscopic screwthat fastens its engine. Traditionally, deep 3D understandingsystems have relied upon representations such as voxels andpoint clouds, which capture structure with either uniformvolumetric or surface detail (Choy et al., 2016; Wu et al.,2016; Fan et al., 2017). By representing unimportant, oruninteresting, regions with high detail, these systems scalepoorly to higher resolutions and object complexity (see Fig-ure 1a and 1b). While efforts have been made to rectify this

1Department of Computer Science, McGill University,Montreal, Canada 2Mila Quebec AI Institute 3FacebookAI Research. Correspondence to: Edward Smith <[email protected]>.

(a) Voxels(262, 144 units)

(b) Point cloud(30, 000 points)

(c) Uniform mesh(2416 vertices)

(d) Adaptive mesh(120 vertices)

Figure 1. Comparison of 3D shape encoding techniques, includingtheir respective encoding sizes for the level of quality viewed.

issue through intermediary representations (Tatarchenkoet al., 2017; Hane et al., 2017; Smith et al., 2018; Zhanget al., 2018), these either continue to rely on sparse volu-metric units, or maintain uniform detail through alternativerepresentations.

A triangle mesh is a graph-based shape representation thatencodes 3D structure through a set of vertices and corre-sponding planar faces. Recently, advances in deep learn-ing on graphs have enabled mesh-based 3D shape meth-ods (Kato et al., 2017; Wang et al., 2018; Kanazawa et al.,2018; Jack et al., 2018; Henderson & Ferrari, 2018; Groueixet al., 2018b). However, these approaches produce meshpredictions which uniformly space vertices and faces overtheir surface, providing no significant improvement over thepreviously highlighted representations (see Figure 1c) andhence, not exploiting the mesh representation advantages.In particular, by placing many vertices in regions of finedetail while using large triangles to summarize nearly pla-nar regions, one could define adaptive meshes (see Figure1d), enabling flexible scaling by effectively localizing com-plexity in object surfaces, and allowing for the 3D structureof complicated shapes to be encoded with smaller spacerequirements and higher precision.

arX

iv:1

901.

1146

1v1

[cs

.CV

] 3

1 Ja

n 20

19


Moreover, these deep learning mesh-based approaches relyon Graph Convolutional Networks (GCNs) (Kipf & Welling,2016; Defferrard et al., 2016; Cucurull et al., 2018; Wanget al., 2018). Although effective in many node/graph classi-fication and regression tasks, we argue that GCNs may beinadequate for understanding, reconstructing or generating3D structure as they may induce over-smoothing while ag-gregating neighboring information at vertex level (Li et al.,2018). This aggregation bias could in turn lead to a harderlearning problem when vital information held at each ver-tex cannot be derived from its neighbors, and as a directconsequence must not be lost.

Last, an important question when reconstructing 3D objectsis how to define a loss between a prediction and its target. Acommon approach is to employ the Chamfer Distance oversome parametrization of the two surfaces (Barrow et al.,1977; Insafutdinov & Dosovitskiy, 2018; Wang et al., 2018;Groueix et al., 2018a; Sun et al., 2018; Fan et al., 2017).However, this loss penalizes the point positions exclusively,and thus, its direct application to mesh vertex positions leadspoor accuracy, as no information of the faces they defineis provided and the placement of vertices over a surfaceis, to a large degree, arbitrary. In addition, this local lossfunction takes no consideration of the global structure ofthe predicted object, preventing class-specific attributes toemerge and creating global inconsistencies.

Therefore in this paper, we aim to address the above-mentioned limitations by introducing an adaptive mesh re-construction system, called Geometrically Exploited ObjectMetrics (GEOMetrics), which properly capitalizes on theadvantages and geometric structure of graph-encoded ob-jects. GEOMetrics reformulates graph convolutional layersto prevent vertex smoothing. Moreover, it incorporates anadaptive face splitting heuristic allowing non-uniform de-tail to emerge. Finally, it introduces a training objectiveoperating both on the local surfaces defined by vertices, viaa differentiable sampling procedure, as well as the globalstructure defined by the graph, through a perceptual lossreminiscent of that of style transfer applications (Gatys et al.,2016; Johnson et al., 2016). To the best of our knowledge,our system is the first deep approach to describing shape asan adaptive mesh, through advances in geometrically-awaregraph operations. We extensively evaluate our system on thetask of 3D object reconstruction from single RGB imagesand show that the interplay of our introduced componentsencourages mesh reconstructions, which properly localizedetail, while maintaining structural consistency. As a result,we are able to obtain mesh predictions which outperformprevious methods and have far smaller space requirements.

The contributions of this paper can be summarized as:

• We introduce the Zero-Neighbor GCN (0N-GCN), anextension of Kipf & Welling (2016), which allows the

information at each vertex to be maintained, and as aresult better suits the understanding and reconstructionof 3D meshes.

• We present an adaptive face splitting procedure to en-courage local complexity to emerge when reconstruct-ing meshes, taking advantage of the mesh flexible scal-ing (see Figure 1d).

• We propose a training objective, which operates locallyand globally over the surface to produce mesh recon-structions, which are highly accurate and benefit fromthe graceful scaling of mesh representations.

• We highlight through extensive evaluation the substan-tial benefits provided by the previous contributions andshow, on the task of 3D object reconstruction from sin-gle RGB images, that by properly exploiting the meshs’properties and geometry, our GEOMetrics system isable to notably outperform prior methods visually andquantitatively, while requiring far less vertices/faces.

Note that the above-mentioned contributions are not specificto the reconstruction system nor the chosen task and thus,can be easily adapted to arbitrary mesh problems. Code forour system is publicly available on a GitHub repository, toensure reproducible experimental comparison.1

2. Related Work3D Mesh Reconstruction. Mesh models have only re-cently been used in generation and reconstruction tasks dueto the challenging nature of their complex definition (Wanget al., 2018). Recent mesh approaches rely on graph rep-resentations of meshes, and use GCNs (Kipf & Welling,2016) to effectively process them. Our work most closelyrelates to Neural 3D Mesh Renderer (Kato et al., 2017) andPixel2Mesh (Wang et al., 2018), which use deformationsof a generic pre-defined input mesh, generally a sphere, toform 3D structures. Similarly, Atlas-Net (Groueix et al.,2018a) uses deformations over a set of primitive squarefaces to form 3D shapes. Conceptually similar, there existsnumerous papers using class-specific input meshes whichare deformed with respect to the given input image (Ponteset al., 2017; Kanazawa et al., 2018; Jack et al., 2018; Hender-son & Ferrari, 2018; Groueix et al., 2018b; Kar et al., 2015).While effective, these approaches require prior knowledgeon the target class or access to a model repository.

Graph Convolutional Networks. The great success ofconvolutional neural networks in numerous image-basedtasks (He et al., 2016; 2017; Huang et al., 2017; Jegou et al.,2017; Casanova et al., 2018) has led to increasing efforts toextend deep networks to domains where graph-structureddata is ubiquitous.

Early attempts to extend neural networks to deal with arbi-

1https://github.com/EdwardSmith1884/GEOMetrics


trarily structured graphs relied on recursive neural networks(Frasconi et al., 1998; Gori et al., 2005; Scarselli et al.,2009). Recently, spectral approaches have emerged as aneffective alternative which formulates the convolution asan operation on the spectrum of the graph (Henaff et al.,2015; Bruna et al., 2014; Bronstein et al., 2017; Levie et al.,2017). Methods operating directly on the graph domainhave also been presented. Defferrard et al. (2016) proposedto approximate the filters using the Chebyshev polynomialsapplied on the Laplacian operator. This approximation wasfurther simplified by Kipf & Welling (2016). Finally, severalworks have been introduced exploring well-established deeplearning ideas and improving previously reported results(Duvenaud et al., 2015; Hamilton et al., 2017; Monti et al.,2017; Velickovic et al., 2018).

3D Object Representation. Deep learning approaches forunderstanding 3D shapes have, for a long time, employedvoxels as a default 3D object representation (Choy et al.,2016; Wu et al., 2016; Smith & Meger, 2017; Tulsiani et al.,2017; Wu et al., 2017; 2018). While straightforward to use,voxels induce a cubic computational cost, scaling poorly tohigher resolutions and complex objects. Numerous compu-tationally efficient approaches have arisen, such as octreemethods (Riegler et al., 2017; Tatarchenko et al., 2017; Haneet al., 2017), which represent voxel objects with adaptivedegrees of detail. Most similar to mesh models are pointclouds methods, which represent 3D objects through a setof points in 3D space (Fan et al., 2017; Qi et al., 2017; Insa-futdinov & Dosovitskiy, 2018; Novotny et al., 2017). Pointclouds represent only the surface information of 3D objects,making them more efficient and scalable. However, as theydo not define surface information beyond each point’s lo-cal neighborhood they must be uniformly sampled over asurface, and so to encode high levels of detail, the samplingdensity over the entire surface must increase.

3. BackgroundIn this section, we review GCNs (Kipf & Welling, 2016),a key component for mesh generation and reconstructionsystems, and outline how the Chamfer Distance has beenpreviously employed as a loss for mesh reconstruction.

3.1. Graph Convolutional Networks

Let G be a graph with N vertices defined by an adjacencymatrix A ∈ RN×N and a vertex feature matrix H ∈ RN×F ,where F denotes the number of features. A GCN layer takesas input both A and H, and produces a new vertex featurematrix H′ ∈ RN×F ′

with F ′ features as follows:

H ′ = σ(AHW + b), (1)

where σ is an arbitrary activation function, and W ∈RF×F ′

and b ∈ RF are the learnable weight matrix and

bias vector, respectively. By stacking multiple such layers,information is exchanged throughout the graph such that,after k layers, the information at a given vertex will, for thefirst time, reach it’s kth depth neighbor. Alternatively, kth

depth information can be immediately reached using thekth power of adjacency matrix in Eq. 1 (Defferrard et al.,2016; Levie et al., 2017; Cucurull et al., 2018). Note that,as discussed in Section 1, GCN layers equate any given ver-tex to a summary of its neighbourhood, and their repeatedapplication may over-smooth important local information(Li et al., 2018).

3.2. Chamfer Loss: Vertex-To-Point Loss

The Chamfer Distance between predicted and ground truthobjects has become a standard metric for 3D reconstruc-tion (Wang et al., 2018; Insafutdinov & Dosovitskiy, 2018;Groueix et al., 2018a; Sun et al., 2018; Fan et al., 2017).This loss is defined as:

LChamfer =∑p∈S

minq∈S‖p− q‖22 +

∑q∈S

minp∈S‖p− q‖22 (2)

and is computed between two sets of points, S and S, sam-pled from the predicted surface and the ground truth surface.As discussed in Section 1, this metric performs poorly whendirectly applied to two sets of mesh vertices as it does nottake into account the faces which they define, and becauseof the difficult learning problem associated with matchinghighly arbitrary vertex placement on a surface. To avoidthese issues, Wang et al. (2018) define S as a large set ofpredicted vertex positions, and S as a large, pre-computedset of points sampled from the ground truth surface (seeFigure 3a). Defining the ground truth set as a dense uniformsampling over the target surface avoids issues with inconsis-tent vertex positions across similar objects. However, thisleads to predictions with a high number of vertices packedtightly over the full surface. In this way, the mesh predic-tions resemble a point cloud, and thus fail to take advantageof the graceful scaling properties of their representation.

4. GEOMetrics Mesh ReconstructionIn this section, we describe our pipeline for reconstruct-ing adaptive meshes from single images and outline ourproposed 0N-GCN as well as our suggested adaptive facesplitting.

4.1. Mesh Reconstruction from Images Pipeline

Figure 2 depicts our mesh reconstruction module, whichtakes as input a mesh model, defined by a set of vertex posi-tions and an adjacency matrix, together with an RGB imagedepicting an object view and outputs a new mesh prediction.The module is composed of three distinct phases: feature


Figure 2. Mesh reconstruction module, with its three main components highlighted. Feature Extraction describes the process throughwhich image features are extracted for each vertex. Mesh Deformation outlines the deformation of the inputted mesh through 0N-GCNlayers. Adaptive Face Splitting illustrates how high curvature faces are split to increase local complexity.

extraction, mesh deformation and face splitting, which arecascaded m = 3 times to obtain incrementally refined meshpredictions. Note that the initial module takes as input apredefined mesh model (e.g. a sphere), whereas each sub-sequent module is fed the preceding module’s prediction.In this manner, the initial mesh is iteratively deformed andupdated to match the input image.

Our feature extraction is based on the method proposed byWang et al. (2018), where the input image is passed througha deep CNN and the features from 4 intermediary layers areoutputted. The feature vector for each vertex of the meshis then defined by projecting the vertices of the input meshonto the CNN outputs and extracting their correspondingfeatures. In addition, each vertex feature vector is providedits 3D coordinates x, y, z and also, if available, the finalfeature vector it possessed in the preceding reconstructionmodule. Our mesh deformation consists of a graph convo-lutional model, which takes as input a mesh and deformsit by making a residual prediction for the position of eachvertex. The residual prediction is then added to the originalposition to complete the deformation. The graph convolu-tional model of this mesh deformation phase is made up ofa series of the proposed 0N-GCN layers (see Subsection 4.2for details). Finally, our face splitting phase, described inSubsection 4.3, encourages local complexity to emerge inregions that require additional detail.

4.2. Zero-Neighbor Graph Convolutional Networks

As described in Subsection 3.1, a potential shortcoming ofthe standard GCN formulation is that a vertex has no capac-ity to maintain and directly draw conclusions upon its owninformation, as this information is smoothed with outsideinfluence at each layer. This outside influence, while usefulin global graph understanding contexts, may be detrimentalwhen vital information held at each vertex cannot be de-rived from its neighbors. This situation is exemplified bymeshes, where, if optimally defined, every vertex defines

some new surface structure (see Figure 1d). To rectify thisproblem, we define a Zero-Neighbor update, in which afraction of a vertex’s feature vector are not updated with theneighbors’ information. This is accomplished by, insteadof applying higher powers of the adjacency matrix to reachfurther depths, taking the adjacency matrix to the power0 (equivalent to the identity matrix) to exchange with nofurther depths:

H′ = HW, H′′ = σ([AH′0:i‖A0H′i:] + b), (3)

where [·||·] denotes concatenation between vectors and iis a feature index. This 0N-GCN provides a soft middleground between full exchange of information and no vertexcommunication, where the network can choose how heavilya portion of the features of a given vertex will be influencedby the rest of the graph.

4.3. Adaptive Face Splitting

In the final step of each reconstruction module, the mesh’sset of vertices is redefined over its surface, by adding ver-tices in regions of high detail. To do so, we introduce aface splitting method, which adaptively increases the setof vertices and the connections between them by analyzingthe local curvature of the surface at each mesh face. Thecurvature at each face is computed by taking the average ofthe angle between a face’s normal and its neighboring faces’normals. For a given face f , made up of vertices v1, v2, v3,its face normal Nf is calculated as:

Nf =e1 × e2

‖e1 × e2‖, (4)

where e1 = v1 − v2 and e2 = v3 − v2. The curvature Cf atface f is then computed as:

Cf =180

|Hf |π∑i∈Hf

arccos (Nf ·Ni), (5)

whereHf is the set of neighboring faces of f . All faces withcurvature over a given threshold, α, are then selected to be


(a) Vertex-to-point (b) Point-to-point (c) Point-to-surface

Figure 3. A comparison of different surface losses. Vertex-to-pointis the technique used by Wang et al. (2018). Point-to-point sam-pling and point-to-surface sampling are the sampling proceduresintroduced in our approach.

updated. A selected face is updated by adding a new vertexto its center and connecting it to its 3 original vertices, cre-ating 3 new faces. As the new vertex positions are definedby the positions of already existing vertices, the gradientsfrom each vertex are easily defined to flow back through allprevious modules. In this way, each reconstruction moduleis able to identify areas of the current mesh which requireincreased detail and prescribe them a higher vertex density.This allows the mesh to fully take advantage of the scal-ing properties of its representation, by concentrating thegeneration process in areas of high detail.

5. GEOMetrics LossesIn this section, we describe the key contributions made tothe mesh prediction task when considering 3D geometry. Inparticular, we introduce a training objective, considering thelocal topology and the global structure to produce mesh pre-dictions that properly benefit from the graph representation.

5.1. Differentiable Surface Sampling Losses

We introduce a differentiable sampling procedure whichenables us to penalize vertices by the surface they implicitlydefine, rather then their explicit position. This approachallows predicted meshes to match the target surface withoutemulating the target vertex positions, which are entirelyarbitrary when randomly sampled from the ground truth,while also optimally positioning their vertices and faces.

To do so, we define a discrete probability distribution basedon the relative area of each face and sample n times fromthis distribution to determine the number of points to sampleper face. Then, we sample the previously chosen number ofpoints uniformly over each corresponding surface.More pre-cisely, given a triangular face defined by vertices v1, v2, v3,following Osada et al. (2002), a point r can be sampleduniformly from the surface of the triangle as:

r = (1−√u)v1 +

√u(1− w)v2 +

√uwv3, (6)

where u,w ∼ U(0, 1). This formulation allows us to differ-entiate through the random selection via the reparametriza-

Figure 4. Mesh-to-voxel Mapping. A encoder-decoder architec-ture is trained to map ground truth meshes to their correspondingvoxelizations. The latent representations produced by the encoderfrom predicted and ground truth meshes are then compared in ourglobal reconstruction loss.

tion trick (Kingma & Welling, 2013; Rezende et al., 2014),as the sampling procedure is defined by a deterministic trans-formation on the vertex coordinates and the independentstochastic terms.

We apply this sampling procedure to both the predicted meshand the ground truth and define an alternative Chamfer loss,which operates over the previously sampled points, ratherthan the predicted vertices (see Figure 3b):

LPtP =∑p∈S


∑q∈S

minp∈S‖p− q‖22, (7)

where S and S are the sampled points of the predicted meshand the ground truth, respectively. An algorithmic descrip-tion of the entire process is provided in the supplementarymaterial. Note this loss differs from the vertex-to-point lossin that it properly penalizes the surface of the predictedmesh instead of the predicted vertex’s positions.

Building on these ideas, we define an improved loss term tomore accurately compare the surfaces of two meshes:

LPtS =∑p∈S

minf∈M

dist(p, f) +∑q∈S

minf∈M

dist(q, f), (8)

where M and M are the predicted and ground truth meshes,f and f the faces, S and S the set of points sampled fromthe surfaces of M and M , and dist is a function computingthe distance between a point and a triangular face2. Thisloss is shown in Figure 3c, and an algorithmic descriptioncan be found in the supplementary material. Note that thisloss provides an exact measure of the distance between apoint and a mesh surface. This is in contrast to the Chamferloss in Eq. 2 and the point-to-point loss in Eq. 8, which are

2Calculated using an optimized adaptation of the DistanceBetween Point and Triangle in 3D algorithm (Eberly, 1999), detailsprovided in the supplementary material.


Figure 5. Renderings from a single reconstructed chair, demon-strating the variety of different local vertex densities which areproduced by our approach.

faster to compute, but can drastically lose accuracy if toofew points are sampled on either surface. A quantitativeanalysis of the improvement from these losses is providedin the supplementary material through a toy problem.

5.2. Global Encoding of Graphs

In order to consider the global structure of an object duringthe reconstruction process, we introduce a global mesh loss.This loss relies on features extracted from a pre-trainedmesh-to-voxel model, which is designed as an encoder-decoder network. The mesh-to-voxel encoder takes as inputa mesh graph and produces a latent embedding, from whichthe 3D object is reconstructed, through the decoder, in avoxelized format. In this manner, objects with structuralsimilarity in voxel space, will have similar latent representa-tions, without requiring similar placement of vertices. Theproposed global mesh loss is then defined as

LLatent = ||E(M)− E(M)||22 (9)

where E corresponds to the encoder function of the mesh-to-voxel network. This process is depicted in Figure 4.

The encoder network of the mesh-to-voxel model is builtby stacking 0N-GCN layers, followed by a max poolingoperation applied to the set of vertices as in Ciregan et al.(2012) to produce a single fixed length latent representation.The decoder is a 3D deconvolutional network (Choy et al.,2016), in following with the network defined in Smith &Meger (2017), to perform image to voxel mappings. Thecomplete mesh-to-voxel network is pre-trained by minimiz-ing the mean squared error on the voxelized representationsprior to being used in the GEOMetrics system.

5.3. Optimization Details

Finally, we present the complete training objective for ourmesh reconstruction system. This function combines our dif-ferential surface sampling losses, our global structure loss,along with two regularization techniques defined in Wanget al. (2018): an edge length minimizing regularizer LEdgeand a Laplacian-maintaining regularizer L∆λ, pushing thepredicted mesh to be smooth and visually appealing.

(a) Input Image (b) GEOMetrics (c) Pixel2Mesh

Figure 6. Visual comparison between GEOMetrics andPixel2Mesh (Wang et al., 2018) chair reconstructions.

The final loss function of our system is defined as:

L = γ1LLatent + γ2LPtS + γ3LEdge + γ4L∆λ, (10)

where γi are hyper-parameters weighting the importance ofeach term. During the initial stages of training we approxi-mate the LPtS term using the defined LPtP loss function forfaster computation. Note that the loss L is applied to theoutput of each mesh reconstruction module.

6. ExperimentsIn this section, we demonstrate our algorithm’s ability toreconstruct the surface information of 3D objects from sin-gle RGB images by taking advantage of the benefits of themesh representation. We evaluate on this task across 13classes of the ShapeNet (Chang et al., 2015) dataset. In ad-dition, we present an ablation study to demonstrate how ouralgorithm’s individual components contribute to its overallperformance.

6.1. Dataset

The dataset consists of mesh models, voxel models, andRGB images computed from 13 large classes of CAD mod-els found in the ShapeNet dataset (Chang et al., 2015). Meshmodels were computed from the CADs by removing all tex-ture information and downscaling their size so that eachmodel possesses less then 2000 vertices, where possible.From these mesh models, voxelized counter parts were pro-duced at 323 resolution. From each CAD model, 24 RGBimages were produced, from random viewpoints, with thecamera projection matrix recorded for use in the featureselection method. The data in each class was then split intoa training, validation and test set with a ratio of 70:10:20,


Table 1. Results on ShapeNet 3D object reconstruction reported as per class surface sampling F1 scores and mean F1 score.

Category 3D-R2N2 PSG N3MR Vertices Pixel2Mesh Vertices Ours Vertices(Choy et al., 2016) (Fan et al., 2017) (Kato et al., 2017) (Wang et al., 2018)

Plane 41.46 68.20 62.10 642 71.12 2466 89.00 645.03Bench 34.09 49.29 35.84 642 57.57 2466 72.11 514.54Cabinet 49.88 39.93 21.04 642 60.39 2466 59.52 556.68Car 37.80 50.70 36.66 642 67.86 2466 74.64 509.33Chair 40.22 41.60 30.25 642 54.38 2466 56.61 619.13Monitor 34.38 40.53 28.77 642 51.39 2466 59.50 449.65Lamp 32.35 41.40 27.97 642 48.15 2466 58.65 743.28Speaker 45.30 32.61 19.46 642 48.84 2466 49.53 550.06Firearm 28.34 69.96 52.22 642 73.20 2466 88.36 638.35Couch 40.01 36.59 25.04 642 51.90 2466 59.54 561.79Table 43.79 53.44 28.40 642 66.30 2466 66.33 732.82Cellphone 42.31 55.95 27.96 642 70.24 2466 73.65 416.05Watercraft 37.10 51.28 43.71 642 55.12 2466 68.32 526.04

Mean 39.01 48.58 33.80 642 59.72 2466 67.37 574.06

respectively. This matches the dataset used for empiricalevaluation by Wang et al. (2018) and Smith et al. (2018).

6.2. Implementation Details

Mesh-to-Voxel Mapping For each class, we train a mesh-to-voxel mapping from the mesh and voxel ground truths,for use in our latent loss. These mappings are trained withAdam optimizer (Kingma & Ba, 2014) (β1 = 0.9, β2 =0.999), a learning rate of 10−4, and a mini-batch size of 16.We train for 105 iterations and practice early stopping, withthe best model selected from evaluating on the validationset every 100 iterations.

GEOMetrics We train the full system on each class inour dataset with Adam optimizer (Kingma & Ba, 2014), atlearning rate of 10−4 for 300k iterations, and then againfor 150k iterations at a learning rate of 10−5, with mini-batch size of 5. We practice early stopping by evaluating onthe validation set every 50 iterations. The hyper-parametersettings used, as described in Eq. (10), are γ1 = .001, γ2 =1, γ3 = 0.3, and γ4 = 1. As mentioned above, LPtPis employed as a faster approximation to the LPtS loss,specifically for the first 300k iteration. This is becauseLPtS is slow to compute and we found it sufficient to onlyuse it to finetune pre-trained models. All hyper-parameterswere initially tuned on the validation set of the chair class.The generic pre-defined mesh fed to the first reconstructionmodule is an ellipsoid. A face is split at the end of eachmodule only if the curvature at that face is greater than70°. Architecture details for all networks is provided in thesupplementary.

6.3. Single Image Reconstruction

We evaluate our method’s performance quantitatively bycomparing its ability to reconstruct mesh surfaces from sin-

gle RGB image to an array of high performing 3D objectreconstruction algorithms. To do this, we sample pointsfrom both the surface of the predicted object and the ground-truth object and compute the F1 score. In following with(Wang et al., 2018), precision and recall are calculated us-ing the percentage of sampled points which exists withina threshold of 0.0001 any sampled point in the comparedsurface. State of the art results of mesh approaches, N3MR(Kato et al., 2017) and Pixel2Mesh (Wang et al., 2018), apoint cloud method, PSG (Fan et al., 2017) and a voxel base-line, 3D-R2N2 (Choy et al., 2016), are reported from Wanget al. (2018). We also compare mesh-based approachesin terms of space requirements (number of vertices). Theresults of this comparison are summarized in Table 1. Asshown, our GEOMetrics system boasts far higher perfor-mance than previous approaches, with an average increasein F1 score of 7.65 points across all classes, and improvedscore in all classes but one, where we experience a negligi-ble drop of 0.87 points. In addition, in all cases, our systemrequires notably less vertices than the previous mesh-basedstate of the art Pixel2Mesh, e.g. cellphone objects require6× less vertices, whereas lamp objects require 3× less ver-tices. With an average of 574.06 vertices used across allclasses, the vertex requirements drop as much as 4.3× onaverage, highlighting the potential of the adaptive face split-ting. Moreover, when compared to point cloud and voxelbaselines, we also exhibits state of the art results.

Qualitative reconstruction results for each of the 13 classesare displayed in Figure 73. We boast highly accurate re-constructions of the input object, effectively capturing bothglobal structure and local detail. In addition, we render anun-smoothed chair reconstruction in Figure 5 with its edgesheavily outlined, demonstrating the obtained diverse vertexdensity across a single object and highlighting the way our

3See supplementary material for additional visualizations.


Figure 7. Qualitative results: renderings of meshes reconstructed from each ShapeNet object classes.

system represents simple surfaces with a small number offaces, and shifts to higher density where required. Lastly,Figure 6 depicts a visual comparison between GEOMetricsand Pixel2Mesh reconstructions, where we can observe howGEOMetrics is able to provide reconstructions with higherdetail (e.g. the sharpness of the chair legs).

6.4. Single Image Reconstruction Ablation Study

In this subsection, we study the influence of our system’scomponents and demonstrate their individual importanceby comparing our full method’s results on the chair classto ablated versions of the system. We assess the impact ofour 0N-GCN layers by replacing them with standard GCNs(Kipf & Welling, 2016) in both the mesh reconstruction aswell as the mesh-to-voxel models. We validate the effective-ness of the proposed adaptive face splitting by substituting itby a procedure in which all faces are split at the end of eachreconstruction module, keeping uniform detail and main-taining approximately the same the number of vertices asthe full approach. We then check the importance of each oneof the newly introduced losses by removing them at trainingtime. Note that, when the face sampling lossesLPtP andLPtSare removed, we replace them by the vertex-to-point loss(LVtP) proposed by Wang et al. (2018). Finally, we compareour method to the Pixel2Mesh, when roughly equivalent interms of number of vertices.

The results of this ablation study are reported in Table 2.As shown in the table, the biggest effect comes from theintroduction of the adaptive face splitting, with a drop of6.23 points when replacing it with a uniform splitting heuris-tic. Moreover, assisting the model to give more importanceto vertex features through the 0N-GCN also appears to berelevant. The losses proposed to train the whole system alsoplay an important role, as ignoring them leads to a decrease

in performance of 3.69 points and 1.02 points, respectively.Moreover, training Pixel2Mesh baseline to use as few ver-tices as GEOMetrics leads to notably worse performance.These results empirically justify the contributions of ourGEOMetrics system.

Table 2. GEOMetrics ablation compared to full method (ours) andPixel2Mesh. Results reported as mean F1 score on the chair class.

Ours GCN Unif. Split. No Llatent LVtP Pixel2Mesh

56.61 54.57 50.33 55.59 52.92 38.13

7. ConclusionIn this paper, we presented GEOMetrics, a novel approachfor adaptive mesh reconstruction, which focuses on ex-ploiting the geometry of the mesh representation. TheGEOMetrics system reformulates GCNs to explicitly pre-serve local vertex information and incorporates an adaptiveface splitting procedure to enhance local complexity whennecessary. Furthermore, the system is trained by introduc-ing a training objective which operates both locally andglobally at mesh level, and capitalizes on the geometricstructure of graph-encoded objects. We demonstrated thepotential of the approach through extensive evaluation onthe challenging task of 3D object reconstruction from singleimages of the ShapeNet dataset. Finally, we reported visu-ally appealing state of the art results, outperforming existingmesh-based methods by a large margin, while requiring (onaverage) as many as 4.3× less vertices. Future research di-rections include addressing the restrictive constant topologyprescribed by the initial mesh object through reconstructionand generation methods, which adapt the topology to matchthe target mesh.


ReferencesBarrow, H., Tenenbaum, J., Bolles, R., and Wolf, H. Para-

metric correspondence and chamfer matching: two newtechniques for image matching. In Proceedings of the 5thinternational joint conference on Artificial intelligence-Volume 2, pp. 659–663. Morgan Kaufmann PublishersInc., 1977.

Bronstein, M. M., Bruna, J., LeCun, Y., Szlam, A., and Van-dergheynst, P. Geometric deep learning: Going beyondeuclidean data. IEEE Signal Processing Magazine, 34:18–42, 2017.

Bruna, J., Zaremba, W., Szlam, A., and Lecun, Y. Spectralnetworks and locally connected networks on graphs. InInternational Conference on Learning Representations,2014.

Casanova, A., Cucurull, G., Drozdzal, M., Romero, A., andBengio, Y. On the iterative refinement of densely con-nected representation levels for semantic segmentation.In Workshop on Autonomous Driving, CVPR, 2018.

Chang, A. X., Funkhouser, T., Guibas, L., Hanrahan, P.,Huang, Q., Li, Z., Savarese, S., Savva, M., Song, S.,Su, H., et al. Shapenet: An information-rich 3d modelrepository. arXiv preprint arXiv:1512.03012, 2015.

Choy, C. B., Xu, D., Gwak, J., Chen, K., and Savarese, S.3d-r2n2: A unified approach for single and multi-view3d object reconstruction. In Proceedings of the EuropeanConference on Computer Vision (ECCV), pp. 628–644.Springer, 2016.

Ciregan, D., Meier, U., and Schmidhuber, J. Multi-columndeep neural networks for image classification. In IEEEConference on Computer Vision and Pattern Recognition(CVPR), 2012.

Clevert, D.-A., Unterthiner, T., and Hochreiter, S. Fastand accurate deep network learning by exponential linearunits (elus). arXiv preprint arXiv:1511.07289, 2015.

Cucurull, G., Wagstyl, K., Casanova, A., Velickovic, P.,Jakobsen, E., Drozdzal, M., Romero, A., Evans, A., andBengio, Y. Convolutional neural networks for mesh-basedparcellation of the cerebral cortex. In Proceedings of theInternational Conference on Medical Imaging with DeepLearning, 2018.

Defferrard, M., Bresson, X., and Vandergheynst, P. Con-volutional neural networks on graphs with fast localizedspectral filtering. In Proceedings of the 30th InternationalConference on Neural Information Processing Systems,NIPS’16, pp. 3844–3852, USA, 2016. Curran AssociatesInc. ISBN 978-1-5108-3881-9.

Duvenaud, D., Maclaurin, D., Aguilera-Iparraguirre, J.,Gomez-Bombarelli, R., Hirzel, T., Aspuru-Guzik, A.,and Adams, R. P. Convolutional networks on graphsfor learning molecular fingerprints. In Proceedings ofthe 28th International Conference on Neural InformationProcessing Systems - Volume 2, NIPS’15, pp. 2224–2232,Cambridge, MA, USA, 2015. MIT Press.

Eberly, D. Distance between point and triangle in 3d. Geo-metric Tools, 1999.

Fan, H., Su, H., and Guibas, L. A point set generationnetwork for 3d object reconstruction from a single image.In IEEE Conference on Computer Vision and PatternRecognition (CVPR), volume 38, 2017.

Frasconi, P., Gori, M., and Sperduti, A. A general frame-work for adaptive processing of data structures. IEEETransactions on Neural Networks, 9(5):768–786, Sep.1998. ISSN 1045-9227. doi: 10.1109/72.712151.

Gatys, L. A., Ecker, A. S., and Bethge, M. Image styletransfer using convolutional neural networks. In IEEEConference on Computer Vision and Pattern Recognition(CVPR), pp. 2414–2423, 2016.

Gori, M., Monfardini, G., and Scarselli, F. A new modelfor learning in graph domains. In Proceedings. 2005IEEE International Joint Conference on Neural Networks,2005., volume 2, pp. 729–734 vol. 2, July 2005. doi:10.1109/IJCNN.2005.1555942.

Groueix, T., Fisher, M., Kim, V., Russell, B., and Aubry,M. Atlasnet: A papier-mache approach to learning 3dsurface generation. In IEEE Conference on ComputerVision and Pattern Recognition (CVPR), 2018a.

Groueix, T., Fisher, M., Kim, V. G., Russell, B. C., andAubry, M. 3d-coded: 3d correspondences by deep defor-mation. In Proceedings of the European Conference onComputer Vision (ECCV), pp. 230–246, 2018b.

Hamilton, W., Ying, Z., and Leskovec, J. Inductive repre-sentation learning on large graphs. In Advances in NeuralInformation Processing Systems, pp. 1024–1034, 2017.

Hane, C., Tulsiani, S., and Malik, J. Hierarchical surfaceprediction for 3d object reconstruction. arXiv preprintarXiv:1704.00710, 2017.

He, K., Zhang, X., Ren, S., and Sun, J. Identity mappings indeep residual networks. In Proceedings of the EuropeanConference on Computer Vision (ECCV), pp. 630–645.Springer, 2016.

He, K., Gkioxari, G., Dollar, P., and Girshick, R. Mask r-cnn. In 2017 IEEE International Conference on ComputerVision (ICCV), pp. 2980–2988. IEEE, 2017.

http://arxiv.org/abs/1512.03012




Henaff, M., Bruna, J., and LeCun, Y. Deep convolu-tional networks on graph-structured data. arXiv preprintarXiv:1506.05163, 2015.

Henderson, P. and Ferrari, V. Learning to generate andreconstruct 3d meshes with only 2d supervision. arXivpreprint arXiv:1807.09259, 2018.

Huang, G., Liu, Z., Van Der Maaten, L., and Weinberger,K. Q. Densely connected convolutional networks. InIEEE Conference on Computer Vision and Pattern Recog-nition (CVPR), pp. 2261–2269. IEEE, 2017.

Insafutdinov, E. and Dosovitskiy, A. Unsupervised learningof shape and pose with differentiable point clouds. arXivpreprint arXiv:1810.09381, 2018.

Ioffe, S. and Szegedy, C. Batch normalization: Acceleratingdeep network training by reducing internal covariate shift.In International Conference on Machine Learning, pp.448–456, 2015.

Jack, D., Pontes, J. K., Sridharan, S., Fookes, C., Shirazi,S., Maire, F., and Eriksson, A. Learning free-form de-formations for 3d object reconstruction. arXiv preprintarXiv:1803.10932, 2018.

Jegou, S., Drozdzal, M., Vazquez, D., Romero, A., andBengio, Y. The one hundred layers tiramisu: Fully con-volutional densenets for semantic segmentation. In IEEEConference on Computer Vision and Pattern RecognitionWorkshops (CVPRW), pp. 1175–1183. IEEE, 2017.

Johnson, J., Alahi, A., and Fei-Fei, L. Perceptual losses forreal-time style transfer and super-resolution. In Proceed-ings of the European Conference on Computer Vision(ECCV), pp. 694–711. Springer, 2016.

Kanazawa, A., Tulsiani, S., Efros, A. A., and Malik, J.Learning category-specific mesh reconstruction from im-age collections. arXiv preprint arXiv:1803.07549, 2018.

Kar, A., Tulsiani, S., Carreira, J., and Malik, J. Category-specific object reconstruction from a single image. IEEEConference on Computer Vision and Pattern Recognition(CVPR), pp. 1966–1974, 2015.

Kato, H., Ushiku, Y., and Harada, T. Neural 3d mesh ren-derer. arXiv preprint arXiv:1711.07566, 2017.

Kingma, D. P. and Ba, J. Adam: A method for stochasticoptimization. arXiv preprint arXiv:1412.6980, 2014.

Kingma, D. P. and Welling, M. Auto-encoding variationalbayes. arXiv preprint arXiv:1312.6114, 2013.

Kipf, T. N. and Welling, M. Semi-supervised classifica-tion with graph convolutional networks. arXiv preprintarXiv:1609.02907, 2016.

Levie, R., Monti, F., Bresson, X., and Bronstein, M. M.Cayleynets: Graph convolutional neural networks withcomplex rational spectral filters. arXiv preprintarXiv:1705.07664, 2017.

Li, Q., Han, Z., and Wu, X.-M. Deeper insights into graphconvolutional networks for semi-supervised learning. InAAAI, 2018.

Monti, F., Boscaini, D., Masci, J., Rodola, E., Svoboda,J., and Bronstein, M. M. Geometric deep learning ongraphs and manifolds using mixture model cnns. IEEEConference on Computer Vision and Pattern Recognition(CVPR), pp. 5425–5434, 2017.

Nair, V. and Hinton, G. E. Rectified linear units improverestricted boltzmann machines. In Proceedings of the 27thinternational conference on machine learning (ICML-10),pp. 807–814, 2010.

Novotny, D., Larlus, D., and Vedaldi, A. Learning 3d objectcategories by looking around them. In IEEE InternationalConference on Computer Vision (ICCV), pp. 5228–5237.IEEE, 2017.

Osada, R., Funkhouser, T., Chazelle, B., and Dobkin, D.Shape distributions. ACM Transactions on Graphics(TOG), 21(4):807–832, 2002.

Pontes, J. K., Kong, C., Sridharan, S., Lucey, S., Eriks-son, A., and Fookes, C. Image2mesh: A learning frame-work for single image 3d reconstruction. arXiv preprintarXiv:1711.10669, 2017.

Qi, C. R., Su, H., Mo, K., and Guibas, L. J. Pointnet: Deeplearning on point sets for 3d classification and segmenta-tion. IEEE Conference on Computer Vision and PatternRecognition (CVPR), 1(2):4, 2017.

Rezende, D. J., Mohamed, S., and Wierstra, D. Stochasticbackpropagation and approximate inference in deep gen-erative models. arXiv preprint arXiv:1401.4082, 2014.

Riegler, G., Ulusoy, A. O., and Geiger, A. Octnet: Learningdeep 3d representations at high resolutions. In IEEEConference on Computer Vision and Pattern Recognition(CVPR), pp. 6620–6629, 2017.

Scarselli, F., Gori, M., Tsoi, A. C., Hagenbuchner, M., andMonfardini, G. The graph neural network model. IEEETransactions on Neural Networks, 20(1):61–80, 2009.

Shelhamer, E., Long, J., and Darrell, T. Fully convolutionalnetworks for semantic segmentation. IEEE transactionson pattern analysis and machine intelligence, 39(4):640–651, 2017.














Smith, E., Fujimoto, S., and Meger, D. Multi-view sil-houette and depth decomposition for high resolution 3dobject representation. In Advances in Neural InformationProcessing Systems, pp. 6479–6489, 2018.

Smith, E. J. and Meger, D. Improved adversarial systems for3d object generation and reconstruction. In Conferenceon Robot Learning, pp. 87–96, 2017.

Sun, X., Wu, J., Zhang, X., Zhang, Z., Zhang, C., Xue, T.,Tenenbaum, J. B., and Freeman, W. T. Pix3d: Dataset andmethods for single-image 3d shape modeling. In IEEEConference on Computer Vision and Pattern Recognition(CVPR), 2018.

Tatarchenko, M., Dosovitskiy, A., and Brox, T. Octree gen-erating networks: Efficient convolutional architecturesfor high-resolution 3d outputs. In IEEE Conference onComputer Vision and Pattern Recognition (CVPR), pp.2088–2096, 2017.

Tulsiani, S., Zhou, T., Efros, A. A., and Malik, J. Multi-viewsupervision for single-view reconstruction via differen-tiable ray consistency. In IEEE Conference on ComputerVision and Pattern Recognition (CVPR), pp. 209–217.IEEE, 2017.

Velickovic, P., Cucurull, G., Casanova, A., Romero, A., Lio,P., and Bengio, Y. Graph Attention Networks. Interna-tional Conference on Learning Representations, 2018.

Wang, N., Zhang, Y., Li, Z., Fu, Y., Liu, W., and Jiang, Y.-G.Pixel2mesh: Generating 3d mesh models from single rgbimages. arXiv preprint arXiv:1804.01654, 2018.

Wu, J., Zhang, C., Xue, T., Freeman, W. T., and Tenenbaum,J. B. Learning a probabilistic latent space of object shapesvia 3d generative-adversarial modeling. In Advances inNeural Information Processing Systems, pp. 82–90, 2016.

Wu, J., Wang, Y., Xue, T., Sun, X., Freeman, W. T., andTenenbaum, J. B. MarrNet: 3D Shape Reconstructionvia 2.5D Sketches. In Advances In Neural InformationProcessing Systems, 2017.

Wu, J., Zhang, C., Zhang, X., Zhang, Z., Freeman, W. T.,and Tenenbaum, J. B. Learning shape priors for single-view 3d completion and reconstruction. In Proceedingsof the European Conference on Computer Vision (ECCV),pp. 673–691. Springer, 2018.

Zhang, X., Zhang, Z., Zhang, C., Tenenbaum, J., Freeman,B., and Wu, J. Learning to reconstruct shapes from unseenclasses. In Advances in Neural Information ProcessingSystems 31, pp. 2263–2274, 2018.


GEOMetrics: Supplemental Material

A. Point to Surface LossIn this section, we describe the the Distance Between Pointand Triangle in 3D algorithm (Eberly, 1999). For a givenpoint P and triangle T , the algorithm computes the mini-mum distance between the point and any point containedwithin the triangle. Assuming the triangle is defined bycorner point B and directions E0 and E1, then any pointT (s, t) contained in the triangle can be defined by a pair ofscalars (s, t) such that T (s, t) = B + sE0 + tE1, where(s, t) ∈ D = {(s, t) : s ≥ 0, t ≥ 0, s+ t ≤ 1}. We cannow define the squared distance Q between the point P andany point in the triangle T (s, t) by the following quadraticfunction:

Q(s, t) = as2 + 2bst+ ct2 + 2ds+ 2et+ f, (11)

where for clarity we denote a = E0 · E0 , b = E0 · E1,c = E1 · E1, d = E0 · (B − P ), e = E1 · (B − P ), andf = (B − P ) · (B − P ). Selecting (s, t) which minimizesQ(s, t) provides the minimum distance between the pointP and triangle T . As Q is continuously differentiable, (s, t)can be found at an interior point where ∇Q = 0 or at theboundary of the set D.

In the first case, note ∇Q(s′, t′) = 0 if and only if s′ and t′

satisfy the following:

s′ =be− cdac− b2

, t′ =bd− aeac− b2

(12)

Then if (s′, t′) ∈ D, we have the minimum dis-tance.Otherwise, the distance minimizing (s, t) must lieon the boundary of D, where either {s = 0, t ∈ [0, 1]},{s ∈ [0, 1], t = 0}, or {s ∈ [0, 1], t = 1− s}. In each caseQ(s, t), can be reduced to quadratic of one unknown vari-able, which can be minimized by setting the gradient to0.

B. Mesh-to-Voxel Mapping AblationIn this section, we perform an ablation study over the use of0N-GCN as building block for our Mesh-to-Voxel Mappingnetwork to highlight its impact with respect to the standardGCN layers. To that end, we compare our model on 3 dif-ferent object classes to an analogous network composed ofstandard GCN layers with the same number of parameters.Additionally, we assess the influence of pooling across a setof vertices by comparing it to other forms of aggregationsuch as the one introduced by the Neural Graph Finger-print (NGF) model (Duvenaud et al., 2015). The results

of this ablation study can be found in Table 3 in terms ofmean squared error (MSE). As shown in the table, resultsdemonstrate the benefits of the 0N-GCN layers, as wellas the max-pooling vertex set aggregation, for this meshunderstanding task.

Table 3. Mesh-to-Voxel Mapping Reconstruction MSE scores.

Category Ours GCN NGF

Plane 0.0089 0.0104 0.0108table 0.0310 0.0393 0.0360Chair 0.0412 0.0526 0.0486

Mean 0.0270 0.0341 0.0318

C. Differentiable Surface Loss AlgorithmsThis section provides the algorithmic details of both thepoint-to-point loss (Algorithm 1) as well as the point-to-surface loss (Algorithm 2).Algorithm 1 Point-to-Point Loss

1: Input: Two mesh surfaces M and M , and number ofpoints n

2: for face f in mesh M do3: Af = Area(f)4: AT += Area(f)5: end for6: Define F s.t. P (F = f) =

Af∗100AT

7: Define U = Uniform(0, 1)8: Define S = []9: for i = 0 to n do

10: f ∼ F11: v1, v2, v3 = vertices(f)12: u ∼ U, w ∼ U13: r = (1−

√u)v1 +

√u(1− w)v2 +

√uwv3

14: S.append(r)15: end for16: Apply lines 1 to 15 to mesh M to produce S17: LPtP =

∑p∈S


∑q∈S

minp∈S‖p− q‖22

D. Loss AnalysisIn this section, we present further analysis of GEOMetricslosses to emphasize the benefits of the introduced point-to-point and surface-to-point losses over the vertex-to-pointloss. To that end, we design a toy problem, which consists inoptimizing the placement of the vertices of an initial squaresurface to match the surface area of a target triangle in 2D.


Algorithm 2 Point-to-Surface Loss

1: Input: Two mesh surfaces M and M , and number ofpoints n

2: for face f in mesh M do3: Af = Area(f)4: AT += Area(f)5: end for6: Define F s.t. P (F = f) =

Af∗100AT

7: Define U = Uniform(0, 1)8: Define S = []9: for i = 0 to n do

10: f ∼ F11: v1, v2, v3 = vertices(f)12: u ∼ U, w ∼ U13: r = (1−

√u)v1 +

√u(1− w)v2 +

√uwv3

14: S.append(r)15: end for16: Apply lines 1 to 15 to mesh M to produce S17: LPtS =

∑p∈S

minf∈M

dist(p, f) +∑q∈S

minf∈M

dist(q, f)

Figure 9 (top) depicts the above-mentioned initial and tar-get surfaces. We optimize the placement of the vertices ofthe initial square by performing gradient descent on eachof the losses independently, and calculate the intersectionover union (IoU) of the predicted object and the target trian-gle. Moreover, in order to assess the impact of the numberof points sampled, we repeat this experiment 100 times, in-creasing the number of sampled points from 1 to 100. Figure8 shows the results of this experiment. Firstly, we observethat the vertex-to-point loss fails to match the target surfaceentirely, no matter the number of sampled points. Secondly,we observe that the point-to-point loss performance is no-tably affected by the number of sampled points. While itexhibits poor performance for lower number of sampledpoints (e.g. below 20), it rapidly improves as the number ofsampled points increases, and ultimately, converges to anaverage performance, which is only slightly lower than thatof the point-to-surface loss. Finally, the point-to-surfaceloss begins with a far higher IoU and remains the strongeroption for nearly all numbers of sampled points.

Figure 9 illustrates qualitative results for the three comparedlosses when optimizing with 50 points sampled. As can beseen, the point-to-surface deformation of the square bettermatches the target triangle shape, followed by point-to-point,which somewhat emulates the triangle, and vertex-to-point,which exhibits the poorest results.

Figure 8. Comparison of vertex-to-point, point-to-point andsurface-to-point losses, in terms of IoU, on a toy problem: op-timizing the placement of vertices of a square to match that of atarget triangle. Results are compared by increasing the numberof sampled points on the surfaces they optimize. Vertex-to-pointis the loss employed by Wang et al. (2018). Point-to-point andpoint-to-surface are the losses introduced in our paper.

Figure 9. Qualitative comparison of vertex-to-point (Wang et al.,2018), point-to-point and surface-to-point losses on the square-to-triangle problem when using 50 sampled points. We highlight thecorrespondence between the points in the initial surface and thetarget surface, which are chosen to be compared, and show theresult of optimizing the placement of the vertices when using eachof the losses.


E. Network ArchitecturesIn this section, we provide details on the architectures ofthe networks used in the paper. Table 4 describes the fea-ture extractor network of the mesh reconstruction module.Similarly, Table 5 specifies the mesh deformation networkof the reconstruction module. Finally, Table 6 and 7 de-tail the mesh-to-voxel encoder and decoder architectures,respectively.

F. Single Image Reconstruction VisualizationsFigures 10 and 11 depict additional reconstruction resultsfrom each ShapeNet object class, with three objects shownper class.


Layers 1-2 3-8 9 10-11 12 13-14 15 16-18Output Resolution 224× 224 224× 224 112× 112 112× 112 56× 56 56× 56 28× 28 28× 28# Channels 16 32 128 128 256 256 512 512Kernel Size 3× 3 3× 3 3× 3 3× 3 3× 3 3× 3 3× 3 3× 3Stride 1 1 2 1 2 1 2 1Extracted Layer - 8 - 11 - 14 - 18

Table 4. Feature extraction network: Details of the convolutional neural network architecture used to extract image features. Each layerperforms a 2D convolutional, followed by batch normalization (Ioffe & Szegedy, 2015) and a ReLU activation function (Nair & Hinton,2010). The last row indicates which layer’s features are extracted for use in the mesh reconstruction module.

Layers 1 2-14 15Input Feature Vector Length 3 192 192Output Feature Vector Length 192 192 3

Table 5. Mesh deformation network: Details of the graph convolutional network architecture used to compute the mesh deformation ineach reconstruction module. Each layer is composed of a 0N-GCN, followed by an ELU activation function (Clevert et al., 2015).

Layers 1 2-4 5 6-7 8 9 10 11 12 13-16 17Input Feature Dimension 3 60 60 120 120 150 200 210 250 300 300Output Feature Dimension 60 60 120 120 150 200 210 250 300 300 50

Table 6. Mesh-to-voxel encoder: Details of the graph convolutional network architecture used to encode mesh graphs into latent vectors.Each layer is composed of a 0N-GCN followed by an ELU activation function (Clevert et al., 2015), except for the final layer which is amax pooling aggregation over the set of vertices.

Layers 1 2 3 4 5Output Resolution 4× 4× 4 8× 8× 8 16× 16× 16 32× 32× 32 32× 32× 32# Channels 64 64 32 8 1Stride 2 2 2 2 1Type DeConv DeConv DeConv DeConv Conv

Table 7. Mesh-to-voxel decoder: Details of the 3D convolutional neural network archictecture used to decode latent vectors into voxelgrids. Each layer performs a 3D deconvolution (Shelhamer et al., 2017) with batch normalization (Ioffe & Szegedy, 2015) and an ELUactivation function (Clevert et al., 2015), except for the final layer which is a standard 3D convolutional layer.


(a) bench (b) cabinet (c) car

(d) cellphone (e) chair (f) lamp

(g) monitor (h) plane (i) rifle

Figure 10. Single image reconstruction results on bench, cabinet, car, cellphone, chair, lamp, monitor, plane and rifle classes.


(a) sofa (b) speaker

(c) table (d) watercraft

Figure 11. Single image reconstruction results on sofa, speaker, table and watercraft classes.

Documents

GEOMetrics: Exploiting Geometric Structure for Graph ... · GEOMetrics: Exploiting Geometric Structure for Graph-Encoded Objects Edward J. Smith1 Scott Fujimoto1 2 Adriana Romero3