PatchNets: Patch-Based Generalizable Deep Implicit 3D ...gvv.mpi-inf.mpg.de/projects/PatchNets/data/patchnets.pdf · PatchNets: Patch-Based Generalizable Deep Implicit 3D Shape Representations

PatchNets: Patch-Based Generalizable DeepImplicit 3D Shape Representations

Edgar Tretschk1 Ayush Tewari1 Vladislav Golyanik1

Michael Zollhofer2 Carsten Stoll2 Christian Theobalt1

1 Max Planck Institute for Informatics, Saarland Informatics Campus2 Facebook Reality Labs

Abstract. Implicit surface representations, such as signed-distance func-tions, combined with deep learning have led to impressive models whichcan represent detailed shapes of objects with arbitrary topology. Since acontinuous function is learned, the reconstructions can also be extractedat any arbitrary resolution. However, large datasets such as ShapeNetare required to train such models.

In this paper, we present a new mid-level patch-based surface represen-tation. At the level of patches, objects across different categories sharesimilarities, which leads to more generalizable models. We then introducea novel method to learn this patch-based representation in a canonicalspace, such that it is as object-agnostic as possible. We show that ourrepresentation trained on one category of objects from ShapeNet can alsowell represent detailed shapes from any other category. In addition, it canbe trained using much fewer shapes, compared to existing approaches.We show several applications of our new representation, including shapeinterpolation and partial point cloud completion. Due to explicit controlover positions, orientations and scales of patches, our representation isalso more controllable compared to object-level representations, whichenables us to deform encoded shapes non-rigidly.

Keywords: implicit functions, patch-based surface representation, intra-object class generalizability

1 Introduction

Several 3D shape representations exist in the computer vision and computergraphics communities, such as point clouds, meshes, voxel grids and implicitfunctions. Learning-based approaches have mostly focused on voxel grids due totheir regular structure, suited for convolutions. However, voxel grids [5] comewith large memory costs, limiting the output resolution of such methods. Pointcloud based approaches have also been explored [23]. While most approachesassume a fixed number of points, recent methods also allow for variable res-olution outputs [28, 17]. Point clouds only offer a sparse representation of thesurface. Meshes with fixed topology are commonly used in constrained settings

2 E. Tretschk et al.

with known object categories [32]. However, they are not suitable for represent-ing objects with varying topology. Very recently, implicit function-based rep-resentations were introduced [21, 17, 4]. DeepSDF [21] learns a network whichrepresents the continuous signed distance functions for a class of objects. Thesurface is represented as the 0-isosurface. Similar approaches [17, 4] use occu-pancy networks, where only the occupancy values are learned (similar to voxelgrid-based approaches), but in a continuous representation. Implicit functionsallow for representing (closed) shapes of arbitrary topology. The reconstructedsurface can be extracted at any resolution, since a continuous function is learned.

Fig. 1. In contrast to a global ap-proach, our patch-based methodgeneralizes to human shapesafter being trained on rigidShapeNet objects.

All existing implicit function-based methodsrely on large datasets of 3D shapes for train-ing. Our goal is to build a generalizable surfacerepresentation which can be trained with muchfewer shapes, and can also generalize to dif-ferent object categories. Instead of learning anobject-level representation, our PatchNet learnsa mid-level representation of surfaces, at thelevel of patches. At the level of patches, objectsacross different categories share similarities. Welearn these patches in a canonical space to fur-ther abstract from object-specific details. Patchextrinsics (position, scale and orientation of apatch) allow each patch to be translated, ro-tated and scaled. Multiple patches can be com-bined in order to represent the full surface ofan object. We show that our patches can belearned using very few shapes, and can general-ize across different object categories, see Fig. 1.Our representation also allows to build object-level models, ObjectNets, which is useful for applications which require an object-level prior.

We demonstrate several applications of our trained models, including partialpoint cloud completion from depth maps, shape interpolation, and a generativemodel for objects. While implicit function-based approaches can reconstructhigh-quality and detailed shapes, they lack controllability. We show that ourpatch-based implicit representation natively allows for controllability due to theexplicit control over patch extrinsics. By user-guided rigging of the patches tothe surface, we allow for articulated deformation of humans without re-encodingthe deformed shapes. In addition to the generalization and editing capabili-ties, our representation includes all advantages of implicit surface modeling. Ourpatches can represent shapes of any arbitrary topology, and our reconstructionscan be extracted at any arbitrary resolution using Marching Cubes [16]. Simi-lar to DeepSDF [21], our network uses an auto-decoder architecture, combiningclassical optimization with learning, resulting in high-quality geometry.

PatchNets 3

2 Related Work

Our patch-based representation relates to many existing data structures andapproaches in classical and learning-based visual computing. In the following, wefocus on the most relevant existing representations, methods and applications.

Global Data Structures. There are multiple widely-used data structures for ge-ometric deep learning such as voxel grids [5], point clouds [23], meshes [32] andimplicit functions [21]. To alleviate the memory limitations and speed-up train-ing, improved versions of voxel grids with hierarchical space partitioning [24]and tri-linear interpolation [27] were recently proposed. A mesh is an explicitdiscrete surface representation which can be useful in monocular rigid 3D re-construction [32, 13]. Combined with patched-based policies, this representationcan suffer from stitching artefacts [12]. All these data structures enable a limitedlevel of detail given a constant memory size. In contrast, other representationssuch as sign distance functions (SDF) [6] represent surfaces implicitly as thezero-crossing of a volumetric level set function.

Recently, neural counterparts of implicit representations and approaches op-erating on them were proposed in the literature [21, 17, 4, 18]. Similarly to SDFs,these methods extract surfaces as zero level sets or decision boundaries, whilediffering in the type of the learned function. Thus, DeepSDF is a learnable vari-ant of SDFs [21], whereas Mescheder et al. [17] train a spatial classifier (indicatorfunction) for regions inside and outside of the scene. In theory, both methodsallow for surface extraction at unlimited resolution. Neural implicit functionshave already demonstrated their effectiveness and robustness in many follow-upworks and applications such as single-view 3D reconstruction [25, 15] as well asstatic [28] and dynamic [19] object representation. While SAL [1] perform shapecompletion from noisy full raw scans, one of our applications is shape comple-tion from partial data with local refinement. Unlike the aforementioned globalapproaches, PatchNets generalize much better, for example to new categories.

Patch-Based Representations. Ohtake et al. [20] use a combination of implicitfunctions for versatile shape representation and editing. Several neural tech-niques use mixtures of geometric primitives as well [31, 11, 7, 9, 33]. The latterhave been shown as helpful abstractions in such tasks as shape segmentation, in-terpolation, classification and recognition, as well as 3D reconstruction. Tulsianiet al. [31] learn to assemble shapes of various categories from explicit 3D geo-metric primitives (e.g., cubes and cuboids). Their method discovers a consistentstructure and allows to establish semantic correspondences between the samples.Genova et al. [11] further develop the idea and learn a general template fromdata which is composed of implicit functions with local support. Due to the func-tion choice, i.e., scaled axis-aligned anisotropic 3D Gaussians, shapes with sharpedges and thin structures are challenging for their method. In CVXNets [7], solidobjects are assembled in a piecewise manner from convex elements. This resultsin a differentiable form which is directly usable in physics and graphics engines.Deprelle et al. [9] decompose shapes into learnable combinations of deformable


elementary 3D structures. VoronoiNet [33] is a deep generative network whichoperates on a differentiable version of Voronoi diagrams. The concurrent NASAmethod [8] focuses on articulated deformations, which is one of our applications.In contrast to other patch-based approaches, our learned patches are not limitedto hand-crafted priors but instead are more flexible and expressive.

3 Proposed Approach

We represent the surface of any object as a combination of several surfacepatches. The patches form a mid-level representation, where each patch rep-resents the surface within a specified radius from its center. This representationis generalizable across object categories, as most objects share similar geometryat the patch level. In the following, we explain how the patches are representedusing artificial neural networks, the losses required to train such networks, as wellas the algorithm to combine multiple patches for smooth surface reconstruction.

3.1 Implicit Patch Representation

We represent a full object i as a collection of NP = 30 patches. A patch prepresents a surface within a sphere of radius ri,p ∈ R, centered at ci,p ∈ R3.Each patch can be oriented by a rotation about a canonical frame, parametrizedby Euler angles φi,p ∈ R3. Let ei,p = (ri,p, ci,p, φi,p) ∈ R7 denote all extrinsicpatch parameters. Representing the patch surface in a canonical frame of refer-ence lets us normalize the query 3D point, leading to more object-agnostic andgeneralizable patches.

The patch surface is represented as an implicit signed-distance function(SDF), which maps 3D points to their signed distance from the closest surface.This offers several advantages, as these functions are a continuous representa-tion of the surface, unlike point clouds or meshes. In addition, the surface can beextracted at any resolution without large memory requirement, unlike for voxelgrids. In contrast to prior work [33, 11], which uses simple patch primitives, weparametrize the patch surface as a neural network (PatchNet). Our networkarchitecture is based on the auto-decoder of DeepSDF [21]. The input to thenetwork is a patch latent code z ∈ RNz of length Nz = 128, which describes thepatch surface, and a 3D query point x ∈ R3. The output is the scalar SDF valueof the surface at x. Similar to DeepSDF, we use eight weight-normalized [26]fully-connected layers with 128 output dimensions and ReLU activations, and wealso concatenate z and x to the input of the fifth layer. The last fully-connectedlayer outputs a single scalar to which we apply tanh to obtain the SDF value.

3.2 Preliminaries

Preprocessing: Given a watertight mesh, we preprocess it to obtain SDF valuesfor 3D point samples. First, we center each mesh and fit it tightly into the unitsphere. We then sample points, mostly close to the surface, and compute their

PatchNets 5

truncated signed distance to the object surface, with truncation at 0.1. For moredetails on the sampling strategy, please refer to [21].

Auto-Decoding: Unlike the usual setting, we do not use an encoder that re-gresses patch latent codes and extrinsics. Instead, we follow DeepSDF [21] andauto-decode shapes: we treat the patch latent codes and extrinsics of each ob-ject as free variables to be optimized for during training. I.e., instead of back-propagating into an encoder, we employ the gradients to learn these parametersdirectly during training.

Initialization: Since we perform auto-decoding, we treat the patch latent codesand extrinsics as free variables, similar to classical optimization. Therefore, wecan directly initialize them. All patch latent codes are initially set to zero, and thepatch positions are initialized by greedy farthest point sampling of point samplesof the object surface. We set each patch radius to the minimum such that eachsurface point sample is covered by its closest patch. The patch orientation alignsthe z-axis of the patch coordinate system with the surface normal.

3.3 Loss Functions

We train PatchNet by auto-decoding N full objects. The patch latent codes ofan object i are zi = [zi,0, zi,1, . . . , zi,NP−1], with each patch latent code of lengthNz. Patch extrinsics are represented as ei = [ei,0, ei,1, . . . , ei,NP−1]. Let θ denotethe trainable weights of PatchNet. We employ the following loss function:

L(zi, ei, θ) = Lrecon(zi, ei, θ) + Lext(ei) + Lreg(zi) . (1)

Here, Lrecon is the surface reconstruction loss, Lext is the extrinsic loss guidingthe extrinsics for each patch, and Lreg is a regularizer on the patch latent codes.

Reconstruction Loss: The reconstruction loss minimizes the SDF values betweenthe predictions and the ground truth for each patch:

Lrecon(zi, ei, θ) =1

NP

NP−1∑p=0

1

|S(ei,p)|∑

x∈S(ei,p)

∥∥f(x, zi,p, θ)− s(x)∥∥1, (2)

where f(·) and s(x) denote a forward pass of the network and the ground truthtruncated SDF values at point x, respectively; S(ei,p) is the set of all (normal-ized) point samples that lie within the bounds of patch p with extrinsics ei,p.

Extrinsic Loss: The composite extrinsic loss ensures all patches contribute to thesurface and are placed such that the surfaces are learned in a canonical space:

Lext(ei) = Lsur(ei) + Lcov(ei) + Lrot(ei) + Lscl(ei) + Lvar(ei) . (3)


Lsur ensures that every patch stays close to the surface:

Lsur(ei) = ωsur ·1

NP

NP−1∑p=0

max( minx∈Oi

∥∥ci,p − x∥∥22, t) . (4)

Here, Oi is the set of surface points of object i. We use this term only when thedistance between a patch and the surface is greater than a threshold t = 0.06.

A symmetric coverage loss Lcov encourages each point on the surface to becovered by a at least one patch:

Lcov(ei) = ωcov ·1

|Ui|∑x∈Ui

wi,p,x∑p wi,p,x

(∥∥ci,p − x

∥∥2− ri,p) , (5)

where Ui ⊆ Oi are all surface points that are not covered by any patch, i.e., out-side the bounds of all patches. wi,p,x weighs the patches based on their distancefrom x, with wi,p,x = exp (−0.5 · ((

∥∥ci,p − x∥∥2− ri,p)/σ)2) where σ = 0.05.

We also introduce a loss to align the patches with the surface normals. Thisencourages the patch surface to be learned in a canonical frame of reference:

Lrot(ei) = ωrot ·1

NP

NP−1∑p=0

(1− 〈φi,p · [0, 0, 1]T ,ni,p〉)2 . (6)

Here, ni,p is the surface normal at the point oi,p closest to the patch center, i.e.,oi,p = argmin

x∈Oi

∥∥x− ci,p∥∥2.

Finally, we introduce two losses for the extent of the patches. The first lossencourages the patches to be reasonably small. This prevents significant overlapbetween different patches:

Lscl(ei) = ωscl ·1

NP

NP−1∑p=0

r2i,p . (7)

The second loss encourages all patches to be of similar sizes. This prevents thesurface to be reconstructed only using very few large patches:

Lvar(ei) = ωvar ·1

NP

NP−1∑p=0

(ri,p −mi)2 , (8)

where mi is the mean patch radius of object i.

Regularizer: Similar to DeepSDF, we add an `2-regularizer on the latent codesassuming a Gaussian prior distribution:

Lreg(zi) = ωreg ·1

NP

NP−1∑p=0

∥∥zi,p∥∥22 . (9)

PatchNets 7

Optimization: At training time, we optimize the following problem:

argminθ,{zi}i,{ei}i

N−1∑i=0

L(zi, ei, θ) . (10)

At test time, we can reconstruct any surface using our learned patch-basedrepresentation. Using the same initialization of extrinsics and patch latent codes,and given point samples with their SDF values, we optimize for the patch latentcodes and the patches extrinsics with fixed network weights.

3.4 Blended Surface Reconstruction

For a smooth surface reconstruction of object i, e.g. for Marching Cubes, weblend between different patches in the overlapping regions to obtain the blendedSDF prediction gi(x). Specifically, gi(x) is computed as a weighted linear com-bination of the SDF values f(x, zi,p, θ) of the overlapping patches:

gi(x) =∑p∈Pi,x

wi,p,x∑p∈Pi,x

wi,p,xf(x, zi,p, θ), (11)

with Pi,x denoting the patches which overlap at point x. For empty Pi,x, we setgi(x) = 1. The blending weights are defined as:

wi,p,x = exp

(− 1

2

(∥∥ci,p − x∥∥2

σ

)2)− exp

(− 1

2

(ri,pσ

)2), (12)

with σ = ri,p/3. The offset ensures that the weight is zero at the patch boundary.

4 Experiments

In the following, we show the effectiveness of our patch-based representation onseveral different problems. For an ablation study of the loss functions, pleaserefer to the supplemental.

4.1 Settings

Datasets We employ ShapeNet [3] for most experiments. We perform prepro-cessing with the code of Stutz et al. [29], similar to [17, 10], to make the mesheswatertight and normalize them within a unit cube. For training and test splits,we follow Choy et al. [5]. The results in Tables 1 and 2 use the full test set.Other results refer to a reduced test set, where we randomly pick 50 objectsfrom each of the 13 categories. In the supplemental, we show that our results onthe reduced test set are representative of the full test set. In addition, we useDynamic FAUST [2] for testing. We subsample the test set from DEMEA [30]by concatenating all test sequences and taking every 20th mesh. We generate200k SDF point samples per shape during preprocessing.


Metrics We use three error metrics. For Intersection-over-Union (IoU), higheris better. For Chamfer distance (Chamfer), lower is better. For F-score, higheris better. The supplementary material contains further details on these metrics.

Training Details We train our networks using PyTorch [22]. The number ofepochs is 1000, the learning rate for the network is initially 5 · 10−4, and for thepatch latent codes and extrinsics 10−3. We half both learning rates every 200epochs. For optimization, we use Adam [14] and a batch size of 64. For eachobject in the batch, we randomly sample 3k SDF point samples. The weightsfor the losses are: ωscl = 0.01, ωvar = 0.01, ωsur = 5, ωrot = 1, ωsur = 200. Welinearly increase ωreg from 0 to 10−4 for 400 epochs and then keep it constant.

Baseline We design a “global-patch” baseline similar to DeepSDF, which onlyuses a single patch without extrinsics. The patch latent size is 4050, matchingours. The learning rate scheme is the same as for our method.

4.2 Surface Reconstruction

We first consider surface reconstruction.

Results We train our approach on a subset of the training data, where werandomly pick 100 shapes from each category. In addition to comparing with ourbaseline, we compare with DeepSDF [21] as setup in their paper. Both DeepSDFand our baseline use the subset. Qualitative results are shown in Fig. 2 and 3.

Fig. 2. Surface Reconstruction. From left to right: DeepSDF, baseline, ours,groundtruth.

Table 1 shows the quantitative results for surface reconstruction. We signifi-cantly outperform DeepSDF and our baseline almost everywhere, demonstratingthe higher-quality afforded by our patch-based representation.

We also compare with several state-of-the-art approaches on implicit surfacereconstruction, OccupancyNetworks [17], Structured Implicit Functions [11] andDeep Structured Implicit Functions [10]1. While they are trained on the full

1 DSIF is also known as Local Deep Implicit Functions for 3D Shape.

PatchNets 9

Table 1. Surface Reconstruction. We significantly outperform DeepSDF [21] and ourbaseline on all categories of ShapeNet almost everywhere.

IoU Chamfer F-scoreCategory DeepSDF Baseline Ours DeepSDF Baseline Ours DeepSDF Baseline Ours

airplane 84.9 65.3 91.1 0.012 0.077 0.004 83.0 72.9 97.8bench 78.3 68.0 85.4 0.021 0.065 0.006 91.2 80.6 95.7cabinet 92.2 88.8 92.9 0.033 0.055 0.110 91.6 86.4 91.2car 87.9 83.6 91.7 0.049 0.070 0.049 82.2 74.5 87.7chair 81.8 72.9 90.0 0.042 0.110 0.018 86.6 75.5 94.3display 91.6 86.5 95.2 0.030 0.061 0.039 93.7 87.0 97.0lamp 74.9 63.0 89.6 0.566 0.438 0.055 82.5 69.4 94.9rifle 79.0 68.5 93.3 0.013 0.039 0.002 90.9 82.3 99.3sofa 92.5 85.4 95.0 0.054 0.226 0.014 92.1 84.2 95.3speaker 91.9 86.7 92.7 0.050 0.094 0.243 87.6 79.4 88.5table 84.2 71.9 89.4 0.074 0.156 0.018 91.1 79.2 95.0telephone 96.2 95.0 98.1 0.008 0.016 0.003 97.7 96.2 99.4watercraft 85.2 79.1 93.2 0.026 0.041 0.009 87.8 80.2 96.4

mean 77.4 76.5 92.1 0.075 0.111 0.044 89.9 80.6 94.8

ShapeNet shapes, we train our model only on a small subset. Even in this dis-advantageous and challenging setting, we outperform these approaches on mostcategories, see Table 2. Note that we compute the metrics consistently with Gen-ova et al. [10] and thus can directly compare to numbers reported in their paper.

Table 2. Surface Reconstruction. We outperform OccupancyNetworks (OccNet) [17],Structured Implicit Functions (SIF) [11], and Deep Structured Implicit Functions(DSIF) [10] almost everywhere.

IoU Chamfer F-scoreCategory OccNet SIF DSIF Ours OccNet SIF DSIF Ours OccNet SIF DSIF Ours

airplane 77.0 66.2 91.2 91.1 0.016 0.044 0.010 0.004 87.8 71.4 96.9 97.8bench 71.3 53.3 85.6 85.4 0.024 0.082 0.017 0.006 87.5 58.4 94.8 95.7cabinet 86.2 78.3 93.2 92.9 0.041 0.110 0.033 0.110 86.0 59.3 92.0 91.2car 83.9 77.2 90.2 91.7 0.061 0.108 0.028 0.049 77.5 56.6 87.2 87.7chair 73.9 57.2 87.5 90.0 0.044 0.154 0.034 0.018 77.2 42.4 90.9 94.3display 81.8 69.3 94.2 95.2 0.034 0.097 0.028 0.039 82.1 56.3 94.8 97.0lamp 56.5 41.7 77.9 89.6 0.167 0.342 0.180 0.055 62.7 35.0 83.5 94.9rifle 69.5 60.4 89.9 93.3 0.019 0.042 0.009 0.002 86.2 70.0 97.3 99.3sofa 87.2 76.0 94.1 95.0 0.030 0.080 0.035 0.014 85.9 55.2 92.8 95.3speaker 82.4 74.2 90.3 92.7 0.101 0.199 0.068 0.243 74.7 47.4 84.3 88.5table 75.6 57.2 88.2 89.4 0.044 0.157 0.056 0.018 84.9 55.7 92.4 95.0telephone 90.9 83.1 97.6 98.1 0.013 0.039 0.008 0.003 94.8 81.8 98.1 99.4watercraft 74.7 64.3 90.1 93.2 0.041 0.078 0.020 0.009 77.3 54.2 93.2 96.4

mean 77.8 66.0 90.0 92.1 0.049 0.118 0.040 0.044 81.9 59.0 92.2 94.8

Generalization Our patch-based representation is more generalizable com-pared to existing representations. To demonstrate this, we design several ex-periments with different training data. We modify the learning rate schemes toequalize the number of network weight updates. For each experiment, we com-pare our method with the baseline approaches described above. We use a reducedShapeNet test set, which consists of 50 shapes from each category. Fig. 3 showsqualitative results and comparisons. We also show cross-dataset generalization


by evaluating on 647 meshes from the Dynamic FAUST [2] test set. In the firstexperiment, we train the network on shapes from the Cabinet category and tryto reconstruct shapes from every other category. We significantly outperformthe baselines almost everywhere, see Table 3. The improvement is even morenoticeable for cross dataset generalization with around 70% improvement in theF-score compared to our global-patch baseline.

Fig. 3. Generalization. From left to right: DeepSDF, baseline, ours on one category,ours on one shape, ours on 1 shape per category, ours on 3 per category, ours on 10per category, ours on 30 per category, ours on 100 per category, and groundtruth.

In the second experiment, we evaluate the amount of training data requiredto train our network. We train both our network as well as the baselines on 30, 10,3 and 1 shapes per-category of ShapeNet. In addition, we also include an exper-iment training the networks on a single randomly picked shape from ShapeNet.Fig. 4 shows the errors for ShapeNet (mean across categories) and DynamicFAUST. The performance of our approach degrades only slightly with a de-creasing number of training shapes. However, the baseline approach of DeepSDFdegrades much more severely. This is even more evident for cross dataset gener-alization on Dynamic FAUST, where the baseline cannot perform well even witha larger number of training shapes, while we perform similarly across datasets.

Fig. 4. Generalization. We train our PatchNet (green), the global-patch baseline (or-ange), and DeepSDF (blue) on different numbers of shapes (x-axis). Results on differentmetrics on our reduced test sets are shown on the y-axis. For IoU and F-score, higheris better. For Chamfer distance, lower is better.

PatchNets 11

Table 3. Generalization. Networks trained on theCabinet category, but evaluated on every cate-gory of ShapeNet, as well as on Dynamic FAUST.We significantly outperform the baseline (BL) andDeepSDF (DSDF) almost everywhere.

IoU Chamfer F-scoreCategory BL DSDF Ours BL DSDF Ours BL DSDF Ours

airplane 33.5 56.9 88.2 0.668 0.583 0.005 33.5 61.7 96.3bench 49.1 58.8 80.4 0.169 0.093 0.006 63.6 76.3 93.3cabinet 86.0 91.1 91.4 0.045 0.025 0.121 86.4 92.6 91.7car 78.4 83.7 92.0 0.101 0.074 0.050 62.7 73.9 87.2chair 50.7 61.8 86.9 0.473 0.287 0.012 49.1 65.2 92.5display 83.2 87.6 94.4 0.111 0.065 0.052 83.9 89.6 96.9lamp 49.7 59.3 86.6 0.689 2.645 0.082 50.4 64.5 93.4rifle 56.4 56.1 91.8 0.114 2.669 0.002 71.0 54.7 99.1sofa 81.1 87.3 94.8 0.245 0.193 0.010 74.2 84.6 95.2speaker 83.2 88.3 90.5 0.163 0.080 0.232 71.8 80.1 84.9table 55.0 73.6 88.4 0.469 0.222 0.020 61.8 82.8 95.0telephone 90.4 94.7 97.3 0.051 0.015 0.004 90.8 96.1 99.2watercraft 66.5 73.5 91.8 0.115 0.157 0.006 63.0 74.2 96.2

mean 66.4 74.8 90.3 0.263 0.547 0.046 66.3 76.6 93.9

DFAUST 57.8 71.2 94.4 0.751 0.389 0.012 25.0 45.4 94.0

Table 4. Ablative Analysis. Weevaluate the performance usingdifferent numbers of patches, aswell as using variable sizes of thepatch latent code/hidden dimen-sions, and the training data. Thetraining time is measured on anNvidia V100 GPU.

IoU Chamfer F-score Time

NP = 3 73.8 0.15 72.9 1hNP = 10 85.2 0.049 88.0 1.5hsize 32 82.8 0.066 84.7 1.5hsize 512 95.3 0.048 97.2 8hfull dataset 92.2 0.050 94.8 156h

ours 91.6 0.045 94.5 2h

Ablation Experiments We perform several ablative analysis experiments toevaluate our approach. We first evaluate the number of patches required to re-construct surfaces. Table 4 reports these numbers on the reduced test set. Thepatch networks here are trained on the reduced training set, consisting of 100shapes per ShapeNet category. As expected, the performance becomes betterwith a larger number of patches, since this would lead to smaller patches whichcan capture more details and generalize better. We also evaluate the impact ofdifferent sizes of the latent codes and hidden dimensions used for the patch net-work. Larger latent codes and hidden dimensions lead to higher quality results.Similarly, training on the full training dataset, consisting of 33k shapes leads tohigher quality. However, all design choices with better performance come at thecost of longer training times, see Table 4.

4.3 Object-Level Priors

We also experiment with category-specific object priors. We add ObjectNet (fourFC layers with hidden dimension 1024 and ReLU activations) in front of Patch-Net and our baselines. From object latent codes of size 256, ObjectNet regressespatch latent codes and extrinsics as an intermediate representation usable withPatchNet. ObjectNet effectively increases the network capacity of our baselines.

Training We initialize all object latents with zeros and the weights of Object-Net’s last layer with very small numbers. We initialize the bias of ObjectNet’slast layer with zeros for patch latent codes and with the extrinsics of an arbi-trary object from the category as computed by our initialization in Sec. 3.2. Wepretrain PatchNet on ShapeNet. For our method, the PatchNet is kept fixedfrom this point on. As training set, we use the full training split of the ShapeNet


category for which we train. We remove Lrot completely as it significantly lowersquality. The L2 regularization is only applied to the object latent codes. We setωvar = 5. ObjectNet is trained in three phases, each lasting 1000 epochs. Weuse the same initial learning rates as when training PatchNet, except in the lastphase, where we reduce them by a factor of 5. The batch size is 128.Phase I : We pretrain ObjectNet to ensure good patch extrinsics. For this, weuse the extrinsic loss, Lext in Eq. 3, and the regularizer. We set ωscl = 2.Phase II : Next, we learn to regress patch latent codes. First, we add a layerthat multiplies the regressed scales by 1.3. We then store these extrinsics. After-wards, we train using Lrecon and two L2 losses that keep the regressed positionand scale close to the stored extrinsics, with respective weights 1, 3, and 30.Phase III : The complete loss L in Eq. 1, with ωscl = 0.02, yields final refinements.

Coarse Correspondences Fig. 5 shows that the learned patch distribution isconsistent across objects, establishing coarse correspondences between objects.

Fig. 5. Coarse Correspondences. Note the consistent coloring of the patches.

Interpolation Due to the implicitly learned coarse correspondences, we canencode test objects into object latent codes and then linearly interpolate betweenthem. Fig. 6 shows that interpolation of the latent codes leads to a smooth morphbetween the decoded shapes in 3D space.

Generative Model We can explore the learned object latent space further byturning ObjectNet into a generative model. Since auto-decoding does not yieldan encoder that inputs a known distribution, we have to estimate the unknowninput distribution. Therefore, we fit a multivariate Gaussian to the object latentcodes obtained at training time. We can then sample new object latent codesfrom the fitted Gaussian and use them to generate new objects, see Fig. 6.

Partial Point Cloud Completion Given a partial point cloud, we can opti-mize for the object latent code which best explains the visible region. ObjectNetacts as a prior which completes the missing parts of the shape. For our method,we pretrained our PatchNet on a different object category and keep it fixed, andthen train ObjectNet on the target category, which makes this task more chal-lenging for us. We choose the versions of our baselines where the eight final layersare pretrained on all categories and finetuned on the target shape category. We

PatchNets 13

Fig. 6. Interpolation (top). The left and right end points are encoded test objects.Generative Models (bottom). We sample object latents from ObjectNet’s fitted prior.

evaluated several other settings, with this one being the most competitive. Seethe supplemental for more on surface reconstruction with object-level priors.

Optimization: We initialize with the average of the object latent codes ob-tained at training time. We optimize for 600 iterations, starting with a learningrate of 0.01 and halving it every 200 iterations. Since our method regresses thepatch latent codes and extrinsics as an intermediate step, we can further re-fine the result by treating this intermediate patch-level representation as freevariables. Specifically, we refine the patch latent code for the last 100 iterationswith a learning rate of 0.001, while keeping the extrinsics fixed. This allows tointegrate details not captured by the object-level prior. Fig. 7 demonstrates thiseffect. During optimization, we use the reconstruction loss, the L2 regularizerand the coverage loss. The other extrinsics losses have a detrimental effect onpatches that are outside the partial point cloud. We use 8k samples per iteration.

We obtain the partial point clouds from depth maps similar to Park et al. [21].We also employ their free-space loss, which encourages the network to regresspositive values for samples between the surface and the camera. We use 30%free-space samples. We consider depth maps from a fixed and from a per-scenerandom viewpoint. For shape completion, we report the F-score between the fullgroundtruth mesh and the reconstructed mesh. Similar to Park et al. [21], wealso compute the mesh accuracy for shape completion. It is the 90th percentile ofshortest distances from the surface samples of the reconstructed shape to surfacesamples of the full groundtruth. Table 5 shows how, due to local refinement onthe patch level, we outperform the baselines everywhere.

Fig. 7. Shape Completion. (Sofa) from left to right: Baseline, DeepSDF, ours unrefined,ours refined. (Airplane) from left to right: Ours unrefined, ours refined.


Table 5. Partial Point Cloud Completion from Depth Maps. We complete depth mapsfrom a fixed camera viewpoint and from per-scene random viewpoints.

sofas fixed sofas random airplanes fixed airplanes randomacc. F-score acc. F-score acc. F-score acc. F-score

baseline 0.094 43.0 0.092 42.7 0.069 58.1 0.066 58.7DeepSDF-based baseline 0.106 33.6 0.101 39.5 0.066 56.9 0.065 55.5ours 0.091 48.1 0.077 49.2 0.058 60.5 0.056 59.4ours+refined 0.052 53.6 0.053 52.4 0.041 67.7 0.043 65.8

4.4 Articulated Deformation

Our patch-level representation can model some articulated deformations by onlymodifying the patch extrinsics, without needing to adapt the patch latent codes.Given a template surface and patch extrinsics for this template, we first encode itinto patch latent codes. After manipulating the patch extrinsics, we can obtain anarticulated surface with our smooth blending from Eq. 11, as Fig. 8 demonstrates.

Fig. 8. Articulated Motion. We encode a template shape into patch latent codes (firstpair). We then modify the patch extrinsics, while keeping the patch latent codes fixed,leading to non-rigid deformations (middle two pairs). The last pair shows a failure casedue to large non-rigid deformations away from the template. Note that the coloredpatches move rigidly across poses while the mixture deforms non-rigidly.

5 Concluding Remarks

Limitations. We sample the SDF using DeepSDF’s sampling strategy, whichmight limit the level of detail. Generalizability at test time requires optimizingpatch latent codes and extrinsics, a problem shared with other auto-decoders.We fit the reduced test set in 71 min due to batching, one object in 10 min.Conclusion. We have presented a mid-level geometry representation based onpatches. This representation leverages the similarities of objects at patch levelleading to a highly generalizable neural shape representation. For example, weshow that our representation, trained on one object category can also representother categories. We hope that our representation will enable a large variety ofapplications that go far beyond shape interpolation and point cloud completion.Acknowledgements. This work was supported by the ERC Consolidator Grant4DReply (770784), and an Oculus research grant.

PatchNets 15

References

1. Atzmon, M., Lipman, Y.: Sal: Sign agnostic learning of shapes from raw data. In:Computer Vision and Pattern Recognition (CVPR) (2020)

2. Bogo, F., Romero, J., Pons-Moll, G., Black, M.J.: Dynamic FAUST: Registeringhuman bodies in motion. In: Computer Vision and Pattern Recognition (CVPR)(2017)

3. Chang, A.X., Funkhouser, T., Guibas, L., Hanrahan, P., Huang, Q., Li, Z.,Savarese, S., Savva, M., Song, S., Su, H., et al.: ShapeNet: An information-rich3D model repository. arXiv preprint arXiv:1512.03012 (2015)

4. Chen, Z., Zhang, H.: Learning implicit fields for generative shape modeling. In:Computer Vision and Pattern Recognition (CVPR) (2019)

5. Choy, C.B., Xu, D., Gwak, J., Chen, K., Savarese, S.: 3D-R2N2: A unified approachfor single and multi-view 3D object reconstruction. In: European Conference onComputer Vision (ECCV) (2016)

6. Curless, B., Levoy, M.: A volumetric method for building complex models fromrange images. In: SIGGRAPH (1996)

7. Deng, B., Genova, K., Yazdani, S., Bouaziz, S., Hinton, G., Tagliasacchi, A.:Cvxnets: Learnable convex decomposition. In: Advances in Neural InformationProcessing Systems Workshops (2019)

8. Deng, B., Lewis, J., Jeruzalski, T., Pons-Moll, G., Hinton, G., Norouzi, M.,Tagliasacchi, A.: Nasa: Neural articulated shape approximation (2020)

9. Deprelle, T., Groueix, T., Fisher, M., Kim, V., Russell, B., Aubry, M.: Learning el-ementary structures for 3D shape generation and matching. In: Advances in NeuralInformation Processing Systems (NeurIPS) (2019)

10. Genova, K., Cole, F., Sud, A., Sarna, A., Funkhouser, T.: Local deep implicitfunctions for 3d shape. In: Computer Vision and Pattern Recognition (CVPR)(2020)

11. Genova, K., Cole, F., Vlasic, D., Sarna, A., Freeman, W.T., Funkhouser, T.: Learn-ing shape templates with structured implicit functions. In: International Confer-ence on Computer Vision (ICCV) (2019)

12. Groueix, T., Fisher, M., Kim, V., Russell, B., Aubry, M.: A papier-mache approachto learning 3D surface generation. In: Computer Vision and Pattern Recognition(CVPR) (2018)

13. Kato, H., Ushiku, Y., Harada, T.: Neural 3D mesh renderer. In: Computer Visionand Pattern Recognition (CVPR) (2018)

14. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Interna-tional Conference on Learning Representations (ICLR) (2015)

15. Liu, S., Saito, S., Chen, W., Li, H.: Learning to infer implicit surfaces without 3Dsupervision. In: Advances in Neural Information Processing Systems (NeurIPS)(2019)

16. Lorensen, W.E., Cline, H.E.: Marching cubes: A high resolution 3D surface con-struction algorithm. In: Conference on Computer Graphics and Interactive Tech-niques (1987)

17. Mescheder, L., Oechsle, M., Niemeyer, M., Nowozin, S., Geiger, A.: Occupancynetworks: Learning 3D reconstruction in function space. In: Computer Vision andPattern Recognition (CVPR) (2019)

18. Michalkiewicz, M., Pontes, J.K., Jack, D., Baktashmotlagh, M., Eriksson, A.: Im-plicit surface representations as layers in neural networks. In: International Con-ference on Computer Vision (ICCV) (2019)


19. Niemeyer, M., Mescheder, L., Oechsle, M., Geiger, A.: Occupancy flow: 4D recon-struction by learning particle dynamics. In: International Conference on ComputerVision (CVPR) (2019)

20. Ohtake, Y., Belyaev, A., Alexa, M., Turk, G., Seidel, H.P.: Multi-level partition ofunity implicits. In: ACM Transactions on Graphics (TOG) (2003)

21. Park, J.J., Florence, P., Straub, J., Newcombe, R., Lovegrove, S.: DeepSDF: Learn-ing continuous signed distance functions for shape representation. In: ComputerVision and Pattern Recognition (CVPR) (2019)

22. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T.,Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito,Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., Chin-tala, S.: PyTorch: An imperative style, high-performance deep learning library. In:Advances in Neural Information Processing Systems (NeurIPS) (2019)

23. Qi, C.R., Yi, L., Su, H., Guibas, L.J.: PointNet++: Deep hierarchical feature learn-ing on point sets in a metric space. In: Advances in Neural Information ProcessingSystems (NeurIPS) (2017)

24. Riegler, G., Osman Ulusoy, A., Geiger, A.: OctNet: Learning deep 3D representa-tions at high resolutions. In: Computer Vision and Pattern Recognition (CVPR)(2017)

25. Saito, S., Huang, Z., Natsume, R., Morishima, S., Kanazawa, A., Li, H.: PIFu:Pixel-aligned implicit function for high-resolution clothed human digitization. In:International Conference on Computer Vision (ICCV) (2019)

26. Salimans, T., Kingma, D.P.: Weight normalization: A simple reparameterizationto accelerate training of deep neural networks. In: Advances in Neural InformationProcessing Systems (NeurIPS) (2016)

27. Shimada, S., Golyanik, V., Tretschk, E., Stricker, D., Theobalt, C.: DispVoxNets:Non-rigid point set alignment with supervised learning proxies. In: InternationalConference on 3D Vision (3DV) (2019)

28. Sitzmann, V., Zollhofer, M., Wetzstein, G.: Scene representation networks: Con-tinuous 3D-structure-aware neural scene representations. In: Advances in NeuralInformation Processing Systems (NeurIPS) (2019)

29. Stutz, D., Geiger, A.: Learning 3D shape completion under weak supervision. In:International Journal of Computer Vision (IJCV) (2018)

30. Tretschk, E., Tewari, A., Zollhofer, M., Golyanik, V., Theobalt, C.: DEMEA: DeepMesh Autoencoders for Non-Rigidly Deforming Objects. European Conference onComputer Vision (ECCV) (2020)

31. Tulsiani, S., Su, H., Guibas, L.J., Efros, A.A., Malik, J.: Learning shape abstrac-tions by assembling volumetric primitives. In: Computer Vision and Pattern Recog-nition (CVPR) (2017)

32. Wang, N., Zhang, Y., Li, Z., Fu, Y., Liu, W., Jiang, Y.G.: Pixel2Mesh: Generating3D mesh models from single RGB images. In: European Conference on ComputerVision (ECCV) (2018)

33. Williams, F., Parent-Levesque, J., Nowrouzezahrai, D., Panozzo, D., Moo Yi, K.,Tagliasacchi, A.: Voronoinet: General functional approximators with local support.In: Computer Vision and Pattern Recognition Workshops (CVPRW) (2020)

Documents

PatchNets: Patch-Based Generalizable Deep Implicit 3D ...gvv.mpi-inf.mpg.de/projects/PatchNets/data/patchnets.pdf · PatchNets: Patch-Based Generalizable Deep Implicit 3D Shape Representations