An Overview of Recent Progress in Volumetric Semantic 3D Reconstructionchaene/publications/... · 2017-05-26 · An Overview of Recent Progress in Volumetric Semantic 3D Reconstruction

An Overview of Recent Progress in VolumetricSemantic 3D Reconstruction

(Invited Paper)

Christian HaneDepartment of Electrical Engineering and Computer Sciences

University of California, Berkeley, USAEmail: [email protected]

Marc PollefeysDepartment of Computer Science

ETH Zurich, Switzerlandand

Microsoft, Redmond, USAEmail: [email protected]

Abstract—This paper gives an overview of a recently pro-posed method of solving dense 3D reconstruction and semanticsegmentation from multiple input images in a joint fashion,i.e. as semantic 3D reconstruction. The formulation is cast asa volumetric fusion of depth maps and pixel-wise semanticclassification scores. By posing the two problems as a jointoptimization problem, both of the tasks can benefit from theother task’s information. This leads to formulations which canreconstruct hidden unobserved surfaces. We give an overviewof several papers which describe different ways of modeling thedata term from the input data and also works which introduceobject shape priors to the formulation. We present the basicconvex multi-label formulation on which the method builds andalso discuss the relation to other reconstruction algorithms whichextract semantically annotated 3D models from images.

I. INTRODUCTION

Two of the core tasks in computer vision are semanticunderstanding of the content of images and dense 3D re-construction. Algorithms for both tasks, assigning a semanticlabel to each of the pixels of an image and recoveringthe dense 3D geometry from a dataset of multiple inputimages, have reached a level of maturity where good resultscan robustly be acquired from well-conditioned input data.In practical applications the data is often challenging andhence 3D reconstruction systems or semantic segmentationalgorithms can easily fail. For example when generating a3D reconstruction of a building from terrestrial images theroof of the buildings can typically not be observed or insome images the strong sunlight might make a semanticsegmentation extremely difficult. Furthermore, reflective ortranslucent objects cannot easily be reconstructed and oftensurfaces observed under very small viewing angles cannotbe estimated reliably. Another problem are surfaces whichare hidden due to occlusion, for example when vegetation isgrowing close to a building facade the geometry of the facademight be hidden and therefore cannot be measured.

Many of these limitations can be overcome by formulatingdense 3D reconstruction and semantic segmentation as a jointoptimization problem instead of two individual tasks. In arecent line of research we focused on stating the joint problemas a voxel labeling problem in which each voxel gets eitherassigned free space or one out of several different occupied

space labels (e.g. building, ground or vegetation). This notonly encodes the potentially observable geometry, i.e. theboundary surface between free and occupied space, it alsoencodes all the unobservable, hidden surfaces. For example fora building standing on the ground a surface between groundand building is estimated as well. By using a continuouslyinspired convex multi-label formulation [1] prior knowledgeabout the orientation of boundary interfaces (surfaces) betweenpairs of semantic labels is included in the formulation. Thiscan be used as geometric prior which for example encodes thata ground surface is more likely to be horizontal than verticalor as class specific object shape priors.

This paper gives an overview over this joint 3D reconstruc-tion and segmentation formulation and discusses the similari-ties and differences to other related systems which acquire asemantically annotated 3D model.

A. Related Work

The most closely related approaches to the ones describedin this overview paper are directly presented together withthe reviewed approaches. Therefore this section includes onlyapproaches related to volumetric multi-label approaches forsemantic 3D reconstruction, but not about the volumetricreconstruction itself.

A line of research is reasoning about semantics and 3Dreconstruction in terms of a 2.5D representation in form ofa depth map. One often utilized strategy when using energyminimization formulations for stereo matching formulationsis preferring discontinuities at locations with a strong imagegradient. The regularizer used in [2] furthermore prefers adiscontinuity if its direction is aligned with the image gradient.This formulation has be further generalized in [3] to take intoaccount surface normal directions which are predicted usinga pixel-wise surface normal direction classifier [4], whichpredicts the surface normals from a single color image. Se-mantic appearance and 3D information is coupled thorough theheight above ground in the joint optimization framework forbinocular stereo matching presented in [5]. Similarly, objectproposals and scene constraints such as physical support andnon-intersection are used in [6] in an optimization frameworkfor computational stereo. Highly reflective objects such as

cars are a source of errors in computational stereo, specialtreatment for such objects is proposed in [7].

The even more challenging problem of dense 3D recon-struction from a single color image has also become feasible.Mosts of the current approaches extract the scene in form of2.5D depth maps. Many of the early approaches try to reasonabout surface direction [8], [9], room layouts [10], [11] ordirectly train a classifier or regressors that estimates depthfor each image pixel [12], [13], [14], [15]. Jointly inferringsemantic classes and depth form a single image leads tobetter predictions than treating the problems individually [15].Using a single Convolutional Neural Network architecture [16]recovers depth, surface normals and semantic classes froma single color image. However, the network is trained (finetuned) for each of the problems individually.

Also in Structure-from-Motion (SfM) semantic informationhas been exploited, see e.g. [17].

The remainder of this paper gives an overview of a recentline of research in dense semantic 3D reconstruction. Theoverview focuses on work where the authors of this paperhave been involved. The closely related previous work and thesimilarities to other concurrently explored methods are directlydiscussed together with the overview.

B. Overview

We first discuss the main ideas of volumetric 3D reconstruc-tion in Sec. II. We start by introducing the basic geometryonly volumetric two-label segmentation formulation and thenintroduce the specific multi-label segmentation formulationthat is used by the approaches presented in the remainder.Sec. III presents how geometric priors can be used in jointvolumetric reconstruction and semantic class segmentation.The focus of Sec. IV is on the definition of a data termthat takes into account the configuration of a whole viewingray. Secs V and VI focus on how object shape priors can beincluded into semantic 3D reconstruction. Sec. VII concludesthe overview with a discussion of remaining challenges insemantic 3D reconstruction.

II. VOLUMETRIC MULTI-LABEL SEGMENTATION

This section introduces volumetric 3D reconstruction, theidea of volumetric 3D reconstruction is to segment a volumeinto free space and occupied space (the exterior and interiorof objects, for an illustration see Fig. 2(a)), which implicitlyrepresents the reconstructed surface as the interface betweenthe two areas. It dates back to [18] which uses a truncatedsigned distance field in a voxel grid in order to fuse rangeimages from laser scans. Adding regularization has beenproposed [19], [20] for situation where more noisy data fromcomputational stereo is fused. With the availability of con-sumer depth cameras, very similar approaches to the originalformulation have more recently been used to fuse depth mapsin real-time [21], [22], [23]. The methods presented in thismanuscript use a formulation which works as a two-labelsegmentation as opposed to a truncated signed distance field.Different formulations have been proposed in the spatially

discrete graph-based setting [24] and also in the spatiallycontinuous setting [20], [25]. We will give an introduction tothe latter option and describe a generalization to multi-labelsegmentation, which is used for semantic 3D reconstruction.The main advantage of the spatially continuous approach isthat it is less affected by artifacts than discrete graph basedmethods. This is of crucial importance for the anisotropicsmoothness term which is used in many of the formulationsfor the joint modeling of reconstruction and segmentation.

A. 3D Reconstruction as Volumetric Segmentation

One specific version of volumetric 3D reconstruction isto treat the whole problem in the continuum and only usediscretization for numerical optimization [19], [20], [25]. LetΩ ⊂ R3 denote the reconstruction volume and u : Ω→ [0, 1]a function which maps each location to a value in the interval[0, 1]. We will call this function indicator function and assignit the meaning u(z) = 0 means z is in the free space andu(z) = 1 means z is in the occupied space. We allow thefunction to take values in the whole interval instead of just 0or 1 such that the volumetric 3D reconstruction energy

E(u) =

∫Ω

u(z)ρ(z) + ‖∇u(z)‖2dz s.t. u(z) ∈ [0, 1] (1)

is convex. Therefore we also call this type of formulation aconvex relaxation. The energy has two parts a unary term anda smoothness term. The function ρ(z) : Ω→ R is a unary costfor assigning occupied space to a location z. Note the cost canbe negative and therefore act as a reward. The smoothnessterm ‖∇u(z)‖2 is called total variation (TV) and penalizesthe perimeter of the level sets of u(·). If u(·) only assumesbinary values the TV exactly corresponds to the surface areaof the boundary between free space and occupied space. Dueto the tightness of the relaxation [26] the optimum is binary(up to discretization artifacts). For a more detailed coverageof the TV see for example [27]. The TV does not take intoaccount the surface normal. Later on in this paper we will seegeneralizations of the TV to the anisotropic cases where thesurface area is penalized depending on the surface normal.

Before we discuss the numerical optimization in the nextsection we introduce a specific data term which leads to theTV-Flux fusion from [20]. The data term is represented usingthe unary term ρ(z). The input is a collection of depth maps.The idea is to assign weights to the voxels that represent theobserved depth maps. To this end we assign weights to thevoxels in a region around the measured depth along viewingrays. Each location gets a weight from each of the depth mapswhich are summed up eventually. The weight is positive if thedepth measurement lies along the ray in a band [d − δ, 0)around the measured depth d and negative if it lies in theband (0, d+δ]. We state the weights assigned from one singledepth map within the band [d− δ, d+ δ]. For this we describethe location with a depth d, which is the depth of the voxelin the considered view. The cost then reads as

ρd = β sign(d− d), (2)

ρ

0

β

d

Fig. 1. Illustration of data term along a single viewing ray. Figure adaptedfrom [29]

with β a constant that influences the strength of the data term.An illustration is given in Fig. 1. Such a data term can beunderstood as minimizing the flux of an oriented point cloudthrough the reconstructed surface, hence the name TV-Fluxfusion [24], [20]. Note, that it is also possible to enter a weightthat prefers free space to all voxels along the whole line-of-sight to the depth measurement. But this weight needs to besmaller in order to stay robust enough against outliers (see e.g.[28]).

B. Discretization and Optimization

In order to numerically solve the volumetric 3D reconstruc-tion problem from above the continuous formulation is firstdiscretized. There are different options for discretization, inthe following we will utilize a forward difference discretizationfor the gradient operator, which is commonly done. Anothereffective strategy is the staggered grid [30]. A comparison ofseveral different ways of discretization can be found in [31].

We discretize Ω into a set of voxels with coordinatess. Therefore the indicator function u(z) is turned into pervoxel indicator variables us ∈ [0, 1]. The discretized gradientoperator using forward differences then reads as

∇us =

us+(1,0,0)T − usus+(0,1,0)T − usus+(0,0,1)T − us

(3)

This leads to the discretized version of the volumetricreconstruction energy.

E(u) =∑s∈Ω

usρs + ‖∇us‖2 s.t. us ∈ [0, 1] (4)

Using convex analysis [32], [33] this energy is rewrittento the following primal-dual saddle-point form and optimizedusing the iterative first order primal-dual algorithm [34], [35].

minu

maxp

∑s∈Ω

(∇us)Tps + ρsus

(5)

s.t. us ∈ [0, 1] ‖ps‖2 ≤ 1

The first order primal-dual algorithm is very well suited tosuch problems and straight forward to implement. However,any convex optimization algorithm which is able to handlenon-smooth functions can be used to minimize such an energy(c.f. [36])

The described TV-Flux 3D reconstruction formulation isused as a baseline in many of the works presented below.Therefore we omit to show results at this place and refer tothe later sections instead (e.g. Fig. 5).

C. Multi-Body Scenes

Before we cover semantic reconstruction approaches wewould like to briefly cover two works which are closely relatedto the semantic 3D reconstruction but do not utilize an actualmulti-label formulation.

The first work [37] can be seen as a first step towards amulti-label formulation. The scene is composed of multiplerigid objects which move with respect to each other, i.e. amulti-body scene. This means for a pair of input images thatare taken at different times and where objects have moveda relative transformation between the two images can becomputed with respect to each of the rigid objects. This alsomeans that for each of the images multiple depth maps withrespect to the different object can be computed. The goalis now to get a dense volumetric reconstruction for each ofthe objects. The idea is to use the reconstruction formulationintroduced above and use an individual reconstruction volumefor each of the objects. For each of the volumes only the depthmaps which are computed with respect to the correspondingrigid object are used. Doing this, however, does not leadto a satisfactory result. The problem is that the depth mapscomputed contain a lot of noise and “ghost” reconstructionsat wrong places which are due to the fact that the cameraposes are only valid for a specific portion of the image.The key observation of [37] is that adding the basic non-intersection constraint which says that solid objects cannotintersect resolves this issue. Besides the camera poses withrespect to a each of the rigid objects also transformationsbetween the individual objects are recovered. This leads toa set of different relative positions between the reconstructionvolumes corresponding to the different objects and hence howthey overlap. The constraint that solid objects cannot intersectis added to the formulation as a constraint that the sum of theoccupancy variables u of voxels that can overlap can be atmaximum 1.

In [38], an approach is presented which formulates asymmetry prior by introducing additional constraints to theTV-Flux fusion. This is very related to the non-intersectionconstraint from above. In the case of the symmetry prior avoxel is preferred to be in the occupied space if the reflectedvoxel is occupied as well. The non-intersection constraintenforces that a voxel is empty if another voxel is alreadyoccupied.

D. Continuous Multi-Label Formulation

We have seen that the standard 3D reconstruction formu-lation can be extended to reconstruct multi-body scenes inthe previous section. In the following we will introduce amulti-label formulation. It is used by all of the approachesreviewed in the remainder of this paper. The basis of themethod is a convex formulation which originates from [39],[40], [41] and is given in a spatially continuous, variational,form. One of the key drawbacks of continuous multi-labelformulations is that the smoothness term needs to form ametric over the label space. This is a restriction which is notpresent in very related spatially discrete graph-based models

[42], [43]. This restriction can be avoided by only lookingat the discretized version of the continuous energy while stillhaving the benefits of a formulation which does not sufferfrom metrication artifacts [44], [1]. We will first introduce thecontinuous version and then focus on the discretized version.

Let Ω ⊂ R3 be a volumetric domain in which we assigna label i ∈ 0, . . . , L to each location z. In order toformalize label assignment we introduce indicator functionsxi : Ω → [0, 1], which indicate if label i is assigned at aparticular location. We also use functions ρi : Ω→ R to definethe unary cost of assigning label i to a specific location. Theobjective function of the convex continuous multi-label energyreads as

E(x,y) =

∫Ω

∑i

ρi(z)xi(z) +∑i,j;i<j

φijz (yij(z))dz. (6)

In addition, to the label indicator functions, the energy alsocontains label transition functions yij : Ω → [−1, 1]3 whichdescribe the gradient of change from label i to label j. In orderto connect the label indicator and label transition functions thefollowing marginalization constraint is used

∇xi =∑j:j>i

yij +∑j:j<i

yji. (7)

In order to ensure that one label gets assigned at each locationwe also need to enforce a normalization constraint

∑i x

i(z) =1 ∀z. The functions φijz : R3 → R+

0 act as transition specific,direction and location dependent penalizer of the surface areaof the surface formed as an interface between labels i and j.They have to be convex and positively 1-homogeneous. Theyare an extension of the TV to the anisotropic case [45]. Asmentioned above one of the main drawbacks of the continuousmulti-label formulation is the restriction that the smoothnessterm needs to form a metric over the label space (for example[46]), otherwise the cost implicitly gets converted into a metric(c.f. [1]). In the next section we will present the discretizedversion which does not suffer from this restriction.

E. Discretized Multi-Label Formulation

The discretized version that we present here is a version ofthe energy that is derived in [1]. The formulation is closely re-lated to linear programming (LP) relaxations for approximateMAP inference in Markov Random Fields (MRFs). A reviewof the MRF LP relaxation can be found in [42], the connectionbetween the formulations is described in [44], [1].

We consider a discretized domain Ω ⊂ R3, i.e. a voxelspace. For the label assignment we use indicator variables xis ∈[0, 1] to indicate if label i is assigned at voxel s. ρis is a cost forassigning label i to voxel s. The energy is discretized using aforward difference discretization for the gradient operator andreads as

E(x) =∑s∈Ω

∑i

ρisxis +

∑i,j:i<j

φijs (xijs − xjis )

. (8)

The variables xijs ∈ [0, 1]3 are non-negative label transitionvariables. To extract signed label transition gradients we can

free space

occupied space

Ω

(a) Volumetric Reconstruction

free space

building

ground

vegetation

Ω

(b) Semantic 3D Reconstruction

Fig. 2. 2D illustrations of the standard volumetric 3D reconstruction and thesemantic volumetric 3D reconstruction.

subtract the corresponding variables, i.e. yijs = xijs −xjis . Theabove energy is subject to the following set of constraints.

xis =∑j

(xijs )k, xis =∑j

(xjis−ek)k k ∈ 1, 2, 3

xis ∈ [0, 1],∑i

xis = 1, xjis ≥ 0 (9)

The first line describes the marginalization constraint that isused to ensure that the per-voxel label indicator variablesand the label transition variables are consistent. The vectorek denotes the canonical basis vector for dimension k, i.e.e1 = (1, 0, 0)T . Intuitively the constraints mean that if thereis a transition from label i to label j occurring betweentwo voxels the transition variables xijs have to reflect sucha change. They are related to the marginalization constraintfrom the continuous case in a sense that subtracting the leftand right hand side equations from each other in a suitable waywe end up with the constraint given above for the continuouscase. However by doing so the variables xijs cancel out of thewhole energy formulation and with them the non-negativityconstraints from the second line of the constraints. This isthe reason why the use of non-metric smoothness priors leadsto problems in the continuous formulation. More details aregiven in the original publication [1].

III. JOINT RECONSTRUCTION AND CLASS SEGMENTATION

In this section we give an overview of a method presented in[47], [48] which uses the discretized but continuously inspiredmulti-label formulation that we described in previous sectionto facilitate joint 3D reconstruction and class segmentation.The scenes which are studied are outdoor scenes such as singlebuildings, building facades or town squares. The expositiongiven in this paper is meant to give an overview about themethod and put it into context with other related approachesfor a more in-depth explanation we would like to refer thereader to the original publications.

The general idea of the formulation is that each of thevoxels gets assigned one out of L + 1 labels where labeli = 0 denotes the free space label and the L labels withi > 0 indicate the occupied space, which is segmented intoseveral semantic classes such as building, ground or vegetation(see Fig. 2(b) for illustration). One of the key differencesto other related approaches is that not only occupied spacevoxels close to an observed surface are labeled into semanticlabels like [49], [50], [51] but the whole occupied space isdensely labeled into semantic classes. This means interfaces

between any pair of labels can be recovered, for examplefor a building standing on the ground the ground hiddenunderneath the building is recovered as a transition betweenground and building. Another related work, [52] also segmentsa voxel space into semantic classes using geometric priors.Compared to the method presented here which uses densedepth maps and continuous regularization, they use sparse3D points as geometry measurement and a graph-cut basedinference method in a discrete graph-based formulation.

In order to apply the multi-label segmentation energy weneed to define the unary costs ρis and the smoothness termφijs (·). The smoothness term is used to encode learnt geometricpriors on the likelihood of surface direction between individuallabels as described in Sec. III-B and the unary term is usedto represent the input data as described in the next section. InSec. III-C we show some of the results of [48] which showhow the learnt geometric prior and the multi-label formulationare able to recover weakly observed and hidden unobservablesurfaces.

A. Data Term as Unary Potential

The input to the formulation are the camera poses of amulti-view dataset together with the pixel-wise likelihoods of asemantic classifier and the depth maps computed using multi-view stereo. The goal is to transfer the input data which isgiven in 2D image space into the 3D volume in which theenergy is formulated. A first way to do this is to represent thedata term as a unary potential per voxel, i.e. the input data istransferred to a cost ρis, which is the cost for assigning labeli to voxel s. This is the way of [47]. We will see other waysof defining the data term throughout this paper. In order todefine a unary potential data term the information from thedepth maps is used to decide at which location in the volumethe information of the semantic classifier should be placed. Wetake into account two cases in the first case a depth observationat the considered pixel is given and in the other case sky isthe most likely label of the considered pixel.

1) Observed depth: This case is a multi-label generalizationof the TV-Flux data term (c.f. Sec. II-A). The idea is that if wefollow a viewing ray from the camera center we will at somepoint hit the depth measurement d. The depth measurementmight be noisy, therefore we use a region before and behindthe depth measurement [d − δ, d + δ] to set weights for freespace (i = 0) or occupied space (i > 0) by setting a negative orpositive weight for all the occupied space classes, respectively.Once we are far enough behind the depth measurement (alongthe viewing ray) such that we are very certain that we are nowinside the observed object we place the score of the semanticclassifier σi (for all labels except sky) in this particular voxel.In order to define the cost formally we index the voxelswith their depth along the viewing ray. The formulas belowonly state the non-zero costs in the band around the depthmeasurement that a single depth map adds to a single voxel.The unary potential over the whole dataset is formed by adding

σclass 1

ρ1

0

β

d

Fig. 3. Visualization of the weights of a single occupied space class along asingle viewing ray. Figure adapted from [47]

up the contributions from each input image to each voxel.

ρid+δ

= σclass i, ρ0d+δ

=

∑Li=1 σclass i

L

ρid =

0 for i = 0

β sign(d− d) for i > 0.(10)

See Fig. 3 for a visualization of the weights entered along aviewing ray. As in the geometry only case also for the multi-label case, a weight for free space for all the voxels on thewhole-line-of-sight can be used. However, also here it needsto be small to stay robust against outliers.

2) Sky: The cost imposed by the sky label is treated asspecial case in a sense that if we observe sky in the imagewe add a reward for free space along the whole viewing ray.Formally, the weights induced by the sky classifier are givenas

ρ0s = γmin 0, σsky −mini 6=sky σi , (11)

with γ > 0 a suitable weight, and ρis = 0 for i > 0.

B. Geometric Priors on Label Transitions

The key component of the joint reconstruction and classsegmentation formulation are the geometric priors which areencoded in the smoothness term. As explained in Sec. II-Ethe smoothness term is a transition, surface orientation andlocation dependent penalization of the surface area. This termis used in the described formulation to define geometric priorsthat take into account the likelihood two labels i and j forminga transition with a specific surface normal. This facilitatesfor example the encoding of a transition between ground andbuilding is more likely to be horizontal than vertical and soforth. For this we first drop the location dependence and onlytake into account spatially homogeneous smoothness priors,i.e. φijs (·) = φij(·) ∀s.

The φij need to be convex and positively 1-homogeneous.One option is to directly define φij(·) in the primal andensure that the function fulfills the requirements imposed bythe formulation, but this is not easily possible in general. Adifferent way to define the smoothness term is by using theso-called Wulff shape. Mathematically, the Wulff shape is aclosed convex shape which contains the origin [45] and itcan be used to define any convex positively 1-homogeneousfunction as

φij(x) = maxp∈Wij

(pTx). (12)

Throughout this overview we will see different ways to extractWulff shapes form training data. They all share the ideathat the energy formulation in Eq. 8 naturally correspondsto a negative log-probability and derive the Wulff shape byconsidering the likelihood of training data. They mainly differin the way the Wulff shape is parametrized. The first waypresented in the following uses a collection of different verylow dimensional parameterizations of specific convex shapesand uses Maximum Likelihood Estiation (MLE) to decide withparameterization and which parameter values to use. A secondway presented in Sec. V is to parameterize the convex shapeas an intersection of half spaces and estimate the distance ofthe half spaces to the origin, from the training data. The firstway is used to define geometric priors for urban environmentsin the remainder of this section and for segment based shapepriors in Sec. VI. The second way is used to define normalbased shape priors in Sec. V.

The specific way of defining the Wulff shape which ispresented next further splits the smoothness term into a sum ofan isotropic and an anisotropic part. In terms of Wulff shapethe addition of multiple terms can easily be handled as theMinkowski sum of multiple Wulff shapes [2].

We follow the derivation of [47]. The functions φij areinterpreted as negative log-probabilities. Let ↔ij denote atransition event between labels i and j, and let nij denotea transition event (with unit surface area) between labels iand j with (unit-length) normal direction n. We use

P (nij) = P (nij |↔ij)P (↔ij), (13)

The conditional probability, P (nij |↔ij) is then modeled asa Gibbs probability measure

P (nij |↔ij) = exp(−ψij(nij)

)/Zij , (14)

for a positive 1-homogeneous function ψij . Zij is the respec-tive partition function, Zij :=

∫n∈S2 exp

(−ψij(nij)

)dn, and

S2 is the 3-dimensional unit sphere. Consequently, φij is nowgiven by

φij(n) = ψij(n) + logZij − logP (↔ij) (15)

Note that only the first term is dependent on the surfacedirection. The last term only captures the general likelihood toobserve a given transition and can therefore be extracted fromtraining data by simply looking at the frequency of transitions.Before we mention the parameterization of ψij(·) we brieflyoutline the training strategy. As mentioned above the param-eterization is very low dimensional. Therefore, a grid searchstrategy, i.e. evaluating the likelihood of the training data for adense grid of parameters, to find a maximum likelihood (ML)estimate of the parameters is utilized. The training data aresurface mesh models of the individual transitions which areextracted from a cadastral city model. The only non-straightforward term to compute is the partition function. It is difficultdue to the integral over all the normal directions. Monte-Carlo integration is utilized to get an accurate estimate of theintegral.

Fig. 4. Visualization of the 2D version of the used Wulff shapes (favoringa single direction, and favoring directions orthogonal to a specific direction).The Wulff shapes are plotted in red and the black curves are polar plots of thefunction ψ. The penalization of a surface with normal direction n correspondsto the distance of the origin to the black line in direction n. Figure adaptedfrom [48].

This training strategy derives the smoothness term fromtraining data. Note that not all the transitions are learnt, as thespecific training data used does not include information aboutall the possible transitions. Therefore, some of the transitionsuse a smoothness prior which is adopted from the learnt ones,for example there is no training data about vegetation includedtherefore an isotropic prior is used.

Which parameterization that is picked for ψij and its param-eter values are part of the MLE. For details how the completefunction φij(·) is defined from the estimated constants andψij(·) we refer the reader to [47], [48]. Also note that a fullyisotropic penalization can be represented using this model dueto the non-direction dependent part in Eq. 15, in such a casethe direction dependent part vanishes.

The two used Wulff shape parameterizations are depicted inFig. 4. They are designed to model two frequent surface priorsencountered in urban environments: one prior favors surfacenormals that are in alignment with a specific direction (e.g.ground surface normals prefer to be aligned with the verticaldirection), and the second Wulff shape favors surface normalsorthogonal to a given direction (such as facade surfaces havinggenerally normals perpendicular to the vertical direction). Inorder for theses priors to make sense the vertical directionneeds to be known from the input data. Remember, that themodel also facilitates fully isotropic penalization.

C. Results

In this section we show a result which is from [48].The input to the reconstruction is a set of images. First thecamera parameters are recovered using the publicly availablestructure from motion pipeline of [53]. The vertical directionis estimated with the method of [54]. The semantic classlikelihoods are extracted using the context-based classifier of[55], [4] and the depth maps are computed with the publiclyavailable plane sweeping stereo implementation of [56] usingzero mean normalized cross correlation as similarity measure.

The semantic 3D reconstruction results are compared to thegeometry only results (TV-Flux Fusion from [20], c.f. II-A) inFig. 5. The semantic 3D model has more complete geometrydue to the learnt geometric priors. Looking at the semanticclasses individually we can see the building facade behind the

Fig. 5. Dataset with 127 input images. (top left) TV-Flux fusion result, (topright) joint reconstruction and segmentation result all labels, (bottom from leftto right) building class only, ground class only, vegetation and clutter class.Results adapted from [48]

vegetation is recovered and the ground underneath the buildingis estimated.

The approach presented above has also been implementedon an octree for better spatial scalability [57]. There is acurical difference to octree or voxel hashing implementationsfor geometry only formulations. In the case of the jointreconstruction and segmentation formulation surfaces are re-covered which are not observable such as ground underneathbuilding and hence the structure of the octree cannot bedetermined directly from the input data. Therefore a coarse-to-fine optimization strategy is utilized.

IV. RAY POTENTIAL DATA TERM

In Sec. III we have seen how joint reconstruction andsemantic segmentation can be formulated as a volumetric,convex energy minimization problem. The data term wasformulated as a unary potentials, i.e. as costs for assigningspecific labels to specific voxels. Remember, the data is givenas per pixel semantic class likelihoods and depth estimates.Therefore, the data naturally induces a cost on the configu-ration of the whole viewing ray. The approximation using aunary potential does not fully capture the whole information.There are two main cases where the unary potential induceserrors (c.f. [58], [59]). First, if geometry is very thin alongthe viewing direction the uncertainty window used to definethe unary potential (Eq. 10) might be bigger than the actualgeometry and therefore surfaces resulting from the recon-struction can be artificially fattened. Another problem is thelong range interactions along viewing rays. If on a viewingray the measurement indicates a geometry at depth d a costshould be induced for reconstructing a geometry at a depthwhich is smaller than d. This might not be the case usinga unary potential which only represents the data in a bandaround the depth measurement. Therefore, a more accuratemodeling of the input data within the optimization is desirable.In this section we give an overview of the method of usingray potentials from [59].

Using formulations over viewing rays has also been pro-posed in other works. A line of works tries to directly estimategeometry from images without prior computation of depthmaps by formulating an energy over viewing rays [60], [61],[62]. One of the main drawbacks of these works is that theyestimate the photo-consistency using single color values andtherefore are not able to capture weak texture in images.[63] uses a formulation over viewing rays to model silhouetteconsistency in the energy formulation for object centric 3Dreconstruction.

A. Energy Formulation over Viewing Rays

The idea of the ray potential is to formulate the data termas a cost which captures the configuration of whole viewingrays. The ray potential measures the agreement with the 2Dinput data, i.e. pixel-wise depth and semantic scores. The 2Dinput gives us information along each viewing ray about thedepth at which we expect the first non-free space label andwhich semantic label this first occupied space voxel shouldhave. Therefore, the ray potential should only depend on theposition of the first occupied space label and its semantic class.This is introduced in the mulit-label energy formulation byformulating the ray potential as a non-convex integer programwhich is then transformed into an optimization problem whichhas a single non-convex constraint that can be handled using amajorize-minimize scheme. Finally, the unary potential of theenergy formulation Eq. 8 is replaced by the ray potential.

The ray potential data term is composed out of poten-tials over viewing rays r ∈ R. Further, let xr denote theper-voxel label indicator variables along viewing ray r. Weindex the voxel positions along the ray r using srd, withd ∈ 0, . . . , Nr the position index along the ray. Like inSec. III we use the convention that we have L + 1 semanticlabels with label i = 0 denotes the free space label. The integerprogram reads as

ψR(x) =∑r∈R

ψr(xr) (16)

ψr(xr)=

(L+1∑i=1

Nr∑d=0

cird

(mind′<d

x0srd′

)xisrd

)+ c0r min

d′≤Nr

x0srd′

,

subject to xis ∈ 0, 1 and∑i x

is = 1. The cird are the costs

for having the first occupied space label along ray r at positiond with semantic label i and c0r is the cost for assigning freespace to the whole ray. The costs c represent the input datain the optimization problem and are therefore computed fromdepth maps and pixel-wise semantic class likelihoods. In orderto use the ray potential together with the geometric priors formSec. III-B the potential gets transformed and relaxed before itis inserted. Here we give a brief outline of the procedure andrefer to [59] for details.

In a first step the costs c are transformed to non-positivecosts cird for all i including free space. This can be done forarbitrary input costs c without changing the energy. By intro-ducing additional visibility variables zird := min(z0

r,d−1, xisrd

)

Π1

Π2

Π3

xis

Π1

zird

Fig. 6. Global view represented by the xis variables and surfaces visible inview Π1 represented with the visibility variables zird. Figure adapted from[59].

an equivalent ray potential formulation is stated as (ray indexr left out for better readability)

ψr(xr) =

L+1∑i=0

N∑d=0

cidzid (17)

s.t. zid ≤ z0d−1, z

id ≤ xisd , z

id ≥ 0 ∀i,∀d .

The visibility variables indicate where the visible surface alonga specific ray lies. If zird = 1, this means that the the rayhas assigned free space until position d − 1 and label i atposition d. In case of occupied space i > 0 this means thefirst visible surface is at position d and has semantic label i.If zird = 0, ∀i then position d is invisible along ray r, i.e. anoccupied space label has been assigned closer to the cameracenter than position d. A visualization of the variable types isgiven in Fig. 6.

Relaxing the integer constraint on the xis would directlyleads to a convex program for the ray potential. However,it turns out that this relaxation is very weak. The underlyingreason is that the crucial constraint that says that the obsesrvedsurface along the ray needs to correspond to the position wherethe visibility of free space drops gets weakened when relaxingthe integer constraint. Therefore adding this constraint, whichreads as

L+1∑i=1

zid ≤ max(0, z0d−1 − x0

sd) ∀d (18)

back to the formulation strengthens the formulation. To getto the final ray potential formulation the integer constraint isdropped and Eq. 17 together with the visibility consistencyconstraint 18 replaces the unary potential in the energy for-mulation presented in Sec. II-D. For the regularization termthe geometric priors form Sec. III-B are used.

B. Majorize-Minimize Optimization

In order to minimize the non-convex program a majorize-minimize scheme [64] is used. The only non-convex term ofthe energy is the added back visibility consistency constraint. Itcan be majorized by a liner function which coincides with oneof the piece-wise linear parts of the definition of the constraint(for details see [59]).

The optimization procedure works by choosing the piecewise linear part which leads to a feasible energy given the

Fig. 7. Comparison between the unary potential based data term and theray potential data term. (left) example input image, (middle) unary potential,(right) ray potential. Results adapted from [59]

current assignment (initialization is done by assigning all freespace). This leads to a surrogate convex energy which isminimized using the first order primal-dual algorithm [34],[35]. The standard way of the majorize-minimize schemewould find the global optimum of the surrogate function beforemajorizing again. For the ray potential energy it turns out thatmuch faster convergence is achieved when the re-estimation ofthe surrogate function (i.e. the majorization step) is executedfrequently. A detailed description of the optimization schemeis given in [59].

C. Results

In this section we present a result from [59]. The depth mapsand camera poses are obtained using the same procedure asin Sec. III-C. The results in Fig. 7 show that when using theunary potential data term surfaces are fattened. Using the raypotential thin structures are recovered more accurately. Verythin structures such as road signs, have been treated with aspecialized formulation in [65] the ray potential method is ableto also reconstruct even such scenes without special treatmentof the thin structures. Furthermore, the ray pontential methodalso leads to state-of-the-art results on the Middlebury Multi-View Stereo benchmark [66].

V. NORMAL DIRECTION BASED OBJECT SHAPE PRIORS

So far we have seen how to use geometric priors insemantic 3D reconstruction. These priors help to recoverweakly observed and unobserved surfaces. However, they arerather generic priors preferring single directions or groups ofdirections over all others. When reconstructing challengingobjects which have translucent or reflective parts such priorsare insufficient. In this part we will give an overview twopapers [29], [67] which introduce how to represent classspecific 3D object shape priors into volumetric semantic 3Dreconstruction introduced in Sec. III.

There are two reasons for using a multi-label formulationfor object shape priors. Suppose we are reconstructing anobject such as a car. The car will be very challenging toreconstruct due to the reflective and translucent surfaces.However, the supporting ground generally is much easier toreconstruct and will be strongly present in the data term. Usinga multi-label formulation allows us to distinguish between theobject and the surrounding, for the case of the car we caneven estimate the unobserved ground underneath the car. Thesecond reason is objects that might themselves be segmented

into semantic classes. For example human heads which aresegmented into skin, hair, beard and so on. Using a mulit-labelformulation facilitates the recovery of hidden surfaces like theskin underneath the hair and leads to a consistent semanticsegmentation for the whole dataset.

One of the most popular approaches in object shape priors isprobably low-dimensional statistical shape models, which canfor example be acquired using Principal Component Analysis(PCA). For face reconstruction the model presented in [68],[69] is one of the most widespread shape priors. Similarmodels have been proposed for other objects in [70], [71].Also other priors such as connectivity constraints [72] havebeen proposed. Recently, learning based approaches have beenutilized to reconstruct complete 3D objects from a single colorimage [73].

Low dimensional shape models have difficulties or areunable to represent shape details which are instance specific.This even is the case when the shape details are well observedin the input data. The approach presented in this sectionovercomes this problem by allowing any shape but changemodel the regularization cost such that a shape which is likely,given the training data, is penalized less than an unikely shape.However, with a strong enough data evidence a reconstructionof any general shape is possible and hence we surface detailswhich are observed in the training data will also naturally bereconstructed.

The idea of the shape prior formulation presented in thefollowing is to look at surface normal distributions locally. Ifwe for example look at the surface normals of a car we observethat for many places the surface normals of different instancesare very similar at corresponding locations. For example theroof is close to horizontal, the bottom plate as well but in theopposite direction, the wind-shield is close to 45. Choosinga surface regularization term such that likely surface normaldirections are penalized less than unlikely ones at each locationin the volume leads to a class-specific shape prior. Anisotropicregularization has also been used in [74] to improve dense 3Dreconstruction. But compared to the approach presented herethe surface normal field has been extracted from the inputimages for the reconstruction and therefore cannot help inregions where the input data is unreliable. The shape priorpresented here is derived from training data.

A. Shape Prior Formulation

We work with a voxel space Ω understood as discretizationof a continuous space, which is now aligned with an object ofknown type. This means, for the case of the object type car,if we consider a voxel in the region where we expect the roofof the car we also have a high likelihood that the surface ishorizontal with upward pointing surface normal. And in turnthis means during the regularization such a direction shouldimpose a lower regularization cost than any other direction.One straight forward solution would be to use the Wulffshape which prefers a given direction from Sec. III-B andconsequently learn the direction and strength from trainingdata. This would not be sufficient as in general more complex

situations occur. If we look at the back of a car we might havea sedan or a hatchback. Looking at the two of them togetherstill only a limited set of surface normal directions is presentin any given voxel. In order to capture these statistics a moregeneral way of parameterizing the Wulff shape is used.

B. Discrete Wulff Shape

As we know from Sec. III-B the smoothness terms φijs thatcan be inserted into Eq. 8 can be represented by a convex shapeWij

s called Wulff shape. Any convex shape can be representedas the intersection of half spaces. This is used to define aparameterization of the Wulff shape which facilitates generalshapes and are hence capable of capturing the per voxel surfacenormal variation accurately.

A set of half spaces Hijs is used to define the Wulff shapeWHij

sfor a transition between label i and j at voxel s as an

intersection of all these half spaces. Each half space is definedby a outward pointing normal direction n ∈ S and a distanceof its boundary to the origin dn,ijs , with S is a discrete subsetof all the 3D unit vectors S2. The set S is chosen to be spatiallyhomogeneous and therefore does not have a location index.

We follow the derivations of [29], [67]. The energy for-mulation in Eq. 8 naturally corresponds to a negative log-probability and therefore the probabilistic interpretation leadsto.

P (nijs ) = e−φijs (nij

s ) = e−maxp∈W

Hijs

(pTnijs )

= e−dn,ijs (19)

The last equation holds under the condition that all the halfspaces in Hijs that are used to form the Wulff shape WHij

s

are active, meaning they share a common boundary with thegenerated Wulff shape. It follows that the desired value fordn,ijs is determined by

dn,ijs := − log (P (nijs )) (20)

The Wulff shapes can be inferred from training shapes givenas mesh models as follows. First, a resolution for the recon-struction volume Ω is selected and then each training shapeis aligned to the volume by rotating non-uniform scaling. Thesurface normal of each training shape in each voxel is collectedand a histogram over the training shapes’ normal directionsis built per voxel s using the normal directions n ∈ S asbin centers. Therefore, we can directly read of the requiredprobabilities form the histogram frequencies. It may happenthat a specific direction n is not active in a voxel which meansthat the specific direction is penalized less then the trainingdata suggests (c.f. [29] for more details). A visualization from[29] of the derived Wulff shapes that form the shape prior forbottles is given in Fig. 8.

C. Results on Cars and Bottles

In this section we show some example reconstructions from[29] using the normal direction based shape prior formulation.The model from Sec. III which utilizes a unary potentialsρis for the data term is used. But unlike in Sec. III nosemantic classifier is used. All the occupied space classesuse the same unary potential. Naemely, the one for geometry

Fig. 8. Vertical slice through the shape prior for bottles and two close ups.Figure adapted from [29].

Fig. 9. Reconstructions using the TV-Flux fusion next to the correspondingreconstruction using the normal direction based shape prior. (top row) Re-construction of a car. (bottom row) Reconstructions of two different bottles.Results adapted from [29].

only reconstruction given in Sec. II-A. The camera poses arecomputed using the pipeline of [53] and the depth maps withthe publicly available plane sweeping implementation of [56].The shape prior is trained from mesh models which weredownloaded from the Internet. The semantic label is chosensolely based on whether or not the geometry is consistentwith the learnt shape prior. The alignment between the shapeprior and the input data is done manually. The results from[29] shown in Fig. 9 give comparison between the shapeprior formulation and the TV-Flux fusion. Besides the betterreconstruction of the shape also the ground underneath the caris correctly recovered using the shape prior formulation.

D. Image Based Classification in the Regularization Term

We have seen above that it is possible to segment an objectfrom its surroundings purely based on a shape prior. In thefollowing we give an overview of [67], which utilizes andextends the method from above for semantic 3D reconstruc-tions of human heads. For this case it is not enough anymore to just rely on the geometry for the segmentation. Theeyebrows for example have a very similar geometry to the skinunderneath them, the same holds for a short beard or shorthair. Therefore, it is crucial to bring an image based semanticclassifier back into the formulation. We have seen two options

already in Secs III and IV. Before we briefly introduce themethod used for human heads we mention the disadvantagesof the approaches mentioned earlier. The unary potential baseddata term has the tendency to fatten thin objects as discussedin Sec. IV. This is especially problematic in the case of thinlayers of semantic classes. As an example suppose we arereconstructing the eyebrows fattened. This would mean theyextend into the space of the skin underneath it and thereforecause an unwanted indentation in the skin. The ray potentialdata term from Sec. IV is very powerful but computationallyvery demanding. This becomes especially problematic whenincluding an alignment transformation into the formulation asoutlined in the next section. Therefore, the formulation of [67]uses an alternative way of including an image based semanticclassifier that supports thin layers of semantic classes but doesnot increase the computational complexity of the model.

The idea is to include the scores of the semantic classifierinto the regularization term and keep the data from the depthmaps represented in the unary term. Looking at the exampleof the eyebrows the unary term would indicate that the spacebehind the observed depth should be occupied but not whichsemantic class it should have. The regularization term knowsthat in the region around the observed depth eyebrows arevisible in the image. Therefore a transition between free spaceand eyebrow is made less costly and therefore preferred. Notethat this does not affect the transition between eyebrow andskin and also the unary potential which prefers all occupiedspace labels equally does not hinder the model of putting atransition from eyebrow to skin at the correct place.

More formally, the per pixel knowledge about the semanticlabels Γ is included in the regularization term. The regu-lariztion term also includes a dependency on the alignmenttransformation T . This transformation aligns the input datawith the shape prior and will be discussed in the subsequentsection. The probability of a surface element between label iand j with normal direction n at position s as

P (nijs |T ,Γ) := P (nijs |↔ijs )P (↔ij

s |↔s,T ,Γ)P (↔s). (21)

The term P (↔s) captures the probability of having a surfaceat voxel s. P (↔ij

s |↔s, T ,Γ) is the probability of the presenceof a surface between two specific labels i and j given there isa surface. It includes the knowledge about the semantic labellikelihoods Γ and hence is dependent on the alignment T .P (nijs |↔ij

s ) essentially captures the implicit normal directionbased shape prior like in the sections above.

Using the idea of the discretized Wulff shape from aboveand further approximating the the semantic information Γusing a scaling factor wijs (T ,Γ), in analogy to edge adaptivepriors in image segmentation, the formula for the half spacedistances dn,ijs reads as

dn,ijs := wijs (T ,Γ)(− log(P (nijs |↔ij

s ))

− log(P (↔ijs |↔s))− log(P (↔s))

). (22)

Details and further explanations of this model are found in theoriginal publication [67].

E. Alignment Transformation

For the results above the reconstruction volume and hencethe shape prior has been manually aligned with respect tothe input data. For faces such a manual alignment would bevery difficult. Also, due to the implicit nature of the shapeprior there is no clear correct alignment. Therefore a similaritytransformation T : R3 → R3 defined as y 7→ αRy + t,with a positive scaling factor α > 0, a rotation matrix Rand a translation vector t, which maps the input data to thereconstruction volume, and hence the shape prior, is includedin the energy formulation. The objective function from Eq. 8with inserted transformation T reads as follows

E(x, T )=∑s∈Ω

∑i

ρis(T )xis+1

α2

∑i,j:i<j

φijs (T ,xijs −xjis )

(23)

The constraints from Eq. 9 are not affected by the transfor-mation T and can hence directly be used for the augmentedformulation. The normalization with respect to α is to ensurethat the optimization does not shrink the model to decreasethe surface area. This new energy is still convex with respectto the reconstruction x but is non-convex with respect to thetransformation T .

The minimization of this energy is done by alternatingbetween recovering the geometry and refining the alignmentby changing the alignment such that it better agrees withthe regularization term and hence the shape prior. This partof the optimization is done using a gradient descent basedoptimization. There are two key ingredients in the optimizationof the alignment. First, the refinement needs to be done as soonas parts of the geometry are recovered and only on transitionsbetween free space an occupied space. This is to avoid that wealign with respect to something that was hallucinated by theprior. Second, the optimization with respect to the alignmentcan be done on a surface level which avoids a re-sampling ofthe volume for each gradient descent iteration. This leads to arobust and efficient alignment procedure which reliably finds agood alignment. The alternation starts with reconstructing thegeometry, therefore an initial registration between the shapeprior and the input data is necessary. This can be easilyachieved using land mark detectors, for the example of humanheads this can be noes tip, corners of the mouth and so on. Werefer the reader to [67] for more details about the optimizationprocedure.

F. Results on Human Heads

In this section we show a semantic head reconstructionresult from [67]. The input is a set of multiple images takenwith a mobile phone or standard digital camera. The cameraposes are computed using either the system of [75] or [76]. Theimage based semantic scores are computed with the method of[55], [4] and the depth maps with the publicly available planesweeping implementation of [56]. The shape prior is trainedon mesh models of heads sampled form the statistical shapemodel of [69]. The initial registration between the shape priorand the input data is computed from land mark detections with

Fig. 10. Result for semantic 3D reconstruction of heads. (from left to right)TV-Flux fusion, fitting of the statistical shape model of [69], semantic headreconstruction skin label only, semantic head reconstruction all labels. Resultsadapted from [67].

[77]. Fig. 10 depicts a semantic 3D reconstruction result on ahead. The mole on the cheek is reconstructed correctly despitethe fact that such instance specific details are not representedin the shape prior. The comparison to fitting a low dimensionalshape model shows that using such an approach details cannotbe captured.

VI. SEGMENT BASED OBJECT SHAPE PRIORS

One major drawback of the shape prior formulations pre-sented above is that they need to store a Wulff shape ateach location. In this section we give an overview of themethod of [78], which proposes an alternative way which usesonly global spatially homogeneous smoothness terms. This hasthe advantage of a smaller model and at the same time noalignment between the shape prior and the data with respect tothe translation is necessary. Instead of a complicated spatiallyvarying smoothness term this shape prior formulation worksby splitting the object into multiple segments.

A. Formulation

The main objective is to formulate a smoothness prior whichis the same at each voxel, i.e. spatially homogeneous, but isable to describe an object shape prior. In order to understandthe motivation for splitting the object into multiple segmentswe consider the example of a table. The common problemswhen acquiring a 3D reconstruction of a table from colorimages are holes in the table and disconnected table legs. Thismostly happens due to texture-less and reflective surfaces. Letsassume we want to define a spatially homogeneous smoothnessprior which tries to resolve these issues. In order to avoiddisconnected legs horizontal surfaces need to be expensiveand vertical surfaces cheap but doing this makes holes in thetable top cheap and hence likely to appear. On the other handtrying to avoid holes in the table top means horizontal surfacesneed to be cheap and vertical surfaces expensive which makesdisconnected table legs likely. As soon as we split the objectinto table top and table legs this problems disappear anda single spatially homogeneous smoothness prior is able tocapture the shape of a table (c.f. Fig. 11).

The underlying reason is due to the convexity of the Wulffshapes. This directly implies that only convex shapes can bepreferred using a Wulff shape. The Wulff shape was firstdescribed in crystallography [79] where it explains the shapeof crystals. Therefore, splitting the object into convex or

table leg tabletop

tabletop table leg ground

Fig. 11. Visualization of the shape prior for the table class, (from left toright). Figure adapted from [78].

mug inner free space

mug ground inner free space

Fig. 12. Visualization of the shape prior for the mug class, (from left toright). Figure adapted from [78].

approximately convex segments leads to strong shape priorsfor many real-world objects. Splitting objects into convex orapproximately convex segements is also a topic in relatedareas such as computational geometry or computer graphics[80], [81], [82], [83] and algorithms to automatically segmentobjects have been proposed. Furthermore, often the segmentsnaturally correspond to semantic parts of an object. Thissegmentation is possible without any image based seman-tic classifier and purely based on the different geometryof different segments of an object. However, to strengthenthe formulation including an image based semantic classifierwould be possible. This however was not done in [78] toillustrate the strength of the shape prior formulation.

Special care needs to be taken for objects which are inher-ently non-convex and therefore splitting into convex segmentsis not feasible. One such example is a mug, which has aconcave inside. This can be tackled by also segmenting the freespace. In the case of a mug a special free space class is definedfor the inside of the mug (c.f. Fig. 12). Note, we naturally endup with a non-metric smoothness prior. A horizontal transitionfrom inside free space directly to outside free space needs tobe more expensive than the transition form inside free space tomug and then to outside free space. This violates the triangleinequality, which is a condition for a metric. Therefore itis crucial that the underlying convex multi-label formulationfrom [1] facilitates non-metric smoothness priors.

The Wulff shapes can be trained from training data using aslightly adpted method to the one presented in Sec. III-B. Inorder to be able to represent sharp corners the splitting into anisotropic and anisotropic part is discarded and a slightly morecomplicated training procedure is utilized, see [78] for details.

Fig. 13. Reconstruction of a table with no shape prior using TV-Flux (left)and the segment based shape prior (right). Figure adapted from [78].

Fig. 14. Reconstruction of a mug with no shape prior using TV-Flux (left) andthe segment based shape prior with inside free space blue (middle), withoutinside free space (right). Figure adapted from [78].

B. Results

In this section we show some results from [78]. The inputdata is preprocessed in the same way as in Sec. V-C. Thealignment only needs to be done with respect to rotation. Thiswas done manually for the results presented here but could beautomated for example through extraction of vanishing points.The results given in Figs 13 and 14 show that the segmentbased shape prior is able to overcome the problems of thegeometry only reconstruction.

VII. CONCLUSION

In this paper we gave an overview of a line of researchthat uses a continuously inspired volumetric multi-label for-mulation to facilitate semantic 3D reconstruction. The for-mulations are able to handle geometric priors for outdoorscenes and category specific object shape priors. This enablesthe reconstruction algorithm to recover hidden surfaces suchas building facades behind vegetation or the skin underneathhair. Different ways of formulating the data term have beenproposed for various applications. As a volumetric methodnaturally the question of scalability arises. For the basicformulation a method which optimizes over an octree in acoarse-to-fine manner was proposed.

At the current point the complexity of the scenes that arereconstructed is limited to scenes with only a few labels.This is mostly due to the quadratic complexity of the energyin terms of labels. It has already been pointed out that thiscan potentially be overcome by only taking into accountrelevant transitions in [47] and utilized in [84], but is tillmuch more limited in the number of labels than currentimage based semantic classifiers can handle. The recovery ofhidden surfaces has been limited to simple surfaces such asflat facades or the skin underneath hair. Recently, methods

which are able to predict a 3D object from just a singleimage using CNNs have started to appear (for example [73]).The advantage of such formulations is that they are able tolearn complex representation of objects from a large set oftraining data. Currently most of these methods focused on thereconstruction of single objects with limited resolution. Furtherextending these ideas and bringing them into semantic 3Dreconstruction from multiple images, might lead to semantic3D reconstruction formulations of complex scenes from onlya few input images.

ACKNOWLEDGMENT

This work has received funding from the Swiss NationalScience Foundation under Project 157101 and in the form ofan Early Postdoc.Mobility fellowship for Christian Hane.

REFERENCES

[1] C. Zach, C. Hane, and M. Pollefeys, “What is optimized in convexrelaxations for multilabel problems: Connecting discrete and continu-ously inspired map inference,” IEEE transactions on pattern analysisand machine intelligence (TPAMI), vol. 36, no. 1, pp. 157–170, 2014.

[2] C. Zach, M. Niethammer, and J.-M. Frahm, “Continuous maximalflows and wulff shapes: Application to mrfs,” in IEEE Conference onComputer Vision and Pattern Recognition (CVPR), 2009.

[3] C. Hane, L. Ladicky, and M. Pollefeys, “Direction matters: Depthestimation with a surface normal classifier,” in IEEE Conference onComputer Vision and Pattern Recognition (CVPR), 2015.

[4] L. Ladicky, B. Zeisl, and M. Pollefeys, “Discriminatively trained densesurface normal estimation,” in European Conference on Computer Vision(ECCV), 2014.

[5] L. Ladicky, P. Sturgess, C. Russell, S. Sengupta, Y. Bastanlar,W. Clocksin, and P. H. Torr, “Joint optimization for object classsegmentation and dense stereo reconstruction,” International Journal ofComputer Vision (IJCV), vol. 100, no. 2, pp. 122–133, 2012.

[6] M. Bleyer, C. Rhemann, and C. Rother, “Extracting 3d scene-consistentobject proposals and depth from stereo images,” in European Conferenceon Computer Vision (ECCV), 2012.

[7] F. Guney and A. Geiger, “Displets: Resolving stereo ambiguities usingobject knowledge,” in IEEE Conference on Computer Vision and PatternRecognition (CVPR), 2015.

[8] D. Hoiem, A. A. Efros, and M. Hebert, “Automatic photo pop-up,” ACMtransactions on graphics (TOG), vol. 24, no. 3, pp. 577–584, 2005.

[9] O. Barinova, V. Konushin, A. Yakubenko, K. Lee, H. Lim, andA. Konushin, “Fast automatic single-view 3-d reconstruction of urbanscenes,” in European Conference on Computer Vision (ECCV), 2008.

[10] A. G. Schwing, T. Hazan, M. Pollefeys, and R. Urtasun, “Efficientstructured prediction for 3d indoor scene understanding,” in IEEEConference on Computer Vision and Pattern Recognition (CVPR), 2012.

[11] V. Hedau, D. Hoiem, and D. Forsyth, “Recovering the spatial layout ofcluttered rooms,” in IEEE international conference on computer vision(ICCV), 2009.

[12] A. Saxena, M. Sun, and A. Y. Ng, “Make3d: Learning 3d scene structurefrom a single still image,” IEEE transactions on pattern analysis andmachine intelligence (TPAMI), vol. 31, no. 5, pp. 824–840, 2009.

[13] A. Saxena, S. H. Chung, and A. Y. Ng, “Learning depth from singlemonocular images,” in Advances in Neural Information ProcessingSystems (NIPS), 2005.

[14] D. Eigen, C. Puhrsch, and R. Fergus, “Depth map prediction from asingle image using a multi-scale deep network,” in Advances in neuralinformation processing systems (NIPS), 2014.

[15] L. Ladicky, J. Shi, and M. Pollefeys, “Pulling things out of perspective,”in IEEE Conference on Computer Vision and Pattern Recognition(CVPR), 2014.

[16] D. Eigen and R. Fergus, “Predicting depth, surface normals and semanticlabels with a common multi-scale convolutional architecture,” in IEEEInternational Conference on Computer Vision (ICCV), 2015.

[17] S. Y. Bao, M. Bagra, Y.-W. Chao, and S. Savarese, “Semantic structurefrom motion with points, regions, and objects,” in IEEE Conference onComputer Vision and Pattern Recognition (CVPR), 2012.

[18] B. Curless and M. Levoy, “A volumetric method for building complexmodels from range images,” in Proceedings of the 23rd annual confer-ence on Computer graphics and interactive techniques. ACM, 1996.

[19] C. Zach, T. Pock, and H. Bischof, “A globally optimal algorithm forrobust tv-l 1 range image integration,” in International Conference onComputer Vision (ICCV), 2007.

[20] C. Zach, “Fast and high quality fusion of depth maps,” in Interna-tional symposium on 3D data processing, visualization and transmission(3DPVT), 2008.

[21] R. A. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, D. Kim,A. J. Davison, P. Kohi, J. Shotton, S. Hodges, and A. Fitzgibbon,“Kinectfusion: Real-time dense surface mapping and tracking,” in IEEEinternational symposium on Mixed and augmented reality (ISMAR),2011.

[22] J. Chen, D. Bautembach, and S. Izadi, “Scalable real-time volumetricsurface reconstruction,” ACM Transactions on Graphics (TOG), vol. 32,no. 4, p. 113, 2013.

[23] M. Nießner, M. Zollhofer, S. Izadi, and M. Stamminger, “Real-time3d reconstruction at scale using voxel hashing,” ACM Transactions onGraphics (TOG), vol. 32, no. 6, p. 169, 2013.

[24] V. Lempitsky and Y. Boykov, “Global optimization for shape fitting,” inIEEE Conference on Computer Vision and Pattern Recognition (CVPR),2007.

[25] K. Kolev, M. Klodt, T. Brox, and D. Cremers, “Continuous globaloptimization in multiview 3d reconstruction,” International Journal ofComputer Vision (IJCV), vol. 84, no. 1, pp. 80–96, 2009.

[26] T. F. Chan, S. Esedoglu, and M. Nikolova, “Algorithms for finding globalminimizers of image segmentation and denoising models,” SIAM journalon applied mathematics, vol. 66, no. 5, pp. 1632–1648, 2006.

[27] A. Chambolle, V. Caselles, D. Cremers, M. Novaga, and T. Pock,“An introduction to total variation for image analysis,” Theoreticalfoundations and numerical methods for sparse recovery, vol. 9, no. 263-340, p. 227, 2010.

[28] C. Hane, C. Zach, J. Lim, A. Ranganathan, and M. Pollefeys, “Stereodepth map fusion for robot navigation,” in 2011 IEEE/RSJ InternationalConference on Intelligent Robots and Systems (IROS), 2011.

[29] C. Hane, N. Savinov, and M. Pollefeys, “Class specific 3d object shapepriors using surface normals,” in IEEE Conference on Computer Visionand Pattern Recognition (CVPR), 2014.

[30] M. Niethammer, A. Boucharin, C. Zach, Y. Shi, E. Maltbie, M. Sanchez,and M. Styner, “Dti connectivity by segmentation,” in InternationalWorkshop on Medical Imaging and Virtual Reality, 2010.

[31] J. Lellmann, B. Lellmann, F. Widmann, and C. Schnorr, “Discrete andcontinuous models for partitioning problems,” International journal ofcomputer vision, vol. 104, no. 3, pp. 241–269, 2013.

[32] R. T. Rockafellar, Convex Analysis. Princeton University Press, 1997.[33] J. M. Borwein and A. S. Lewis, Convex analysis and nonlinear opti-

mization: theory and examples. Springer Science & Business Media,2010.

[34] A. Chambolle and T. Pock, “A first-order primal-dual algorithm forconvex problems with applications to imaging,” Journal of MathematicalImaging and Vision, vol. 40, no. 1, pp. 120–145, 2011.

[35] T. Pock and A. Chambolle, “Diagonal preconditioning for first orderprimal-dual algorithms in convex optimization,” in International Con-ference on Computer Vision (ICCV), 2011, pp. 1762–1769.

[36] P. L. Combettes and J.-C. Pesquet, “Proximal splitting methods in signalprocessing,” in Fixed-point algorithms for inverse problems in scienceand engineering. Springer, 2011, pp. 185–212.

[37] B. Jacquet, C. Hane, R. Angst, and M. Pollefeys, “Multi-body depth-map fusion with non-intersection constraints,” in European Conferenceon Computer Vision (ECCV), 2014.

[38] A. C. Pablo Speciale, Martin R. Oswald and M. Pollefeys, “A symmetryprior for convex variational 3d reconstruction,” in European Conferenceon Computer Vision (ECCV), 2016.

[39] T. Pock, A. Chambolle, D. Cremers, and H. Bischof, “A convex relax-ation approach for computing minimal partitions,” in IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR), 2009.

[40] J. Lellmann and C. Schnorr, “Continuous multiclass labeling approachesand algorithms,” SIAM Journal on Imaging Sciences, vol. 4, no. 4, pp.1049–1096, 2011.

[41] A. Chambolle, D. Cremers, and T. Pock, “A convex approach to minimalpartitions,” SIAM Journal on Imaging Sciences, vol. 5, no. 4, pp. 1113–1158, 2012.

[42] T. Werner, “A linear programming approach to max-sum problem: Areview,” IEEE transactions on pattern analysis and machine intelligence(TPAMI), vol. 29, no. 7, pp. 1165–1179, 2007.

[43] V. Kolmogorov and R. Zabin, “What energy functions can be minimizedvia graph cuts?” IEEE transactions on pattern analysis and machineintelligence (TPAMI), vol. 26, no. 2, pp. 147–159, 2004.

[44] C. Zach, C. Hane, and M. Pollefeys, “What is optimized in tight convexrelaxations for multi-label problems?” in IEEE Conference on ComputerVision and Pattern Recognition (CVPR), 2012.

[45] S. Esedoglu and S. J. Osher, “Decomposition of images by theanisotropic rudin-osher-fatemi model,” Communications on pure andapplied mathematics, vol. 57, no. 12, pp. 1609–1626, 2004.

[46] E. Strekalovskiy and D. Cremers, “Generalized ordering constraintsfor multilabel optimization,” in International Conference on ComputerVision (ICCV), 2011.

[47] C. Hane, C. Zach, A. Cohen, R. Angst, and M. Pollefeys, “Joint 3dscene reconstruction and class segmentation,” in IEEE Conference onComputer Vision and Pattern Recognition (CVPR), 2013.

[48] C. Hane, “Semantic 3d modeling from images with geometric priors,”Ph.D. dissertation, ETH-Zurich, 2016.

[49] B.-S. Kim, P. Kohli, and S. Savarese, “3d scene understanding by voxel-crf,” in IEEE International Conference on Computer Vision (ICCV),2013.

[50] S. Sengupta, E. Greveson, A. Shahrokni, and P. H. Torr, “Urban 3dsemantic modelling using stereo vision,” in International Conference onRobotics and Automation (ICRA), 2013.

[51] V. Vineet, O. Miksik, M. Lidegaard, M. Nießner, S. Golodetz, V. A.Prisacariu, O. Kahler, D. W. Murray, S. Izadi, P. Perez et al., “In-cremental dense semantic stereo fusion for large-scale semantic scenereconstruction,” in IEEE International Conference on Robotics andAutomation (ICRA), 2015.

[52] A. Kundu, Y. Li, F. Dellaert, F. Li, and J. M. Rehg, “Joint semanticsegmentation and 3d reconstruction from monocular video,” in EuropeanConference on Computer Vision (ECCV), 2014.

[53] C. Zach, M. Klopschitz, and M. Pollefeys, “Disambiguating visualrelations using loop constraints,” in IEEE Conference on ComputerVision and Pattern Recognition (CVPR), 2010.

[54] A. Cohen, C. Zach, S. N. Sinha, and M. Pollefeys, “Discovering andexploiting 3d symmetries in structure from motion,” in IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR), 2012.

[55] L. Ladicky, C. Russell, P. Kohli, and P. H. Torr, “Associative hierar-chical crfs for object class image segmentation,” in IEEE InternationalConference on Computer Vision (ICCV), 2009.

[56] C. Hane, L. Heng, G. H. Lee, A. Sizov, and M. Pollefeys, “Real-timedirect dense matching on fisheye images using plane-sweeping stereo,”in International Conference on 3D Vision (3DV), 2014.

[57] M. Blaha, C. Vogel, A. Richard, J. D. Wegner, T. Pock, and K. Schindler,“Large-scale semantic 3d reconstruction: an adaptive multi-resolutionmodel for multi-class volumetric labeling,” in IEEE Conference onComputer Vision and Pattern Recognition (CVPR), 2016.

[58] N. Savinov, L. Ladicky, C. Hane, and M. Pollefeys, “Discrete opti-mization of ray potentials for semantic 3d reconstruction,” in IEEEConference on Computer Vision and Pattern Recognition (CVPR), 2015.

[59] N. Savinov, C. Hane, L. Ladicky, and M. Pollefeys, “Semantic 3dreconstruction with continuous regularization and ray potentials usinga visibility consistency constraint,” in IEEE Conference on ComputerVision and Pattern Recognition (CVPR), 2016.

[60] S. Liu and D. B. Cooper, “Ray markov random fields for image-based3d modeling: model and efficient inference,” in IEEE Conference onComputer Vision and Pattern Recognition (CVPR), 2010.

[61] ——, “A complete statistical inverse ray tracing approach to multi-view stereo,” in IEEE Conference on Computer Vision and PatternRecognition (CVPR), 2011.

[62] A. O. Ulusoy, A. Geiger, and M. J. Black, “Towards probabilistic volu-metric reconstruction using ray potentials,” in International Conferenceon 3D Vision (3DV), 2015.

[63] K. Kolev and D. Cremers, “Integration of multiview stereo and sil-houettes via convex functionals on convex domains,” in Europeanconference on computer vision (ECCV), 2008.

[64] K. Lange, D. R. Hunter, and I. Yang, “Optimization transfer usingsurrogate objective functions,” Journal of computational and graphicalstatistics, vol. 9, no. 1, pp. 1–20, 2000.

[65] B. Ummenhofer and T. Brox, “Point-based 3d reconstruction of thinobjects,” in IEEE International Conference on Computer Vision (ICCV),2013.

[66] S. M. Seitz, B. Curless, J. Diebel, D. Scharstein, and R. Szeliski, “Acomparison and evaluation of multi-view stereo reconstruction algo-rithms,” in IEEE Computer Society Conference on Computer Vision andPattern Recognition (CVPR), 2006.

[67] F. Maninchedda, C. Hane, B. Jacquet, A. Delaunoy, and M. Pollefeys,“Class specific 3d object shape priors using surface normals,” in Euro-pean Conference on Computer Vision (ECCV), 2016.

[68] V. Blanz and T. Vetter, “Face recognition based on fitting a 3d morphablemodel,” IEEE Transactions on pattern analysis and machine intelligence(TPAMI), vol. 25, no. 9, pp. 1063–1074, 2003.

[69] P. Paysan, R. Knothe, B. Amberg, S. Romdhani, and T. Vetter, “A 3dface model for pose and illumination invariant face recognition,” inIEEE International Conference on Advanced video and signal basedsurveillance, 2009.

[70] S. Yingze Bao, M. Chandraker, Y. Lin, and S. Savarese, “Dense objectreconstruction with semantic priors,” in IEEE Conference on ComputerVision and Pattern Recognition (CVPR), 2013.

[71] A. Dame, V. A. Prisacariu, C. Y. Ren, and I. Reid, “Dense reconstructionusing 3d object shape priors,” in IEEE Conference on Computer Visionand Pattern Recognition (CVPR), 2013.

[72] J. Stuhmer, P. Schroder, and D. Cremers, “Tree shape priors withconnectivity constraints using convex relaxation on general graphs,” inIEEE International Conference on Computer Vision (ICCV), 2013.

[73] A. Kar, S. Tulsiani, J. Carreira, and J. Malik, “Category-specific objectreconstruction from a single image,” in IEEE Conference on ComputerVision and Pattern Recognition (CVPR), 2015.

[74] K. Kolev, T. Pock, and D. Cremers, “Anisotropic minimal surfacesintegrating photoconsistency and normal information for multiviewstereo,” in European conference on computer vision (ECCV), 2010.

[75] P. Tanskanen, K. Kolev, L. Meier, F. Camposeco, O. Saurer, andM. Pollefeys, “Live metric 3d reconstruction on mobile phones,” inInternational Conference on Computer Vision (ICCV), 2013.

[76] C. Wu, “Visualsfm: A visual structure from motion system,”http://ccwu.me/vsfm/, 2011.

[77] J. M. Saragih, S. Lucey, and J. F. Cohn, “Deformable model fitting byregularized landmark mean-shift,” International Journal of ComputerVision (IJCV), 2011.

[78] R. K. Mahabadi, C. Hane, and M. Pollefeys, “Segment based 3d objectshape priors,” in IEEE Conference on Computer Vision and PatternRecognition (CVPR), 2015.

[79] G. Wulff, “Zur frage der geschwindigkeit des wachsthums undder auflosung der krystallflachen,” Zeitschrift fur Kristallographie-Crystalline Materials, vol. 34, no. 1, pp. 449–530, 1901.

[80] B. Chazelle and D. P. Dobkin, “Optimal convex decompositions,”Computational Geometry, 1985.

[81] X. Li, T. W. Woon, T. S. Tan, and Z. Huang, “Decomposing polygonmeshes for interactive applications,” in Symposium on Interactive 3Dgraphics, 2001.

[82] J.-M. Lien and N. M. Amato, “Approximate convex decomposition ofpolyhedra,” in ACM symposium on Solid and physical modeling (SPM),2007.

[83] K. Mamou and F. Ghorbel, “A simple and efficient approach for 3d meshapproximate convex decomposition,” in IEEE International Conferenceon Image Processing (ICIP), 2009.

[84] I. Cherabier, C. Hane, M. Oswald, and M. Pollefeys, “Multi-label seman-tic 3d reconstruction using voxel blocks,” in International Conferenceon 3D Vision (3DV), 2016.

Documents

An Overview of Recent Progress in Volumetric Semantic 3D Reconstructionchaene/publications/... · 2017-05-26 · An Overview of Recent Progress in Volumetric Semantic 3D Reconstruction