Automatic occlusion removal from facades for 3D urban ...vcity.diginext.fr/Documents/Engels_Occlusions_ACIVS_2011.pdf · detection-by-parts framework, while vegetation is detected

Automatic occlusion removal from facades for3D urban reconstruction

C. Engels1, D. Tingdahl1, M. Vercruysse1,T. Tuytelaars1, H. Sahli2, and L. Van Gool1,3

1 K.U.Leuven, ESAT-PSI/IBBT2 V.U.Brussel, ETRO3 ETH Zurich, BIWI

Abstract. Object removal and inpainting approaches typically requirea user to manually create a mask around occluding objects. While cre-ating masks for a small number of images is possible, it rapidly becomesuntenable for longer image sequences. Instead, we accomplish this stepautomatically using an object detection framework to explicitly recog-nize and remove several classes of occlusions. We propose using this tech-nique to improve 3D urban reconstruction from street level imagery, inwhich building facades are frequently occluded by vegetation or vehicles.By assuming facades in the background are planar, 3D scene estimationprovides important context to the inpainting process by restricting inputsample patches to regions that are coplanar to the occlusion, leading tomore realistic final textures. Moreover, because non-static and reflectiveocclusion classes tend to be difficult to reconstruct, explicitly recognizingand removing them improves the resulting 3D scene.

1 Introduction

We seek to reconstruct buildings within urban areas from street level imagery.Most earlier approaches to 3D reconstruction have worked solely on low level im-age data, finding correspondences and backprojecting these into 3D. In contrast,we believe that in order to obtain high quality models, higher level knowledgecan best be incorporated into the 3D reconstruction process from the very start,i.e. information of what the image actually represents should be extracted incombination.

Here, we focus on one particular example of such top-down, cognitive-levelprocessing: detecting cars or vegetation allows us to remove these occludingobjects and their textures from the 3D reconstructions and to focus attentionon the relevant buildings behind.

On the other hand, as soon as 3D information becomes available, it is help-ful in the interpretation of the scene content, as probabilities to find objectsat different locations in the scene depend on the geometry, e.g. whether thereis a supporting horizontal plane available. So the geometric analysis helps thesemantic analysis and vice versa. Obtaining a better understanding of the scenecontent not only helps the 3D modeling, but can also be useful in its own right,

2 Engels et al.

VehicleSegmentation

Structure from motion

Planefitting

Occlusion removal

Dense reconstruction

Vegetationdetectionremoval

False positivedetectionVehicle

Geometric analysis

Semantic analysis

Fig. 1. Overview of our processing pipeline.

creating a semantically rich model. Instead of just having the geometry (and pos-sibly texture) of a scene, we now also have a partial interpretation. Ultimately,we aim for a system that knows what it is looking at, recognizing common ob-jects such as doors, windows, cars, pedestrians, etc. Such a semantically richmodel can be considered as an example of a new generation of representationsin which knowledge is explicitly represented and therefore can be retrieved, pro-cessed, shared, and exploited to construct new knowledge. We envision that thiswill be a key component for the next generation GPS systems, providing a moreintuitive user interface and/or additional information (e.g. warnings concerningspeed limits, pedestrian crossings, tourist information about specific buildings,etc.). Automatic extraction of such models is crucial if one wants to keep thesystem up to date without having to employ an army of image annotators.

The specific problem we will be focusing on here is the occlusion of buildingfacades by foreground objects. Occluders are often non-static (e.g. vehicles orpedestrians) and therefore usually not relevant from a user perspective. More-over, they are often difficult to reconstruct accurately due to specular reflections(e.g. cars), non-rigidity (e.g. pedestrians), or very fine structures (e.g. trees).Some measures can be used to mitigate the presence of these occlusions, such asincreasing camera height or fusing images from different viewpoints to help com-plete textures, but some regions of a building facade may simply not be visiblefrom any viewpoint. In such cases, it is possible to estimate the appearance ofthe occluded region using established inpainting approaches (e.g. [4]). However,these approaches frequently assume that occlusions will be spatially limited andmanually masked, which is not feasible for a larger dataset.

Instead, we propose automatically finding occlusions using object-specificdetectors. In practical situations, occlusions originate almost exclusively from alimited number of known object categories such as cars, pedestrians, or vege-tation. These can be detected and recognized using state-of-the-art object de-tection methods (e.g. [6]). We add a refinement step based on Grab-cut [14] toobtain clean segmentations and show this allows to remove the foreground ob-jects using inpainting without any manual intervention. We compare this methodwith a baseline scheme where objects are removed based on depth masking.

Automatic occlusion removal from facades for 3D urban reconstruction 3

Additionally, we show that superior results can be obtained by exploitingthe planar structure of building facades and ground. We first rectify the imagewith respect to the plane, so as to reduce the effects of perspective deformation.During the inpainting process, we restrict the input sample patches to regionsbelonging to the particular plane being completed, leading to more realistic finaltextures. Finally, we fill in the missing depth information by extending the planesuntil they intersect.

In summary, our processing pipeline goes as follows (see also Fig. 1): givenan image sequence, our approach initializes by estimating camera parametersand creating a sparse reconstruction. We estimate an initial dense, multi-view3D reconstruction, from which facade and ground planes are detected. This ge-ometric analysis is described in Sec. 3. In parallel, we detect vehicles within adetection-by-parts framework, while vegetation is detected with a patch-basedtexture classifier. The vehicle detections provide only a rough location of occlu-sions, which we then refine to obtain a final segmentation mask. These stepsare described in Sec. 4. We proceed to eliminate occlusions using a patch-basedinpainting approach that constrains input samples to a neighboring facade. Wereplace depths corresponding to occlusions with those of background facades,thereby eliminating the occlusions from the final reconstruction. This is ex-plained in Sec. 5. Finally, in Sec. 6 we show some experimental results, andSec. 7 concludes the paper.

2 Previous work

2.1 Cognitive 3D

This approach to 3D scene modeling in a sense comes close to the seminal workof Dick et al. [5], who combine single view recognition with multiview recon-struction for architectural scenes. They use a probabilistic framework that in-corporates priors on shape (derived from architectural principles) and texture(based on learned appearance models). A different kind of constraint is usedin [21], where a coarse piecewise-planar model of the principal scene planes andtheir delineations is reconstructed. This method has been extended by fittingshape models to windows [20], recognizing these models in the images usinga Bayesian framework similar to the method described in [5]. However, thesemethods are limited to scenes that are highly constrained geometrically, result-ing in a relatively strict model of what the scene may look like. This limits theapplicability. Their method cannot easily be relaxed to general street views withtrees, pedestrians, moving cars, etc. We want to exploit the recent advances inthe recognition of generic object classes [6, 12] to apply similar ideas to moregeneral scenes. Note that we will not use strict geometric or probabilistic modelsduring 3D modeling though. Instead, the higher level information is exploited tofocus the attention, i.e. to select interesting parts to reconstruct.

The opposite scheme, where geometry is exploited to help recognition, hasfurther been explored in [9]. Also worth mentioning here are a series of worksthat try to estimate depth from a single image [15, 8, 9]. Finally, [16] investigates

4 Engels et al.

how to infer depth from recognition, by transferring depth profiles from trainingimages of a particular object class to a newly detected instance.

2.2 Occlusion removal

Occlusion removal has been extensively studied within the computer vision andgraphics communities, mostly building on advances made in the work on texturesynthesis. Most approaches rely on the prior manual annotation of occlusions.

Our inpainting strategy is based on the patch exemplar-based technique ofCriminisi et al. [4]. Wang et al. [19] extended this approach to also infer depthfrom stereo pairs.

Several works have noted that manual workload can be greatly decreasedusing interactive methods that allow a user to quickly mark foreground andbackground areas, while exact segmentations are determined using graph cuts.The PatchMatch algorithm of Barnes et al. [1] allows for interactive rates forinpainting and reshuffling via simple user-defined constraints and efficient nearestneighbor search.

Within the context of building facades and urban reconstruction, increasedcontextual knowledge is available by assuming a structure’s planarity and repeti-tion of floors, windows, etc. Konushin and Vezhnevets [11] reconstruct buildingmodels by detecting floors and estimating building height. Occlusions are re-moved by cloning upper floors and propagating them downward. Rasmussen etal. [13] use multiple views and median fusion to remove most occlusions, requir-ing inpainting only for smaller regions. Benitez et al. [2] relies on a LIDAR pointcloud to find and remove occlusions closer than detected planes by combiningimage fusion and inpainting. Xiao et al. [22] semantically segment street-sidescenes into several classes, including vegetation and vehicles, but do not activelyfill in missing data. Instead, they rely on the missing information being availablefrom other views.

3 Geometric analysis

Planes are the most dominant geometric primitives found in urban environmentsand form a natural way of representing metropolitan scenes in 3D. Parameteriz-ing a scene into planes not only gives a lightweight representation of the scene,but also provides the information necessary for geometric algorithms such as im-age rectification and occlusion detection. This section describes how we obtaina 3D reconstruction from a set of images from which the dominant planes areextracted.

3.1 3D reconstruction with ARC3D

A sparse 3D reconstruction does not typically provide enough geometry for planeextraction. The ground plane can be especially difficult due to weak texture con-taining very few salient feature points to match between the images. However,


many dense reconstruction methods can do better here, as the search space be-tween corresponding image pixels is limited once the cameras are calibrated. Tothis end, we use the publicly available ARC3D web service [18], which computesdense depth maps from a set of input images. It is composed of an uncalibratedstructure-from-motion pipeline together with a dense stereo matcher. The useruploads images to a server via a client tool, which then computes the depthmaps and notifies the user via email when the computation is done.

3.2 Plane extraction

The depth maps from ARC3D are merged into a 3D point cloud which is usedas input to a plane detector algorithm. A maximum likelihood RANSAC schemeis employed to find the most dominant plane in the point cloud. We use athresholded-squared Euclidean distance between the candidate plane and a testpoint to determine if the test point is an inlier and keep the candidate planewith highest number of inliers. Subsequent planes are found by iteratively run-ning RANSAC on all the outliers from the previous detection.

3.3 Image rectification

Projective distortion may cause unwanted artifacts from the inpainting algo-rithm. Thus we need to rectify the input images such that each imaged planeis seen from a fronto-parallel viewpoint. This can be achieved by applying thehomography

H = KRK−1 (1)

to each pixel in the input image [7], where K is the camera calibration matrixand

R =[rT1 r

T2 r

T3

](2)

is a rotation matrix. With πF and πG as the facade and ground plane normals inthe camera frame, the rotation is formed as follows. First, we need to align thecamera viewing direction (z-axis) with the plane normal, thus r3 = πF . Further,r1 is selected to align the image x-axis with the intersection between πF and πG,r1 = πF × πG. Finally, we set r2 = r1 × r3 to complete the orthogonal basis.

4 Semantic analysis

4.1 Vehicle detection

To detect foreground occluding objects such as cars or pedestrians we use alocal features-based approach, namely the Implicit Shape Model (ISM) [12].Here, interest points are matched to object parts and vote for possible locationsof the object center using a probabilistic generalized Hough transform. Modelsfor different viewpoints are integrated using the method of [17].

If needed, false detections can be removed in an extra processing step, ex-ploiting the 3D and geometric constraints. Here, we use the fact that cars shouldalways be supported by the ground plane.

6 Engels et al.

Fig. 2. Left: source image. Center: Initial segmentation mask. Right: refined segmen-tation mask.

4.2 Vehicle segmentation

Since we know which interest points contributed to the object detection, we canuse them to obtain a rough segmentation of the detected object, by transferringsegmentations available for the training images (see [12]).

However, the interest points typically do not cover the entire object and asa result these segmentations often contain holes. As we will show later, this hasa detrimental effect on the inpainting results.

Therefore, we propose to add an extra step to refine the segmentation. Tothis end, we build on the work of [14, 3] for interactive scene segmentation. Wereplace the interactive selection of foreground with the initial segmentation basedon object detection. This results in significantly cleaner segmentations, whilethe method remains fully automatic, as illustrated in Fig. 2. This segmentationresults in an occlusion mask Mcar.

4.3 Vegetation detection

Unlike vehicles, vegetation tends to have a more random structure, which wedetect using a patch-based texture classifier. For each patch, we construct a13-dimensional vector containing mean RGB and HSV values, a five bin huehistogram, and an edge orientation histogram containing orientation and numberof modes. The classifier uses an approximate k-nearest neighbor search to matchthe patch to vegetation or non-vegetation descriptors from a training set, whichis supplied by a separate set of manually segmented images. Finally, we performmorphological closing on the detections to refine the vegetation occlusion maskMveg and combine the vegetation mask with the segmented vehicle detectionsto obtain an occlusion mask M = Mveg ∪Mcar.

5 Occlusion removal

5.1 Inpainting

Our approach to occlusion removal closely follows that of Criminisi et al. [4].They assume a user-provided occlusion mask M and a source region Φ = M̄


from which to sample patches that are similar to a patch Ψ ′ on the boundaryof Φ. Given the local neighborhood, the patch Ψ̂ ∈ Φ minimizing some distancefunction d(Ψ ′, Ψ) is used to fill the occluded pixels in Ψ ′. The authors recom-mend defining d as the sum of squared differences of observed pixels in CIE Labcolor space. The key insight of that work is that the order in which patches arefilled is important. By prioritizing patches containing linear elements orientedperpendicularly to the boundary of M , the algorithm preserves linear structureson a larger scale and leads to a more convincing replacement of the occlusion.

Rather than sampling over the entire image, we segment the image by planesand perform the inpainting in the rectified images produced in Sec. 3.3. We limitboth the source and mask regions to the area lying on the facade. This has theeffect of restricting vegetation, sky, other facades, etc. from the fill region. Weexamine the effects of this sampling strategy in Sec. 6.1.

5.2 Removing occlusions in 3D

After the textures are inpainted, we still need to remove the geometric aspectof the occlusion. Rather than simply discarding the 3D vertices in the occludedareas, we again make use of the planes, this time to fill in the occluded areas.All 3D points that project into the combined occlusion mask M are part of theoccluding geometry and must be dealt with. For each 3D point M , a line l isformed between the camera center C and M . This line is then extended to findthe first plane it intersects:

ΠM = arg minΠi

(d(C,Πi)) , (3)

where, d denotes the Euclidean distance between a point and a plane. The newposition of M is selected as the point of intersection between l and ΠM . Theeffect of this is shown in Fig. 3.

Fig. 3. Left: Original point cloud. Right: The geometric occlusions have been removedby transferring each occluded point to its corresponding plane.

8 Engels et al.

5.3 Mesh generation

We create a mesh from the filled point cloud using the Poisson reconstructionalgorithm [10]. The method casts the remeshing into a spatial Poisson problemand produces a watertight, triangulated surface from a dense 3D point cloud.Its resilience to noise makes it a favorable approach for image based 3D recon-struction. The Poisson reconstruction algorithm assumes a complete and closedsurface, and covers areas of low point density with large triangles. We clean themesh by removing all triangles with a side larger than 5 times the mean valueof the mesh. Finally, we texture the mesh with the inpainted texture.

6 Results

Fig. 4 shows the same scene as Fig. 3 after the mesh has been generated and theupdated texture has been added. Although an air conditioner is falsely classifiedas a vehicle and removed, the remaining artifacts are still minor compared tothose created by the difficult reconstruction of reflective cars.

Fig. 4. Reconstructed facade before and after occlusion removal.

6.1 Planar constraint on inpainting

Fig. 5 shows an example of the effects of using knowledge of the facade andground planes to rectify and constrain the source and masked regions. On theleft is essentially the approach of Criminisi, where Φ = M̄ , i.e. the source regionis simply the complement of the occlusion mask. Both approaches have difficultywith the ground due to the cars’ shadows not being included in the mask. How-ever, our approach’s constraints prevent the appearance of growing out of thefacade.

6.2 Mask selection

As discussed earlier, selecting the correct region for occlusion masking is criti-cal. Leaving occluding pixels labeled as background means the boundary of the


Fig. 5. Example inpainted region without (left) and with (right) planar constraints.

masked region M is already inside the occlusion, which prevents the inpaintingalgorithm from even initializing correctly. Creating too large a mask allows fora reasonable initialization but may cause the system to miss critical structuresthat would otherwise provide local cues to the unobserved facade. Fig. 6 shows acomparison between the raw ISM mask and the refined mask, while Fig. 7 showsthe resulting images. Because ISM searches for parts and centers of an object,it is possible that boundaries or non-discriminative sections of an object maynot be captured. The refinement enlarges the area out to stronger edges in theimage which are more likely to form the object boundary.

7 Conclusion

In this work we have demonstrated the use of cognitive level information aboutthe scene to improve the quality of 3D reconstructions of urban environments. Inparticular, we have investigated how occluding objects such as cars or vegetationcan be removed automatically without the need for manual intervention. Asfuture work, we want to further evaluate the proposed method on a larger scale,as well as integrate more semantic information into the 3D model, includingocclusion filling.

Acknowledgments

This work was supported by the Flemish FWO project Context and Scene de-pendent 3D Modeling (FWO G.0301.07N) and the European project V-City(ICT-231199-VCITY).

References

1. Barnes, C., Shechtman, E., Finkelstein, A., Goldman, D.B.: PatchMatch: A ran-domized correspondence algorithm for structural image editing. ACM Transactionson Graphics (Proc. SIGGRAPH) 28(3) (Aug 2009)

10 Engels et al.

Fig. 6. Occlusion masks from ISM segmentation before (above) and after (below) re-finement with Grab-cut.

2. Benitez, S., Denis, E., Baillard, C.: Automatic production of occlusion-free recti-fied facade textures using vehicle-based imagery. In: Photogrammetric ComputerVision and Image Analysis. p. A:275 (2010)

3. Boykov, Y., Kolmogorov, V.: An experimental comparison of min-cut/max- flowalgorithms for energy minimization in vision. Pattern Analysis and Machine Intel-ligence, IEEE Transactions on 26(9), 1124 –1137 (September 2004)

4. Criminisi, A., Perez, P., Toyama, K.: Region filling and object removal by exemplar-based image inpainting. IEEE Transactions on Image Processing 13, 1200–1212(2004)

5. Dick, A.R., Torr, P.H.S., Cipolla, R.: Modelling and interpretation of architecturefrom several images. Int. J. Comput. Vision 60, 111–134 (November 2004)

6. Felzenszwalb, P., Girshick, R., McAllester, D.: Cascade object detection with de-formable part models. In: Computer Vision and Pattern Recognition (2010)

7. Hartley, R.I., Zisserman, A.: Multiple View Geometry in Computer Vision. Cam-bridge University Press, ISBN: 0521540518, second edn. (2004)

8. Hedau, V., Hoiem, D., Forsyth, D.: Recovering the spatial layout of clutteredrooms. In: International Conference on Computer Vision (2009)

9. Hoiem, D., Efros, A.A., Hebert, M.: Putting objects in perspective. InternationalJournal of Computer Vision 80(1), 3–15 (2008)

10. Kazhdan, M., Bolitho, M., Hoppe, H.: Poisson surface reconstruction. In: Proceed-ings of the fourth Eurographics symposium on Geometry processing. pp. 61–70.SGP ’06, Eurographics Association, Aire-la-Ville, Switzerland, Switzerland (2006)


11. Konushin, V., Vezhnevets, V.: Abstract automatic building texture completion.Graphicon (2007)

12. Leibe, B., Leonardis, A., Schiele, B.: Robust object detection with interleavedcategorization and segmentation. International Journal of Computer Vision 77(1-3), 259–289 (May 2008)

13. Rasmussen, C., Korah, T., Ulrich, W.: Randomized view planning and occlusion re-moval for mosaicing building facades. In: IEEE International Conference on Intelli-gent Robots and Systems (2005), http://nameless.cis.udel.edu/pubs/2005/RKU05

14. Rother, C., Kolmogorov, V., Blake, A.: ”grabcut”: interactive foreground extrac-tion using iterated graph cuts. ACM Trans. Graph. 23, 309–314 (August 2004)

15. Saxena, A., Chung, S.H., Ng, A.Y.: 3-d depth reconstruction from a single stillimage. International Journal of Computer Vision (IJCV 76, 2007 (2007)

16. Thomas, A., Ferrari, V., Leibe, B., Tuytelaars, T., Gool, L.V.: Shape-from-recognition: Recognition enables meta-data transfer. Computer Vision and ImageUnderstanding 113(12), 1222–1234 (2009)

17. Thomas, A., Ferrari, V., Leibe, B., Tuytelaars, T., Schiele, B., Van Gool, L.: To-wards multi-view object class detection. In: Proceedings of the 2006 IEEE Com-puter Society Conference on Computer Vision and Pattern Recognition - Volume 2.pp. 1589–1596. CVPR ’06, IEEE Computer Society, Washington, DC, USA (2006)

18. Vergauwen, M., Gool, L.V.: Web-based 3d reconstruction service. Mach. VisionAppl. 17(6), 411–426 (2006)

19. Wang, L., Jin, H., Yang, R., Gong, M.: Stereoscopic inpainting: Joint color anddepth completion from stereo images. In: Conference on Computer Vision andPattern Recognition (2008)

20. Werner, T., Zisserman, A.: Model selection for automated reconstruction frommultiple views. In: British Machine Vision Conference. pp. 53–62 (2002)

21. Werner, T., Zisserman, A.: New techniques for automated architecture reconstruc-tion from photographs. In: European Conference on Computer Vision. vol. 2, pp.541–555. Springer-Verlag (2002)

22. Xiao, J., Fang, T., Zhao, P., Lhuillier, M., Quan, L.: Image-based street-side citymodeling. ACM Trans. Graph. 28, 114:1–114:12 (December 2009)

12 Engels et al.

Fig. 7. Effects of segmentation refinement. Left: completed texture image; right: inset.From top to bottom: Original image; inpainted image with ISM mask; inpainted imagewith refined mask.