8
Combining Maps and Street Level Images for Building Height and Facade Estimation * Jiangye Yuan Computational Sciences & Engineering Division Oak Ridge National Laboratory Oak Ridge, Tennessee 37831 [email protected] Anil M. Cheriyadat Computational Sciences & Engineering Division Oak Ridge National Laboratory Oak Ridge, Tennessee 37831 [email protected] ABSTRACT We propose a method that integrates two widely available data sources, building footprints from 2D maps and street level images, to derive valuable information that is generally difficult to acquire – building heights and building facade masks in images. Building footprints are elevated in world coordinates and projected onto images. Building heights are estimated by scoring projected footprints based on their alignment with building features in images. Building foot- prints with estimated heights can be converted to simple 3D building models, which are projected back to images to identify buildings. In this procedure, accurate camera pro- jections are critical. However, camera position errors inher- ited from external sensors commonly exist, which adversely affect results. We derive a solution to precisely locate cam- eras on maps using correspondence between image features and building footprints. Experiments on real-world datasets show the promise of our method. Keywords data fusion; GIS map; building model 1. INTRODUCTION The rapid development of data collection capabilities gives rise to multi-modal datasets. Fusing the available data leads to new solutions to problems that are challenging, even im- possible, with single-modal datasets. In particular, inte- grating geographic information with image data has been * This manuscript has been authored by UT-Battelle, LLC under Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy. The United States Government re- tains and the publisher, by accepting the article for pub- lication, acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, world-wide li- cense to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Gov- ernment purposes. Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only. UrbanGIS 16, October 31-November 03 2016, Burlingame, CA, USA c 2016 Copyright held by the owner/author(s). Publication rights licensed to ACM. ACM ISBN 123-4567-24-567/08/06. . . $15.00 DOI: 10.475/123 4 Figure 1: Building height estimation and building facade identification using building footprints and street level images. increasingly exploited [20, 17, 18], which significantly en- hances the information extraction capability. In this paper, we utilize building footprint data from GIS maps and street level images, and develop a fully automatic method to es- timate building heights and generate building facade masks in images. The concept is illustrated in Fig 1. Building heights are important information for many ap- plications related to urban modeling. Stereo pairs of high resolution aerial images and aerial LiDAR data are typically used to acquire height information [22, 15, 5]. Although highly accurate results can be obtained, those types of data are not easily accessible due to expensive collection process. We derive height information in a novel way. Through cam- era projections, we map edges of building footprints with dif- ferent heights to images. The building height is determined by matching the projected edges with building rooflines in street view images. We design an indicator that incorporates color and texture information and reliably find the projected edges aligned with rooflines. This method does not require aerial sensing data. In addition, by extending building foot- prints vertically with estimated heights, we can generate 3D building models. Despite being significantly simplified, such building models are useful in applications requiring low stor- age and fast rendering. Identifying buildings in images is an important yet chal- lenging task for scene understanding. A main strategy is to learn a classifier from labeled data [19, 16]. Deep neural networks trained with massive labeled data have shown ex- cellent performance on segmenting semantic objects in im- arXiv:1601.07630v2 [cs.CV] 13 Dec 2016

Combining Maps and Street Level Images for Building Height ... · camera positions on maps. The method is fully automated and works reliably on real-world data.. In this work, we

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Combining Maps and Street Level Images for Building Height ... · camera positions on maps. The method is fully automated and works reliably on real-world data.. In this work, we

Combining Maps and Street Level Images for BuildingHeight and Facade Estimation ∗

Jiangye YuanComputational Sciences & Engineering Division

Oak Ridge National LaboratoryOak Ridge, Tennessee 37831

[email protected]

Anil M. CheriyadatComputational Sciences & Engineering Division

Oak Ridge National LaboratoryOak Ridge, Tennessee [email protected]

ABSTRACTWe propose a method that integrates two widely availabledata sources, building footprints from 2D maps and streetlevel images, to derive valuable information that is generallydifficult to acquire – building heights and building facademasks in images. Building footprints are elevated in worldcoordinates and projected onto images. Building heightsare estimated by scoring projected footprints based on theiralignment with building features in images. Building foot-prints with estimated heights can be converted to simple3D building models, which are projected back to images toidentify buildings. In this procedure, accurate camera pro-jections are critical. However, camera position errors inher-ited from external sensors commonly exist, which adverselyaffect results. We derive a solution to precisely locate cam-eras on maps using correspondence between image featuresand building footprints. Experiments on real-world datasetsshow the promise of our method.

Keywordsdata fusion; GIS map; building model

1. INTRODUCTIONThe rapid development of data collection capabilities gives

rise to multi-modal datasets. Fusing the available data leadsto new solutions to problems that are challenging, even im-possible, with single-modal datasets. In particular, inte-grating geographic information with image data has been

∗This manuscript has been authored by UT-Battelle, LLCunder Contract No. DE-AC05-00OR22725 with the U.S.Department of Energy. The United States Government re-tains and the publisher, by accepting the article for pub-lication, acknowledges that the United States Governmentretains a non-exclusive, paid-up, irrevocable, world-wide li-cense to publish or reproduce the published form of thismanuscript, or allow others to do so, for United States Gov-ernment purposes.

Publication rights licensed to ACM. ACM acknowledges that this contribution wasauthored or co-authored by an employee, contractor or affiliate of the United Statesgovernment. As such, the Government retains a nonexclusive, royalty-free right topublish or reproduce this article, or to allow others to do so, for Government purposesonly.

UrbanGIS 16, October 31-November 03 2016, Burlingame, CA, USAc© 2016 Copyright held by the owner/author(s). Publication rights licensed to ACM.

ACM ISBN 123-4567-24-567/08/06. . . $15.00

DOI: 10.475/123 4

Figure 1: Building height estimation and buildingfacade identification using building footprints andstreet level images.

increasingly exploited [20, 17, 18], which significantly en-hances the information extraction capability. In this paper,we utilize building footprint data from GIS maps and streetlevel images, and develop a fully automatic method to es-timate building heights and generate building facade masksin images. The concept is illustrated in Fig 1.

Building heights are important information for many ap-plications related to urban modeling. Stereo pairs of highresolution aerial images and aerial LiDAR data are typicallyused to acquire height information [22, 15, 5]. Althoughhighly accurate results can be obtained, those types of dataare not easily accessible due to expensive collection process.We derive height information in a novel way. Through cam-era projections, we map edges of building footprints with dif-ferent heights to images. The building height is determinedby matching the projected edges with building rooflines instreet view images. We design an indicator that incorporatescolor and texture information and reliably find the projectededges aligned with rooflines. This method does not requireaerial sensing data. In addition, by extending building foot-prints vertically with estimated heights, we can generate 3Dbuilding models. Despite being significantly simplified, suchbuilding models are useful in applications requiring low stor-age and fast rendering.

Identifying buildings in images is an important yet chal-lenging task for scene understanding. A main strategy isto learn a classifier from labeled data [19, 16]. Deep neuralnetworks trained with massive labeled data have shown ex-cellent performance on segmenting semantic objects in im-

arX

iv:1

601.

0763

0v2

[cs

.CV

] 1

3 D

ec 2

016

Page 2: Combining Maps and Street Level Images for Building Height ... · camera positions on maps. The method is fully automated and works reliably on real-world data.. In this work, we

ages, including buildings [9, 13]. In contrast, we generatebuilding masks by projecting the 3D building models ontoimages. This approach provides an alternative solution thathas the following advantages. First, it is capable of dealingwith large appearance variations without complicated train-ing and expensive labeled data collection. Second, each in-dividual building in images is associated with a building onmaps. Such a link allows attribute information from mapdata to be easily transfered to the images. For example, wecan identify the buildings in an image corresponding to aspecific restaurant if the information is given on maps. Notethat while extracting builds from images has been addressedthrough a similar strategy of projecting georeferenced mod-els [6, 17], those studies use existing building models insteadof 2D maps.

Our approach involves matching building features betweenmaps and images. However, even when camera projectionparameters are available from sensors, cross alignment be-tween maps and images still poses a major challenge. Amain reason is that the measurement of camera positionsmostly relies on Global Positioning Systems (GPS), whichtend to be affected by measurement conditions, especiallyin high density urban areas. Despite various new techniquescombined with additional sensory information [14, 8, 11],GPS errors remain at a noticeable level. For example, amedian position error of 5-8.5 meters is reported for currentgeneration smartphones [21]. To overcome inaccurate spa-tial alignments, we propose an effective approach based oncorrespondence between map and image features to refinecamera positions on maps. The method is fully automatedand works reliably on real-world data..

In this work, we use building footprint layers from Open-StreetMap (OSM)1 – the largest crowd sourced maps pro-duced by contributors using aerial imagery and low-tech fieldmaps. The OSM platform has millions of contributors andis able to generate map data with high efficiency. We useGoogle Street View images, which capture street scenes witha world-wide coverage. Images can be downloaded throughan publicly available API2 with all camera parameters pro-vided.

2. RELATED WORKA number of previous studies have explored the idea of

creating building models based on 2D maps, where a criticalstep is to obtain height information. In [10], building heightsand roof shapes are set to a few fixed values based on knownbuilding usage. In more recent work, stereo pairs of highresolution aerial images and aerial LiDAR data are used toestimate height information [22, 15, 5]. However, since thosetypes of data are not available for most areas in the world,the methods do not scale well. In this work, a new methodis proposed to acquire building heights, which utilizes mapdata and street view images. Since input data are widelyavailable, the method can scale up to very large areas.

In order to correctly project map data onto images, im-ages need to be registered with maps. This is a difficult taskbecause it requires matching between abstract shapes on aground plane and images from very different views. Thishas been pursued in a few recent studies. In [4] omnidirec-

1http://www.openstreetmap.org/2https://developers.google.com/maps/documentation/streetview/

(a) (b)

Figure 2: Building height estimation. (a) A mapcontaining building footprints. The green markerrepresents the camera location. Blue edges indicatethe part within the field of view. (b) A street levelimage. All blue lines are the projected edges withdifferent elevation values.

tional images are used to extract building corner edges andplane normals of neighboring walls, which are then comparedwith structures on maps to identify camera positions. Themethod in [7] follows the same framework and aims to re-fine camera positions initially from GPS. It works on a singleview image, but it involves manual segmentation of buildingsin images to obtain highly accurate building corner lines. In-stead of 2D maps, digital elevation models (DEM) has alsobeen utilized, where building roof corners are extracted andmatched with those in images [2]. However, DEM is muchless accessible than maps. It should be noted that existingtechniques for camera pose estimation given 2D-to-3D pointcorrespondences (the PnP problem) are not applicable herebecause point correspondences are not available. We pro-pose a method that registers a single view image with mapsthrough a voting scheme. The method is fully automatedand works reliably on real-world data.

The rest of the paper is organized as follows. Section 3presents the method to estimate building heights and iden-tify buildings in images. The method for estimating accuratecamera position on maps is discussed in Section 4. In Sec-tion 5 we conduct experiments on large datasets and providequantitative evaluation. We conclude in Section 6.

3. BUILDING HEIGHT AND FACADE ES-TIMATION

For a building footprint in a georeferenced map, we haveaccess to 2D geographic coordinates (e.g., latitude and lon-gitude) of the building footprint. If we set the elevation ofthe building footprint to the ground elevation, we have 3Dworld coordinates. Then we project the edges of the buildingfootprint onto the image through camera projections. Theprojected edges should outline the building extent at its bot-tom. As we increase the elevation of the building footprint,the projected edges move toward the building roof. Whenthe projected edges are aligned with building rooflines, thecorresponding elevation is the roof elevation of the building.The building height is simply the increased elevation. Thisprocedure is illustrated in Fig. 2, where we project footprintedges within the field of view.

We need to determine whether projected footprint edgesare aligned with building rooflines. A straightforward way is

Page 3: Combining Maps and Street Level Images for Building Height ... · camera positions on maps. The method is fully automated and works reliably on real-world data.. In this work, we

Edgeness

Height

Figure 3: Height estimation based on edgenessscores. Each color represents one building. Dashedlines indicate actual heights of buildings.

to compute image gradients and select the projected edgeswith the maximum gradient magnitude as the roofline. How-ever, because lighting conditions change significantly andthere exist many straight edges other than rooflines, gradientmagnitude often fails to represent the existence of rooflines.

Here we use an edgeness indicator that incorporates bothcolor and texture information [12]. It is fast to compute andperforms reliably. We first compute spectral histograms ata local window around each pixel location (the window sizeis set to 17× 17 in our experiments). A spectral histogramof an image window is a feature vector consisting of his-tograms of different filters responses. It has been shownthat spectral histograms are powerful to characterize imageappearances [12]. Here we use a spectral histogram con-catenating histograms of RGB bands and two Laplacian ofGaussian (LoG) filter responses. Two LoG filters are ap-plied to the grayscale image converted from RGB bands(σ is set to 0.5 and 1, respectively). The histogram ofeach band has 11 equally spaced bins. Local histogramscan be computed efficiently using the integral histogramapproach. Since histograms of RGB bands are sensitiveto lighting conditions, we weight those histograms by 0.5.The edgeness indicator value at (x, y) is then defined asthe sum of two feature distances between pixel locations< (x + h, y), (x − h, y) > and < (x, y + h), (x, y − h) >,where h is half of the window side length. χ2 difference isused as the distance metric. This edgeness indicator reflectsthe appearance change between neighboring windows andthus better captures rooflines, which are essentially bound-aries separating two distinctive regions. Fig. 3 shows a plotof edgeness scores versus heights for three buildings, whereactual building heights are well captured by maximum ed-geness.

The main steps of the proposed method are described asfollows.

1. Identify visible building edges on a map. Based oncamera parameters, we determine the field of view onthe map. We select the edges of building footprintsthat are not occluded by other buildings. Buildingsfar away from the camera may not show clear rooflinesin an image. Therefore, we only consider buildingswithin a distance of 60 meters.

2. Project selected edges onto the image. 2D geographiccoordinates of the edges are known. We assign ground

elevations to the edges and map them onto the imagethrough a projective camera projection, which resultsin polylines in the image. Highly accurate ground ele-vations can be obtained from public geographic databases.

3. Determine building heights. For each visible building,we gradually increase the elevation of selected edgesand project them onto the image. The projected edgesscan the image through a series of polylines. Whenthe sum of the edgeness scores on a polyline reachesthe maximum, the corresponding elevation gives thebuilding height. There are cases where taller buildingsbehind the target building are also visible, but theyusually have rooflines with different length and shape,which do not give maximum edgeness values.

4. Process a set of images. The height of a building canbe estimated sufficiently well with one image as long asbuilding rooflines are visible in the image. If a collec-tion of images are available, where a building appearsin multiple images, we can utilize the redundancy toimprove accuracy. For each building, we create a one-dimensional array with each element representing aheight value and initialize all elements to zeros. Afterscanning an image for a building, we add one to theelement corresponding to the estimated height value.Once all images containing the building are processed,we choose the element with the maximum value to ob-tain the building height. In some images where onlylower part of a building can be seen, estimated heightsare incorrect. We use a simple technique to deal withthis issue. Because in such a case the projected edgesare within building facades, the edgeness scores aregenerally small. We ignore the estimated height if thecorresponding indicator value is smaller than a thresh-old, and thus incorrect estimates are not recorded inthe array.

5. Generate building facade masks in images. Simple 3Dbuildings models can be generated by extruding boxesfrom building footprints with estimated heights. Build-ings in the field of view are projected back onto images,which results in labeling facades of each building. Toaddress the visibility problem (building models par-tially occluded by others), we use the painter’s algo-rithm, where the farthest facade is projected first.

Our method treats building roofs as flat planes. For build-ings with other roof shapes, the resulting height is generallybetween the top and the base of roofs. Since most non-flatroofs have a small height, our method can still give an esti-mate close to the mean height. The method may not workwell when building rooflines in an image are completely oc-cluded by other objects (e.g., trees and other buildings).However, with multiple views available, the method can ex-ploit the rooflines of the same building visible in other im-ages taken from different angles. If a building is completelyobstructed on the map but visible in images because it istaller than obstructing buildings, the method will skip thebuilding.

4. CAMERA LOCALIZATION ON MAPSAs discussed earlier, we need to register images with maps

for accurate projections. In this paper, we focus on camera

Page 4: Combining Maps and Street Level Images for Building Height ... · camera positions on maps. The method is fully automated and works reliably on real-world data.. In this work, we

(a) (b) (c)

Figure 4: Impact of camera position errors. (a)A building footprint edge projected onto an imagewith a correct camera position. A series of eleva-tion values are used. (b) and (c) The projectionswith camera positions moved 3 meters away fromthe correct one.

positions, the measurement of which is much more noisythan orientation measurements. To illustrate the effect ofcamera position errors on projection results, Fig. 4 shows thebuilding footprint edges projected onto images with shiftedcamera positions. As can be seen, there are clear misalign-ments between projected footprint edges and buildings inimages even though camera positions are shifted by only 3meters. Such misalignments may cause projected buildingfootprints to fail to match the corresponding rooflines andhence incorrect height estimation. They also lead to incor-rect building facade masks. In the following we propose amethod that takes an initial position that may be noisy andestimate the accurate position on maps.

4.1 Camera position from point correspondenceA 3D point P = (X,Y, Z)T in world coordinates can be

projected onto a pixel location p = (u, v, 1)T in an imageplane through the camera projection equation:

λp = [K|03]

[R −RC0T3 1

] [P1

]= KRP−KRC. (1)

λ is the scaling factor, K the camera intrinsic matrix, R thecamera rotation matrix, and C the location of the cameracenter in world coordinates. Note that −RC equals to thecamera translation. Given a pair of p and P , we have

C′ = P + λR−1K−1p. (2)

That is, given correspondence between a pixel location and a3D point in world coordinates, the possible camera positionsC′ lie on a 3D line defined by (2).

In this work, we assume that camera position errors aremainly in a horizontal plane, because vertical camera posi-tions can be reliably obtained with existing ground elevationdata. Let C′ = (XC , YC , ZC)T . We can discard Z dimensionand simplify (2) to[

XC

YC

]=

[XY

]+ λ

[∆X∆Y

], (3)

where ∆X and ∆Y are truncated from (∆X,∆Y,∆Z)T =R−1K−1p. This defines a line on a 2D plane. If correspon-dence of two pairs of points is given, the camera positioncan be uniquely determined, which is the intersection of twolines.

4.2 Camera position from image and map cor-respondence

Based on the above analysis, we can determine camerapositions based on point correspondence between images andmaps. We use corners on building footprints, which canbe easily identified from map data. However, it is difficultto find the corresponding points in an image. To providebetter features for matching, we place a vertical line segmenton each building footprint corner (two end points have thesame 2D coordinate but different elevation values). We willrefer to such a line as a building corner line (BCL). The linelength is fixed. Note that at this stage building height isnot available. When projecting a BCL onto an image withan accurate camera position, it should be well aligned withbuilding edges in the image.

Under the assumption that position errors are within ahorizontal plane, for a BCL projected onto an image withan inaccurate camera position, the displacement is along im-age columns. We first project a BCL onto an image using aninitial camera position. Next, we horizontally move the pro-jected BCL toward both directions within a certain range.The range is set to be inversely proportional to the distancefrom the BCL to the camera position. At each moving step,we have a pair of P, the building footprint corner, and p,the projected point moved along columns. By using (2),we can compute a line on the map that represents potentialcamera positions. We create an accumulator, which is a 2Darray centered at the initial camera position. For each line,we increment the value of the bins the line passes through.The increased value is determined based on how well themoved BCL is aligned with building edges. At the end, thebin with the highest value gives the most likely camera po-sition, which results in the best match between BCLs andbuilding edges in images. The procedure is illustrated inFig. 5.

When moving a projected BCL, we need to know whetherit is aligned with a building edge so that proper values canbe assigned to the accumulator. Detecting building edge isanother challenging task. In an image of a street scene, con-trast of building edges varies significantly due to changes oflighting conditions. Building edges can be fully or partiallyoccluded by other objects. Moreover, there exist a largenumber of edges from other objects. Although the voting-based method is able to tolerate a reasonable amount oferrors in detection results, a large number of false positivesand false negatives can still confuse the method.

We exploit the idea of line support regions [3], which iswidely used for straight line extraction. A line support re-gion consists of pixels with similar gradient orientations,where a line segment can be extracted. A common prac-tice for building edge extraction in previous work [4, 7] isto assume that building edges have relatively large contrast.However, we observe that in many cases there are only slightcolor changes around building edges. Line support regionsare formed based on consistency of gradient orientations re-gardless of magnitudes and hence better suited in this task.

Since street level images are captured by a camera with itsY-axis orthogonal to the ground, building edges in imagesshould be close to vertical lines. We select pixels with gra-dient orientations within 22.5◦ around the vertical directionand find connected regions with large vertical extents andsmall horizontal extents (in the experiments two thresholdsare set to 50 and 20 pixels, respectively). When a projected

Page 5: Combining Maps and Street Level Images for Building Height ... · camera positions on maps. The method is fully automated and works reliably on real-world data.. In this work, we

Move projected BCLs in images Compute lines representing camera positions

Vote on accumulator

Figure 5: Illustration of camera position estimation. By moving projected BCLs (blues lines) in images,we compute potential camera positions (blue lines) on the map, which add values to an accumulator basedon alignment measure between projected BCLs and building edges. The green marker indicates the initialcamera position, and the black window shows the extend of the accumulator. In the accumulator, red pixelsrepresent large values.

BCL hits a region center, the corresponding accumulatorbins increment by one. A Gaussian density kernel is placedon each region center, and a projected BCL close to a centerpoint casts a value equal to the maximum density value itreaches.

4.3 Integration with height estimationGiven an image and a map, the procedure includes the

following two steps. The first is to apply the voting basedmethod to determine the camera position, and the secondis to estimate the building height by examining projectedbuilding footprints. In certain scenarios, a set of line sup-port regions that do not correspond to building edges hap-pen to form a strong peak, resulting in an incorrect cameralocation. To address this issue, we integrate height estima-tion with camera position estimation, which utilizes the in-formation of building footprint edges in addition to cornersto obtain accurate positions and simultaneously producesbuilding heights.

From a resulting accumulator, we select the top k localmaxima as candidate camera positions. For each of them, weproject building footprint edges to match building rooflinesin an image, as described in Section 3. We calculate thesum of edgeness scores for all projected building footprintedges that matches rooflines. We select the candidate withthe largest sum, where rooflines in the image should be bestaligned with building footprints. This strategy integratestwo separate steps and reduces ambiguities in correspon-dence between BCLs and building edges. Algorithm 1 sum-marizes the algorithm for processing a street view image.

5. EXPERIMENTSTo evaluate the proposed method, we compile a dataset

consisting of Google Street View images and OSM building

Algorithm 1 Building height estimation with camera posi-tion refinement1: Extract line support regions2: Compute an accumulator using the voting method. Se-

lect the locations of top k peaks pos(i), i = 1, . . . , k3: for i:=1 to k do4: Set camera position to pos(i)5: Identify visible buildings on map6: for each visible building do7: Project building footprints onto image and estimate

height based on edgeness values8: end for9: Compute S(i): the sum of edgeness values of projected

edges matching rooflines10: end for11: Output building heights with max(S)

footprints. Images are collected by a camera mounted on amoving vehicle. The camera faces the front of the vehicleso that more building rooflines are visible. The image sizeis 905× 640. The intrinsic and extrinsic camera parametersare provided, where extrinsic parameters are derived fromGPS, wheel encoder, and inertial navigation sensor data [1].We use 400 images covering an area in San Francisco, CA.Map data of the same area is downloaded from OSM, whichare in the vector form. We convert the map data into animage plane with a spatial resolution of 0.3 meter. Geo-location information of map data and camera positions isconverted into the UTM coordinate system for easy distancecomputation. Although camera positions in the dataset aremore accurate than those solely based on GPS, we observeclear misalignments when projecting map data onto images.

We apply the proposed method to the dataset with the fol-

Page 6: Combining Maps and Street Level Images for Building Height ... · camera positions on maps. The method is fully automated and works reliably on real-world data.. In this work, we

Figure 7: Images causing incorrect height estima-tion. Blue lines represent detected rooflines.

Error tolerance 2 m 3 m 4 mAccuracy (refined positions) 72.2% 83.6% 90.1%Accuracy (initial positions) 67.3% 78.6% 87.7%

Table 1: Accuracy of height estimation over differenterror tolerance.

lowing parameter setting. For camera position estimation,we use an accumulator corresponding to a local window onthe image plane converted from the map. The local windowis centered at the initial camera position and of size 40× 40pixels, which is to correct position errors up to 6 meters.When there are very few detected line support regions cor-responding to building edges, estimated camera positions arenot reliable. To address this issue, we set an accumulatorthreshold (set as 1.5) and use estimated camera positionsonly when the detected peak is above the threshold. Weselect the largest 5 peaks in an accumulator as candidates ofcamera positions. To scan an image with projected buildingfootprints for height estimation, we use discrete elevationvalues from 3 to 100 meters with a step size of 0.2 meter,which cover the range of building heights in the dataset.

Fig. 6 shows a subset of resulting building models around acity block. We use building height information derived fromLiDAR data as ground truth. Building models using groundtruth heights are also displayed in Fig. 6. As we can see, theyare very close to each other. There is one building where theestimated height is significantly different from ground truth,which is pointed by a red arrow. We examine the imagesthat the building height is derived from and find that theerror is caused by low image quality combined with unusuallighting conditions. Two images are shown in Fig. 7, wherethe proposed method confuses shadow borders as rooflinesbecause the actual rooflines have a extremely low contrast.

To provide quantitative measurements, we calculate er-rors by comparing the height values from our method withground truth. We set different error tolerance values (themaximum allowable deviation from ground truth) and com-pute the percentage of the buildings that have correct heightestimation, which is reported in Table 1. The results agreewith the observation in Fig. 6. We also evaluate the heightsobtained without applying camera position estimation andshow accuracy measurements in the table. As can be seen,refined camera positions lead to a clear improvement overinitial ones.

With images aligned with maps, we label buildings by pro-jecting visible parts of building models onto images. Fig. 8

Figure 9: Building facade masks by projecting build-ing models with raw camera position (left) and re-fined position (right).

presents example results, where the facade masks obtainedby projecting building models are well aligned with buildingfacades in images. To assess the quality of segmented build-ings, we select 100 images and manually generate buildingmasks. Each building has a unique label corresponding tothe footprint in the map. We calculate an accuracy rateas correctly labeled pixels divided by overall building pix-els, where labels of a building that match other buildingsare also penalized. Our results reach an accuracy rate of85.3%. Camera position accuracy is particularly importantfor the facade mask quality. We find using raw positionsthe accuracy drops by 9.2%. Fig. 9 shows facade masksfrom projections using raw and refined camera positions. Itcan be seen that using the raw position projected buildingsseverely deviate from the corresponding buildings in images.

We implement the method using MATLAB on a 3.2-GHzIntel processor. The current version of the code takes on av-erage 3 seconds to process one image, including camera po-sition estimation and height estimation. The efficiency canbe further enhanced by using parallel computing resourcesbecause each image is processed independently.

6. CONCLUSIONSWe have presented a new approach that fuses 2D maps

and street level images, both of which are easily accessible,to perform challenging tasks including building height es-timation and building facade identification in images. Themethod makes effective use of complementary informationin data generated from two distinct platforms. Due to thewide availability of input data, the method is highly scalable.Our experiments show that the method performs reliably onreal world data.

The proposed method to estimate camera position canwork as an add-on for enhancing GPS accuracy. Althoughin this work we do not measure exact improvements on lo-cation accuracy because of the lack of ground truth, we in-deed find that even one meter shift of an estimated cameraposition causes noticeable misalignments between projectedbuilding footprints and buildings in images. Worth men-tioning is that this method is particularly suited for denseurban neighborhoods, which is often the most challengingsituation for acquiring accurate GPS measurements.

We find that more accurate roofline detection and buildingedge detection can further improve results. In future work,

Page 7: Combining Maps and Street Level Images for Building Height ... · camera positions on maps. The method is fully automated and works reliably on real-world data.. In this work, we

Automatic method Ground truth

Figure 6: 3D building models. Left: building models generated by our method. Right: building models usingLiDAR derived height information. The red arrow indicates one building with incorrect height estimation.

Figure 8: Example results of projecting building models onto images. Different buildings are represented bydistinct colors.

Page 8: Combining Maps and Street Level Images for Building Height ... · camera positions on maps. The method is fully automated and works reliably on real-world data.. In this work, we

we will investigate supervised learning based approaches,where a building boundary detector learned from labeleddata is expected to provide more meaningful boundary maps.

7. REFERENCES[1] D. Anguelov, C. Dulong, D. Filip, C. Frueh, S. Lafon,

R. Lyon, A. Ogale, L. Vincent, and J. Weaver. Googlestreet view: Capturing the world at street level.Computer, 43(6):32–38, 2010.

[2] M. Bansal and K. Daniilidis. Geometric urbangeo-localization. In CVPR, 2014.

[3] J. B. Burns, A. R. Hanson, and E. M. Riseman.Extracting straight lines. TPAMI, 8(4):425–455, 1986.

[4] T.-J. Cham, A. Ciptadi, W.-C. Tan, M.-T. Pham, andL.-T. Chia. Estimating camera pose from a singleurban ground-view omnidirectional image and a 2Dbuilding outline map. In CVPR, 2010.

[5] L.-C. Chen, T.-A. Teo, C.-Y. Kuo, and J.-Y. Rau.Shaping polyhedral buildings by the fusion of vectormaps and LiDAR point clouds. PhotogrammetricEngineering & Remote Sensing, 74(9):1147–1157,2008.

[6] P. Cho and N. Snavely. 3D exploitationof 2D imagery.Lincoln Laboratory Journal, 20(1):105–137, 2013.

[7] H. Chu, A. Gallagher, and T. Chen. GPS refinementand camera orientation estimation from a single imageand a 2D map. In CVPR workshops, 2014.

[8] N. M. Drawil and O. Basir.Intervehicle-communication-assisted localization.IEEE Transactions on Intelligent TransportationSystems, 11(3):678–691, 2010.

[9] C. Farabet, C. Couprie, L. Najman, and Y. LeCun.Learning hierarchical features for scene labeling. IEEETransactions on Pattern Analysis and MachineIntelligence, 35(8):1915–1929, 2013.

[10] N. Haala and K.-H. Anders. Fusion of 2D-GIS andimage data for 3D building reconstruction.International Archives of Photogrammetry and RemoteSensing, 31:285–290, 1996.

[11] K. Jo, K. Chu, and M. Sunwoo. Interacting multiplemodel filter-based sensor fusion of GPS with in-vehiclesensors for real-time vehicle positioning. IEEETransactions on Intelligent Transportation Systems,13(1):329–343, 2012.

[12] X. Liu and D. Wang. A spectral histogram model fortexton modeling and texture discrimination. VisionResearch, 42(23):2617–2634, 2002.

[13] J. Long, E. Shelhamer, and T. Darrell. Fullyconvolutional networks for semantic segmentation.CVPR, pages 3431–3440, 2015.

[14] H. Qi and J. B. Moore. Direct kalman filteringapproach for GPS/INS integration. IEEETransactions on Aerospace and Electronic Systems,38(2):687–693, 2002.

[15] F. Tack, G. Buyuksalih, and R. Goossens. 3D buildingreconstruction based on given ground plan informationand surface models extracted from spaceborneimagery. ISPRS Journal of Photogrammetry andRemote Sensing, 67:285–290, 2012.

[16] J. Tighe and S. Lazebnik. Superparsing: scalablenonparametric image parsing with superpixels. InECCV, pages 352–365. 2010.

[17] C.-P. Wang, K. Wilson, and N. Snavely. Accurategeoregistration of point clouds using geographic data.In International Conference on 3D Vision-3DV, pages33–40, 2013.

[18] S. Wang, S. Fidler, and R. Urtasun. Holistic 3D sceneunderstanding from a single geo-tagged image. InCVPR, pages 3964–3972, 2015.

[19] J. Xiao, T. Fang, P. Zhao, M. Lhuillier, and L. Quan.Image-based street-side city modeling. ACMTransactions on Graphics, 28(5):114, 2009.

[20] J. Yuan and A. M. Cheriyadat. Road segmentation inaerial images by exploiting road vector data. In COM.Geo, pages 16–23, 2013.

[21] P. A. Zandbergen and S. J. Barbeau. Positionalaccuracy of assisted GPS data from high-sensitivityGPS-enabled mobile phones. Journal of Navigation,64(03):381–399, 2011.

[22] L. Zebedin, J. Bauer, K. Karner, and H. Bischof.Fusion of feature-and area-based information forurban buildings modeling from aerial imagery. InECCV. 2008.