8
QuickSketch: Building 3D Representations in Unknown Environments using Crowdsourcing Fawad Ahmad * , Hang Qiu * , Xiaochen Liu * , Fan Bai , Ramesh Govindan * * University of Southern California General Motors Inc. Abstract—Disaster and emergency response operations require rapid situational assessment of the affected area for timely and efficient rescue operations. A 3D map, collected after a disaster, can provide such awareness, but constructing this map quickly is a significant challenge. In this paper, we explore the design of a capability called QuickSketch that rapidly builds 3D representations of an unknown environment using crowdsourc- ing. QuickSketch employs multiple vehicles equipped with 3D sensors (stereo cameras) to explore different areas of an unknown territory and then combines 3D data from all the vehicles to build a single 3D map. QuickSketch annotates the 3D map with im- portant landmarks and enables rapid contextualization of visual intelligence (photos) received from first responders and disaster victims to guarantee timely backup and rescue operations. Our evaluation results show that QuickSketch can stitch a 3D map for a large campus with sub-meter mapping accuracy under certain conditions, position landmarks an order of magnitude more accurately than other image matching techniques, and contextualize visual intelligence accurately. I. I NTRODUCTION Situational awareness of disaster affected areas (e.g., a city, town or a large building) require up-to date knowledge of the environment’s layout, as well as the positions of common landmarks (e.g., street signs, fire hydrants, light poles etc.) for timely and efficient rescue operations. It is equally important to be able to rapidly contextualize images (determine where they have been taken) received from first responders and disaster victims to provision backup and first-aid. This information, when represented in a map, can be useful for planning rescue operations. Leveraging the increasing availability of 3D sensors (e.g., stereo cameras) we explore how to rapidly construct, post- disaster, a 3D map of a disaster area (§II). Such a map should (a) contain an accurate 3D model of the environment, (b) indicate the precise positioning of common landmarks within the model, and (c) be able to contextualize visual intelligence (such as images) within the model. Constructing this 3D map rapidly remains a challenge, since collecting a map can require a traversal of the entire area, and processing the sensor data to generate the map can require significant computing resources. Research reported in this paper was sponsored in part by the Army Research Laboratory under Cooperative Agreement W911NF-17-2-0196. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Army Research Laboratory or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation here on. In this paper, we consider the design of a capability called QuickSketch that uses crowdsourcing to collect map segments concurrently, then develops algorithms to stitch these segments together, to detect and annotate landmarks, and accurately position visual intelligence (§III). To collect and stitch map segments, QuickSketch leverages an open-source visual SLAM algorithm, ORB-SLAM2 [17]. It stitches small length, self- corrected 3D map chunks from different participants to prevent the accumulation of sensor and algorithmic errors and thus build an accurate 3D map. To annotate landmarks on the 3D map, QuickSketch uses an off-the-shelf 2D object recognition module [20] to detect important landmarks, converts their 2D pixels to 3D voxels and positions them in the 3D map relative to the vehicle’s current position. To quickly contextualize visual intelligence, QuickSketch detects landmarks in the received photos and performs a guided search along those landmarks in the 3D map. Using QuickSketch, we have been able to construct a 3D map of our campus. Our evaluations using this map (§IV) show that it can achieve sub-meter mapping accuracy by carefully calibrating cameras, using short (50m) segments, and ensuring closed loop maps 1 that permit map self-correction. Using an analytical model, we show that it is possible to assemble the map on the order of minutes with modest collection and computing resources. QuickSketch has high coverage of landmarks on our campus, and can position these landmarks to within 0.89 m. It can also contextualize images in some cases under a second and to within 1.5 m of where the image was actually taken. II. BACKGROUND,MOTIVATION,CHALLENGES In this section, we first present some background for QuickSketch, then discuss the capability that it enables and the associated challenges. A. Background 3D Sensors. Increasingly, modern vehicles are being equipped with 3D sensors such as LiDAR, radar and stereo cameras. These sensors are capable of depth perception, so they can generate 3D representations of the environment within their field of view. Specifically, these sensors generate data structures 1 Loop closure refers to the ability of a robot to recognize a previously visited place after exploring new terrain. Loop closure counters the drift error accumulated during the traversal of the robot. SLAM algorithms use loop closure to minimize errors in tracking the robot, and mapping the environment.

QuickSketch: Building 3D Representations in Unknown ...QuickSketch: Building 3D Representations in ... sensors (stereo cameras) to explore different areas of an unknown territory and

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: QuickSketch: Building 3D Representations in Unknown ...QuickSketch: Building 3D Representations in ... sensors (stereo cameras) to explore different areas of an unknown territory and

QuickSketch: Building 3D Representations inUnknown Environments using Crowdsourcing

Fawad Ahmad∗, Hang Qiu∗, Xiaochen Liu∗, Fan Bai†, Ramesh Govindan∗∗University of Southern California

†General Motors Inc.

Abstract—Disaster and emergency response operations requirerapid situational assessment of the affected area for timelyand efficient rescue operations. A 3D map, collected after adisaster, can provide such awareness, but constructing this mapquickly is a significant challenge. In this paper, we explore thedesign of a capability called QuickSketch that rapidly builds 3Drepresentations of an unknown environment using crowdsourc-ing. QuickSketch employs multiple vehicles equipped with 3Dsensors (stereo cameras) to explore different areas of an unknownterritory and then combines 3D data from all the vehicles to builda single 3D map. QuickSketch annotates the 3D map with im-portant landmarks and enables rapid contextualization of visualintelligence (photos) received from first responders and disastervictims to guarantee timely backup and rescue operations. Ourevaluation results show that QuickSketch can stitch a 3D mapfor a large campus with sub-meter mapping accuracy undercertain conditions, position landmarks an order of magnitudemore accurately than other image matching techniques, andcontextualize visual intelligence accurately.

I. INTRODUCTION

Situational awareness of disaster affected areas (e.g., a city,town or a large building) require up-to date knowledge ofthe environment’s layout, as well as the positions of commonlandmarks (e.g., street signs, fire hydrants, light poles etc.) fortimely and efficient rescue operations. It is equally important tobe able to rapidly contextualize images (determine where theyhave been taken) received from first responders and disastervictims to provision backup and first-aid. This information,when represented in a map, can be useful for planning rescueoperations.

Leveraging the increasing availability of 3D sensors (e.g.,stereo cameras) we explore how to rapidly construct, post-disaster, a 3D map of a disaster area (§II). Such a map should(a) contain an accurate 3D model of the environment, (b)indicate the precise positioning of common landmarks withinthe model, and (c) be able to contextualize visual intelligence(such as images) within the model. Constructing this 3D maprapidly remains a challenge, since collecting a map can requirea traversal of the entire area, and processing the sensor data togenerate the map can require significant computing resources.

Research reported in this paper was sponsored in part by the Army ResearchLaboratory under Cooperative Agreement W911NF-17-2-0196. The views andconclusions contained in this document are those of the authors and shouldnot be interpreted as representing the official policies, either expressed orimplied, of the Army Research Laboratory or the U.S. Government. The U.S.Government is authorized to reproduce and distribute reprints for Governmentpurposes notwithstanding any copyright notation here on.

In this paper, we consider the design of a capability calledQuickSketch that uses crowdsourcing to collect map segmentsconcurrently, then develops algorithms to stitch these segmentstogether, to detect and annotate landmarks, and accuratelyposition visual intelligence (§III). To collect and stitch mapsegments, QuickSketch leverages an open-source visual SLAMalgorithm, ORB-SLAM2 [17]. It stitches small length, self-corrected 3D map chunks from different participants to preventthe accumulation of sensor and algorithmic errors and thusbuild an accurate 3D map. To annotate landmarks on the 3Dmap, QuickSketch uses an off-the-shelf 2D object recognitionmodule [20] to detect important landmarks, converts their 2Dpixels to 3D voxels and positions them in the 3D map relativeto the vehicle’s current position. To quickly contextualize visualintelligence, QuickSketch detects landmarks in the receivedphotos and performs a guided search along those landmarksin the 3D map.

Using QuickSketch, we have been able to construct a 3Dmap of our campus. Our evaluations using this map (§IV) showthat it can achieve sub-meter mapping accuracy by carefullycalibrating cameras, using short (50m) segments, and ensuringclosed loop maps1 that permit map self-correction. Using ananalytical model, we show that it is possible to assemblethe map on the order of minutes with modest collectionand computing resources. QuickSketch has high coverage oflandmarks on our campus, and can position these landmarks towithin 0.89 m. It can also contextualize images in some casesunder a second and to within 1.5 m of where the image wasactually taken.

II. BACKGROUND, MOTIVATION, CHALLENGES

In this section, we first present some background forQuickSketch, then discuss the capability that it enables andthe associated challenges.

A. Background

3D Sensors. Increasingly, modern vehicles are being equippedwith 3D sensors such as LiDAR, radar and stereo cameras.These sensors are capable of depth perception, so they cangenerate 3D representations of the environment within theirfield of view. Specifically, these sensors generate data structures

1Loop closure refers to the ability of a robot to recognize a previouslyvisited place after exploring new terrain. Loop closure counters the drift erroraccumulated during the traversal of the robot. SLAM algorithms use loopclosure to minimize errors in tracking the robot, and mapping the environment.

Page 2: QuickSketch: Building 3D Representations in Unknown ...QuickSketch: Building 3D Representations in ... sensors (stereo cameras) to explore different areas of an unknown territory and

called 3D point clouds which define every point perceived bythe sensor in terms of its 3D position (x, y and z) with respectto the center of the sensor.

Stereo cameras. In this paper, we use stereo cameras whichhave 2 mono-cameras placed at a known distance (baselinedistance) from each other to give two slightly different imagesof the same environment. Stereo matching algorithms make useof the disparity in the positions of all points in both images,along with the stereo camera baseline, to estimate the depth ofcommon points in both images, resulting in a 3D point cloudrepresenting a single frame.

SLAM. Simultaneous Localization and Mapping (SLAM)algorithms localize a robot while at the same time building amap of an unknown environment. SLAM algorithms essentiallytake as input sensor data about the external environmentand use that to estimate both the state (pose) of the robotand incrementally build a map. In this paper, we use aSLAM algorithm as a building block for mapping an unknownenvironment. Specifically, we use ORB-SLAM2 [17], which isa visual SLAM (camera input) algorithm that ranks the higheston the KITTI [12] benchmark for open-source stereo SLAMalgorithms.

The primary sensor input to ORB-SLAM2 is either a mono,stereo, or RGB-D camera feed. From each frame, ORB-SLAM2extracts a certain number of visual features (key-points) fromthe camera input and tracks them in the subsequent frames todetermine whether they belong to static or dynamic objects.ORB-SLAM2 also uses these features and other computervision techniques like bundle adjustment to estimate the motionof the camera. After a specific number of frames, ORB-SLAM2 arrives at a keyframe. In a keyframe, it stores themost stable tracked visual features as map points along withtheir descriptors in a map file. It also associates the optimizedpose and intrinsic properties of the camera with the samekeyframe in the map file. It then incrementally builds a mapfile consisting of keyframes and map points along with otherdata structures for localization. It also uses bag of visual words(BoVW) to represent images as histograms of visual descriptorsand uses these to localize a camera in a given map file.

ORB-SLAM2 can localize a robot in a region given a mapfor that region. For example, suppose we want to localize anagent/robot equipped with a camera in an environment forwhich we’ve built a map using ORB SLAM2. In order todo this, ORB-SLAM2 extracts visual features from the inputimage and converts them into a bag of visual words. It thenqueries its recognition database and other data structures forthe keyframe that is most similar to the current image basedon their bag of visual words. Once this operation is successful,ORB-SLAM2 is able to localize the given agent/robot in itsmap-file by determining the transformation between the visualfeatures of the input image with the map points associatedwith the keyframe. After the robot/agent has been localized,ORB-SLAM2 continuously tracks its motion.

QuickSketch Goal. Accurate situational awareness is crucialfor strategic and tactical planning of rescue missions in

disaster-struck areas. An up-to date map can provide importantcontextual information for these tasks. Publicly available repre-sentations such as Open-Street Maps (OSM [19]), and GoogleEarth [13] are infrequently updated, so may be inaccurate.Military grade GPS based positioning might be very accurate,but does not give contextual information. We posit that it wouldbe most useful for disaster recovery missions to be able tohave an accurate 3D map of the theater of operations beforethe rescue mission begins. Such a map should be annotatedwith common landmarks such as stop signs, streetlights, speedlimits, and so forth: knowing the positions of these landmarkscan be crucial for rescue planning. Finally, disaster operationcenters need to be able to contextualize visual intelligence,namely, find the location in the 3D map of pictures taken inthe area by rescue personnel and disaster victims.

Approach and Challenges. QuickSketch’s approach is basedon the observation that 3D sensors are starting to be available incivilian vehicles, so it is possible that they will be available onvehicles used by rescue squads as well. Given this, QuickSketchwould work as follows:

• Before commencing a rescue operation in an area, theoperation commander would recruit civilian or rescuesquad vehicles for reconnaissance. The actual recruitmentprocedures are beyond the scope of this paper.

• Each vehicle would collect 3D sensor information (wecall these map segments) from disjoint areas and deliveror transmit them to a central disaster relief center.

• At the relief center, QuickSketch would stitch togetherthese map segments to obtain a complete 3D map of thearea, while at the same time detecting and annotatinglandmarks.

• Subsequently, when given an image taken somewherein the area, QuickSketch would be able to rapidlycontextualize the image, i.e., identify the location wherethat image was taken.

The design of QuickSketch must surmount three challenges:a) how to accurately and rapidly stitch together map segmentsto obtain the 3D map; b) how to detect and position commonlandmarks of interest in the 3D map; c) how to quicklycontextualize visual information within the 3D map.

III. QUICKSKETCH DESIGN

In this section, we describe how QuickSketch surmountsthese challenges. We begin by presenting an overview ofQuickSketch’s components.

A. Overview

QuickSketch builds a 3D map of an unknown environmentin two steps i.e., data collection and map construction. Thedata collection phase employs multiple vehicles (either civilianor rescue squad assets) to traverse the unknown territory andcollect raw stereo camera data. The vehicles can either transmitthis data through wireless communication or return to therelief center and offload the data. The map construction phaseprocesses raw data from every vehicle, extracts 3D visual

Page 3: QuickSketch: Building 3D Representations in Unknown ...QuickSketch: Building 3D Representations in ... sensors (stereo cameras) to explore different areas of an unknown territory and

features, annotates them with common landmarks and builds a3D representation with a single coordinate frame of reference.This map is then used for relative localization, strategic/path-planning and contextualizing visual intelligence from rescuesquad personnel or disaster affectees in the same region.

QuickSketch consists of three major modules: stitching,annotation and contextualization. The stitching module usesSLAM techniques in a novel way: after extracting visualfeatures from the different map segments, the stitching modulestitches all the processed map segments into a single 3D map.To do this, it uses ORB-SLAM2 [17] on common/overlappingvisual features between two map segments to align them andthen stitch them into a single map.

The annotation module combines an off-the-shelf 2D objectlocalization algorithm [20] with the 3D views produced bystereo cameras to achieve two goals: (a) detect all commonlandmarks of interest within the area, and (b) position thoselandmarks accurately.

Finally, QuickSketch’s contextualization module providescontext for visual intelligence received from rescue squadpersonnel and disaster affectees. Given an already built 3Dmap for a region, if a person sends a photo to the relief centertaken within the same region, QuickSketch can find the accurate3D position of photo taker and hence expedite rescue efforts.

B. Stitching 3D Map Segments

After data collection, vehicles offload their collected mapsegments at the relief center. In the first step, QuickSketch usesORB-SLAM2 to process these map segments. ORB-SLAM2goes through them frame-by-frame to track the position of thevehicle and map the unknown area.

ORB-SLAM2 Operational Details. ORB-SLAM2 runs threethreads in parallel, tracking, mapping and loop closure. Thetracking thread localizes the camera/vehicle at every frame. Todo this, it extracts key-points from the environment (Figure 1).Then, it estimates changes in the camera’s position by trackingthe relative change in the position of these key-points.

The mapping thread builds a 3D map along the path followedby the vehicle. ORB-SLAM2 monitors the key-points extractedby the tracking thread for several frames and stores onlythe most stable features in the map as map-points. Figure2 illustrates this: red points represent the key-points that arebeing monitored, the black points represent map-points thathave become an integral part of the map and the blue rectanglesrepresent keyframes and the trajectory of the vehicle.

SLAM algorithms suffer from various sources of errors(algorithmic and sensor) that cause a drift in the 3D mapalong 7 degrees of freedom (3 translations, 3 rotations andscale). To mitigate these errors, the loop closure thread uses aplace recognition sub-module to find loops in the map. Onceloops are found, ORB-SLAM2 aligns the two ends of theloop, removes the duplicate 3D points in the map and runs apose graph optimization over the whole map to achieve globalconsistency. Figures 4 and 5 show the map built for a blockbefore and after loop closure.

Stitching using ORB-SLAM2. QuickSketch’s stitching mod-ule is built on top of ORB-SLAM2. It uses overlaps betweentwo distinct map segments to align them together in asingle frame of reference and does this progressively for thewhole region. To determine which map segments to stitch,QuickSketch uses the GPS traces collected during the mapcollection phase to find overlapping map segments.

To explain how stitching works, assume we want to build amap for a block ABCD using four vehicles. The first vehicletraverses AB, second BC, third CD and fourth DA. Thetraversals are done in such a manner that there is some amountof overlap between two consecutive segments, i.e., a smallportion of the map segment of AB contains the same viewas starting few frames of the map segment for BC and soon. QuickSketch starts with the map segment from the firsttraversal AB and uses ORB-SLAM2 to generate a map file,mAB as described above. QuickSketch loads the map file,mAB , and the map segment, BC in ORB-SLAM2. Undernormal circumstances, ORB-SLAM2 attempts to build a newmap file for the map segment, BC, with its initial positionas the origin. However, in this situation ORB-SLAM2 alreadycontains a map file, mAB , in memory. Therefore, now ORB-SLAM2 will attempt localize BC in mAB instead of buildinga new map file. In other words, the tracking, mapping and loopclosure threads are, by virtue of loading mAB , "tricked" intoconverting features in BC to AB’s frame of reference. Thisis the crucial idea behind QuickSketch’s stitching algorithm.

In order to localize BC in mAB , ORB-SLAM2 converts thefirst frame from BC into a bag of visual words. It then attemptsto find keyframes with similar bag of words vectors from themap file mAB . Once it finds a list of candidates, it computescorrespondences of the current frame’s key-points with thesimilar keyframes’ associated map-points. This allows ORB-SLAM2 to find the closest keyframe match. After this, usingthe common features (key-points and map-points) between thesimilar keyframe and current frame, ORB-SLAM2 is able tolocalize the first frame of BC in the map segment mAB . Afterlocalization, ORB-SLAM2 starts updating the map file mAB

with the new frames from BC. After ORB-SLAM2 is donewith BC, QuickSketch saves the new map file mABC , coveringAB and BC. QuickSketch repeats this process incrementallyfor CD and DA to get the final stitched map file mABCDA.

For QuickSketch, speed of map construction is important.To this end, we note that stitching is easily parallelizable. Inour example, QuickSketch can stitch AB and BC in parallelwith stitching CD and DA. Then, it can stitch together thetwo resulting maps mABC and mCDA. Map collection alsoimpacts map construction speed. In §IV, we explore the trade-offs between these two factors.

Improving Stitching Accuracy. Stitching two map segmentstogether is possible only if there exists an overlap betweenthem. In other words, there must be at least one frame in thesecond map segment that ORB-SLAM2’s place recognitionmodule can recognize. If no overlap exists, we cannot localize

Page 4: QuickSketch: Building 3D Representations in Unknown ...QuickSketch: Building 3D Representations in ... sensors (stereo cameras) to explore different areas of an unknown territory and

Figure 1. Extracting 3D visual features from theenvironment

Figure 2. Map construction from extracted visualfeatures

Figure 3. Object recognition module

Figure 4. Before loop closure Figure 5. After loop closure

the second map segment in the first, and hence cannot stitchthe two together. A greater overlap generally means that theoverlapping region will be sampled more than once by ORB-SLAM2 and hence will be a more accurate reconstruction ofthe environment. In §IV-A, we explore the relationship betweenthe accuracy of stitching and the degree of overlap.

Although QuickSketch uses multiple map segments to stitchtogether the 3D map, it does not make any changes to themap formation process in ORB-SLAM2. Thus, ORB-SLAM2behaves exactly as it would for a single stereo camera feedcovering the whole region. QuickSketch leverages this anduses two techniques to reduce the overall mapping error: (i) ituses smaller length map segments, and (ii) it performs at leastone loop closure in each map segment.

Long map segments are more prone to the accumulationof drift errors than short map segments. However, small mapsegments can increase map construction time because it canincrease the total number of stitches required in a given area.In §IV-A, we explore the dependence of map accuracy as afunction of segment length. QuickSketch users can pick theappropriate trade-off point between accuracy and speed of mapconstruction using these results. For a similar reason, it isalways better to leverage loop closure by closing small loops,as opposed to larger loops. In §IV-A, we validate this assertionexperimentally. Thus, when stitching, QuickSketch picks thesmallest loop to close, ensuring high overall accuracy. Thus,QuickSketch uses both of these techniques to prevent the erroraccumulation: it builds a large 3D map from accurate smallerlength map segments with closed loops.

C. Annotating the Map with Common Landmarks

Common landmarks like stop signs, street signs, traffic lightsetc. are critical for operation-planning, excavation strategies,setting up relief points etc. The accurate locations of these land-marks on the 3D map can enable rescue operation commandersto take more informed decisions during operation planning. Tothis end, QuickSketch annotates the 3D map with the preciselocations of landmarks of interest by using an off-the-shelfobject detector to detect and localize common landmarks inthe 2D image captured by one camera of the stereo camera,then transform the position of the object in the 2D image to a3D position using a technique described below.

Annotation is integrated with map processing: as QuickS-ketch converts map segments into map files, it also runs an off-the-shelf object detection neural network, YOLO [20], frameby frame on one of the two mono cameras (Figure 3). In thispaper, we detect landmarks like stop signs and fire hydrantsbut YOLO can be trained to detect other landmarks of interestas well. YOLO processes the 2D images and returns labels(identifiers) of landmarks along with their 2D bounding boxes.

QuickSketch converts these 2D points to 3D voxels usingdepth perception of the stereo camera. It can do this becausethere is a one-to-one correspondence between a pixel in a 2Dimage and a voxel in a 3D image. Using this correspondence,it transforms the positions of these landmarks to the 3D map’scoordinate frame of reference using the their 3D position inthe camera’s frame of reference and the camera’s position inthe 3D map. Thus, at the end of this process, QuickSketch isable to (a) enumerate common landmarks in an area, and (b)position these landmarks accurately in the map’s coordinateframe of reference. In §IV-B, we evaluate the coverage andaccuracy of annotation.

D. Quickly Contextualizing Visual Intelligence

During a rescue mission, rescue squad personnel or disasteraffectees might send visual intelligence in the form of imagesto a command post. In such situations, it is desirable for thecommanding officer to determine where the image was taken,namely to associate the image with a 3D position in the 3Dmap. We call this contextualizing visual intelligence.

One way to contextualize visual intelligence is to do abrute force search the complete map. This approach is very

Page 5: QuickSketch: Building 3D Representations in Unknown ...QuickSketch: Building 3D Representations in ... sensors (stereo cameras) to explore different areas of an unknown territory and

inefficient. ORB-SLAM2 attempts to solve this problem usingan inverted index recognition database in its place recognitionmodule. This inverted index maps bags of visual words tokeyframes, permitting faster search of features. However, aswe shall demonstrate later, this approach can be inefficient forlarger maps because of the time required to load large mapfiles into memory.

To solve this problem, QuickSketch leverages its annotationmodule to partition the 3D maps into smaller map filescontaining specific landmarks. For example, for every landmarkwe will have a separate map file containing the landmark andthe region around it. Therefore, whenever the commandingofficer receives visual intelligence, QuickSketch passes itthrough the object recognition module and attempts to findcommon landmarks in the image. Based on the landmarksfound in the image, QuickSketch loads the map segment forthat specific landmark into memory and localizes it there usingthe same place recognition module. Since have sparse spatialdistributions, therefore, their corresponding map segments aremuch smaller as compared to the whole 3D map.

IV. QUICKSKETCH EVALUATION

To evaluate QuickSketch, we use vehicles mounted withZED stereo cameras [3] and equipped with GPU laptops tocollect overlapping map segments covering our campus. Weuse an Alienware laptop equipped with an Intel 7th generationquad-core i7 CPU clocked at 4.4 GHz, 16 GB of DDR4 RAM,and nVidia 1080p GPU with 2560 CUDA cores. QuickSketchruns the ZED stereo cameras at 30 fps and extracts at most2000 key-points per frame. Our data-set contains over 100,000stereo camera frames and covering 20 km of roads. A singletraversal of our campus covers around 2 km of road lengthand 15,000 stereo camera frames. The corresponding 3D mapconsists of over 5,600 keyframes and approximately 480,000map-points. The map-file takes approximately 1.5 GB of diskspace, while the point cloud alone occupies 15 MB.

A. Stitching Performance

We evaluate the map error, defined as the position errorof an object in the stitched map relative to an "unstitched"map (i.e., one collected using a single session). To ensure bothmap files have the same coordinate frames of reference, weinitialize the stitched map with a small starting portion of thesingle session map. After constructing the maps, we use ORB-SLAM2 to localize a single vehicle’s motion frame-by-framein both these map files. Then, map error is determined by thevehicle’s position error in every frame. Since the coordinateframes of reference for both maps are the same, ideally thereshould be no difference in the positions reported by both maps.We measure QuickSketch’s map accuracy as a function of: (a)the length of map segments; (b) the number of stitches in themap; and (c) degree of overlap between map segments.

Length of map segments. To estimate map error as a functionof segment length, we build two map-files for different lengthsof segments: (a) a stitched map consisting of two equal-lengthmap segments from two different vehicles and, (b) an unstitched

single session map from a single vehicle covering the samesegments. As we increase the length of the map segments from50m to 250m, the overlap (common frames/area) between themap segments and the number of stitches is kept constant. Werepeated this experiment for three different roads and presentthe averaged results. We present only x and y-axis errors sinceour experiments were conducted in flat terrain.

As we increase the length of the map segments from 50mto 250m, the error increases from 0.9m to 2.5m (Figure 6).This is because increasing the length of map segments allowsmore room for drift error to accumulate. Thus, it is importantto use smaller length map segments (50m) to obtain sub-meteraccuracy.

Number of stitches. To study map error as a function of thenumber of stitches, we keep the road length covered by themap-file, and the overlap between map segments constant. Wetake a fixed length of road (200m), and construct a stitchedmap file for it using map segments collected from two differentvehicles. For a single stitch, each vehicle yields a map segmentof approximately 100m. For a double stitch, the first vehiclecontributes the first and third map segments while the secondvehicles contributes the second map segment and so on forthree and four stitches.

Map error ranges between 0.5m and 1m, but does notdeteriorate as the number of stitches in the final map-fileincreases as one would expect (Figure 7). This is becauseQuickSketch stitches a 3D map from different map segmentsby progressively localizing a new map segment in an alreadybuilt map-file and then adding the new map segment to the map-file (§III-B). Thus, at each step, a new map file is generated,and the next stitch initializes ORB-SLAM2 state from this file,resulting in no error accumulation, since the process resembleswhat ORB-SLAM2 does for a single session.

Degree of map segment overlap. To study the degree ofoverlap on map error, we generate three files: (a) a unstitchedmap file for the ground truth, (b) stitched map file from twodifferent vehicles, and (c) stitched map file from the samevehicle to serve as a baseline. For (b), we take the two mapsegments and stitch them at different points to get differentdegrees of overlap. For (c), we split a single vehicle’s traversalof the road in two map segments at different points and thenstitch them together to emulate different degrees of overlap.

Stitching accuracy is largely independent of the degree ofoverlap (Figure 8). As we increase the overlap from 0.5% to50% for the heterogeneous case, there is only a 0.02m changein the overall map error. This is because ORB-SLAM2 onlyneeds a single image to localize itself in a pre-loaded/pre-existing map-file (§III-B). Increasing the overlap increasesthe likelihood of a successful stitch, but cannot reduce error.Interestingly, when stitching two adjacent traces collected bythe same vehicle, ORB-SLAM2 can stitch them even with zerooverlap because it can match features in frames at the end ofone traces with those at the beginning of the next.

Importance of loop closure. To understand the impact of loopclosure on map error, we define map error as the difference

Page 6: QuickSketch: Building 3D Representations in Unknown ...QuickSketch: Building 3D Representations in ... sensors (stereo cameras) to explore different areas of an unknown territory and

Figure 6. Map accuracy as a function of length ofmap segments

Figure 7. Map accuracy as a function of the numberof stitches

Figure 8. Map accuracy as a function of degree ofoverlap

Table ILOOP CLOSURE REDUCES MAP ERROR

ID Distance from Origin Map File Mapping Error1st Stop Sign 260m Closed Loop 0.31m1st Stop Sign 260m Open Loop 0.85m2nd Stop Sign 650m Closed Loop 3.8m2nd Stop Sign 650m Open Loop 47.65m

between the position of a landmark (e.g., a stop sign) on the 3Dmap as compared to its ground truth position found on GoogleMaps using map pinpointing. Using an object detector [20],we annotate both maps with stop signs found on the first andthird corners of the block. We measure the distance of the stopsigns from the origin of the two maps and compare them withthe ground truth locations from map pin-pointing on GoogleMaps. Map error in closed loop map-files is more than 2×smaller than open loop map-files (Table IV-A). Map error alsoincreases due to the accumulation of drift as the distance fromorigin increases i.e., from 0.31m to 3.8m.

From these experiments, we conclude that QuickSketch canachieve low error by stitching together many small length mapsegments, with minimal overlap, while closing loops wheneverpossible.

Speed of map construction. QuickSketch’s map constructionspeed depends on the number of vehicles, and the computepower available to stitch maps. Suppose we have a road oflength d divided into k map segments with equal lengths. Letthere be k vehicles covering these map segments traveling withconstant velocity v. Furthermore, suppose that the total numberof frames of all the map segments is f , the time to processa single frame is c and the number of machines working tostitch the map is n. The time required by k vehicles and nmachines to stitch the map is:

t =

d

k ∗ v+ c ∗ f, if n = 1

d

k ∗ v+ c ∗ f ∗ [lg(k) + 1]

k, if n ≥ k

d

k ∗ v+ c ∗ f

k∗∑lg(n)

i=0k/2i

n, if n < k

(1)

Using this model, we plot the time required to build a 3D mapfor a region with road length 20km, 14000 stereo camera framesand with vehicles traveling at 25 mph as we vary the numbernumber of map segments (Figure 9). In this case, the number ofmachines is always equal to the number of map-segments. After

using 5 vehicles for data-collection (multiple vehicles-singlemachine curve), the bottleneck changes from data-collectionto the compute power at hand. Parallelism overcomes thisbottleneck as shown by the curve for multiple vehicles-multiplemachines. For the given 20km area, QuickSketch can build a 3Dmap in just 10 minutes with 20 machines in parallel, after whichwe get diminishing returns. If we vary the number of machineswhile keeping the number of segments fixed (Figure 10) theknee of the curve shows that we cannot significantly reducethe map construction time after the number of machinesis twice the number of map segments. This suggests thatwe further sub-divide the current map-segments to reap thebenefits of parallelism with QuickSketch. To put these results incontext, with 20 machines working in parallel on the stitchingoperation consisting of 10 map segments, QuickSketch willtake approximately 8 minutes to build the whole 3D map.On the other hand, if the same machines sub-divide 10 mapsegments into 20, the time reduces to only 4 minutes. Finally,QuickSketch can build a 3D map in an order of magnitudetime less than using a single vehicle and single machine (morethan an hour).

B. Annotation Performance

In this section, we evaluate the annotation performance ofQuickSketch on two metrics: landmark coverage and positionalaccuracy. Landmark coverage is the percentage of landmarksdetected by QuickSketch of all landmarks present in a region.Positional accuracy is the ability of QuickSketch to accuratelyposition these landmarks in the 3D map. We perform fourdifferent traversals of our campus using a stereo cameramounted on top of a vehicle to build 4 different annotated 3Dmaps. Our methodology for positional accuracy is similar tothat discussed above for loop closure.

Of the 33 stop signs over 7 kms of road length in our dataset,QuickSketch detected 37 stop signs, with a coverage precisionof 90% and a recall of 100%. Of the four false positives, 2were hand-held stop signs being used by the campus police toregulate traffic at small intersections. The other two were otherroad signs mis-detected by YOLO [20]. Moreover, QuickSketchis able to localize landmarks with an average positional errorof 0.89m. Both these results are encouraging, suggesting thatQuickSketch can be effectively used to annotate 3D maps fordisaster recovery.

Page 7: QuickSketch: Building 3D Representations in Unknown ...QuickSketch: Building 3D Representations in ... sensors (stereo cameras) to explore different areas of an unknown territory and

Figure 9. Map construction time as a function ofnumber of vehicles

Figure 10. Map construction time as a function ofnumber of machines

Figure 11. Error distribution for position estimation

C. Contextualization Performance

We evaluate QuickSketch’s ability to contextualize visualintelligence using three metrics: recognition capability, posi-tional accuracy and localization time. Recognition capability isthe ability of QuickSketch to successfully recognize an imagein a region, given a 3D map of that region. Positional accuracyrefers to the ability of QuickSketch’s to estimate the exactposition of the camera from the camera viewpoint for a givenregion. Localization time refers to the time required to localizea photo in a given region.

Recognition capability. For recognition capability, after build-ing a stitched 3D map of our campus, we selected 200 imagesfrom a separate traversal of our campus, then attempted tolocalize these images. QuickSketch localized 87% of theseimages in the 3D map. Since ORB-SLAM2 only stores map-points (features) at keyframes, there is a tradeoff betweenthe ability to localize and the map size. A greater number ofkeyframes increases the size of the 3D map, taking more timeto load and occupying more space on disk, but could havelocalized more images.

Positional accuracy. To estimate image positional accuracy,we take 20 stereo photos on our campus and attempt to positionthem using QuickSketch and the feature matching methoddescribed in §III-D, with ground-truth derived as described in§IV-A. As Figure 11, shows QuickSketch is able to localizethe 20 photos with an average positional error of 1.5m.

By comparison, an approach that uses existing street-levelimagery, like Google Street View, incurs an average error of30m. To obtain these results, we used ALPS [15] to downloadand generate all Google Street View’s viewpoints on ourcampus. At each viewpoint, we download four images withfour directions orthogonal to each other. Each image has field-of-view of 90°, the four images combined cover 360° view ofsurroundings. We extract ORB feature points from these imagesand label them with their corresponding GPS coordinates. Afterthat, we use a bag-of-words model [5] to train an SVM (supportvector machine) with the labeled data. After training, we canfeed the SVM input images, from which it extracts ORBfeatures and uses the pre-trained classifier to predict their GPSpositions. Note that the Google Street View does not cover allthe countries and most of its data is collected in urban area. Thetraining time for the SVM is 15 minutes and ALPS downloads

10 images per second from Google Street View. The order ofhigher magnitude error in image matching is caused by theemain factors. First, Google Street View images are sparse i.e.,there is a 10m distance between consecutive Google StreetView images. Secondly, since Google Street View images arenot updated regularly, there might exist significant differencesin the visual features between the 20 photos and those inGoogle Street View. Third, the BoW model depends only onthe frequency of visual features and so, images containingcommon features e.g., stop signs, windows, street lights etc.,have a high similarity score [7], resulting in the large errorsshown in Figure 11 for image matching.

Localization time. QuickSketch can speed up localization byusing landmarks in the target image. When ORB-SLAM2 isfed a picture, it loads the whole 3D map into memory andattempts to localize the picture in the 3D map using its placerecognition module. On the other hand, QuickSketch runs itthrough its annotation module and attempts to find landmarks.Once it finds landmarks, then it loads the smaller map-filesaround each landmark and tries to localize that photo in thoseinstead. When we attempted to localize an image containing astop sign, QuickSketch was able to load all stop-sign proximatemap-files in 0.495 seconds while it took ORB-SLAM2 5.51seconds to load the whole 3D map, an order of magnitudefaster performance.

V. RELATED WORK

Natural disaster management and recovery. Disaster man-agement systems are motivated by significant loss of life orproperty during disasters: for example, the Tohoku earthquakeand tsunami resulted in over 15,800 casualties and economicimpact of $240 billion [1]. Systems for disaster management [4]use multiple mobile nodes, vehicle-to-vehicle, and vehicle-to-infastructure communication to achieve more coverage,propagate data faster, and reduce the overall informationtransmission delay as compared to traditional approaches.Similarly, these emergency response systems use the ubiquityof mobile phones and social media applications [11] to increasesituational awareness and use genetic algorithms for efficientand priority based path-planning for first responders. Practicalemergency response systems consist of distributed, decision-making multi-UAV (unmanned air vehicle) swarms [16] andhave proven to be practically efficient in challenging disaster

Page 8: QuickSketch: Building 3D Representations in Unknown ...QuickSketch: Building 3D Representations in ... sensors (stereo cameras) to explore different areas of an unknown territory and

management scenarios. To cater to the needs of disasters on alarge geographic scale, multi-cloud architectures can be usedfor rapid capacity scaling to increase tolerance to pressure inshort amounts of time [9]. Complementary to such systems,QuickSketch focuses on quickly generating a 3D map for adisaster affected region for situational and context awareness.

SLAM and Mapping. Recent developments in SLAM algo-rithms [6] in the past few decades has paved way for moreaccurate localization and mapping techniques. These SLAMalgorithms use a variety of input sensors like mono cameras [8],stereo cameras [10], infrared [18] and LiDARs [14] forlocalization and mapping. QuickSketch leverages a visualSLAM algorithm, ORB-SLAM2 [17], to use stereo camerainput for mapping unknown environments. Other 2D mappingsystems, Google Maps [13], Waze [2] and OpenStreetMaps(OSM) [19] also rely on crowd sourced information to annotatemaps with traffic information, landmarks etc., but these do notprovide rapidly updated 3D maps. Panoramic image stitchingmethods [21], for both monocular and stereo cameras, stitchmultiple overlapping images to create a wide-angle view forthe camera’s viewpoint.

VI. CONCLUSIONS AND FUTURE WORK

QuickSketch rapidly builds 3D representations of unknownenvironments for contextual and situational awareness espe-cially during emergency and rescue operations. It uses tech-niques for combining 3D data from multiple vehicles into a sin-gle 3D map, for annotating important landmarks and for rapidlycontextualizing visual intelligence. Our evaluations show thatQuickSketch can be used for building accurate 3D maps of largeareas rapidly, identifying and annotating important landmarkswith over 90% accuracy and contextualizing photos an orderof magnitude faster than other systems. We envision severalfuture directions for QuickSketch, including improving stitchingaccuracy using more sophisticated computer vision techniques,reducing map stitching time by using more parallelism andoff-loading computations to GPUs, and increasing collaborationbetween multiple sensing vehicles using V2V communication.

REFERENCES

[1] 2011 Tohoku Earthquake and Tsunami.https://en.wikipedia.org/wiki/2011_Tohoku_earthquake_and_tsunami.

[2] Waze. https://www.waze.com/.[3] ZED Stereo Camera. https://www.stereolabs.com/.[4] Z. Alazawi, O. Alani, M. B. Abdljabar, S. Altowaijri,

and R. Mehmood. A Smart Disaster ManagementSystem for Future Cities. In Proceedings of the 2014ACM International Workshop on Wireless and MobileTechnologies for Smart Cities, WiMobCity ’14, pages1–10, New York, NY, USA, 2014. ACM.

[5] T. Botterill, S. Mills, and R. Green. Bag-of-Words-Driven,Single-Camera Simultaneous Localization and Mapping.Journal of Field Robotics, 28(2):204–226, 2011.

[6] C. Cadena, L. Carlone, H. Carrillo, Y. Latif, D. Scara-muzza, J. Neira, I. Reid, and J. J. Leonard. Past, Present,

and Future of Simultaneous Localization and Mapping:Toward the Robust-Perception Age. IEEE Transactionson Robotics, 32(6):1309–1332, 2016.

[7] Mark Cummins and Paul Newman. FAB-MAP: Prob-abilistic Localization and Mapping in the Space ofAppearance. The International Journal of RoboticsResearch, 27(6):647–665, 2008.

[8] A. J. Davison, I. D. Reid, N. D. Molton, and O. Stasse.MonoSLAM: Real-time Single Camera SLAM. IEEETrans. Pattern Anal. Mach. Intell., 29(6), June 2007.

[9] M. Dong, H. Li, K. Ota, L. T. Yang, and H. Zhu.Multicloud-Based Evacuation Services for EmergencyManagement. IEEE Cloud Computing, 1(4):50–59, Nov2014.

[10] J. Engel, J. Stueckler, and D. Cremers. Large-Scale DirectSLAM with Stereo Cameras. In IROS, September 2015.

[11] J. T. B. Fajardo and C. M. Oppus. A Mobile DisasterManagement System using the Android Technology.WSEAS Transactions on Communications, 9(6):343–353,2010.

[12] A. Geiger, P. Lenz, and R. Urtasun. Are We Readyfor Autonomous Driving? The Kitti Vision BenchmarkSuite. In Conference on Computer Vision and PatternRecognition (CVPR), 2012.

[13] Google Earth. https://www.google.com/earth/.[14] W. Hess, D. Kohler, H. Rapp, and D. Andor. Real-

Time Loop Closure in 2D Lidar SLAM. In 2016 IEEEInternational Conference on Robotics and Automation(ICRA), pages 1271–1278, 2016.

[15] Y. Hu, X. Liu, S. Nath, and R. Govindan. ALPS: AccurateLandmark Positioning at City Scales. In Proceedingsof the 2016 ACM International Joint Conference onPervasive and Ubiquitous Computing, pages 1147–1158.ACM, 2016.

[16] I. Maza, F. Caballero, J. Capitán, J. R. M. Dios, andA. Ollero. Experimental Results in Multi-UAV Coor-dination for Disaster Management and Civil Securityapplications. Journal of Intelligent & Robotic Systems,61(1):563–585, Jan 2011.

[17] R. MurArtal, J. M. M. Montiel, and J. D. Tardos. ORB-SLAM: a Versatile and Accurate Monocular SLAMSystem. IEEE Transactions on Robotics, 2015.

[18] R. A. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux,D. Kim, A. J. Davison, P. Kohli, J. Shotton, S. Hodges, andA. Fitzgibbon. KinectFusion: Real-time Dense SurfaceMapping and Tracking. In Proceedings of the 2011 10thIEEE International Symposium on Mixed and AugmentedReality, ISMAR ’11, 2011.

[19] Open Street Maps. https://www.openstreetmap.org/.[20] J. Redmon, S. K. Divvala, R. B. Girshick, and A. Farhadi.

You Only Look Once: Unified, Real-Time Object Detec-tion. CoRR, abs/1506.02640, 2015.

[21] F. Zhang and F. Liu. Casual Stereoscopic PanoramaStitching. In The IEEE Conference on Computer Visionand Pattern Recognition (CVPR), June 2015.