7
Localization and Mapping using Instance-specific Mesh Models Qiaojun Feng Yue Meng Mo Shan Nikolay Atanasov Abstract— This paper focuses on building semantic maps, containing object poses and shapes, using a monocular cam- era. This is an important problem because robots need rich understanding of geometry and context if they are to shape the future of transportation, construction, and agriculture. Our contribution is an instance-specific mesh model of object shape that can be optimized online based on semantic information extracted from camera images. Multi-view constraints on the object shape are obtained by detecting objects and extracting category-specific keypoints and segmentation masks. We show that the errors between projections of the mesh model and the observed keypoints and masks can be differentiated in order to obtain accurate instance-specific object shapes. We evaluate the performance of the proposed approach in simulation and on the KITTI dataset by building maps of car poses and shapes. I. I NTRODUCTION Autonomous robots bring compelling promises of revo- lutionizing many aspects of our lives, including transporta- tion, agriculture, mining, construction, security, and envi- ronmental monitoring. Transitioning robotic systems from highly structured factory environments to partially known, dynamically changing operational conditions, however, re- quires perceptual capabilities and contextual reasoning that rival those of biological systems. The foundations of arti- ficial perception lie in the twin technologies of inferring geometry (e.g., occupancy mapping) and semantic content (e.g., scene and object recognition). Visual-inertial odometry (VIO) [1]–[3] and Simultaneous Localization And Mapping (SLAM) [4] are approaches capable of tracking the pose of a robotic system while simultaneously reconstructing a sparse [5], [6] or dense [7], [8] geometric representation of the environment. Current VIO and SLAM techniques achieve impressive performance, yet most rely on low-level geometric features such as points [9], [10] and planes [11], [12] and are unable to extract semantic content. Computer vision techniques based on deep learning recently emerge as a potentially revolutionary way for context comprehension. A major research challenge today is to exploit information provided by deep learning, such as category-specific object keypoints [13], [14], semantic edges [15], and segmentation masks [16], in VIO and SLAM algorithms to build rich models of the shape, structure, and function of objects. This paper addresses camera localization and object-level mapping, incorporating object categories, poses, and shapes. Our main contribution is the development of an instance- specific object shape model based on a triangular mesh and We gratefully acknowledge support from ARL DCIST CRA W911NF- 17-2-0181 and ONR N00014-18-1-2828. The authors are with Department of Electrical and Computer Engi- neering, University of California, San Diego, La Jolla, CA 92093, USA {qjfeng,yum107,moshan,natanasov}@ucsd.edu Fig. 1: Our objective is to build detailed environment maps in- corporating object poses and shapes. The figure from KITTI [20] in the top row shows the kind of information that our method relies on: bounding boxes (green), segmentation masks (magenta) and semantic keypoints (multiple colors). The middle row includes the reconstructed mesh models and 3D configuration. The last row shows the projection result. differentiable functions that measure the discrepancy in the image plane between projections of the model and detected semantic information. We utilize semantic keypoints [13], [14], [17] and segmentation masks [16] trained on open- source datasets [18] as observations for optimizing the error functions. Initialized from a pre-defined mean category-level model, the optimization steps are inspired by the recently proposed differentiable mesh renderer [19], which allows back-propagation of mask errors measured on a rendered im- age to update the mesh vertices. We generalize this approach to full SE(3) camera and object pose representations and allow multi-view observation constraints and multi-object reconstruction. The pixel-level information from the segmen- tation masks is used to incrementally refine the instance- specific object models, which are significantly more accurate than generic category-level ones. II. RELATED WORK The problem of incorporating semantic information in SLAM has gained increasing attention in recent years [4], [21]. In an early approach [22], objects are inserted in the map based on matching of feature descriptors to the models in a database, constructed offline using structure from mo- tion. The camera trajectory provides multi-view information for object pose estimation but the object detections are not used as constraints to optimize the camera trajectory. Recent works often consider the optimization of object and camera poses simultaneously. SLAM++ [23] optimizes the camera and object poses jointly using a factor graph and moreover

Localization and Mapping Using Instance-Specific Mesh Models · 2020. 8. 2. · Localization and Mapping using Instance-specic Mesh Models Qiaojun Feng Yue Meng Mo Shan Nikolay Atanasov

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Localization and Mapping Using Instance-Specific Mesh Models · 2020. 8. 2. · Localization and Mapping using Instance-specic Mesh Models Qiaojun Feng Yue Meng Mo Shan Nikolay Atanasov

Localization and Mapping using Instance-specific Mesh Models

Qiaojun Feng Yue Meng Mo Shan Nikolay Atanasov

Abstract— This paper focuses on building semantic maps,containing object poses and shapes, using a monocular cam-era. This is an important problem because robots need richunderstanding of geometry and context if they are to shapethe future of transportation, construction, and agriculture. Ourcontribution is an instance-specific mesh model of object shapethat can be optimized online based on semantic informationextracted from camera images. Multi-view constraints on theobject shape are obtained by detecting objects and extractingcategory-specific keypoints and segmentation masks. We showthat the errors between projections of the mesh model and theobserved keypoints and masks can be differentiated in order toobtain accurate instance-specific object shapes. We evaluate theperformance of the proposed approach in simulation and onthe KITTI dataset by building maps of car poses and shapes.

I. INTRODUCTION

Autonomous robots bring compelling promises of revo-lutionizing many aspects of our lives, including transporta-tion, agriculture, mining, construction, security, and envi-ronmental monitoring. Transitioning robotic systems fromhighly structured factory environments to partially known,dynamically changing operational conditions, however, re-quires perceptual capabilities and contextual reasoning thatrival those of biological systems. The foundations of arti-ficial perception lie in the twin technologies of inferringgeometry (e.g., occupancy mapping) and semantic content(e.g., scene and object recognition). Visual-inertial odometry(VIO) [1]–[3] and Simultaneous Localization And Mapping(SLAM) [4] are approaches capable of tracking the poseof a robotic system while simultaneously reconstructing asparse [5], [6] or dense [7], [8] geometric representationof the environment. Current VIO and SLAM techniquesachieve impressive performance, yet most rely on low-levelgeometric features such as points [9], [10] and planes [11],[12] and are unable to extract semantic content. Computervision techniques based on deep learning recently emerge asa potentially revolutionary way for context comprehension.A major research challenge today is to exploit informationprovided by deep learning, such as category-specific objectkeypoints [13], [14], semantic edges [15], and segmentationmasks [16], in VIO and SLAM algorithms to build richmodels of the shape, structure, and function of objects.

This paper addresses camera localization and object-levelmapping, incorporating object categories, poses, and shapes.Our main contribution is the development of an instance-specific object shape model based on a triangular mesh and

We gratefully acknowledge support from ARL DCIST CRA W911NF-17-2-0181 and ONR N00014-18-1-2828.

The authors are with Department of Electrical and Computer Engi-neering, University of California, San Diego, La Jolla, CA 92093, USA{qjfeng,yum107,moshan,natanasov}@ucsd.edu

Fig. 1: Our objective is to build detailed environment maps in-corporating object poses and shapes. The figure from KITTI [20]in the top row shows the kind of information that our methodrelies on: bounding boxes (green), segmentation masks (magenta)and semantic keypoints (multiple colors). The middle row includesthe reconstructed mesh models and 3D configuration. The last rowshows the projection result.

differentiable functions that measure the discrepancy in theimage plane between projections of the model and detectedsemantic information. We utilize semantic keypoints [13],[14], [17] and segmentation masks [16] trained on open-source datasets [18] as observations for optimizing the errorfunctions. Initialized from a pre-defined mean category-levelmodel, the optimization steps are inspired by the recentlyproposed differentiable mesh renderer [19], which allowsback-propagation of mask errors measured on a rendered im-age to update the mesh vertices. We generalize this approachto full SE(3) camera and object pose representations andallow multi-view observation constraints and multi-objectreconstruction. The pixel-level information from the segmen-tation masks is used to incrementally refine the instance-specific object models, which are significantly more accuratethan generic category-level ones.

II. RELATED WORK

The problem of incorporating semantic information inSLAM has gained increasing attention in recent years [4],[21]. In an early approach [22], objects are inserted in themap based on matching of feature descriptors to the modelsin a database, constructed offline using structure from mo-tion. The camera trajectory provides multi-view informationfor object pose estimation but the object detections are notused as constraints to optimize the camera trajectory. Recentworks often consider the optimization of object and cameraposes simultaneously. SLAM++ [23] optimizes the cameraand object poses jointly using a factor graph and moreover

Page 2: Localization and Mapping Using Instance-Specific Mesh Models · 2020. 8. 2. · Localization and Mapping using Instance-specic Mesh Models Qiaojun Feng Yue Meng Mo Shan Nikolay Atanasov

reconstructs dense surface mesh models of pre-defined objectcategories. A limitation of this work is that the estimatedobject shapes are pre-defined and rigid instead of beingoptimized to match the specific instances detected online.

The popularity of joint optimization of camera and objectposes keeps increasing with the advent of robust 2-D objectdetectors based on structured models [24] and deep neuralnetworks [25], [26]. The stacked hourglass model [13] isused by several works [14], [27] to extract mid-level objectparts and, in turn, perform factor graph inference to recoverthe global positions and orientations of objects detected froma monocular camera. In [28], a deep network object detectoris used to generate object hypotheses, which are subsequentlyvalidated using geometric and semantic cues and optimizedvia nonlinear filtering. Some of these approaches [27]–[29]use inertial measurements and probabilistic data associationamong detections and objects as additional constraints in theoptimization. While most approaches focus on static objects,[30] uses a stereo camera to track ego-motion and dynamic 3-D objects in urban driving scenarios. The authors use bundleadjustment to fuse 3-D object measurements obtained fromdetection and viewpoint classification.

Various 3D object representations, including point-clouds [31], [32], voxels [33], [34], meshes [19], [35], andquadrics [12], [36], have been proposed in the literature.We are particularly interested in object models that can beconstrained from multi-view observations and can adapt theirpose and shape to specific object instances observed online.Tulsiani, Kar, et al. [32] learn a deformable pointcloudmodel with mean shape and deformation bases to fit objectsilhouettes at test time. The perspective transformer nets [33]use perspective projection to synthesize observations fromdifferent views and can be utilized for multi-view shapeoptimization. Introducing object models into the online infer-ence process of SLAM requires compact representations thatcan be optimized and stored efficiently. QuadricSLAM [12],[36] parameterizes objects using dual ellipsoids, which canbe extracted directly from bounding box detections andoptimized using few parameters. A triangular mesh modelof object shapes is proposed by [35] and is optimizedfrom a single image using object keypoints and silhouettes.The optimization proceses uses approximate gradients of amesh rasterization function obtained via the neural meshrenderer [19]. In this work, we generalize the deformablemesh model to SE(3) camera and object poses and allowmulti-view constraints and multi-object reconstruction.

III. PROBLEM FORMULATION

We consider the problem of detecting, localizing, andestimating the shape of object instances present in the scene,and estimating the pose of a camera over time. The states weare interested in estimating are the camera poses C , {ct}Tt=1

with ct ∈ SE(3) and the object shapes and poses O ,{on}Nn=1. More precisely, a camera pose ct := (Rct , pct)is determined by its position pct ∈ R3 and orientationRct ∈ SO(3), while an object state on = (µn, Ron , pon)consists of a pose Ron ∈ SO(3), pon ∈ R3 and shape

µn, specified as a 3-D triangular mesh µn = (Vn, Fn) inthe object canonical frame with vertices Vn ∈ R3×|Vn| andfaces Fn ∈ R3×|Fn|. Each row of Fn contains the indicesof 3 vertices that form a triangular face. A subset of themesh vertices are designated as keypoints – distinguishinglocations on an object’s surface (e.g., car door, windshield, ortires) that may be detected using a camera sensor. We define akeypoint association matrix An ∈ R|Vn|×|Kn| that generates|Kn| keypoints Kn = VnAn from all mesh vertices.

Suppose that a sequence I , {it}Tt=1 of T imagesit ∈ RW×H , collected from the corresponding camera poses{ct}Tt=1, are available for the estimation task. From eachimage it, we extract a set of object observations Zt , {zlt =(ξlt, slt, ylt)}Lt

l=1, consisting of a detected object’s categoryξlt ∈ Ξ, a segmentation masks slt ∈ {0, 1}W×H andthe pixel coordiantes of detected keypoints ylt ∈ R2×|Klt|.We suppose that Ξ is a finite set of pre-defined detectableobject categories and that the data association n = πt(l)of observations to object instances is known (we describean object tracking approach in Sec. IV-A but global dataassociation can also be performed [29], [37], [38]). See Fig. 1for example object observations.

For a given estimate of the camera pose ct and theobject state on, we can predict expected semantic mask sltand semantic keypoint observations ylt using a perspectiveprojection model:

slt = Rmask(ct, on)

ylt = Rkps(ct, on, An)(1)

where the mask and keypoint projection functions Rmask,Rkps will be defined precisely in Sec. IV. The camera andobject estimates can be optimized by reducing the errorbetween the predicted Z1:T and the actual Z1:T observa-tions. We define loss functions Lmask, measuring discrepancybetween semantic masks, and Lkps, measuring discrepancyamong semantic keypoints, as follows:

Lmask(s, s) =− ‖s� s‖1‖s+ s− s� s‖1

Lkps(y, y) =‖y − y · vis(y)‖2F(2)

where � is an element-wise product and vis(ylt) ∈{0, 1}|Kn|×|Klt| is a binary selection matrix that discardsunobservable object keypoints.

Problem. Given object observations Z1:T , determine thecamera poses C and object states O that minimize the maskand keypoint losses:

minC,O

T∑t=1

Lt∑l=1

(wmaskLmask(slt,Rmask(ct, oπt(l)))

+ wkpsLkps(ylt,Rkps(ct, oπt(l), Aπt(l))

)) (3)

where wmask, wkps are scalar weight parameters specifying therelative importance of the mask and keypoint loss functions.

IV. TECHNICAL APPROACH

We begin by describing how the object observations Ztare obtained. Next, we provide a rigorous definition of the

Page 3: Localization and Mapping Using Instance-Specific Mesh Models · 2020. 8. 2. · Localization and Mapping using Instance-specic Mesh Models Qiaojun Feng Yue Meng Mo Shan Nikolay Atanasov

perspective projection models in (1), which, in turn, definethe loss functions in (2) precisely. Finally, in order to performthe optimization in (3), we derive the gradients of Lmask andLkps with respect to ct and ot.

A. Semantic Perception

We extract both category-level (object category ξlt andkeypoints ylt) and instance-level (segmentation masks slt)semantic information from the camera images. For eachframe, we first use pre-trained model [39] to get object de-tection results represented with bounding boxes and instancesegmentations inside the boxes. Each object is assigned toone of the class labels in Ξ. Then we extract semantickeypoints ylt within the bounding box of each detected objectusing the pre-trained stacked hourglass model of [17], whichis widely used for human-joint/object-keypoint detector. Thel-th detection result at time t contains the object categoryξlt ∈ Ξ, keypoints ylt ∈ R2×|Klt|, mask slt ∈ {0, 1}W×H ,bounding box βlt ∈ R4 (2-D location, width, and height) asshown in Fig. 1, object detection confidence ult ∈ R andkeypoint detection confidences qlt ∈ R|Klt|.

We develop an object tracking approach in order to as-sociate the semantic observations obtained over time withthe object instance that generated them. We extend the KLT-based ORB-feature-tracking method of [40] to track semanticfeatures ylt by accounting for their individual labels (e.g.,car wheel, car door) and share object category ξlt. In detail,let zlt be a semantic observation from a newly detectedobject at time t. The objective is to determine if zlt matchesany of the semantic observations zm,t+1 ∈ Zt+1 at timet + 1 given that both have the same category label, i.e.,ξlt = ξm,t+1. We apply the KLT optical flow algorithm [41]to estimate the locations yl,t+1 of the semantic features yltin the image plane at time t + 1. We use the segmentationmask sm,t+1 of the m-th semantic observation to determineif yl,t+1 are inliers (i.e., if the segmentation mask sm,t+1 is1 at image location yl,t+1) with respect to observation m.Let in(yl,t+1, sm,t+1) ∈ {0, 1}|Klt| return a binary vectorindicating whether each keypoint is an inlier or not. Werepeat the process in reverse to determine if the backpropa-gated keypoints ym,t of observation m are inliers with respectto observation l. Eventually, we compute a matching scorebased on the inliers and their detection confidences:

Mlm =

Klt∑k=1

in(y(k)l,t+1, s

(k)m,t+1) · in(y

(k)m,t, s

(k)l,t ) · q(k)lt (4)

where q(k)lt is the k-th element of qlt. Finally, we match

observation l to the observation at time t+1 that maximizesthe score, i.e., m∗ = arg maxkMlm. If the object boundingboxes βlt and βm∗,t+1 have compatible width and height, wedeclare that object l has been successfully tracked to timet+1. Otherwise, we declare that object track l has been lost.

B. Mesh Renderer as an Observation Model

Next, we develop the observation models Rmask and Rkpsthat specify how a semantic observations z = (ξ, s, y) is gen-erated by a camera with pose (Rc, pc) ∈ SE(3) observing

an object of class ξ ∈ Ξ with pose (Ro, po) ∈ SE(3) andmesh shape µ = (V, F ) with keypoint association matrix A.Let K be the intrinsic matrix of the camera, defined as:

K =

[fsu fsn cu0 fsv cv

]∈ R2×3, (5)

where f is the focal length in meters, (su, sv) is the pixelsper meter resolution of the image array, (cu, cv) is the imagecenter in pixels and sn is a rectangular pixel scaling factor.Let x := V Aek ∈ R3 be the coordinates of the k-thobject keypoint in the object frame, where ek is a standardbasis vector. The projection of x onto the image frame canbe determined by first projecting it from the object frameto the camera frame using (Ro, po) and (Rc, pc) and thenthe perspective projection π : R3 → R3, and the lineartransformation K. In detail, this sequence of transformationsleads to the pixel coordinates of x as follows:

y(k) = Kπ(RTc (Rox+ po − pc)) ∈ R2 (6)

where the standard perspective projection function is:

π(x) =[x1/x3 x2/x3 x3/x3

]T(7)

Applying the same transformation to all object keypoints V Asimultaneously leads to the keypoint projection model:

Rcam(c, o, A) := RTc (RoV A+ (po − pc)1T )

Rkps(c, o, A) := KπRcam(c, o, A)(8)

where 1 is a vector whose elements are all equal to 1.To define Rmask, we need an extra rasterization step,

which projects the object faces F to the image frame. Arasterization function, Raster(·), can be defined using thestandard method in [42], which assumes that if multiple facesare present, only the frontmost one is drawn at each pixel.Kato et al. [19] also show how to obtain an approximategradient for the rasterization function. Relying on [42] and[19] for Raster(·), we can define the mask projection model:

Rmask(c, o) := Raster (Rcam(c, o, I), F ) (9)

Now that the projection models (1) and hence the lossfunctions (2) have been well defined, the final step neededto perform the optimization in (3) is to derive their gradients.We assume that the connectivity F of the object meshis fixed and the mesh is deformed only by changing thelocations of the vertices V . We use the results of [19] forthe gradient ∇VRaster(V, F ). Since Rmask is a function ofRcam according to (9), we only need to derive the following:

∇sLmask(s, s), ∇yLkps(y, y),

∇cRkps(c, o, A), ∇oRkps(c, o, A).(10)

Our results are summarized in the following propositions.

Proposition 1. The gradients of the loss functions Lmask(s, s)and Lkps(y, y) in (2) with respect to the estimated mask s ∈{0, 1}W×H and keypoint pixel coordinates y ∈ R2×K are:

∇yLkps(y, y) = 2 (y · vis(y)− y) vis(y)T (11)

∇sLmask(s, s) = − 1

U(s, s)· s+

I(s, s)

U2(s, s)· (11T − s)

Page 4: Localization and Mapping Using Instance-Specific Mesh Models · 2020. 8. 2. · Localization and Mapping using Instance-specic Mesh Models Qiaojun Feng Yue Meng Mo Shan Nikolay Atanasov

where I(s, s) := ‖s� s‖1 and U(s, s) := ‖s+ s− s� s‖1.

Proposition 2. Let y(i) = Rkps(c, o, I) ∈ R2 be the pixelcoordinates of the i-th vertex vi := V ei of an object withpose (Ro, po) ∈ SE(3) obtained by projecting vi onto theimage plane of a camera with pose (Rc, pc) ∈ SE(3),calibration matrix K ∈ R2×3. Let θc, θo ∈ R3 be theaxis-angle representations of Rc and Ro, respectively, sothat Rc = exp(bθc×c) and Ro = exp(bθo×c) and b·×cis the hat map. Then, the derivative of y(i) with respect toα ∈ {θc, pc, θo, po, vi} is:

∂y(i)

∂α= K

∂π

∂x(γ)

∂γ

∂α(12)

where:

∂π

∂x(x) =

1

x3

1 0 −x1/x30 1 −x2/x30 0 0

γ = RTc (Rovi + po − pc) (13)

∂γ

∂pc= −RTc

∂γ

∂po= RTc

∂γ

∂vi= RTc Ro

∂γ

∂θc= RTc b(Rovi + po − pc)×cJrSO(3)(−θc)

∂γ

∂θo= −RTc Robvi×cJrSO(3)(θo)

and JrSO(3)(θ) is the right Jacobian of SO(3), which isnecessary because the gradient needs to be projected fromthe tangent space to the SO(3) manifold [43, Ch. 10.6], andcan be computed in closed form:

JrSO(3)(θ) = I3−1− cos ‖θ‖‖θ‖2

bθ×c+ ‖θ‖ − sin ‖θ‖‖θ‖3

bθ×c2.(14)

Proof. By definition (8), y(i) = Kπ(γ) so most steps followby the chain rule. We only discuss the relationship betweenthe axis-angle vectors θc, θo and the orientations Rc, Ro.Any rotation matrix R ∈ SO(3) can be associated with avector θ ∈ R3 specifying it as a rotation about a fixed axisθ‖θ‖2 through an angle ‖θ‖2. The axis-angle representation θis related to R through the exponential and logarithm maps:

R = exp(bθ×c) = I +

(sin ‖θ‖‖θ‖

)bθ×c+

(1− cos ‖θ‖‖θ‖2

)bθ×c2

bθ×c = log(R) =‖θ‖

2 sin ‖θ‖(R−RT )

See [44] and [43] for details. Consider the derivative of γwith respect to θo. The right Jacobian of SO(3) satisfies thefollowing for small δθ:

exp(b(θ + δθ)×c) ≈ exp(bθ×c) exp(b(JrSO(3)(θ)δθ)×c)

Using this and Ro = exp(bθo×c), we can compute:

∂γ

∂θo= RTc

∂ exp(bθo×c)vi∂θo

= RTc Ro∂

∂δθob(JrSO(3)(θo)δoθ)×cvi

= −RTc Robv×ciJrSO(3)(θo)∂δθo∂δθo

(15)

Fig. 2: Left: Localization results from a simulated dataset, showingcar poses (blue), the ground truth camera trajectory (blue), theinertial odometry used for initialization (red), and the optimizedcamera trajectory (green). Right: Change of Absolute TrajectoryError versus number of optimization iterations.

The derivative of γ with respect to θc can be obtained usingthe same approach.

In conclusion, we derived explicit definitions for the ob-servation models Rkps, Rmask, the loss functions Lmask, Lkps,and their gradients directly taking the SO(3) constraintsinto account via the axis-angle parameterization. As a result,we can treat (3) as an unconstrained optimization problemand solve it using gradient descent. The explicit gradientequations in Prop. 2 allow solving an object mapping-only problem by optimizing with respect to O, a cameralocalization-only problem by optimizing with respect to C,or a simultaneous localization and mapping problem.

C. Optimization InitializationWe implemented the localization and mapping tasks sepa-

rately. In the localization task, we initialize the camera poseusing inertial odometry obtained from integration of IMUmeasurements [1]. The camera pose is optimized sequentiallybetween every two images via (3), leading to an object-levelvisual-inertial odometry algorithm.

To initialize the object model in the mapping task, wecollect high-quality keypoints (according to qlt defined inSec. IV-A) from multiple frames until an object track is lost.The 3-D positions of these keypoints are estimated by opti-mizing Lkps only using the Levenberg-Marquardt algorithm.Using a predefined category-level mesh model(mean model)with known keypoints, we apply the Kabsch algorithm [45]to initialize the object pose (i.e., the transformation fromthe detected 3-D keypoints to the category-level modelkeypoints). After initialization, we take two steps to optimizethe object states. First, we fix the mesh vertices and optimizethe pose based on the combined loss function in (3). Next,we fix the object pose, and optimize the mesh vertices usingonly the mask loss because the keypoint loss affects onlyfew vertices. To improve the deformation optimization andobtain a smooth mesh model, we add regularization usingthe mean mesh curvature. The curvature is computed usinga discretization of the continuous Laplace-Beltrami opera-tor [46], [47]. Constraints from symmetric object categoriescan be enforced by directly defining the mesh shape modelto be symmetric.

V. EXPERIMENTS

We evaluated the ability of the proposed localization andmapping technique to optimize the camera trajectory and

Page 5: Localization and Mapping Using Instance-Specific Mesh Models · 2020. 8. 2. · Localization and Mapping using Instance-specic Mesh Models Qiaojun Feng Yue Meng Mo Shan Nikolay Atanasov

Fig. 3: Views used to evaluate the object-level mapping approachin simulation.

Fig. 4: Mask and 3-D voxel intersection over union (IoU) resultsobtained with different numbers of object views for four differentobject instances (see Fig. 5).

reconstruct object poses and shapes using both simulated andreal data. Our experiments used images from a monocularcamera and inertial odometry information and focused ondetecting, localizing and reconstructing cars object because1) KITTI dataset with multiple cars is widely used and 2)there is pre-trained object detector and keypoint extractor forcars. We represented cars using a symmetric mesh modelwith 642 vertices and 1280 faces.

A. Simulation Dataset

To model the real mechanism of IMU, we chose a sub-sequence IMU measurement and associated groundtruth posefrom synchronized KITTI odometry dataset. We collectedcamera images following the groundtruth pose in a simulatedGazebo environment populated with car mesh models, sothat we simulated a real camera-IMU sensor (see Fig. 2).The car models were annotated with keypoints and both thecar surface and the keypoints were colored in contrastingcolors to simplify the semantic segmentation and keypointdetection tasks. The simulated experiments used ground-truthdata association among the observations. We evaluated boththe localization and the mapping tasks.

For the localization task, we used a sequence with 70frames and synchronized IMU measurements and 6 knowncars were placed around. We initialized the estimation by pre-dicting the transformation between two camera poses basedon the IMU odometry. Then, we optimized the predictedcamera pose by solving problem (3) and used the IMU topredict the next pose. An example camera trajectory andthe associated localization results are shown in Fig. 2. Wecan see that our optimization successfully reduced the erroraccumulated from IMU integration.

TABLE I: 2D projection mIoU with respect to object segmentationon three KITTI sequence [20]

Dataset 09 26 09 26 09 300048 0035 0020

Frames 22 131 1101Detected objects 6 28 77

Single image mesh prediction [35] 0.692 0.642 0.641With pose estimation 0.735 0.656 0.689

With pose and shape estimation 0.778 0.675 0.725

The mapping performance was evaluated on a sequenceof images obtained from different views of a single object(see Fig. 3). The optimization was initialized using a genericcategory-level car mesh and its vertices were optimizedbased on the detected keypoints and segmentation masks.The mapping quality is evaluated qualitatively using theIntersection over Union (IoU) ratio between the predictedand groundtruth car masks volumes. In detail, the mask IoUcompares the area differences between predicted binary carmasks, while the voxel IoU compares the voxelized volumeof the predicted and groundtruth car models. Fig. 4 showsthe dependence of the mapping accuracy on the number ofdifferent views used. The optimized car meshes are shown inFig. 5. The differences among car models are clearly visiblein the reconstructed meshes and their shapes are very closeto the corresponding groundtruth shapes. Using only a fewviews, the optimization process is able to deform the meshvertices to fit the segmentation masks but not necessarilyalign the estimated model with the real 3-D shape. As moreobservations become available, the 3D IoU increases, whichmakes sense since different views can provide informationabout additional instance-level characteristics. Based only on3 views, the IoU reaches over 0.8, while the generic category-level mesh has an average IoU of 0.63 with respect to thedifferent object instances.

B. KITTI Dataset

Experiments with real observations were carried out usingthe KITTI dataset [20]. We choose three sequences withdifferent lengths. The experiments used the ground-truthcamera poses and evaluated only the mapping task. Theobject detector, semantic segmentation [39] and the keypointdetector [17] algorithms used pre-trained weights. Semanticobservations were collected as described in Sec. IV-A. Theposes and shapes of the detected cars were initialized andoptimized as described in Sec. IV-C. Fig. 6 shows a bird-eyeview of the estimated car poses and compares the resultswith the ground truth car positions provided in [48]. Theposes and shapes of 56 out of 62 marked cars were recon-structed, with an average position error across all cars of 1.9meters. Fig. 7 shows some estimated 3-D car mesh modelsprojected back onto the camera images. Fig. 8 compares thedifferences between the category-level and instance-specificmodels. Fig. 9 shows the estimated model shapes and posesin 3D. Since groundtruth object shapes are not available,we evaluate the quality of shape reconstruction based on the2D IoU compared with the observed instance segmentationmasks. We trained a single-image mesh predictor [35] on cardata from the PASCAL3D+ dataset [18] and calculated its

Page 6: Localization and Mapping Using Instance-Specific Mesh Models · 2020. 8. 2. · Localization and Mapping using Instance-specic Mesh Models Qiaojun Feng Yue Meng Mo Shan Nikolay Atanasov

Fig. 5: Qualitative comparison between estimated car shapes (bottom row) obtained from a simulation sequence and ground truth objectmeshes (top row). The numbers indicate 3D IoU.

Fig. 6: Qualitative results showing the accuracy of the estimated car positions (green) on sequence 06 of the KITTI odometry dataset incomparison with hand-labeled ground truth (red) obtained from [48]. The camera trajectory (blue) is shown as well.

Fig. 7: Top: the semantic observations. Bottom: the projection of reconstructed mesh models.

Fig. 8: Top: category-level model before shape optimization. Bot-tom: instance-level model after shape optimization.

mean IoU for individual objects over multiple frames. TableI shows that our multi-view optimization method improvesthe IoU by leveraging semantic information from multipleimages. The reconstruction quality on the real dataset islimited by the accuracy of the semantic information becausethe optimization objective is to align the predicted carshapes with the observed semantic masks and keypoints. Theviewpoint changes on the real dataset are smaller, makingthe reconstruction task harder than in simulation. The poseestimation relies heavily on the keypoint detections, whichin some cases are not robust enough. Nevertheless, ourapproach is able to generate accurate instance-specific meshmodels in an environment containing occlusion and differentlighting conditions.

Fig. 9: Left: 2D observation of mesh models. Right: corresponding3D configuration. Trajectory in blue.

VI. CONCLUSION

This work demonstrates that object categories, shapes andposes can be recovered from visual semantic observations.The key innovation is the development of differentiablekeypoint and segmentation mask projection models thatallow object shape to be used for simultaneous semanticmapping and camera pose optimization. In contrast withexisting techniques, our method generates accurate instance-level reconstructions of multiple objects, incorporating multi-view semantic information. Future work will extend the meshreconstruction to multiple object categories and will focuson data association techniques for object re-identificationand loop closure. Our ultimate goal is to develop an onlineSLAM algorithm that unifies semantic, geometric, and iner-tial measurements to allow rich environment understandingand contextual reasoning.

Page 7: Localization and Mapping Using Instance-Specific Mesh Models · 2020. 8. 2. · Localization and Mapping using Instance-specic Mesh Models Qiaojun Feng Yue Meng Mo Shan Nikolay Atanasov

REFERENCES

[1] A. Mourikis and S. Roumeliotis, “A Multi-State Constraint KalmanFilter for Vision-aided Inertial Navigation,” University of Minnesota,Tech. Rep., 2006.

[2] M. Bloesch, S. Omari, M. Hutter, and R. Siegwart, “Robust visualinertial odometry using a direct EKF-based approach,” in IEEE/RSJIROS, 2015, pp. 298–304.

[3] T. Qin, P. Li, and S. Shen, “Vins-mono: A robust and versatile monoc-ular visual-inertial state estimator,” IEEE Transactions on Robotics,vol. 34, no. 4, pp. 1004–1020, 2018.

[4] C. Cadena, L. Carlone, H. Carrillo, Y. Latif, D. Scaramuzza, J. Neira,I. Reid, and J. J. Leonard, “Past, present, and future of simultaneouslocalization and mapping: Toward the robust-perception age,” IEEETransactions on Robotics, vol. 32, no. 6, pp. 1309–1332, 2016.

[5] J. Engel, T. Schops, and D. Cremers, “LSD-SLAM: Large-scale directmonocular SLAM,” in ECCV, 2014.

[6] R. Mur-Artal and J. D. Tardos, “ORB-SLAM2: an open-source SLAMsystem for monocular, stereo and RGB-D cameras,” IEEE Transac-tions on Robotics, vol. 33, no. 5, pp. 1255–1262, 2017.

[7] R. Newcombe, “Dense Visual SLAM,” Ph.D. dissertation, ImperialCollege London, 2012.

[8] T. Whelan, R. Salas-Moreno, B., A. Davison, and S. Leutenegger,“ElasticFusion: Real-Time Dense SLAM and Light Source Estima-tion,” IJRR, vol. 35, no. 14, pp. 1697–1716, 2016.

[9] D. G. Lowe, “Distinctive image features from scale-invariant key-points,” International journal of computer vision, vol. 60, no. 2, pp.91–110, 2004.

[10] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, “ORB: An effi-cient alternative to SIFT or SURF,” in IEEE International Conferenceon Computer Vision (ICCV), 2011, pp. 2564–2571.

[11] M. Kaess, “Simultaneous localization and mapping with infiniteplanes,” in IEEE Intl. Conf. on Robotics and Automation, ICRA, 2015.

[12] M. Hosseinzadeh, Y. Latif, T. Pham, N. Sunderhauf, and I. D. Reid,“Towards semantic SLAM: points, planes and objects,” arXiv, vol.1804.09111, 2018.

[13] A. Newell, K. Yang, and J. Deng, “Stacked hourglass networks forhuman pose estimation,” in European Conference on Computer Vision.Springer, 2016, pp. 483–499.

[14] G. Pavlakos, X. Zhou, A. Chan, K. G. Derpanis, and K. Daniilidis,“6-DoF Object Pose from Semantic Keypoints,” in IEEE InternationalConference on Robotics and Automation (ICRA), 2017.

[15] Z. Yu, C. Feng, M. Liu, and S. Ramalingam, “CASENet: DeepCategory-Aware Semantic Edge Detection,” in IEEE Conference onComputer Vision and Pattern Recognition (CVPR), 2017.

[16] K. He, G. Gkioxari, P. Dollar, and R. Girshick, “Mask r-cnn.” IEEEtransactions on pattern analysis and machine intelligence, 2018.

[17] X. Zhou, A. Karpur, L. Luo, and Q. Huang, “Starmap for category-agnostic keypoint and viewpoint estimation,” in European Conferenceon Computer Vision (ECCV), 2018, pp. 318–334.

[18] Y. Xiang, R. Mottaghi, and S. Savarese, “Beyond pascal: A benchmarkfor 3d object detection in the wild,” in IEEE Winter Conference onApplications of Computer Vision (WACV), 2014.

[19] H. Kato, Y. Ushiku, and T. Harada, “Neural 3D Mesh Renderer,” inIEEE Conf. on Computer Vision and Pattern Recognition, 2018.

[20] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics:The KITTI dataset,” The International Journal of Robotics Research,vol. 32, no. 11, pp. 1231–1237, 2013.

[21] A. J. Davison, “FutureMapping: The Computational Structure ofSpatial AI Systems,” arXiv, vol. 1803.11288, 2018.

[22] J. Civera, D. Galvez-Lopez, L. Riazuelo, J. D. Tardos, and J. Montiel,“Towards semantic SLAM using a monocular camera,” in IEEE/RSJInt. Conf. on Intelligent Robots and Systems (IROS), 2011.

[23] R. F. Salas-Moreno, R. A. Newcombe, H. Strasdat, P. H. Kelly, andA. J. Davison, “SLAM++: Simultaneous Localisation and Mapping atthe Level of Objects,” in IEEE Conference on Computer Vision andPattern Recognition (CVPR), 2013, pp. 1352–1359.

[24] P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan, “ObjectDetection with Discriminatively Trained Part-Based Models,” IEEETrans. on PAMI, vol. 32, no. 9, pp. 1627–1645, 2010.

[25] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich featurehierarchies for accurate object detection and semantic segmentation,”in IEEE CVPR, 2014.

[26] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning forimage recognition,” in IEEE CVPR, 2016.

[27] N. Atanasov, S. Bowman, K. Daniilidis, and G. Pappas, “A UnifyingView of Geometry, Semantics, and Data Association in SLAM,” inInternational Joint Conference on Artificial Intelligence (IJCAI), 2018.

[28] X. Fei and S. Soatto, “Visual-inertial object detection and mapping,”in European Conference on Computer Vision (ECCV), 2018.

[29] S. Bowman, N. Atanasov, K. Daniilidis, and G. Pappas, “Probabilisticdata association for semantic SLAM,” in IEEE International Confer-ence on Robotics and Automation (ICRA), 2017, pp. 1722–1729.

[30] P. Li, T. Qin, and S. Shen, “Stereo Vision-based Semantic 3D Objectand Ego-motion Tracking for Autonomous Driving,” in EuropeanConference on Computer Vision (ECCV), 2018, pp. 664–679.

[31] H. Fan, H. Su, and L. Guibas, “A Point Set Generation Network for3D Object Reconstruction from a Single Image,” in IEEE Conf. onComputer Vision and Pattern Recognition (CVPR), 2017.

[32] S. Tulsiani, A. Kar, J. Carreira, and J. Malik, “Learning category-specific deformable 3d models for object reconstruction,” IEEE Trans.on PAMI, vol. 39, no. 4, pp. 719–731, 2017.

[33] X. Yan, J. Yang, E. Yumer, Y. Guo, and H. Lee, “Perspective trans-former nets: Learning single-view 3d object reconstruction without 3dsupervision,” in Advances in Neural Information Processing Systems,2016, pp. 1696–1704.

[34] S. Tulsiani, T. Zhou, A. A. Efros, and J. Malik, “Multi-view supervi-sion for single-view reconstruction via differentiable ray consistency,”in IEEE Conference on Computer Vision and Pattern Recognition(CVPR), 2017, pp. 209–217.

[35] A. Kanazawa, S. Tulsiani, A. A. Efros, and J. Malik, “Learningcategory-specific mesh reconstruction from image collections,” inEuropean Conference on Computer Vision (ECCV), 2018.

[36] L. Nicholson, M. Milford, and N. Sunderhauf, “Quadricslam: Con-strained dual quadrics from object detections as landmarks in semanticslam,” arXiv preprint arXiv:1804.04011, 2018.

[37] J. Neira and J. D. Tardos, “Data association in stochastic mappingusing the joint compatibility test,” IEEE Transactions on Robotics andAutomation, vol. 17, no. 6, pp. 890–897, 2001.

[38] M. Kaess and F. Dellaert, “Covariance recovery from a square rootinformation matrix for data association,” Robotics and AutonomousSystems, vol. 57, no. 12, pp. 1198–1210, 2009.

[39] F. Massa and R. Girshick, “maskrnn-benchmark: Fast, modular refer-ence implementation of Instance Segmentation and Object Detectionalgorithms in PyTorch,” 2018.

[40] K. Sun, K. Mohta, B. Pfrommer, M. Watterson, S. Liu, Y. Mulgaonkar,C. J. Taylor, and V. Kumar, “Robust stereo visual inertial odometryfor fast autonomous flight,” IEEE Robotics and Automation Letters,vol. 3, no. 2, pp. 965–972, 2018.

[41] B. D. Lucas and T. Kanade, “An Iterative Image Registration Tech-nique with an Application to Stereo Vision,” in International JointConference on Artificial Intelligence (IJCAI), 1981.

[42] S. Marschner and P. Shirley, Fundamentals of computer graphics.CRC Press, 2015.

[43] G. Chirikjian, Stochastic Models, Information Theory, and Lie Groups,Volume 2: Analytic Methods and Modern Applications. SpringerScience & Business Media, 2011, vol. 2.

[44] Y. Ma, S. Soatto, J. Kosecka, and S. Sastry, An Invitation to 3-D Vision:from Images to Geometric Models. Springer Science & BusinessMedia, 2012, vol. 26.

[45] W. Kabsch, “A solution for the best rotation to relate two sets ofvectors,” Acta Crystallographica, vol. 32, no. 5, pp. 922–923, 1976.

[46] U. Pinkall and K. Polthier, “Computing discrete minimal surfaces andtheir conjugates,” Experimental mathematics, vol. 2, no. 1, 1993.

[47] O. Sorkine, “Differential representations for mesh processing,” inComputer Graphics Forum, vol. 25, no. 4, 2006, pp. 789–807.

[48] N. Atanasov, M. Zhu, K. Daniilidis, and G. J. Pappas, “Localizationfrom semantic observations via the matrix permanent,” The Interna-tional Journal of Robotics Research, vol. 35, no. 1-3, pp. 73–99, 2016.