arXiv:1910.10672v3 [cs.RO] 19 Nov 2020

∇SLAM: Automagically differentiable SLAMhttps://gradslam.github.io

Krishna Murthy J.*1,2,3, Soroush Saryazdi∗4, Ganesh Iyer5, and Liam Paull†1,2,3,6

1Université de Montréal, 2Mila, 3Robotics and Embodied AI Lab (REAL), 4Concordia University,5Carnegie Mellon University, 6Candian CIFAR AI Chair

Figure 1. ∇SLAM (gradSLAM) is a fully differentiable dense simultaneous localization and mapping (SLAM) system. The centralidea of ∇SLAM is to construct a computational graph representing every operation in a dense SLAM system. We propose differentiablealternatives to several non-differentiable components of traditional dense SLAM systems, such as optimization, odometry estimation,raycasting, and map fusion. This creates a pathway for gradient-flow from 3D map elements to sensor observations (e.g., pixels). Weimplement differentiable variants of three dense SLAM systems that operate on voxels, surfels, and pointclouds respectively. ∇SLAMthus is a novel paradigm to integrate representation learning approaches with classical SLAM.

AbstractBlending representation learning approaches with si-

multaneous localization and mapping (SLAM) systems isan open question, because of their highly modular andcomplex nature. Functionally, SLAM is an operation thattransforms raw sensor inputs into a distribution over thestate(s) of the robot and the environment. If this transfor-mation (SLAM) were expressible as a differentiable func-tion, we could leverage task-based error signals to learnrepresentations that optimize task performance. How-ever, several components of a typical dense SLAM sys-tem are non-differentiable. In this work, we propose∇SLAM (gradSLAM), a methodology for posing SLAMsystems as differentiable computational graphs, which uni-fies gradient-based learning and SLAM. We propose differ-entiable trust-region optimizers, surface measurement andfusion schemes, and raycasting, without sacrificing accu-racy. This amalgamation of dense SLAM with computa-tional graphs enables us to backprop all the way from 3Dmaps to 2D pixels, opening up new possibilities in gradient-based learning for SLAM1.

TL;DR: We leverage the power of automatic differentia-tion frameworks to make dense SLAM differentiable.

*Equal contribution†No∇ students were harmed in the making of this work.1Video abstract: https://youtu.be/2ygtSJTmo08

1. Introduction

Simultaneous localization and mapping (SLAM) has—for decades—been a central problem in robot perceptionand state estimation. A large portion of the SLAM litera-ture has focused either directly or indirectly on the questionof map representation. This fundamental choice dramati-cally impacts the choice of processing blocks in the SLAMpipeline, as well as all other downstream tasks that dependon the outpus of the SLAM system. Of late, gradient-basedlearning approaches have transformed the outlook of severaldomains (Eg. image recognition [24], language modeling[42], speech recognition [17]). However, such techniqueshave had limited success in the context of SLAM, primarilysince many of the elements in the standard SLAM pipelineare not differentiable. A fully differentiable SLAM systemwould enable task-driven representation learning since theerror signals indicating task performance could be back-propagated all the way through the SLAM system, to theraw sensor observations.

This is particularly true for dense 3D maps generatedfrom RGB-D cameras, where there has been a lack of con-sensus on the right representation (pointclouds, meshes,surfels, etc.). Several methods have demonstrated a ca-pability for producing dense 3D maps from sequences ofRGB or RGB-D frames [21, 29, 44]. However, none ofthese methods are able to solve the inverse mapping prob-

1

arX

iv:1

910.

1067

2v3

[cs

.RO

] 1

9 N

ov 2

020

https://gradslam.github.io

http://montrealrobotics.ca

https://youtu.be/2ygtSJTmo08

lem, i.e., answer the question: “How much does a specificpixel-measurement contribute to the resulting 3D map"?Formally, we desire an the expression that relates a pixelin an image (or in general, a sensor measurement s) to a3D map M of the environment. We propose to solve thisthrough the development of a differentiable mapping func-tionM = GSLAM (s). Then the gradient of that mapping∇sM can intuitively tell us that perturbing the sensor mea-surement s by an infinitesimal δs causes the map M tochange by ∇sGSLAM (s)δs.

Central to our goal of realizing a fully differentiableSLAM system are computational graphs, which underliemost gradient-based learning techniques. We make theobservations that, if an entire SLAM system can be de-composed into elementary operations, all of which aredifferentiable, we could compose these elementary opera-tions2 to preserve differentiability. However, modern denseSLAM systems are quite sophisticated, with several non-differentiable subsystems (optimizers, raycasting, surfacemapping), that make such a construct challenging.

We propose ∇SLAM (gradSLAM), a differentiablecomputational graph view of SLAM. We show how all non-differentiable functions in SLAM can be realised as smoothmappings. First, we propose a differentiable trust region op-timizer for nonlinear least squares systems. Building on it,we present differentiable strategies of mapping, raycasting,and global measurement fusion.

The ∇SLAM framework is very general, and can beextended most dense SLAM systems for differentiability.In Sec. 4, we provide three examples of SLAM systemsthat can be realized as differentiable computation graphs:implicit-surface mapping (Kinectfusion [29]), surfel-basedmapping (PointFusion [21]), and iterative closest point(ICP) mapping (ICP-SLAM). We show that the differen-tiable approaches maintain similar performance to theirnon-differentiable counterparts, with the added advantagethat they allow gradients to flow through them.

To foster further research on differentiable SLAM sys-tems and their applications to spatially-grounded learning,∇SLAM is available as an open-source PyTorch frame-work. Our project page and code can be accessed athttps://gradslam.github.io.

2. Related Work

Several works in recent years have applied recent ma-chine learning advances to SLAM or have reformulated asubset of components of the full SLAM system in a differ-entiable manner.

2Again, using differentiable composition operators.

2.1. Learning-based SLAM approaches

There is a large body of work in deep learning-based SLAM systems. For example, CodeSLAM [2] andSceneCode [46] attempt to represent scenes using compactcodes that represent. 2.5D depth map. DeepTAM [47]trains a tracking network and a mapping network, whichlearn to reconstruct a voxel representation from a pair ofimages. CNN-SLAM [40] extends LSD-SLAM [9], a pop-ular monocular SLAM system, to use single-image depthpredictions from a convnet. Another recent trend has beento try to formulate the SLAM problem over higher level fea-tures such as objects, which may be detected with learneddetectors [45] [28] [32]. DeBrandandere et al. [3] performlane detection by backpropagating least squares residualsinto a frontend module. Recent work has also formulatedthe passive [22] and active localization problems [4, 11] inan end-to-end differentiable manner. While all of these ap-proaches try to leverage differentiability in submodules ofSLAM systems (eg. odometry, optimization, etc.), there isno single framework that models an entire SLAM pipelineas a differentiable graph.

2.2. Differentiable visual odometry

The beginnings of differentiable visual odometry can betraced back to the seminal Lucas-Kanade iterative matchingalgorithm [27]. Kerl et al. [23]3 apply the Lucas-Kanade al-gorithm to perform real-time dense visual odometry. Theirsystem is differentiable, and has been extensively used forself-supervised depth and motion estimation [10, 26, 48].Coupled with the success of Spatial Transformer Netowrks(STNs) [20], several libraries (gvnn [15], kornia [34]) havesince implemented these techniques as differentiable layers,for use in neural networks.

However, extending differentiability beyond the two-view case (frame-frame alignment) is not straightforward.Global consistency necessitates fusing measurements fromlive frames into a global model (model-frame alignment),which is not trivially differentiable.

2.3. Differentiable optimization

Some approaches have recently proposed to learn theoptimization of nonlinear least squares objective functions.This is motivated by the fact that similar cost functions havesimilar loss landscapes, and learning methods can help con-verge faster, or potentially to better minima.

In BA-Net [39], the authors learn to predict the dampingcoefficient of the Levenberg-Marquardt optimizer, whilein LS-Net [5], the authors entirely replace the Levenberg-Marquard optimizer by an LSTM netowrk [18] that pre-dicts update steps. In GN-Net [43], a differentiable ver-sion of the Gauss-Newton loss is used to show better ro-

3The formulation first appeared in Steinbrüker et al. [37].

2

https://gradslam.github.io

bustness to weather conditions. RegNet [14] employs alearning-based optimization approach based on photomet-ric error for image-to-image pose registration. However, allthe aforementioned approaches require the training of addi-tional neural nets and this requirement imposes severe lim-itations on the generalizability. OptNet [1] introduces dif-ferentiable optimization layers for quadratic programs, thatdo not involve learnable parameters.

Concurrently, Grefenstette et al. [12] propose to unrolloptimizers as computational graphs, which allows for com-putation of arbitrarily higher order gradients. Our proposeddifferentiable Levenberg-Marquardt optimizer is similar inspirit, with the addition of gating functions to result in bettergradient flows.

In summary, to the best of our knowledge, there is nosingle approach that models the entire SLAM pipeline as adifferentiable model, and it is this motivation that underlies∇SLAM.

3.∇SLAM

In this section we will overview our proposed method for∇SLAM and also detail the individual differentiable sub-components.

3.1. Preliminaries: Computational graphs

Figure 2. A computational graph. Nodes in red represent vari-ables. Nodes in blue represent operations on variables. Edgesrepresent data flow. This graph computes the function 3(xy + z).Dashed lines indicate (local, i.e., per-node) gradients in the back-ward pass.

In gradient-based learning architectures, all functionsand approximators are conventionally represented as com-putational graphs. Formally, a computation graph is a di-rected acyclic graph G = (V, E), where each node v ∈ Vholds an operand or an operator, and each (directed) edgee ∈ E indicates the control flow in the graph. Further, eachnode in the graph also specifies computation rules for thegradient of the outputs of the node with respect to the in-puts to the node. Computational graphs can be nested andcomposed in about any manner, whilst preserving differen-tiability. An example computation graph for the function3(xy + z) is shown in Fig. 2.

In a standard SLAM pipeline there are several subsys-tems/components that are not differentiable (i.e., for a few

Figure 3. Computational graph for ∇LM

forward computations in the graph, gradients are unspeci-fiable). For example, in the context of dense 3D SLAM[29] [21], nonlinear least squares modules, raycasting rou-tines, and discretizations are non-diffrentiable. Further, forseveral operations such as index selection / sampling, gra-dients exist, but are zero almost everywhere, which result inextremely sparse gradient flows.

3.2. Method Overview

The objective of ∇SLAM is to make every computa-tion in SLAM exactly realised as a composition of differ-entiable functions4. Broadly, the sequence of operations indense SLAM systems can be termed as odometry estimation(frame-to-frame alignment), map building (model-to-framealignment/local optimization), and global optimization. Anoverview of the approach is shown in 1.

First, we provide a description of the precise issuesthat render nearly all of the aforementioned modules non-differentiable, and propose differntiable counterparts foreach module. Finally, we show that the proposed differen-tiable variants allow the realization of several classic densemapping algorithms (KinectFusion [29], PointFusion [21],ICP-SLAM) in the∇SLAM framework.5

3.3. ∇LM: A differentiable nonlinear least squaressolver

Most state-of-the-art SLAM solutions optimize nonlin-ear least squares objectives to obtain local/globally consis-tent estimates of the robot state and the map. Such ob-jectives are of the form 1

2

∑r(x)2, where r(x) is a non-

linear function of residuals. Example application scenar-ios that induce this nonlinear least squares form include vi-sual odometry, depth measurement registration (e.g., ICP),and pose-graph optimization among others. Such objectivefunctions are minimized using a succession of linear ap-proximations (r(x + δx)|x=x0 = r(x0) + J(x0)δx), us-ing Gauss-Newton (GN) or Levenberg-Marquardt (LM)solvers. GN solvers are extremely sensitive to intialization,numerical precision, and moreover, provide no guaranteeson non-divergent behavior. Hence most SLAM systems useLM solvers.

Trust-region methods (such as LM) are not differentiable

4Wherever exact differentiable realizations are not possible, we desireas-exact-as-possible realizations.

5That is, realizable as fully differentiable computational graphs.

3

as at each optimization step, they involve recalibration ofoptimizer parameters, based on a lookahead operation oversubsequent iterates [25]. Specifically, after a new iterateis computed, LM solvers need to make a discrete decisionbetween damping or undamping the linear system. Fur-thermore, when undamping, the iterate must be restored toits previous value. This discrete switching behavior of LMdoes not allow for gradient flow in the backward pass.

Figure 4. An example curve fitting problem, showing that ∇LMperforms near-identical to LM, with the added advantage of beingfully differentiable.

We propose a computationally efficient softreparametrization of the damping mechanism to en-able differentiability in LM solvers. Our key insight is that,if r0 = r(x0)Tr(x0) is the norm of the error at the currentiterate, and r1 = r(x1)Tr(x1) is the norm of the errorat the lookahead iterate, the value of r1 − r0 determineswhether to damp or to undamp. And, only when we chooseto undamp, we revert to the current iterate. We definetwo smooth gating functions Qx and Qλ based on thegeneralized logistic function [35] to update the iterate anddetermine the next damping coefficient.

λ1 = Qλ(r0, r1) = λmin +λmax − λmin

1 +De−σ(r1−r0)

Qx(r0, r1) = x0 +δx0

1 + e−(r1−r0)

(1)

where D and σ are tunable parameters that control thefalloff [35]. Also [λmin, λmax] is the range of values thedamping function can assume. Notice that this smooth pa-rameterization of the LM update allows the optimizer tobe expressed as a fully differentiable computational graph(Fig. 3).

It must be noted that this scheme can be modified to ac-commodate other kinds of gating functions, such as hyper-bolic curves. We however, choose the above gating func-tions, as they provide sufficient flexibility. A thorough treat-ment of the impact of the choice of gating functions on per-formance is left for future work.

3.4. Differentiable mapping

Another non-smooth operation in dense SLAM is mapconstruction (surface measurement). For example, consider

M

I

project M to I

intersect

Ma

Pa

soft association exp(− r(Pvalid)2

2σ2 )

K(Pa) Pvalid

update surface measurement

map fusion

Figure 5. Computation graph for the differentiable mappingmodule. The uncolored boxes indicate intermediate variables,while the colored boxes indicate processing blocks. Note thatthe specific choice of the functions for update surfacemeasurement and map fusion depend on the map represen-tation used.

a global map M being built in the reference frame of thefirst image-sensor measurement I0. When a new frame Ikarrives at time k, dense SLAM methods need to align thesurface measurement being made in the live frame, with themap M. Notwithstanding the specific choice of map rep-resentation (i.e., pointclouds, signed-distances, surfels), ageneric surface alignment process comprises the followingsteps.

1. The mapM is intersection-tested with the live frame,to determine the active set Ma of map elements, andthe active set of image pixels Pa. The remaing mapelements are clipped.

2. Active image pixels Pa are checked for measurementvalidity (e.g., missing depth values / blurry pixels,etc.). This results in a valid active set of image pixlsPvalid

3. The set of pixels in Pvalid is backprojected to 3D andcompared with the map. At this stage, it must be dis-cerned whether these pixels measure existing elementsinMa, or if they measure a new set of elements thatneed to be added to the global map.

4. Once the above decision is made, these surface mea-surements are fused into the global map. The choiceof the fusion mechanism is dependent on the underly-ing representation of each map element (points, sur-fels, TSDF, etc.).

The above process involves a number of differentiableyet non-smooth operations (clipping, indexing, threshold-ing, new/old decision, active/inactive decision, etc.). Al-though the above sequence of operations can be representedas a computation graph, it will not necessarily serve ourpurpose here since, even though (local) derivatives can bedefined for operations such as clipping, indexing, thresh-olding, and discrete decisions, these derivatives exist onlyat that single point. The overall function represented bythe computation graph will have undefined gradients "al-most everywhere" (akin to step functions). We propose tomitigate this issue by making the functions locally smooth.Concretely, we propose the following corrective measures.

4

1. The surface measurement made by each valid pixel pin the live frame (i.e., p ∈ Pvalid) is not a function ofp alone. Rather, it is the function of the pixel p and its(active/inactive) neighbours nbd(p), as determined bya kernel K(p, nbd(p)).

2. When a surface measurement is transformed to theglobal frame, rather than using a hard (one-one) as-sociation between a surface measurement and a mapelement, we use a soft association to multiple map el-ements, in accordance with the sensor characteristics.

3. Every surface measurement is, by default, assumed torepresent a new map element, which is passed to a dif-ferentiable fusion step (c.f . Sec 3.5).

The kernel K(p, nbd(p)) can be a discrete approxima-tion (e.g., constant within a pixel) or can vary at the sub-pixel level, based on the choice of the falloff function. Forfaster computation and coarse gradients, we use a bilinearinterpolation kernel. While bilinear interpolation is a sen-sible approximation for image pixels, this is often a poorchoice for use in 3D soft associations. For forming 3D as-sociations, we leverage characteristics of RGB-D sensors indefining the soft falloff functions. Specifically, we compute,for each point P in the live surface measurement, a set ofclosest candidate points in a region exp

(− r(P )2

2σ2

), where

r(P ) is the radial depth of the point from the camera ray,and σ affects the falloff region.6

3.5. Differentiable map fusion

The aforementioned differentiable mapping strategy,while providing us with a smooth observation model, alsocauses an undesirable effect: the number of map elementsincreases in proportion with exploration time. However,map elements should ideally increase with proportion to theexplored volume of occupied space, rather than with explo-ration time. Conventional dense mapping techniques (e.g.,KinectFusion [29], PointFusion [21]) employ this throughfusion of redundant observations of the same map element.As a consequence, the recovered map has a more manage-able size, but more importantly, the reconstruction qualityimproves greatly. While most fusion strategies are differ-entiable (eg. [21, 29]), they impose falloff thresholds thatcause an abrupt change in gradient flow at the truncationpoint. We use a logistic falloff function, similar to Eq. 1, toease gradient flow through these truncation points.

3.6. Differentiable ray backprojection

Some dense SLAM systems [29,44] perform global poseestimation by raycasting a map to a live frame. Such an op-eration inherently involves non-differentiable steps. First,

6This is a well-known falloff function, usually with Kinect-style depthsensors [6, 21, 30].

pc

Rc

d(pc)

intersect ( pc,M )

vc

Φ(Vc)

N (d(pc), σr)

σr

Figure 6. Ray differentials: Inset shows the computation graph ofthe ray value computation. The dashed rectangle is not differen-tiable, and its derivatives are approximated as shown in Eq 2

from each pixel in the image, a ray from the camera isbackprojected, and its intersection with the first map ele-ment along the direction of the ray is determined. This in-volves marching along the ray until a map element is found,or until we exit the bounds of reconstruction. Usual (non-differentiable) versions of ray marching use max-min accel-eration schemes [31] or rely on the existence of volumetricsigned distance functions [29]. Several attempts have beenmade to make the raycasting operation differentiable. Scenerepresentation networks [36] proposes to predict ray march-ing steps using an LSTM. In other works such as DRC [41]and WS-GAN [13], the authors pool over all voxels alonga ray to compute the potential of a ray. In this work, wemake one enhancement to the ray pooling operation. Wepool over all voxels along a ray, but have a Gaussian falloffdefined around the depth measurement of the image pixelthrough which the ray passes. Further, we use finite dif-ferences to compute the derivative of the ray potential withrespect to the pixel neighbourhood. We use the finite differ-ences based ray differentials defined in Igehy et al. [19]. Ifpc is the image pixel that the ray Rc pierces, and Vc = {vc}is the set of all voxels it pierces, then the aggregated valueof the ray is denoted vc (with respect to an aggregationfunction Φ(ψ(vc) ∀vc ∈ Vc)). The aggregation functionsimply multiplies each value ψ(vc) with the density of theGaussian fallof at vc, and normalizes them. Similarly vl, vr,vu, and vb are the aggregated values of rays emanating fromthe pixels to the left, right, above, and below pc respectively.Then, the partial derivative ∂vc

∂c can be approximated as

5

∂vc∂pc

=

((vr − vl)/2(vu − vd)/2

)(2)

An illustration of the ray differential computationscheme can be found in Fig. 6.

4. Case Studies: KinectFusion, PointFusion,and ICP-SLAM

To demonstrate the applicability of the ∇SLAM frame-work, we leverage the differentiable computation graphsspecified in Sec 3 and compose them to realise three practi-cal SLAM solutions. In particular, we implement differen-tiable versions of the KinectFusion [29] algorithm that con-structs TSDF-based volumetric maps, the PointFusion [21]algorithm that constructs surfel maps, and a pointcloud-onlySLAM framework that we call ICP-SLAM.

4.1. KinectFusion

Recall that KinectFusion [29] alternates between track-ing and mapping phases. In the tracking phase, the entireup-to-date TSDF volume is raycast onto the live frame, toenable a point-to-plane ICP that aligns the live frame to theraycast model. Subsequently, in the mapping phase, surfacemeasurements from the current frame are fused into the vol-ume, using the TSDF fusion method proposed in [29]. Thesurface measurement is given as (c.f . [29])

sdf(p) = trunc(‖K−1x‖−12 ‖t− p‖2 − depth(x))

trunc(sdf) = min(1,sdf

µ)(sign(sdf)) iff sdf ≥ −µ

(3)

Here, p is the location of a voxel in the camera frame, andx = bπ(Kp)c is the live frame pixel to which p projectsto. µ is a parameter that determines the threshold beyondwhich a surface measurement is invalid. However, we notethat the floor operator is non-differentiable "almost every-where". Also, the truncation operator, while differentiablewithin a distance of µ from the surface, is abruptly trun-cated, which hinders gradient flow . Instead, we again use ageneralized logistic function [35] to create a smooth trunca-tion, which provides better-behaved gradients at the trunca-tion boundary. The other steps involved here, such as ray-casting, ICP, etc. are already differentiable in the ∇SLAMframework (c.f . Sec 3). Fusion of surface measurementsis perfomed using the same approach as in [29] (weightedaveraging of TSDFs).

4.2. PointFusion

As a second example, we implement PointFusion [21],which incrementally fuses surface measurements to obtaina global surfel map. Surfel maps compare favourably to

volumetric maps due to their reduced memory usage.7 Weclosely follow our differentiable mapping formulation (c.f .Sec 3.4) and use surfels as map elements. We adopt thefusion rules from [21] to perform map fusion.

4.3. ICP-SLAM

As a baseline example, we implement a simple point-cloud based SLAM technique, which uses ICP to incremen-tally register pointclouds to a global pointcloud set. In par-ticular, we implement two systems. The first one aligns ev-ery pair of consecutive incoming frames, to obtain an odom-etry estimate (also referred to as frame-to-frame alignmentor ICP-Odometry). The second variant performs what wecall frame-to-model alignment (ICP-SLAM). That is, eachincoming frame is aligned (using ICP) with a pointcloudcontaining the entire set of points observed thus far.

5. Experiments and results

5.1. Differentiable optimization

In Sec 3.3, we introduced two generalized logistic func-tions Qλ and Qx to compute the damping functions as wellas the subsequent iterates. We conduct multiple experi-ments to verify the impact of this approximation on the per-formance (convergence speed, quality of solution) of non-linear least squares solvers.

We first design a test suite of nonlinear curve fitting prob-lems (inspiration from [5]), to measure the performance of∇LM to its non-differentiable counterpart. We considerthree nonlinear functions, viz. exponential, sine, and sinc,each with three parameters a, t, and w.

f(x) = a exp

(− (x− t)2

2w2

)f(x) = sin(ax+ tx+ w)

f(x) = sinc(ax+ tx+ w)

(4)

For each of these functions, we uniformly sample theparameters p = {a, t, w} to create a suite of ground-truth curves, and uniformly sample an initial guess p0 ={a0, t0, w0} in the interval [−6, 6]. We sample 100 probleminstances for each of the three functions. We run a variety ofoptimizers (such as gradient descent (GD), Gauss-Newton(GN), LM, and ∇LM) for a maximum of 10, 50, and 100iterations. We compute the mean squared error in parame-ter space (independently for each parameter a, t, w) as wellas in function space (i.e., ‖f(x)pred − f(x)gt‖2. Note thatthese two errors are not necessarily linearly related, as theinteraction between the parameters and the function vari-ables are highly nonlinear. The results are presented in Ta-

7On the flipside, surfel-based algorithms are harder to parallelize com-pared to volumetric fusion.

6

Tmax = 10 iters Exponential Sine SincGD GN LM ∇LM GD GN LM ∇LM GD GN LM ∇LM

‖apred − agt‖2 0.422 0.483 0.483 0.483 0.379 0.341 0.342 0.342 2.929 0.304 0.304 0.304‖tpred − tgt‖2 0.606 0.50 0.550 0.550 0.222 0.359 0.360 0.360 3.024 0.304 0.304 0.040‖wpred − wgt‖2 1.268 0.667 0.075 0.075 1.215 0.080 0.084 0.085 0.462 10−7 0.023 10−4

‖f(x)pred − f(x)gt‖2 0.716 0.160 0.163 0.160 0.666 0.148 0.152 0.148 0.700 5× 10−8 0.005 4× 10−5

Tmax = 50 iters‖apred − agt‖2 0.365 0.275 0.231 0.275 0.486 0.429 0.434 0.434 3.329 0.380 0.380 0.380‖tpred − tgt‖2 0.263 0.219 0.231 0.218 0.519 0.455 0.459 0.460 2.739 0.380 0.380 0.380‖wpred − wgt‖2 1.220 0.205 0.007 0.369 1.327 0.273 0.376 0.383 0.383 2× 10−7 0.202 4× 10−5

‖f(x)pred − f(x)gt‖2 0.669 0.083 0.004 0.078 0.673 0.153 0.153 0.151 0.795 2× 10−7 0.005 3× 10−5

Tmax = 100 iters‖apred − agt‖2 0.431 0.475 0.480 0.487 0.486 0.429 0.434 0.434 2.903 0.196 0.196 0.196‖tpred − tgt‖2 0.466 0.311 0.378 0.323 0.519 0.455 0.459 0.460 2.847 0.196 0.196 0.196‖wpred − wgt‖2 1.140 0.364 0.066 0.065 1.327 0.273 0.376 0.382 0.601 10−7 0.026 9× 10−5

‖f(x)pred − f(x)gt‖2 0.662 0.243 0.162 0.230 0.673 0.153 0.153 0.151 0.707 6× 10−8 0.005 4× 10−5

Table 1. ∇LM performs quite similarly to its non-differentiable counterpart, on a variety of non-linear functions, and at various stages ofoptimization. Here, GD, GN, and LM refer to gradient descent, Gauss-Newton, and Levenberg-Marquardt optimizers respectively.

ble 4.3. It can be seen that ∇LM performs near-identicallyto LM.

5.2. Comparitive analysis of case studies

In Sec 4, we implemented KinectFusion [29], PointFu-sion [21], and ICP-SLAM as differentiable computationalgraphs. Here, we present an analysis of how each of the ap-proaches compare to their non-differentiable counterparts.Table 2 shows the trajectory tracking performance of thenon-differentiable and differentiable (∇) versions of ICP-Odometry, ICP-SLAM, and PointFusion. We observe novirtual change in performance when utilizing the differen-tiable mapping modules and∇LM for optimization. This iscomputed over split subsets of the living_room_traj0sequence.

We also evaluate the reconstruction quality of ∇-KinectFusion with that of Kintinuous [44]. On a subsec-tion of the living_room_traj0 sequence of the ICL-NUIM [16] benchmark, the surface reconstruction qualityof Kintinuous is 18.625, while that of differentiable Kinect-Fusion is 21.301 (better). However, this quantity is mis-leading, as Kintinuous only retains a subset of high confi-dence points in the extracted mesh, while our differentiableKinectFusion outputs (see Fig. 8) contain a few noisy arti-facts, due to our smooth truncation functions.

5.3. Qualitative results

∇SLAM works out of the box on multiple other RGB-D datasets. Specifically, we present qualitative results ofrunning our differentiable SLAM systems on RGB-D se-quences from the TUM RGB-D dataset [38], ScanNet [7],as well as on an in-house sequence captured from an IntelRealSense D435 camera.

Fig. 9- 11 show qualitative results obtained by running∇SLAM on a variety of sequences from the TUM RGB-

Method ATE RPEICP-Odometry (non-differentiable) 0.029 0.0318

∇ICP-Odometry 0.01664 0.0237ICP-SLAM (non-differentiable) 0.0282 0.0294

∇ICP-SLAM 0.01660 0.0204PointFusion (non-differentiable) 0.0071 0.0099

∇PointFusion 0.0072 0.0101KinectFusion (non-differentiable) 0.013 0.019

∇KinectFusion 0.016 0.021Table 2. Performance of ∇SLAM. The differentiable counter-parts perform nearly similar to their non-differentiable counter-parts (ATE: Absolute Trajectory Error, RPE: Relative Pose Error).

D benchmark (Fig. 9), ScanNet (Fig. [7]), and an in-housesequence (Fig. 11). These differentiable SLAM systemsall execute fully on the GPU, and are capable of comput-ing gradients with respect to any intermediate variable (Eg.camera poses, pixel intensities/depths, optimization param-eters, camera intrinsics, etc.).

5.4. Analysis of Gradients

The computational graph approach of ∇SLAM allowsus to recover meaningful gradients of 2D (or 2.5D) mea-surements with respect to a 3D surface reconstruction. InFig. 12, the top row shows an RGB-D image differentiablytransformed—using ∇SLAM—into a (noisy) TSDF sur-face measurement, and then compared to a more preciseglobal TSDF map. Elementwise comparision of alignedvolumes gives us a reconstruction error, whose gradientsare backpropagated through to the input depthmap usingthe computational graph maintained by ∇SLAM (and vi-sualized in the depth image space). In the second row, weintentionally introduce an occluder that masks out a small(40× 40) region in the RGB-D image, thereby introducinga reconstruction artifact. Computing the volumetric error

7

Figure 7. ∇LM performs comparably to LM optimizers. In this figure, we show example curve fitting problems from the test suite.

KinectFusion PointFusion ICP-Odometry ICP-SLAM

Figure 8. Qualitative results: On the living room lr kt0 sequence of the ICL-NUIM dataset [16]. The reconstructions are near-identical to their non-differentiable counterparts. However, distinct from classic SLAM approaches, these reconstructions allow for gradi-ents to flow from a 3D map element all the way to the entire set of pixel-space measurements of that element.

Figure 9. Reconstruction obtained upon running the dif-ferentiable ICP-Odometry pipeline on a subsection of thergbd_dataset_freiburg1_xyz sequence.

between the global and local occluded TSDF volumes andinspecting the gradients with respect to the input indicatesthe per pixel contribution of the occluding surface to thevolumetric error. Thus, ∇SLAM provides a rich interpreta-tion of the computed gradients: they denote the contributionof each pixel towards the eventual 3D reconstruction.

5.5. Application: RGB and depth completion

In Fig. 13, we similarly introduce such occluders (toprow) and pixel noise (bottom row) in one of the depth

maps of a sequence and reconstruct the scene using∇PointFusion. We then calculate the chamfer distance be-tween the noisy and true surfel maps and backpropogate theerror with respect to each pixel. The minimized loss leads tothe targeted recovery of the noisy and occluded regions. Weadditionally show an RGB-D image completion task (fromuniform noise)in Fig. 14.

6. ConclusionWe introduce ∇SLAM, a differentiable computational

graph framework that enables gradient-based learning fora large set of localization and mapping based tasks, by pro-viding explicit gradients with respect to the input image anddepth maps. We demonstrate a diverse set of case stud-ies, and showcase how the gradients propogate throughoutthe tracking, mapping, and fusion stages. Future effortswill enable ∇SLAM to be directly plugged into and op-timized in conjunction with downstream tasks. ∇SLAMcan also enable a variety of self-supervised learning appli-cations, as any gradient-based learning architecture can nowbe equipped with a sense of spatial understanding.

References[1] Brandon Amos and J. Zico Kolter. OptNet: Differentiable

optimization as a layer in neural networks. In ICML, 2017.3

[2] Michael Bloesch, Jan Czarnowski, Ronald Clark, StefanLeutenegger, and Andrew J Davison. Codeslam—learning

8

Figure 10. Qualitative results on sequences from the ScanNet [7] dataset. Owing to GPU memory constraints, we use each of thedifferentiable SLAM systems (∇KinectFusion, ∇PointFusion, and ∇ICP-SLAM) to reconstruct parts of the scene. We also show outputsfrom BundleFusion [8] for reference.

Figure 11. In-house sequence collected from an Intel RealSense D435 camera. The reconstruction (right) is obtained by running∇PointFusion. Note that we do not perform any noise removal. Differentiable noise filtering is left for future work.

9

Difference in gradientsBackprop

Compare

CompareBackprop

Figure 12. Analysis of gradients: ∇SLAM enables gradients to flow all the way back to the input images. Top: An RGB-D image pair(depth not shown) is passed through ∇SLAM, and reconstruction error is computed using a precise fused map. Backpropagation passesthese gradients all the way back to the depth map (blue map). Bottom: An explicit occluder added to the center of the RGB-D pair. Thisoccluder distorts the construction by creating a gaping hole through it. But, using the backpropagated gradients, one can identify the set ofimage/depthmap pixels that result in a particular area to be reconstructed imperfectly.

Figure 13. End-to-end gradient propagation: (Top): A chunk of a depth map is chopped. The resultant sequence is reconstructedusing ∇PointFusion and the pointcloud is compared to a clean one reconstructed using the unmodified depth map. The Chamfer distancebetween these two pointclouds is used to define a reconstruction error between the two clouds, which is backpropagated through to theinput depth map and updated by gradient descent. (Bottom): Similar to the Fig. 12, we show that ∇SLAM can fill-in holes in the depthmapby leveraging multi-view gradient information.

Figure 14. RGB-D completion using end-to-end gradient propagation: Three RGB-D images and a noise image are passed through∇PointFusion, and compared to a clean reconstruction obtained from four RGB-D images. The reconstruction loss is used to optimize thenoise image by gradient descent. We can recover most of the artifacts from the raw RGB and depth images. Note that finer features arehard to recover from a random initialization, as the overall SLAM function is only locally differentiable.

10

a compact, optimisable representation for dense visual slam.In CVPR, 2018. 2

[3] Bert De Brabandere, Wouter Van Gansbeke, Davy Neven,Marc Proesmans, and Luc Van Gool. End-to-end lane de-tection through differentiable least-squares fitting. CoRR,abs/1902.00293, 2019. 2

[4] Devendra Singh Chaplot, Emilio Parisotto, and RuslanSalakhutdinov. Active neural localization. In InternationalConference on Learning Representations, 2018. 2

[5] Ronald Clark, Michael Bloesch, Jan Czarnowski, StefanLeutenegger, and Andrew J Davison. Ls-net: Learning tosolve nonlinear least squares for monocular stereo. ECCV,2018. 2, 6

[6] Brian Curless and Marc Levoy. A volumetric method forbuilding complex models from range images. In Proceedingsof the 23rd annual conference on Computer graphics andinteractive techniques, pages 303–312, 1996. 5

[7] Angela Dai, Angel X. Chang, Manolis Savva, Maciej Hal-ber, Thomas Funkhouser, and Matthias Nießner. Scannet:Richly-annotated 3d reconstructions of indoor scenes. InProc. Computer Vision and Pattern Recognition (CVPR),IEEE, 2017. 7, 9, 12

[8] Angela Dai, Matthias Nießner, Michael Zollöfer, ShahramIzadi, and Christian Theobalt. Bundlefusion: Real-timeglobally consistent 3d reconstruction using on-the-fly surfacere-integration. ACM Transactions on Graphics 2017 (TOG),2017. 9

[9] J. Engel, T. Schöps, and D. Cremers. LSD-SLAM: Large-scale direct monocular SLAM. In ECCV, 2014. 2

[10] Ravi Garg, Vijay Kumar BG, Gustavo Carneiro, and IanReid. Unsupervised cnn for single view depth estimation:Geometry to the rescue. In ECCV, 2016. 2

[11] S. K. Gottipati, K. Seo, D. Bhatt, V. Mai, K. Murthy, and L.Paull. Deep active localization. IEEE Robotics and Automa-tion Letters, 4(4):4394–4401, Oct 2019. 2

[12] Edward Grefenstette, Brandon Amos, Denis Yarats,Phu Mon Htut, Artem Molchanov, Franziska Meier, DouweKiela, Kyunghyun Cho, and Soumith Chintala. Generalizedinner loop meta-learning. arXiv preprint arXiv:1910.01727,2019. 3

[13] JunYoung Gwak, Christopher B Choy, Manmohan Chan-draker, Animesh Garg, and Silvio Savarese. Weakly super-vised 3d reconstruction with adversarial constraint. In 2017International Conference on 3D Vision (3DV), 2017. 5

[14] Lei Han, Mengqi Ji, Lu Fang, and Matthias Nießner. Reg-net: Learning the optimization of direct image-to-image poseregistration. CoRR, abs/1812.10212, 2018. 3

[15] Ankur Handa, Michael Bloesch, Viorica Patraucean, SimonStent, John McCormac, and Andrew Davison. gvnn: Neuralnetwork library for geometric computer vision. In ECCVWorkshop on Geometry Meets Deep Learning, 2016. 2

[16] A. Handa, T. Whelan, J.B. McDonald, and A.J. Davison. Abenchmark for RGB-D visual odometry, 3D reconstructionand SLAM. In ICRA, 2014. 7, 8, 12

[17] Geoffrey Hinton, Li Deng, Dong Yu, George Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, VincentVanhoucke, Patrick Nguyen, Brian Kingsbury, et al. Deep

neural networks for acoustic modeling in speech recognition.IEEE Signal processing magazine, 2012. 1

[18] Sepp Hochreiter and Jürgen Schmidhuber. Long short-termmemory. Neural computation, 9:1735–80, 12 1997. 2

[19] Homan Igehy. Tracing ray differentials. In Proceedings ofthe 26th annual conference on Computer graphics and in-teractive techniques, pages 179–186. ACM Press/Addison-Wesley Publishing Co., 1999. 5

[20] Max Jaderberg, Karen Simonyan, Andrew Zisserman, et al.Spatial transformer networks. In Neurips, 2015. 2

[21] Maik Keller, Damien Lefloch, Martin Lambers, ShahramIzadi, Tim Weyrich, and Andreas Kolb. Real-time 3d re-construction in dynamic scenes using point-based fusion. In2013, 2013. 1, 2, 3, 5, 6, 7, 12

[22] Alex Kendall, Matthew Grimes, and Roberto Cipolla.Posenet: A convolutional network for real-time 6-dof camerarelocalization. In ICCV, 2015. 2

[23] Christian Kerl, Jürgen Sturm, and Daniel Cremers. Robustodometry estimation for rgb-d cameras. In ICRA, 2013. 2,12

[24] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton.Imagenet classification with deep convolutional neural net-works. In NeurIPS, 2012. 1

[25] Michael Lampton. Damping–undamping strategies for thelevenberg–marquardt nonlinear least-squares method. Com-puters in Physics, 1997. 4

[26] Ruihao Li, Sen Wang, Zhiqiang Long, and Dongbing Gu.Undeepvo: Monocular visual odometry through unsuper-vised deep learning. In ICRA, 2018. 2

[27] Bruce D Lucas, Takeo Kanade, et al. An iterative imageregistration technique with an application to stereo vision,1981. 2

[28] B. Mu, S. Liu, L. Paull, J. Leonard, and J. P. How. Slam withobjects using a nonparametric pose graph. In 2016 IEEE/RSJInternational Conference on Intelligent Robots and Systems(IROS), pages 4602–4609, Oct 2016. 2

[29] Richard A Newcombe, Shahram Izadi, Otmar Hilliges,David Molyneaux, David Kim, Andrew J Davison, Push-meet Kohli, Jamie Shotton, Steve Hodges, and Andrew WFitzgibbon. Kinectfusion: Real-time dense surface mappingand tracking. In ISMAR, 2011. 1, 2, 3, 5, 6, 7, 12, 13

[30] Chuong V Nguyen, Shahram Izadi, and David Lovell. Mod-eling kinect sensor noise for improved 3d reconstruction andtracking. In 2012 second international conference on 3Dimaging, modeling, processing, visualization & transmis-sion. IEEE, 2012. 5

[31] Steven Parker, Peter Shirley, Yarden Livnat, Charles Hansen,and P-P Sloan. Interactive ray tracing for isosurface render-ing. In Proceedings Visualization’98 (Cat. No. 98CB36276),pages 233–238. IEEE, 1998. 5

[32] P. Parkhiya, R. Khawad, J. K. Murthy, B. Bhowmick, andK. M. Krishna. Constructing category-specific models formonocular object-slam. In 2018 IEEE International Confer-ence on Robotics and Automation (ICRA), pages 1–9, May2018. 2

[33] Adam Paszke, Sam Gross, Soumith Chintala, GregoryChanan, Edward Yang, Zachary DeVito, Zeming Lin, Al-

11

ban Desmaison, Luca Antiga, and Adam Lerer. Automaticdifferentiation in pytorch, 2017. 12

[34] Edgar Riba, Dmytro Mishkin, Daniel Ponsa, Ethan Rublee,and Gary Bradski. Kornia: an open source differentiablecomputer vision library for pytorch. In Winter Conferenceon Applications of Computer Vision, 2020. 2

[35] FJ Richards. A flexible growth function for empirical use.Journal of experimental Botany, 1959. 4, 6

[36] Vincent Sitzmann, Michael Zollhöfer, and Gordon Wet-zstein. Scene representation networks: Continuous 3d-structure-aware neural scene representations. In Advancesin Neural Information Processing Systems, 2019. 5

[37] Frank Steinbrücker, Jürgen Sturm, and Daniel Cremers.Real-time visual odometry from dense rgb-d images. InICCV Workshops, 2011. 2

[38] J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cre-mers. A benchmark for the evaluation of rgb-d slam systems.In Proc. of the International Conference on Intelligent RobotSystems (IROS), 2012. 7, 12

[39] Chengzhou Tang and Ping Tan. Ba-net: Dense bundle ad-justment network. ICLR, 2019. 2

[40] K. Tateno, F. Tombari, I. Laina, and N. Navab. Cnn-slam:Real-time dense monocular slam with learned depth predic-tion. In CVPR, 2017. 2

[41] Shubham Tulsiani, Tinghui Zhou, Alexei A. Efros, and Ji-tendra Malik. Multi-view supervision for single-view recon-struction via differentiable ray consistency. In CVPR, 2017.5

[42] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and IlliaPolosukhin. Attention is all you need. In NeurIPS, 2017. 1

[43] Lukas von Stumberg, Patrick Wenzel, Qadeer Khan, andDaniel Cremers. Gn-net: The gauss-newton loss for deepdirect SLAM. CoRR, abs/1904.11932, 2019. 2

[44] Thomas Whelan, Michael Kaess, Hordur Johannsson, Mau-rice Fallon, John J Leonard, and John McDonald. Real-timelarge-scale dense rgb-d slam with volumetric fusion. The In-ternational Journal of Robotics Research, 34(4-5):598–626,2015. 1, 5, 7

[45] S. Yang and S. Scherer. Cubeslam: Monocular 3-d objectslam. IEEE Transactions on Robotics, 35(4):925–938, Aug2019. 2

[46] Shuaifeng Zhi, Michael Bloesch, Stefan Leutenegger, andAndrew J Davison. Scenecode: Monocular dense semanticreconstruction using learned encoded scene representations.In CVPR, 2019. 2

[47] Huizhong Zhou, Benjamin Ummenhofer, and Thomas Brox.Deeptam: Deep tracking and mapping. In ECCV, 2018. 2

[48] Tinghui Zhou, Matthew Brown, Noah Snavely, and David GLowe. Unsupervised learning of depth and ego-motion fromvideo. In CVPR, 2017. 2

A.∇SLAM: LibraryIn this paper, we demonstrated that classical dense

SLAM systems such as KinectFusion [29], PointFusion[21], and ICP-SLAM can all be realized as differentiable

computations. However, the set of differentiable modulesintroduced herein can be used to construct several newerdifferentiable SLAM systems. To this end, we intend tomake the ∇SLAM framework publicly available as open-source software.∇SLAM is built on top of PyTorch [33], a reverse-mode

automatic differentiation library that supports computationover multi-dimensional arrays (often misnomered tensors).At the time of writing this article, ∇SLAM supports thefollowing functionality8:

1. Non-linear least squares optimization

2. Depth-based perspective warping (dense visual odom-etry [23])

3. Point-to-plane ICP

4. Raycasting

5. TSDF volumetric fusion

6. PointFusion (surfel map building)

7. ICP-SLAM

8. Boilerplate operations (Lie algebraic utilities, differen-tiable vertex and normal map computation, etc.)

∇SLAM is intended to be an out-of-the-box PyTorch-based SLAM framework. Currently, it interfaces with pop-ular datasets such as ScanNet [7], TUM RGB-D bench-mark [38], and ICL-NUIM [16]. Currently, low resolutionreconstructions (for example, a 128 × 128 × 128 TSDFvolume run at about 10Hz on a medium-end laptop GPU(NVIDIA GeForce 1060).

For more details on release timelines, and for more visu-alizations/results, one can visit this webpage.

B. Frequently asked questions (FAQ)1. Q: So, ∇SLAM is just classical dense SLAM,

implemented using an autograd-compatible lan-guage/library?A: Yes and no. Technically, while it is possibleto “simply" implement dense SLAM in an autograd-compatible library (eg. PyTorch [33]), in such acase the obtained gradients would not be meaning-ful enough, to be used in a gradient-based learningpipeline. We believe that ∇SLAM addresses manysuch problems (of the gradients being zero “almost ev-erywhere", akin to impulse functions).

2. Q: The paper paints a rosy side of ∇SLAM. What aresome of the shortcomings of the framework?

8All of these operations are performed fully differentiably.

12

http://montrealrobotics.ca/gradSLAM

A: Unrolling each computation in dense SLAM as agraph requires an enormous amount of memory. Forexample, running a differentiable KinectFusion [29]algorithm using a coarse voxel resolution 128× 128×128 ends up requiring 6GB of GPU memory on av-erage. This severely restricts the size of scenes thatcan be reconstructed in this framework. That is oneof the primary concerns we are tackling at the mo-ment. Another aspect of ∇SLAM we are improvingupon is to add more robust (differentiable) filters intoseveral stages of the pipeline, such as ICP, photomet-ric warping, etc. We are also working on getting inM-estimators into the optimization routine.

3. Q: What is the application of such a system? A:We envisage a plethora of applications for a differen-tiable SLAM system, ranging from enabling spatially-grounded learning, to self-supervision, to task-orientedlearning and beyond. We believe that∇SLAM greatlybenefits by following the same modular structure asconventional SLAM systems. This could potentiallyallow localized learning in only submodules of aSLAM system that actually need to be learnt.

C. Acknowledgements

The authors thank Gunshi Gupta for helping out withsome of the ∇LM experiments. KM would like to thankAnkur Handa for providing valuable feedback/advice onthis work. KM also thanks Zeeshan Zia, Ronald Clark,and Sajad Saeedi, who participated in the initial brainstorm-ing sessions. KM and GI acknowledge timely help fromAaditya Saraiya, Parv Parkhiya, Akshit Gandhi, ShubhamGarg, all from CMU, who helped capture a live sequencethat features in the ∇SLAM video. During the early stagesof the project, KM benefitted from participating in brain-storming sessions with Tejas Khot and Gautham Swami-nathan. The authors acknowledge the impact of the feed-back from colleagues at the Robotics and Embodied AI Lab(REAL), Université de Montréal on this work; in particular,Gunshi Gupta, Bhairav Mehta, and Mark Van der Merwe.

D. Contributions

Krishna Murthy Jatavallabhula came up with thisidea, inspired by an unrelated question on one of his all-nighter coding projects he chose to put up on GitHubhttps://github.com/krrish94/dvo_python. He implementedinitial prototypes and played a wrote large parts of the pa-per.

Soroush Saryazdi led the development of the gradslamlibrary. Although he only joined aboard after our ICRA2020 submission, he was extremely enthusiastic about theproject and was instrumental in its release.

Ganesh Iyer ran several experiments and implementeddifferentiable ICP and PointFusion pipelines. He led severalinteresting discussions on follow-up work, and contributedto shaping the core ideas of gradslam.

Liam Paull was the ideal advisor on this project. Hesupported the rest of us by inspiring us to push harder, andby instilling the spirit of slow science. He wrote a majorityof this paper!

13

https://github.com/krrish94/dvo_python

Documents

arXiv:1910.10672v3 [cs.RO] 19 Nov 2020