NotesonLeast-SquaresandSLAM DRAFTgrisetti/teaching/lectures-ls-slam-master_2014... · 2 State Estimation of a Stationary Non-Linear ... 3 Sparse Least Squares ... The aim of this

Notes on Least-Squares and SLAM

DRAFT

Giorgio Grisetti

November 2, 2014

Contents

I Least Squares and SLAM 3

1 Simultaneous Localization and Mapping 4

1.1 Probabilistic Formulation of SLAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.2 Document Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 State Estimation of a Stationary Non-Linear System 9

2.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2 Least Squares Estimation: Basics and Probabilistic Interpretation . . . . . . . . . . . . . 11

2.2.1 Direct Minimization Methods: Gauss, Newton and their friends . . . . . . . . . . . 122.2.2 Gaussian Conditional: p(x|z) ∼ N (x∗,H−1) . . . . . . . . . . . . . . . . . . . . . . 16

2.3 Smooth Manifolds (a.k.a. Non-Euclidean Pain) . . . . . . . . . . . . . . . . . . . . . . . . 172.4 Approaching a Least Squares Estimation Problem: Methodology . . . . . . . . . . . . . . 20

3 Sparse Least Squares 23

3.1 Typical Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.1.1 Example: 2D Multirobot Point Registration . . . . . . . . . . . . . . . . . . . . . . 233.1.2 Example: Pose SLAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.2 A Graphical Representation: Factor Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . 263.3 Structure of the Linearized System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.4 Sparse Least Squares on Manifold . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.5 Robust Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4 Data Association 31

4.1 Data Association . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314.2 RANSAC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

II Other Stuff 33

5 Introduction to Mobile Robot Navigation and SLAM 34

5.1 Localization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345.2 Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345.3 Planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

6 State Estimation: Filtering 35

6.1 Recursive Bayes Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366.2 Gaussian Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

6.2.1 Kalman Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376.2.2 Information Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 386.2.3 Example: Localization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

6.3 Non-Parametric Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 386.3.1 Discrete Filter and Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . 386.3.2 Particle Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 386.3.3 Example: Localization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

1

A Gaussian Distribution 42

A.1 Partitioned Gaussian Densities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42A.2 Marginalization of a Partitioned Gaussian Density . . . . . . . . . . . . . . . . . . . . . . 42A.3 Conditioning of a Partitioned Gaussian Density . . . . . . . . . . . . . . . . . . . . . . . . 43A.4 Affine Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43A.5 Chain Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

2

Part I

Least Squares and SLAM

3

Chapter 1

Simultaneous Localization andMapping

To efficiently solve many tasks envisioned to be carried out by mobile robots including transportation,search and rescue, or automated vacuum cleaning robots need a map of the environment. The availabilityof an accurate map allows for the design of systems that can operate in complex environments onlybased on their on-board sensors and without relying on external reference system like, e.g., GPS. Theacquisition of maps of indoor environments, where typically no GPS is available, has been a majorresearch focus in the robotics community over the last decades. Learning maps under pose uncertaintyis often referred to as the simultaneous localization and mapping (SLAM) problem. In the literature,a large variety of solutions to this problem is available. These approaches can be classified either asfiltering or smoothing. Filtering approaches model the problem as an on-line state estimation wherethe state of the system consists in the current robot position and the map. The estimate is augmentedand refined by incorporating the new measurements as they become available. Popular techniques likeKalman and information filters [24, 2], particle filters [20, 9, 8], or information filters [5, 25] fall intothis category. To highlight their incremental nature, the filtering approaches are usually referred to ason-line SLAM methods. Conversely, smoothing approaches estimate the full trajectory of the robot fromthe full set of measurements [19, 3, 21]. These approaches address the so-called full SLAM problem, andthey typically rely on least-square error minimization techniques.

Figure 1.1 shows three examples of real robotic systems that use SLAM technology: an autonomouscar, a tour-guide robot, and an industrial mobile manipulation robot. Image (a) shows the autonomouscar Junior as well as a model of a parking garage that has been mapped with that car. Thanks tothe acquired model, the car is able to park itself autonomously at user selected locations in the garage.Image (b) shows the TPR-Robina robot developed by Toyota which is also used in the context of guidedtours in museums. This robot uses SLAM technology to update its map whenever the environment hasbeen changed. Robot manufacturers such as KUKA, recently presented mobile manipulators as shownin Image (c). Here, SLAM technology is needed to operate such devices in flexible way in changingindustrial environments. Figure 1.2 illustrates 2D and 3D maps that can be estimated by the SLAMalgorithm discussed in this paper.

An intuitive way to address the SLAM problem is via its so-called graph-based formulation. Solv-ing a graph-based SLAM problem involves to construct a graph whose nodes represent robot poses orlandmarks and in which an edge between two nodes encodes a sensor measurement that constrains theconnected poses. Obviously, such constraints can be contradictory since observations are always affectedby noise. Once such a graph is constructed, the crucial problem is to find a configuration of the nodesthat is maximally consistent with the measurements. This involves solving a large error minimizationproblem.

The graph-based formulation of the SLAM problem has been proposed by Lu and Milios in 1997 [19].However, it took several years to make this formulation popular due to the comparably high complexity ofsolving the error minimization problem using standard techniques. Recent insights into the structure ofthe SLAM problem and advancements in the fields of sparse linear algebra resulted in efficient approachesto the optimization problem at hand. Consequently, graph-based SLAM methods have undergone a

4

(a) (b) (c)

Figure 1.1: Applications of SLAM technology. (a) An autonomous instrumented car developed at Stan-ford. This car can acquire maps by utilizing only its on-board sensors. These maps can be subsequentlyused for autonomous navigation. (b) The museum guide robot TPR-Robina developed by Toyota (pic-ture courtesy of Toyota Motor Company). This robot acquires a new map every time the museum isreconfigured. (c) The KUKA Concept robot “Omnirob”, a mobile manipulator designed autonomouslynavigate and operate in the environment with the sole use of its on-board sensors (picture courtesy ofKUKA Roboter GmbH).

renaissance and currently belong to the state-of-the-art techniques with respect to speed and accuracy.The aim of this tutorial is to introduce the SLAM problem in its probabilistic form and to guide thereader to the synthesis of an effective and state-of-the-art graph-based SLAM method. To understandthis tutorial a good knowledge of linear algebra, multivariate minimization, and probability theory arerequired.

1.1 Probabilistic Formulation of SLAM

Solving the SLAM problem consists of estimating the robot trajectory and the map of the environment asthe robot moves in it. Due to the inherent noise in the sensor measurements, a SLAM problem is usuallydescribed by means of probabilistic tools. The robot is assumed to move in an unknown environment,along a trajectory described by the sequence of random variables x1:T = x1, . . . ,xT . While moving, itacquires a sequence of odometry measurements u1:T = u1, . . . ,uT and perceptions of the environmentz1:T = z1, . . . , zT . Solving the full SLAM problem consists of estimating the posterior probability ofthe robot’s trajectory x1:T and the map m of the environment given all the measurements plus an initialposition x0:

p(x1:T ,m | z1:T ,u1:T ,x0). (1.1)

The initial position x0 defines the position of the map and can be chosen arbitrarily. For convenienceof notation, in the remainder of this document we will omit x0. The poses x1:T and the odometryu1:T are usually represented as 2D or 3D transformations in SE(2) or in SE(3), while the map can berepresented in different ways. Maps can be parametrized as a set of spatially located landmarks, bydense representations like occupancy grids, surface maps, or by raw sensor measurements. The choice ofa particular map representation depends on the sensors used, on the characteristics of the environment,and on the estimation algorithm. Landmark maps [24, 20] are often preferred in environments wherelocally distinguishable features can be identified and especially when cameras are used. In contrast,dense representations [26, 9, 8] are usually used in conjunction with range sensors. Independently ofthe type of the representation, the map is defined by the measurements and the locations where thesemeasurements have been acquired [14, 15]. Figure 1.2 illustrates three typical dense map representationsfor 3D and 2D: multilevel surface maps, point clouds and occupancy grids. Figure 1.3 shows a typical2D landmark based map.

Estimating the posterior given in (1.1) involves operating in high dimensional state spaces. Thiswould not be tractable if the SLAM problem would not have a well defined structure. This structurearises from certain and commonly done assumptions, namely the static world assumption and the Markovassumption. A convenient way to describe this structure is via the dynamic Bayesian network (DBN)depicted in Figure 1.4. A Bayesian network is a graphical model that describes a stochastic process as

5

(a) (b) (c)

Figure 1.2: (a) A 3D map of the Stanford parking garage acquired with an instrumented car (bottom), andthe corresponding satellite view (top). This map has been subsequently used to realize an autonomousparking behavior. (b) Point cloud map acquired at the university of Freiburg (courtesy of Kai. M. Wurm)and relative satellite image. (c) Occupancy grid map acquired at the hospital of Freiburg. Top: a bird’seye view of the area, bottom: the occupancy grid representation. The gray areas represent unobservedregions, the white part represents traversable space while the black points indicate occupied regions.

-20

-10

0

10

20

30

-50 -40 -30 -20 -10 0 10 20

landmarkstrajectory

Figure 1.3: Landmark based maps acquired at the German Aerospace Center. In this setup the landmarksconsist in white circles painted on the ground that are detected by the robot through vision, as shownin the left image. The right image illustrates the trajectory of the robot and the estimated positions ofthe landmarks. These images are courtesy of Udo Frese and Christoph Hertzberg.

6

x0 x1 xt−1 xt xT

u1 ut−1 ut uT

z1 zt−1 zt zT

m

Figure 1.4: Dynamic Bayesian Network of the SLAM process.

a directed graph. The graph has one node for each random variable in the process, and a directed edge(or arrow) between two nodes models a conditional dependence between them.

In Figure 1.4, one can distinguish blue/gray nodes indicating the observed variables (here z1:T andu1:T ) and white nodes which are the hidden variables. The hidden variables x1:T and m model therobot’s trajectory and the map of the environment. The connectivity of the DBN follows a recurrentpattern characterized by the state transition model and by the observation model. The transition modelp(xt | xt−1,ut) is represented by the two edges leading to xt and represents the probability that therobot at time t is in xt given that at time t− 1 it was in xt and it acquired an odometry measurementut.

The observation model p(zt | xt,mt) models the probability of performing the observation zt giventhat the robot is at location xt in the map. It is represented by the arrows entering in zt. Theexteroceptive observation zt depends only on the current location xt of the robot and on the (static)map m. Expressing SLAM as a DBN highlights its temporal structure, and therefore this formalism iswell suited to describe filtering processes that can be used to tackle the SLAM problem.

An alternative representation to the DBN is via the so-called “graph-based” or “network-based”formulation of the SLAM problem, that highlights the underlying spatial structure. In graph-basedSLAM, the poses of the robot are modeled by nodes in a graph and labeled with their position in theenvironment [19, 15]. Spatial constraints between poses that result from observations zt or from odometrymeasurements ut are encoded in the edges between the nodes. More in detail, a graph-based SLAMalgorithm constructs a graph out of the raw sensor measurements. Each node in the graph representsa robot position and a measurement acquired at that position. An edge between two nodes representsa spatial constraint relating the two robot poses. A constraint consists in a probability distributionover the relative transformations between the two poses. These transformations are either odometrymeasurements between sequential robot positions or are determined by aligning the observations acquiredat the two robot locations. Once the graph is constructed one seeks to find the configuration of the robotposes that best satisfies the constraints. Thus, in graph-based SLAM the problem is decoupled intwo tasks: constructing the graph from the raw measurements (graph construction), determining themost likely configuration of the poses given the edges of the graph (graph optimization). The graphconstruction is usually called front-end and it is heavily sensor dependent, while the second part is calledback-end and relies on an abstract representation of the data which is sensor agnostic. A short exampleof a front-end for 2D laser SLAM is described in Section ??. In this tutorial we will describe an easy-to-implement but efficient back-end for graph-based SLAM. Figure 1.5 depicts an uncorrected pose-graphand the corresponding corrected one.

1.2 Document Structure

The purpose of this document is to give you an understanding of SLAM. However you may have realizedthat SLAM is a problem that can be solved in many ways. I will discuss in this block of lectures one

7

Figure 1.5: Pose-graph corresponding to a data-set recorded at MIT Killian Court (courtesy of MikeBosse and John Leonard) (left) and after (right) optimization. The maps are obtained by rendering thelaser scans according to the robot positions in the graph.

possible solution to these problems, which is now regarded as the most promising. To this end we willmake extensive use of least squares optimization, and of out knowledge about the Gaussian distributions.

In the remainder of this document you will learn a methodology for approaching problems with leastsquares. Naive implementations of least squares, however are problematic in circumstances where thestate space is not a vector space. This for instance happens when our robot rotates, and we want toestimate its orientation. In these cases, however if the state space that is locally smoother (more formallyit is a smooth manifold), we can still deal with it. You will learn these wonders in Chapter 2.

When you become robust on dealing with least squares problems involving a small state space, andyou try to use the very same approach on large problems you will be disappointed by its slowness. Butyou will be enlightened by looking at the structure of the problem and you will discover that you cansave a lot of computation. You will reach this wisdom after reading Chapter 3.

After you master least squares, you will be telling yourself: “Yes, i know how to solve a least squaresproblem, but how do i make one out of the raw data of the robot?”. This requires solving the socalled Data Association problem, that means distinguishing which variable in your state your system iscurrently observing. Imagine you have a million needles in a pot, each one with its position. Your caobserve the pot, and thus determine the position of each needle. But how to associate the portion ofimage capturing a needle to the right variable in the state? In SLAM things don’t look that bad. Butstill you will have to figure out which needle is which. I will disclose you these secrets in Chapter 4.

8

Chapter 2

State Estimation of a StationaryNon-Linear System

2.1 Problem Definition

Let W be a stationaty system, whose state is parameterized by a set of state variables x = x1, . . . ,xNthese variables can span over arbitrary spaces. We observe indirectly the state of the system by a setof sensors and from which we obtain measurements z = z1, . . . , zK. Here zk is the kth measurement.This situation is illustrated in Fig. 2.1. The measurements are affected by noise, so in fact zk are randomvariables. We assume that the state models all the knowledge required to predict the distribution of ameasurement. Being the measurements affected by noise, it is impossible to estimate the state of thesystem, given the measurements. Instead, we can compute a distribution over the potential states of thesystem given these measurements. More formally, we want to estimate the probability distribution ofthe state x, given the measurements z:

p(x | z) = p(x1, . . . ,xN | z1, . . . , zK) (2.1)

= p(x1:N |z1:K). (2.2)

In Eq. 2.2 x1:N just indicates all state variables with indices from 1 to N and similarly z1:K indicates allmeasurements between 1 and K.

Unfortunately, there are in general no closed forms to compute Eq 2.2, for many reasons. The mostimportant are listed below:

• the measurements zk typically allow to observe only a subset of the state variables. Additionally,the mapping between the measurements and the states is in general non-linear. Thus, the probablitydistribution can be multi modal and have complex shapes

• we can have an arbitrary number of measurements. Sometimes not enough to fully characterize

W

x1:N x1:N

z1 = h1(x)

z2 = h2(x)

z3 = h3(x)

zN = hN (x)

Figure 2.1: Illustration of a state estimation on a stationary system. W is the phenomena being observed.The state of this phenomena is indirectly measured through a set of sensors each of them measuring aquantity zk which depends on the system’s state.

9

the distribution over states. Some other time these measurements could not be explained by anyparticular state due to noise, thus we have an overdetermined problem

• some of the measurements might be wrong, for instance because we cannot correctly identyfy whichstate variable generate the observations.

However, we notice that there is another distribution that is in general easy to obtain. This is theconditional distribution of a measurement given the state: p(zk|x) and is commonly known as sensormodel or observation model. This is a predictive distribution, that tells what is the probability that wedo a certain measurement zk, assuming the state x to be known. Another hint comes from the fact thatthe measurements are independent, given the state. Put in other words this means that if I know thestationary state, taking a measurement za does not tell me anything on another measurement zb becauseI know the state, and it captures all the relevant information necessary to predict za. We can put thisprose in probabilistic form by the following:

p(z1:K |x1:N ) =

K∏

k=1

p(zk|x1:N ) (2.3)

The term p(z1:K |x1:N ) is known as likelihood of the measurements given the states. It is a conditionaldistribution that changes its shape when changing the states.

Our initial goal was however to compute the distribution of the states, given the measurements.We know that the likelihood can be easily computed. Can we use this fact to simplify our life? Theanswer is yes (otherwise you would probably not be reading this document). With the bayes rule we canmanipulate Eq. 2.2 to highlight the contribution of the likelihood.

p(x1:N |z1:K) =

likelihood︷︸︸︷

p(z1:K |x1:N ) ·prior︷︸︸︷

p(x1:N )

p(z1:K)︸︷︷︸

normalizer

(2.4)

=p(z1:K |x1:N )px

pz(2.5)

= ηpxp(z1:K |x1:N ) (2.6)

∝K∏

k=1

p(zk|x1:N ) (2.7)

In Eq. 2.4 the prior p(x1:N ) models the potential knowledge we have about the states before we do themeasurements. If we know nothing, the prior is a uniform distribution whose value is a constant px. andthe normalizer p(z1:K) does not depend on the states. We know the measurements, and they do notchange, so p(z1:K) is just a constant number pz (see Eq. 2.5). To drop the fraction, let η = 1/pz (Eq. 2.6).Eq. 2.7 is obtained from the previous by killing the constants and finally utilizing the independenceassumption.

Example: Point Registration Imagine the following phenomena: We have a robot moving in a 2Dworld populated by unique 2D points. Unique means that, for instance each point has a unique color,so that if we observe a point from different viewpoints, we can always determine to which point in theworld the observation refers. Our robot is equipped with a powerful 2D sensor capable to determine thecolor and the position of a point relative to the robot. Let x = (tx ty θ) be the position of the robotw.r.t, the origin of our world. Let pi = (xi yi 1) be the position of one of the points w.r.t. the origin,in homogeneous coordinates. Our problem is to estimate the position of the robot x, given a set ofmeasurements of the surrounding points. We assume to know the positions of these points in the globalframe. Put in other words we want to estimate p(x|z1:N ), where zi refers to a measurement of one ofthe points.

To this end we just need to construct a sensor model, that is to construct the function describingthe measurement density, given the state: p(zi|x). If the robot is located at a certain position, we could

10

Figure 2.2: Illustration of the scenario of a robot moving in a 2D world and sensing 2D point features.

predict the measurement the robot would gather about a certain point. This is the so called observationmodel

hi(x) = T(x)−1pi (2.8)

that is the position of the point w.r.t. the robot reference frame. Here i denoted with T(x) the trans-formation matrix corresponding to the three dimensional vector x as follows:

T(x) =

(Rθ t

0 1

)

Rθ =

(cos θ − sin θsin θ cos θ

)

t =

(txty

)

(2.9)

If the measurements are affected by a zero mean addittive Gaussian noise characterized by an infor-mation matrix Ωi, we could estimate the distribution of the measurement zi of the ith point, given astate x. p(zi|x) = N (zi,hi(x),Σi)

2.2 Least Squares Estimation: Basics and Probabilistic Inter-pretation

The conclusion in the previous section is captured by Eq. 2.7. In short, the distribution over the possiblestates given the measurements is proportional to the likelihood of the measurement given the states.The latter is easy to compute, just by knowing the observation model, which can be easily constructedonce we know how the sensor works. If the noise n affecting the measurements is zero mean normallydistributed, the likelihood of the measurement will also be gaussian.

p(zk|x) ∝ exp(−(zk − zk)

TΩk(zk − zk)T)

(2.10)

Here Ωk = Σ−1k is the information matrix of the conditional measurement and zk is the prediction of

the measurement, given the state x, and represents the mean of the conditional distribution. Also thelikelihood of all the measurements p(z|x) will be Gaussian, being the product of Gaussians, according toEq. 2.7.

The predicted measurement is obtained from the state by applying the sensor model hk(·):

zk = hk(x) (2.11)

In case the sensor model is an affine transformation hk(x) = Ax+b and the noise in the measurementsis Gaussian, p(x|z) is normally distributed N (µx,Σx). If you have some problems with this statement,[23] might be a good read. Unlucky us, the sensor model is in general a non-linear function of the states.However, if it is smooth enough we can approximate it in the neighborhood of a linearization point x byits Taylor expansion

hk(x+∆x) ≃ hk(x)︸︷︷︸

zk

+∂hk(x)

∂x

∣∣∣∣x=x

︸︷︷︸

Jk

·∆x. (2.12)

If we choose as linearization point the state x∗ that better explains the measurements

x∗ = argmaxx

p(z | x), (2.13)

and we plug in Eq. 2.12 in Eq.2.10 we can have a new distribution over P (z|∆x) governed by theincrements ∆x:

p(zk|x∗ +∆x) ∝ exp[−(hk(x

∗) + Jk∆x− zk)TΩk(hk(x

∗) + Jk∆x− zk)]. (2.14)

Since x∗ is the optimal state given the measurements, we can treat it as a constant. Evaluating Eq 2.14at different ∆x tells us how the likelihood fits the data measurement given a perturbation of the optimalstate.

11

As I said before, the measurement models are in general non-linear, and finding the state x∗ thatbetter explains the measurements might require some work. However, once we have found this state, wecan construct a local gaussian approximation of our likelihood. Thus, let us split the problem in twoparts:

• find the optimal x∗ = argmaxx p(z|x)

• determine the parameters of the local gaussian approximation of p(x|z) ≃ N (x;x∗,Σx|z). Notethat in computing the gaussian approximation around the optimum, I set directly its mean to bethe optimum. This makes sense since the mean is the point where the gaussian is higher.

We will discuss how to carry on these steps in the remainder of this section.

2.2.1 Direct Minimization Methods: Gauss, Newton and their friends

We want to find the x∗ that maximizes Eq. 2.10

x∗ = argmaxx

K∏

k=1

p(zk|x) (2.15)

[gaussian assumption]

= argmaxx

K∏

k=1

exp[−(hk(x)− zk)TΩk(hk(x)− zk)] (2.16)

[taking the logarithm] (2.17)

= argmaxx

K∑

k=1

[−(hk(x)− zk)TΩk(hk(x)− zk)] (2.18)

[removing the minus]

= argminx

K∑

k=1

(hk(x)− zk)TΩk(hk(x)− zk). (2.19)

Might be a matter of personal taste, but I like Eq. 2.19 far more than Eq. 2.15. Let us introduce a newfunction: the error ek(x), which is the difference between the prediction and the observation:

ek(x) = hk(x)− zk. (2.20)

Similarly, we define the argument of the minimization as

F (x) =K∑

k=1

ek(x)TΩkek(x)

︸︷︷︸

ek(x)

(2.21)

Let us assume that we have a reasonable initial guess of the optimum x. By exploiting the Taylorapproximation in Eq. 2.12, we can rewrite one of the summands in Eq 2.19 as

ek(x+∆x) = (hk(x+∆x)− zk)TΩk(hk(x+∆x)− zk) (2.22)

[plugging in Eq 2.12]

≃ (Jk∆x+ hk(x)− zk)TΩk(Jk∆x+ hk(x)− zk) (2.23)

[substituting hk(x)− zk = ek ]

= (Jk∆x+ ek)TΩk(Jk∆x+ ek) (2.24)

[doing the products and grouping the terms]

= ∆xT JTkΩkJk︸︷︷︸

Hk

∆x+ 2JTkΩkek︸︷︷︸

bk

∆x+ eTkΩkek (2.25)

[using the renamed terms]

= ∆xTHk∆x+ 2bk∆x+ eTkΩkek. (2.26)

12

Now we are ready to plug Eq. 2.26 in Eq. 2.21

F (x+∆x) ≃K∑

k=1

∆xTHk∆x− 2bk∆x+ eTkΩkek (2.27)

[pushing in the summations]

= ∆xT

[K∑

k=1

Hk

]

︸︷︷︸

H

∆x+ 2

[K∑

k=1

bk

]

︸︷︷︸

b

∆x+

[K∑

k=1

eTkΩkek

]

︸︷︷︸

c

(2.28)

What we really did in Eq. 2.28 is to write an expression of the objective function, under a linearapproximation of the sensor model around a neighborhood of the initial estimate x. We found out thatif we fix the initial estimate, the value of the function in the neighborhood can be approximated by aquadratic form with respect to the increments ∆x. Thus, we can minimize this quadratic form and getthe increment ∆x∗ that applied to the current guess gives a better solution:

x∗ = x+∆x∗. (2.29)

The question now is: how do we get the ∆x∗ that minimizes the quadratic form in Eq. 2.28? As wedid in the high school: we set the derivative of the function to zero and solve the corresponding equation.The derivative of Eq. 2.28 is

∂(∆xTH∆x− 2b∆x+ c)

∂∆x= 2H∆x− 2b, (2.30)

thus we can find the minimum by solving the linear system

H∆x∗ = b. (2.31)

If the measurement function would be linear, we would find the minimum in just one go. Since thisis typically not the case, we have to iterate the procedure until no substantial improvements are made.The whole procedure is the Gauss-Newton (GN) algorithm and is described in detail in Algorithm 1.

Note that the Gauss-Newton algorithm is not guaranteed to converge in general. The convergencedepends on a variety of factors, namely:

• how close our initial guess is to the actual minimum,

• how smooth are the error functions and

• how far is our guess from potential singularities.

Note that the matrix H should be possibly full rank, if we do not want to have an under-determinedsolution.

It might also happen that one iteration of Gauss-Newton results in a new solution that is worse thanthe initial one. To overcome this problem we can use the Levenberg-Marquardt (LM) algorithm, whichis a damped version of Gauss-Newton that introduces backup-restore steps to recover from wrong steps.At every iteration, instead of solving the system H∆x∗ = b, LM solves a damped version of the system:

(H+ λI)∆x∗ = b. (2.32)

Intuitively, if λ→ inf, ∆x∗ → 0. So, the higher is the damping factor λ, the smaller are the increments.An abstract implementation of LM is provided in Algorithm 2.

In general, LM algorithm always converges, but the minimum found might be not the global one.In the remainder this document we will further investigate techniques to make the optimization morerobust, however for now you have two powerful tools to find the most likely state of a system, given itsmeasurements. In the next section we will revise how to determine the distribution of states, instead ofjust the most likely state. Let us now focus on a simple example that can be used to ground this theory.

13

Algorithm 1 Gauss-Newton minimization algorithm

Require: x: initial guess. C = 〈zk(·),Ωk〉: measurementsEnsure: x∗ : new solution//compute the current errorFnew ← F// iterate until no substantial improvements are maderepeat

F ← Fnew

b← 0 H← 0

for all k = 1 . . . K do

// Compute the prediction zk, the error ek and the jacobian Jk

zk ← hk(x)ek ← zk − zk

Jk ← ∂hk(x)∂x

∣∣∣x=x

// compute the contribution of this measurement to the linear systemHk ← JT

kΩkJk

bk ← JTkΩkek

// accumulate the contribution to construct the overall systemH += Hk

b += bk

end for

// solve the linear system∆x← solve(H∆x = −b)// update the statex += ∆x

// compute the new errorFnew ← F (x)

until F − Fnew > ǫreturn x

Example: Point Registration Let us back up to the previous example on point registration. As Ipromised you, once you know the error function (or the likelihood function p(z|x) you can approach theproblem with least squares.

In our previous example on point registration we have found the likelihood function, from which wecan easily determine the error function ei(x) = h(x)− zi:

ei(x) = RTθ

(xi − txyi − ty

)

− zi (2.33)

Here we dropped the last row of the homogeneous notation, since it does not carry any information.Thus the measurements will be represented by a two dimensional vector.

To implement Algorithm 1, we only need to compute the jacobian Jk = ∂ek(x)∂x

∣∣∣x=x

. Let us rewrite

Eq. 2.33 to separate the contributions of pi = (xi yi) and t = (tx ty), as follows:

ei(x) = RTθ pi −RT

θ t− zi. (2.34)

In this straightforward case the Jacobian w.r.t. the first two components of the state vector tx and ty isjust:

∂ei(x)

∂t= −RT

θ . (2.35)

The jacobian w.r.t the angular component θ is

∂ei(x)

∂θ=∂RT

θ

∂θ︸︷︷︸

R′θ

(pi − t)︸︷︷︸

p′i

. (2.36)

14

The full Jacobian will then be∂ei(x)

∂x=(−RT

θ R′θp

′i

). (2.37)

If you plug these equations in Algorithm 1, it will work.The reader might notice that we could have substantially simplified our existance if instead fo seeking

for the transform x that maps the position of the robot in the world frame we would have looked for itsinverse. This comes from the algebric form of Eq. 2.33. Let then parameterize the state of the robot bythe transform x′ = (t′x t

′y θ

′) such that T(x′) = T−1(x). In this case, the error function would have been

ei(x′) = Rθp

′i + t− zi. (2.38)

If you compare 2.38 with Eq. 2.33 you will notice that the former is much simpler than the latter.Accordingly the the Jacobian of Eq. 2.38 has the following form:

∂ei(x′)

∂x′= (I2×2 R′

θpi) . (2.39)

Example: Bearing only Registration In the previous example we assumed to have a powerfulsensor capable of detecting the cartesian coordinates of a point in its reference frame. What if we arepoor and we can only afford a much cheaper bearing only sensor? Can we still localize our sensor?

In this simple case, trying is faster than thinking. Let pi be world point and Rθ|t be the transformof the origin w.r.t the reference frame of the robot. The point sensed in the robot frame will be

p′i = gi(x) = Rθpi + t =

(x′iy′i

)

. (2.40)

Here gi(x) is a function introduced to highlight the dependency of the projected point from the transformin x. Once we have the point in the robot reference frame we can compute its bearing as:

zi = hi(x) = atan

(y′ix′i

)

. (2.41)

The error function is just the difference between the predicted and sensed bearing

ei(x) = zi − zi; (2.42)

Ok, the error function is done. To be nice we should fix tghe angulat wraparounds of our estimator, butwe will deal with this in the next section. What remains to compute is the Jacobian, which in this casewill be 1× 3 matrix. We notice that Eq. 2.42 can be rewritten as

ei(x) = atan(p′i)− zi = atan(g(x)i)− zi; (2.43)

Here we assumed atan(·) can be fed with a 2D point and returns its bearing; By using the chain rule ofderivatives, the Jacobian in the point x becomes:

Ji(x) =∂ atan(p)

∂p

∣∣∣∣p=p′

i

· ∂gi(x)

∂x

∣∣∣∣x=x

. (2.44)

We already computed the derivative of gi w.r.t, x, in Equation 2.39. So we only have to compute∂ atan(p)

∂p . Recalling that d atan(a)/da = 11+a2 and that atan(p) = atan(y/x) where x and y are the

coordinates of p, we can write :

∂ atan(p)

∂p=

1

1 + (y/x)2·(

− y

x21

x

)

. (2.45)

Putting it all together in Eq. 2.44 we get

Ji(x) =1

x′2i + y′2i· (−y′i x′i) · (I2×2 Rθ′pi) . (2.46)

With our utmost satisfaction we observe that Eq. 2.46 is dimensionally correct.

15

2.2.2 Gaussian Conditional: p(x|z) ∼ N (x∗,H−1)

In this section we will show that the statement made in its title is true. We will recall some conceptsalready introduced at the beginning of Section 2.2, namely:

• if the measurements zk are affected by a Gaussian noise with information matrices Ωk

• under smoothness assumptions, the measurement function zk = hk(x) can be reasonably approxi-mated around the optimum x∗ by its Taylor expansion

• the conditional distribution of the measurements given the increments is normally distributed

p(zk|∆x+ x∗) ∼ N (Jk∆x+ z∗k,Ω−1k ). (2.47)

Here z∗k = hk(x∗) is the prediction at the optimum.

The conditional over all measurements z is again a multivariate Gaussian

p(z|∆x+ x∗) ∼ N (µz,Ω−1z ) (2.48)

with

µz =

J1∆x+ z∗1J2∆x+ z∗2

...JK∆x+ z∗K

=

J1

J2

...JK

︸︷︷︸

J

∆x+

z∗1z∗2...

z∗K

︸︷︷︸

z∗

Ωz =

Ω1

Ω2

. . .

ΩK

(2.49)

If we have no prior about the states, but we assume they are normally distributed we can describe themwith p(∆x) ∼ N (0,Σx). To express our total ignorance we set Σx to infinity, or more elegantly we setΩx = 0. Recalling Eq. A.14, we have that the joint distribution p(∆x, z) has the following form:

p(∆x, z) ∼ N ((0, z∗)T ,Ω−1x,z) (2.50)

Where z∗ = (z∗1, ..., z∗K) and

Ωx,z =

(JTΩzJ −JTΩz

−ΩzJ Ωz

)

(2.51)

Let us now consider the upper left block of Ωx,z

JTΩzJ =(JT1 JT

2 · · · JTK

)

Ω1

Ω2

. . .

ΩK

J1

J2

...JK

(2.52)

= JT1 Ω1J1 + JT

2 Ω2J2 + · · ·+ JTkΩkJk︸︷︷︸

Hk

+ · · ·+ JTKΩKJK (2.53)

=

K∑

k=1

Hk (2.54)

In substance H is the upper ledt block of the Ωx,z, which is the information matrix of the joint Gaussiandistribution of the states and the measurements p(∆x, z).

What we want to do now is to compute the conditional distribution p(∆x|z). In particular we areinterested in the information matrix Ω∆x|z. Thanks to Eq. A.9 we have that conditioning a Gaussiandistribution expressed in information form can be easily done by suppressing the rows and the columnsof the information matrix corresponding to variables in the conditional. In our case, these blocks areΩz, −JTΩz and −ΩzJ. So only H remains. Intuitively, the mean µ∆x is 0 because we assumed x∗ tobe our optimal state.

To conclude we show the initial claim of this section, that is p(x|z) ∼ N (x∗,H−1). To this end, weconsider the random variable x = x∗ +∆x. Since ∆x ∼ N (0,Σ∆x|z) and x∗ is a constant, x has thesame covariance of ∆x, and just its mean is shifted of x∗.

16

2.3 Smooth Manifolds (a.k.a. Non-Euclidean Pain)

In the previous section we put the basis of Least Squares Estimation, and we provided some basic toolsfor estimating the optimal solution and its uncertainty. However all the derivations we made stronglyrely on two aspects, of the input problem, namely: the smoothness of the measurement functions andthe fact that the state space spans over an Euclidean domain. All in all, to find a solution with iterativeapproaches we linearize the problem, compute quadratic approximation of it and compute a set ofincrements that should decrease the value of the quadratic form. Then, once we have these incrementswe sum them to the previous solution and we pray that they are still good.

Everything works as long as the sum is a legal operation in the domain of the state variables consid-ered. In robotics this is often not the case, since angles and rotations are involved. Summing a tripletof Euler angles is not such an elegant thing, and might lead to singular configurations. In fact rotationsare not Euclidean, although they admit local Euclidean mappings, they are smooth manifolds [17].

To understand the contents of this section, it is a good thing to keep in mind, for instance, thespace of the 3D rotations. It is a well known thing that 3D rotations can be parameterized with arotation matrix R. A rotation matrix, however has 9 numbers to represent 3 angles. If we would usethe rotation matrices as parameters in our optimization mechanism we would find non-valid solutions,since the orthogonality constraint is not enforced. Thus, it seems there is no alternative than carryingon the optimization by using a minimal representation, for instance Euler angles. But we just said thatEuler angles suffer of singularities. Lucky us, the rotations are a manifold. That is, they admit a localparameterization which is homeomorphic to the vector space ℜn. To explain this sentence, think to ageneric rotation R(ρ, θ, ψ). We can easily define a mapping between rotation matrices and Euler angles,and viceversa:

u = toVector(R) (2.55)

R = fromVector(u) (2.56)

where u = (ρ θ ψ)T represents the vector containing the 3 Euler angles. Clearly fromVector(toVector(R)) =R and toVector(fromVector(u)) = u. If we are around the origin of u, the Euler angles are away fromsingularities and the euclidean distance measured between two orientations u1 and u2 around the originare closely related to the true distance of their orientations. This situation changes, however when weare close to a singularity, when a small difference in orientation might lead to an abrupt difference in thelocal parameters u, with potentially catastrophic consequences on our optimization that strongly relieson the smoothness of the problem.

However, given any arbitrary initial rotation R0, we can compute a local mapping to Euler anglesthat is away from singularities. In the domain of the rotations, this can be easily done by setting thereference frame so that toVector(R0) = 0 is the origin,

u = R⊟R0 = toVector(R−10 ·R) (2.57)

R = R0 ⊞ u = R0 · fromVector(u) (2.58)

Eq. 2.57 returns the euler angles of the rotation that moves R0 to R, in the reference frame of R0. Itcomputes the coordinates in a local map, centered aroundR0. If the two rotations have a small difference,also u will be small, independently on whether R0 is close to a singularity or not. Similarly, Eq. 2.58computes a new rotation from R0, by applying a local perturbation u. Clearly R0 ⊞ (R⊟R0) = R and(R0⊞u)⊟R0 = u. The operators ⊟ and ⊞ are powerful operators that convert a global difference in themanifold space in a local perturbation, and viceversa. We have given an example of manifold mapping inthe domain of rotation matrices. The question now is how to use this result to improve the optimization.

To this end, let’s assume our state space is a smooth manifold. The values in this space are param-eterized redundantly by a certain variable X, and minimally by a vector x. A sensor model has theforms hk(X) or hk(x). If X is a current estimate, we are interested in computing the perturbation of the

predicted observation, under a small perturbation of the state variables: hk(X⊞∆x). Consider that Xis a constant. Still we can span the whole state space by varying δx. We could then compute the Taylorexpansion of hk(X ⊞∆x), around the point ∆x = 0, with respect to ∆x To this end we can consider

17

the X as a constant

hk(X⊞∆x) ≃ hk(X) +∂hk(X⊞∆x)

∂∆x

∣∣∣∣∣∆x=0

︸︷︷︸

Jk

·∆x (2.59)

= hk(X) + Jk∆x. (2.60)

This Taylor expansion is very similar to Eq. 2.12, with some notable differences:

• the current linearization point is X, and is fixed.

• the expansion gives us the change of the parameters as a function of the variation of the increments,that are applied to the current linearization point through the ⊞ operator.

As a consequence of that, once we find a solution ∆x∗ for the increments, we need to apply theperturbation to the current estimate by the ⊞ operator:

X← X⊞∆x∗. (2.61)

Appropriately dealing with manifold state spaces is essential for direct methods (like GN or LM) to work.Usually, local parametrizations greatly increase the convergence basin of an algorithm.

So said, not only the states can be non-euclidean, but also the measurements. Again, by exploitingthe smooth manifolds, we can come up with a general approach that in most of the cases results insmoother error function, thus it generally leads to faster and more robust convergence. In Section 2.2.1we introduced the error function ek(x) = hk(x)−zk, that was defined as the Euclidean difference betweenthe prediction and the measurement. Let us consider this alternative error function:

ek(x) = ek(zk, zk) = zk ⊟ zk = hk(x)⊟ zk (2.62)

where we replace the “−” with “⊟”. This error function computes the displacement between predic-tion and observation a minimal space, and selects as center of the “chart” the actual measurement zk.Since in general prediction and measurement are close in the manifold space of the observations, theirdisplacement computed by using the ⊟ will also be small and regular.

The prediction is distributed according to p(zk|x) = N (zk,Ω−1k ), with zk = hk(x). As usual, we can

compute the Taylor expansion of the new error, under a small perturbation of the prediction:

ek(zk +∆zk, zk) = (zk +∆zk)⊟ zk (2.63)

≃ zk ⊟ zk +∂ ((zk +∆zk)⊟ zk)

∂∆zk︸︷︷︸

Jzk

∣∣∣∣∣∣∣∣∣∆zk=0

∆zk (2.64)

Based on this result, we can compute the gaussian approximation of the conditional distribution of theerror as

p(ek|x) ∼ N (zk ⊟ zk,JzkΩ−1

k JTzk

︸︷︷︸

Σek|x

). (2.65)

Note that when we project the measurement to a minimal space through ⊟, the conditional covarianceof the error Σek|x depends on zk = hk(x), thus on x, and needs to be recomputed at every iteration.

The reader might be worried by the complexity of the derivatives one has to compute once exploitingthis smooth manifold structure. However, this procedure can be simplified by using the rule on thepartial derivatives as:

∂ek(x⊞∆x)

∂∆x=

∂ek(x)

∂x︸︷︷︸

Jk

∣∣∣∣∣∣∣∣x=x

· ∂(x⊞∆x)

∂∆x︸︷︷︸

M

∣∣∣∣∣∣∣∣∆x=0

(2.66)

18

Accordingly, one can easily derive from the Jacobian not defined on a manifold of Eq. 2.12 a Jacobianon a manifold just by multiplying its non-zero blocks by the derivative of the ⊞ operator computed in x

Let the Jacobian of ⊞ be denoted by M, as follows:

Jk = JkM. (2.67)

With these things in place, we can derive a modified version of the Gauss-Newton algorithm describedin Algorithm 1, as shown in Algorithm 3.

Example: Point Registration Again, let us think on how we can modify the point registrationexample to take into account of manifolds. This is easily done, just by

• characterizing the state space, in this case SE(2), in terms of extended and minimal representa-tions. The extended representation can be the homogeneous transformation matrices. The minimalrepresentation could be the vector x = (tx ty θ).

• characterizing the ⊞ and ⊟ operators

– ⊞ : SE(2)×ℜ3 → SE(2) is defined as follows:

X⊞∆x = X · fromVector(∆x). (2.68)

Here X denotes a homogeneous transform and fromVector(∆x) denotes the transform ob-tained by converting the vector x = (tx ty θ) to a transformation matrix as explained beforein this document. An alternative valid form for the ⊞ would be to multiply the increment tothe left, as

fromVector(∆x) ·X. (2.69)

Clearly when applying the increments to the state you have to be consistent with the definitionof the ⊞ of your choice.

– ⊟ : SE(2)×SE(2)→ ℜ3 is the operator that computes a minimal vector ∆x that applied tothe first transform, moves it into the second. It is defined as.

X2 ⊟X1 = toVector(X−11 ·X2). (2.70)

In sum we compute the “difference” between X2 and X1, and convert it to a minimal vectorthrough the toVector(..) function.

• writing the error function, by highlighting the contribution of the perturbation. We refer to thesimplified version of the error in Eq. 2.38, for our convenience. In the remainder, I will drop thepoint index i for sake of notation, and I will denote the point p = (px py), the transformationmatrix X and the delta increments ∆x = (∆tx ∆ty ∆θ).

ei(X⊞∆x) = toHomogeneousMatrix(∆x)Xpi︸︷︷︸

p′i

−zi. (2.71)

note that in the previous equation we used the defnition of the ⊞ in Eq. 2.69. We then can computethe Jacobian as

∂ei(X⊞∆x)

∂∆x

∣∣∣∣∆x=0

=

∂

(p′x cos∆θ − p′y sin∆θ +∆txp′x sin∆θ + p′y cos∆θ +∆ty

)

∂∆x

∣∣∣∣∣∣∣∣∆x=0

(2.72)

=

(

I2×2−p′yp′x

)

. (2.73)

Once you have the error functiond and the ⊞ operator you can happily construct the least quaresproblem. In this case you will have to apply the incremental transformation to the left of the initial one,according to the choice we have taken for the ⊞.

19

2.4 Approaching a Least Squares Estimation Problem: Method-ology

In the previous sections we presented a bunch of concepts and algorithms to approach a least squaresestimation problem. Here we try to wrap these ideas together and to give you a general methodology toconstruct your solver for a given problem. In fact we have already seen this methodology in the pointregistration example.

A general estimation problem might be expressed as find p(x|z), where x are the state variables andz = z1:K are the measurements. To construct a solver you have to follow these sequential steps:

1. Identify a feasible parameterization for the state variables: This means that you have todetermine a way to represent the possible states of your system. To this end, choose a convenientrepresentation, like transformation matrices for SE(2) or SE(3). This “Extended” parameteriza-tion should not be minimal. In case the space is Euclidean, however, the parameterization will beminimal, and it will live in the same space as the perturbation. Let X this parameterization.

2. If the state variables do not live in an Euclidean space

• Define a parameterization for the increments: It is often convenient solving the lin-earized system in a minimal representation, for the reasons outlined above. Let ∆x thisminimal parameterization.

• Define the ⊞ operator The ⊞ operator should apply to a state variable the perturbation,expressed in the minimal form.

3. Define a parameterization for the measurements zk: The measurements are “what mysensor observes”, and finding a parameterization means defining how a bunch of numbers can beused to describe a particular measurement. Let Z these parameters.

4. If the measurements do not live in an Euclidean space

• Define a minimal parameterization for the difference in the measurements: Analo-gous to the case of the state variables, if the measurements do not live in an euclidean space,we need to map “differences” between measurements in an euclidean space, in order to castour linear system. The difference between two identical variables should result in an euclideanvector of 0 norm. Let ∆z this parameterization.

• Define the ⊟ operator The ⊟ operator should return the euclidean difference between themeasurements and the state.

5. Define the measurement function: This means “describing what your sensor should measure,given a known configuration of the state variables”. Let Z = h(X) this function.

6. Define the error function: With the ⊟ operator in place, you can construct the error functionfunction as

e(X) = h(X)⊟ Z, (2.74)

where Z is the measurement.

7. computing the Jacobians: that is

J =∂e(X⊞∆x)

∂∆x

∣∣∣∣∆x=0

(2.75)

Note that if you feel bad in computing large derivatives, you can always compute the jacobiansnumerically.

8. Put the things together in Algorithm 3.

20

Algorithm 2 Levenberg-Marquardt minimization algorithm

Require: x: initial guess. C = 〈ek(·),Ωk〉: measurementsEnsure: x∗ : new solution//compute the current errorFnew ← F//backup the solutionxbakup ← x

λ← computeInitialLambda(C, x)// iterate until no substantial improvements are maderepeat

F ← Fnew

b← 0 H← 0


// Compute the prediction zk, the error ek and the jacobian Jk

zk ← hk(x)ek ← zk − zk

Jk ← ∂hk(x⊞∆x)∂∆x

∣∣∣∆x=0


kΩkJk

bk ← JTkΩkek


b += bk

end for

//Adjust the damping factor and perform backup/restore actionst← 0 // number of iterations the damping factor has been adjustedrepeat

// solve the linear system∆x← solve((H+ λI)∆x = b)// update the statex += ∆x

// compute the new errorFnew ← F (x)if Fnew < F then

// good step, accept the solution and update the temporariesλ← λ/2xbakup ← x

t← −1else

// bad step, revert the solutionλ← λ ∗ 4x = xbakup

t += 1end if

until t < tmax ∧ t > 0until F − Fnew > ǫ ∧ t < tmax

return x

21

Algorithm 3 Gauss-Newton minimization algorithm for manifold measurement and state spaces

Require: x: initial guess. C = 〈zk(·),Ωk〉: measurementsEnsure: x∗ : new solution//compute the current errorFnew ← F// iterate until no substantial improvements are maderepeat

F ← Fnew

b← 0 H← 0


// Compute the prediction zk and the error ek// Note that the measurements zk are overparameterized

zk ← hk(x)ek ← zk ⊟ zkand the jacobians Jk and Jzk

// Compute the jacobian of the error function, in the space of the increments

Jk ← ∂ek(hk(x⊞∆x),zk)∂∆xk

∣∣∣∆xk=0

// Compute the jacobian of the ominus operator in the measurements

Jzk← ∂((zk+∆zk)⊟zk)

∂∆zk

∣∣∣∆zk=0

// Remap the information matrixΩk ← Jzk

Ω−1k JT

zk


k ΩkJk

bk ← JTk Ωkek


b += bk

end for

// solve the linear system∆x← solve(H∆x = −b)// update the statex← x⊞∆x

// compute the new errorFnew ← F (x)

until F − Fnew > ǫreturn x

22

Chapter 3

Sparse Least Squares

Least Squares Estimation, is a powerful tool that allows us to estimate a set of hidden variables frommeasurements that depend on these variables. In the previous sections we introduced the general con-cepts, that remain valid. The dimension of the state vector, for certain classes of problems, might bevery large. The least squares solution of the problem involves the solution of a very large linear system,that might become a performance bottleneck.

It is not uncommon to encounter problems where, albeit a large dimension of the state vector, ameasurement depends only on a subset of the state variables. In this section we will discover how totake advantage of this fact to enhance the performances of our solver, and approaching problems thatotherwise would not be tractable. We start with introducing a simple example that captures one of theseproblems, and then we will reason on how we can exploit the intuition (hopefully) given to us by ourexample in a more general way.

3.1 Typical Problems

3.1.1 Example: 2D Multirobot Point Registration

For instance imagine the following problem: we have a robot moving in SE(2), equipped with a pointsensor. Let xr

1:n these robot positions, with xri ∈ SE(2). This robot moves in a world populated by a

set of distinguishable points, and senses them. Let xp1:m these point positions, with x

pj ∈ ℜ2. For each

robot position, our sensor will report a set of measurements corresponding to the x− y positions of thelandmarks, in the reference frame of the robot. Let the generic measurement zi,j a measurement madeby the robot at position xr

i about the landmark xpj , with zi,j ∈ ℜ2.

Our goal is to estimate:

• the position of all robots xr1:n,

• the position of all landmarks xp1:m

given the measurements zi,j.To construct this solver we will be following the methodology outlined in Section 2.4

• Identification of the state variables

The state variables are all robot positions and all landmarks.

– Robot positions: The extended parameterization for the ith robot position is a transformationmatrix, denoted with Xr

i . To ease the calculations of the jacobians in the remainder of thissection, we select to describe the position of the robot, as the transformation that maps theorigin w.r.t the local frame of the robot, as we did in the example in Section 3.4.

– Landmarks: It is with immense pleasure that we note the landmarks live in ℜ2, so and this isthe way we will represent the blocks of the state.

23

The overall state x will then be a data structure consisting of n transformation matrices Xr1:n and

m two dimensional vectors xp1:m , as follows:

x = Xr1:n,x

pi:m (3.1)

• Parameterization of the increments

– Robot positions: An increment in a robot position can be described just by the 3 vector∆xr

i = (txityi

θi). The boxplus will be the same as in Section 2.3, that is

Xri ⊞∆xr

i = fromVector(∆xri ) ·Xr

i (3.2)

– Landmarks: the landmarks live in an euclidean space, so the increment will just be an euclideandisplacement that can be applied to a state variable with the regular vector addition.

Our displacement vector ∆x will be consisting of the chaining of the displacements of all statevariables, that is

∆x = (∆xr1 ∆xr

2 · · · ∆xrn ∆x

p1 · · · ∆xp

m) (3.3)

This displacement can be applied to the state by applying the individual perturbation to each stateblock, according to the ⊞ defined before. More specifically we can construct the global perturbationoperator x′ = x⊞∆x, where x′ is the global state after applying the perturbation, x is the initialstate and ∆x is the perturbation defined in Eq. 3.3 as follows:

x = Xr1:n,x

pi:m (3.4)

x′ = X′r1:n,x

′pi:m (3.5)

X′ri = Xr

i ⊞∆xri (3.6)

x′pj = x

pj +∆x

pj (3.7)

• Measurement function: The measurement function predicting the measurement zi,j will dependon the ith robot pose and on the jth landmark variable, and will map the landmark variable fromthe global frame to the local frame of the robot, as follows:

hi,j(X) = Xri ·(

xpj

1

)

. (3.8)

• Error function: The measurement space is Euclidean, thus the error function is just the differencebetween the prediction and the measurement

ei,j(X) =

(1 0 00 1 0

)

Xri ·(

xpj

1

)

− zi,j . (3.9)

The matrix

(1 0 00 1 0

)

is just used to eliminate the useless homogeneous component.

• Jacobians: To compute the jacobian we first have to write the error function at the perturbations

ei,j(X⊞∆x) =

(1 0 00 1 0

)

fromVector(∆xri )X

ri

(xpj +∆x

pj

1

)

. (3.10)

We note that the error function depends only on two variables, thus the jacobian will be non zeroonly in the 2×3 block corresponding to variable ∆xr

i and in the 2×3 block corresponding to ∆xpj ,

as

∂ei,j(X⊞∆x)

∂∆x=

(

0 · · ·0∂ei,j(X

ri ⊞∆xr

i ,xpj )

∂∆xri

0 · · ·0∂ei,j(X

ri ,x

pj +∆x

pj )

∂∆xpj

0 · · ·0)

(3.11)

24

Analogous to the example of point registration, evaluating the jacobian for ∆x = 0, we have that

∂ei,j(Xri ⊞∆xr

i ,xpj )

∂∆xri

=

(

I2×2−p′yp′x

)

(3.12)

where p′ = (−p′y p′x) = Xrix

pj . Being a linear transformation,

∂ei,j(Xri ,x

pj +∆x

pj )

∂∆xpj

= R(θi). (3.13)

3.1.2 Example: Pose SLAM

Here we propose another example: Pose SLAM. Imagine we have a set of robots living in a 2D world,each located at a position xi ∈ SE(2). Each robot is equipped with a fancy sensor capable of reportingthe postions of the other robots in its own reference frame. A measurement made by robot i about thelocation of robot j will be zi,j ∈ SE(2). Once again we assume the sensor is capable of determiningwhich other robot it is sensing.

Our goal is to estimate:

• the position of all robots x1:n,

given the measurements zi,j.Again, we follow our beloved methodology

• Identification of the state variables

The state variables are all robot positions. The extended parameterization for the ith robot positionis a transformation matrix, denoted with Xi.

The overall state x will then be a data structure consisiting of n transformation matrices X1:n

x = X1:n (3.14)

• Parameterization of the increments An increment in a robot position can be described justby the 3 vector ∆xi = (txi

tyiθi). The boxplus will be the same as in Section 3.4, that is

Xi ⊞∆xi = Xi · fromVector(∆xi) (3.15)

Our displacement vector ∆x will be consisting of the chaining of the displacements of all statevariables, that is

∆x = (∆x1 ∆x2 · · · ∆xn) (3.16)

This displacement can be applied to the state by applying the individual perturbation to eachstate block, as in the previous example. Let x′ = x ⊞∆x, be the global state after applying theperturbation, x the initial state and ∆x is the perturbation defined in Eq. 3.16 as follows:

x = X1:n (3.17)

x′ = X′1:n (3.18)

Xi = Xi ⊞∆xi (3.19)

• Measurement function: The measurement function predicting the measurement zi,j will dependon the ith and the jth robot poses, and will map the position of the robot j in the frame of roboti as follows:

hi,j(X) = X−1i ·Xj . (3.20)

• Error function: The measurement space is a manifold, we need to define the ⊟ operator as theoperator that maps the “difference” between two reference frames in a minimal perturbation.

Xa ⊟Xb = toVector(X−1b ·Xa) (3.21)

The error will then be

ei,j(X) = hi,j(X)⊟ Zi,j (3.22)

= toVector(Z−1i,j ·X−1

i ·Xj). (3.23)

25

• Jacobians: Again, to compute the jacobian we have to write the error function at the perturbations

ei,j(X⊞∆x) = toVector[Z−1

i,j · (Xi · fromVector(∆xi))−1 · (Xj · fromVector(∆xj))

](3.24)

= toVector

Z

−1i,j fromVector(∆xi)

−1 ·X−1i ·Xj

︸︷︷︸

Xi,j

·fromVector(∆xj)

(3.25)

= toVector[Z−1

i,j fromVector(∆xi)−1 ·Xi,j · fromVector(∆xj)

](3.26)

As in the previous case, we note that the error function depends only on two variables, thus thejacobian will be non zero only in the 3× 3 blocks corresponding to variables ∆xi and ∆xj

∂ei,j(X⊞∆x)

∂∆x=

(

0 · · ·0∂ei,j(Xi ⊞∆xi,Xj)

∂∆xi0 · · ·0∂ei,j(Xi,Xj ⊞∆xj)

∂∆xj0 · · ·0

)

(3.27)

If the reader (you) feels comfortable and rested I would suggest him to pursue the challenge ofcomputing the jacobians above in analytic form. So said, everything works well even with numericjacobians.

3.2 A Graphical Representation: Factor Graphs

A least squares minimization problem can be described by the following equation:

F(x) =∑

k∈C

ek(xk, zk)TΩkek(xk, zk)

︸︷︷︸

Fk

(3.28)

x∗ = argminx

F(x). (3.29)

Here

• x = (xT1 , . . . ,x

Tn )

T is a vector of parameters, where each xi represents a generic parameter block.

• xk = (xTk1, . . . ,xT

kq)T ⊂ (xT

1 , . . . ,xTn )

T is the subset of the parameters involved in the kth

constraint.

• zk and Ωk represent respectively the mean and the information matrix of a constraint relating theparameters in xk.

• ek(xk, zk) is a vector error function that measures how well the parameter blocks in xk satisfythe constraint zk. It is 0 when xk perfectly matches the constraint. As an example, if one hasa measurement function zk = hk(xk) that generates a synthetic measurement zk given an actualconfiguration of the nodes in xk. A straightforward error function would then be e(xk, zk) =hk(xk)− zk.

For simplicity of notation, in the rest of this section we will encode the measurement in the indices ofthe error function:

ek(xk, zk) = ek(xk) = ek(x). (3.30)

Note that each error function and each parameter block can span over a different space. A problem inthis form can be effectively represented by a directed hyper-graph. A node i of the graph represents theparameter block xi ∈ xk and a hyper-edge among the nodes xi ∈ xk represents a constraint involving allnodes in xk. In case the hyper edges have size 2, the hyper-graph becomes an ordinary graph. Figure 3.1shows an example of mapping between a hyper-graph and an objective function.

26

p0 p1 p2

pi

pj pi−1

z2i

K

zij

u0 u1

ui−1

Figure 3.1: This example illustrates how to represent an objective function by a hyper-graph. Herewe illustrate a portion of a small SLAM problem [16]. In this example we assume that where themeasurement functions are governed by some unknown calibration parameters K. The robot poses arerepresented by the variables p1:n. These variables are connected by constraints zij depicted by the squareboxes. The constraints arise, for instance, by matching nearby laser scans in the laser reference frame.The relation between a laser match and a robot pose, however, depends on the position of the sensoron the robot, which is modeled by the calibration parameters K. Conversely, subsequent robot posesare connected by binary constraints uk arising from odometry measurements. These measurements aremade in the frame of the robot mobile base.

3.3 Structure of the Linearized System

According to (2.28), the matrix H and the vector b are obtained by summing up a set of matrices andvectors, one for every constraint. If we set bk = JT

kΩkek and Hk = JTkΩkJk we can rewrite H and b as

b =∑

k∈C

bk (3.31)

H =∑

k∈C

Hk. (3.32)

Every constraint will contribute to the system with an addend term. The structure of this addenddepends on the Jacobian of the error function. Since the error function of a constraint depends only onthe values of the nodes xi ∈ xk, the Jacobian in (2.12) has the following form:

Jk =(0 · · ·0 Jk1

· · · Jki· · ·0 · · · Jkq

0 · · ·0). (3.33)

Here Jki= ∂e(xk)

∂xki

are the derivatives of the error function with respect to the nodes connected by the

kth hyper-edge, with respect to the parameter block xki∈ xk.

27

From (2.25) we obtain the following structure for the block matrix Hij :

Hk =

. . .

JTk1ΩkJk1

· · · JTk1ΩkJki

· · · JTk1ΩkJkq

......

...JTkiΩkJk1

· · · JTkiΩkBki

· · · JTkiΩkJkq

......

...JTkqΩkJk1

· · · JTkqΩkBki

· · · JTkqΩkJkq

. . .

(3.34)

bk =

...Jk1

Ωkek...

JTkiΩkek...

JTkqΩkek...

(3.35)

For simplicity of notation we omitted the zero blocks. The reader might notice that the block structureof the matrix H is the adjacency matrix of the hyper graph. Additionally the Hessian H is a symmetricmatrix, since all the Hk are symmetric. A single hyper-edge connecting q vertices will introduce q2 nonzero blocks in the Hessian, in correspondence of each pair

⟨xki

,xkj

⟩of nodes connected.

3.4 Sparse Least Squares on Manifold

To deal with parameter blocks that span over a non-Euclidean spaces, it is common to apply the errorminimization on a manifold. A manifold is a mathematical space that is not necessarily Euclidean on aglobal scale, but can be seen as Euclidean on a local scale [18].

For example, in the context of SLAM problem, each parameter block xi consists of a translation vectorti and a rotational component αi. The translation ti clearly forms a Euclidean space. In contrast to that,the rotational components αi span over the non-Euclidean 2D or 3D rotation group SO(2) or SO(3).To avoid singularities, these spaces are usually described in an over-parameterized way, e.g., by rotationmatrices or quaternions. Directly applying (2.29) to these over-parameterized representations breaksthe constraints induced by the over-parameterization. The over-parameterization results in additionaldegrees of freedom and thus introduces errors in the solution. To overcome this problem, one can usea minimal representation for the rotation (like Euler angles in 3D). This, however, is then subject tosingularities.

As discussed in Section 2.3 an alternative idea is to consider the underlying space as a manifold andto define an operator ⊞ that maps a local variation ∆x in the Euclidean space to a variation on themanifold, ∆x 7→ x⊞∆x. We refer the reader to [11, §1.3] for more mathematical details on manifold insparse problems. With this operator, a new error function can be defined as

ek(∆xk) = ek(xk ⊞∆xk) (3.36)

= ek(x⊞∆x) ≃ ek + Jk∆x, (3.37)

where x spans over the original over-parameterized space, for instance quaternions. The term ∆x

is a small increment around the original position x and is expressed in a minimal representation. Acommon choice for SO(3) is to use the vector part of the unit quaternion. In more detail, one can

represent the increments ∆x as 6D vectors ∆xT = (∆tTqT ), where ∆t denotes the translation and

qT = (∆qx ∆qy ∆qz)T is the vector part of the unit quaternion representing the 3D rotation. Conversely,

xT = (tT qT ) uses a quaternion q to encode the rotational part. Thus, the operator ⊞ can be expressedby first converting∆q to a full quaternion∆q and then applying the transformation ∆xT = (∆tT ∆qT )

28

to x. In the equations describing the error minimization, these operations can nicely be encapsulated bythe ⊞ operator. The Jacobian Jk can be expressed by

Jk =∂ek(x⊞∆x)

∂∆x

∣∣∣∣∆x=0

. (3.38)

Since in the previous equation e depends only on ∆xki∈∆xk we can further expand it as follows:

Jk =∂ek(x⊞∆x)

∂∆x

∣∣∣∣∆x=0

(3.39)

=(

0 · · ·0 Jk1· · · Jki

· · ·0 · · · Jkq0 · · ·0

)

. (3.40)

With a straightforward extension of notation, we set

Jki=∂ek(x⊞∆x)

∂∆xki

∣∣∣∣∆x=0

(3.41)

With a straightforward extension of the notation, we can insert (3.37) in (3.31) and (3.32). Thisleads to the following increments:

H∆x∗ = −b. (3.42)

Since the increments ∆x∗ are computed in the local Euclidean surroundings of the initial guess x, theyneed to be re-mapped into the original redundant space by the ⊞ operator. Accordingly, the update ruleof (2.29) becomes

x∗ = x⊞∆x∗. (3.43)

meaning that we have to carry on a ⊞ for each variable:

x∗k = xk ⊞∆xk)

∗. (3.44)

In summary, formalizing the minimization problem on a manifold consists of first computing a set ofincrements in a local Euclidean approximation around the initial guess by (3.42), and second accumulat-ing the increments in the global non-Euclidean space by (3.43). Note that the linear system computedon a manifold representation has the same structure like the linear system computed on an Euclideanspace. One can easily derive a manifold version of a graph minimization from a non-manifold version,only by defining an ⊞ operator and its Jacobian Jki

w.r.t. the corresponding parameter block.

3.5 Robust Least Squares

Optionally, the least squares optimization can be robustified. Note, that the error terms in Eq. 3.28 havethe following form:

Fk = eTkΩkek = ρ2

(√

eTkΩkek

)

with ρ2(x) := x2. (3.45)

Thus, the error vector ek has quadratic influence on F, so that a single potential outlier would have majornegative impact. In order to be more outlier robust, the quadratic error function ρ2 can be replacedby a more robust cost function which weighs large errors less. A common robustifier is the Huber costfunction ρH can be used

ρH(x) :=

x2 if |x| < b

2b|x| − b2 else,(3.46)

which is quadratic for small |x| but linear for large |x|. Compared to other, even more robust costfunctions, the Huber kernel has to advantage that it is still convex and thus does not introduce newlocal minima in F [10, pp.616]. In practice, we do not need to modify Eq. 3.28. An alternative is to use

29

the following scheme: First the error ek compute the error as usual. Then, replace ek with a weightedversion wkek such that

(wkek)TΩk(wkek) = ρH

(√

eTkΩkek

)

. (3.47)

Then calculate the the weights wk as:

wk =

√

ρH (||ek||Ω)||ek||Ω

with ||ek||Ω :=√

eTkΩkek. (3.48)

30

Chapter 4

Data Association

So far, we alwats assumed to know which observation/measurement was generated by which set of statevariables. In the point registration exercise, for example, we assumed to know which point in the oneframe was which in the other frame. I think you easily understand this is typically not the case. Inpractice we construct our associations incrementally, and we use a large set of aggregated informationabout the scene to figure to which landmark/point/pose generated a specific observation.

A lot can be done in a simple way, by using the fact that we are estimating a distribution over statevariables. This gives us the first simplistic version of data association, that is the mahalanobis test. Ata more advanced level we might want to match multiple obsrervations at once. To this end the mostcommonly used algorithm is RANSAC.

Since committing to a wrong data association is extremely dangerous, we must be careful whenaccepting a new match. Even when using a strategy that matches multiple landmarks, we may endup with making wrong associations. This is the case of our beloved SLAM algorithm, in self similarenvironments. Imagine we have several rooms that look the same. If we are convinced to be in a room,but we are in a different one, they will look exactly the same.

What cound be used to disambiguate the situation, in this case is the layout of the rooms in theenvironment. To this end we can use the same approach, on a larger scale.

In the remainder of this section, we will first introduce the data association problem. Subsequentlywe will present the first and easiest method to implement a statistically valid, yet error prone dataassociation. Then we will introduce a generalized ransac algorithm to perform robust association eitheron a single sensing frame or a sequence of them.

4.1 Data Association

Imagine we have a state space characterized by the variables xr1:n, x

l1:n. To prevent your fantasy from

throttling you might think to our multi-point registration of Section 3.1.1 example as a concrete case,where the xr

1:n are the robot poses and the xl1:m are the landmark poses. The robot is equipped with a

sensor capable of measuring the relative pose of a landmark with respect to its location.From a pose, let’s say xr

i , the robot will be able to measure all the landmarks that are in the fieldof view of the sensor xl

i1,..,ik. But since all landmarks are the same, all what we get from our sensor

are a set of measurements zl1,..,k. We do not know which measurement arises from which landmark. Inabsence of this information we cannot cast a least squares problem and all what we have learned so faris of little use for SLAM. Additionally, a landmark might not have been seen by our system, thus thecorresponding state variable might not exist.

Solving the data association, in this context consists in finding for each measurement in zl1,..,k anindex that tell which landmark originated this measurement. So data association is a combinatorialproblem. In practice we do have a guess of the robot pose and of the position of the landmarks we haveseen in the previous frames, and this renders the problem tractable in most of the cases. Indeed a singleassociation error is sufficient to lead the system to a catastrophic failure.

31

4.2 RANSAC

- ask yourself the question: how many correspondences should I know to align these two frames? - thisdepends on the problem - sample a minimal set - align the others, given the minimal set

- count the inliers/outliers - recompute the transform

32

Part II

Other Stuff

33

Chapter 5

Introduction to Mobile RobotNavigation and SLAM

5.1 Localization

5.2 Mapping

5.3 Planning

34

Chapter 6

State Estimation: Filtering

Most of the localization, mapping and SLAM approaches have a probabilistic formulation. In thischapter, we revise the Bayesian state estimation framework, and in particular the filtering problem, inorder to provide the reader with background on the used mathematical tools, and for introducing thenotation.

The filtering problem can be expressed as estimating the state x of a dynamic discrete system (seeFigure 6.1), given:

• the analytical knowledge of the state transition function ft(xt) and the statistical knowledge of thestate noise wt,

• the analytical knowledge of the measurement function ht(xt) and the statistical knowledge of theobservation noise vt,

• a realization of the system output z1:t up to time t.

A probabilistic filter for a dynamic system is a mathematical tool, whose goal is to estimate adistribution of the possible system state histories given the measurements:

p(x1:t | z1:t) (6.1)

Here x1:t are the system states from the instant 1 to the instant t, while z1:t is the history of measurementsz from 1 to t.

In most of the cases, one is interested in evaluating only the marginal distribution of the current stategiven the observations:

p(xt|z1:t) (6.2)

instead of the full state history.Several on line and off line techniques for solving the filtering problem have been proposed [13, 7,

1, 22]. Most of them rely on the assumption that the process being observed is Markovian. A processis Markovian if the current measurement is independent from the past ones, given the current state:p(zt|z1:t−1,xt) = p(zt|xt).

wt

ft

ht

xt

vtzt

xt−1

∆

Figure 6.1: Generic System Model. The ∆ block is a one step delay.

35

In the rest of this chapter we describe a wide range of useful tools for Bayes filtering, which isa framework that can be used for estimating the state of Markovian systems. In particular we willdescribe:

• The Kalman Filter (KF), which is an exact, closed form filter working with linear systems, affectedby zero mean Gaussian noise.

• The Particle Filters (PF), which are Monte Carlo methods suitable for the state estimation of nonlinear non Gaussian dynamic systems.

6.1 Recursive Bayes Formulation

Now we shortly present the key filtering equations, and formalize the problem. Let p(zt|xt) be theobservation model , that is the density of the measurement z, given that the system state is x, and letp(xt|xt−1) be the transition model1. If the Markov assumption holds, the posterior of the state chain upto time t is

p(x1:t|z1:t) =p(zt|x1:t, z1:t−1)p(x1:t|z1:t−1)

p(zt|z1:t−1)[by Markov assumption]

=p(zt|xt)p(x1:t|z1:t−1)

p(zt|z1:t−1)

=p(zt|xt)p(xt|xt−1)

p(zt|z1:t−1)p(x1:t−1|z1:t−1) (6.3)

If one is interested in estimating the current state distribution, the filtering equation becomes the fol-lowing:

p(xt|z1:t) =p(zt|xt)p(xt|z1:t−1)

p(zt|z1:t−1)

=p(zt|xt)

∫p(xt|xt−1)p(xt−1|z1:t−1)dxt−1

p(zt|z1:t−1)

=p(zt|xt)

∫p(xt|xt−1)p(xt−1|z1:t−1)dxt−1

∫p(zt|z1:t−1,xt)p(xt|z1:t−1)dxt

= ηp(zt|xt)

∫

p(xt|xt−1)p(xt−1|z1:t−1)dxt−1 (6.4)

here η is a normalization factor ensuring that Eq. 6.4 correctly represents a probability distribution.Usually, the evaluation of Eq. 6.4 is done in two steps: prediction and update. In the prediction step,

the result of the state transition from xt−1 to xt is computed. In the update step, the last observation ztis incorporated in the previously computed probability density. Referring to Eq. 6.4, one can argue thatthe predict step consists in computing the integral term. The update step is performed by weighting thepredicted belief

∫p(xt|xt−1)p(xt−1|z1:t−1)dxt−1 with the last observation likelihood p(zt|xt). These two

steps can be found in all of the filters described in the reminder of this chapter.Bayes filtering in this form is exact and can be used on any kind of system for which the Markov

assumption holds. Unfortunately, in the above equations there are some integrations over the statespace. In many cases the state space is high dimensional, and the Bayes filtering cannot be directlyimplemented. For instance, in the SLAM problem the dimension of the system state is the sum of therobot location dimension and the map space dimension, which can easily be in the order of the hundredsor thousands. A straightforward evaluation of Eq. 6.4 would require an integration over the entire statespace. For this reason approximated techniques are needed.

In the remainder of this chapter we will derive two classes of filters, for addressing parametric andnon-parametric pdfs. A parametric pdf is a distribution that can be described

1In the following sections, the evolution model is also referred to asmotion model, since it is used for describing the changeof the robot state after motion. In the SLAM context the state transitions are governed by the proprioceptive sensingsu0:t−1. For clarity of notation, but with loss of generality, in this chapter we omit the terms u0:t−1 which affect the transitionmodel. This is equivalent to consider a time dependent transition model. In practice p(xt|xt−1, ut−1) = p(xt|xt−1)

36

6.2 Gaussian Filters

A probability density is a function that goes from the domain of the random variable to the positive reals,and must integrate to one. This means that one might have virtually many shapes of distributions, andthey might change during the filtering process. In the lucky case one has to deal with a system having alinear (or close to linear) dynamics and measurement models and in the even more lucky case that thenoise normally distributed the pdf in the game remain gaussian. This means that one needs to estimateonly how the parameters of the gaussian change as more measurements are integrated (while the shapeis fixed, it is a gaussian). In this section we will review some properties of the Gaussian distribution,and we will shortly introduce the Kalman filter, which is noting else than a development of Eq 6.4 whereone plugs in the fancy gaussian properties.

6.2.1 Kalman Filter

The Kalman Filter (KF) [13, 28] is an exact filter that can be derived directly by Equation 6.4 underthe assumptions that the system is linear and the noise is Gaussian. Under these linearity hypothesesthe system can be described by

xt = Ftxt−1 + at−1 + wt

zt = Htxt + bt + vt

The system noise wt ∼ N (0,Σwt) and the observation noise vt ∼ N (0,Σvt

) are zero mean normallydistributed. The key advantage of the Kalman Filter is that it represents the distributions in closedform, in terms of means and covariance matrix. The update of the Kalman filter can be carried out inthe time of a matrix multiplication (O(n3), where n is the state dimension).

The iterative algorithm of the filter is the following:

• predict:x′t = Ftxt−1 + at−1 Σ′

t = FtΣt−1FTt +Σwt

(6.5)

• update:

Kt = Σ′tH

Tt

(HtΣ

′tH

Tt +Σvt

)−1(6.6)

xt = x′t +Kt (zt −Htx

′t − bt) Σt = (I −KtHt)Σ

′t (6.7)

One can derive the predict step of a kalman filter by applying Eq. A.11 to the input distribution. Theupdate step is obtained by assuming that p(zt|xt) is distributed according to N (zt;Htxt+bt;Σvt

). Basedon that and on p(xt|z1:t−1) obtained in the predict one can compute the joint distribution p(zt,xt|z1:t−1)by using Eq. A.14, and subsequently condition on the real measurement zt by Eq A.8. The procedure isdescribed in detail in [23].

Unfortunately, in the mobile robot domain, the evolution model, as well as the observation modelare non linear, thus the noise cannot be considered Gaussian. However, for mild evolution laws, a nonlinear extension can be used: the Extended Kalman Filter (EKF) [28], in which local linearizations ofthe state transition function f(x) and the observation model h(x) are performed. The extended Kalmanfilter algorithm can be expressed as

• predict:x′t = ft−1(xt−1) Σ′

t = FtΣt−1FTt +Σwt

(6.8)

• update:

Kt = Σ′tH

Tt

(HtΣ

′tH

Tt +Σvt

)−1(6.9)

xt = x′t +Kt (zt − ht(x

′t)) Σt = (I −KtHt)Σ

′t (6.10)

here Ft = ∇xft|xtand Ht = ∇xht|xt

.The key limitations in the use of extended Kalman filter lies in the strong assumptions that have to

be done on the estimated system, namely: Gaussian noise, and linearizability. In practical situations

37

mild violations to the above assumptions lead only to a loss of optimality. However, in most of therobotic systems used for localization and SLAM, the uncertainty is not expressible nor approximable asa Gaussian distribution, being multi modal and irregularly shaped. When more modes are present in adistribution, dealing with multiple hypotheses is needed, while the Kalman Filter works on their mean.In such a situations, its use is prone to failure. Moreover, the linearization of the system can introducesome systematic error in the estimate.

Finally, some systems cannot be linearized (being the 1st order derivatives of the state transitionfunction null), thus the extended Kalman Filter cannot be applied. In these contexts the UnscentedKalman Filter (UKF) [12, 27] can be used. The key difference among EKF and UKF lies in the filterpredict step. The first computes a local linearization of the transition function around the current meanestimate, and uses this local linearization for computing the predicted distribution. The second system-atically selects a set of sampling points (σ-points) in the state space, around the mean. The σ-pointsare selected along the axis of the Cholesky decomposition of the covariance matrix. Subsequently, theσ-points are translated according to the transition function, and their translation is used for computingthe predicted mean and covariance. While the UKF in general behaves better than the Kalman filterthe hypotheses of Gaussian noise is still required to hold.

Despite the above outlined limitations, the Kalman Filter is one of the most used tools in localizationand SLAM, due to its simplicity. Moreover, when the underlying hypotheses hold, it exhibits a strongconvergence rate if compared with other filtering techniques.

6.2.2 Information Filter

The information filter is a variant of the kalman filter that uses the canonical parameterization instead ofthe moment parameterization to represent the Gaussian. It has the same properties as the kalman filter,but it can handle better cases or measurements where some variable can be affected by an infinite noise(zero information). Whereas the computationally expensive step in the Kalman filter is the update, inthe information filter the expensive step is the predict. Another issue of the information filter is thatretrieving the mean from the information vector requires inverting Omega. The big advantage of thisrepresentation is that if by keeping an extended representation for the states (that is mantaining the jointdistribution over x1:t) results in a sparse pattern of the information matrix that allows for substantialspeedups in the computation. This has been exploited in [6]. For the actual purpose of these notes, wedo not enter in the details. Indeed, it is well seen any proactive action taken by the students.

6.2.3 Example: Localization

6.3 Non-Parametric Filters

6.3.1 Discrete Filter and Hidden Markov Models

6.3.2 Particle Filters

A particle filter is a Bayes filter that works by representing a probability distribution as a set of samples(particles):

p(x) ≃ 1

N

∑

i

δx(i)(x). (6.11)

where δx(i)(x) is the impulse function centered in x(i). The denser are the samples x(i) in a region, thehigher is the probability that the current state falls within that region.

In principle, in order to maintain a sampled representation of the feasible system state histories, one

should draw the samples x(i)1:t form the probability distribution p(x1:t | z1:t) of the current state given

the observation history. Such a distribution is in general not available in a form suitable for sampling.However, the Importance Sampling (IS) principle ensures that if one can:

• evaluate point wise and draw samples from an arbitrarily chosen importance function π(x1:t | z1:t),such that p(x1:t | z1:t) > 0⇒ π(x1:t | z1:t) > 0, and

• evaluate point wise p(x1:t | z1:t),

38

∆p ∆pa) b) c)

Figure 6.2: The Importance Sampling principle. In picture a) the goal distribution and some samplesdrawn from it are depicted. The more dense are the samples in a region, the higher is the probabilitydensity in that region. If we do not have a closed form for the goal distribution we are not able to drawsamples from it. However, we can draw samples from another distribution, for instance the uniform,shown in picture b). Picture c) shows the samples, weighted according to the Importance Sampling. Wecompute the probability mass falling in an interval ∆p by summing the weights of the samples fallingin the interval. If using a high number of samples the probability mass computed from a) and c) aresimilar, becoming equal as the number of samples goes to infinity.

then it is possible to recover a sampled approximation of p(x1:t | z1:t) as

p(x1:t | z1:t) ∝∑

i

w(i)δx(i)1:t(x1:t). (6.12)

Here x(i)1:t are samples drawn from π(x1:t | z1:t) and w(i)

t =p(x

(i)1:t|z1:t)

π(x(i)1:t|z1:t)

is the importance weight related

to the ith sample that takes into account the mismatch among the target distribution p(xt | z1:t) andthe importance function. Observe that, in case we are able to draw samples from the target distribution,

such that p(x(i)1:t | z1:t) ∝ π(x

(i)1:t | z1:t) then all of the weights are the same, and the variance of w(i) is 0.

An intuitive explanation of how the importance sampling principle works is given in Figure 6.2.

Sequential Importance Sampling By restricting to the set of Markovian systems, and in particularfocusing the choice on a particular class of importance functions, such that

π(x1:t | z1:t) = π(xt | x1:t−1, z1:t)π(x1:t−1 | z1:t−1) (6.13)

it is possible to recursively computing the importance weights, without revising the past generatedtrajectories, since

w(i)t =

p(x(i)1:t | z1:t)

π(x(i)1:t | z1:t)

= ηp(zt | x(i)

t )p(x(i)t | x(i)

t−1)

π(x(i)t | x(i)

1:t−1, z1:t)

p(x(i)1:t−1 | z1:t−1)

π(x(i)1:t−1 | z1:t−1)

∝ p(zt | x(i)t )p(x

(i)t | x(i)

t−1)

π(x(i)t | x(i)

1:t−1, z1:t)w

(i)t−1 (6.14)

Where η = 1/p(zt | x(i)1:t−1, z1:t−1) is a normalization factor. Several approaches select the importance

function to be the transition model p(xt|xt−1). According to the importance sampling principle the

weights w(i)t can be computed as follows:

w(i)t =

p(x(i)1:t | z1:t)

p(x(i)t | x(i)

t−1)= using Eq.6.14

∝ p(zt | x(i)t )p(x

(i)t | x(i)

t−1)

π(x(i)t | x(i)

t−1)w

(i)t−1

∝ w(i)t−1p(zt | x

(i)t ). (6.15)

39

Sampling Importance Resampling The direct use of a Sequential Importance Sampling filter re-quires a huge number of samples, since as the system evolves all of the particles but one will have a highweight. For this reason the Sampling Importance Resampling (SIR) filter [7] has been introduced.

A SIR filter, sequentially processes the observations zt and the state transitions xt−1 → xt as theyare perceived, by updating a set of samples representing the estimated distribution p(x1:t | z1:t).

This is done by performing the following three steps:

1. Sampling: The next generation of particles x(i)t is obtained by the previous generation x(i)

t−1,by sampling from a proposal distribution π(xt | x(i)

1:t−1, z1:t).

2. Importance Weighting: An individual importance weight w(i) is assigned to each particle, accordingto the IS principle

w(i) = w(i)t−1

p(x(i)t | x(i)

1:t−1, z1:t)

π(x(i)t | x(i)

1:t−1, z1:t). (6.16)

The weights w(i) account for the fact that the proposal distribution π in general is not equal tothe true distribution of successor states.

3. Resampling: Particles with a low importance weight w are typically replaced by samples with ahigh weight. This step is necessary since only a finite number of particles are used to approximatea continuous distribution. Furthermore, resampling allows to apply a particle filter in situations inwhich the true distribution differs from the proposal one. For completeness we report a commonresampling technique in Algorithm 4

Please note that a SIR filter here described, assumes the proposal to be suitable for sequential estimation.This means that π(·) satisfies Eq. 6.13.

In principle, one would evaluate point wise and sample from the target distribution p(xt | z1:t), butthis is hard, due to the following reasons:

• it is usually not available in closed form and,

• it depends on the whole input history z1:t, up to time t.

In the next of this section, we discuss a crucial problem affecting the particle filters: particle depletion.Such a problem can be detected by the analysis of some indicators of the filter behavior, while it can bebounded by a proper choice of the proposal distribution.

Algorithm 4 Systematic Resampling

Require:

SampleSet : S, input sample setEnsure:

SampleSet : R, the unweighted output distribution

SampleSet : S ′ = normalize(S)SampleSet : R = double : n = 1

|S| sampleFromUniform()

double : t = ndouble : c = 0for all s(i) =< w(i),x(i) >∈ S ′ doc = c+ w(i)

while c > t doSample : r(i) =< 1

|S| ,x(i) >

R = R∪ r(i)t = t+ 1

|S|

end while

end for

40

6.3.3 Example: Localization

41

Appendix A

Gaussian Distribution

The gaussian density is described through its parameters, which are mean µ and covariance matrix Σ.A remarkable property of the gaussian is that marginalizing, conditioning and affine transformations ofa random variable that is normally distributed are still normally distributed.

The pdf of a normal distribution is the following:

p(x) = N (x;µ,Σ) =1

(2π)n/2√detΣ

exp

(

−1

2(x− µ)TΣ−1(x− µ)

)

, (A.1)

where the notation N (x;µ,Σ) makes the parameters explicit, and n is the dimension of the randomvariable x. Instead of parameterizing the Gaussian through its mean and covariance matrix, one can usethe information vector ν and the (fisher) information matrix Ω, where

ν = Σ−1µ (A.2)

Ω = Σ−1. (A.3)

This leads to the following equivalent expression for Eq. A.1:

N−1(x; ν,Ω) =exp

(12ν

TΩ−1ν)√

detΩ

(2π)n/2exp

(

−1

2xTΩx+ xT ν

)

. (A.4)

These two parameterizations of the gaussian are called respectively: moments parameterization (Eq. A.1and canonical parameterization (Eq. A.4).

A.1 Partitioned Gaussian Densities

In the remainder of this paragraph, we will discuss how to perform the basic operations on gaussian dis-tribution, that are: marginalization, conditioning and applying an affine transformation. Without loss ofgenerality, we will consider partitioned gaussian densities, where the random variable is an n-dimensionalvector xT = (xT

a xTb ) that consists of two subvectors xT

a and xTb . We can make the components of x

explicit in the parameterizations of the gaussian, as follows:

x =

(xa

xb

)

, µ =

(µa

µb

)

, ν =

(νaνb

)

, Σ =

(Σaa Σab

Σba Σbb

)

, Ω =

(Ωaa Ωab

Ωba Ωbb

)

(A.5)Since both covariance matrix Σ and the information matrix Ω are symmetric, Σba = ΣT

ab and Ωba = ΩTab;

A.2 Marginalization of a Partitioned Gaussian Density

Let xT = (xTa xT

b ) be a Gaussian random variable such that x ∼ N (x, µ,Σ).The marginal probability p(xa) =

∫

xbp(xa,xb)dxb is Gaussian, with parameters

p(xa) = N (xa;µa,Σa). (A.6)

The proof of this statement is in [23].

42

A.3 Conditioning of a Partitioned Gaussian Density

Let xT = (xTa xT

b ) be a Gaussian random variable such that x ∼ N−1(x, µ,Σ).

The conditional probability p(xa | xb) =p(xa,xb)∫

xap(xa,xb)dxa

is Gaussian with parameters:

p(xa) = N (xa;µa|b,Σa|b), (A.7)

whereµa|b = µa +ΣabΣ

−1bb (xb − µb) and Σa|b = Σaa −ΣabΣ

−1bb Σba. (A.8)

or alternatively, in information form

νa|b = νa −Ωabxb and Ωa|b = Ωaa. (A.9)

The proof of this statement is in [23]. Note that if the conditioning value is the mean of the b component:xb = µb, we have that µa|b = µa and equivalently νa|b = νa.

A.4 Affine Transformations

Let xa be a random variable normally distributed according to N (xa;µa,Σa).Let A and c be respectively a matrix and a vector of compatible dimensions, the random variable

xb = Axa + c (A.10)

Is normally distributed according to N (xb;µb,Σb), and has the following parameters:

µb = Aµ+ c Σb = AΣAT (A.11)

A.5 Chain Rule

Let xa be a random variable normally distributed according to p(xa) = N (xa;µa,Σa).Let p(xb|xa) be the distribution of another variable xb, conditioned to xa, distributed according to

p(xb|xa) = N (xb;Axa + c,Σb|a), (A.12)

where A and c are respectively a matyrix and a vector of appropriate dimensions. If the above assump-tions hold, the joint distribution p(xa,xb) will be still normally distributed according to

p(xa,xb) = N (xa,b;µa,b,Σa,b) (A.13)

Eq. A.13, denotes a Gaussian having dimension Dim(xa) + Dim(xb), where

xa,b =

(xa

xb

)

µa,b =

(µa

µb

)

Σa,b =

(Σa ΣaA

T

AΣa Σb|a +AΣaAT

)

(A.14)

The information matrix Ωa,b = Σ−1a,b is

Ωa,b =

(ATΩb|aA+Ωa −ATΩb|a

−Ωb|aAT Ωb|a

)

(A.15)

43

Bibliography

[1] S. Arumampalam, S. Maskell, N. Gordon, and T. Clapp. A tutorial on particle fiters for on-linenon-linear/non-gaussian bayesian tracking. IEEE Transactions on Signal Provessing, XX, 2001.

[2] J.A. Castellanos, J.M.M. Montiel, J. Neira, and J.D. Tardos. The SPmap: A probabilistic frameworkfor simultaneous localization and map building. IEEE Transactions on Robotics and Automation,15(5):948–953, 1999.

[3] F. Dellaert and M. Kaess. Square Root SAM: Simultaneous localization and mapping via squareroot information smoothing. Int. Journal of Robotics Research, 25(12):1181–1204, Dec 2006.

[4] A. Doucet, N. De Freitas, and N. Gordon, editors. Sequential Monte Carlo Methods in Practice.Springer Verlag, 2001.

[5] R. Eustice, H. Singh, and J.J. Leonard. Exactly sparse delayed-state filters. In Proc. of the IEEEInt. Conf. on Robotics & Automation (ICRA), pages 2428–2435, Barcelona, Spain, 2005.

[6] R. Eustice, M. Walter, and J.J. Leonard. Sparse extended information filters: Insights into spar-sification. In Proc. of the IEEE/RSJ Int. Conf. on Intelligent Robots and Systems (IROS), pages641–648, Edmonton, Cananda, 2005.

[7] N. Gordon, D. Salmond, and C Ewing. A novel approach to nonlinear non gaussian bayesianestimation. In In IEEE Proceedings, part F, pages 107–113, 1993.

[8] G. Grisetti, C. Stachniss, and W. Burgard. Improved techniques for grid mapping with rao-blackwellized particle filters. IEEE Trans. on Robotics, 23(1):34–46, 2007.

[9] D. Hahnel, W. Burgard, D. Fox, and S. Thrun. An efficient FastSLAM algorithm for generating mapsof large-scale cyclic environments from raw laser range measurements. In Proc. of the IEEE/RSJInt. Conf. on Intelligent Robots and Systems (IROS), pages 206–211, Las Vegas, NV, USA, 2003.

[10] R. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Cambridge UniversityPress, second edition, 2004.

[11] C. Hertzberg. A framework for sparse, non-linear least squares problems on manifolds. Master’sthesis, Univ. Bremen, 2008.

[12] S Julier, J. Uhlmann, and H. Durrant-White. A new approach for filtering nonlinear systems. InProceedings of the American Control Conference, pages 1628–1632, 1995.

[13] R.E. Kalman. A new approach to linear filtering and prediction problems. Journal of Basic Enge-neering, 1960.

[14] K. Konolige. A gradient method for realtime robot control. In Proc. of the IEEE/RSJ Int. Conf. onIntelligent Robots and Systems (IROS), 2000.

[15] K. Konolige, J. Bowman, J. D. Chen, P. Mihelich, M. Calonder, V. Lepetit, and P. Fua. View-basedmaps. International Journal of Robotics Research (IJRR), 29(10), 2010.

[16] K. Konolige, G. Grisetti, R. Kummerle, W. Burgard, B. Limketkai, and R. Vincent. Sparse poseadjustment for 2d mapping. In Proc. of the IEEE/RSJ Int. Conf. on Intelligent Robots and Systems(IROS), 2010.

44

[17] J.M. Lee. Introduction to Smooth Manifolds. Graduate Texts in Mathematics. Springer, 2002.

[18] J.M. Lee. Introduction to Smooth Manifolds, volume 218 of Graduate Texts in Mathematics. SpringerVerlag, 2003.

[19] F. Lu and E. Milios. Globally consistent range scan alignment for environment mapping. Au-tonomous Robots, 4:333–349, 1997.

[20] M. Montemerlo, S. Thrun, D. Koller, and B. Wegbreit. FastSLAM: A factored solution to simul-taneous localization and mapping. In Proc. of the National Conference on Artificial Intelligence(AAAI), pages 593–598, Edmonton, Canada, 2002.

[21] E. Olson, J. Leonard, and S. Teller. Fast iterative optimization of pose graphs with poor initialestimates. In Proc. of the IEEE Int. Conf. on Robotics & Automation (ICRA), pages 2262–2269,2006.

[22] M. Pitt and N. Shephard. Filtering via simulation: Auxiliary particle filters. Journal of the AmericanStatistical Association, 94(446):590–599, 1999.

[23] T. Schoen and F. Lindsten. Manipulating the multivariate gaussian density. Technical report,Linkoeping University, 2011. http://www.rt.isy.liu.se/ schon/Publications/SchonL2011.pdf.

[24] R. Smith, M. Self, and P. Cheeseman. Estimating uncertain spatial realtionships in robotics. InI. Cox and G. Wilfong, editors, Autonomous Robot Vehicles, pages 167–193. Springer Verlag, 1990.

[25] S. Thrun, Y. Liu, D. Koller, A.Y. Ng, Z. Ghahramani, and H. Durrant-Whyte. Simultaneouslocalization and mapping with sparse extended information filters. Int. Journal of Robotics Research,23(7/8):693–716, 2004.

[26] R. Triebel, P. Pfaff, and W. Burgard. Multi-level surface maps for outdoor terrain mapping andloop closing. In Proc. of the IEEE/RSJ Int. Conf. on Intelligent Robots and Systems (IROS), 2006.

[27] R. van der Merwe, N. de Freitas, A. Doucet, and E. Wan. The unscented particle filter. TechnicalReport CUED/F-INFENG/TR380, Cambridge University Engineering Department, August 2000.

[28] G. Welch and G. Bishop. An introduction to the kalman filter. Technical report, University of NorthCarolina at Chapel Hill, Chapel Hill, NC 27599-3175, 2001.

45

Documents

NotesonLeast-SquaresandSLAM DRAFTgrisetti/teaching/lectures-ls-slam-master_2014... · 2 State Estimation of a Stationary Non-Linear ... 3 Sparse Least Squares ... The aim of this