Vision-Aided Localization For Ground Robots · Top left: An image used for localization with red dots representing the extracted visual features. Top right and bottom: Different views

Vision-Aided Localization For Ground Robots

Mingming Zhang, Yiming Chen and Mingyang Li

Abstract— In this paper, we focus on the problem of vision-based localization for ground robotic applications. In recentyears, camera only or camera-IMU (inertial measurementunit) based localization methods are widely studied, in termsof theoretical properties, algorithm design, and real-worldapplications. However, we experimentally find that none ofexisting methods is able to perform high-precision and robustlocalization for ground robots in large-scale complicated 3Denvironments. To this end, in this paper, we propose a novelvision-based localization algorithm dedicatedly designed forground robots, by fusing measurements from a camera, an IMU,and the wheel odometer. The first contribution of this paper isthat we propose a novel algorithm for approximating the motionmanifold for ground robots by parametric representation andperforming pose integrating via IMU and wheel odometermeasurements. Secondly, we propose a complete localizationalgorithm, by using a sliding-window based estimator. Theestimator is designed based on iterative optimization to fusemeasurements from multiple sensors on the proposed manifoldrepresentation. We show that, based on a variety of real-worldexperiments, the proposed algorithm outperforms a numberof the state-of-the-art vision based localization algorithms bya significant margin, especially when deployed in large-scalecomplicated environments.

I. INTRODUCTION AND RELATED WORK

The localization problem for ground robots has been underactive research for about 30 years [1] [2] [3] [4] [5]. Themajority of real-world ground robotic applications (e.g.,self-driving cars) rely on high-quality GPS-INS systemstogether with 3D laser range-finders for precise localiza-tion [3] [6] [7]. However, those systems are at high manu-facturing and maintenance costs, requiring thousands or eventens or hundreds of thousands of dollars, which inevitablyprevent their wide applications. Alternatively, low-cost local-ization approaches have gained increased interests in recentyears, especially the ones that rely on cameras [4] [8].Camera’s size, low cost, and 3D sensing capability makesitself a popular sensor. Additionally, recent work showsthat, when used together with an IMU, the localizationperformance can be significantly improved [9] [10] [11] [12].In fact, camera-IMU localization1 is widely used in realapplications, e.g., smart drones, augmented and virtual realityheadsets, or mobile phones.

However, all localization methods mentioned above arenot optimized for ground robots. Although camera-IMUlocalization generally performs better than camera only lo-calization by resolving the ambiguities in estimating scale,

The authors are with A.I. Labs, Alibaba Group Inc., Hangzhou,China. mingmingzhang, yimingchen, [email protected]

1Camera-IMU localization can also be termed as vision-aided inertialnavigation, visual-inertial odometry (VIO), or visual inertial navigationsystem (VINS) in other papers.

Fig. 1: Representative visualization of the proposed algo-rithm 1. Top left: An image used for localization with reddots representing the extracted visual features. Top right andbottom: Different views of localization with pre-built map,green line is the estimated trajectory, red dots are positionsof visual landmarks in the pre-built map, and blue linesrepresent landmark-to-feature correspondences.

roll, and pitch [9] [13] [11], it has its own limitationswhen used for localizing ground robots. Firstly, there area couple of degenerate cases that can result in large errorswhen performing motion estimation, e.g., system under staticmotion, zero rotational velocity motion, constant local linearacceleration motion, and so on [9] [14] [15]. The likelihoodof encountering those degenerate cases in ground robots issignificantly larger than that in hand-held mobile devices.Secondly, unlike drones or smart headset devices whichmove freely in 3D environment, ground robots can onlymove on a manifold (e.g., ground surfaces) due to nature ofmechanical design. This makes it possible to use additionallow-cost sensors and derive extra mathematical constraintsfor improving the localization performance [14] [16].

In recent years, there are a couple of low-costvision-based localization methods designed for groundrobots [14] [16] [17]. Specifically, Wu et al. [14] proposedto introduce planar-motion constraints to camera-IMU lo-calization system, and also add wheel odometer measure-ments for stochastic optimization. The proposed methodis shown to improve localization performance in indoor

1Attached video demonstration online is available at https://youtu.be/_gQ3ky_GTsA

Image

IMU

Buffering and Tracking

Pose Estimation

Global Re-Localization

Synchronization- Buffering raw measurements. - Synchronize across threads and

interpolate data to generate frame.

Create new keyframe?No

Solve Pose - Add local visual/IMU/wheel

encoder constraints. - Add re-localization constraints. - Estimate keyframe pose.

Marginalization - Marginalize old poses out of the sliding window. - Marginalize landmarks. - Generate information matrix

and vector.

Prior information

Visual re-localization results

Geometric Verification - Gravity PnP. - Multi-view Consistency.

Visual tracking- Predict and track existing features. - Extract new features. Robot poseInitialization

- Compute local gravity vector.

- Motion Variables. - Manifold Variables.

Yes

Pose Predict - predict latest robot pose with wheel odometer.

Wheel Odometer

Fig. 2: System overview of the proposed approach.

environments. [17] designed a complete framework forvisual-odometer SLAM, in which IMU measurements andwheel odometer measurements were used together for poseintegration. Additionally, to better utilize wheel odometermeasurements, the corresponding intrinsic parameters canalso be calibrated online for performance enhancement [16].However, [14] [17] [16] only focus on robotic navigation ona single piece of planar surface. While this is typically truefor most indoor environments, applying those algorithms incomplicated outdoor 3D environments is highly risky.

By contrast, in this paper, we propose to approximatemotion manifolds for ground robots by a parametric repre-sentation, supporting most indoor and outdoor human-madeenvironments. This allows ground robot to conduct generalcommercial applications without mathematical formulationviolation and reduced performance. Additionally, we proposea novel localization framework for fusing measurementsfrom cameras, an IMU, and the wheel odometer, with aniterative sliding-window estimator. The overall localizationsystem using our proposed method is shown in Fig. 2,which is similar to the structure of other state-of-the-artalgorithms [10] [11] [18]. By extensive real-world experi-ments, we show that, the proposed method outperforms otherstate-of-the-art algorithms, specifically [9] [14] [10], by asignificant margin.

II. NOTATIONS

In this work, we assume a ground robot navigating withrespect to a global reference frame, G. When a pre-built map is used for persistent localization, its referenceframe is denoted as M. The reference frames for eachsensor are denoted by C, I, and O, for camera, IMU,and wheel odometer respectively. The center of frame Olocates at the center of the robot wheels, with its x-axispointing forward and z-axis pointing up. Additionally, we useApB and A

Bq to represent the position and unit quaternion

orientation of frame B with respect to the frame A. ABR is

the corresponding rotation matrix of ABq.

III. LOCALIZATION ALGORITHM

A. Sensor Measurements and Sensor Data Pre-processing

We describe our localization system by firstly presentingsensor models and the sensor data pre-processing step. Also,in this paper, we assume that, the used sensors are perfectlysynchronized by hardware, which is true in our experiments.

1) Wheel Odometer: Similarly to [14], at time t, themeasurements for an intrinsically calibrated wheel odometersystem are given by:

uo(t) =

[vo(t)ωo(t)

]=

[eT1 · O(t)v + nv

eT3 · O(t)ω + nw

](1)

where ei is a 3× 1 vector, with the ith element to be 1 andother elements to be 0. O(t)v and O(t)ω are the local linearvelocity and rotational velocity expressed in the frame O attime t, and nv and nw are measurement noises.

2) Inertial Measurement Unit (IMU): On the other hand,the IMU measurements are expressed as:

ui(t) =

[ai(t)ωi(t)

](2)

with

ai(t) =I(t)G R

(GaI(t)− Gg

)+ ba + na (3)

ωi(t) = I(t)ω + bg + ng (4)

where GaI is the IMU’s linear acceleration with respect toframe G, Gg is the known gravity vector expressed alsoin G. bg and ba are gyroscope and accelerometer biases,and ng and na are the corresponding measurement noises,respectively.

3) Data Interpolation: A couple of operations in theproposed system require performing pose (i.e., position andorientation) integration between images (e.g., the operationin Sec. III-A.4). However, due to the nature of multi-sensor system, we might not have IMU and wheel odometermeasurements at exactly the same time when images arecaptured. To this end, we propose to compute extra ‘virtual’IMU and wheel odometer measurements by interpolating theclosest measurements before and after the image’s times-tamp. Since IMU and wheel odometer measurements are atrelatively high frequencies (e.g., 200Hz and 30Hz in ourcase) and ground robots in human-made environments aretypical under smoothed motion, we choose to apply linearinterpolation.

4) Image Processing: Once a new image is received,we proceed to perform pose integration to compute thecorresponding predicted pose by wheel odometer measure-ments (see Sec. III-B). Once enough translational or ro-tational displacement is detected by pose prediction (e.g.,20 centimeters and 3 degrees in our tests), the new imagewill be processed, otherwise it will be dropped. For featureprocessing, FREAK [19] feature is computed due to itsefficiency on low-cost processors, which is followed byfeature matching and RANSAC outlier rejection steps. Notethat, a couple of IMU-camera localization algorithms relyon IMU-based pose integration for detecting displacementto decide keyframes [9]. However, if a device is under slowmotion or ‘stop-and-go-style’ motion for a long time, thelong-time IMU integration becomes inaccurate. Due to theavailability of wheel odometer, we simply decide keyframesby wheel odometry based pose integration, which is moretemporally robust.

B. State Vector and Iterative Optimization

To illustrate the localization algorithm, we first introducethe state vector. At timestamp tk, the state vector is 2:

xk =[Ok

T sTk dTk

]T, (5)

where

dk =[bT

gkbT

ak

GvTI

GpTOi

GOi

qT]T

(6)

and Ok is the sliding-window pose at timestamp k:

Ok =[xT

Ok–N+1· · · xT

Ok–1

]T,xOi

=[GpT

Oi

GOi

qT]T(7)

for i = k − N + 1, · · · , k − 1. sk represents the motionmanifold parameters at the timestamp k, which will beexplained in details in Sec. III-C.

When a new image at tk+1 is recorded, pose integration isperformed to compute dk+1 (see Sec. III-D). Subsequently,

2For simpler representation, we ignore sensor extrinsic parameters inour presentation in this section. However, those parameters are explicitlymodeled in our formulation and used in experiments.

we employ the following cost function to refine our state:

Ck+1(xk,dk+1, fk+1) = 2ηTk (xk − xk) + ||xk − xk||Σk

+∑

i,j∈Si,j

γi,j +

k+1∑i=k−N+1

ψi + βi(dk,dk+1) (8)

where ηk and Σk are estimated prior information vectorand matrix from the previous timestamp respectively, xk

is the estimate of xk, ||a||Σ is computed by aTΣa, fk+1

is the set of visual landmarks involved in the optimizationprocess, Si,j represents the set of pairs between keyframesand observed features, and γi,j is the computed camerareprojection residual vector. Additionally, ψi is the residualcorresponding to the motion manifold, and βi to the poseprediction cost by IMU and odometer measurements betweentime tk and tk+1. Specifically, the camera cost function is:

γi,j =∑

i,j∈Si,j

∥∥zij − h(xOi, fj)∥∥

RC(9)

where zij represents camera measurement corresponding tothe pose i and visual landmark fj , RC is the measurementinformation matrix, and the function h(·) is the model of acalibrated perspective camera [13].

To model h(·), fj is required. Recent state-of-the-artlocalization algorithms can choose to estimate fj online [10]or not model it in the state vector [9], or a hybrid methodbetween them [12]. In this work, we choose to not modelfj in our state vector (see Eq. 5), similar to [9]. To achievethat, during optimization process, we first compute features’positions via Gauss-Newton based multi-view triangulationmethod, which are used as initial values in minimizingCk+1. We note that, to avoid using the same informationmultiple times, at each timestamp, we choose un-processedfeatures only [14]. With computed positions of features, allprerequisites of solving Ck+1 are prepared.

At each timestamp, we use iterative nonlinear optimizationto solve Ck+1, for obtaining posterior estimates x?

k+1 andf?k+1. To keep the computational cost constrained, after theoptimization process, we marginalize oldest pose, biases andvelocity term in dk, and features fk+1. We also marginalizethe IMU and wheel odometer measurements between posesat tk and tk+1. This process computes new prior terms ηk+1

and Σk+1. We here denote the gradient and Hessian matrixof Ck with respect to [xT

k ,dTk , f

Tk+1]T as ψ and Ω:

ψ =

[ψr

ψm

],Ω =

[Ωrr ΩT

mr

Ωmr Ωmm

](10)

where ψm and Ωmm are corresponding to the terms thatare going to be marginalized, e.g., oldest pose and features,ψr and Ωrr to the terms to be kept, and Ωmr is the crossterm. Similarly to other marginalization methods [10], wecompute:

ηk+1 = ψr −ΩTmrΩ

−1mmψm (11)

Σk+1 = Ωrr −ΩTmrΩ

−1mmΩmr (12)

The remaining question is how to represent motion manifoldand perform pose prediction, which will be discussed indetails in the next section.

C. Manifold Representation

We focus on ground robot navigating in outdoor environ-ments, especially in large-scale human-made scenes (e.g.,company and university campuses, streets, residential areas,etc.). In those scenarios, the terrains are typically smoothand without sharp slope changes. Based on that, we chooseto approximate the motion manifold at any 3D location x byquadratic polynomial:

M(x) = z + c+ BT

[xy

]+

1

2

[xy

]TA

[xy

],x =

xyz

(13)

with

B =

[b1b2

], A =

[a1 a2a2 a3

]. (14)

The manifold parameters are:

s =[c b1 b2 a1 a2 a3

]T. (15)

For any local x′ that is close to x, the manifold equation canbe expressed as:

M(x′) = 0, if ||x′ − x|| < ε (16)

where ε is a distance constraint. Since we use a limitednumber of polynomial parameters to represent the motionmanifold, ε is necessary by limiting the representation in alocal region. As mentioned in Sec. III-A.4, in our localizationformulation, we choose keyframes of the state vector basedon geometrical pose displacement. Therefore, by defining alocal motion manifold at GpOk

, for any keyframe index i,the following equation holds:

M(GpOk) = 0, if ||GpOk

− GpOi|| < ε (17)

To use motion manifold as optimization cost functions inEq. 8, we express Eq. 13 and 17 as stochastic constraints.Specifically, we define the residuals of ψi for any validkeyframe (see Eq. 8) as ψi = [fpi

, fqi ]T , with

fpi=

1

σ2p

(M(GpOi

))2(18)

and σ2p is the corresponding noise variance. Additionally, we

apply an orientation cost as:

fq =

∥∥∥∥∥b(GOiR · e3

)×c12 ·

∂ M∂ p

∣∣∣∣p=GpOi

∥∥∥∥∥Rq

(19)

where ba×c12 are the first two rows of the cross-productmatrix of a, Rq is the measurement noise information matrix,and

∂ M∂ p

∣∣∣∣p=GpOi

=

B + A

[GxOiGyOi

]1

(20)

The physical interpretation of Eq. 19 is that, the motionmanifoldM has explicitly defined roll and pitch of a groundrobot which should be consistent with G

OiR.

It is important to point out that, in complicated outdoorenvironments, the ‘true’ local motion manifold will be chang-ing over time. However, for most human-made environments,e.g., university campuses, streets, residential areas, and etc.,the terrain is typically smooth and changing slowly. To thisend, we define the manifold parameter s as:

sk+1 = sk + nsk , eTi nsk ∼ N (0, σ2

sk) with i = 1, · · · , 6

(21)

where N (0, σ2i,sk

) represents Gaussian distribution withmean 0 and variance σ2

sk. Specifically, σsk is defined by:

σsk =αp||GpOk+1− GpOk

||+αq||GOkq−1 ⊗ G

Ok+1q|| (22)

where ⊗ represents quaternion multiplication, αp and αq arecontrol parameters.

It is important to point out that the work similarly toours on parametric manifold approximation is [20]. Howeverthe lack of a sequential statistical estimator makes [20] notsuitable for high-precision localization. We also note thatin Eq. 13 we choose to fix the coefficient of z to be 1,which means that our manifold representation is not generic.However, this perfectly fits our applications, since mostground robots can not climb vertical walls.

D. Pose Prediction

To present the details of odometry based pose prediction,we assume there are κ wheel odometer measurements be-tween timestamps tk and tk+1. The timestamps of thosemeasurements are denoted by tk(1), · · · , tk(κ), with tk(1) =tk and tk(κ) = tk+1. Starting from tk(1), the pose predictioncan be computed by integrating:

GOR(t) = G

OR(t)⌊

O(t)ω(t)⌋

(23)GpO(t) = GvO(t) = G

OR(t)O(t)v(t) (24)

where bac represents the skew-symmetric matrix of vectora. However, the nature of wheel odometer measurementmakes it difficult for integrating Eq. 23 and 24, since it onlyprovides estimates of the first element in O(t)v(t) and thethird element in O(t)ω(t), shown in Eq. 1. If pose integrationis performed with odometer measurement only (let othervalues in O(t)v(t) and O(t)ω(t) to be zero), it is equivalentto say the pose integration is calculated on a planar surface,which is the tangent plane of the manifold at the integrationstarting point. From tk(1) to tk(2), if the odometry-onlyintegrated poses are denoted by G

OR(tk(2)) and GpO(tk(2)),we can write:

GOR(tk(2)) = G

OR(tk(2)) · δR (25)GpO(tk(2)) = GpO(tk(2)) + G

OR(tk(1)) · δp (26)

where δR and δp are correction terms from the tangentplane of manifold to the manifold itself. Since the tangentplane can be considered as the manifold’s first-order ap-proximation, δR and δp should be small. Thus, we here

adopt the odometry-only integration described above, andmodel δR and δp as Gaussian noises. We also note that, thisapproximation is similar to [21], in which unknown externalforce is modeled as Gaussian noises for flying drones.

It is also important to point out that our method ofapproximating odometry based pose integration is not theonly solution. However, based on our high-FPS odometermeasurements and testing environments, we experimentallyfind that the proposed one already achieves high accuracy andoutperforms competing state-of-the-art methods. Exploringand comparing the performance of different pose integrationmethods will be our focus in future work.

On the other hand, pose prediction using IMU is per-formed similar to other work on camera-IMU localiza-tion [9] [10] 3, and thus we omit the details. In this paper,based on pose integration of wheel odometer and IMU,the cost β in localization optimization function Eq. 8 canbe computed, consisting of both wheel odometer term andIMU term. We also point out pose prediction for keyframeselection (see Sec. III-A.4) is performed by wheel odometerbased integration. This is due to the fact that errors of wheelodometer integration will not grow as a function of time(especially important for stop-and-go motion), leading tobounded errors of prior estimates for keyframes.

E. Localization with Pre-built Map

The method described in Sec. III-B defines an open-looplocalization algorithm, in which long-term drift is inevitable.To cope with this problem, online simultaneous localizationand mapping (SLAM) [8] or localization with pre-builtmap [11] [18] can be performed to generate pose estimateswith bounded localization errors. In this paper, we focuson the latter approach. Note that, to enable localizationwith pre-built map, any type of visual map can be used,if the following conditions can be satisfied. Firstly and mostimportantly, the generated map must be accurate. Secondly,when building the localization map, the used feature detectorand descriptor should match that is used for our localizationalgorithm.

Our framework of persistent re-localization is similar tothat of [11] [18], and we here briefly go over the steps. Tobe able to use a pre-built map, the position and orientationbetween the map frame and global frame, i.e., GpM andGqM, need to be modeled into state vector (Eq. 5) and beestimated online. In our persistent re-localization framework,all operations of normal open-loop localization remain thesame, with one additional operation added: associate featuresdetected from incoming image to visual 3D landmarks in themap and use that for formulating extra cost functions. This isachieved by combination of appearance based loop closurequery and geometrical consistent verification [11].

3Minor difference exists due to our pose parameterization on frame Oinstead of I. As a result, extrinsic parameters between O and I will beused during the integration.

TABLE I: Localization errors of different motion man-ifold representation methods: using 0th-order, 1st-order,2nd-order (proposed) representation, and the one withoutmodeling the manifold.

2nd-order 1st-order 0th-order w/o manifold

Dataset 01pos. err. (m) 0.4993 3.7528 3.2917 2.1186rot. err. (deg.) 2.78423 3.8094 4.22689 2.95411

Dataset 02pos. err. (m) 0.7330 2.6927 2.3419 2.3178rot. err. (deg.) 2.59439 4.32612 4.44717 3.51778

TABLE II: Position and rotation (yaw only) errors ofthe proposed approach, compared to other state-of-the-artmethods VINS-Wheel[14], Hybrid-MSCKF[12] and VINS-Mono[10], on different indoor and outdoor datasets.

Proposed [14] [12] [10]

INDOOR 01position err. (m) 1.229 1.070 2.849 5.236rotation err. (deg.) 1.032 1.164 1.340 2.437




OUTDOOR 01position err. (m) 1.141 2.988 2.829 2.44rotation err. (deg.) 3.225 5.564 6.101 2.032


OUTDOOR 03position err. (m) 3.119 3.496 — —rotation err. (deg.) 4.229 4.513 — —




a —: Fails due to severe drift.

IV. EXPERIMENTS

To the best of the authors’ knowledge, there does notexist any proper public dataset with camera, IMU, and wheelodometer measurements for large scale ground robots4.Therefore, we collected more than 60 datasets from July2018 to Feb. 2019, for our experiments. During the datacollection, the ground robots were operated in both indoorand outdoor environments, under various types of motion,and under different light and weather conditions. In ourexperiments, the images were recorded at 10Hz with 640×400-pixels resolution, IMU measurements were at 200Hz,

4This statement is at the time of the paper submission of this work.

01 02 03 04 05 06 07 08 09

10 11 12 13 14 15 16 17 18

19 20 21 22 23 24 25 26 27

28 29 30 31 32 33 34 35 36

37 38 39 40 41 42 43 44 45

46 47 48 49 50 51 52 53 54

Fig. 3: Sample images for long-term deployment scenarios. Hall (1st row), office (2nd row), hotel (3rd row) are all indoorscenarios, while other three rows are different weather conditions for outdoors from sunny (4th row), rainy (5th row) tosnowy (6th row).

and wheel odometer measurements were at 30Hz. Based onthe motion during our data collection process, the medianfrequency of placed keyframes of the proposed localizationalgorithm was at about 3.5Hz.

A. Evaluation of Manifold Representation

The first experiment is to evaluate the necessity andperformance of manifold representation. Specifically, weimplemented the proposed localization algorithm in fourdifferent modes: 1) removing manifold constraint in Eq. 8,2) using zero-order approximation for motion manifold (onlyusing c in Eq. 15), 3) using first-order approximation (usingc, b1, and b2 in Eq. 15), and 4) using the proposed quadraticrepresentation.

Table I shows localization errors for four different modes,evaluated by two different outdoor datasets in which groundrobot started and ended at the same location. Due to the lackof full-trajectory ground truth, we computed the localizationerrors by evaluating the final drifts. A couple of conclusionscan be made from the results. First, by using ‘poor’ pa-rameterization for motion manifold, the localization accuracycan even be reduced. In fact, when approximated by zero-thorder or first order polynomials, using motion manifold forlocalization will be worse than not using it. This is due to thefact that outdoor environment is typically complicated andsimple representation is not enough. However, when propermodeling (i.e. the proposed method) is used, better local-ization precision can be achieved. This experiment validatesthe most important assumption in our paper: with properlydesigned motion manifold representation the localizationaccuracy can be improved. In fact, for the two datasetsinvolved, the improvement of position precision is at least69%, which is significant.

B. Overall Localization Performance

The next experiment is to evaluate the overall localizationperformance of the proposed method, compared to a coupleof the state-of-the-art algorithms, i.e., VINS-Wheel [14],Hybrid-MSCKF [12] and VINS-Mono [10]. Ten experimentswere conducted, with four in indoor 2D environments andsix in outdoor 3D environments. Similarly to the previousexperiment, the final errors were computed as the metric fordifferent methods, on open-loop estimation setup.

Table II shows the localization errors for all methods inall testing datasets. In indoor cases, the proposed method isable to obtain best overall performance, but the differencebetween the proposed method and VINS-Wheel is not large.This is due to the fact that, in indoor environments, theplanar surface assumption of VINS-Wheel is true and thishelps with the localization accuracy. However, in outdoordatasets, the proposed method outperforms all competingmethods by a wide margin. In fact, except for the thirddataset, the position error of the proposed algorithm is at least62% better than VINS-Wheel, 60% than Hybrid-MSCKF,and 53% than VINS-Mono. We also note that in the thirddataset a lot of rotation-only time periods were involved,which was known to cause degenerate cases for camera-IMUlocalization. In fact, both Hybrid-MSCKF and VINS-Monodiverges in this case, however, the proposed method performsrelatively well. This experiment shows that when used invision-based localization for ground robots, the proposedalgorithm performs significantly better than alternative state-of-the-art methods, in terms of both accuracy and robustness.

C. Re-localization Tests

The proposed re-localization method is also tested withdifferent ground robots platforms in 60 different environ-mental conditions. The involved testing environments contain

0.0

0.1

0.2

0.3

0.4

erro

r[m

]

50 100 150 200 250 300

t [s]

0.00

0.25

0.50

0.75

1.00

1.25

1.50

erro

r[de

gree

]w/o manifold w/o online calibration proposed

Fig. 4: Translation error and rotation error of the ‘full’proposed approach, the one without manifold representation,and the one without online sensor extrinsic calibration.

ones that are highly challenging, e.g., in front of fully-glass-covered building, with low light conditions, in low texturearea, and so on (see Fig. 3). Note that, we do not aim toseek for difficult cases on purpose, those are cases commonfor real robotic applications.

In all experiments, the proposed algorithm always workedwell, i.e., globally re-localized in the end of the dataset andjumps in pose estimates did not show up during the entiretrajectory. We note that this is achieved by both accuratelocal pose estimation and global loop closure estimation. Infact, in some challenging datasets, it is possible that therewere 30 meters’ trajectory or 40 seconds’ time without goodloop closure query (also partially due to the FREAK featurewe use [19]). The accurate local pose estimate ensureslimited local drift which makes it possible for later globalre-localization correction without ‘jumps’.

Additionally, based on a representative re-localizationdataset, we also evaluated the effectiveness of the proposedmanifold-based localization approach, by using or not usingit during the test. The result is shown in Fig. 4, which clearlydemonstrates that with the proposed manifold representa-tion the re-localization errors can also be largely reduced.Although it is not the focus of this paper, it is mentionedin III-B that the proposed system is implemented with onlinesensor extrinsic calibration. Fig. 4 also shows that if this isnot implemented, the errors will be increased.

V. CONCLUSIONS

In this paper, we propose a vision-aided localization algo-rithm dedicatedly designed for ground robots. Specifically,a novel method is proposed to approximate the motionmanifold and use that to derive localization cost function.Additionally, an optimization-based sliding window estima-tor is designed to fuse camera, IMU, and wheel odometer

measurements together, to perform precise localization tasks.By extensive experimental results, we show that the proposedmethod outperforms competing state-of-the-art localizationalgorithms by a significant margin.

REFERENCES

[1] Teddy Yap, Mingyang Li, Anastasios I Mourikis, and Christian RShelton. A particle filter for monocular vision-aided odometry. InIEEE International Conference on Robotics and Automation, 2011.

[2] Ji Zhang and Sanjiv Singh. Low-drift and real-time lidar odometryand mapping. Autonomous Robots, 41(2):401–416, 2017.

[3] Jesse Levinson, Michael Montemerlo, and Sebastian Thrun. Map-based precision vehicle localization in urban environments. InRobotics: Science and Systems, volume 4, page 1, 2007.

[4] Ryan W Wolcott and Ryan M Eustice. Visual localization withinlidar maps for automated urban driving. In IEEE/RSJ InternationalConference on Intelligent Robots and Systems, pages 176–183, 2014.

[5] Mingming Zhang, Yiming Chen, and Mingyang Li. SDF-loc: Signeddistance field based 2D relocalization and map update in dynamicenvironments. In American Control Conference, 2019.

[6] Guowei Wan, Xiaolong Yang, Renlan Cai, Hao Li, Yao Zhou, HaoWang, and Shiyu Song. Robust and precise vehicle localization basedon multi-sensor fusion in diverse city scenes. In IEEE InternationalConference on Robotics and Automation (ICRA), 2018.

[7] Jesse Levinson, Jake Askeland, Jan Becker, Jennifer Dolson, DavidHeld, Soeren Kammel, J Zico Kolter, Dirk Langer, Oliver Pink,Vaughan Pratt, et al. Towards fully autonomous driving: Systems andalgorithms. In Intelligent Vehicles Symposium (IV), 2011.

[8] Raul Mur-Artal, Jose Maria Martinez Montiel, and Juan D Tardos.ORB-SLAM: a versatile and accurate monocular slam system. IEEETransactions on Robotics, 31(5):1147–1163, 2015.

[9] Mingyang Li and Anastasios I Mourikis. High-precision, consistentekf-based visual-inertial odometry. The International Journal ofRobotics Research, 32(6):690–711, 2013.

[10] Tong Qin, Peiliang Li, and Shaojie Shen. VINS-Mono: A robust andversatile monocular visual-inertial state estimator. IEEE Transactionson Robotics, 34(4):1004–1020, 2018.

[11] Simon Lynen, Torsten Sattler, Michael Bosse, Joel A Hesch, MarcPollefeys, and Roland Siegwart. Get out of my lab: Large-scale, real-time visual-inertial localization. In Robotics: Science and Systems,2015.

[12] Mingyang Li and Anastasios I Mourikis. Optimization-based estimatordesign for vision-aided inertial navigation. In Robotics: Science andSystems, pages 241–248, 2013.

[13] Mingyang Li and Anastasios I Mourikis. Online temporal calibrationfor camera–imu systems: Theory and algorithms. The InternationalJournal of Robotics Research, 33(7):947–964, 2014.

[14] Kejian J Wu, Chao X Guo, Georgios Georgiou, and Stergios IRoumeliotis. Vins on wheels. In IEEE International Conference onRobotics and Automation (ICRA), pages 5155–5162, 2017.

[15] Dimitrios G Kottas, Kejian J Wu, and Stergios I Roumeliotis. De-tecting and dealing with hovering maneuvers in vision-aided inertialnavigation systems. In IEEE/RSJ International Conference on Intelli-gent Robots and Systems, pages 3172–3179. IEEE, 2013.

[16] Andrea Censi, Antonio Franchi, Luca Marchionni, and GiuseppeOriolo. Simultaneous calibration of odometry and sensor parametersfor mobile robots. IEEE Transactions on Robotics, 29(2), 2013.

[17] Meixiang Quan, Songhao Piao, Minglang Tan, and Shi-Sheng Huang.Tightly-coupled monocular visual-odometric slam using wheels and amems gyroscope. arXiv preprint arXiv:1804.04854, 2018.

[18] Thomas Schneider, Marcin Dymczyk, Marius Fehr, Kevin Egger, Si-mon Lynen, Igor Gilitschenski, and Roland Siegwart. maplab: An openframework for research in visual-inertial mapping and localization.IEEE Robotics and Automation Letters, 3(3):1418–1425, 2018.

[19] Alexandre Alahi, Raphael Ortiz, and Pierre Vandergheynst. Freak:Fast retina keypoint. In 2012 IEEE Conference on Computer Visionand Pattern Recognition, pages 510–517, 2012.

[20] Bhoram Lee, Kostas Daniilidis, and Daniel D Lee. Online self-supervised monocular visual odometry for ground vehicles. In IEEEInternational Conference on Robotics and Automation (ICRA), pages5232–5238, Seattle, WA, May 2015.

[21] Barza Nisar, Philipp Foehn, Davide Falanga, and Davide Scaramuzza.VIMO: Simultaneous visual inertial model-based odometry and forceestimation. IEEE Robotics and Automation Letters, 2019.

Documents

Vision-Aided Localization For Ground Robots · Top left: An image used for localization with red dots representing the extracted visual features. Top right and bottom: Different views