8
2013 IEEE International Conference on Robotics and Automation (ICRA) Karlsruhe, Germany, May 6-10, 2013 Gaussian Process Gauss-Newton for 3D Laser-Based Visual Odometry Chi Hay Tong and Timothy D. Barfoot Ahstct- In this paper, we present a method for obtain- ing Visual Odometry (VO) estimates using a scanning laser rangefinder. Though common VO implementations utilize stereo camera imagery, cameras are dependent on ambient light. In contrast, actively-illuminated sensors such as laser rangefinders work in a variety of lighting conditions, including full darkness. We leverage previous successes by applying sparse appearance- based methods to laser intensity images, and address the issue of motion distortion by considering the estimation problem in continuous time. This is facilitated by Gaussian Process Gauss- Newton (GPGN), an algorithm for non-parametric, continuous- time, nonlinear, batch state estimation. We include a concise derivation of GPGN, along with details on the extension to three-dimensions (3D). Validation of the 3D laser-based VO framework is provided using 1.1km of experimental data, which was gathered by a field robot equipped with a two-axis scanning lidar. I. INTRODUCTION The ability to compute motion estimates om onboard sensing is a core competency for mobile robots, since it is a fundamental building block for many autonomy algorithms. For operation in three-dimensional (3D), unstructured terrain, stereo camera-based visual odometry (VO) has emerged as the dominant form of motion estimation, largely due to the success of sparse, appearance-based techniques. These methods offer robustness and accuracy, at reasonable com- putational costs. Stereo camera-based VO has been employed on the Mars Exploration Rovers (MERs) [1], and has demon- strated high accuracies over long distances [2], [3]. Sparse appearance-based VO methods identify and track distinctive points in the camera imagery. Commonly-used interest points include Harris corners [4], Scale Invariant Feature Transforms (SIFT) [5], and Speeded Up Robust Features (SURF) [6]. These features are then input into a state estimation algorithm, which determines the motion estimate that best explains the feature tracks. Unfortunately, these methods are sensitive to ambient lighting conditions due to the passive nature of the camera sensor. This can pose serious issues for exploration of poorly- lit areas on a planetary surface, or long-term navigation on Earth over the course of a day. Alternatively, the dependence on external lighting can be avoided by using an actively illuminated sensor, such as a laser rangefinder. Recent work [7], [8], [9] has shown that images constructed from the laser intensity data resemble grey scale camera imagery, and can be utilized in the same manner in sparse appearance-based VO algorithms. The authors are with the Autonomous Space Robotics Lab at the University of Toronto Institute for Aerospace Studies. Toronto, Ontario, M3H 5T6, Canada. Email: { chihay.tong. tim.barfoot } @utoronto.ca Fig. 1. The ROC6 field robot utilized in our hardware experiments. The relevant payloads are a high-framerate Autonosys Ii dar. and a DGPS antenna for ground truth data. However, a key distinction between camera and laser sensors is the method of data acquisition. A camera sensor acquires an entire image at a single time instant, while a scanning laser rangefinder measures points individually. As a result, intensity images constructed while the rover is in motion exhibit distortion effects. Though a stop-scan-go approach may be employed to avoid the motion distortion issues [7], [10], this mode of operation severely reduces the capabilities of the autonomous platform. Inertial mea- surement units (IMUs) can be used for motion correction, but this solution is undesirable due to the reliance on an additional sensor. Furthermore, calibration between multiple sensor types is very challenging [11]. If a motion estimate can be obtained using the laser data alone, the algorithm can benefit when IMU data is available, and still operate when it is not. We address the issue of motion distortion by considering the timestamps of the interest points detected in each image. Since each feature is detected at a different time instant, we formulate the state estimation problem in continuous time. This is facilitated by Gaussian Process Gauss-Newton (GPGN) [12], an algorithm for non-parametric, continuous- time, nonlinear, batch state estimation. GPGN models the rover poses as a Gaussian Process (GP), and utilizes GP interpolation to obtain the poses at the measurement times. In summary, we consider the 3D laser-based VO prob- lem in this paper. An appearance-based approach is taken, where intensity images are constructed, and sparse features are tracked for motion estimation. The effects of motion distortion are handled by considering the estimation problem in continuous-time using GPGN, and experimental validation is presented using 1.1km of data gathered by the field robot depicted in Figure 1. 978-1-4673-5643-5/13/$31.00 ©2013 IEEE 5204

Gaussian Process Gauss-Newton for 3D Laser-Based Visual ...ncfrn.mcgill.ca/members/pubs/tong_icra13.pdf · a scanning laser rangefinder measures points individually. As a result,

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Gaussian Process Gauss-Newton for 3D Laser-Based Visual ...ncfrn.mcgill.ca/members/pubs/tong_icra13.pdf · a scanning laser rangefinder measures points individually. As a result,

2013 IEEE International Conference on Robotics and Automation (ICRA) Karlsruhe, Germany, May 6-10, 2013

Gaussian Process Gauss-Newton for 3D Laser-Based Visual Odometry

Chi Hay Tong and Timothy D. Barfoot

Ahstract- In this paper, we present a method for obtain­ing Visual Odometry (VO) estimates using a scanning laser rangefinder. Though common VO implementations utilize stereo camera imagery, cameras are dependent on ambient light. In contrast, actively-illuminated sensors such as laser rangefinders work in a variety of lighting conditions, including full darkness. We leverage previous successes by applying sparse appearance­based methods to laser intensity images, and address the issue of motion distortion by considering the estimation problem in continuous time. This is facilitated by Gaussian Process Gauss­Newton (GPGN), an algorithm for non-parametric, continuous­time, nonlinear, batch state estimation. We include a concise derivation of GPGN, along with details on the extension to three-dimensions (3D). Validation of the 3D laser-based VO framework is provided using 1.1km of experimental data, which was gathered by a field robot equipped with a two-axis scanning lidar.

I. INTRODUCTION

The ability to compute motion estimates from on board sensing is a core competency for mobile robots, since it is a fundamental building block for many autonomy algorithms. For operation in three-dimensional (3D), unstructured terrain, stereo camera-based visual odometry (VO) has emerged as the dominant form of motion estimation, largely due to the success of sparse, appearance-based techniques. These methods offer robustness and accuracy, at reasonable com­putational costs. Stereo camera-based VO has been employed on the Mars Exploration Rovers (MERs) [1], and has demon­strated high accuracies over long distances [2], [3].

Sparse appearance-based VO methods identify and track distinctive points in the camera imagery. Commonly-used interest points include Harris corners [4], Scale Invariant Feature Transforms (SIFT) [5], and Speeded Up Robust Features (SURF) [6]. These features are then input into a state estimation algorithm, which determines the motion estimate that best explains the feature tracks.

Unfortunately, these methods are sensitive to ambient lighting conditions due to the passive nature of the camera sensor. This can pose serious issues for exploration of poorly­lit areas on a planetary surface, or long-term navigation on Earth over the course of a day. Alternatively, the dependence on external lighting can be avoided by using an actively illuminated sensor, such as a laser rangefinder. Recent work [7], [8], [9] has shown that images constructed from the laser intensity data resemble grey scale camera imagery, and can be utilized in the same manner in sparse appearance-based VO algorithms.

The authors are with the Autonomous Space Robotics Lab at the University of Toronto Institute for Aerospace Studies. Toronto, Ontario, M3H 5T6, Canada. Email: {chihay.tong. tim.barfoot} @utoronto.ca

Fig. 1. The ROC6 field robot utilized in our hardware experiments. The relevant payloads are a high-framerate Autonosys Ii dar. and a DGPS antenna for ground truth data.

However, a key distinction between camera and laser sensors is the method of data acquisition. A camera sensor acquires an entire image at a single time instant, while a scanning laser rangefinder measures points individually. As a result, intensity images constructed while the rover is in motion exhibit distortion effects. Though a stop-scan-go approach may be employed to avoid the motion distortion issues [7], [10], this mode of operation severely reduces the capabilities of the autonomous platform. Inertial mea­surement units (IMUs) can be used for motion correction, but this solution is undesirable due to the reliance on an additional sensor. Furthermore, calibration between multiple sensor types is very challenging [11]. If a motion estimate can be obtained using the laser data alone, the algorithm can benefit when IMU data is available, and still operate when it is not.

We address the issue of motion distortion by considering the timestamps of the interest points detected in each image. Since each feature is detected at a different time instant, we formulate the state estimation problem in continuous time. This is facilitated by Gaussian Process Gauss-Newton (GPGN) [12], an algorithm for non-parametric, continuous­time, nonlinear, batch state estimation. GPGN models the rover poses as a Gaussian Process (GP), and utilizes GP interpolation to obtain the poses at the measurement times.

In summary, we consider the 3D laser-based VO prob­lem in this paper. An appearance-based approach is taken, where intensity images are constructed, and sparse features are tracked for motion estimation. The effects of motion distortion are handled by considering the estimation problem in continuous-time using GPGN, and experimental validation is presented using 1.1km of data gathered by the field robot depicted in Figure 1.

978-1-4673-5643-5/13/$31.00 ©2013 IEEE 5204

Page 2: Gaussian Process Gauss-Newton for 3D Laser-Based Visual ...ncfrn.mcgill.ca/members/pubs/tong_icra13.pdf · a scanning laser rangefinder measures points individually. As a result,

This paper presents a practical application of the theo­retical contributions introduced in [12] and [13]. We also demonstrate how to extend GPGN to the 3D domain, and utilize it a multi-frame sliding window formulation.

The remainder of this paper is organized as follows. We begin with a review of related work in Section II. This is followed by how the feature measurements are obtained from intensity imagery in Section III, and an overview of the estimation algorithm in Section IV. The framework is validated through hardware experimentation in Section V, and concluding remarks are provided in Section VI.

II. LITERATURE REVIEW

The use of laser intensity data in mobile robotics is not a new concept. Intensity images have been used for indoor localization against a known map [14], identification of retroreflective markers in the scene [15], [16], and road localization for self-driving cars [17].

Previous work in laser motion estimation includes the continuously-spinning laser VO algorithm by Bosse and Zlot [IS]. In this approach, the dense data was subdivided into discrete segments, and iteratively corrected the point cloud as the motion estimate was improved. These concepts have been recently applied towards large-scale underground mapping [19], where interpolation was introduced to smooth the trajectory estimate, and an IMU was included to assist in motion compensation. A similar formulation is utilized in the Zebedee mobile mapping system [20].

Appearance-based methods for laser data have been demonstrated to provide comparable accuracy to stereo cam­era VO for stop-scan-go operation [7], and sufficient for Visual Teach and Repeat (VT&R) even without motion com­pensation [S]. Improved estimates were achieved by Dong and Barfoot [9], who considered the motion distortion issues by employing a novel pose interpolation scheme to obtain the intermediate poses between successive frames. Similarly, Ringaby and Forssen [21] addressed video rectification for rolling shutter camera imagery by employing the SLERP [22] quaternion interpolation method.

Further improvements can be achieved by considering the estimation problem in continuous time. Rather than simply

(a) Range-corrected intensity image.

performing interpolation between frames, a richer state repre­sentation can be chosen that better reflects the rover motion. Though continuous-time filtering methods predate much of mobile robotics [23], [24], it is only recently that continuous­time batch state estimation algorithms have appeared in the literature. These include a piecewise spline representation of the state [25], [26], and a GP in GPGN [13].

An alternative approach to laser-based VO is also pre­sented in [27], which utilizes a piecewise spline representa­tion for relative continuous-time SLAM. This work involves the same dataset utilized in this paper, and provides an interesting extension to the continuous-time state estimation formulation in the form of velocity estimation.

In sununary, we address the issue of motion distortion by employing a significantly different state estimation method than previous applications in the literature. The primary contributions of this paper are the application of GPGN to a real-world 3D estimation problem, and the experimental validation of our laser-based VO framework.

III. FEATURE MEASUREMENTS

Since the experimental data utilized in this work comes from a preacquired dataset [2S], only a brief overview of measurement generation process is provided in this section. We begin with the formation of intensity images. These are constructed by simply placing the laser data into a 2D array. To account for the effect of distance on intensity, we perform a correction by multiplying by the range values, and scaling the data to produce S-bit images [7].

As can be seen in Figure 2(a), the result of this process resembles a grey scale camera image. Therefore, we are able to utilize sparse appearance-based feature detectors common in the computer vision literature. In our implementation, we use a GPU-accelerated implementation of SURF [6]. Some sample SURF features are depicted in Figure 2(b). Though the entire image is subject to motion distortion, each interest point is detected locally. As a result, the motion effects are minimized for each feature.

In addition to the intensity image, azimuth, elevation, range, and time images are also formed, creating an image

(b) Intensity image with detected SURF features.

Fig. 2. Sample images constructed from laser intensity data utilized for VO. Since these resemble greyscale camera images. an appearance-based approach is taken where sparse SURF features are identified and tracked in consecutive images. The SURF detector identifies distinct blobs in the image, which are displayed in red for dark blobs on a light background, and blue for light blobs on a dark background.

5205

Page 3: Gaussian Process Gauss-Newton for 3D Laser-Based Visual ...ncfrn.mcgill.ca/members/pubs/tong_icra13.pdf · a scanning laser rangefinder measures points individually. As a result,

(a) Feature tracks obtained using the standard rigid (b) Feature tracks obtained using a motion-formulation. compensated formulation.

Fig. 3. The inlier feature tracks obtained after applying RANSAC on candidate matches between consecutive frames. Due to motion distortion, the standard rigid transformation employed in RANSAC incorrectly rejects many valid matches. This issue was overcome by introducing a simple constant velocity model into RANSAC. Development of this modification to RANSAC is ongoing.

stack [7]. Since the features are detected at sub-pixel loca­tions, the image stack allows for sub-pixel lookup of these quantities through bilinear interpolation.

Similarly to camera sensors, the measurements produced by laser rangefinders also need to be calibrated. This was addressed by gathering a series of checkerboard images to perform characterization of the geometric distortions [29]. Correcting for these distortions results in data that fit an idealized spherical camera model.

Following the standard VO pipeline, the features detected in each image are then matched using their descriptors be­tween consecutive frames. Outlier matches are then validated using the Random Sample and Consensus (RANSAC) [30] algorithm. The resulting measurements and matches are the inputs for estimation.

However, we found that many valid matches were incor­rectly rejected by a standard RANSAC implementation. This can be attributed to the fact that though the features were detected locally, RANSAC employs a rigid transformation between the entire group of features to classify candidate matches. As a result, the uneven motion distortion present in the imagery and measurements resulted in many false posi­tives. This was overcome by introducing a simple constant velocity model into the RANSAC transformation, which was iteratively refined to capture the overall motion between frames. The improvement in classification performance is shown in Figure 3, where we show the inlier matches produced by both approaches I.

IV. STATE ESTIMATION

In this section, we address how to obtain motion estimates using the feature measurements. We approach this task as a batch estimation problem, where a set of N measurements are obtained over a period of time, and we seek to estimate the robot trajectory during the time window. A sliding window approach is employed to limit the computational requirements, while still providing a smooth estimate.

I Further development of this modification to RANSAC is currently being conducted by other members of the Autonomous Space Robotics Lab at the University of Toronto, and will be published at a later date.

For this problem formulation, we consider a state com­posed of both a time-varying quantity, x(t), and a time­invariant component, f.. In the case of VO, the time-varying quantity represents one or more robot poses, and the time­invariant component represents feature locations. In addition, each feature measurement has an associated timestamp, which must be incorporated appropriately in the estimator. This continuous-time state estimation problem is addressed using GPGN [13].

In GPGN, we model the time-varying component as a GP, and incorporate the feature measurement times through appropriate kernel function evaluations. Since there are a large number of feature measurements, we maintain compu­tational tractability by identifying key poses for estimation, and employing GP interpolation to obtain intermediate poses at the measurement times. In our implementation, the key poses are defined to be at the midpoint acquisition time of each intensity image. For clarity, we provide an illustration summarizing our problem formulation in Figure 4. This approach differs from prior interpolation methods [9], [21] because multiple poses are considered in GP interpolation, instead of simply the adjacent ones. This allows for a richer state representation with fewer key poses.

In the remainder of this section, we define the system models, and provide a concise derivation of GPGN. This is followed by details of our sliding window formulation and the extension of GPGN to the 3D domain.

Fig. 4. An illustration of the multi-frame state estimation formulation The state consists of an underlying time-varying state anchored by rover poses at key instants of time, x( tk)' and the feature locations, R.j. GP interpolation is utilized to obtain intermediate poses at the measurement times, and a sliding window approach is employed to maintain computational tractability.

5206

Page 4: Gaussian Process Gauss-Newton for 3D Laser-Based Visual ...ncfrn.mcgill.ca/members/pubs/tong_icra13.pdf · a scanning laser rangefinder measures points individually. As a result,

A. System Models

The state space models are defined to be

x(t) :'" gp(f.t(t), K(t, t')) (1)

f:",N(d,L) (2)

Zi := hi(x(ti), f) + ni, ni '" N(O, Ri) (3)

where x(t) is a GP with mean and covariance functions f.t(t) and K(t, t'), respectively, f is a time-invariant discrete Gaussian random variable with prior mean of d and covari­ance L, and the measurements, Zi, are obtained through a conventional nonlinear, non-invertible measurement model, hiO, at N discrete times, ti. For simplicity, we have modelled the measurement noise, ni, as additive, zero-mean, and Gaussian with covariance Ri.

To simplify the expressions in the following sections, we combine the two state components by defining

O(t):= [x�l ry(t):= [f.t�tl 1'(t,t'):= [K(�t') �], which results in

O(t) '" gp(ry(t), 1'(t, t')), Zi = hi(O(ti)) + ni·

B. Function-Space GPGN Derivation

(4)

(5)

(6)

In this section, we provide a brief derivation of the GPGN algorithm. While two different derivations are pos­sible [13], reflecting both the weight-space and function­space approaches from the GP regression literature [31], we provide only the concise function-space derivation in this paper.

Inspired by the Gauss-Newton approach [32], we begin by assuming the state is approximated by the value of the current estimate, O(t), and an additive perturbation, bO(t). This assumption is used to linearize our system models, which produces

bO(t) '" gp(ry(t) -O(t), 1'(t, t')), (7)

- 8hi I Zi ;::::; hi(O(ti)) + Hi bO(ti) + ni, Hi := 80 _

. (8) O(ti)

At each iteration, we seek the optimal value of the perturba­tion at certain times of interest, bO*(t), which we apply to bring our estimate progressively closer to the optimal value of the state. It should be noted that we require a value for the state at each measurement time to perform linearization. As a result, these values are either updated and stored at each iteration, or obtained through GP interpolation.

In the function-space approach, we determine the opti­mal state perturbations by first expressing the states and measurements as a jointly Gaussian distribution, and then conditioning the measurements onto the state. The expression for the joint distribution is obtained by utilizing the linearized models to obtain Gaussian approximations. That is, the mean of the measurement function is

E [zil = E [hi(O(ti)) + Hi bO(ti) + nil = hi(O(ti)) + Hi (ry(ti) -O(ti)) , (9)

and the covariance is

E [(Zi -E [Zi]) (Zi -E [Zi])T] = E[(ni-Hi (ry(ti) -O(ti)))(ni-Hi (ry(ti) -O(ti)))T] = E [ninn + Hi E[(ry(ti) -O(ti)) (ry(ti) -O(ti)(]Hr = R + Hi1'(ti, ti) Hr. (10)

Next, we derive the cross-covariance between the state and the measurements, which can be expressed as

E [(O(t) -E [O(t)]) (Zi -E [zilf] = E[(O(t) + bO(t) -ry(t)) (ni -Hi (ry(ti) -O(ti)))T] = E [(O(t) -ry(t)) (O(ti) -ry(ti))T] Hr = 1'(t, ti) Hr. (11)

Finally, we simplify our expressions by defining

Z := [zI T]T zN , h := [hI(O(tI))T ry := [ry(tI)T 0 := [lJ(tdT

hN(O(tN ))Tf ' ry(tN )Tf ' O(tN)T]T ,

H := diag {HI, ... , HN}, R := diag {RI' ... , RN},

1'(t) := [K(ri td K(t'/N)

K(h, td l' :=

(12a)

(12b)

(12c)

(12d)

(12e)

(12f)

(12g)

(12h)

which allows us to express the joint distribution between the batch of measurements and a single state perturbation as:

[bOZ(t)] '" N( [h ry�tf�ryO(t�)] , [R + H1'HT H1'(t)T] ) (13) 1'(t)HT 1'(t,t) .

Using this joint distribution, we obtain the optimal perturba­tion by conditioning out the measurements, which produces

bO*(t) '" N( (ry(t) -O(t)) + 1'(t) HT (R + H1'HT)-I

X (z-h-H (ry-O)) , 1'(t, t') -1'(t) HT (R + H1'HTr1

x H1'(t'f). (14)

While (14) could be used directly as the estimate update equation for each iteration, simplifications can be achieved if we only consider the state at the measurement times. That is, if we define bO* := [bO*(tdT bO*(tN)T], algebraic manipulation produces

(1'-1 + HTR-IH) bO* = 1'-I(ry -0) + HTR-I(z -h). (15)

5207

Page 5: Gaussian Process Gauss-Newton for 3D Laser-Based Visual ...ncfrn.mcgill.ca/members/pubs/tong_icra13.pdf · a scanning laser rangefinder measures points individually. As a result,

The Schur complement [33] can be employed for an efficient solution to this linear system. After solving for the optimal perturbation of the state at the measurement times, it is applied as an additive update to the current state estimate by 7i +--- 7i + MY. The system is then relinearized using the improved estimate, and the process repeats until convergence.

Additional algebraic manipulation can also produce an expression for the time-varying component of the state as a linear combination of the perturbations at the measurement times. This expression is

x(t) + ox*(t) = I-"(t) - K(t)K-1 (I-" - x) + K(t)K-10x*, (16)

and upon convergence of Gauss-Newton, ox* (t) = ox* = 0, which results in

x(t) = I-"(t) - K(t)K-1 (I-" - x). (17)

This equation provides state predictions at other times of interest, and can also be utilized for pose interpolation during estimation [13].

C. Sliding Window

Since computational limits prevent us from estimating the entire rover trajectory as a single problem, we utilize the batch formulation of GPGN in a multi-frame sliding window implementation. In this approach, only a small portion of the trajectory is considered in each estimation problem. This consists of a number of fixed poses to maintain consistency, as well as some free poses to be estimated. After conver­gence, the feature position estimates are discarded, and the estimation window is slid forward. Alternative formulations such as the Sliding Window Filter [34] maintain a local map by marginalizing out old poses and landmarks. While this allows the estimator to benefit from the improved map estimates, the sparsity in the linear system is lost.

The primary consideration in implementing a sliding win­dow is the tradeoff between accuracy and computation time. Though the computational requirements of batch estimation scale with the number of measurements, maintaining a fixed window size allows for approximately constant computation. However, the use of GPGN introduces an additional factor. Since GP interpolation tends to utilize many frames, a sufficiently large number of poses is required to maintain the richness of the state representation. In our implementation, we used a sliding window configuration of 5 fixed frames and 2 free poses.

D. Extension to 3D

To extend GPGN to the 3D domain, an appropriate state representation must be chosen. The primary challenge in 3D estimation stems from the representation of rotations, which conflicts with the vector space assumptions that underlie the estimation algorithms.

In our implementation, we parametrize the rover poses with transformation matrices, and feature positions with homogeneous vectors. That is, the 4 x 4 transformation that takes a point in the estimate reference frame and transforms

it to the sensor frame at time tk is denoted as Tk, and the position of feature j with respect to the estimate reference frame is represented by the 4 x 1 homogeneous coordinate vector, Pj.

The state perturbations are composed of 6 x 1 vectors for the robot poses, 07r k, and 3 x 1 vectors for the feature positions, OEj. These are defined to be

Tk = e-07l'�Tk ;::::: (1 - 07r�) Tk,

Pj = Pj + DOEj,

(18)

(19)

where the exponential map [35] is utilized to transform the vector parameterization into transformation matrices, ( - )EE and (.) x are the matrix operators

WEE := [�] := [�; -u] o ' (20) [ 0 -V3 V, 1 VX ' - V3 0 -;1 (21)

- v 2 VI

and D is a dilation matrix defined as

[ 0

] D:= 1 (22) 0

0

We consider the generic camera model

which is composed of two nonlinearities: one that transforms a homogenous point into a local frame, and another for the sensor model. The linearized point transformation is

where

gjk := TkPj

;::::: (1- 07r�) Tk (Pj +DOEj)

;::::: TkPj + GjkOXjk, (24)

Gjk := [ (TkPj)E3 TkD] , OXjk := [�:;] , (25)

and (-)E3 is the matrix operator

[nl OT (26)

Using the chain rule, this result can be multiplied with the Jacobian of the sensor model to obtain

(27)

This linearized model can be used in place of (8), where OXjk is one of the many state components that form M). After solving for the optimal perturbations at each iteration, the current state estimate is updated using the exponential map.

5208

Page 6: Gaussian Process Gauss-Newton for 3D Laser-Based Visual ...ncfrn.mcgill.ca/members/pubs/tong_icra13.pdf · a scanning laser rangefinder measures points individually. As a result,

Fig. S. An image of the ROC6 field robot in the Ethier Sand and Gravel Pit in Sudbury, Ontario, Canada, where the dataset was collected. The test site offers a large-scale, 3D, unstructured, natural terrain, composed primarily of rocks and dirt.

V. EXPERIMENTAL VALIDATION

For experimental validation, we utilized a dataset gath­ered at the Ethier Sand and Gravel Pit in Sudbury, On­tario, Canada. This dataset was previously collected during lighting-invariant Visual Teach and Repeat experiments [8]. The test site, depicted in Figure 5, offers large-scale, 3D, unstructured terrain, primarily composed of rocks and dirt.

The ROC6 field robot utilized to gather the dataset, depicted in Figure 1, was equipped with an Autonosys LVC0702 two-axis scanning Iidar. The Iidar was configured to produce 480x360 pixel images, with a 30x90° field of view. These images were collected at 2Hz, and ground truth data was provided by a combination of a DGPS base station and an onboard antenna. The traverse used in this work consists of 6880 image stacks obtained over a 1154m route.

For comparison, we implemented our proposed GPGN formulation alongside the conventional discrete-time batch Gauss-Newton approach. In this approach, the timestamps of the feature observations are replaced by the assumption that features detected in a single intensity image all arrive at the same time. In other words, the motion distortion effects are ignored, and the rover is assumed to be stationary during a single image acquisition. While this does not reflect the actual acquisition process, this compensation-free ap­proach provides a point of reference that reflects the current discrete-time state estimation literature. Though discrete­time batch Gauss-Newton may be extended to conduct pose interpolation [9], [21], we only provide a basic comparison to the standard formulation in this section. In both cases, a Huber M-estimator [36] was incorporated for additional robustness against measurement outliers, and a maximum motion threshold was enforced to reject unrealistic pose estimates suggested by poor feature matches.

A. Implementation Details

Since a GP describes a distribution over a vector space, we require a vector parametrization of the state. This parametrization can be provided by the exponential map [35]. That is, a transformation matrix, Tk,ko' is related to

its 6x 1 vector parametrization, 7r(tk), by Tk,ko = e-7r(tdE•

Using this definition, we can model the time-varying state acceleration as a white noise process with

*(t) rv gp (0, W <5(t - t')) , (28)

where W is the power spectral density matrix, and <5 (.) is the Dirac delta function. This simple physically-motivated model provides a smooth trajectory, but does not address the motion constraints imposed by the hardware configuration. Integrat­ing the mean and covariance functions twice produces the GP for 7r(t), where

7r(t) rv gp(,.,,(t), K(t, t')), ,.,,(t) = 7r(O),

, _ (mill(t,t')2max(t,t') _ mill(t,t')3 ) lC(t, t ) - W 2 6 '

(29)

(30)

(31)

The derivation of this covariance function is provided in [13]. Unfortunately, this covariance function is fully dense. However, the number of features greatly outnumbers the number of estimated poses in our scenario. As a result, the overall system remains sparse.

Since the two-axis scanning Iidar provided azimuth, el­evation, and range data, it was modelled as a spherical camera. That is, the measurement model that provided the transformation between Cartesian points seen in a local frame to the spherical sensor space was [ atan2(y, vx2 + z2) 1

f(p) := atan2(z,x) Jx2 + y2 + Z2

(32)

Finally, tuning was conducted using a small section of a representative traverse, where the best estimator performance was achieved by a sliding window configuration of 5 fixed and 3 free poses. Similarly, the sensor noise covariance, M-estimator threshold, and GP hyperparameters were de­termined using a combination of the manufacturer's spec­ifications, ground truth data, and manual experimentation. The sliding window and sensor noise parameters were kept identical for both algorithms.

5209

Page 7: Gaussian Process Gauss-Newton for 3D Laser-Based Visual ...ncfrn.mcgill.ca/members/pubs/tong_icra13.pdf · a scanning laser rangefinder measures points individually. As a result,

50

E >- 0

-50

-100 -50 0 50 100 150

x[m]

100 -- Ground Truth -- No Motion Compensation -- GPGN

50 E N

0

-50 -100 -50 0 50 100 150

x[m]

Fig. 6. An overhead and side plot of the estimates over the entire l.lkm traverse. The ground truth GPS data is depicted in black, the compensation­free estimate in red, and our GPGN estimate in blue. Projected 3a covari­ance envelopes are also represented by the shaded ellipses at two locations to illustrate the estimator consistency. The first 30m of each estimate was used for alignment to the ground truth trajectory. As can be seen, considering the motion distortion in the continuous-time GPGN formulation greatly outperforms the discrete-time compensation-free approach, especially in vertical drift. The accompanying video depicts these trajectories alongside the intensity images and feature tracks.

B. Results

The resulting laser-based va estimates are depicted in Figure 6. Since the va estimates and ground truth data did not share a common reference frame, the first 30m of each estimate was used for alignment to the ground truth trajec­tory. As can be seen, the continuous-time GPGN formulation provides some resemblance to the ground truth trajectory, and greatly outperforms the discrete-time compensation-free approach, especially in vertical drift.

We quantify the growth in estimation error by plotting the translation errors against the actual distance travelled in Figure 7(a). These results confirm our qualitative observa­tions that GPGN is significantly more accurate. However, the direction switches conducted by the robot proved to have beneficial effects, allowing intrinsic estimator biases to cancel out intermittently. This can be visually observed by the jagged growth and decline of the error plots.

To remove the beneficial effects of the direction switches from the estimator performance, we subdivided the dataset, creating subsets bordered by each direction switch. The estimator accuracy for these subsets is depicted in Figure 7(b), where the first 30m of each trajectory was also used for alignment to the ground truth data. Once again, the GPGN translation error grew much slower with respect to distance travelled compared to the compensation-free approach.

70

60

g W 40 c .2 iii � 30 � f--

I g

20

10

W .§ 1 iii 'in c � f-- 1

-No Motion Compensation -GPGN

400 600 800 1000 1200 Distance travelled 1m]

(a) Translation estimate errors for the entire trajectory.

- No Motion Compensation -GPGN

60 80 100 120 Distance travelled 1m]

160 180

(b) Translation estimate errors split into multiple traverses by the direction switches.

Fig. 7. Plots of the translation estimate errors compared to the actual distance travelled. Since the direction switches tended to cancel out error, the trajectory was also split into multiple subproblems. In both cases, the first 30m of each trajectory was used for alignment. As can be seen, GPGN greatly outperforms the compensation-free approach in estimator drift.

Finally, we provide timing analysis to compare the two estimator implementations. These estimators were imple­mented in Matlab on a MacBook Pro with a 2.66GHz Core 2 Duo and 4GB of 1067MHz DDR3 RAM, and utilized both cores during operation. In total, the compensation-free approach took 2.9h for state estimation, and GPGN took 3.8h. These durations divided by 6880, the number of frames, resulted in an average of I.Ss per estimation window in the compensation-free implementation, and 2.0s for GPGN.

While not yet suitable for real-time operation, it is ex­pected that further optimization and implementation in C++ will make this a viable choice for online operation. Further­more, the increase in computation time from the conven­tional discrete-time approach to the continuous-time GPGN formulation appears to be small, which provides promise for addressing online operation by augmenting existing online discrete-time state estimation code.

5210

Page 8: Gaussian Process Gauss-Newton for 3D Laser-Based Visual ...ncfrn.mcgill.ca/members/pubs/tong_icra13.pdf · a scanning laser rangefinder measures points individually. As a result,

VI. CONCLUSION

In smmnary, we have presented a novel 3D laser-based VO algorithm in this paper. We constructed intensity images, tracked sparse visual features, and overcame the motion distortion issues by utilizing GPGN for continuous-time state estimation. The novel contributions of this paper are the extension of GPGN to the 3D domain, and the validation of a laser-based VO framework on l.lkm of real-world data.

Possible avenues for future work include investigation of other covariance functions that may be more suitable for the VO problem, and comparisons to other motion compensation methods, such as integration of an IMU. Furthermore, we would like to apply these concepts to other mobile robot laser configurations, such as a 3D sensor composed of a sweeping planar laser unit.

ACKNOWLEDGEMENTS

We would like to thank Sean Anderson, Colin McManus, Hang Dong, Braden Stenning, and Erik Beerepoot for col­lecting and preparing the dataset used in this work. We also acknowledge the support of NSERC, the Canada Foundation for Innovation, DRDC Suffield, the Canadian Space Agency, and MDA Space Missions.

REFERENCES

[1] M. Maimone, Y. Cheng, and L. Matthies, "Two years of visual odometry on the Mars exploration rovers," Journal of Field Robotics, vol. 24, no. 3, pp. 169-186, March 2007.

[2] K. Konolige, M. Agrawal, and 1. Sola, "Large scale visual odometry for rough terrain," in Proceedings of the International Symposium on Robotics Research, Hiroshima, Japan, November 26-29 2007, pp. 201-212.

[3] G. Sibley, C. Mei, I. Reid, and P. Newman, "Vast-scale outdoor navigation using adaptive relative bundle adjustment," International Journal of Robotics Research, vol. 29, no. 8, pp. 958-980, July 2010.

[4] c. Harris and M. Stephens, "A combined comer and edge detector," in Proceedings of the Fourth Alvey Vision Conference, Manchester, UK, 31 August - 2 September 1988, pp. 147-151.

[5] D. G. Lowe, "Distinctive image features from scale-invariant key­points," International Journal of Computer Vision, vol. 60, no. 2, pp. 91-110, November 2004.

[6] H. Bay, A. Ess, T. Tuytelaars, and L. V. Gool, "SURF: Speeded up robust features," Computer Vision and Image Understanding (CVIU), vol. 110, no. 3, pp. 346-359, 2008.

[7] C. McManus, P. Furgale, and T. D. Barfoot, "Towards appearance­based methods for lidar sensors," in Proceedings of the IEEE Con­ference on Robotics and Automation (ICRA), Shanghai, China, 9-13 May 2011, pp. 1930-1935.

[8] C. McManus, P. Furgale, B. Stenning, and T. D. Barfoot, "Visual teach and repeat using appearance-based lidar," in Proccedings of the IEEE Conference on Robotics and Automation (ICRA), St. Paul, MN, USA, 14-18 May 2012, pp. 389-396.

[9] H. Dong and T. D. Barfoot, "Lighting-invariant visual odometry using lidar intensity imagery and pose interpolation," in Proceedings of the International Conference on Field and Service Robotics (FSR), Matsushima, Japan, 16-19 July 2012.

[l0] A. NUchter, K. Lingemann, J. Hertzberg, and H. Surmann, "6D SLAM - 3D mapping outdoor environments," Journal of Field Robotics, vol. 24, no. 8-9, pp. 699-722, August/September 2007.

[11] J. Kelly and G. S. Sukhatme, "Visual-inertial sensor fusion: Local­ization, mapping and sensor-to-sensor self-calibration," International Journal of Robotics Research (URR), vol. 30, no. I, pp. 56-79, January 2011.

[l2] c. Tong, P. Furgale, and T. D. Barfoot, "Gaussian Process Gauss­Newton: Non-parametric state estimation," in Proceedings of the 9th Canadian Coriference on Computer and Robot Vision (CRV), Toronto, Ontario, Canada, 27-30 May 2012, pp. 206-213.

5211

[13]

[14]

[15]

[16]

[17]

[18]

[l9]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]

[33]

[34]

[35] [36]

--, "Gaussian Process Gauss-Newton for non-parametric simulta­neous localization and mapping," International Journal of Robotics Research (URR), 2013, manuscript #lJR-12-151O, to appear. J. Neira, J. Tardos, J. Hom, and G. Schmidt, "Fusing range and intensity images for mobile robot localization," IEEE Transactions on Robotics and Automation, vol. 15, no. I, 1999. J. Guivant, E. Nebot, and S. Baiker, "Localization and map building using laser range sensors in outdoor applications," Journal of Robotic Systems, vol. 17, pp. 565-583, 2000. C. Tong and T. D. Barfoot, "A self-calibrating ground-truth localization system using retroretlective landmarks," in Proceedings of the IEEE Coriference on Robotics and Automation (ICRA), Shanghai, China, 9-13 May 2011, pp. 3601-3606. J. Levinson, "Automatic laser calibration, mapping, and location for autonomous vehicles," Ph.D. dissertation, Stanford University, 2011. M. Bosse and R. Zlot, "Continuous 3D scan-matching with a spinning 2D laser," in Proceedings of the 2009 IEEE International Conference on Robotics and Automation, Kobe, Japan, May 12-17 2009, pp. 4312-4319. R. Zlot and M. Bosse, "Efficient large-scale 3D mobile mapping and surface reconstruction of an underground mine," in Proceedings of

the International Conference on Field and Service Robotics (FSR), Matsushima, Japan, 16-19 July 2012. M. Bosse, R. Zlot, and P. Flick, "Zebedee: Design of a spring­mounted 3-D range sensor with application to mobile mapping," IEEE Transactions on Robotics, vol. 28, no. 5, pp. 1104-1119, October 2012. E. Ringaby and P.-E. Forssen, "Efficient video rectification and sta­bilisation for cell-phones," International Journal of Computer Vision, vol. 96, no. 3, pp. 335-352, February 2012. K. Shoemake, "Animating rotation with quatemion curves," ACM SIGGRAPH Computer Graphics, vol. 19, no. 3, pp. 245-254, 1985. R. E. Kalman and R. S. Bucy, "New results in linear filtering and prediction theory," Journal of Basic Engineering, vol. 83, no. I, pp. 95-108, 1961. A. H. Jazwinski, Stochastic Processes and Filtering Theory. New York: Academic Press, 1970. C. Bibby and I. D. Reid, "A hybrid SLAM representation for dynamic marine environments," in Proceedings of the IEEE International Coriference on Robotics and Automation (ICRA), Anchorage, AK, USA, 3-7 May 2010, pp. 257-264. P. Furgale, T. Barfoot, and G. Sibley, "Continuous-time batch esti­mation using temporal basis functions," in Proceedings of the IEEE

International Conference on Robotics and Automation (ICRA), St. Paul, MN, USA, 14-18 May 2012, pp. 2088-2095. S. Anderson and T. D. Barfoot, "Towards relative continuous-time SLAM," in Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), to appear., Karlsruhe, Germany, 6-10 May 2013. S. Anderson, C. McManus, H. Dong, E. Beerepoot, and T. D. Barfoot, "The gravel pit lidar intensity imagery dataset," submitted to the International Journal of Robotics Research, 2013. H. Dong, S. Anderson, and T. D. Barfoot, "Two-axis scanning lidar geometric calibration using intensity imagery and distortion mapping," in Proceedings of the IEEE International Coriference on Robotics and Automation (ICRA), to appear., Karlsruhe, Germany, 6-10 May 2013. M. Fischler and R. Bolles, "Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography," Communications of the ACM, vol. 24, no. 6, pp. 381-395, 1981. C. E. Rasmussen and C. K. I. Williams, Gaussian Processes for Machine Learning. MIT Press, 2006. C. F. Gauss, Methode des Moindres Carres. Quai des Augustins no. 55, Paris: Mallet-Bachelier, Impreimeur-Libraire de L'Ecole Poly tech­nique, 1855. D. C. Brown, "A solution to the general problem of multiple station analytical stereotriangulation," Patrick Airforce Base, Florida, USA, Tech. Rep. 43, 1958. G. Sibley, L. Matthies, and G. Sukhatme, "Sliding window filter with applications to planetary landing," Journal of Field Robotics, vol. 27, no. 5, pp. 587-608, September/October 2010. J. Stillwell, Naive Lie Theory. Springer, 2008. P. J. Huber, Robust Statistics. New York, NY, USA: John Wiley & Sons, 1981.