High-Precision Globally-Referenced Position and …radionavlab.ae.utexas.edu/images/stories/files/papers/...High-Precision Globally-Referenced Position and Attitude via a Fusion of

High-Precision Globally-Referenced Position and

Attitude via a Fusion of Visual SLAM,

Carrier-Phase-Based GPS, and Inertial Measurements

Daniel P. Shepard and Todd E. Humphreys

The University of Texas at Austin, Austin, TX

Abstract—A novel navigation system for obtaining high-precision globally-referenced position and attitude is pre-sented and analyzed. The system is centered on a bundle-adjustment-based visual simultaneous localization andmapping (SLAM) algorithm which incorporates carrier-phase differential GPS (CDGPS) position measurementsinto the bundle adjustment in addition to measurements ofpoint features identified in a subset of the camera images,referred to as keyframes. To track the motion of the camerain real-time, a navigation filter is employed which utilizesthe point feature measurements from all non-keyframes,the point feature positions estimated by bundle adjustment,and inertial measurements. Simulations have shown thatthe system obtains centimeter-level or better absolutepositioning accuracy and sub-degree-level absolute attitudeaccuracy in open outdoor areas. Moreover, the positionand attitude solution only drifts slightly with the distancetraveled when the system transitions to a GPS-deniedenvironment (e.g., when the navigation system is carriedindoors). A novel technique for initializing the globally-referenced bundle adjustment algorithm is also presentedwhich solves the problem of relating the coordinate systemsfor position estimates based on two disparate sensors whileaccounting for the distance between the sensors. Simulationresults are presented for the globally-referenced bundleadjustment algorithm which demonstrate its performancein the challenging scenario of walking through a hallwaywhere GPS signals are unavailable.

I. INTRODUCTION

Cameras remain one of the most attractive sensors for motionestimation because of their inherently high information con-tent, low cost, and small size. Visual simultaneous localizationand mapping (SLAM) leverages this vast amount of informa-tion provided by a camera to estimate the motion of the userand a map of the environment seen by the camera with a highdegree of precision as the user moves around the environment.However, the utility of stand-alone visual SLAM is severelylimited due to its scale ambiguity (for monocular cameras) andlack of a global reference.

Much prior work in visual SLAM has focused on either elim-inating the scale ambiguity, through the inclusion of inertialmeasurements [1], [2] or GPS carrier-phase measurements [3],or employing previously-mapped visually recognizable mark-ers, referred to as fiduciary markers [4]. In contrast, there hasbeen little prior work that attempts to solve the problem ofanchoring the local navigation solution produced by visual

SLAM to a global reference frame without the use of an apriori map of the environment, even though the no-prior-map-technique is preferred or required for many applications. Thefew papers addressing this issue typically employ estimationarchitectures with performance significantly inferior to anoptimal estimator and lean heavily on magnetometers andinertial measurement units (IMUs) for attitude determination,which results in poor attitude precision for all but the highestquality magnetometers and IMUs [5]–[9].

The visual SLAM framework reported in prior literature thatcomes closest to that reported in this paper is Bryson’s visualSLAM algorithm from [10]. Bryson used a combination ofmonocular visual SLAM, inertial navigation, and GPS to createa map of the terrain from a UAV flying 100 m above theground. However, this algorithm was incapable of runningin real-time; estimation of the map of the environment andvehicle motion was performed after-the-fact. Additionally, theaccuracy of the solution in a global sense was severely limitedbecause standard positioning service (SPS) GPS, which isaccurate to only a few meters, was used. The inertial measure-ments and visual SLAM help to tie the GPS measurementstogether during the batch estimation procedure to increase theaccuracy of the resulting solution, but the resulting positionestimates were still only accurate to decimeter level.

The visual SLAM framework presented in this paper is in-spired by Klein and Murray’s Parallel Tracking and Mapping(PTAM) technique [11], which is a stand-alone visual SLAMalgorithm that separates tracking of the position and attitude ofthe camera and mapping of the environment into two separatethreads. The mapping thread performs a batch estimationprocedure, referred to as bundle adjustment, that operates ona subset of the camera images, referred to as keyframes, thatare chosen for their spatial diversity to estimate the positionof identified point features and the position and orientation ofthe camera at each keyframe. The tracking thread identifiespoint features in each frame received from the camera anddetermines the current position and attitude of the camera usingthe point feature measurements from the current frame and thecurrent best estimate of the positions of the point features fromthe mapping thread.

Like PTAM, the visual SLAM framework presented in thispaper separates tracking of the position and attitude of thecamera and mapping of the environment into two separatethreads. However, the mapping thread’s bundle adjustmentalgorithm presented in this paper additionally employs carrier-phase differential GPS (CDGPS) position estimates, interpo-

lated to the time the keyframes were taken. This allows themapping thread to determine the position of each point featureand the position and attitude of the camera at each keyframeto high-precision in a global coordinate system without theuse of a magnetometer or IMU and without an a priorimap of the environment. When CDGPS position estimatesare not available (e.g., when the navigation system is carriedindoors), the mapping thread continues to operate withoutthese measurements at new keyframes. However, the accuracyof the estimates of any newly identified point features, and thusthe position and attitude of the system, decays slowly withthe distance traveled in the GPS-denied environment. Sinceinformation about previous keyframes is maintained, returningto a previously-visited area in the GPS-denied environmentwill aid in fixing up any accumulated errors, a conditionreferred to as loop closure.

The tracking thread for the visual SLAM framework presentedin this paper maintains the point feature identification function-ality of PTAM’s tracking thread but incorporates a navigationfilter. This filter greatly improves the accuracy of the bestestimate of the current position and attitude of the camera byproviding a better motion model, through the incorporation ofIMU measurements, and utilizing information obtained fromall non-keyframes. The filter additionally improves the robust-ness and computational efficiency of the tracking and mappingthreads by aiding in recovery during rough dynamics, reducingthe search space for feature identification, and reducing thenumber of required batch iterations for the mapping thread.

One significant advantage of this navigation system over otherhigh-precision navigation systems is that it can be implementedusing inexpensive sensors. Modern digital cameras are inex-pensive, high-information-content sensors. Inexpensive GPSreceivers are available today that produce the single-frequencycarrier-phase and psuedorange measurements required to de-termine a CDGPS position solution. An inexpensive IMU canalso be employed because the navigation system does notrely on the IMU for long-term state propagation or attitudedetermination.

One promising application for this type of navigation systemis augmented reality. Augmented reality (AR) is a conceptclosely related to virtual reality (VR), but has a fundamentallydifferent goal. Instead of replacing the real world with avirtual one like VR does, AR seeks to produce a blendedversion of the real world and context-relevant virtual elementsthat enhance or augment the user’s experience in some way,typically through visuals. The relation of AR to VR is bestexplained by imagining a continuum of perception with thereal world on one end and VR on the other. On this continuum,AR would be placed in between the real world and VR withthe exact placement depending on the goal of the particularapplication of AR.

The primary limiting factor for AR is the fact that AR requiresextremely precise navigation to maintain the illusion of realismof the virtual objects augmented onto the view of the realworld. AR applications simply fail to impress without thisillusion of realism, and the human eye is fairly good at pickingup on this. Additionally, large errors in the registration ofvirtual objects (i.e., position and orientation of virtual objectsrelative to the real world) make it impossible for a user tointeract with these objects.

Many current successful AR applications rely on visual SLAMfor relative navigation, which results in accuracy suitable formany applications. However, there are many AR applications,such as construction, utility work, social networking, andmultiplayer games, that are awkward or impossible to dousing relative navigation alone because of the need to relatenavigation information in a consistent coordinate system. Thenavigation system presented in this paper has the requiredprecision in a global reference frame to serve as a viable ARplatform for all of these applications.

This paper begins with a discussion of the estimation ar-chitecture that details the differences between the proposedestimation architecture and that of the optimal estimator. Next,the state vectors and measurement models for the bundleadjustment and navigation filter are described. Then, the bundleadjustment algorithm is detailed including a novel techniquefor initialization of the globally-referenced bundle adjustment.This is followed by a detailed description of the navigationfilter. Finally, results from simulations of the bundle adjustmentare presented.

II. ESTIMATION ARCHITECTURE

The eventual goal of the work presented in this paper isthe creation of a high-precision globally-referenced navigationsystem based on a fusion of visual SLAM, CDGPS, and inertialmeasurements that is capable of operating in real-time. Animportant consideration for any multi-sensor navigation systemis how the information from these sensors will be combined toestimate the state. An optimal estimator would, by definition,attain the highest precision state estimate for any given set ofmeasurements, but operation of an optimal estimator in real-time is impractical for this system due to finite computationalresources and the high computational demand of visual SLAM.Therefore, compromises must be made with regard to theoptimality of the estimator to enable real-time operation, asis typically the case. The remainder of this section detailsthe compromises made to enable real-time performance bydescribing the differences between the optimal estimator andthe estimation architecture proposed in this paper. Note that theoptimal estimator and the intermediate architectures leadingto the final proposed architecture are only notional and areused simply to draw a comparison with the final proposedarchitecture.

A. Optimal Estimator

To highlight the differences between the proposed estimationarchitecture and the optimal estimator, the optimal estimatorfor this problem must first be presented. Due to non-linearities,the optimal estimator, in a least-squares sense, requires thatthe measurements from all sensors at all time epochs receivedthus far be processed together in a single least-squares batchestimator. Before introducing the state vector for this batchestimator, it is necessary to define what measurements fromeach sensor the estimator employs, since this choice may alterthe state. For example, there are three types of measurementsthat can be taken from a GPS receiver which represent differentstages in the processing and, for CDGPS, will change the stateof the estimator depending on which of these types of mea-surements is employed. When coupling GPS measurementswith those from another navigation sensor, it is conventional

to consider three levels of coupling based on the types of GPSmeasurements used by the estimator and the details of theestimation architecture. These levels of coupling and the typesof GPS measurements associated with those levels of couplingare as follows

1) Loosely-coupled, which uses position and time esti-mates

2) Tightly-coupled, which uses pseudorange and carrier-phase measurements for each GPS signal

3) Ultra-tightly-coupled, which uses in-phase andquadrature accumulations for each correlator tap andGPS signal

As a general rule-of-thumb, more tightly coupled estimationarchitectures will result in better performance. Therefore, theoptimal estimator considered here employs an ultra-tightly-coupled architecture where the tracking loops of both GPSreceivers, a reference receiver at a known location and themobile receiver, are driven by the estimate of the state (i.e., avector tracking loop). The sensor measurements used in thisestimator are (1) the in-phase and quadrature accumulationsfor each GPS signal for the prompt, early, and late correlatortaps from both GPS receivers, reference and mobile, (2) theimage feature coordinates in each image, and (3) the specificforce and angular velocity measurements from the IMU. Thestate vector for this estimator would include the following

1) Camera poses (i.e., position and attitude of the cam-era) for each image

2) The velocity of the camera for each image3) The local clock offset and offset rate from GPS time

for both receivers at the time each image was taken4) The image feature positions5) Either the accelerometer and gyro biases at each

image or the coefficients of a piece-wise polynomialmodel for the accelerometer and gyro biases

6) The integer ambiguities on the double-differencedcarrier-phase measurements, which are formed basedon the prompt tap in-phase and quadrature accumu-lations from both receivers

B. Removal of Inertial Measurements

The first compromise made to reduce the required compu-tational expense of the estimator was to remove the IMUmeasurements from the batch estimator and the accelerometerand gyro biases from the state of the batch estimator. Due to thealready high-precision of the GPS and vision measurements,the measurements from the IMU do not significantly contributeto the accuracy of the state estimate and are not necessaryfor observability of the state. The vision measurements actas an extremely high quality IMU by relating the poses ofeach image and allow for determination of attitude to a high-precision, even without the IMU. Additionally, it is awkwardand computationally burdensome to deal with the IMU biasesin the batch estimator. Thus, incorporating the IMU mea-surements into the batch estimator is simply not worth themarginal benefits gained. However, the IMU measurements areuseful for propagating the state between frames, which can beperformed external to the batch estimator.

C. Scalar GPS Tracking Loops

While the vector GPS tracking loops in an ultra-tightly-coupledarchitecture do significantly improve robustness of signaltracking and acquisition, this benefit comes at an extremelyhigh price in terms of the computational requirements of theestimator because the estimator has to update the state at amuch faster rate to drive the tracking loops. Additionally, theincorporation of visual SLAM results in extremely slow driftin the state estimate during GPS outages and aids in detectingand eliminating carrier-phase cycle slips, which minimizes theimpact a vector tracking loop would have on performanceover scalar tracking loops. Therefore, a tightly-coupled archi-tecture for the estimator, where the in-phase and quadratureaccumulations are replaced with pseudorange and carrier-phasemeasurements, can be employed instead of an ultra-tightly-coupled architecture with little loss in performance. The localclock offset and offset rate from GPS time for both GPSreceivers can also be removed from the state vector. Theseparameters were necessary in the ultra-tightly-coupled archi-tecture because the vector tracking loops needed an estimateof time, but the tightly-coupled architecture does not requirethese parameters, so long as the pseudorange and carrier-phasemeasurements for both receivers are aligned in time to at leastGPS standard positioning service (SPS) accuracy. This is aresult of the effects of the local clock canceling out whenforming the double-differenced pseudorange and carrier-phasemeasurements in the CDGPS algorithm [12].

D. Separation of CDGPS Processing and Batch Estimator

The third compromise was the separation of CDGPS positionestimation from the batch estimator, which was done primarilyto simplify the design of the batch estimator. In this loosely-coupled architecture, the CDGPS-based position estimates areincorporated into the batch estimator instead of the pseu-dorange and carrier-phase measurements, and the double-differenced carrier-phase integer ambiguities are removed fromthe state vector of the batch estimator. To better understandthe effects of removing the CDGPS algorithm from the batchestimator, there are two pertinent questions one must ask:

1) How is the estimation of the double-differencedcarrier-phase integer ambiguities affected by incor-poration into the batch estimator?

2) How much does the accuracy of the state estimatechange when double-differenced carrier-phase mea-surements with resolved integer ambiguities are in-corporated into the batch estimator instead of only theposition estimates suggested by those measurements?

As for the first question, it is well understood in literatureon CDGPS that the addition of any constraints significantlyaids the resolution of the integer ambiguities [13]. The visionmeasurements are able to constrain pose in a local coordinatesystem, which will greatly aid in resolving the integer ambi-guities (especially during motion). As for the second question,it is difficult to say how much the state estimate will improvewithout implementing this approach. However, the accuracy ofthe state estimate is certain to improve at least somewhat. Asopposed to the removal of inertial measurements and transitionto a tightly-coupled architecture discussed previously, the sep-aration of CDGPS position estimation may have a significant

effect on the performance of the estimator with little change inthe computational burden, provided that the integer ambiguitiesare fixed after convergence rather than continually estimated.Therefore, the authors hope to later reincorporate the CDGPSalgorithm into the batch estimator, which is not a trivial matter.

E. Hybrid Batch/Sequential Approach

Computing global solutions for the keyframe poses and imagefeature positions in the batch estimator is a computationallyintensive process that would require immense computationalresources if it were performed at the frame-rate of the camera.Therefore, an approach to this problem employing a singlebatch estimator performing these global solutions for eachframe is impractical for real-time applications. Thankfully, thehighest accuracy for visual SLAM is obtained by employinga large number of image features, which results in a sparsestructure that can be exploited by the batch estimator, anda smaller number of geographically-diverse images, as wasshown in [14]. In other words, not all frames are created equal.The principle of diminishing return applies to frames takenfrom nearly the same camera pose. This means that there isno need to process most of the images in the batch estimatorbecause incorporating these images into the batch estimatorwill result in little improvement in the accuracy of the globalsolution. Therefore, new frames should be incorporated intothe batch estimator only when the frame to be incorporatedis, in some sense, geographically diverse from the otherframes already incorporated into the batch estimator. Thesegeographically-diverse frames are referred to as keyframesand can be chosen based on a comparison of prior keyframeposes and an estimate of the current camera pose. As thesystem moves around, eventually the number of keyframes willbecome too large to perform global solutions in a reasonableamount of time. At this point, the batch estimator can shedold keyframes and point features that no longer contributesignificantly to the estimate of the point feature locationsnear the system’s current position and save the image featuremeasurements for these keyframes for later use if the systemreturns to that area.

To be useful as a real-time navigation system, however,the system must maintain a highly accurate estimate of thecurrent pose of the camera. This goal is at odds with thecompromise that only select frames be incorporated into thebatch estimator. While the non-keyframes do not contributesignificantly to the accuracy of the global solution, they docontain valuable information about the pose of the cameraat the time they were taken. To account for this, a secondestimator can be employed that takes as measurements thenon-keyframes and the inertial measurements from the IMU,which aid in propagating the camera pose between frames.The second estimator additionally utilizes the estimates ofthe image feature locations from the first estimator to tie itsnavigation solution to the global map. This approach resultsin a federated estimation architecture where one estimator, thebatch estimator discussed in the previous paragraph, is taskedwith mapping the environment and a second estimator is taskedwith tracking the motion of the camera. This is similar tothe approach taken by Klein and Murray’s stand-alone visualSLAM algorithm called PTAM [11].

One issue with using this federated approach for the estimator,

which is discussed in further detail in section VI, is that thesecond estimator cannot maintain cross-covariances betweenthe image feature positions from the batch estimator withoutdestroying its ability to operate in real-time. This means thesecond estimator cannot hope to truly maintain a consistentestimate of the state covariance and an ad-hoc inflation ofthe covariance matrix must be used. However, these cross-covariances between image feature positions will typically besmall in practice, which is demonstrated by the simulationresults in Sec. VII.

The system reported in this paper employs a sequential es-timator or filter as this second estimator. Thus, the estima-tion architecture presented in this paper describes a hybridbatch/sequential estimator. One could instead employ a batchestimator that uses only the last few frames, but incorporatingthe IMU measurements into this framework would be some-what awkward. However, this is a viable option that might beexplored in future work.

F. Proposed Estimation Architecture

Based on the previous discussion, an estimation architecturewas designed that has the potential for real-time operation.A block diagram of this estimation architecture is shown inFig. 1. The blocks on the far left of Fig. 1 are all the sensorsfor the system; camera, reference GPS receiver, mobile GPSreceiver, and IMU.

The most important components of this estimation architectureare the two blocks on the far right of Fig. 1. The upper blockis the batch estimator responsible for creating a high-precisionglobally-referenced map of the environment based on imagefeature measurements from the keyframes and CDGPS-basedposition estimates, when available, interpolated to the time thekeyframes were taken. This process of estimating a map ofthe environment based on keyframes in a batch estimator iscommonly referred to as bundle adjustment. In this particularcase, the bundle adjustment is augmented with CDGPS-basedposition estimates which anchor the bundle adjustment to aglobal coordinate system. The lower block is the sequentialestimator or navigation filter responsible for maintaining anaccurate estimate of the current pose of the camera basedon image feature measurements from non-keyframes, IMUspecific force and angular velocity measurements, and theimage feature position estimates from the bundle adjustment.In addition to being the primary output of the estimator asa whole, the camera pose estimated by the navigation filteris also used in several other components of the estimator toaid in improving computational efficiency and performance.These two components of the estimator, bundle adjustment andnavigation filter, are the main focus of the remainder of thispaper.

In addition to the bundle adjustment and the navigation filter,there are several other components of the estimation architec-ture that are responsible for producing the measurements thatare later input to the bundle adjustment and the navigationfilter. The CDGPS filter, shown in the middle of Fig. 1, isresponsible for estimating the position of the GPS antenna andthe double-differenced carrier-phase integer ambiguities basedon the pseudorange and carrier-phase measurements from thereference and mobile GPS receivers. An estimate of the currentcamera pose from the navigation filter is provided to the

Bundle Adjustment(batch estimator)

Navigation Filter(sequentialestimator)

InertialMeasurementUnit (IMU)

ReferenceGPS Receiver

Camera

MobileGPS Receiver

Carrier-Phases, φA

Carrier-Phases, φB

CDGPSFilter

Specific Force, f

Angular Velocity, ω

Video

CDGPS

Position

Estimates, xi

Non-Keyframe

Feature

Measurements

Keyframe CDGPS

and Feature

Measurements

Camera

Pose

Video

NetworkFeature Position

Estimates and

Covariances

Propagated Pose

Estimate and

Covariance

Image Feature

Measurements, sji

Pseudoranges, ρA

Pseudoranges, ρB

KeyframeSelector

FeatureIdentifier

Fig. 1. A block diagram of the proposed estimation architecture.

CDGPS filter only for linearization and, thus, does not createcorrelation between the CDGPS-based position estimates andthe navigation filter’s pose estimate. This is important becauseany correlation between the output of the navigation filterand the input to the bundle adjustment would destroy theconsistency of the map created by the bundle adjustment andcould cause divergence of the estimator. The feature identifier,shown to the right of the camera in Fig. 1, finds and matchesfeatures in each image received from the camera. To reduce thecomputational burden of the feature identifier, estimates of thecamera pose and image feature positions are provided to thefeature identifier by the navigation filter and bundle adjust-ment, respectively, which reduces the search space for eachimage feature. Although not shown explicitly in Fig. 1, thefeature identifier also identifies new features matched betweenmultiple keyframes. The keyframe selector, shown to the rightof the feature identifier in Fig. 1, employs a set of heuristicsthat determines whether a frame is geographically diverseenough, relative to the current keyframes, to be considereda new keyframe. These heuristics are based on the estimateof the current camera pose provided by the navigation filter.While these components (CDGPS filter, feature identifier, andkeyframe selector) serve important roles in this estimation ar-chitecture, they will only be discussed superficially throughoutthe remainder of this paper because this is not where thecontributions of this paper lie and there is already a plethoraof literature on these components individually.

III. STATE VECTOR

Before discussing the measurement models and estimationalgorithms for the bundle adjustment and the navigation filter,it is appropriate to first introduce the state vectors for eachestimator.

A. Bundle Adjustment State

The bundle adjustment is responsible for producing a globally-referenced map of the environment and, as such, must includethe position of each image feature in the global coordinatesystem in its state vector. As a byproduct of producing thismap, the camera poses at each keyframe must also be estimatedin this global coordinate system. Therefore, the state vector forthe bundle adjustment is as follows

XBA =

[

(

xC1

G

)T (

qC1

G

)T

. . .(

xCN

G

)T (

qCN

G

)T

,

(

xp1

G

)T. . .

(

xpM

G

)T]T

(1)

where xCi

G is the position of the camera at the ith keyframe in

the global coordinate system, qCi

G is the quaternion representa-tion of the attitude of the camera at the ith keyframe relativeto the global coordinate system, N is the number of keyframesin the bundle adjustment, x

pj

G is the position of the jth imagefeature in the global coordinate system, and M is the numberof image features in the map.

The Ci frame is the camera frame at the time the ith keyframewas taken. The camera frame, which will be denoted as C,is defined as the reference frame centered on the camera lenswith the z-axis pointing down the bore-sight of the camera, thex-axis pointing to the right, and the y-axis pointing down tocomplete the right-handed triad. The G frame is the Earth-Centered Earth-Fixed (ECEF) coordinate system. Note thatfor any attitude representation in this paper (·)AB representsa rotation from the A frame to the B frame.

B. Navigation Filter State

The navigation filter is responsible for maintaining an estimateof the current camera pose, which must be included in thestate vector. As part of maintaining an estimate of the currentcamera pose, the camera pose after each measurement updatemust be propagated forward in time to the next measurement.This is accomplished through the use of accelerometer andgyro measurements which include bias terms that must beestimated. Therefore, the state vector for the navigation filteris as follows

XF =

[

(

xCG)T (

vCG

)T(

bfB

)T(

qCG

)T(bω

B)T

]T

(2)

where xCG and vCG are the current position and velocity of the

camera in the global coordinate system, qCG is the quaternion

representation of the current attitude of the camera relative to

the global coordinate system, and bfB and bω

B are the currentaccelerometer and gyro biases, respectively.

The accelerometer and gyro biases are expressed in the Bframe, which is the IMU’s reference frame. The transformbetween the B frame and the C frame, which are both body-fixed coordinate systems, is assumed to be fixed and eithermeasured or calibrated ahead of time. This transformation isgiven by

(·)C = RBC (·)B − xB

C (3)

where RBC is the rotation matrix relating the two coordinate

systems and xBC is the location of the IMU in the C frame.

While this transformation could be estimated on-the-fly insteadof simply measured or calibrated ahead of time, the transfor-mation is only weakly observable and does not need to beknown with great precision, since the estimator does not relyheavily on the accuracy of this transformation.

IV. MEASUREMENT MODELS

This section presents the measurement models employed byboth the bundle adjustment and the navigation filter. As amatter of notation for this paper, parameters, when substitutedinto models, will be denoted with either a bar, (·), for a priori

estimates or a hat, (·), for a posteriori estimates. Any parameterwithout these accents is the true value of that parameter. Whena state vector or an element of a state vector has a delta infront of it, δ(·), this represents a linearized correction term tothe current value of that state variable. The same accent rulesalso apply to delta states.

Before presenting the measurement models, it is appropriateto discuss how the quaternions representing the attitude of thecamera will be handled within the estimator. Quaternions are anon-minimal attitude representation that is constrained to haveunit norm. To enforce this constraint, the quaternion elementsof the state are replaced in the state with a minimal attituderepresentation, generally denoted as δe, during measurementupdates and state propagation [15]. This is accomplishedthrough the use of differential quaternions, which represent

a small rotation from the current attitude to give an updatedestimate of the attitude through the equation

q′ = δq(δe)⊗ q (4)

where q′ is the updated attitude estimate, ⊗ represents quater-nion multiplication, and δq(δe) is the differential quaternion,which is closely approximated as follows

δq(δe) =

[

e sin(

δθ2

)

cos(

δθ2

)

]

≈

e δθ2

√

1−∣

∣

∣

∣e δθ2

∣

∣

∣

∣

2

=

[

δe√

1− ||δe||2

] (5)

where cos(

δθ2

)

is approximated as√

1− ||δe||2 instead ofthe typical 1 to comply with the quaternion constraint. Thisapproximation allows for reduction of the quaternion to aminimal three-element representation, δe, and is useful forpreserving the quaternion constraint in an estimator, as shownin [15]. During initial convergence of an estimator, the as-sumption that δθ is small may be violated and could cause√

1− ||δe||2 to become imaginary. To protect against thisscenario, a less accurate form of the differential quaternionis used whenever ||δe||2 > 1. This form of the differentialquaternion is

δq(δe) =1

√

1 + ||δe||2

[

δe1

]

(6)

This completely specifies the multiplicative update to thequaternion. All other states are updated in the typical additivefashion

(·)′ = (·) + δ(·) (7)

A. CDGPS Position Measurements

The CDGPS filter provides estimates of the position of theGPS antenna that are accurate to within a couple centimeters.One important observation is that these are not estimates ofthe position of the camera lens. Therefore, the position of theGPS antenna relative to the camera lens, which is assumed tobe fixed and measured or calibrated ahead of time, must betaken into account. Unlike the transformation between the Bframe and the C frame described previously, it is of primeimportance that the position of the GPS antenna in the Cframe be known as accurately as possible because any errorsin this parameter directly translates to errors in the estimatedstate. Thankfully, this parameter is observable provided thatthe system is rotated at least somewhat in all directions. If itis not possible to measure this vector to at least millimeteraccuracy, then a calibration procedure can be defined usingthe bundle adjustment approach presented in this paper witha state vector that is augmented with the position of the GPSantenna in the C frame.

These CDGPS position estimates can be modeled as follows

xAG = hx

(

xCG , qCG

)

+ wx = xCG +R

(

qCG

)

xAC + wx (8)

where hx (·) is the non-linear measurement model for theCDGPS position estimates, R(·) is the rotation matrix cor-responding to the argument, xAC is the position of the GPSantenna in the C frame, and wx is zero-mean Gaussian whitenoise with covariance matrix given by the CDGPS filter. Thenon-zero partial derivatives of this model with respect to thestate variables are

∂hx

(

xCG , qC

G

)

∂xCG

∣

∣

∣

∣

∣

X

= I (9)

∂hx

(

xCG , qCG

)

∂δeCG

∣

∣

∣

∣

∣

X

= 2[(

R(

qCG

)

xAC)

×]

(10)

where I is the identity matrix and [(·)×] is the cross productequivalent matrix of the argument defined as

[x×] =

[

0 −x3 x2

x3 0 −x1

−x2 x1 0

]

(11)

where xi is the ith element of x.

B. Image Feature Measurements

To simplify the model for the image feature measurements,it is assumed that a calibrated camera is used and that anydistortion caused by the lens is removed by passing the rawmeasurements through the inverted distortion model prior topassing the measurements to the estimators. This allows thebundle adjustment and the navigation filter to be ambivalentto the distortion model used for the camera. For the simulationspresented in Sec. VII, the field-of-view (FOV) model for a fish-eye lens from [16] was employed using parameters calibratedfrom a real camera using the calibration procedure reportedin [11]. The primary effect of using the distortion model inthe simulations is to properly model the error covariances onthe measurements.

Once the lens distortions have been removed from the rawmeasurements, what remains is the result of a perspectiveprojection. A perspective projection, also known as a centralprojection, projects a view of a three-dimensional scene ontoan image plane normal to the camera bore-sight located 1unit in front of the camera through rays connecting three-dimensional locations and a center of projection. A perspectiveprojection can be expressed mathematically as

spj

I = hs

(

xpj

C

)

+ wpj=

[

xpj

C

zpj

C

ypj

C

zpj

C

]T

+ wpj(12)

where spj

I is the distortion-free image feature measurementfor the jth feature, hs(·) is the perspective projection function,wpj

is zero-mean Gaussian white noise with covariance matrix

given by the feature identifier, and xpj

C is related to the statevariables through the equation

xpj

C =

xpj

C

ypj

C

zpj

C

=(

R(

qCG

))T (xpj

G − xCG)

(13)

The non-zero partial derivatives of this model with respect tothe state variables are

∂hs

(

xpj

C

)

∂xCG

∣

∣

∣

∣

∣

X

= − ∂hs

(

xpj

C

)

∂xpj

C

∣

∣

∣

∣

∣

X

(

R(

qCG

))T(14)

∂hs

(

xpj

C

)

∂δeCG

∣

∣

∣

∣

∣

X

= −2∂hs

(

xpj

C

)

∂xpj

C

∣

∣

∣

∣

∣

X

(

R(

qCG

))T [(xpj

G − xCG

)

×]

(15)

∂hs

(

xpj

C

)

∂xpj

G

∣

∣

∣

∣

∣

X

=∂hs

(

xpj

C

)

∂xpj

C

∣

∣

∣

∣

∣

X

(

R(

qCG

))T(16)

where∂hs

(

xpj

C

)

∂xpj

C

∣

∣

∣

∣

∣

X

is given by

∂hs

(

xpj

C

)

∂xpj

C

∣

∣

∣

∣

∣

X

=

1

zpj

C

0−x

pj

C(

zpj

C

)2

01

zpj

C

−ypj

C(

zpj

C

)2

(17)

C. Inertial Measurements

The inertial measurements consist of 3-axis accelerometermeasurements and 3-axis gyro measurements and are used inthe navigation filter to aid in propagating the state forwardin time. The measurement models presented in this sectionsimply model the relations between these measurements andthe acceleration and angular velocity of the IMU with respectto the IMU frame using state variables and should not beinterpreted as modeling the dynamics of the state. Filter statedynamics models will be presented in Sec. VI-B.

1) Accelerometer Measurements: A subtle, but extremely im-portant, point regarding accelerometers is that they measure thespecific force they experience and not the acceleration. Thismeans that accelerometer measurements include gravitationalacceleration, which must be subtracted out. A walking biasterm, which was included in the filter’s state vector, is alsopresent in these measurements and is typically modeled bythe first-order Gauss-Markov process

bf

B = νf2 (18)

where νf2 is zero-mean Gaussian white noise with a diagonal

covariance matrix, σ2f2I . The covariance of ν

f2 can be obtained

from the IMU specifications. The acceleration of the IMU withrespect to the IMU frame can be modeled as

aBG = R(

qCG

)

RBC

(

fB − bfB

)

− GE∣

∣

∣

∣xCG

∣

∣

∣

∣

3 xCG + νf1 (19)

where fB is the accelerometer measurement, GE is the grav-

itational constant of Earth, and νf1 is zero-mean Gaussian

white noise with a diagonal covariance matrix, σ2f1I . The

covariance of νf1 can be obtained from the IMU specifications.

Note that using xCG in the gravity term is an approximation to

xCG + R

(

qCG

)

xBC , but the term R(

qCG

)

xBC is extremely small

compared to xCG and, thus, negligible.

2) Gyro Measurements: Like the accelerometer measurements,the gyro measurements have a walking bias term, which wasincluded in the filter’s state vector. This bias term is alsotypically modeled by the first-order Gauss-Markov process

bω

B = νω2 (20)

where νω2 is zero-mean Gaussian white noise with a diagonal

covariance matrix, σ2ω2I . The covariance of νω

2 can be obtainedfrom the IMU specifications. The angular velocity of the IMU,which is also the angular velocity of the camera, can bemodeled as

ωCG = R

(

qCG

)

RBC (ωB − bω

B) + νω1 (21)

where ωB is the gyro measurement and νω1 is zero-mean Gaus-

sian white noise with a diagonal covariance matrix, σ2ω1I . The

covariance of νω1 can be obtained from the IMU specifications.

V. BUNDLE ADJUSTMENT

Bundle adjustment is a batch estimation procedure employedby many visual SLAM algorithms that takes advantage ofthe inherent sparsity of the visual SLAM problem. Due toexploiting this sparse structure, the computational complexityof bundle adjustment is linear in the number of image featuresand cubic in the number of images included in the bundleadjustment. Compared to a sequential estimator, which hascomputational complexity that is cubic in the number of imagefeatures and linear in the number of images, this is a significantimprovement because the highest accuracy per CPU cyclefor the visual SLAM problem comes from including moreimage features rather than images [14]. In other words, thecompromise of reducing the number of frames incorporatedinto the bundle adjustment to maintain real-time operation ispreferred to reducing the number of image features, whichis required for a sequential estimator to operate in real-time.As discussed in Sec. II-E, the frames included in the bundleadjustment are selected based on their geographic diversity andare referred to as keyframes.

The bundle adjustment algorithm presented in this sectionalso incorporates CDGPS position estimates interpolated tothe keyframes, which serve to anchor the bundle adjustmentsolution to a global reference frame. For ease of notation,the equations presented in this section will assume that everykeyframe has an associated CDGPS position estimate andevery image feature is seen in each keyframe. Obviously, thisis not a requirement of the bundle adjustment, and the termsin these equations corresponding to any non-existent mea-surements are simply ignored by the bundle adjustment. Forobservability of the globally-referenced visual SLAM problem,four non-coplanar image features seen in three keyframes, that

have corresponding globally-referenced position estimates, issufficient provided that the camera positions for these threekeyframes are not collinear [12].

This section begins by defining the robust objective functionto be minimized by the bundle adjustment. This objectivefunction is linearized and conditions for a minimum of theobjective are presented. Then, the sparse Levenberg-Marquardtalgorithm used to minimize the objective function is presented.Finally, a novel technique for initialization of the globally-referenced bundle adjustment is detailed.

A. Objective Function

The objective function most commonly used in estimationalgorithms is the sum of the squares of the measurementresiduals, which is commonly referred to as the least-squaresobjective function. The cost function for a least-squares esti-mator is defined as

ρLS(r) =r2

2(22)

where r is the normalized measurement residual. This costfunction performs well for the CDGPS position measurementsbecause the measurement error distribution is well approxi-mated by a normal distribution.

The image feature measurements, on the other hand, often havelarge outliers, due to mismatches of the point features andmoving point features in the images, and thus the measurementerror distribution has much larger tails than a normal distri-bution. This necessitates the use of a cost function which isrobust to outliers in order to obtain an accurate estimate of thestate. Ideally, outliers should be entirely suppressed. This canbe accomplished using the Tukey bi-weight cost function [17]given as

ρT (r) =

c2

6

(

1−(

1−(

rc

)2)3)

; |r| ≤ c

c2

6; |r| > c

(23)

where c is a constant typically set to 4.6851 in order toobtain 95% asymptotic efficiency on the standard normaldistribution. However, the Tukey bi-weight cost function is notconvex, which can lead to difficulties with convergence if theinitial guess for the state is too far from the global minimumof the objective function. Another option is the Huber costfunction [17] which does not completely suppress outliers, butis convex and significantly reduces the effect of outliers. TheHuber cost function is given as

ρH(r) =

r2

2; |r| ≤ k

k

(

|r| − k

2

)

; |r| > k

(24)

where k is a constant typically set to 1.345 in order to obtain95% asymptotic efficiency on the standard normal distribution.

Huber proposed in [17] that one could begin with the Hubercost function to obtain an estimate of the state with the outlierssomewhat suppressed and then perform a few iterations usingthe Tukey bi-weight cost function to obtain an estimate ofthe state with completely suppressed outliers. Once enoughkeyframes are obtained, the initial guesses for the pose of newkeyframes and the positions of new point features become soprecise that there is no need to begin with the Huber costfunction because the initial guesses are close enough that thenon-convexity of the Tukey bi-weight cost function is not anissue. At this point, one can transition to simply starting withthe Tukey bi-weight cost function. This will be the approachtaken in this paper.

In summary, the estimation procedure will first employ theobjective function given as

f1 (XBA) =

N∑

i=1

ρLS

(∣

∣

∣

∣

∣

∣∆xAi

G

∣

∣

∣

∣

∣

∣

)

+

M∑

j=1

ρH(∣

∣

∣

∣∆spj

Ii

∣

∣

∣

∣

)

(25)

where ∆xAi

G and ∆spj

Iiare the normalized measurement resid-

uals defined as

∆xAi

G = R−1/2

xAiG

(

xAi

G − xAi

G

)

(26)

∆spj

Ii= R

−1/2

spj

Ii

(

spj

Ii− s

pj

Ii

)

(27)

where xAi

G and spj

Iiare the actual CDGPS position

and image feature measurements, respectively, and R−1/2

xAiG

and R−1/2

spj

Ii

are the inverse of the Cholesky factoriza-

tion of the measurement covariance matrices for theCDGPS position and image feature measurements, re-spectively. After the estimation procedure converges toa solution to the problem argminXBA∈S f1 (XBA) where

S ={

XBA ∈ R7N+3M∣

∣

∣

∣

∣

∣

∣

∣

∣qCi

G

∣

∣

∣

∣

∣

∣ = 1 for i = 1, 2, . . . , N}

,

the objective function is changed to the following

f2 (XBA) =

N∑

i=1

ρLS

(∣

∣

∣

∣

∣

∣∆xAi

G

∣

∣

∣

∣

∣

∣

)

+

M∑

j=1

ρT(∣

∣

∣

∣∆spj

Ii

∣

∣

∣

∣

)

(28)

The estimation procedure then determines the solution tothe problem argminXBA∈S f2 (XBA) using the solution toargminXBA∈S f1 (XBA) as an initial guess. After enoughkeyframes are obtained, the estimation procedure will simplyskip minimizing the first objective function and go straight tothe second.

B. Linearized Objective

The optimization problem defined in the previous section isa non-linear, equality-constrained optimization problem. Theequality constraints on the quaternions are enforced by reduc-ing the quaternion to minimal form during the iterative updates,

as described in Sec. IV, resulting in an unconstrained, non-linear optimization problem at each iteration. At each iterationof the solution procedure, the objective function is linearizedabout the current best estimate of the state to obtain a linear,unconstrained optimization problem that will be solved at eachiteration. The normalized measurement residuals are linearizedusing the partial derivatives of the measurement models definedin Sec. IV as follows

∆xAi

G ≈∆xAi

G − HAi∣

∣

XBAδXBA

=∆xAi

G −R−1/2

xAiG

∂hx

(

xCi

G , qCi

G

)

∂xCi

G

∣

∣

∣

∣

∣

∣

XBA

δxCi

G

+∂hx

(

xCi

G , qCi

G

)

∂δeCi

G

∣

∣

∣

∣

∣

∣

XBA

δeCi

G

(29)

∆spj

Ii≈∆s

pj

Ii− H

pj

Ii

∣

∣

XBAδXBA

=∆spj

Ii−R

−1/2

spj

Ii

(

∂hs

(

xpj

Ci

)

∂xCi

G

∣

∣

∣

∣

∣

XBA

δxCi

G

+∂hs

(

xpj

Ci

)

∂δeCi

G

∣

∣

∣

∣

∣

XBA

δeCi

G +∂hs

(

xpj

Ci

)

∂xpj

G

∣

∣

∣

∣

∣

XBA

δxpj

G

)

(30)

Since the objective function is being linearized, the termsinvolving the image feature measurements in both objectivescan be equivalently written as weighted least-squares terms ofthe form

1

2wV

(∣

∣

∣

∣∆spj

Ii

∣

∣

∣

∣

)

∣

∣

∣

∣

∣

∣∆spj

Ii− H

pj

Ii

∣

∣

XBAδXBA

∣

∣

∣

∣

∣

∣

2

(31)

where wV (·) is a weight function for the vision measurementsthat is defined based on the cost function. For the Huber andTukey cost functions, the corresponding weight functions are

wH(r) =

{

1 ; |r| ≤ kk|r| ; |r| > k (32)

wT (r) =

{

(

1−(

rc

)2)2

; |r| ≤ c

0 ; |r| > c(33)

C. Conditions for a Minimum of the Linearized Objective

The first-order necessary conditions for a minimum of theobjective function state that the derivative with respect to thestate must be zero. This condition results in the followingset of linear equations, commonly referred to as the normalequations, for the linearized objective function

N∑

i=1

[

(

HAi∣

∣

XBA

)T

HAi∣

∣

XBA

+

M∑

j=1

wV

(∣

∣

∣

∣∆spj

Ii

∣

∣

∣

∣

)

(

Hpj

Ii

∣

∣

XBA

)T

Hpj

Ii

∣

∣

XBA

δXBA =

N∑

i=1

[

(

HAi∣

∣

XBA

)T

∆xAi

G

+

M∑

j=1

wV

(∣

∣

∣

∣∆spj

Ii

∣

∣

∣

∣

)

(

Hpj

Ii

∣

∣

XBA

)T

∆spj

Ii

(34)

The terms in the summations in Eq. 34 are extremely sparseand should be treated as such for computational efficiency byignoring all known zero elements in the summation. The termscorresponding to the CDGPS measurements only have non-zero elements in the rows and columns corresponding to thecamera pose for that measurement. The terms corresponding tothe image feature measurements only have non-zero elementsin the rows and columns corresponding to the camera poseand point feature for that measurement. After performing thesummations, the resulting matrix equation is of the form

[

U WWT V

] [

δcδp

]

=

[

ǫc

ǫp

]

(35)

where δc and δp represent the portion of the state updatevector corresponding to the camera poses and the point featurelocations respectively. The matrices U and V in Eq. 35 are bothblock diagonal with U having six by six blocks correspondingto each camera pose and V having three by three blockscorresponding to each point feature location. The matrix W isdense and corresponds to the cross terms between the cameraposes and the point features observed from those camera poses.

The matrix coefficient in Eq. 35 is also the second derivativeof the objective function with respect to the state. The second-order sufficient conditions for a minimum of the objectivefunction state that the second derivative of the objective func-tion must be positive definite. This second derivative matrixis equal to the sum of positive semi-definite matrices and isthus at least positive semi-definite. So long as the full state isobservable, this matrix will be positive definite. Conditions forobservability are stated in [12] and will be repeated here with-out proof. Sufficient conditions for observability of the cameraposes are that four non-coplanar point features are observed inthree images taken from camera locations that are not collinear.Once the sufficient conditions for observability of the cameraposes are satisfied, a necessary and sufficient condition forobservability of the remaining point feature locations is thateach point feature is observed in at least two images wherethe lines between each camera location and the point featurelocation are not the same line. Therefore, the solution to Eq. 35provides the minimizer for the linearized objective function ateach iteration provided that these conditions for observabilityare satisfied.

D. Sparse Levenberg-Marquardt Algorithm

This optimization problem was solved using the sparseLevenberg-Marquardt (LM) algorithm from Appendix 6of [18]. This algorithm first inflates the diagonal elements ofU and V by a multiplicative factor as follows

U∗ij =

{

(1 + λ)Uii ; i = jUij ; otherwise

V ∗ij =

{

(1 + λ)Vii ; i = jVij ; otherwise

(36)

Upon completion of the iteration, the value of the objectivefunction is checked to see if it decreased. If the objectiveincreased, then λ is multiplied by a certain factor and theiteration is repeated. This has the effect of decreasing the stepsize and bringing the solution closer to that which would beobtained if each state variable was optimized independently. Ifthe objective decreased, then λ is divided by the same factorand the iteration is accepted. A common value for the updatefactor for λ is 10.

To solve Eq. 35 after replacing U and V with U∗ and V ∗,the algorithm exploits the block diagonal structure of V ∗ inEq. 35 by pre-multiplying the equation by a special matrix asfollows

[

I −W (V ∗)−1

0 I

] [

U∗ WWT V ∗

] [

δcδp

]

=

[

U∗ −W (V ∗)−1

WT 0WT V ∗

] [

δcδp

]

=

[

I −W (V ∗)−1

0 I

] [

ǫc

ǫp

]

(37)

where the matrix (V ∗)−1

is a block diagonal matrix composedof three by three blocks that are the inverse of the correspond-ing blocks in V ∗. Therefore, the state update vector can bedetermined as follows

δc =(

U∗ −W (V ∗)−1

WT)−1 (

ǫc −W (V ∗)−1

ǫp

)

δp = (V ∗)−1 (

ǫp −WT δc)

(38)

If implemented properly to exploit the sparsity of (V ∗)−1

, thisalgorithm has computational complexity that is linear in thenumber of point features and cubic in the number of cameraposes. With only a few minor changes, one could rewrite thisalgorithm such that it has computational complexity that islinear in the number of camera poses and cubic in the numberof point features. However, the number of point features beingtracked at one time tends to number in the thousands forvisual SLAM algorithms, while the number of camera posesis typically kept in the tens or hundreds range. This is basedon the fact that tracking more point features results in a higheraccuracy than retaining more keyframes as shown by Strasdatin [14].

Convergence of the solution is determined by comparing thenorm of the state update vector and the change in the objective

at the end of each iteration to threshold values. If boththreshold checks are passed, then the algorithm is declaredto have converged and the covariance of the state estimate iscomputed. This covariance is given by the following equation

PBA =

[

Pcc −PccWV −1

−V −TWTPTcc V −TWTPccWV −1 + V −1

]

(39)

where Pcc =(

U −WV −1WT)−1

. Note that the entirecovariance matrix is not computed by the bundle adjustmentsince only the 3×3 blocks on the diagonal of the bottom rightsub-matrix are required by the navigation filter, which will bediscussed in Sec. VI. Computing the entire covariance matrixwould add significantly to the computational requirementsof the bundle adjustment. By only computing the necessaryelements of the covariance matrix, the covariance computationis comparable to one additional LM iteration.

E. Initialization

A convenient and highly accurate method for initializing analgorithm for combined CDGPS and visual SLAM is to beginwith a stand-alone visual SLAM algorithm. A stand-alonevisual SLAM algorithm, such as PTAM [11], is capable ofdetermining the camera poses and point feature locations tohigh accuracy in an arbitrarily-defined local reference frame upto a scale factor. This stand-alone visual SLAM mode can beused on start-up, when CDGPS position measurements are notyet available, and then transitioned into a combined CDGPSand visual SLAM mode.

In [19], Horn derived a closed form solution to the least squaresproblem of estimating the similarity transformation relatingtwo coordinate systems given estimates of the locations ofat least three points in space in both coordinate systems.This transform could be applied directly to the initializationproblem at hand using the camera position estimates from thestand-alone visual SLAM algorithm and the CDGPS positionmeasurements if the camera and the GPS antenna were col-located. However, this is obviously physically unrealizable.The distance between the camera and GPS antenna couldalso easily be accounted for by shifting the camera positionestimates from the stand-alone visual SLAM algorithm to thelocation of the GPS antenna if the coordinate frame for thevisual SLAM estimates did not have a scale factor ambiguity,but this is not the case. Therefore, a modified form of the Horntransform must be derived that accounts for this separationbetween sensors.

For this modified form of the Horn transform, it is assumedthat the vector between the camera and the GPS antenna isknown in the C frame and attitude estimates of the systemare available in the vision frame (i.e., the frame defined bythe stand-alone visual SLAM algorithm), which is true forstand-alone visual SLAM. Given N stand-alone visual SLAMpose estimates and CDGPS position estimates, the objectivefunction to be minimized is given as

g =1

2

N∑

i=1

∣

∣

∣

∣

∣

∣

∣

∣

1√s

(

xAi

G − xVG −R

(

qVG

)

R(

qCi

V

)

xAC

)

−√sR(

qVG

)

xCi

V

∣

∣

∣

∣

∣

∣

2

(40)

where s, xVG , and qV

G are the scale-factor, translation, androtation, respectively, parameterizing the transform from the

vision frame to the global coordinate system and xCi

V and qCi

Vare the position and attitude estimates, respectively, from thestand-alone visual SLAM algorithm. Expressing the error ina symmetric form with respect to the scale factor in Eq. 40has the effect of weighing the errors in the measurementsfrom both coordinate systems equally and results in a moreconvenient expression for the resulting solution for the trans-form [19].

The method for solving this estimation problem follows thesame procedure as in [19], but results in some additionalcomplications that prevent a completely analytical solution.First, the measurements are averaged and re-expressed asdeviations from these averages as follows

µxAG=

1

N

N∑

i=1

xAi

G

µxCV=

1

N

N∑

i=1

xCi

V

µxAV=

1

N

N∑

i=1

R(

qCi

V

)

xAC

(41)

xAi

G = xAi

G − µxAG

xCi

V = xCi

G − µxCV

xAi

V = R(

qCi

V

)

xAC − µxAV

(42)

Substituting these relations into Eq. 40 for the measurements

and noting that the linear terms in xAi

G , xCi

V , and xAi

V sum tozero results in the equation

g =1

2

N∑

i=1

[

∣

∣

∣

∣

∣

∣

∣

∣

1√s

(

xAi

G −R(

qVG

)

xAi

V

)

−√sR(

qVG

)

xCi

V

∣

∣

∣

∣

∣

∣

∣

∣

2

+

∣

∣

∣

∣

∣

∣

∣

∣

1√s

(

µxAG− xV

G −R(

qVG

)

µxAV

)

−√sR(

qVG

)

µxCV

∣

∣

∣

∣

∣

∣

∣

∣

2]

(43)

The second term in Eq. 43 is the only term that contains thetranslation, xV

G , and is a quadratic form. Therefore, this termcan be minimized by setting the translation to

xVG = µxAG−R

(

qVG

)

(

sµxCV+ µxA

V

)

(44)

However, this relation for the translation depends on therotation and scale factor, which must still be determined. Itis interesting to note that the translation is, quite sensibly,just the difference in the means of the positions of the GPSantenna in both coordinate systems. This is the same result asthe original Horn transform from [19] with the addition of theterm corresponding to the GPS antenna location relative to thecamera.

After substituting this solution for the translation into Eq. 43,only the first term remains. This term is then expanded andthe scale factor is pulled out of the summations resulting inthe following

g =1

2

(

1

sSG − 2D + sSV

)

(45)

where

SG =

N∑

i=1

∣

∣

∣

∣

∣

∣xAi

G −R(

qVG

)

xAi

V

∣

∣

∣

∣

∣

∣

2

(46)

SV =

N∑

i=1

∣

∣

∣

∣

∣

∣xCi

V

∣

∣

∣

∣

∣

∣

2

(47)

D =

N∑

i=1

(

(

xAi

G

)T

R(

qVG

)

−(

xAi

V

)T)

xCi

V (48)

The scale factor can be determined by completing the squareand setting that term equal to zero. This results in the followingrelation for the scale factor

s =

√SG√SV

=

√

√

√

√

√

√

∑Ni=1

∣

∣

∣

∣

∣

∣xAi

G −R(

qVG

)

xAi

V

∣

∣

∣

∣

∣

∣

2

∑Ni=1

∣

∣

∣

∣

∣

∣xCi

V

∣

∣

∣

∣

∣

∣

2 (49)

Unlike the original Horn transform [19], this solution forthe scale factor depends on the rotation because of the termcorresponding to the GPS antenna location relative to thecamera. However, the solution for the original Horn transformis otherwise the same. It is interesting to note that the solutionfor the scale factor is, quite sensibly, simply a ratio ofmetrics describing the spread of the camera positions in bothcoordinate systems.

After substituting the solution for the scale factor into theobjective function, this leaves an objective function that is onlya function of the rotation, which is given by the equation

g =√

SVSG −D (50)

To aid in solving for the rotation, a useful relation betweenrotation matrices and quaternions is employed that is given asfollows

(x1)TR(q)x2 = (q ⊗ x2 ⊗ q∗)

Tx1

= (q ⊗ x2)T(x1 ⊗ q)

= qT {x2}T [x1]q

(51)

where [·] and {·} are the quaternion left and right multiplicationmatrices, respectively, which, in the case of position vectors,are given as

[x] =

0 x3 −x2 x1

−x3 0 x1 x2

x2 −x1 0 x3

−x1 −x2 −x3 0

(52)

{x} =

0 −x3 x2 x1

x3 0 −x1 x2

−x2 x1 0 x3

−x1 −x2 −x3 0

(53)

Substituting this relation into the expanded objective functionresults in

g =√

Sl

(

N∑

i=1

(

∣

∣

∣

∣

∣

∣xAi

G

∣

∣

∣

∣

∣

∣

2

+∣

∣

∣

∣

∣

∣xAi

V

∣

∣

∣

∣

∣

∣

2)

−2(

qVG

)T

(

N∑

i=1

{

xAi

V

}T [

xAi

G

]

)

qVG

)1/2

−(

qVG

)T

(

N∑

i=1

{

xCi

V

}T [

xAi

G

]

)

qVG +

N∑

i=1

(

xAi

V

)T

xCi

V

(54)

In general, there is no analytic solution to the problem ofdetermining the minimizer of Eq. 54. However, it is safe toassume that the solution will be close to the solution of finding

the minimizer for −(

qVG

)T(

∑Ni=1

{

xCi

V

}T [

xAi

G

]

)

qVG , which

is the result from the original Horn transform from [19]. This

is primarily due to xCi

V being greater than xAi

V in realisticscenarios because the separation between sensors will besmaller than the distance the whole system moves. The solutionto this reduced problem is the eigenvector corresponding to the

largest eigenvalue of the matrix∑N

i=1

{

xCi

V

}T [

xAi

G

]

, which is

used as an initial guess for an iterative estimator that uses theNewton-Raphson method to determine the rotation. Althoughdifficult to interpret from Eq. 54, it can easily be seen from thesolution to the original Horn transform in [19] that the rotationis, quite sensibly, the direction that best aligns the deviationsfrom the means of the camera positions in both coordinatesystems.

Once the estimate for the rotation is determined by the iterativeestimator, the scale factor and translation can be computeddirectly from Eqs. 49 and 44 respectively. This estimate ofthe similarity transform relating the V frame to the G framecan be used to transform the camera poses estimated by thestand-alone visual SLAM algorithm into globally-referencedpose estimates, which are then used as an initial guess for thebundle adjustment.

VI. NAVIGATION FILTER

While the bundle adjustment is responsible for maintaining ahigh-precision, globally-referenced map of the environment,it cannot be relied upon for maintaining an estimate of thecurrent pose of the camera due to real-time constraints. Instead,a navigation filter is employed that maintains an estimate of thecurrent pose of the camera using image feature measurementsfrom non-keyframes and inertial measurements from an IMUfor propagation of system dynamics. The navigation filter alsoleverages the map of the environment maintained by the bundleadjustment so that the image feature measurements can be tiedto a global reference. In other words, the navigation filter treatsimage features as fiduciary markers, but with some knownerror distribution. The remainder of this section outlines thealgorithm used for the navigation filter.

A. Measurement Update Step

During the measurement update step, the filter state is tem-porarily augmented with the image feature positions corre-sponding to the features with measurements in the currentframe. This temporary augmentation of the state vector allowsthe estimates of the image feature positions to be treatedas a prior with error covariances provided by the bundleadjustment. Taking this approach allows the information fromthe measurement update to be weighted properly between thefilter state and the image feature positions.

This type of state augmentation would normally destroy theability of the filter to perform in real-time. However, there areseveral steps that can be taken with little loss in performanceto reduce the computational burden of the filter that results inonly about twice the computational burden over simply takingthe image feature position estimates from bundle adjustmentas “truth” (i.e., conditioning on the image feature positionestimates from bundle adjustment). These steps are as follows

1) The cross-covariances between image feature positionestimates from bundle adjustment can be neglected.If these cross-covariances were retained the filterwould be required to invert a dense 3M×3M matrixat most measurement updates because different fea-tures would be seen in different frames. This wouldmake real-time operation impossible. By neglectingcross-covariances, the filter only needs to invert M3 × 3 matrices, which can also be stored for usein all frames between bundle adjustment updates.Additionally, these cross-covariances between imagefeature position estimates are small provided that thesystem moves a fair amount (∼ 10 m), which isrequired for good observability anyways. This pointis demonstrated by the simulations in Sec. VII.

2) Since the cross-covariances between image featureposition estimates from bundle adjustment are beingneglected, the normal equations for the measurementupdate have nearly the same sparse structure asthe bundle adjustment normal equations discussed inSec. V-D. By exploiting the sparse structure of thenormal equations, a 15×15 system of linear equationscorresponding to the filter state is the largest systemof equations that must be solved. Under the scenariowhere the feature position estimates from bundle

adjustment were taken as “truth,” the filter would alsobe required to solve a 15× 15 system of equations.

3) The image feature positions are marginalized outof the state after the measurement update and notupdated within the measurement update. If the imagefeature positions were maintained in the state thesparsity of the next measurement update would notbe preserved and the filter would need to compute thefull covariance matrix for the augmented state. Sincethe feature position measurements are not maintainedin the filter between measurement updates, there isalso no need to compute the measurement update forthese variables.

However, this convenience of reduced computational burdendoes not come free. The price paid for this ability to operatethe filter in real-time is that the filter covariance estimate isno longer consistent because cross-covariances between theimage feature positions were ignored. This effect should besmall provided that the system is moved a fair amount (∼10 m) and the environment is rich with visually recognizablefeatures. An ad-hoc method, such as fudge factors [20], canbe used to slightly inflate the covariance in an attempt tomaintain filter consistency. Simulations can be used to tunethese fudge factors. Note that, while the filter’s estimate maynot be consistent, the bundle adjustment is not affected by thisissue and maintains a consistent estimate of the map.

Given these considerations, the normal equations for the mea-surement update step are given as

(

[

P−1F 00 P−1

p

]

+

M∑

i=1

wT

(∣

∣

∣

∣∆spj

I

∣

∣

∣

∣

)

(

Hpj

I

∣

∣

XF ,p

)T

Hpj

I

∣

∣

XF ,p

)[

δXF

δp

]

=M∑

i=1

wT

(∣

∣

∣

∣∆spj

I

∣

∣

∣

∣

)

(

Hpj

I

∣

∣

XF ,p

)T

∆spj

I

(55)

where PF and Pp are the a priori covariance matrices for thefilter state and the image feature positions respectively. Toreiterate, Pp is approximated as block diagonal with 3 × 3blocks corresponding to each point feature position. Afterperforming the summations, these normal equations are givenas

[

UF WF

WTF VF

] [

δXF

δp

]

=

[

ǫF

ǫp

]

(56)

These equations are of nearly the same form as Eq. 35 with theonly difference being that the upper left block correspondingto the filter state is dense. This means that the same sparsetransformation applied during the bundle adjustment in Eq. 37can be applied to the filter measurement update. Therefore, thesolution to the normal equations for the filter state update andcovariance is given by the equations

δXF =(

UF −WFV−1F WT

F

)−1 (ǫF −WFV

−1F ǫp

)

(57)

PF =(

UF −WFV−1F WT

F

)−1(58)

B. Propagation Step

The propagation step utilizes accelerometer and gyro measure-ments to aid in the propagation of the filter state. The modelsfor these measurements were defined in Sec. IV-C, but the fullstate dynamics have yet to be defined. Dynamics equationsfor the accelerometer and gyro biases were given in Eqs. 18and 20. The time derivative of the differential quaternion issimply

δeCG =

1

2ω

CG =

1

2

(

R(

qCG

)

RBC (ωB − bω

B) + νω1

)

(59)

The acceleration is significantly more complicated because theIMU and camera are not collocated, which means that theangular velocity of the system couples into the accelerationwhen expressed using the accelerometer measurements. Fromrigid body kinematics, the acceleration of the camera can bederived as

aCG =aBG −α

CG ×

(

R(

qCG

)

xBC)

− ωCG ×

(

ωCG ×

(

R(

qCG

)

xBC

))

− 2ωEG × vC

G −αEG × xCG − ω

EG ×

(

ωEG × xC

G

)

(60)

where ωEG and α

EG are the angular velocity and acceleration of

the Earth, respectively, and αCG is the angular acceleration of

the camera. This equation can be slightly simplified by recog-nizing that the angular acceleration of the Earth is negligiblysmall and the angular acceleration of the camera, while notalways small, will not be large over significant time intervalsand will roughly average to zero. Removing these terms andsubstituting the accelerometer and gyro measurement modelsinto Eq. 60 results in the equation

aCG =R(

qCG

)

RBC

(

fB − bfB

)

− GE∣

∣

∣

∣xCG

∣

∣

∣

∣

3 xCG + νf1

−αCG ×

(

R(

qCG

)

xBC

)

−(

R(

qCG

)

RBC (ωB − bω

B) + νω1

)

×((

R(

qCG

)

RBC (ωB − bω

B) + νω1

)

×(

R(

qCG

)

xBC

))

− 2ωEG × vCG − ω

EG ×

(

ωEG × xC

G

)

(61)

To summarize, the dynamics model for the full state is givenby

XF =

xCG

vCG

bf

B

δeCG

bω

B

=

vCG

aCG

νf2

12ω

CG

νω2

(62)

Since the time between IMU updates is small (∼10 ms), Eulerintegration can be used to determine the state update from timek to k+1 as

δXF (k) = ∆t XF

∣

∣

∣

XF (k),u(k+1)(63)

where ∆t is the time interval between time k and k+1 and u(·)is the IMU measurement vector given as

u(k) =

[

fB(k)ωB(k)

]

(64)

To determine the propagated covariance, the state transitionmatrix and process noise Jacobian matrix must first be deter-mined as

F (k) = I +∆t∂XF

∂XF

∣

∣

∣

∣

∣

XF (k),u(k+1)

(65)

D(k) =

[

∂XF

∂νf1

∂XF

∂νf2

∂XF

∂νω1

∂XF

∂νω2

]∣

∣

∣

∣

∣

XF (k),u(k+1)(66)

The discrete time process noise covariance matrix is then givenas

Q(k) =

∫

∆t

0

F (k)D(k)

σ2f1I 0 0 0

0 σ2f2I 0 0

0 0 σ2ω1I 0

0 0 0 σ2ω2I

·DT (k)FT (k)d∆t

(67)

Finally, the propagated covariance matrix is computed usingthe standard covariance propagation equation for the Kalmanfilter given as

PF (k + 1) = F (k)PF (k)FT (k) +Q(k) (68)

VII. BUNDLE ADJUSTMENT SIMULATIONS

The bundle adjustment algorithm detailed in Sec. V wasimplemented and used in a series of simulations designed toprobe the limits of its performance. First, the estimability ofthe bundle adjustment state was probed under a simplisticsimulation scenario designed to demonstrate the behavior ofthe estimator. Next, a more realistic simulation scenario wascreated where a user carrying the system began outside abuilding, walked through a hallway in the building with asingle turn, and came out the building on the other side.All simulations assumed a 2 cm standard deviation for theCDGPS position estimates and a 1 pixel standard deviationfor the image feature measurements. These are reasonablevalues for the errors on both types measurements. Additionally,the vector between the camera and GPS antenna was set toxAC = [ 0.1002 −0.1664 −0.0267 ]m, which is represen-

tative of a real prototype system, to make the simulations morerepresentative of actual performance.

A. Estimability Simulations

This simplistic simulation scenario was designed to demon-strate the estimability of the bundle adjustment state. Thescenario involves a single cloud of point features, which areuniformly distributed in a sphere with radius rp, centered on apoint a distance d away from the center of a cloud of camerapositions, which are uniformly distributed in a sphere withradius rc. The camera was pointed toward the center of thecloud of point features with some added random pointingdither.

These simulations were all performed using 200 point featuresand 25 keyframes. The addition of more point features andkeyframes would slightly increase the accuracy of the bundleadjustment’s solution (up to some geometric limit), but this isnot as interesting as varying the other simulation parametersbecause there is only marginal benefit for such a simplisticscenario.

After fixing the number of point features and keyframes, thereare three simulation parameters which could be varied. Theseparameters are (1) the scale of the simulation preserving theratio of rp : d : rc, (2) the ratio rc : d, and (3) the ratiorp : d. A ratio of 1 : 4 was used in all simulations for the thirdparameter, rp : d, because there exists a symmetry betweenthis parameter and rc : d, which was allowed to vary, thatmakes varying both of them unenlightening. The first set ofsimulations keep the ratio rc : d constant at a value 1 : 2and vary the scale of the simulation, while the second set ofsimulations hold the scale constant with d = 20m and varythe motion of the camera represented by the ratio rc : d.

1) Simulations Varying Scale: Simulations were performed fora set of values of d between 0.5 m and 200 m, while preservingthe ratio of rp : d : rc as 1 : 4 : 2. Figure 2 shows a scatter plotof the camera pose errors from these simulations as a functionof d. The norms of the position errors are shown as blueasterisks with the left axis denoting their value. The attitudeerrors are shown as red pluses with the right axis denotingtheir values. Figure 3 shows a scatter plot of the norms ofthe point feature position errors from these simulations as afunction of d.

Above about d = 20m, the camera position and attitude errors,from Fig. 2, are roughly invariant to scale and the cameraposition errors are representative of the 2 cm standard deviationused for each dimension of the CDGPS position estimates.This means that, over this region, the camera position estimatesare directly limited in accuracy by the accuracy of the CDGPSposition estimates. It is important to note the remarkably smallattitude error over this region, which is less than 0.1◦ inmost cases. This demonstrates the power of this technique forprecise navigation. Unlike the camera pose errors, the pointfeature position errors decrease linearly with the scale, evenin the region above about d = 20m. This is a result of thepoint feature position estimates being based purely on bearingmeasurements, which are invariant to scale. Therefore, thepoint feature position errors must decrease linearly with thescale.

Below about d = 10m, the estimability of the problem beginsto degrade. The attitude errors grow significantly (to over 1◦

for d ≤ 1m) and become highly correlated between frames, ascan be seen by how tightly packed the attitude errors are on

Fig. 2. A scatter plot of the camera pose errors from the simulations as afunction of d, while preserving the ratio of rp : d : rc as 1 : 4 : 2. The normsof the position errors are shown as blue asterisks with the left axis denotingtheir value. The attitude errors are shown as red pluses with the right axisdenoting their values.

Fig. 3. A scatter plot of the norms of the point feature position errors fromthe simulations as a function of d, while preserving the ratio of rp : d : rcas 1 : 4 : 2.

the scatter plot. This is because, as the motion of the systemdecreases, the errors in the CDGPS position estimates start tobecome significant relative to the distance moved. Therefore,it becomes more difficult to tie the visual SLAM solution tothe global coordinate system, resulting in a “bias” betweenthe more accurate local solution normally provided by stand-alone visual SLAM and the global reference frame. This samephenomenon is evident in the point feature position estimatesin Fig. 3 for d = 0.5m and d = 2m.

A clearer picture of these correlated errors can be seen in theleft panel of Fig. 4, which shows the point feature positionerrors in each dimension for the scenario where d = 2mwith the x errors represented by blue asterisks, the y errors

Fig. 4. The left panel shows the point feature position errors in eachdimension for the scenario where d = 2m and the ratio of rp : d : rcis 1 : 4 : 2. The x errors are represented by blue asterisks. The y errorsare represented by red pluses. The z errors are represented by green x’s. Theright panel shows the point feature position errors in each dimension underthe modified simulation scenario where the camera and point feature positionsare spread out over an extra 10 m laterally in one direction.

represented by red pluses, and the z errors represented by greenx’s. This correlation is simply an artifact of this simplisticsimulation scenario. In reality, one would typically walk bythese close-up point features fairly quickly and obtain muchbetter attitude estimates and uncorrelated estimation errors.This fact is partially demonstrated by the right panel of Fig. 4which displays the point feature position errors under themodified simulation scenario where the camera and pointfeature positions are spread out over an extra 10 m laterallyin one direction. Under this modified scenario, the errors areno longer highly correlated and, although not shown directly,the attitude estimates become much more precise as well.

2) Simulations Varying Camera Motion: Having studied thebehavior of the estimator by varying the scale of the problemas a whole, it is informative to now fix the scale and varythe range of motion of the system. The scale of the problemwas set by fixing d = 20m. Then, the range of motion of thecamera was varied from 1% to 75% of d while holding thesize of the point cloud constant at rp = 5m. Figure 5 shows ascatter plot of the camera pose errors from these simulationsas a function of rc/d. The norms of the position errors areshown as blue asterisks with the left axis denoting their value.The attitude errors are shown as red pluses with the right axisdenoting their values. Figure 6 shows a scatter plot of thenorms of the point feature position errors normalized by dfrom these simulations as a function of rc/d.

It can be seen from Fig. 5 that below about rc/d = 0.2 theattitude accuracy begins to degrade while positioning accuracyremains roughly constant. This is because the positioningaccuracy is directly limited by the accuracy of the CDGPSposition estimates for this value of d, as demonstrated inthe previous section, while the attitude accuracy is heavilydependent upon the geometry of the problem as a whole.Below about rc/d = 0.2, the estimability of the orientation

Fig. 5. A scatter plot of the camera pose errors from the simulations as afunction of rc/d with d = 20m and rp = 5m. The norms of the positionerrors are shown as blue asterisks with the left axis denoting their value. Theattitude errors are shown as red pluses with the right axis denoting their values.

Fig. 6. A scatter plot of the norms of the point feature position errorsnormalized by d from the simulations as a function of rc/d with d = 20mand rp = 5m.

of the system about roughly the vector connecting the centersof the cloud of point features and cloud of camera positionsbegins to degrade. This results in an attitude “bias” about thisdirection which can be seen by how tightly packed the attitudeerrors are on the scatter plot for small values of rc/d. It isinteresting to note that the pose accuracy, shown in Fig. 5,is invariant to scale above about d = 20m, which is theregion where the positioning accuracy is directly limited bythe accuracy of the CDGPS position estimates.

It can be seen from Fig. 6 that the point feature positionerrors scale linearly with the decrease in camera motion.Additionally, it is interesting to note that the normalized pointfeature position errors are invariant over all scales. Therefore,as a rule of thumb, the range of camera motion should be

Fig. 7. A diagram showing the layout and camera trajectory for the hallwaysimulation with the x-axis pointing right, the y-axis pointing into the page,and the z-axis pointing up. The camera trajectory is shown in blue (startingfrom the bottom left and ending at the top right, while a road and walls ofbuildings are shown in black and red respectively. Black dashed lines outlinefour regions of the environment which will be referenced later in the simulationresults. These regions are (A) the area the camera is in before entering thehallway, (B) the first leg of the hallway up to and including the turn, (C)the second leg of the hallway, and (D) the area where the camera exits thehallway.

about 40% of the depth or greater in order to obtain pointfeature position accuracy to 1% of the depth, as can be seenin Fig. 6. This rule of thumb could be useful in practice whencollecting data for the purpose of accurate reconstruction ofthe environment.

B. Hallway Simulation

Figure 7 shows the layout of this scenario. The cameratrajectory, shown in blue in Fig. 7, begins in the bottom leftcorner of the figure, moves through a hallway, and comes outthe other side in the top right. The simulated visual featuresin this environment included trees lining the road (outlined inblack in Fig. 7), the center stripe in the road, windows onthe sides of the buildings, entrances and exits to the hallway,doors and posters on the walls inside the hallway, and scatteredfeatures on the ceiling of the hallway. In total, there were 1310features observed in at least 5 keyframes which were includedin the bundle adjustment. Keyframes were taken every 0.25m while moving and every 30◦ during a turn, resulting in242 keyframes in total. Uniformly distributed random ditherwas added to the trajectory in each direction for the positionand attitude to simulate human motion. This represents achallenging scenario for the system because there is littlelateral motion to improve observability while in the hallway.For convenience, four regions of the simulation environment,shown in Fig. 7 outlined with black dashed lines, are definedto aid in discussion of the results. These regions are (A) thearea the camera is in before entering the hallway, (B) the firstleg of the hallway up to and including the turn, (C) the secondleg of the hallway, and (D) the area where the camera exitsthe hallway.

Fig. 8. A plot of the norms of the camera position errors for the hallwaysimulation for each of the three cases with case 1 represented by greenasterisks, case 2 represented by red pluses, and case 3 represented by bluex’s. The periods when the camera is in the various regions of the simulationenvironment are delineated by vertical black lines and labeled.

Three different cases were run for this scenario regarding theavailability of CDGPS position estimates. In the first case,CDGPS position estimates were available at every keyframe.This serves to demonstrate the best performance one couldexpect from the algorithm in this geometry and acts as a controlfor the two more interesting cases. The second case assumesthat CDGPS position estimates are unavailable for the entiredataset after the camera enters the hallway. This serves todemonstrate the drift one would expect from the system asit moves around indoors. The third and final case assumes thatCDGPS position estimates become available again as soon asthe system exits the hallway. This serves to demonstrate howthe bundle adjustment fixes up the estimates of the poses ofprevious keyframes and the positions of previously seen pointfeatures after GPS is reacquired.

Figures 8 and 9 show the norms of the camera position errorsand the camera attitude errors, respectively, for each case withcase 1 represented by green asterisks, case 2 represented byred pluses, and case 3 represented by blue x’s. The periodswhen the camera is in the various regions of the simulationenvironment are delineated by vertical black lines and labeled.

For case 1, the norms of the position errors are mostly under1 cm and the attitude errors are mostly under 0.1◦, whichdemonstrates the pose accuracy one would expect from thissystem with open view of the sky. There is one point of interestwhere the pose errors increase significantly near the end ofregion B. This point corresponds to the turn inside the hallwaywhen the camera can only see a few point feature, which resultsin poor estimability of the pose. In fact, there is one keyframewhere only 12 point features can be seen. This is partiallyan artifact of the simulation because, in actual operation, thefeature identifier would identify more point features as thecamera approached the turn.

In case 2, the attitude estimates are biased by about 0.2◦

over the entire simulation, as can be seen in Fig. 9, except at

Fig. 9. A plot of the camera attitude errors for the hallway simulation foreach of the three cases with case 1 represented by green asterisks, case 2represented by red pluses, and case 3 represented by blue x’s. The periodswhen the camera is in the various regions of the simulation environment aredelineated by vertical black lines and labeled.

the turn, where the errors increase substantially. The positionerrors are close to those from case 1 over most of region A,since these errors are mostly attributed to the CDGPS positionestimates which both cases share. However, the position errorincreases roughly monotonically over the remainder of thesimulation, due to the lack of new CDGPS position estimates,reaching a final error of about 0.2 m, which is rather smallconsidering that the camera moved about 50 m after loosingGPS. This is only 0.4% of the distance traveled. There are twoterms to this growing error which correspond to an attitudebias, as was noted previously, and a drift error. These twoterms can clearly be seen in Fig. 10 which shows the absolutevalue of the position errors in each direction for this case withthe x direction represented by green circles, the y directionrepresented by red squares, and the z direction represented byblue diamonds. Note how the error in the z direction suddenlybegins to decrease just before the end of region B, which iswhere the turn occurs, and actually goes to zero in the middleof region C before changing signs and increasing again. Thistype of behavior can clearly be attributed to a bias in attitude.Over the interval the error in the z direction was increasing,the camera traveled about 27 m with an attitude error of about0.2◦ resulting in a lateral position error of about 8.7 cm, whichis close to the maximum error in the z direction. On the otherhand, the errors in the x direction continued to increase afterthe turn. This error is due to drift of the coordinate system.

For case 3, the camera position errors at the beginning andend of the simulation are comparable to those from case 1because, as with case 2, the CDGPS position estimates, whenpresent, are the primary limiting factor in the accuracy of thecamera position estimate. In the middle of the simulation, whenthe camera is in the hallway, the position error has a parabolicshape with its maximum just before the end of region B (i.e., atthe turn). This demonstrates how the information provided bythe CDGPS position estimates is propagated backward by thebundle adjustment to fix up the errors in the camera position

Fig. 10. A plot of the absolute value of the camera position errors in eachdirection for case 2 of the hallway simulation with the x direction representedby green circles, the y direction represented by red squares, and the z directionrepresented by blue diamonds. The periods when the camera is in the variousregions of the simulation environment are delineated by vertical black linesand labeled.

estimates. This improvement might have been more drastic ifthere had been more point features visible during the turn thatwere also visible in nearby keyframes both before and afterthe turn. This effectively prevented most of the informationgained through the addition of CDGPS position estimates at theend of the simulation from propagated back through the turn.Therefore, little improvement in the camera position estimatesfrom before the turn is achieved over case 2, as can be seen inFig. 8. This behavior is also observed in the attitude estimatesin Fig. 9, which have errors close to those of case 2 beforethe turn (i.e., near the end of region B) even though there issignificant improvement in the attitude errors, over case 2, fromafter the turn. In fact, the attitude errors from after the turn areof the same order of magnitude as case 1. For comparison withFig. 10, Fig. 11 shows the absolute value of the position errorsin each direction for this case with the x direction representedby green circles, the y direction represented by red squares,and the z direction represented by blue diamonds. Due to theaddition of CDGPS position estimates at the end of the dataset, the drift of the position estimates while the camera isin the hallway is significantly reduced, as can easily be seenby comparing the errors in the x direction between Figs. 10and 11.

Figure 12 shows the norms of the point feature position errorsfor each case divided into subplots based on the region inwhich the point feature is located and ordered within each re-gion based on the distance of closest approach of the camera tothat point feature from closest to furthest. Case 1 is representedby green asterisks, case 2 is represented by red pluses, andcase 3 is represented by blue x’s. Note that the point featureindices in Fig. 12 are only defined within each region and donot correspond across regions. For region A, cases 2 and 3have errors that are nearly identical and case 1 only showssignificant improvement over the other two cases on some ofthe point features. The upward trend in the error with feature

Fig. 11. A plot of the absolute value of the camera position errors in eachdirection for case 3 of the hallway simulation with the x direction representedby green circles, the y direction represented by red squares, and the z directionrepresented by blue diamonds. The periods when the camera is in the variousregions of the simulation environment are delineated by vertical black linesand labeled.

index is due to the ordering of the features based on the closestapproach. For region B, cases 2 and 3 still have nearly identicalerrors, but case 1 now shows significant improvement over theother cases with about an order of magnitude reduction inerror. This is obviously because case 1 is the only case withCDGPS position estimates in this region. The fact that cases 2and 3 have nearly identical errors in region B demonstrates thefact that the additional information provided by the additionof CDGPS position estimates at the end of the simulation incase 3 was not propagated back through the turn. For regionC, there is a clear distinction in accuracy between all threecases with case 1 being the most accurate and case 2 beingthe least accurate. This shows that the additional informationprovided by the addition of CDGPS position estimates at theend of the simulation did help case 3 significantly in thisregion. For region D, cases 1 and 3 have errors that are onthe same order, which is to be expected, and case 2 has errorsthat are significantly larger than the other cases for most ofthe points features.

VIII. CONCLUSION

A novel estimation architecture for combined visual SLAM,CDGPS, and inertial navigation was presented that is capableof delivering high accuracy pose (∼ 1 cm positioning accuracyand ∼ 0.1◦ attitude accuracy) and has the potential for real-time operation. The system is centered around a bundle-adjustment-based visual SLAM algorithm that incorporatesCDGPS-based position estimates into the bundle adjustmentand is responsible for maintaining a highly-accurate, globally-referenced map of the environment. To provide real-timecamera pose estimates, a navigation filter is also employedwhich leverages the map created by bundle adjustment duringmeasurement updates and uses inertial measurements for itspropagation step. The system is capable of maintaining highly-accurate, globally-referenced camera pose estimates even with-

out GPS availability for a limited distance of travel.

The globally-referenced bundle adjustment algorithm was im-plemented and used in a series of simulations. A first setof simulations demonstrated the estimability of the globally-referenced bundle adjustment problem under a simplistic sim-ulation scenario. The second set of simulations represented afairly realistic scenario where a user carried the system intoand through a hallway that blocked reception of the GPSsignals. These simulations demonstrated the performance ofthe bundle adjustment under a challenging scenario. The resultwas that the bundle adjustment algorithm demonstrated only0.4% drift in position over 50 m and still attained absoluteattitude accuracies of about 0.2◦. The addition of CDGPSposition estimates at the end of the simulation, once the systemhad exited the hallway, also resulted in significant improvementin the accuracy of the solution over most of the simulation.

REFERENCES

[1] J. Rydell and E. Emilsson, “CHAMELEON: Visual-inertial indoornavigation,” in Proceedings of the IEEE/ION PLANS Meeting. MyrtleBeach, SC: IEEE / Institute of Navigation, April 2012.

[2] G. Nuetzi, S. Weiss, D. Scaramuzza, and R. Siegwart, “Fusion of IMUand vision for absolute scale estimation in monocular SLAM,” Journal

of Intelligent & Robotic Systems, vol. 61, no. 1, pp. 287–299, Jan. 2011.

[3] A. Soloviev and D. Venable, “Integration of GPS and vision measure-ments for navigation in GPS challenged environments,” in Proceedings

of the IEEE/ION PLANS Meeting. IEEE/Institute of Navigation, May2010, pp. 826–833.

[4] D. Zachariah and M. Jansson, “Fusing visual tags and inertial infor-mation for indoor navigation,” in Proceedings of the IEEE/ION PLANS

Meeting. Myrtle Beach, SC: IEEE/Institute of Navigation, April 2012.

[5] G. Schall, S. Zollmann, and G. Reitmayr, “Smart vidente: Advances inmobile augmented reality for interactive visualization of undergroundinfrastructure,” Pers Ubiquit Comput, July 2012.

[6] J. J. Wang, S. Kodagoda, and G. Dissanayake, “Vision aided GPS/INSsystem for robust land vehicle navigation,” in Proceedings of the ION

ITM. Savannah, GA: Institute of Navigation, Sept. 2009, pp. 600–609.

[7] M. D. Agostino, A. Lingua, D. Marenchino, F. Nex, and M. Piras,“GIMPHI: a new integration approach for early impact assessment,”Applied Geomatics, vol. 3, no. 4, pp. 241–249, Dec. 2011.

[8] T. Oskiper, S. Samarasekera, and R. Kumar, “Multi-sensor navigationalgorithm using monocular camera, IMU and GPS for large scaleaugmented reality,” in IEEE International Symposium on Mixed and

Augmented Reality. Atlanta, GA: IEEE, Nov. 2012.

[9] J. Wang, M. Garratt, A. Lambert, J. J. Wang, S. Han, and D. Sinclair,“Integration of GPS/INS/vision sensors to navigate unmanned aerialvehicles,” The International Archives of the Photogrammetry, Remote

Sensing, and Spatial Information Sciences, vol. 37, no. B1, pp. 963–969, 2008.

[10] M. Bryson and S. Sukkarieh, “A comparison of feature and pose-based mapping using vision, inertial and GPS on a UAV,” in IEEE/RSJInternational Conference on Intelligent Robots and Systems. SanFrancisco, CA: IEEE, Sept. 2011.

[11] G. Klein and D. Murray, “Parallel tracking and mapping for small ARworkspaces,” in 6th IEEE and ACM International Symposium on Mixedand Augmented Reality. IEEE, 2007, pp. 225–234.

[12] D. Shepard, “Fusion of carrier-phase differential GPS, bundle-adjustment-based visual slam, and inertial navigation for precisely andglobally-registered augmented reality,” Master’s thesis, The Universityof Texas at Austin, May 2013.

Fig. 12. Plots of the norms of the point feature position errors for each case divided into subplots based on the region in which the point feature is locatedand ordered within each region based on the distance of closest approach to the point feature from closest to furthest. Case 1 is represented by green asterisks,case 2 is represented by red pluses, and case 3 is represented by blue x’s. Note that the point feature indices are only defined within each region and do notcorrespond across regions.

[13] S. Mohiuddin and M. Psiaki, “Carrier-phase differential Global Posi-tioning System navigation filter for high-altitude spacecraft,” Journal of

Guidance, Control, and Dynamics, vol. 31, no. 4, pp. 801–814, 2008.

[14] H. Strasdat, J. Montiel, and A. J. Davison, “Visual slam: Why filter?”Image and Vision Computing, 2012.

[15] T. E. Humphreys, “Attitude determination for small satellites withmodest pointing constraints,” in Proc. 2002 AIAA/USU Small SatelliteConference, Logan, Utah, 2002.

[16] F. Devernay and O. Faugeras, “Straight lines have to be straight,”Machine Vision and Applications, vol. 13, no. 1, pp. 14–24, Aug. 2001.

[17] P. Huber, Robust Statistics. New York, New York: John Wiley & Sons,1981.

[18] R. Hartley and A. Zisserman, Multiple view geometry in computer

vision. Cambridge Univ Press, 2000, vol. 2.

[19] B. K. Horn, “Closed-form solution of absolute orientation using unitquaternions,” JOSA A, vol. 4, no. 4, pp. 629–642, 1987.

[20] Y. Bar-Shalom, X. R. Li, and T. Kirubarajan, Estimation with Applica-

tions to Tracking and Navigation. New York: John Wiley and Sons,2001.

Documents

High-Precision Globally-Referenced Position and …radionavlab.ae.utexas.edu/images/stories/files/papers/...High-Precision Globally-Referenced Position and Attitude via a Fusion of