6
Data Association in Temporal Integration of Stereo Depth Chris Debrunner and Mark Whitehorn PercepTek, Incorporated 12395 N. Mead Way Littleton, Colorado 80125, USA [email protected] [email protected] Tyrone Vincent and John Steele Colorado School of Mines 1610 Illinois Street, Engineering Division Golden, Colorado, USA [email protected] [email protected] Abstract— We describe the construction and integration of 3D models using two calibrated stereo vision rigs mounted on a Load-Haul-Dump vehicle (LHD) operating in an experimental mine. The system operates in unstructured environments, and does not assume it is operating on a smooth planar surface. We compare the 3D models with ground truth data obtained using a laser scanner over 50 meters of tunnel including an intersection and a muck bay. Our implementation is partially recursive, in that the 3D model is updated recursively, but image data is saved for later use in associating image features to map locations. This kind of implementation is a good compromise between fully batch and fully recursive, as the growth in both storage requirements and computational complexity with time is very low, while retaining the stability of a batch method. I. I NTRODUCTION The objective of this work is to produce 3D models of underground mines using techniques feasible for real-time implementation in semi-autonomous vehicle applications. The models should provide sufficient detail, accuracy and precision to enable planning and control of LHD (Load-Haul- Dump vehicle) loading and tramming operations. Reliable localization of the LHD (full 6 DOF) using only stereo vision allows elimination of more expensive and mechan- ically complicated sensors such as laser line scanners. In some applications, such as mapping an abandoned mine, it is desirable to record a complete 3D model of a tunnel on a single pass. A large field of view increases the likelihood of visibility of good features; this improves the robustness of pose tracking, which is critical for generating a complete, connected model. With these goals in mind, we tested the performance of a pose tracking and modeling system using two stereo rigs with a combined horizontal field of view spanning approximately 70 degrees. We chose to use multiple stereo rigs (instead of a single rig with wide angle cameras) in order to avoid potential loss of stereo quality due to reduced image resolution. Our pose tracker maintains an estimate of relative and cumulative pose uncertainty for every stereo pair in order to accurately determine search regions for feature matching. Pose tracking accuracy is verified by comparison with ground truth; actual error is shown to be less than 10% of distance traveled. A unique feature is our method of data association. Inte- gration of stereo disparity information into the current map, as well as localization of the LHD, depend on matching the locations of “reliable” features in the stereo imagery to corresponding locations on the existing map. We demonstrate how this data association problem can be effectively solved using a “bank” of image data, indexed using the 3D model itself. II. PRIOR WORK Much work has been directed toward “realistic” recon- structions of 3D scenes using image sequences [1]. Our approach focuses less on realism than on utility of the resulting 3D model. Our system is most similar to that described by Koch in [2]; we differ mainly in the cho- sen application and the method of model representation. Since our orientation is toward robotic LHDs operating in underground mines, we chose to use a volumetric model to facilitate implementation of real-time control algorithms which must be model driven. Our octree based volumetric model also inherently supports examination at varying levels of detail and requires memory proportional to the number of occupied voxels. In a task-specific approach oriented toward navigation of an off-road vehicle, Baten et al. [3] apply correlation based stereo to compute a terrain model. They employ a single stereo rig mounted on an agile pan-tilt unit to provide a flexible field of view with adequate resolution. Moyung [4] reports development of a system oriented toward applications in space, such as docking and satellite retrieval. His algorithm builds an incrementally accurate and dense representation of the reconstructed object using 3D feature points obtained from stereo image sequences. Moyung uses a Kalman filter approach over multiple stereo pairs to aid in both temporal feature tracking and stereo correspondence. His approach is to use a sparse set of features to measure the location and orientation of a known object, and he does not address modeling unknown structure. Se [5] uses trinocular stereo to build 3D maps of scenes. His approach uses multi-level representation of the imagery to identify scale-invariant features which are temporally tracked. This is a similar concept to our temporal identification of features which track well. Although Se’s pose estimation is a full 6 DOF algorithm, he tests the system only using data obtained indoors. The resulting map is a database of 3D landmarks rather than a volumetric model. Thrun et al. [6] and [7] present an incremental method for concurrent mapping and localization for mobile robots equipped with Proceedings of the 2005 IEEE Conference on Control Applications Toronto, Canada, August 28-31, 2005 MB1.2 0-7803-9354-6/05/$20.00 ©2005 IEEE 203

[IEEE 2005 IEEE Conference on Control Applications, 2005. CCA 2005. - Toronto, Canada (Aug. 29-31, 2005)] Proceedings of 2005 IEEE Conference on Control Applications, 2005. CCA 2005

  • Upload
    j

  • View
    212

  • Download
    0

Embed Size (px)

Citation preview

Page 1: [IEEE 2005 IEEE Conference on Control Applications, 2005. CCA 2005. - Toronto, Canada (Aug. 29-31, 2005)] Proceedings of 2005 IEEE Conference on Control Applications, 2005. CCA 2005

Data Association in Temporal Integration of Stereo Depth

Chris Debrunner and Mark WhitehornPercepTek, Incorporated

12395 N. Mead WayLittleton, Colorado 80125, USA

[email protected]

[email protected]

Tyrone Vincent and John SteeleColorado School of Mines

1610 Illinois Street, Engineering DivisionGolden, Colorado, [email protected]

[email protected]

Abstract— We describe the construction and integration of3D models using two calibrated stereo vision rigs mounted ona Load-Haul-Dump vehicle (LHD) operating in an experimentalmine. The system operates in unstructured environments, anddoes not assume it is operating on a smooth planar surface.We compare the 3D models with ground truth data obtainedusing a laser scanner over 50 meters of tunnel including anintersection and a muck bay.

Our implementation is partially recursive, in that the 3Dmodel is updated recursively, but image data is saved for lateruse in associating image features to map locations. This kind ofimplementation is a good compromise between fully batch andfully recursive, as the growth in both storage requirements andcomputational complexity with time is very low, while retainingthe stability of a batch method.

I. INTRODUCTION

The objective of this work is to produce 3D models ofunderground mines using techniques feasible for real-timeimplementation in semi-autonomous vehicle applications.The models should provide sufficient detail, accuracy andprecision to enable planning and control of LHD (Load-Haul-Dump vehicle) loading and tramming operations. Reliablelocalization of the LHD (full 6 DOF) using only stereovision allows elimination of more expensive and mechan-ically complicated sensors such as laser line scanners. Insome applications, such as mapping an abandoned mine, itis desirable to record a complete 3D model of a tunnel ona single pass. A large field of view increases the likelihoodof visibility of good features; this improves the robustnessof pose tracking, which is critical for generating a complete,connected model. With these goals in mind, we tested theperformance of a pose tracking and modeling system usingtwo stereo rigs with a combined horizontal field of viewspanning approximately 70 degrees. We chose to use multiplestereo rigs (instead of a single rig with wide angle cameras)in order to avoid potential loss of stereo quality due toreduced image resolution. Our pose tracker maintains anestimate of relative and cumulative pose uncertainty for everystereo pair in order to accurately determine search regionsfor feature matching. Pose tracking accuracy is verified bycomparison with ground truth; actual error is shown to beless than 10% of distance traveled.

A unique feature is our method of data association. Inte-gration of stereo disparity information into the current map,as well as localization of the LHD, depend on matching

the locations of “reliable” features in the stereo imagery tocorresponding locations on the existing map. We demonstratehow this data association problem can be effectively solvedusing a “bank” of image data, indexed using the 3D modelitself.

II. PRIOR WORK

Much work has been directed toward “realistic” recon-structions of 3D scenes using image sequences [1]. Ourapproach focuses less on realism than on utility of theresulting 3D model. Our system is most similar to thatdescribed by Koch in [2]; we differ mainly in the cho-sen application and the method of model representation.Since our orientation is toward robotic LHDs operating inunderground mines, we chose to use a volumetric modelto facilitate implementation of real-time control algorithmswhich must be model driven. Our octree based volumetricmodel also inherently supports examination at varying levelsof detail and requires memory proportional to the number ofoccupied voxels. In a task-specific approach oriented towardnavigation of an off-road vehicle, Baten et al. [3] applycorrelation based stereo to compute a terrain model. Theyemploy a single stereo rig mounted on an agile pan-tilt unitto provide a flexible field of view with adequate resolution.Moyung [4] reports development of a system oriented towardapplications in space, such as docking and satellite retrieval.His algorithm builds an incrementally accurate and denserepresentation of the reconstructed object using 3D featurepoints obtained from stereo image sequences. Moyung usesa Kalman filter approach over multiple stereo pairs to aidin both temporal feature tracking and stereo correspondence.His approach is to use a sparse set of features to measurethe location and orientation of a known object, and hedoes not address modeling unknown structure. Se [5] usestrinocular stereo to build 3D maps of scenes. His approachuses multi-level representation of the imagery to identifyscale-invariant features which are temporally tracked. Thisis a similar concept to our temporal identification of featureswhich track well. Although Se’s pose estimation is a full6 DOF algorithm, he tests the system only using dataobtained indoors. The resulting map is a database of 3Dlandmarks rather than a volumetric model. Thrun et al.[6] and [7] present an incremental method for concurrentmapping and localization for mobile robots equipped with

Proceedings of the2005 IEEE Conference on Control ApplicationsToronto, Canada, August 28-31, 2005

MB1.2

0-7803-9354-6/05/$20.00 ©2005 IEEE 203

Page 2: [IEEE 2005 IEEE Conference on Control Applications, 2005. CCA 2005. - Toronto, Canada (Aug. 29-31, 2005)] Proceedings of 2005 IEEE Conference on Control Applications, 2005. CCA 2005

multiple 2D laser rangefinders. Their approach uses scan-matching for mapping, and a sample-based probabilisticmethod for localization. While Thrun constructs volumetricmodels of large sections of underground mines, the scan-matching pose determination method is two-dimensionaland solves only for three degrees of freedom. Molton [8]describes a stereo vision system which robustly estimatesmotion and 3D structure of a rigid environment, as the systemmoves through it. The system uses temporal feature matchingalong with multiple stereo hypotheses and Kalman filtersto track 6DOF camera pose and model local 3D structure.Molton’s application is obstacle avoidance for a partiallysighted person, and he does not attempt to build an integratedvolumetric model of the scene. Singh [9] describes a cross-country navigation system using wide-angle stereo as theprimary sensor. Singh builds 2.5D terrain maps, but doesnot discuss volumetric models or determination of pose fromstereo.

III. DESIGN APPROACH

Integration of multiple range data sets has several potentialadvantages. One is reduction of the errors in the modelestimates through computation of weighted averages andrejection of outliers. Another is filling gaps in the modelconstructed from individual measurements. These gaps in3D coverage are often quite large in a single range imagedue to occlusions or lack of scene texture. We refer to theset of measurements from a single stereo pair as either adisparity image of pixel dimensions width x height (forwhich many disparity values may be undefined), or as a 3Dpoint cloud comprising only the 3D coordinates derived fromvalid disparities through use of the stereo rig calibration data.The point cloud is represented as a list of 3D points withassociated uncertainties (3x3 covariance matrix).

Figure 1 gives an overview of our approach. To performintegration, it is necessary to measure the stereo rig poserelative to the octree model coordinate frame; we accomplishthis using a RANSAC technique [10] to find image featurecorrespondences that allow consistent 3D matches betweenthe new stereo data and the model. The best pose is then usedto integrate the new 3D data and the image data (needed formatching to future frames) into the model.

Fig. 1. The structure of the proposed algorithm.

The result of temporal integration is a 3D model accumu-lated over a vehicle motion through which we have trackedcamera pose relative to the model frame. The uncertaintiesof the pose computed for each new stereo frame are accumu-lated to estimate the cumulative pose uncertainty, and these

uncertainties together with the 3D structure can be used todetermine search ranges for image feature matching in newframes. The uncertainty in the 3D model structure (due topose and stereo depth uncertainty) is captured by the scatterof the point clouds stored in the octree voxels. Since theaverage stereo depth error over the working volume rangesfrom 1 cm to 3 cm, a voxel size of approximately 6.25 cmwas chosen for adequate representation of surface detail. Toproduce surfaces from the model data, we can approximatethe surface in each voxel as a plane. We use the moments ofthe point cloud points in each voxel to estimate the positionand orientation of this plane as well as the noisiness of thepoints (the deviation from planarity).

IV. POSE ESTIMATION

Assuming that the scene is static, a rigid-body trans-formation of the 3D locations of corresponding featureswill describe the observed motion. This is referred to asdetermining the absolute orientation of the system [11]. Therigid-body transformation which takes points from frame 0 toframe i is here represented using a variant of Craig’s notation[12]:

ixj = iD 0xj (1)

where 0xj is the j’th 3D point represented using referenceframe coordinates, ixj is the same point measured in framei, and iD is the rigid-body transformation which describesthe motion between frames 0 and i. (We have dropped theleading subscript of i

0D which Craig’s notation requires,since D always operates on points 0xj which are measuredin the reference frame.)

We decompose the pose estimation problem into two steps:feature generation and matching, and pose calculation anduncertainty. These two steps are discussed in the followingtwo sub-sections.

A. Feature generation and matching

The goal here is to solve the absolute orientation problemfor each stereo pair relative to a reference frame, in orderto register the measurements obtained in frame k to thereference frame, and integrate the measurements into a single3D model. To reduce accumulation of errors, it is desirableto register each new stereo pair directly to the 3D scenemodel where there is overlap between the current view andthe existing model. We register each new stereo pair tothe oldest pair which has sufficient overlap with the newpair, and we use this approach for the tunnel modelingapplication. To find prior views which overlap the currentview, we examine occupied model voxels which should bevisible in the current view (according to the current pose andpose uncertainty estimates). Recorded in each occupied voxelis the list of views which have contributed to the voxel’soccupancy count; the contributing view with pose closest tothe current pose is the best candidate for use as a reference.Temporal feature matching (2D correlation) is used to matchfeatures between the new image and the selected referenceimage (which has already been registered to the 3D model).We use the computed stereo depth in the new and reference

204

Page 3: [IEEE 2005 IEEE Conference on Control Applications, 2005. CCA 2005. - Toronto, Canada (Aug. 29-31, 2005)] Proceedings of 2005 IEEE Conference on Control Applications, 2005. CCA 2005

images to project the image feature locations into 3D cameracoordinates for each image. Given these corresponding 3Dpoint sets, we can compute the absolute orientation (andits uncertainty, as described in the next sub-section) of thenew camera pose relative to the reference frame and themodel. Using RANSAC, we compute this transformationfrom many subsets of three corresponding features and selectthe solution that matches the largest number of features. Theset of 3D feature locations identified in the current view isthen transformed to the reference frame using iD−1.

B. Pose calculation and uncertainty

For the absolute orientation computations of the previ-ous sub-section, we use the quaternion based method of[13] to obtain the transform iD from reference cameracoordinates to camera i coordinates. However, for reliablefeature matching and pose estimation, we need to keep trackof and accumulate the uncertainties in this pose estimate.Let the vector d be the six parameter pose vector d =[dγ , dβ , dα, dtx, dty, dtz] specifying the rigid transformationfrom the current frame to the reference frame. Then we candefine the transformation of the jth 3D point xj in camerai coordinates (current view) to points yj in the referencecamera coordinates:

yj = f(xj , d)

yj + ∆yj = f(xj , d + ∆d)

where ∆yj is the measurement error in yj , ∆d is theestimation error in d, and f(x, d) is the function whichtransforms the feature coordinates xj from the current frameto the model frame.

We use the method presented in [14] where

f(xj , d) = D(d)xj (2)

and D(d) is the 4x4 homogeneous transform matrix

D(d) =

[R(dγ , dβ , dα) [dtx dty dtz]

T

0 0 0 1

](3)

and R(dγ , dβ , dα) is a rotation matrix specified by XYZ fixedangles dγ , dβ , dα. A first order approximation of the error∆yj is

∆yj =

[∂f

∂d

∣∣∣∣d

]T

∆d = Jj∆d (4)

Stacking all measurement equations: ∆Y = J∆d and ∆d =

M†∆Y where M† =(JT J

)−1

JT is the Moore-Penrosepseudo-inverse of J. The pose covariance matrix is then

Cd = M†CyM†T . (5)

where Cy is a block diagonal matrix formed from theindividual covariance matrices E(∆yi ∆yT

j ). As pointedout in [14] we are assuming that the errors in the 3D pointsare independent since we have set the off-diagonal termsof Cy to zero. Since the errors in x and y are spatiallydependent, this assumption is incorrect for non-random pointdistributions and it is likely that the covariance matrix Cd

will be underestimated. We therefore compare the results

obtained using Equation (5) with the results obtained froma Monte Carlo simulation.

To use Equation 5, we need estimates of the covariancefor the error in each feature point location measurementE(∆yj ∆yT

j ). Assuming that the errors in 3D points areindependent of depth and letting yj = D(d)xj , the samplecovariance of the matched features is

Cy = E[∆yT ∆y] =1

N − 1

∑j

(yj − yj)T (yj − yj).

Over longer sequences it is not possible to use the samereference frame throughout, so our algorithm changes ref-erence views when the overlap between the current viewand the reference view gets too small. As a result, inorder to estimate the uncertainty of a certain camera poserelative to the model, it may be necessary to accumulatethe uncertainties through a sequence of reference poses.Say for example that we want to estimate the cumulativeuncertainty resulting from uncertainty in two combined rigidtransformations m

c d = g( mr d, r

cd) where

g( mr d, r

cd) = Q (D( mr d)D( r

cd)) (6)

and D(d) is a 4x4 homogeneous transform matrix. Thevector b

ad is a six-element pose vector specifying the rigidtransformation from coordinate frame a to coordinate frameb. Each pose vector has an associated covariance matrix b

aCand Q(·) represents conversion of a 4x4 matrix to its 6parameter form. The covariance matrix m

c C is given by [14,Equation 7]:

mc C = m

r J mr C m

r JT + rcJ

rcC

rcJ

T (7)

where baJ is the Jacobian

∂g∂ b

ad

∣∣∣∣bad

.

This formula is applied to maintain an estimate of cumulativepose uncertainty, and the uncertainty estimate is used insetting appropriate search ranges when matching features innew images to those in older images.

V. TUNNEL MODELING EXPERIMENTS

When driving down a tunnel which has not already beenmapped, pose tracking must be performed by matching thecurrently visible image features to the same features in earlierviews (or to the model under construction) and, as describedabove, this process necessarily involves accumulation oferrors as older features leave the field of view. If the fieldof view is forward, and the only lighting is provided bythe LHD, as in our experimental configuration, then featureswill persist for, at most, about 8 meters of travel. One musttherefore expect “drift” in both pose and modeled structure.The amount of drift will depend both upon the accuracyof feature location estimates and the distance over whichfeatures can be tracked.

To evaluate the performance of the pose tracker/modeler intramming operations, we compared our results with groundtruth data for an 80 meter section of the Army tunnel

205

Page 4: [IEEE 2005 IEEE Conference on Control Applications, 2005. CCA 2005. - Toronto, Canada (Aug. 29-31, 2005)] Proceedings of 2005 IEEE Conference on Control Applications, 2005. CCA 2005

at CSM’s Edgar Experimental mine. Scott Schiele of I-SiTE Pty. Ltd. provided the ground truth data, formed bycombining four separate 80ox320o scans made with a tripodmounted Riegl model LMS-Z210 3D laser scanner. Rangeaccuracy for this scanner is specified as 25mm (1 sigma),and the accuracy for the combined dataset is also reportedas 25mm.

Figure 2 shows a 50 meter section of the ground truthdataset with a superimposed blue band indicating the trajec-tory of the LHD for these experiments. The LHD begins nearthe bottom of the figure and drives straight through the firstintersection, then turns right into the muck bay near the topof the figure. The world coordinate system has z axis in thedirection of LHD forward travel, x to the right and y down.

Fig. 2. Ground truth data with trajectory

The four synchronized video streams from the two stereorigs were recorded for each of three 4 minute runs. Thefirst run is at the slowest speed, about 0.3 meter/second, andends with several left/right articulations in the muck bay.The second run averages about 0.4 meter/second and theLHD reverses immediately after entering the muck bay. Thethird run averages 0.5 meter/second and also reverses fromthe muck bay back to the starting point. Since the frame rateaverages only 7 frames/second, the video collected in thefastest run would be similar (except for motion and vibrationblur) to a velocity of 2.1 meters/second (4.8 miles/hour)recorded at 30 frames/second.

Comparison of tracking and modeling accuracy/robustnessfor these 3 sequences provides information on the feasibilityof operating at higher LHD velocity and indicates whichcomponents of the process need improvement. Since theLHD reverses back from the muck bay toward the startingpoint in the second and third runs, we process the video fromthe reverse segments into separate models for comparisonwith the models (and poses) generated on the outboundsegment.

Key parameter settings were determined from trials per-formed on the video from the second run. The two mostcritical parameters affecting the frequency of reference pairupdates are the inlier error threshold and the minimum inlierratio of RANSAC.

Another critical parameter in the current implementationis the search range for the temporal tracker. The assumptionis that features move less than the search range betweensuccessive frames. If this assumption is violated (for a partic-ular feature), tracking results are guaranteed to be incorrect

for that feature. On the other hand, as the search range isincreased, the likelihood of incorrect matches increases withthe search area, i.e., with the square of the search range.The required search range can be significantly reduced ifthe vehicle motion is modeled, for instance in a Kalmanfilter. Our system does not include such a Kalman filter andtherefore required an adjustment to the search range at thepoint where the vehicle turned into the muck bay in thefastest run (run 3).

A. Registering model to ground truth

To compare our estimated models to the ground truth, itwas first necessary to register them together. To register them,we first manually extract a portion of the ground truth datain the neighborhood of the point clouds obtained from thefirst few frames of video data (after the LHD began to move)and then align them approximately by hand. Then we clip thepoints of the estimated model to eliminate outliers and usethe iterated closest point (ICP) algorithm to more preciselyregister them to the ground truth points. There were 10,939points in the resulting subset of ground truth data.

Since the estimated model point clouds are much denser(76,271 points were computed from frames 170 and 171) andcontain outliers which would reduce registration accuracy,they are thinned to reduce density prior to running theICP. Thinning is accomplished by first sorting on depth andbinning into depth intervals of 0.1 meter. The points in eachdepth bin are then sorted on y (height) and decimated to amaximum of 200 points per bin. After thinning and outlierremoval, 6,182 model points remained. Figure 3 shows theresult of using ICP to register the model cloud to groundtruth.

Fig. 3. Portion of Model registered to ground truth using ICP

After registering the model point cloud to the groundtruth point cloud, the RMS distance between model andground truth points is 2cm. The accuracy (1-sigma) of theground truth data is approximately 3 cm, and the stereo rigcalibration results indicate an accuracy of approximately 1cm for the model data (over a similar volume). The averagedistance between closest points in the ground truth modelis 3.9 cm; for the stereo data it is 1.2 cm. These results

206

Page 5: [IEEE 2005 IEEE Conference on Control Applications, 2005. CCA 2005. - Toronto, Canada (Aug. 29-31, 2005)] Proceedings of 2005 IEEE Conference on Control Applications, 2005. CCA 2005

indicate that the model point cloud is closely aligned tothe ground truth cloud with translation uncertainty (assumedisotropic) of less than 1 cm. Shape complexity is sufficient toprovide accurate orientation; the rotation estimated by ICP is5.699o with estimated sigma of .024o. Visual inspection ofthe registered point clouds confirms that the ICP algorithmhas converged to the correct pose for the model data.

B. Estimates of tracking drift

To estimate the total drift (cumulative error) in the transla-tion component of the measured LHD pose on the trip out tothe muck bay, we register one end of the 3D model to groundtruth using the ICP algorithm as described in the previoussection. Aligning to ground truth in this way highlightstranslational pose error at the end of the LHD’s trajectory inthe muck bay. Cumulative translational drift for the three runswas 8, 4 and 7%, respectively, of distance traveled. Figure 4shows the entire estimated vehicle trajectory and estimated3D model in the muck bay (both in red) against a crosssection of the ground truth data in green. In this figure, thestereo model has been registered to ground truth at the startof the run (at the left edge of the figure) as described above.The grid spacing is 1 meter, and one can see the magnitudeof the cumulative mis-registration.

Fig. 4. Ground truth cross-section with trajectory and muck bay model inred

When driving down a tunnel, the only visible scene struc-ture is the floor and walls a few meters ahead of the LHD.We use the RANSAC inlier ratio as a measure of the qualityof the pose based on the current reference view, since it isan accurate measure of the structure overlap between currentand reference views. There is a correlation between the actualinlier ratio and the number of successfully tracked featuresfor a given stereo rig. This is due to the fact that stereo matchquality depends on the same image characteristics as doesmatch quality in temporal tracking. In video segments withpoor stereo performance, this correlation suggests a violationof our assumption that the inlier ratio exceeds 50%.

Figure 5 shows a subset of tracking data with changes ofreference view annotated. The reference view is changed tothe most recent view when the inlier ratio drops below 70%.Groups of frames matched to the same reference view areseparated by the sloping black lines. Frames 250-300 of run 3were taken at z coordinates ranging from 8.4 to 12.4 meters.In run 1, this z range corresponds to frames 450-564, in run2 it is frames 353-417. The average z component of velocityis 80 mm/frame in run 3, 63 mm/frame in run 2 and only 35

350 360 370 380 390 400 410 4200.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

frame number

inlie

r ra

tio

Run 2 reference view changes

1 2 3 4

Fig. 5. Tracking detail showing changes of reference view

mm/frame in run 1. Each reference frame in run 2 is usedover a larger distance than in runs 1 and 3. Observed posedrift is also lowest in run 2, and this may indicate that posetracking is also more accurate in run 2, although it would benecessary to collect and process additional runs at this speedto verify this.

C. Detailed Run 2 Results Including Reverse Trip

Comparison of the surface model for the vicinity of themuck bay with the full ground truth model (Figure 4)indicates that translational drift is on the order of 2 meters(4% of the distance traveled). The yellow grid is drawn inthe world x,z plane with a spacing of 1 meter, and the greendots represent the ground truth data. Near the right side ofthe figure is the pillar at the left edge of the muckbay, and theintegrated model of the pillar is about 1.5 meters to the leftand 1 meter below (in red). This translation error is about4% of the distance traveled and represents the cumulativepose error from tracking through 800 frames of video. Thefact that the actual translational drift is small relative to thedistance traveled indicates that precision of estimated LHDtranslation and orientation is good (and biases are small) overthe entire trajectory from start to muck bay. The actual posedrift observed in runs 1, 2 and 3 correlates well with imageplane feature location accuracy of 0.8 pixels, (the accuracyobtained with calibration targets is 0.2 pixels).

The LHD reverses direction after reaching the muck bay.Since pose drift is significant, 3D structure obtained fromsubsequent frames was integrated into a separate modelto allow independent comparison of drift on the returnpath. When treated independently of the outgoing trip, thecumulative translational drift on the reverse trip is about 0.25meter in the x-z plane (plan view) and about 6 cm in the ydirection. This is actually a decrease in the drift observedat the muck bay, suggesting presence of a systematic poseerror component (bias).

Figure 7 displays a section of both outbound and reversesurface models, outbound in blue and reverse in red. Cumu-lative error of approximately 0.25 meters is apparent in the

207

Page 6: [IEEE 2005 IEEE Conference on Control Applications, 2005. CCA 2005. - Toronto, Canada (Aug. 29-31, 2005)] Proceedings of 2005 IEEE Conference on Control Applications, 2005. CCA 2005

Fig. 6. Trajectory and structure near end of reverse segment, run 2: planview

displacement of the reverse model relative to the outboundmodel.

Fig. 7. Outbound surface model (blue) overlaid with reverse model (red):plan view

VI. CONCLUSION

Our work is the first use of stereo vision for combinedpose tracking and integrated volumetric modeling in an un-derground mining environment. Our method is a compromisebetween a fully recursive and a batch method, in that itrecursively updates the structure, but references image datastored earlier for association of new data. The growth in bothstorage requirements and computational complexity of thismethod with time is very low, while it retains the stabilityof a batch method. We have demonstrated the utility ofour multi-rig stereo configuration for robust and accuratevehicle pose tracking without the use of a kinematic model.Pose uncertainty is estimated both from sensor data and

comparison with ground truth; translational drift is shownto be less than 10% of distance traveled. The system canalso construct highly detailed local scene models while inmotion, a capability necessary for LHD loading automation.Temporal integration of 3D data increases the density andreduces noise in the 3D scene models. 3D model results arecompared with ground truth data and show model accuracyon the order of a few centimeters over an 8 meter distance.Pose uncertainty (drift) is estimated using both analytical andMonte Carlo techniques and compared with the actual driftby comparison with ground truth.

REFERENCES

[1] R. Koch, M. Pollefeys, and L. J. V. Gool, “Realisticsurface reconstruction of 3d scenes from uncalibrated imagesequences,” Journal of Visualization and Computer Animation,vol. 11, no. 3, pp. 115–127, 2000. [Online]. Available:citeseer.nj.nec.com/koch00realistic.html

[2] R. Koch, “3d surface reconstruction from stereoscopic image se-quences,” in Proceedings of the 5th International Conference onComputer Vision, Cambridge, Massachusetts, USA, June 1995, pp.109–114.

[3] S. Baten, R. Mandelbaum, M. Luetzeler, P. Burt, and E. Dickmanns,“Techniques for autonomous, off-road navigation,” IEEE IntelligentSystems magazine special issue on Vision-Based Driving Assistance,pp. 57–65, Nov-Dec 1998.

[4] T. J. Moyung, “Incremental 3d reconstruction using stereo imagesequences,” Master’s thesis, University of Waterloo, Waterloo, Ontario,Canada, 2000.

[5] S. Se, D. Lowe, and J. Little, “Vision-based mobile robot localizationand mapping using scale-invariant features,” in Proceedings of theIEEE International Conference on Robotics and Automation (ICRA),Seoul, Korea, May 2001, pp. 2051–2058. [Online]. Available:citeseer.nj.nec.com/se01visionbased.html

[6] S. Thrun, W. Burgard, and D. Fox, “A real-time algorithm for mobilerobot mapping with applications to multi-robot and 3d mapping,” inIEEE International Conference on Robotics and Automation, April2000.

[7] S. Thrun, D. Hahnel, D. Ferguson, M. Montemerlo, R. Triebel,W. Burgard, C. Baker, Z. Omohundro, S. Thayer, and W. Whittaker,“A system for volumetric robotic mapping of abandoned mines,” inProceedings of the IEEE International Conference on Robotics andAutomation (ICRA), 2003.

[8] N. Molton and M. Brady, “Practical structure and motion fromstereo when motion is unconstrained,” International Journal ofComputer Vision, vol. 39, no. 1, pp. 5–23, 2000. [Online]. Available:citeseer.ist.psu.edu/molton00practical.html

[9] S. Singh and B. Digney, “Autonomous cross-country navigationusing stereo vision,” 1999. [Online]. Available: citeseer.ist.psu.edu/singh99autonomous.html

[10] M. A. Fischler and R. C. Bolles, “Random sample consensus: Aparadigm for model fitting with applications to image analysis andautomated cartography,” Communications of the Association for Com-puting Machinery, vol. 24, no. 6, pp. 381–395, 1981.

[11] B. K. Horn, “Closed-form solution of absolute orientation using unitquaternions,” J. Opt. Soc. Am. A, vol. 4, no. 4, pp. 629–642, April1987.

[12] J. J. Craig, Introduction to Robotics: mechanics and control. Addison-Wesley Publishing Company, Inc., 1989.

[13] P. J. Besl and N. D. McKay, “A method for registration of 3-d shapes,”IEEE Transactions on Pattern Analysis and Machine Intelligence,vol. 14, no. 2, pp. 239–256, February 1992.

[14] W. Hoff and T. Vincent, “Analysis of head pose accuracy in augmentedreality,” IEEE Transactions on Visualization and Computer Graphics,vol. 6, no. 4, pp. 1–16, 2000.

208