Tim Oberhauser

Department of Informatics Robotics and Perception Group

Tim Oberhauser

Camera Localization in a3D Map

Semester Thesis

Robotics and Perception LabUniversity of Zurich

Supervision

Christian ForsterDavide Scaramuzza

February 2013

Contents

Abstract ii

1 Introduction 11.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

2 Approach 22.1 Map Building . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2.1.1 RGBDSLAM . . . . . . . . . . . . . . . . . . . . . . . . . 22.1.2 Filtering of the Map . . . . . . . . . . . . . . . . . . . . . 3

2.2 Camera Localization . . . . . . . . . . . . . . . . . . . . . . . . . 32.2.1 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . 42.2.2 Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.2.3 Pose Estimation . . . . . . . . . . . . . . . . . . . . . . . 5

3 Results 73.1 Intermediate Results . . . . . . . . . . . . . . . . . . . . . . . . . 7

3.1.1 Kinect and AR.Drone Camera Compared . . . . . . . . . 83.1.2 Conclusion from the Intermediate Results . . . . . . . . . 11

3.2 Final Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.2.1 Feature Map . . . . . . . . . . . . . . . . . . . . . . . . . 123.2.2 Kinect Camera . . . . . . . . . . . . . . . . . . . . . . . . 133.2.3 AR.Drone Camera . . . . . . . . . . . . . . . . . . . . . . 193.2.4 Performance Analysis . . . . . . . . . . . . . . . . . . . . 19

4 Discussion 214.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

i

Abstract

An approach has been implemented to estimate the pose of a camera with re-spect to a 3D map. RGBDSLAM has been used to build a 3D map and has beenmodified to be able to export the information needed to localize the camera - 3Dlocations and descriptors of all features. Features are extracted from the cam-era image and matched with the features of the 3D map using their descriptors.Finally, the pose of the camera can be computed using a RanSaC implemen-tation of the perspective-3-point (P3P) algorithm. If a current pose estimateis available, the camera model can be used to select the features of the mapthat are currently believed to be visible by the camera. This way the numberof features in the matching part of the algorithm, which is computationally themost expensive step, can be reduced substantially. From the implementation ofRGBDSLAM it was given that features exist multiple times in the feature map.To speed up the matching step, an option has been added to filter out thesemultiple features. The approach has been evaluated in practical experimentsby using the RGB camera of the Kinect without its depth measurement andfinally with the front camera of the AR.Drone.

ii

Chapter 1

Introduction

The broader scope of this project is to use a ground robot paired with a microunmanned aerial vehicle (MAV) for indoor exploration. The ground robot couldbe equipped with a RGB+depth camera such as the Kinect which can be usedto build a 3D model of the environment. To prevent the MAV from collidingwith walls, it needs to be localized with respect to this map. This could be doneby using the onboard camera of the MAV.The problem has been simplified to localizing a camera in a 3D map, which isassumed to be built beforehand. An approach has been implemented and finallypresented and evaluated in this report.

1.1 Related Work

For a similar application - autonomous indoor exploration with a ground robotand a MAV - Rudol et al. (2008) [1] came up with a method to estimate thepose of the MAV with respect to the ground robot. Light emitting diodes (LED)of different colors are mounted on the MAV in a certain pattern. A monocularcamera on the ground robot is used to detect the LED lights which can betranslated into the MAVs relative pose using the known pattern.Many have tackled the simultaneous localization and mapping (SLAM) problemusing monocular cameras. Two famous ones are MonoSLAM [2] and PTAM [3].The difference to this project is that the map can be built with a RGB+depthcamera which could - by the additional depth measurement - be more precise.The monocular camera is only used for localization. However, the approachpresented can be seen as a technique for measuring the pose and yet needs tobe combined with a motion model to compete with the mentioned approachesin terms of precision.

1

Chapter 2

Approach

The goal of this work is to find the camera pose with respect to a 3D map byonly using its visual information. The approach described in this report will useimage features from the camera image to match with the 3D map. Thereforeonly the feature descriptors and their 3D locations need to be passed to theroutine, from here on such a structure will be called feature map.

2.1 Map Building

There are different approaches in building a 3D map. RGBDSLAM [4] hasbeen chosen which uses a RGB-D (RGB + depth) camera such as the MicrosoftKinect. Intentionally, this algorithm was only minimally modified, as it mightbe replaced by another routine that suits the problem at hand better. Butas it was not fundamentally changed, the algorithm needs to be well under-stood to identify potential drawbacks. In the following section the algorithm issummarized briefly.The only modification that was necessary was to efficiently pass the feature mapto the proposed routine. This was achieved by using a custom ROS messagetype [5], which contains all descriptors and feature locations in a single messageand is sent if requested by the user (by pressing CTRL+K in the RGBDSLAMwindow).

2.1.1 RGBDSLAM

RGBDSLAM consists of a front-end and a back-end. The front-end tries to findrelative poses between two key frames and the back-end optimises these posesto form a globally consistent pose graph.For every new key frame features are extracted and matched to the featuresof a subset of previous key frames using the feature descriptors. From thesecorrespondences and taking into account the depth information at the key pointsthe relative transformation between two key frames can be found. However,there can be false matches and also the depth can be incorrect due to the shuttersof the color and the infrared camera not being synchronized and interpolationerrors. Therefore, a Random Sample Consensus (RanSaC) step is executed,which randomly chooses three matched feature pairs. At this stage, outliers

2

Chapter 2. Approach 3

can already be identified by comparing the pairwise Euclidean distance. If thechosen samples fulfill this criteria, the rigid transformation can be calculatedwhich is applied to all matched features. A matched pair is considered aninlier if the distance between the two features after applying the newly foundtransformation is smaller than 3 cm. The transformation can be refined byusing all inliers. This process is repeated and the transformation with mostinliers is finally used. The key frame is added if it could be matched with one ofthe previous key frames. If a valid transformation could be found between twokey frames, an edge is added in the pose graph. It is the task of the back-endto optimize the pose graph to make it globally consistent. This is of specialimportance in the case of loop closures.

Figure 2.1: 3D map built by RGBDSLAM

For every key frame - among others - the feature descriptors and the featurelocations in 3D are stored. However, it is not saved which features match withwhich features in other frames. If all feature descriptors and their locationwould be exported, every feature that has been observed in multiple key frameswill also exist multiple times in the exported feature map. This can slow downthe process of matching, which is why filtering the feature map could be ofadvantage. One approach on filtering the feature map has been implemented,which will be described in the next section.

2.1.2 Filtering of the Map

For every feature of the feature map the best k matches are found by OpenCVsknnMatch from the DescriptorMatcher class [6]. If the matches have a distancesmaller than a certain threshold, they are considered to actually represent theexact same feature. The 3D location of the features that are considered to belongtogether are averaged to reduce the influence of noise of the depth measurementsand errors from the map building process. Figure 2.2 shows the feature mapbefore and after filtering resulting from 2.1, using a distance threshold of 0.5m.

2.2 Camera Localization

The working principle of the proposed algorithm is shown in figure 2.3. Inorder to compute the camera pose with respect to the feature map, pointsof the current image need to be registered to points with known 3D positioni.e. the points of the feature map. This step is done by matching the feature

4 2.2. Camera Localization

(a) (b)

Figure 2.2: Feature map (a) before and (b) after filtering, displayed by rviz [7]

descriptors of the current image with the descriptors of the feature map. Clearly,the descriptors of the feature map have to be of the same kind as the descriptorsof the features of the current image.

Figure 2.3: Working principle of the proposed algorithm

2.2.1 Feature Extraction

RGBDSLAM works with SIFT [8], SURF [9] and ORB [10] features. In thisproject, only SIFT and SURF have been tested because they lead to more ac-curate results. However, the approach presented here is not intended to workexclusively with certain descriptors but should be adaptable to other map build-ing approaches that use different descriptors too.

2.2.2 Matching

OpenCVs BFMatcher [6] has been used to match the descriptors of the cur-rent image to the descriptors of the feature map. For SURF and SIFT featuresthe L2-norm can be used as an error measure between two descriptors. The

Chapter 2. Approach 5

BFMatcher also has an option for cross-checking, with this turned on, feature ifrom set A and feature j from set B are only considered a match if j is the bestmatch for i and vice versa. This method tends to produce the best matcheswith fewest outliers. However, if a feature exists multiple times in one set -like in a feature map built from RGBDSLAM - the probability of satisfying thecross-checking criteria is decreased substantially. I therefore recommend not touse cross-checking, if the map contains the same features multiple times. Notethat with this option turned off, BFMatcher will find a match for every feature.

Unfortunately matching of feature descriptors is computationally quite expen-sive, as it involves the calculation of the L2 norm of each possible feature pairsfrom set A and B. One way to accelerate this step is to match the features ofthe current image with only a subset of the features of the map. If an estimateof the current camera pose is available, the features of the map can be projectedto the image plane using the camera model. If a projected feature lies withinthe boundaries of the image sensor it is considered to be visible. Matching thefeatures of the current frame only to the visible features of the feature map sug-gests itself. This can substantially reduce the number of features to match andaccordingly the computational cost. I distinguish between the two cases: globalpose estimation - matching of features of the current image with all features ofthe map - and tracking - matching of features of the current image with onlythe visible features of the map. Tracking is discussed in the next section.

Tracking

Obviously the pose must be known in order to project world points into theimage plane. This is why, the first pose has to be always calculated by globallocalization. After that, the pose computed from the previous frame is alwaysused to project the features for matching with the current frame. If the pro-posed approach will be fused with a more advanced velocity model (and othermeasurement techniques), self-evidently the current estimate of the pose shouldbe used for projection.Matching the features of the current image with only the visible features of thefeature map can clearly speed-up the matching step of the algorithm. Also, theprobability of wrong matches should decrease.If a wrong pose has resulted from the pose estimation, also wrong features arereprojected and tried to be matched. To alleviate the aftermath of this occur-rence, a rule has been introduced: if for two subsequent frames the resultingpose is not trusted, the next frames are processed by global pose estimation. Assoon as a valid pose has been found again, the routine switches back to tracking.In which cases a pose is considered valid or not valid is discussed in section 2.2.3.

2.2.3 Pose Estimation

After finding point correspondences, the camera pose can be calculated usingthe perspective-three-point algorithm (P3P) [11], more specifically a RanSaC[12] implementation thereof.The algorithm randomly chooses a set of 3 matched feature pairs. First it has tocheck the set for the degenerate case: the three points lie close to a line. Also,another rule has been implemented: the distances of 3 points in the image must

6 2.2. Camera Localization

be greater than 10 pixels. Given that for one set the 3D location of the points aregiven, with these 3 pairs, the possible camera poses can generally be restrictedto four different ones. By using the four possible camera poses, a fourth worldpoint can be projected into the cameras image plane. The ambiguity can finallybe solved by comparing the location of the projected point with the location ofits match.Similarly, the other matched world points can be projected to the image plane.If the distance between the projected world point and its match are lower thana certain reprojection threshold, it is considered an inlier.Finally the pose with most inliers and a list of all the inliers are returned.The resulting pose will have at least 4 inliers - the ones used for computingthe pose. This would mean that not a single feature pair supports the modelfound by the algorithm, a result like that is very likely to be wrong and shouldtherefore not be trusted. A minimum inlier threshold suggests itself, if thenumber of inliers does not satisfy this condition, the pose is not consideredto be valid. The choice of the threshold and the effect of such a measure arediscussed in section 3.1.

Chapter 3

Results

After building a map using RGBDSLAM with the RGB and depth data of theMicrosoft Kinect, every calibrated camera should be localizable using only itsgrayscale images. The proposed approach has been tested with two differentdevices, the Microsoft Kinect by using the grayscale image without the depthinformation and the AR.Drone 2 [13] by using its front camera.

Figure 3.1: The flying room of the AI Lab of University of Zurich, also showing6 of 10 cameras of the Optrack tracking system [14]

Matching the features of a Kinect image with a map that has been built also byusing the Kinect worked pretty well without adjusting the parameters a lot. TheAR.Drone brought about some issues, which could only be partly solved. Forthis camera it made sense to first consider a simplified case, instead of directlytry to match the image to the map. The experiment is introduced and presentedin the next section. The goal of this was to explore some of the problems ofthe AR.Drone camera and to come up with a reasonable choice for some of theparameters. These were then used for localizing both cameras in the 3D map.The results of that are presented in section 3.2.

3.1 Intermediate Results

The localization procedure involves matching of features of the map with fea-tures extracted from the to-be-localized camera. The map in our case will always

7

8 3.1. Intermediate Results

be built by the Kinect. Hence, it makes sense to investigate how matching fea-tures from a Kinect frame to a Kinect frame and an AR.Drone-frame to a Kinectframe compare. This has been investigated by a very simple experiment whichis described in the following.A single frame with depth information captured by the Kinect serves as a re-placement for the map. The AR.Drone has been placed such that it had thesame view point as the Kinect when the single frame has been taken. The poseestimation process, described in chapter 2, is run on each frame of the video.Since the camera has not been moved, the correct solution is close to identity forthe rotational and close to zero for the translational part. To have a comparisonto the results from the AR.Drone, the same experiment has also been conductedusing the Kinect.This experiment leaves aside many issues that have arisen during localizationwith respect to the feature map. For instance multiple features, noisy featureposition due to the map building algorithm, very different view angles and scales.There is no guarantee that the routine with the parameters chosen from thissimple experiment will behave well in a much more complicated environment,but tuning all the parameters by rerunning the actual localization over and overagain would have been very tiresome if not impossible.Another simplification made for these experiments is that only SURF featureshave been used, with the detector threshold held constant. The detector thresh-old basically determines the number of features.In the following, the two cameras are compared by using the described, simpleexperiment.

3.1.1 Kinect and AR.Drone Camera Compared

(a) (b)

Figure 3.2: A frame from (a) the AR.Drone at its higher resolution and (b) theKinect from the test experiment

Feature Extraction

The AR.Drone camera has a resolution of 720 by 1280 pixels at a frame rate of30 frames per second, the resolution can also be switched to 360 by 640. Firstthe higher resolution is considered.

Chapter 3. Results 9

The features of the AR.Drone image should be matched with the feature of themap or in this case of a single frame coming from the Kinect. The features ofthe map have also been extracted from a Kinect image, which has a resolutionof 480 by 640. The feature detection process involves the downsampling of theimage a certain number of times, this number is commonly called number ofoctaves. If features should be matched from cameras with different resolution,the number of octaves has to be adjusted accordingly.

By default, the number of octaves is set to 4, this is used to build the mapand in this experiment, to extract features from the Kinect image. If the samefeatures should be detected in an image with double the resolution - the higherresolution AR.Drone image - one more downsampling step has to be added, thenumber of octaves for this image should be set to 5.

If this leads to better matches is difficult to state, but what matters in theend is whether or not it leads to more accurate pose estimation. Hence, theprecision resulting from 4 and 5 octaves for the higher resolution image andfor the lower resolution image is shown in table 3.1. The table also lists theprecision resulting from matching a Kinect image to a single Kinect frame.The reprojection threshold has been set to 4.0 for the higher resolution testand to 2.0 for the lower resolution. Since all images have been recorded fromthe same position, the translational error was simply the distance from zero.The error caused by imprecise placement of the AR.Drone camera is constantand - compared to the precision of the algorithm - small. Using the lowerresolution yields the best results of the AR.Drone images and also brings aboutan advantage in computational time.

Table 3.1: Comparison of the localization with respect to a single Kinect framewith an image of the AR.Drone at different resolutions and using different num-bers of octaves and with the Kinect

numberof octaves

translational error(meanstd.dev.) [m]

number of inliers(meanstd.dev.)

number ofmeasurements

Kinectto Kinect

4 0.0110 0.0083 369.18 13.59 68

AR.Droneto Kinect7201280 4 0.5113 0.4412 39.79 21.41 1107201280 5 0.4834 0.3809 44.25 24.06 109360640 4 0.4564 0.1541 49.57 10.99 95

Feature Matching

In figure 3.3 typical matching results from both cameras with a Kinect frame areshown. Obviously, the matches from the AR.Drone to the Kinect frame containmuch more false matches than the ones from the Kinect to Kinect. Reasons forthis will be discussed in chapter 4.

10 3.1. Intermediate Results

(a)

(b)

(c)

Figure 3.3: Typical matching result of (a) two Kinect frames and an AR.Droneframe with a Kinect frame, using (b) the higher resolution image and (c) thelower resolution image

P3P RanSaC

Having seen the quality of the matches from the AR.Drone to the Kinect frame,it is not surprising at all that the output of P3P is also much worse for thiscase. Figure 3.4 confirms this.

A measure that has been mentioned before was the introduction of a minimumnumber of inliers to decide whether or not a measurement should be trusted.This has a notable impact only when getting very noisy data. Such a situationhas been simulated by using a low feature detector threshold of 50, instead of200. For this result, the higher resolution image has been used. The imple-mentation of an inlier threshold entails that some of the measurements are lost.Figure 3.5 illustrates what impact an inlier threshold has on the precision of thepose estimate.


(a)

(b)

(c)

Figure 3.4: Inliers after the P3P RanSaC step of (a) two Kinect frames and anAR.Drone frame with a Kinect frame, using (b) the higher resolution and (c)the lower resolution image

3.1.2 Conclusion from the Intermediate Results

Since the algorithm works with the lower resolution as well as with the higherresolution, and taking into account the computational advantage, from here on,only the lower resolution images will be used.

Also, a minimum inlier threshold has been introduced. Through many experi-ments, 8 proved to be a reasonable choice.

3.2 Final Results

In this section, the performance of the algorithm is tested in the actual applica-tion - localizing a camera in a feature map. The two cameras are considered astwo separate cases. In both cases, the ground truth trajectory has been recordedby using the Optitrack motion capture system [14]. The motion of an object

12 3.2. Final Results

0 10 20 30 40 50 60 70 80 90 1000

1

2

3

4

5

6

iterations

err

or

[m]

original output of p3p

average error w/o threshold

outliers (


Figure 3.6: Trackable Kinect

based on SURF features to the one on SIFT features. To achieve more precisemap building and to simulate the slow movement of the ground robot, thedataset was actually played back at a fifth of the original speed. This way, morekey frames were added.

Table 3.2: RGBDSLAM: standard setup

option standard choice

use SIFTGPU offfeature type SURFminimum number of features 600maximum number of features 1600feature matching FLANN [16]

3.2.2 Kinect Camera

For examining the influence of certain parameters on the precision of the pro-posed approach, a standard setup has been chosen. In each of the followingsections, one option of the standard setup has been varied to investigate the im-pact on precision and - if meaningful - computational time. The chosen standardsetup is shown in table 3.3.

Reproducing the RGBDSLAM Trajectory

The dataset has been recorded by moving the trackable Kinect by hand at arelatively low height to simulate a possible future application with the Kinectmounted on a ground robot. Figure 3.7 shows the trajectory estimated by


Table 3.3: Kinect localization: standard setup

part option standard choice

map building filter map ondistance threshold 0.5 m

feature extraction feature type SURFdetector threshold 200number of octaves 4

feature matching tracking onP3P minimum inliers 8

reprojection threshold 1.5 pixelsmaximum iterations 1000maximum data trials 40p-value 0.9999

RGBDSLAM and by the proposed approach as well as the ground truth fromOptitrack. The precision of the two trajectory estimates can be found in table3.4. Note that the ground truth trajectory exhibits one clear disturbance whichwas most probably because the person holding the camera blocked the view forthe tracking system.

3 2 1 0 1 2 34

3.5

3

2.5

2

1.5

1

0.5

0

0.5

1

x [m]

y [

m]

proposed approach

RDBDSLAM

ground truth

Figure 3.7: Resulting trajectory estimate when trying to reproduce the RGBD-SLAM trajectory with only the visual data using the proposed approach instandard configuration

Reproducing a Free Trajectory

The same feature map is now used to estimate another trajectory. Figure 3.8shows this trajectory and the trajectory estimated with the standard configura-tion. The trajectory that has been used for building the map with RGBDSLAM


Table 3.4: Comparison between RGBDSLAM and the proposed approach

translational error (mean std.dev.) [m]

RGBDSLAM 0.1281 0.0800proposed approach 0.1856 0.2915

is also shown to illustrate their spacial relation, this will not be shown in thefollowing plots.

3 2 1 0 1 2 34

3.5

3

2.5

2

1.5

1

0.5

0

0.5

1

x [m]

y [

m]

ground truth rgbdslam

proposed approach

ground truth

Figure 3.8: Resulting pose estimation of the Kinect with the standard parameterset

Filtering of the Map As already explained in section 2.1.1, the feature mapexported from RGBDSLAM contains all features of all key frames. This way, ahuge amount of features is accumulated during the map building process whichalso contains redundant information. In section 2.1.2 an approach has been in-troduced to reduce the number of same features existing multiple times in thefeature map. This step is computationally quite expensive, but it has to be ex-ecuted only once before localization starts, and it can reduce the computationaltime of the localization.

The trajectory estimate without filtering the feature map beforehand is shownin figure 3.9. The precision of the localization and the computational time of thematching step - the step that is mainly affected by a larger number of features- are compared in table 3.5. A detailed discussion about the processing timesof all the parts of the algorithm will follow in 3.2.4. Interestingly, filtering themap beforehand yields more accurate results during the localization and reducescomputational time of the matching step.


2 1.5 1 0.5 0 0.52.5

2

1.5

1

0.5

0

0.5

x [m]

y [

m]

proposed approach

ground truth

Figure 3.9: Resulting pose estimation of the Kinect without filtering the mapbeforehand

Table 3.5: Kinect: Comparison of precision and processing time of the matchingstep between localization with filtering the map and not filtering the map

filter maptranslational error

(mean std.dev.) [m]average computation time

of matching [s]

on 0.3440 0.3091 0.3311off 0.3697 0.5851 1.1111

P3P: Reprojection Threshold Tuning the reprojection threshold of theP3P RanSaC trades precision of the pose estimate against robustness to noise.Both are reflected in the average error of the localization. Values from 1.0 to3.0 pixels have been tested and the resulting translational errors are shown intable 3.6 and plotted in figure 3.10. Apparently a reprojection threshold of 1.5results in the error with the lowest mean value and standard deviation.

Table 3.6: Kinect: Influence of P3Ps reprojection threshold on the precision

reprojection threshold translational error (mean std.dev.) [m]

1.0 0.3781 0.51301.5 0.3440 0.30912.0 0.4370 0.45932.5 0.5014 0.52183.0 0.4544 0.4631


0 0.5 1 1.5 2 2.5 3 3.50

0.2

0.4

0.6

0.8

1

reprojection threshold [pixels]

err

or

[m]

mean

std. dev.

Figure 3.10: Influence of the reprojection threshold on the error of the positionestimate

Tracking vs. Global Localization As discussed in section 2.2.2, matchingthe features of the current image only with the ones that are believed to bevisible can speed up the matching part. But it depends on a good estimate ofthe pose from the previous frame. The precision and computational time of thematching step of the localization with and without tracking are compared intable 3.7. Not only is the tracking approach faster, it is also more precise. Thiscould be caused by the fact that the probability of getting wrong matches ishigher when matching the features of the current image with all the features ofthe feature map.

2 1.5 1 0.5 0 0.52.5

2

1.5

1

0.5

0

0.5

x [m]

y [

m]

proposed approach

ground truth

Figure 3.11: Resulting pose estimation of the Kinect without tracking


Table 3.7: Kinect: Comparison of precision and processing time of the matchingstep between tracking and global localization

trackingtranslational error

(mean std.dev.) [m]average computation time

of matching [s]

on 0.3440 0.3091 0.3311off 0.4020 0.4067 0.5890

SURF vs. SIFT SURF features have been used because of their advantagein computational time. However, it is interesting to investigate if SIFT featureslead to more precision of the localization. Of course, for this case the maphas been built using SIFT features. The precision of the trajectory estimateof RGBDSLAM with SIFT features is comparable to the one with SURF: amean error and its standard deviation of 0.1244 and 0.0764 respectively. Thetrajectory estimate from the proposed algorithm is shown in figure 3.12 and itsprecision compared to the one using SURF features in table 3.8. From the plot,it seems that the trajectory estimate with SIFT features is more precise but twooutliers raise the mean and standard deviation of the error.

2 1.5 1 0.5 0 0.52.5

2

1.5

1

0.5

0

0.5

x [m]

y [

m]

proposed approach

ground truth

Figure 3.12: Trajectory estimate using SIFT features

Table 3.8: Kinect: Precision of the localization using SURF and SIFT features

feature type translational error (mean std.dev.) [m]

SURF 0.3440 0.3091SIFT 0.3252 0.6625


3.2.3 AR.Drone Camera

In section 3.1.1, it has been shown that the matching part of the algorithmdoes not work as well for the AR.Drone camera as for the Kinect. Becauseof this, the pose estimation is more ambiguous with this camera. The lowerimage resolution of 360 by 640 has been used. To account for imprecise cameracalibration, the reprojection threshold has been raised to 2.0 pixels, instead of1.5.The ground truth and the estimated trajectory of the AR.Drone are depicted infigure 3.13, the trajectory of the Kinect which has been used to build the mapis again also shown. The quality of the trajectory estimate is not as good asthe one of the Kinect, most probably due to wrong matching. The mean andstandard deviation of the position error can be taken from 3.9. Worth noting isthat out of 497 frames, with SIFT 244 led to a valid pose and with SURF only48.

3 2 1 0 1 2 34

3.5

3

2.5

2

1.5

1

0.5

0

0.5

1

x [m]

y [

m]


proposed approach

ground truth

3 2 1 0 1 2 34

3.5

3

2.5

2

1.5

1

0.5

0

0.5

1

x [m]

y [

m]


proposed approach

ground truth

(a) (b)

Figure 3.13: AR.Drone trajectory estimate using (a) SURF and (b) SIFT fea-tures

Table 3.9: AR.Drone: Precision of the localization using SURF and SIFT fea-tures

feature type translational error (mean std.dev.) [m]

SURF 0.7923 0.5604SIFT 0.5363 0.4607

3.2.4 Performance Analysis

The average processing times of the different steps of a Kinect and of a AR.Droneframe are shown in figure 3.14. In order to make the results comparable, allresults shown have been measured with SIFT features. All simulations havebeen run on a virtual machine with 2 GB of memory and 4 CPUs with 2.6 GHz.The most expensive parts are feature extraction and matching, causing a totalprocessing time of 218.8 ms for a single Kinect frame. This means that in a


real-time application, only about 4 frames per second can be processed. Foran image from the AR.Drone, this rate is increased to almost 7 frames persecond. Note that the difference is this high, mainly because of the differencein resolution.

0 10 20 30 40 50 60 70 80 90 100 110

Displaying: 24.1 ms

P3P RanSaC: 12.9 ms

Matching: 105.4 ms

Reproject: 0.6 ms

Extract Features: 74.9 ms

Average Processing Time [ms]0 10 20 30 40 50 60 70 80 90 100 110

Displaying: 13.7 ms

P3P RanSaC: 8.2 ms

Matching: 75.5 ms

Reproject: 0.5 ms

Extract Features: 49.5 ms

Average Processing Time [ms]

(a) Total: 217.8 ms (b) Total: 147.4 ms

Figure 3.14: Average processing times of (a) a Kinect frame and (b) anAR.Drone frame, all using SIFT features and tracking

Chapter 4

Discussion

The approach presented in this report has been tested with two cameras, theKinect and the front camera of the AR.Drone. With the Kinect, it could beshown that the algorithm works in principle. Precision can be improved, byfusing the measurements with a physical model. However, using just the visualdata of the Kinect is still not an application that would make sense in practice. Itonly makes sense for cameras with no depth information, such as the AR.Drone.For this camera, the algorithm did not work as well, especially with SURFfeatures. An important reason for this is certainly that matching of featuresfrom the AR.Drone image to features from the Kinect image works poorly (seesection 3.1.1). In order to make the algorithm work with this camera (andhopefully all other cameras too) the source of wrong matching has to be found.Therefore, the differences between the two cameras have to be investigated:

radial distortion

different contrast and illumination due to different sensor

lens quality?

These are factors that could be responsible for the poor matching in the frameto frame experiment (see section 3.1.1). In the actual localization some otherdifficulties are added:

motion blur

different viewing angles and scales

About the difference from the sensor, lens quality, motion blur and viewingangles nothing can be done. But about radial distortion there could: in [17]the influence of radial distortion on SIFT features has been studied and it isstated that the repeatability of SIFT features can be improved by rectifying theimage before extracting the features. Rectifying the image is a computationallyexpensive step, hence a modified SIFT algorithm that takes the radial distortioninto account has been proposed and further improved in [18]. Unfortunately,an implementation of sRD-SIFT is currently only available for MATLAB. SinceSURF features are detected and extracted in a similar manner as SIFT features,the statements probably apply to SURF features as well. However, the influenceof radial distortion in this case is not expected to be huge.

21

22 4.1. Future Work

4.1 Future Work

Most importantly, the presented approach should be combined with a motionmodel to improve precision.For the results of this report, timing issues have been ignored completely.Frames that have been recorded at 30 Hz have been processed at 5 to 7 Hzwithout skipping frames. If this approach was to run in a real-time application,consequently frames would be skipped. To avoid skipping of frames, i.e. notto lose track in quick turns, the algorithm needs to be extended. For exampleby tracking features with optical flow using the pyramidal implementation ofthe Lucas Kanade feature tracker [19]: features that have been found as inliersfor the P3P RanSaC step could be tracked. Since they are inliers, they havea corresponding 3D location in the map and can therefore be fed to the P3PRanSaC again. This way, the computation can be distributed to two threadsrunning in parallel. Note that by tracking the P3P inliers of one frame, thefeature extraction and matching can be left out, which enables the algorithm torun real-time. The principle is depicted in figure 4.1.

Figure 4.1: Proposition for extension of algorithm

A function that uses a previous image and the feature positions of its inliers toestimate the pose using optical flow has already been implemented and initialtests seemed promising.

Bibliography

[1] P. Rudol, M. Wzorek, G. Conte, and P. Doherty, Micro unmanned aerialvehicle visual servoing for cooperative indoor exploration, in AerospaceConference, 2008 IEEE, pp. 1 10, March 2008.

[2] A. Davison, I. Reid, N. Molton, and O. Stasse, Monoslam: Real-time singlecamera slam, Pattern Analysis and Machine Intelligence, IEEE Transac-tions on, vol. 29, pp. 1052 1067, june 2007.

[3] G. Klein and D. Murray, Parallel tracking and mapping for small arworkspaces, in Mixed and Augmented Reality, 2007. ISMAR 2007. 6thIEEE and ACM International Symposium on, pp. 225 234, nov. 2007.

[4] F. Endres, J. Hess, N. Engelhard, J. Sturm, D. Cremers, and W. Burgard,An evaluation of the rgb-d slam system, in Robotics and Automation(ICRA), 2012 IEEE International Conference on, pp. 1691 1696, May2012.

[5] http://www.ros.org/wiki/rosmsg. Accessed: 29/01/2013.

[6] http://opencv.willowgarage.com/documentation/cpp/features2d_common_interfaces_of_descriptor_matchers.html. Accessed:29/01/2013.

[7] http://www.ros.org/wiki/rviz. Accessed: 23/02/2013.

[8] D. G. Lowe, Distinctive image features from scale-invariant keypoints,Int. J. Comput. Vision, vol. 60, pp. 91110, Nov. 2004.

[9] H. Bay, A. Ess, T. Tuytelaars, and L. V. Gool, Speeded-up robust features(surf), Computer Vision and Image Understanding, vol. 110, no. 3, pp. 346 359, 2008.

[10] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, Orb: An efficientalternative to sift or surf, in Computer Vision (ICCV), 2011 IEEE Inter-national Conference on, pp. 2564 2571, nov. 2011.

[11] L. Kneip, D. Scaramuzza, and R. Siegwart, A novel parametrization of theperspective-three-point problem for a direct computation of absolute cam-era position and orientation, in Computer Vision and Pattern Recognition(CVPR), 2011 IEEE Conference on, pp. 2969 2976, June 2011.

23

http://www.ros.org/wiki/rosmsghttp://opencv.willowgarage.com/documentation/cpp/features2d_common_interfaces_of_descriptor_matchers.htmlhttp://opencv.willowgarage.com/documentation/cpp/features2d_common_interfaces_of_descriptor_matchers.htmlhttp://www.ros.org/wiki/rviz

[12] M. A. Fischler and R. C. Bolles, Random sample consensus: a paradigmfor model fitting with applications to image analysis and automated car-tography, Commun. ACM, vol. 24, pp. 381395, June 1981.

[13] http://ardrone2.parrot.com/usa/. Accessed: 11/02/2013.

[14] http://www.naturalpoint.com/optitrack/. Accessed: 20/02/2013.

[15] D. Eggert, A. Lorusso, and R. Fisher, Estimating 3-d rigid body trans-formations: a comparison of four major algorithms, Machine Vision andApplications, vol. 9, pp. 272290, 1997.

[16] M. Muja and D. G. Lowe, Fast approximate nearest neighbors with au-tomatic algorithm configuration, in In VISAPP International Conferenceon Computer Vision Theory and Applications, pp. 331340, 2009.

[17] M. Lourenco, J. Barreto, and A. Malti, Feature detection and matching inimages with radial distortion, in Robotics and Automation (ICRA), 2010IEEE International Conference on, pp. 1028 1034, may 2010.

[18] M. Lourenco, J. Barreto, and F. Vasconcelos, srd-sift: Keypoint detectionand matching in images with radial distortion, Robotics, IEEE Transac-tions on, vol. 28, pp. 752 760, june 2012.

[19] J.-Y. Bouguet, Pyramidal implementation of the lucas kanade featuretracker description of the algorithm, 2000.

http://ardrone2.parrot.com/usa/http://www.naturalpoint.com/optitrack/

Robotics and Perception Group

Title of work:

Camera Localization in a 3D Map

Thesis type and date:

Semester Thesis, February 2013

Supervision:

Christian ForsterDavide Scaramuzza

Student:

Name: Tim OberhauserE-mail: [email protected].: 07-919-806

Statement regarding plagiarism:

By signing this statement, I affirm that I have read the information noticeon plagiarism, independently produced this paper, and adhered to the generalpractice of source citation in this subject-area.

Information notice on plagiarism:

http://www.ethz.ch/students/semester/plagiarism_s_en.pdf

Zurich, 6. 3. 2013:

http://www.ethz.ch/students/semester/plagiarism_s_en.pdf

AbstractIntroductionRelated Work

ApproachMap BuildingRGBDSLAMFiltering of the Map

Camera LocalizationFeature ExtractionMatchingPose Estimation

ResultsIntermediate ResultsKinect and AR.Drone Camera ComparedConclusion from the Intermediate Results

Final ResultsFeature MapKinect CameraAR.Drone CameraPerformance Analysis

DiscussionFuture Work

Documents

Tim Oberhauser