Visual Localisation of a Robot with an external RGBD Sensor

Visual Localisation of a Robot with an external RGBD Sensor

Winston Yii, Nalika Damayanthi, Tom Drummond, Wai Ho LiMonash University, Australia

{winston.yii, nalika.dona, tom.drummond, wai.ho.li}@monash.edu

AbstractThis paper presents a novel approach to visuallocalisation that uses a camera on the robot cou-pled wirelessly to an external RGB-D sensor. Un-like systems where an external sensor observesthe robot, our approach merely assumes the robotscamera and external sensor share a portion of theirfield of view. Experiments were performed us-ing a Microsoft Kinect as the external sensor anda small mobile robot. The robot carries a smart-phone, which acts as its camera, sensor processor,control platform and wireless link. Computationaleffort is distributed between the smartphone and ahost PC connected to the Kinect. Experimental re-sults show that the approach is accurate and robustin dynamic environments with substantial objectmovement and occlusions.

1 IntroductionLocalisation is an important problem in robotics, especiallyfor mobile robots working within domestic and office envi-ronments [Cox et al., 1990]. This paper presents a novel ap-proach to robot localisation that uses a conventional cameraon a robot, coupled wirelessly to an external RGB-D sen-sor mounted rigidly in the environment. Our approach (il-lustrated in Figure 1) sits between systems which use an ex-ternal sensor mounted in the environment to sense the robotand those which use an on-board system which senses the en-vironment.

A static Microsoft Kinect acts as the external sensor ob-serving a scene, which provides both colour and 2.5D depthinformation of an indoor environment. Parts of the samescene are also visible to a mobile robot. The robot carries asmartphone, which acts as its camera, sensor processor, con-trol platform and wireless link. By matching features betweenthe external sensor and the smartphone camera, the robot’slocation is estimated in real time in 6 degrees of freedom.Further details of the system are available from Section 3.

The proposed robot localisation approach offers a numberof advantages:

• By contrast to external localisation systems [Rekleitis etal., 2006; Rawlinson et al., 2004], we do not requirethe robot to be in the field of view of the external sen-sor. Our system merely requires that the external sen-sor and the robot’s camera share a portion of their fieldsof view. This greatly extends the workspace of mobilerobots being localised by external sensors by removingout-of-sensor-range deadspots.

• By contrast to purely on-board vision-based systems [Seet al., 2001; Dellaert et al., 1999], our approach doesnot require the environment to be static. Our approach isrobust against moving objects in the scene because theseare dynamically observed by the external sensor and areused to generate useful information for localisation evenwhile they are moving. We only require sufficient visualfeature matches between the external sensor view andthe robot-eye-view.

• In addition, our approach operates without the need forartificial landmarks such as those employed in [Betkeand Gurvits, 1997]. This is possible because we makeuse of observed natural features that are matched inreal time between the external sensor and robot cam-era. These features are robust to scale and orientationchanges.

These are important improvements as they facilitate robotlocalisation in real world environments, including those withdynamic human users. However, in order to achieve real timelocalisation, we had to overcome two challenges:

• The limited computing available on the robot restrictsthe type of visual processing and robotic control thatcan be performed in real time. In particular, the on-board computer is not powerful enough to run the kindof model-based visual localisation we use in this work.

• The wireless link between robot and external sensormeans that the transmission of large quantities of data,such as raw video sensor data, is prohibitively slow (andalso consumes unnecessary power).

Both challenges are solved simultaneously by distribut-ing computational processing between the relatively powerful

Figure 1: Overview of System: Visual Localisation between a Mobile Phone Camera and an external RGB-D sensor

host PC attached to the external sensor and the smartphoneCPU on the robot. This asymmetric computing arrangementlies conceptually in between a remote brain configuration [In-aba et al., 2000] and having all processing occur locally onthe robot. Our robot processes camera video to extract imagefeatures, which are then sent wirelessly to the host PC. Thishelp minimise the quantity of data being transmitted over thewireless link. The robots pose is then estimated using the hostPC by comparing features extracted from the external sensorwith those from the robots camera. The resulting pose esti-mate can be used by the host PC for robot path planning orsent back to the robot for local planning.

The rest of the paper is structured as follows:

• Section 2 outlines the hardware and robotic platformused for the experiments.

• Section 3 details the localisation system in a number ofsubsections; 3.2 feature extraction, 3.3 feature descriptorbuilding, 3.4 feature matching and 3.5 pose estimation.

• Section 4 outlines the experiments conducted to mea-sure the accuracy of the system and demonstrate the sys-tem’s capabilities for dynamic environments and whenthe robot lies outside the external sensor’s view.

2 HardwareThe setup for this project is a mobile robot operating in anindoor evironment surveiled by a stationary kinect.

The Microsoft Kinect provided a cheap platform for RGB-D information. The recent release of the open source OpenNIdrivers [OpenNI, 2010] made it possible to extract factory cal-ibrated RGB-D data.

The robotic platform we used for the experiment test bedwas an eBug smart mobile robot, developed by Monash Uni-versity (figure 2). The hardware details for the eBug are thesubject of another paper submitted to ACRA 2011. For theseexperiments the eBug was connected to an Android basedmobile phone. All the sensor processing (vision) and WiFi

communication are handled by the mobile phone. Communi-cation between mobile phone and the eBug is done throughan IOIO board. 1 The IOIO device acts as a communicationbridge to transfer motion commands to the robot, with oneside connected to the Android Phone as a USB host (throughAndroid accessory / ADB protocols) and other side connectedwith eBug via serial link. The mobile phone used was aHTC Desire Android 2.2 (Froyo) based smart mobile phonewith CPU processing speed of 1GHz and internal memory of512MB. The images obtained from the phone have a resolu-tion of 800x480.

Figure 2: Hardware platform - HTC Desire Android Phoneon an eBug

1http://www.sparkfun.com/products/10585

3 Robot Localisation3.1 System OverviewAn overview of the robot localisation process is illustrated inthe Figure 1. The core concept of the localisation presentedis to estimate a pose from a set of appearance-based featurecorrespondences between the mobile phone camera and theKinect. The process begins by detecting salient features andbuilding feature descriptors in greyscale images from both theKinect and smartphone cameras. The feature locations anddescriptors built on the mobile phone are sent to a Kinect-CPU server via a Wifi link. The server performs robust fea-ture matching to provide a set of correspondences between2D points from the robot camera and 3D points from theKinect. The Lie algebra of rigid body motions is used to lin-earise the displacement of these salient features and to calcu-late a pose estimation from correspondences. This localisesthe robot in 3D space from the Kinects perspective and thelocation can be projected onto the ground plane to obtain therobots 2D position.

3.2 Corner DetectionFAST features [Rosten and Drummond, 2006] are computedfrom the greyscale images in both the Kinect-CPU serverand the robot. FAST was chosen for its computational effi-ciency to accommodate the limited computational power ofthe robot’s (smartphone) processor. To allow for differencesof scale between the two frames, FAST features are detectedacross a pyramid, 5-layers in the server (downsampled at fac-tors of 1, 1.5, 2, 3 and 4) and 2-layers on the robot (down-sampled at factors of 1 and 2).

3.3 Feature Descriptor BuildingFor each FAST corner, a Histogrammed Intensity PatcheS(HIPS) feature descriptor is calculated as described in [Taylorand Drummond, 2009]. This descriptor uses 64 evenly spacedpixels from a 15x15 image patch centred on the FAST corner.These pixels are normalised for mean and variance and thenquantised into 4 intensity bins. This produces a descriptorthat only requires 4 arrays of 64 bits, where each array corre-sponds to a quantised intensity bin and each bit position is a1 if the pixel is in that intensity range. Each corner k(x,y) inthe Kinect image and c(x,y) in the mobile camera image hasa descriptor:

db,i(x,y) =

{1 Qb < Im((x,y)+offseti)< Qb+1

0 else(1)

where Q =

−∞

µ−0.675σ

0µ +0.675σ

∞

where b is the index of the intensity bin and i is the indexof the sampled pixel

The quantisation bins are selected to distribute the samplesinto each range approximately evenly and assume a normaldistribution of intensities among the 64 sampled pixels in thepatch for which the mean intensity µ and standard deviationσ are computed.

To optimise the build-time for the descriptors on the mo-bile phone, we applied ARM Neon SIMD instructions to thefeature patch selection and quantisation process. The 64-bitquantised bins were computed by data parallelism, treatingeach row of the feature patch intensities as a vector of 8 wide8 bit elements.

The descriptors for each corner in the Kinect frame arecomputed for a 3x3 patch of centre locations and aggregatedby being ORed together to create a descriptor that will toler-ate small deviations in translation, rotation, scale or shear.

K(x,y) =∨

i, j∈{−1,0,1}k(x+ i,y+ j) (2)

This ORing process creates a descriptor which is the unionof all nine appearances of the simple descriptor.

The mobile camera’s feature locations and descriptors aresent to the Kinect-CPU server via Wireless LAN where theyare matched to the Kinect’s current database of features.Sparse features are sent over the wireless link rather than fullmobile phone camera images to reduce the amount of trans-mitted data. For each feature, a 32 byte descriptor and 8 bytesof coordinates are sent. There are typically 800 features inan image from our smartphone camera, which gives approx-imately 32KB of data to be sent to the Kinect-CPU server.Sending data over the Wifi link causes the largest latency inthe system.

3.4 Feature MatchingA computationally efficient comparison between a test de-scriptor and a descriptor in the database can be performed bycounting bit differences. The error function, E, between a testdescriptor from the mobile camera, c, and database descriptorfrom the Kinect image, k, is expressed as:

E = Error(K,c) = bitcount(c∧¬K) (3)

This is the number of bits set in c and not set in K and thuscounts how many pixels in c have an intensity that was neverobserved in any of the nine k descriptors that were ORed to-gether to form K. Whilst the ORing operation reduces theerror function between correct matches where the viewpointdiffers slightly, some descriptors are rather unstable in ap-pearance and this results in them acquiring a large number ofbits set. This results in those descriptors acquiring too manycorrespondences at matching time. In order to prevent this, athreshold on the range of appearance is used to exclude un-stable descriptors from the matching stage.

accept K if (bitcount(K)< threshold) (4)

In order to perform rapid matching, a fast tree-based indexof the Kinect HIPS descriptors is formed using the or-treemethod described in [Taylor and Drummond, 2009]. The treeis built once per frame allowing the system to deal with dy-namic scenes. Building the tree only takes approximately 20ms to build on our system.

Once the tree index has been built, it can be used to rapidlyfind matches for the mobile camera corners. On our system,this typically takes 10 ms to find matches for 2000 corners.In order to tune the system for speed, only the match that min-imises the error function for each mobile camera corner (pro-vided it is below an error threshold) is retained, rather thanadmitting all matches below the threshold. This results in aset of correspondences between corners in the mobile cameraimage and the kinect image as shown in figure 3 below.

Figure 3: All matches between the Kinect RGB image (left)and the mobile camera RGB image (right) produced by HIPS

3.5 Pose Estimation with Lie AlgebraThe pose of the mobile camera relative to the coordinateframe of the Kinect can be expressed as a rigid body trans-formation in SE(3):

Tck =

[R t

000 1

]where R ∈ SO(3) and t ∈ R3 (5)

The subscript ck indicates the camera coordinate frame rel-ative to the kinect frame.

A point at a known position in the kinect frame, pk =(x,y,z,1)T , can be expressed in the camera frame by applyingthe rigid transformation to obtain pc = Tck pk.

pk is projectively equivalent to ( xz ,

yz ,1,

1z )

T = (u,v,1,q)T

and it is more convenient to solve for pose using this formu-lation as (u,v) are a linear function of pixel position; Ourcameras exhibit almost no radial distortion and we assume alinear camera model for both cameras, and the value reportedin the depth image from the Kinect is linearly related to qk.This gives:

ucvc1qc

≡ Tck

ukvk1qk

=

R

ukvk1

+qkt

qk

(6)

qc is unobserved, so only two constraints are provided percorrespondence and thus three correspondences are needed tocompute Tck. When this has been computed, the same equa-tion can be used to calculate qc, thus making it possible totransform the Kinect depth map into the mobile camera frameshould that be required (e.g. to build an obstacle map from therobots perspective).

Pose is solved iteratively from an initial assumption T 0ck =

I, ie that the mobile and kinect coordinate frames are coinci-dent. At each iteration, the error in Tck is parameterised usingthe exponential map and at each iteration Tck is updated by:

T j+1ck = exp

(∑

iαiGi

)T j

ck (7)

where Gi are the generators of SE(3). The six coefficients αiare then estimated by considering the projection of the kinectcoordinates into the mobile camera frame.

ucvc1qc

≡ exp

(∑

iαiGi

)T j

ck

ukvk1qk

, (8)

Writing

u′kv′k1q′k

≡ T jck

ukvk1qk

, (9)

then gives the Jacobian J of partial derivatives of (uc,vc)T

with respect to the six αi as:

J =

[q′k 0 −u′kq′k −u′kv′k 1+u′k

2 −v′k0 q′k −v′kq′k −1− v′k

2 u′kv′k u′k

], (10)

where the first three columns of J correspond to derivativeswith respect to translations in x, y and z (and thus depend onq) and the last three columns correspond to rotations aboutthe x, y and z axes respectively. This simple form of J isa convenient consequence of working in (u,v,q) coordinatesrather than (x,y,z).

From three correspondences between (uk,vk,qk) and(uc,vc) it is possible to stack the three Jacobians to obtaina 6x6 matrix which can be inverted to find the linearised so-lution to the αi from the error between the observed coordi-nates in the mobile camera frame and the transformed Kinectcoordinates: (uc,vc)− (u′k,v

′k). With more than three corre-

spondences, it is possible to compute a least squares solutionto obtain αi.

When αi have been computed, iteration proceeds by settingT j+1

ck = exp(∑i αiGi)T jck and recomputing the αi. In practice,

we found that 10 iterations is enough for this to converge toacceptable accuracy.

Because the correspondences obtained using HIPS containmany outliers, RANSAC was used (sampling on triplets of

correspondences) to obtain the set of inliers. The final posewas then calculated through an optimisation over all inliers.This process is typically only takes about 0.5ms. Figure 4shows all the inlier correspondences between the Kinect andmobile camera images computed in this way.

Figure 4: The inlier matches between the Kinect RGB im-age and the mobile camera RGB image as determined byRANSAC

3.6 Robot localisation on a 2D planar map

Since the robot is restricted to movement on the ground sur-face, we define the ground plane using the Kinect’s depth datato give the robot a 2D map of its operating environment. Thisis done through a calibration process where a user manuallyselects three points on the ground plane in the Kinect image.The depth values at these three points are then used to gen-erate a ground plane of the form ax+by+ cz = d, where theconstants a,b and c are normalised to produce unit vector forthe plane normal.

Points in the kinect depth image are classified into threecategories

• Points within a noise threshold of the ground plane areconsidered to be part of the ground plane.

• Points above the ground but below the height of the robotare marked as obstacles.

• Other points are disregarded.

Projection of a point [x,y,z]T to the ground plane withplane normal n = [a,b,c]T is computed by:

Projected Point = [I3x3−n.nT]

xyz

−dn, (11)

The projected ground plane points (x′,y′,z′)T are trans-formed onto a xy plane to give an overhead 2D map. Thetransformation required is a rotation to align the plane nor-mal, n, and two other perpendicular vectors lying on theplane, xplane and yplane, with the Kinect’s camera coordinateframe. This is followed by a translation in the plane normaldirection by an arbitrary amount λ . The transformation canbe expressed as:

xmapymapzmap

=

xplaneT

yplaneT

nT

x′

y′

z′

−λn (12)

The zmap coordinate is constant over all points in the xyplane and is discarded.

To obtain the final map localisation, the pose estimationis projected from the Kinect’s frame onto the 2D map. Thetranslation component of the pose estimation is projected ina similar manner to a 3D point as described above. The ori-entation requires multiplying the rotation component with aunit vector along the Kinect’s optical axis before projection.The result is the localisation of robot position as well as ori-entation. Figure 5 displays the position and orientation of therobot in a map filled with obstacles.

Figure 5: Dynamic map generated using Kinect. Left:Kinect’s point-of-view showing obstacles in red. Right:kinect data projected as a 2D planar map.

4 Experiments4.1 Localisation Accuracy for a Static Scene

Figure 6: Experimental setup to measure the accuracy oflocalisation system The masking tape on the floor is used forground truth measurements.

To measure the precision and repeatability of the localisa-tion system, the robot was localised at a number of markedpositions and orientations as shown in figure 6. This is doneby projecting the 6 DoF robot pose onto the 2D planar map.The distance between the marked positions was measuredto ±2mm accuracy with a measuring tape to provide groundtruth data. 10 data points were measured for each location toprovide a mean and standard deviation for each position. Theobtained results are shown in figure 7. Each cluster of pointsrepresent a set of measurements for a single location.

Figure 7: Measured Position Localisation

This data was analysed in a number of ways:

• Because the full 3D 6 DoF pose is calculated, the con-straint that the robot sits on the ground plane is not re-spected. In order to provide a measure of the globalaccuracy of the system, a plane was fitted to all thecomputed mobile camera positions. The residual sum-squared error between all computed positions and thisplane was then calculated and divided by (n−3), wheren is the number of measured positions, to obtain an un-biassed estimate for the variance of position error in thedirection of the plane normal. This analysis gave a stan-dard deviation in this direction of 0.012m (1.2cm).

• A second analysis was performed on each line of datausing a similar method. A line was fitted to each of thethree subsets of the data, the residual squared distanceto the line was computed and divided by (n−2) to giveunbiassed estimates of the standard deviation of 0.010mfor the right line, 0.16m for the centre line and 0.021mfor the left line. The reason for this variation in accu-racy from left to right is unknown, but is likely to be dueto the specific geometry of landmark data used in theexperiments. The standard deviation of the orientationwas also computed to be 0.03 radians (All points on the

same line were assumed to have the same ground truthorientation).

The accuracy of the localisation is largely dependent onthe number of correct feature matches between the kinect androbot’s camera. With a cluttered scene where there are nu-merous texture features, the system demonstrates results con-sistent to 16mm of the ground truth. As the proportion, p,of correct matches drops, the probability of selecting the cor-rect hypotheses for a given number of RANSAC iterations, n,descreases. For the three point pose estimation, we requirethree correct correspondences. A 0.01 probability of testingn randomly sampled hypotheses and not selecting the correctinlier set is:

(1− p3)n < 0.01 (13)

Our system samples n = 200 hypotheses, requiring an in-lier rate of 0.283 for a 0.99 chance of selecting the correctpose. This places a limit on the field of view overlap requiredbetween the Kinect and the robot’s camera for correct local-isation. A smaller view overlap and likely increase in falsepositive correspondences may diminish the chance obtainingof the correct pose.

Another source of mismatches occurs from large viewpointdifferences between the Kinect and robot. Despite our HIPSfeatures having slight affine invariance, sufficient viewpointdeviations will cause structural corners to change appearancesignificantly due to its changing background.

4.2 Localisation in Dynamic ScenesIt is difficult to perform visual localisation in dynamic scenesas visual features may move unexpectedly or become oc-cluded on a regular basis. Our system operates robustly indynamic scenes as moving objects still contribute visual fea-tures to pose estimation. Occlusions have little effect becausethey only matter when features visible in one view are oc-cluded in another; typically a small portion of features.

The video submission accompanying this paper2 shows therobot localising repeatedly in a dynamic environment. Stillframes of the robotic experiment taken from a handycam andscreen captures of the 2D planar map showing the robot’spose are available at the end of this paper.

5 Conclusion and Future WorkWe have presented a robot and external RGB-D sensor sys-tem that performs a 6 DoF pose estimation in real time. Thesystem achieves up to 10 fps despite the processing limitationof the smartphone on the robot and the bandwidth limitationof the Wifi-based communication link. Localisation is alsorobust against dynamic environments. Future work includesextending the system to include input from multiple Kinectsso that universal coverage can be provided, and the use of

2http://www.youtube.com/watch?v=N5PXHWk8nCE

multiple robots, including kinect-carrying robots that can actas the external sensor. We also plan to explore the compu-tational asymmetry further by investigation of different divi-sions of labour between robots and external sensors.

References[Betke and Gurvits, 1997] M. Betke and L. Gurvits. Mobile

robot localization using landmarks. Robotics and Automa-tion, IEEE Transactions on, 13(2):251–263, 1997.

[Cox et al., 1990] I.J. Cox, G.T. Wilfong, and T.L. Perez.Autonomous robot vehicles, volume 447. Springer-Verlag,1990.

[Dellaert et al., 1999] F. Dellaert, W. Burgard, D. Fox, andS. Thrun. Using the condensation algorithm for robust,vision-based mobile robot localization. Computer Visionand Pattern Recognition, 1999. IEEE Computer SocietyConference on., 2:2 vol. (xxiii+637+663), 1999.

[Inaba et al., 2000] Masayuki Inaba, Satoshi Kagami, FumioKanehiro, Yukiko Hoshino, and Hirochika Inoue. A plat-form for robotics research based on the remote-brainedrobot approach. The International Journal of Robotics Re-search, 19(10):933–954, 2000.

[OpenNI, 2010] OpenNI. Kinect drivers.http://www.openni.org/, 2010.

[Rawlinson et al., 2004] D. Rawlinson, P. Chakravarty, andR. Jarvis. Distributed visual servoing of a mobile robotfor surveillance applications. Australasian Conference onRobotics and Automation (ACRA), 2004.

[Rekleitis et al., 2006] Ioannis Rekleitis, David Meger, andGregory Dudek. Simultaneous planning, localization, andmapping in a camera sensor network. Robotics and Au-tonomous Systems, 54(11):921–932, 2006. Planning Un-der Uncertainty in Robotics.

[Rosten and Drummond, 2006] E. Rosten and T. Drum-mond. Machine learning for high-speed corner detection.Computer Vision–ECCV 2006, pages 430–443, 2006.

[Se et al., 2001] S. Se, D. Lowe, and J. Little. Vision-based mobile robot localization and mapping using scale-invariant features. In Robotics and Automation, 2001. Pro-ceedings 2001 ICRA. IEEE International Conference on,volume 2, pages 2051–2058. IEEE, 2001.

[Taylor and Drummond, 2009] S. Taylor and T. Drummond.Multiple target localisation at over 100 fps. In British ma-chine vision conference, volume 4. Citeseer, 2009.

Figure 8: Initial movement of robot away from Kinect

Figure 9: No visual localisation as robot and external do not share field of view

Figure 10: Example of localisation in dynamic scene with moving objects

Figure 11: Example of localisation in dynamic scene with occlusions

Figure 12: Example of localisation under significant scale and viewpoint change

Documents

Visual Localisation of a Robot with an external RGBD Sensor