17
Real-time camera tracking for marker-less and unprepared augmented reality environments Ke Xu, Kar Wee Chia, Adrian David Cheok * Department of Electrical and Computer Engineering, National University of Singapore, Singapore 117576, Singapore Received 19 July 2004; received in revised form 18 November 2006; accepted 2 August 2007 Abstract For three-dimensional video-based augmented reality applications, accurate measurements of the 6DOF camera pose relative to the real world are required for proper registration of the virtual objects. This paper presents an accurate and robust system for real-time 6DOF camera pose tracking based on natural features in an arbitrary scene. Crucially, the calculation is based on pre-captured reference images. This prevents a gradual increase in the camera position error. Point features in the current image frame are first matched to two spatially separated reference images. This wide baseline correspondence problem is overcome by constructing (1) a global homography between current and previous image frame and (2) local affine transforms derived from known matches between previous frame and ref- erence images. Chaining these two mappings constrains the search for potential matches in the reference images and allows the warping of corner intensity neighborhoods so that a viewpoint invariant similarity measure for assessing potential point matches can be defined. We then minimize deviations from the two-view and three-view constraints between the reference images and current frame as a function of the camera motion parameters to obtain an estimate of the current camera pose relative to the reference images. This calculation is stabilized using a recursive form of temporal regularization similar in spirit to the Kalman filter. We can track camera pose reliably over hundreds of image frames and realistically integrate three-dimensional virtual objects with only slight jitter. This paper also tries to sim- plify the above described algorithm and present a real-time, robust tracking system based on computing homographies. Homography can exactly describe the image motion between two frames when the camera motion is pure rotation, or it is viewing a planar scene. For outdoor registration applications, the system is robust under small translations as long as the majority of the scene contents are distant. Ó 2008 Published by Elsevier B.V. Keywords: Vision based tracking; Optical flow; Fundamental matrix; Homography; Augmented reality 1. Introduction In a video-based augmented reality (AR) system, a user views the real world through a video camera attached to a head mounted display (HMD). Video stream from the camera is combined with graphic images of virtual objects, blending the real and virtual. To generate a consistent view of these virtual objects from all views of the real scene so that the illusion that the real and virtual worlds coexist is not compromised, accurate measurements of the six-degree of freedom (6DOF) camera pose: three degrees of freedom for position and three for orientation relative to the world coordinate system are required. Camera tracking systems based on placing fiducial mark- ers in the scene have been highly successful [1,2]. Markers are constructed so that they are easily detected in each image frame and given some a priori information about the shapes or positions of the markers, the relative pose of the camera can be easily determined [3,4]. However, camera tracking can be easily lost as it is only based on a few features and there is a limited range of camera viewpoints from which the mark- ers are visible. In comparison, systems based on natural fea- tures, e.g., corner points in the scene extend the tracking 0262-8856/$ - see front matter Ó 2008 Published by Elsevier B.V. doi:10.1016/j.imavis.2007.08.015 * Corresponding author. Tel.: +65 94593964. E-mail address: [email protected] (A.D. Cheok). www.elsevier.com/locate/imavis Available online at www.sciencedirect.com Image and Vision Computing 26 (2008) 673–689

Real-time camera tracking for marker-less and unprepared augmented reality environments

  • Upload
    ke-xu

  • View
    221

  • Download
    8

Embed Size (px)

Citation preview

Page 1: Real-time camera tracking for marker-less and unprepared augmented reality environments

Available online at www.sciencedirect.com

www.elsevier.com/locate/imavis

Image and Vision Computing 26 (2008) 673–689

Real-time camera tracking for marker-less and unpreparedaugmented reality environments

Ke Xu, Kar Wee Chia, Adrian David Cheok *

Department of Electrical and Computer Engineering, National University of Singapore, Singapore 117576, Singapore

Received 19 July 2004; received in revised form 18 November 2006; accepted 2 August 2007

Abstract

For three-dimensional video-based augmented reality applications, accurate measurements of the 6DOF camera pose relative to thereal world are required for proper registration of the virtual objects. This paper presents an accurate and robust system for real-time6DOF camera pose tracking based on natural features in an arbitrary scene. Crucially, the calculation is based on pre-captured referenceimages. This prevents a gradual increase in the camera position error. Point features in the current image frame are first matched to twospatially separated reference images. This wide baseline correspondence problem is overcome by constructing (1) a global homographybetween current and previous image frame and (2) local affine transforms derived from known matches between previous frame and ref-erence images. Chaining these two mappings constrains the search for potential matches in the reference images and allows the warpingof corner intensity neighborhoods so that a viewpoint invariant similarity measure for assessing potential point matches can be defined.We then minimize deviations from the two-view and three-view constraints between the reference images and current frame as a functionof the camera motion parameters to obtain an estimate of the current camera pose relative to the reference images. This calculation isstabilized using a recursive form of temporal regularization similar in spirit to the Kalman filter. We can track camera pose reliably overhundreds of image frames and realistically integrate three-dimensional virtual objects with only slight jitter. This paper also tries to sim-plify the above described algorithm and present a real-time, robust tracking system based on computing homographies. Homographycan exactly describe the image motion between two frames when the camera motion is pure rotation, or it is viewing a planar scene.For outdoor registration applications, the system is robust under small translations as long as the majority of the scene contents aredistant.� 2008 Published by Elsevier B.V.

Keywords: Vision based tracking; Optical flow; Fundamental matrix; Homography; Augmented reality

1. Introduction

In a video-based augmented reality (AR) system, a userviews the real world through a video camera attached to ahead mounted display (HMD). Video stream from thecamera is combined with graphic images of virtual objects,blending the real and virtual. To generate a consistent viewof these virtual objects from all views of the real scene sothat the illusion that the real and virtual worlds coexist isnot compromised, accurate measurements of the six-degree

0262-8856/$ - see front matter � 2008 Published by Elsevier B.V.

doi:10.1016/j.imavis.2007.08.015

* Corresponding author. Tel.: +65 94593964.E-mail address: [email protected] (A.D. Cheok).

of freedom (6DOF) camera pose: three degrees of freedomfor position and three for orientation relative to the worldcoordinate system are required.

Camera tracking systems based on placing fiducial mark-ers in the scene have been highly successful [1,2]. Markers areconstructed so that they are easily detected in each imageframe and given some a priori information about the shapesor positions of the markers, the relative pose of the cameracan be easily determined [3,4]. However, camera trackingcan be easily lost as it is only based on a few features and thereis a limited range of camera viewpoints from which the mark-ers are visible. In comparison, systems based on natural fea-tures, e.g., corner points in the scene extend the tracking

Page 2: Real-time camera tracking for marker-less and unprepared augmented reality environments

674 K. Xu et al. / Image and Vision Computing 26 (2008) 673–689

range and are typically more stable as there are more featuresavailable to track the camera pose from.

The natural feature based camera tracking problem isgreatly simplified when there are planar structures visiblein the scene. A method introduced in [5,6] chains thehomographies relating coplanar corner points in previousadjacent image frames to compute the homography whichtransforms points on the world plane to the current imageframe. The camera pose is then estimated from this homog-raphy. Although planar surfaces are common in the envi-ronment, it is also commonly true that for some scenessuch as that in Fig. 1, it is difficult to find a plane that issufficiently large and textured so that corner points canbe reliably tracked. Despite all these limitations, thehomography approach is still the only algorithm that issuitable for real-time applications that can be run on mostcurrent desktop PCs (Pentium III and Pentium IV) at themoment.

This paper is concerned with both the general problemof tracking the camera pose from corner points lying inarbitrary locations in the scene, and the real-time applica-tions based on the current PC processing speed.

The optimal way to solve the general tracking problemis by offline techniques which use the whole imagesequence, e.g., [7]. Global bundle adjustment techniquesattempt to simultaneously estimate the structure of thescene and the camera motion. At the core of these methodsis a non-linear minimization of a cost function based onreprojection errors of estimated three-dimensional (3D)points across the whole video sequence. An excellent reviewof the theory of bundle adjustment is found in [8]. Despitetheir excellent performance, such batch processing methodsare clearly not suitable for real-time online AR applicationsin which ‘‘future’’ data are not available for processing.

On the other hand, online methods like [9], using fivecorresponding points to directly calculate the camera pose,does not utilize the temporal information or the estimatedresults of the previous frames. Various incremental motionestimation approaches have also been proposed for timecritical applications. A key issue here is to ensure thatincremental motion estimates at different points in the timeseries are compatible with a single 3D structure. Unfortu-

Fig. 1. The objective of this work is to estimate camera pose from frame to framcan be realistically introduced into the scene. In this video sequence, the 3Dforward.

nately, recursive online estimation of this structure as in[10] is not possible as it is impractically slow. Avidan andShashua [11] introduced the operation of ‘‘threading’’ adja-cent two-view motion estimates. This technique is repeat-edly applied to a sliding window of view triplets torecover the camera trajectory. The Bayesian, top-downapproach with Kalman Filter based algorithms proposedby Davison [12] can yield a real-time and robust result,but it is limited in the desk-like environment and the pro-posed motion modeling requires the camera motion to bevery smooth. Fitzgibbon and Zisserman [13] applied bun-dle adjustment techniques to a sliding window of tripletsof images using point matches across all three images.Zhang and Shan [14] presented a similar scheme in whichboth two-view and three-view matches were used in thebundle adjustment.

Although the above methods are suitable for onlineimplementation, the incremental nature means that errorsin the camera motion estimates will inevitably accrue overtime since the current camera position is calculated by con-catenating transformations between adjacent frames in thesequence. This is unacceptable in AR applications, wheresmall registration errors are very noticeable and very accu-rate estimates must be maintained over thousands offrames. Also, these online algorithms do not cater to thefact that the incoming data may sometimes be impover-ished, e.g., presence of mismatches. Noisy and erroneousmotion estimates will be produced as a result. It is essentialthat the camera tracking system is robust to such situationsso that errors in the motion estimates due to impoverisheddata are smoothened or reduced and a stable perceptionthat the virtual and real worlds coexist can be maintained.

Our approach is based on always calculating cameramotion relative to two or more pre-captured referenceimages of the scene. This has the advantage of preventinga gradual increase in the camera position error. Currentcamera pose relative to the reference images is computedthrough the minimization of a simple cost function basedon two-view epipolar and three-view constraints on featureposition. We use previous camera pose estimates to providethe starting point for this minimization and to regularizethe error surface when the incoming data is impoverished.

e by tracking natural features in an arbitrary scene, so that virtual contentmodel remains registered to the center of the room as the camera moves

Page 3: Real-time camera tracking for marker-less and unprepared augmented reality environments

K. Xu et al. / Image and Vision Computing 26 (2008) 673–689 675

2. Overview of approach

Fig. 2 shows a schematic diagram of the proposed cam-era pose tracking system. Two spatially separated photosof the work area, in which virtual content is to be intro-duced, are taken. We call these photos reference imagesVA and VB. The two images are calibrated, meaning thattheir camera poses, TA and TB, relative to the virtualobjects are known. These can be computed by placing afiducial marker in the scene for these two frames only.The ARToolkit software [15] is used for this purpose. Aset of corner [16] matches and their corresponding 3Dpoints are obtained for these two images. The 3D recon-struction problem will be discussed in Section 3.2. Hence,upon entering the system, we have two reference images,an accurate estimate of the positions of the virtual contentsrelative to the two images, a set of corner matches and theirreconstructed 3D coordinates. All of these things can becalculated very accurately and reliably using offlineprocessing.

As the camera moves in the scene, two-view cornermatches are computed for each incoming image frame Vk

and the reference images. The problem of establishingwhich features correspond to which in two images is com-monly known as the correspondence problem. Conven-tional techniques that solve the correspondence probleminvolve the use of intensity cross-correlation of cornerneighborhoods of similar orientation as a similarity mea-sure for potential matches. The usual assumption is thatthe camera movement between the two camera views is

Fig. 2. The problem is to estimate the transformation matrix Tk between thecurrent frame to two stored reference images, VA and VB, to determine theprevious frames in the time-series.

small (narrow baseline). For the proposed camera trackingsystem, the current camera position is most likely some dis-tance away and rotated from the reference camera posi-tions. For this wide baseline case, conventional matchingtechniques will perform poorly as the standard cross-corre-lation method fails to provide a veridical similarity scorefor matching corners. Several algorithms have been pro-posed for this wide baseline matching problem [17–19].However, they are too computationally expensive forreal-time implementation. Section 6 describes our proposedsolution which exploits the fact that each image frame ispart of a time series and makes use of corner matchesobtained for the previous frame Vk�1 to constrain thesearch for potential matches between the current frameVk and reference images. A more reliable similarity mea-sure for assessing potential corner matches is also defined.

From the two sets of two-view matches, a set of cornermatches present in all three views, i.e., VA, Vk and VB arealso identified. The camera pose of Vk relative to say VA,TAk, is estimated by minimizing a cost function consistingof the (1) deviations from the two-view epipolar constraintsbetween VA and Vk. (2) Deviations from the two-view epi-polar constraints between Vk and VB. (3) Deviations fromthe three-view constraints on feature position in Vk. Theseimage constraints are described in Section 3. Motion esti-mates based upon the previous frames . . .TA(k�2), TA(k�1)

provide the starting point for the non-linear minimizationand also serve to regularize the error surface by imposingsome prior knowledge of the likely solution. This temporalregularization technique is implemented in a recursive

camera and the scene for the current frame Vk. The system matches thecamera position. This estimation problem is regularized using data from

Page 4: Real-time camera tracking for marker-less and unprepared augmented reality environments

Fig. 3. Two-view epipolar constraint. The 3D point P, which projects topoint pA in reference image VA must project to somewhere along fixed linelk in current image frame Vk. lk is known as an epipolar line and isdetermined by the relative camera pose TAk between the two camera views.

676 K. Xu et al. / Image and Vision Computing 26 (2008) 673–689

manner similar to the Kalman filter and will be described inSections 4 and 5.

Since the camera pose TA between VA and the virtualobject is known, the camera pose Tk between the currentframe and the virtual object is easily computed using theestimated camera motion TAk between the current frameand VA. We can then proceed to introduce the virtualobject realistically into the current frame.

Under certain special constrains, that is when the cam-era motion is pure rotation, or the camera is viewing a pla-nar scene, a one-to-one mapping, named homography, canexactly describe the image motion between two frames of avideo sequence. Homography is defined by only eightparameters.

Usually, feature displacement between two imagesdepends on both the camera movement and the camera’sdistance from the feature. A simple parameterized mappingis therefore not possible. However, in many circumstances,the homography represents a good approximation of thetrue image flow, particularly when the image structure isnear planar, or the camera movement is small and the scenestructure is mostly distant. A number of previous systemshave applied this information. For example, Uenoharaand Kanade’s [20] system achieves medical-image registra-tion based on five coplanar points. However, their system’sdisadvantage is that it is not robust if any points areobscured or not detected. Another example is Simon andFitzgibbon’s [5] offline system for introducing virtual con-tent into stored video sequences.

3. Image constraints

This section describes the two-view and three-view con-straints used to construct the cost function which is mini-mized to obtain an estimate of the relative cameramotion TAk between the current image frame Vk and refer-ence image VA.

3.1. Two-view epipolar constraint

Fig. 3 shows the two-view geometry between referenceimage VA and the current image frame Vk. This part willonly summarize the fundamental results of epipolargeometry. The complete explanation can be found in[21].

Consider an 2D point pA in the reference image and itscorresponding point pk in the current image frame. The epi-polar constraint states that their homogeneous coordinates~pA and ~pk must satisfy the following equation:

~pTk FAk

~PA ¼ 0 ð1Þ

where FAk is the 3 · 3 fundamental matrix [22] whichencapsulates all the geometric information on camera cali-bration and motion between two images of the same rigidand static scene. When camera calibration and motioninformation are available, the fundamental matrix FAk

has the following form:

FAk ¼ K�T½tAk� � RAkK�1 ð2Þ

where RAk is the 3 · 3 rotation matrix relating VA and Vk.[tAk]· is a skew-symmetric matrix of the 3 · 1 translationvector tAk. K is the intrinsic parameter matrix of the cali-brated camera, which can be estimated by using the cameracalibration software in [15].

A consequence of the constraint Eq. (1) is that given acorner point pAj in VA, the corresponding corner pkj mustlie along the epipolar line in Vk, which is given by FAk

~PAj.Similarly, given a corner point pkj in Vk, its correspondingcorner pAj must lie along the epipolar line FT

Ak~pkj in VA. Ifthe motion parameters given by TAk = [RAk tAk] are cor-rect, then these lines should pass through the correspond-ing corner points in the other image (Fig. 4).

Consequently, given a set of corner matches between VA

and Vk, fpAj $ pkjgNAj¼1, we can construct a non-linear crite-

rion function which is the sum of the perpendicular dis-tance of each corner match from their respectivepredicted epipolar lines:

hA ¼XNA

j¼1

dð~pkj;FAk~pAjÞ2 þ dð~pAj;FTAk~pkjÞ2 ð3Þ

where FAk is given by Eq. (2) and the distance terms are gi-ven by:

dð~pkj;FAk~pAjÞ ¼~pT

kjFAk~pAjffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiðFAk~pAjÞ21 þ ðFAk~pAjÞ22

q

dð~pAj;FTAk~pkjÞ ¼

~pTAjF

TAk~pkjffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

ðFTAk~pkjÞ21 þ ðF

TAk~pkjÞ22

q ð4Þ

where for example ðFAk~pAjÞi is the ith component of vectorFAk~pAj The two-view cost hA can be minimized as a functionof the motion parameters to generate the optimal motionestimate TAk for the current image frame.

In fact, it transpires that minimizing Eq. (3) can deter-mine the camera rotation correctly, but only estimate the

Page 5: Real-time camera tracking for marker-less and unprepared augmented reality environments

Fig. 4. Two-view constraint between reference image VA and the current frame Vk. The current estimate of the translation and rotation between theframes defines the fundamental matrix FAk. This maps a given corner pAj in VA to a line in Vk. The transpose of the fundamental matrix maps each cornerpoint in Vk to a line in VA. If the motion parameters are correct, then these lines should pass through the corresponding points in the other image. Weminimize the perpendicular distance from the lines to their corresponding corner points as a function of the motion parameters.

K. Xu et al. / Image and Vision Computing 26 (2008) 673–689 677

camera translation up to an unknown scaling factor. Whenthe translation vector tAk in Eq. (2) and hence the funda-mental matrix FAk is multiplied by an arbitrary scalar,the epipolar constraint in Eq. (1) still holds. We resolve thissituation by also minimizing deviations from the epipolarconstraints relating the corner matches between Vk and ref-erence image VB. This second cost function hB creates a sec-ond, independent family of potential solutions.

3.2. Two-view homography constraint

A homography is a one-to-one mapping between twoimages, which is defined by only eight parameters. Con-sider a set of points in the first image x ¼ ½x; y; z�T, and theira set of corresponding points in the second image,x0 ¼ ½x0; y0; z0�T. The relationship between the two imagesis a homography if the following equation holds:

x0 ¼ Hx ð5Þ

This model exactly describes the image motion betweentwo frames of a video sequence in special cases when thecamera motion is pure rotation, or the camera is viewinga planar scene. In general, feature displacement betweentwo images depends on both the camera movement andthe camera’s distance from the feature, therefore, a simpleparameterized mapping is not possible. However, thehomography represents a good approximation of the trueimage flow in many circumstances, particularly when theimage structure is near planar, or the camera movementis small and the scene structure is mostly distant.

To solve for the elements of H, we note that x 0 and Hx

are, by definition, rays pointing in the same direction. Theircross product is equal to zero:

x0 �Hx ¼y0hT

3 x� z0hT2 x

z0hT1 x� x0hT

3 x

x0hT2 x� y0hT

1 x

264

375 ¼ 0 ð6Þ

where hi, i = 1, 2, 3 is the ith row of the homography ma-trix, H.

This set of equations provides two independent linearconstraints on the components of H. Although the homog-

raphy matrix contains nine elements, it’s ambiguous up toscale – the use of homogeneous coordinates means that anymultiple of the homography will have the same effect. Con-sequently, if we find four general point correspondences,we can provide eight equations to solve for the eightunknowns of the homography. Generally, we aim to findmore point correspondences and calculate an overdeter-mined least-squares solution. The error in the point posi-tions is linear, but these terms are quadratic in the set ofequations. Ideally, we should use this solution as an initialestimate in a subsequent non-linear minimization of theEuclidean projection error.

The fact that a homography can describe exactly purecamera rotation allows us to apply natural feature track-ing, based on homographies, to the problem of registeringinformation in outdoor scenes. Such a problem has previ-ously been attacked through various methods, includinginertial trackers, 2D computer vision, and hybridapproaches. We argue that outdoor tracking requirementsare frequently satisfied by calculating a homography to astored reference frame. Because most objects in this settingare distant, we can consider camera movement to beapproximately a fixed rotation. Some experimental resultsare shown in Section 7.3.

3.3. Three-view constraint

A second way to resolve the ambiguity in the cameratranslation is to make use of the constraints relating a setof corner matches present in reference image VA, currentimage frame Vk and reference image VB. Consider athree-view corner match pAi M pki M pBi, which are the per-spective projections of a 3D point Pi into VA, Vk and VB. Pi

is expressed in the camera coordinate frame of VA. Sincethe camera pose TAB of VB relative to VA is known (fromTA and TB), the 3D coordinates Pi can be reconstructedfrom pAi M pBi by solving the following two equations:

s1~pAi ¼ K I3�3 03�1½ �~Pi ð7Þs1~pBi ¼ K RAB tAB½ �~Pi ð8Þ

Page 6: Real-time camera tracking for marker-less and unprepared augmented reality environments

Fig. 5. Three-view constraint. For a corner point pki in the current imageframe Vk matched to both corner pAi in reference image VA and corner pBi

in reference image VB, a further constraint is imposed. Since the relativecamera pose between VA and VB is known, the 3D coordinates pi can bereconstructed from the corner match pAi $ pBi. Given the motion estimateTAk of Vk relative to VA, a prediction of where the corresponding cornerpoint will lie in Vk is obtained. The three-view cost refers to the Euclideandistance from the measured position pki to the prediction p0ki:

678 K. Xu et al. / Image and Vision Computing 26 (2008) 673–689

where I3�3 is the 3 · 3 identity matrix and TAB = [RAB tAB].Let pAi = [xAi,yAi]

T, Let pAi = [xAi,yAi]T, pBi = [xBi,yBi]

T,C = KRAB and eliminating s1 and s2 from Eqs. (6) and(7), the linear solution for Pi is obtained as:

k1T � xAik3T

k2T � yAik3T

c1T � xBic3T

c2T � yBic3T

26664

37775Pi ¼

0

0

ðxBik3 � k1ÞTtAB

ðyBik3 � k2ÞTtAB

26664

37775

ZPi ¼ z

Pi ¼ ðZTZÞ�1ZTz ð9Þ

where kjT and cjT are the jth row of matrices K and C,respectively. This linear estimate can be refined by minimiz-ing the Euclidean distance between the back-projection ofthe 3D reconstruction and the measured corner point:

minpi

ðjjpAi � p0Aijj2 þ jjpBi � p0Bijj

2Þ ð10Þ

where i•i denotes vector norm and

~p0Ai ¼ K I3�3 03�1½ �~Pi

~p0Bi ¼ K RAB tAB½ �~Pi

Note that the 3D reconstruction procedure described aboveis done offline for the set of corner matches obtained be-tween the two reference images during the initial systempreparation (Section 2).

Hence, from the 3D reconstruction Pi and an estimate ofthe relative camera motion TAk between Vk and VA, a pre-diction of where the corresponding corner point to pAi andpBi will lie in the current image frame can be obtained. Ifthe motion parameters are correct, the predicted cornerposition p0ki will coincide with the actual measured positionpki (Fig. 5). Given a set of three-view corner matchesfpAi $ pki $ pBig

Ni¼1 and their reconstructed 3D coordi-

nates fPigNi¼1, the following non-linear criterion function

can be set up:

w ¼XN

i¼1

jjpki � p0kijj2

~p0ki ¼ KTAk~pi

ð11Þ

Minimizing the three-view cost w in Eq. (11) as a functionof the motion parameters of TAk recovers the actual cameratranslation.

4. Temporal regularization technique

We now create a measurement vector zk which incorpo-rates all the deviations from the two-view and three-viewconstraints discussed in Section 3. Specifically, zk is a col-umn vector where the elements are the two-view costsbetween reference image VA and current image frame,two-view costs between reference image VB and currentimage frame and three-view costs. The simplest methodto obtain an estimate of the relative camera motion TAk

between the current image frame and VA is to minimizethe non-linear quantity

esðxAkÞ ¼ zTk zk ð12Þ

over the 6 motion parameters, where TAk is parameter-ized by a 6 · 1 motion vector xAk consisting of 3parameters for the translation vector and 3 parametersfor the rotation vector. Conversions between the matrixand vector representations of rotation are described in[23]. A linear algorithm is used to provide the initialestimate of xAk for the non-linear minimization: Fromthe three-view corner matches fpAi $ pki $ pBig

Ni¼1 and

their reconstructed 3D coordinates fPigNi¼1, a set of

3D–2D correspondences fPi $ pkigNi¼1 between the 3D

points and their projections in Vk can be set up. A lin-ear estimate of xAk can be obtained basedonfPi $ pkig

Ni¼1.

The standard least-squares formulation of Eq. (12) isunstable in the presence of corner mismatches [24]. Thesemismatches can give an effect so strong in the minimizationthat the motion parameters thus estimated are erroneousand distorted. Moreover, if the number of corner matchesbecomes small or the geometrical distribution of the 3Dpoints is not sufficiently general, the standard least-squarescost function es may not have a unique global minimum.Even if none of these things occur, the error surface maybe extremely flat near the minimum as the rotation andtranslation parameters may trade off against one another.Very small amount of noise in the measurements mayhence result in quite large variations in the solution forthe motion estimates. Thus, the standard least-squaresminimization will in general produce very noisy motionestimates.

Page 7: Real-time camera tracking for marker-less and unprepared augmented reality environments

K. Xu et al. / Image and Vision Computing 26 (2008) 673–689 679

In order to guard against this ill-conditioning, we regu-larize the solution by imposing some prior knowledge toensure that the error surface has a well-defined global min-imum in a ‘‘likely’’ area of the motion parameter space. Weuse the motion parameter estimates from previous framesto define what is ‘‘likely’’. It transpires that the cameramotion parameters are highly predictable from frame toframe. Fig. 6 shows part of a real sequence of cameramotion data. The majority of the variation in the datacan be predicted by a simple time-series model.

We now minimize the following regularized criterionfunction:

erðxAkÞ ¼ jjzkjjMmeas þ ajjxAk � xAkjjMprior ð13Þ

where the notation iaiB denotes the magnitude of vector ameasured using distance metric B or aTBa. The first term isthe weighted sum of squared deviations from the two-viewand three-view constraints similar to the function of Eq.(12). The second term is the regularization term that favorsa solution xAk which is close to a prior prediction xAk. Thedistance metrices Mmeas and Mprior depend on the covari-ances of the measurement errors and predictions, respec-tively. The constant a controls the relative contributionof the prior knowledge about the solution and the mea-sured data.

Hence, even when the error surface due to the first termis flat near the minimum, the component due to the secondterm will ensure there is a distinct solution. As opposed tothe standard least-squares method, the influence of any sin-gle erroneous measurement due to a corner mismatch isinsufficient to yield a significant offset in the presence of

Fig. 6. Camera pose was measured using ARToolkit for a 30 s sequenceof an observer wearing a head-mounted camera and viewing a smallVRML model on the desk in front of him. We fit a second orderAutoRegressive model to the first half of the dataset. The plot aboveshows the values for one of the translation components from the secondhalf of the dataset (solid line) together with predictions generated fromthis model (dotted line). The model successfully predicts 84% of the actualdata variance. Similar results were found for the rotational motioncomponents.

the regularization term. The regularized criterion functioner is minimized over the 6 translation and rotation param-eters to obtain the optimal estimate of xAk. The predictedmotion vector xAk provides the starting point for thisnon-linear minimization.

5. Recursive motion estimation

This section describes the recursive implementation ofthe temporal regularization technique which is similar tothe Kalman filter formulation.

5.1. Time series modeling of camera motion

As the camera moves in the scene relative to the refer-ence camera positions, the relative camera motion xAk isdescribed by a discrete time series, {xA1,xA2, . . .}.

A common model for time series modeling is the AR(p)or AutoRegressive model of order p:

xAk ¼ Q1xAðk�1Þ þ � � � þQpxAðk�pÞ þ eAk

Qp ¼

/1p . . . 0

..

. . .. ..

.

0 . . . /6p

26664

37775

ð14Þ

where eAk is a 6 · 1 vector of random noise variables ofzero mean and variance r2: f/i

pg6i¼1 are weighting factors.

Thus, given past values of the camera motion up to imageframe Vk�p and estimates of the model parameters, thecamera motion xAk for the current image frame can bepredicted:

xAk ¼ Q1xAðk�1Þ þ � � � þQpxAðk�pÞ ð15Þ

5.2. Algorithm

For simplicity, an AR(1) model with f/i1g

6i¼1 ¼ 1 is used

to illustrate the implementation of the recursive temporalregularization technique:

xAk ¼ xAðk�1Þ þ eAk

xAk ¼ xAðk�1Þð16Þ

Despite its simplicity, the prediction model above will workwell for a real-time image sequence in which the change incamera position is very small between consecutive imageframes. Deviations from the prediction are assumed to takethe form of noise with covariance Q. Given the motion esti-mate xAðk�1Þ for the previous image frame and its uncer-tainty which is described by the 6 · 6 covariance matrixXAðk�1Þ this model is used to predict the relative camera mo-tion xAk for the current image frame and estimate theuncertainty of this prediction, XAk: xAk is then used to reg-ularize the solution for the optimal motion estimate xAk, asin Eq. (13). Finally, the gradient of the criterion function er

around the minimum at xAk is assessed and used to evalu-

Page 8: Real-time camera tracking for marker-less and unprepared augmented reality environments

Fig. 7. Plot of corner accuracy as a function of RMS contrast of 15 · 15pixel region around corner. Accuracy estimated from an ideal squarecorner contaminated with independent Gaussian noise (SD = 5) at eachpixel. The variation in corner estimation asymptotes approximately at 0.16pixels. We use these data as estimates of our measurement accuracy.

680 K. Xu et al. / Image and Vision Computing 26 (2008) 673–689

ate the uncertainty of this optimal solution. For imageframe Vk, k > 1, the algorithm is summarized as follows:

1. Motion prediction

xAk ¼ xAðk�1Þ

2. Covariance prediction

XAk ¼ XAðk�1Þ þQ

3. Motion update

minxAk

erðxAkÞ ¼ kzkkMmeas þ akxAk � xAkkX�1Ak

4. Covariance update

XAk ¼ ðJJTÞ�1

J ¼ der=dxAk is the Jacobian or first-order derivative of theminimization expression er assessed at the optimal solutionfor xAk and can be estimated using the finite differencemethod. For the first image frame V1, the relative cameramotion xA1 is estimated using the standard least-squaresmethod described in Section 4. The corresponding uncer-tainty XA1 is obtained as in step 4 of the algorithm abovefor J ¼ des=dxA1. The non-linear minimizations of the cri-terion functions es and er are performed by the iterativeLevenberg–Marquardt procedure.

The temporal regularization is very closely related to theKalman filter algorithm. Indeed, it can be shown that theKalman filter is an exact solution of the error metric inEq. (13) for a linear system. The extended Kalman filtercan be thought of as one iteration of a Gauss–Newton opti-mization of this error metric. Hence, the iterated Kalmanfilter also seeks the minimum of this function.

5.3. Parameter estimation

In order to get optimal motion estimates, it is importantto choose the parameters carefully. We estimate the mea-surement metric Mmeas by considering the accuracy of ourcorner detection routine under noisy conditions (Fig. 7).We parameterize corner accuracy in terms of RMS con-trast of the local region, since this must be computed any-way in the corner matching stage. We assume that the noiseon each corner position is independent, so the matrix Mmeas

is a diagonal matrix. Each entry is the reciprocal of the esti-mated corner position variance.

The covariance Q is estimated from pre-captured cam-era movement records similar to the experiment conductedfor Fig. 6. Let the camera pose of an n-frame videosequence relative to the ARToolkit marker be fxkgn

k¼1.The motion model of Eq. (16) assumes that the currentcamera motion is the same as in the previous image framewith some added jitter. Thus, Q measures the covariance ofthe change in camera motion between consecutive imageframes and is calculated as follows:

Q ¼Pn

k¼2ðDxkjk�1 � Dxkjk�1ÞðDxkjk�1 � Dxkjk�1ÞT

n� 1Dxkjk�1 ¼ xk � xk�1

Dxkjk�1 ¼Pn

k¼2Dxkjk�1

n� 1

The parameter a controls the relative importance ofthe data and the prior model. For small values, themotion estimates are noisy, but for large values theybecome overly smooth and the motion estimates fail tocapture real variation in the camera position. In practice,a is set to one.

6. Correspondence problem

We now turn to the problem of matching cornersbetween two images. For each incoming image frame Vk,we must identify which corners correspond to which ineach of the reference images. Section 6.1 describes a con-ventional method for corner matching between two imageswith a narrow baseline and its deficiencies in a wide base-line configuration. Section 6.2 describes our solution tothe wide baseline matching problem for the proposed cam-era pose tracking system.

6.1. Narrow baseline matching

In a narrow baseline situation, the relative camera dis-placement and orientation between two images is small.A common procedure for obtaining an initial set of cornermatches is summarized as follows:

1. For a corner point p1 ¼ ðx1; y1Þ in image 1, define a smallrectangular search area centered about this point inimage 2.

2. Perform cross-correlation of corresponding pixelsbetween a window of size (2n + 1) · (2n + 1) centeredabout p1 and similar windows centered about all

Page 9: Real-time camera tracking for marker-less and unprepared augmented reality environments

K. Xu et al. / Image and Vision Computing 26 (2008) 673–689 681

detected corner points p2 ¼ ðx2; y2Þ lying within thesearch window in image 2. The standard correlationscore f(p1,p2) is defined as:

fðp1; p2Þ ¼Pn

i¼�n

Pnj¼�n½I1ðx1 þ i; y1 þ jÞ � I1ðx1; y1Þ�½I2ðx2 þ i; y2 þ jÞ � I2ðx2; y2Þ�

ð2nþ 1Þ2rðI1ÞrðI2Þð17Þ

where Ikðxk; ykÞ is the average pixel intensity of a(2n + 1) · (2n + 1) neighborhood region about cornerpoint ðxk; ykÞ in image Ik(k = 1,2). r(Ik) is the standarddeviation of the pixel intensities in the square neighbor-hood of corner point ðxk; ykÞ in Ik. Retain the pair of cor-ner points with the highest correlation score. Aconstraint on the correlation score can be applied toselect the most consistent matches.

3. Repeat steps 1 and 2 but with the roles of the two imagesreversed. Specifically, for the match candidate p2 foundin step 2, find the corner point in image 1 which givesthe highest correlation score with p2. If the match candi-date for p2 is again the initial p1, then this match will bevalidated; otherwise it will be rejected. This symmetricmatching helps to reduce the probability of errormatches.

The robust statistical procedure, RANSAC [4], is thenused to determine a set of inliers from the initial set ofmatches: A minimal subset of this initial set is used tocalculate the proposed geometry, e.g., fundamentalmatrix which describes the epipolar geometry betweenthe two images, or homography matrix if the cameramotion is close to pure rotation or it is viewing a nearlyplanar scene (Fig. 8). This proposed geometry is assessedby considering how many of the other matches are inagreement (inliers). We repeat this procedure until a setwith the largest number of inliers is found. This will be

Fig. 8. Robust calculation of a proposed geometry (RANSAC). Corner pointbased on the similarity of the areas around these corners and on prior knowlematched corner in other image). This initial set contains many incorrect matchWe then count the number of other matches that are in agreement (inliers are psupport and recalculate the geometry using all of the inliers.

the final set of corner matches. Hartley and Zisserman[21] provides a detailed description of using image geo-metrical constraints with RANSAC for stereo matching.

The success of this approach depends largely on howmany correct inliers there are in the initial set of matches.For the proposed camera tracking system, the current cam-era position is usually some distance away or/and rotatedsignificantly from the reference camera positions. Underthese wide baseline conditions, the initial set of matchesmay be very poor if the algorithm above is used, hencecausing the method to fail. The dominant reason for thisfailure is that the standard cross-correlation method ofEq. (17) fails to provide a veridical similarity score forassessing potential corner matches. Photometrically,fðp1; p2Þ is invariant to a linear transformation of imageintensities – a constant addition and arbitrary scaling. Geo-metrically, fðp1; p2Þ is not even invariant to a simple rota-tion of the corner intensity neighborhood.

Consider the pair of wide baseline images in Fig. 9with correlation windows of similar orientation definedabout a corner match. The standard cross-correlationtechnique assumes that corresponding pixels in the cor-ner neighborhoods are related by a simple global 2Dtranslation. This transformation model does not fullydescribe the local 2D motion field of the corner neigh-borhood points: The neighborhood pixels in image 1are not mapped to the corresponding pixels in image 2.Consequently, cross-correlation of these corner intensityneighborhood points based on Eq. (17) results in anunreliable similarity score. This in turn increases the pos-sibility of incorrect matches.

s are identified in the two images (yellow dots). We choose initial matchesdge about the likely match direction (pink lines indicate corner vector toes. We pick N matches (blue lines) and calculate the associated geometry.ink lines) and repeat this procedure. We choose the estimate with the most

Page 10: Real-time camera tracking for marker-less and unprepared augmented reality environments

Fig. 9. Standard correlation neighborhoods for a corner match in a pair of wide baseline images.

682 K. Xu et al. / Image and Vision Computing 26 (2008) 673–689

6.2. Proposed matching algorithm

This section describes a robust approach to matchcorner points in each incoming image frame Vk to thosein each reference image. The proposed method increasesthe proportion of initial correct matches dramatically.Let the detected corner points in reference image VA

and Vk be {pA} and {pk}, respectively. It is assumed thatthe previous image frame Vk�1 is correctly matched toVA. The recursive algorithm for obtaining a set of cornermatches fpk $ pAg between Va and Vk, k > 1, is as fol-lows (Fig. 10):

1. Match Vk to the previous image frame Vk�1 to obtain aset of corner matches fpk $ pk�1g. Considering the nar-row baseline between the two images, the matchingmethod described in Section 6.1 will be suitable. Com-pute the 3 · 3 homography matrix H corresponding tofpk $ pk�1g. H approximates the mapping betweenpoints in Vk and Vk�1.

Fig. 10. Wide baseline m

2. For a corner point pk in Vk, predict the correspondingpoint p0k�1 in the previous image frame Vk1

:

~p0k�1 ¼ H~pk

3. From the previous set of corner matches fpk�1 $ pAgobtained between Vk-1 and the reference image VA, find3 corners pk�1 which are the closest in terms of imageEuclidean distance to the predicted point p0k�1 These 3corners, together with their corresponding corners pA

in VA, define a pair of local image regions. Computean affine transform L corresponding to these 3 cornermatches. L models the mapping between points in thepair of local image regions.

4. Use L and p0k�1 to estimate where the corner pk will lie inthe reference image VA:

~p0A ¼ L~p0k�1 ¼ LH~pk ¼W~pk ð18Þ

where p0A ¼ ðx0A; y0AÞ is the predicted corner position in VA.

atching algorithm.

Page 11: Real-time camera tracking for marker-less and unprepared augmented reality environments

K. Xu et al. / Image and Vision Computing 26 (2008) 673–689 683

5. Define a search window of size (2dx + 1) · (2dy + 1)centered about p0A in VA. Only detected corners in VA

that lie in this window are considered as potentialmatches. For a potential corner match pk $ pA wherepk = (xk,yk) and pA = (xA,yA), perform intensity correla-tion of pixels in a (2n + 1) · (2n + 1) square neighbor-hood of pk and the corresponding pixels in the warpedneighborhood of pA. The warped correlation score isdefined as:

!ðpk;pAÞ¼Pn

i¼�n

Pni¼�n½VkðmijÞ�VkðpkÞ�½VAðmAÞ�VAðPAÞ�

ð2nþ1Þ2rðVkÞrðVAÞmij¼ðxkþ i;ykþ jÞ~mA¼OW ~mij

O¼1 0 tx

0 1 ty

0 0 1

264

375

ðtx; tyÞ¼ pA�p0A

ð19Þ

where VkðpkÞ and VAðpAÞ are the average pixel intensitiesof the square and warped neighborhood regions aboutpk and pA in Vk and VA, respectively. r(Vk) and r(VA)are the standard deviations of the pixel intensities inthe square and warped neighborhoods of pk and pA,respectively. The potential corner match pk $ pA whichmaximizes the warped correlation score is retained.

6. Repeat steps 2–5 for the next detected corner point inVk. In this way, a set of high quality initial matchesbetween Vk and VA is obtained.

7. Use the RANSAC procedure to obtain a set of inliersfpk $ pAg from the initial set of matches obtained instep 6. fpk $ pAg can then be used to match the nextimage frame Vk + 1 to VA (step 3).

The global homography H models the 2D differentialmotion of points in Vk and Vk�1. This is reasonable sincethe camera movement is very small over the time periodof a single frame. Assuming that the local depth in theimage varies smoothly, the local affine transform L canbe used to model the 2D motion field between the neigh-borhood region of p0k�1 in Vk�1 and the correspondingimage region in VA.

By chaining L and H, i.e., W in Eq. (18), any corner pk

in Vk can be transferred to an approximate position p0A inVA. In general, the predicted position p0A will not coincidewith the actual measured corner position pA but this dis-parity is quite small in practice. Thus, the search for poten-tial matches can be confined to a small window centeredabout p0A. Taking into account the translational offset O

between the predicted and measured corner position, a2D transformation OW approximates the unknown map-ping between the neighborhoods of a corner matchpk $ pA in Vk and VA. Consequently, a viewpoint invariantsimilarity measure !(pk,pA) can be defined by the intensitycross-correlation of image points mij in a square neighbor-

hood of pk with the warped image points mA in referenceimage VA, where mij is mapped to mA by OW.

In this way, initial sets of corner matches that arealready �80% correct are typically generated. Thisdecreases the number of trials required in the RANSACprocedure and reduces the possibility of any remainingfalse matches. A similar procedure is used to match refer-ence image VB to the current image frame Vk.

The success of the proposed wide baseline matchingalgorithm is heavily dependent on the previous imageframe being correctly matched to each reference image soas to ensure the accuracy of the point transfer and warpingvia the chained mapping W. It follows that a reliable set ofcorner matches between the first image frame V1 and eachreference image must be obtained. One constraint of theproposed method is that the initial camera position mustbe close to one of the reference camera positions, e.g., VA

so that the narrow baseline assumption between V1 andVA is valid. The method discussed in Section 6.1 can thenbe used to compute a reliable set of matches fp1 $ pAgto initialize the proposed algorithm. However, this alsomeans that V1 and the second reference image VB will mostlikely belong to a wide baseline configuration. This prob-lem is overcome by making use of the corner matchesfpA $ pBg between the two reference images that areobtained during the initial system preparation (Section 2).Since V1 is already matched to VA, a reliable set of cornermatches fp1 $ pBg between V1 and VB can be identifiedfrom fp1 $ pAg and fpA $ pBg.

7. Experimental results

Section 7.1 compares the standard and warped correla-tion methods. Section 7.2 compares the camera motionestimates obtained by the standard least-squares methodand the proposed temporal regularization technique.

7.1. Comparison between standard and warped correlation

The proposed matching algorithm described in Section6.2 is used to track a corner point p0 in a reference imageacross a sequence of 65 image frames captured by a cameratranslating and rotating away from the reference cameraposition (Fig. 11).

For the corner match pk $ p0 where k = 2, . . ., 65, thestandard correlation score f(pk, p0) given by Eq. (17) andwarped correlation score !(pk,p0) given by Eq. (19) arecomputed. The results are shown in Fig. 11. Both simi-larity measures use a square neighborhood defined aboutpk in image k. For the standard cross-correlationmethod, a similar correlation neighborhood of the sameorientation about p0 is defined. For the warped cross-correlation method, the corresponding neighborhoodpixel in the reference image is calculated via the transfor-mation OW.

Comparing the results in Fig. 11, as the baseline betweenimage k and the reference image increases, the warped

Page 12: Real-time camera tracking for marker-less and unprepared augmented reality environments

Fig. 11. Comparison between standard and warped correlation score.

684 K. Xu et al. / Image and Vision Computing 26 (2008) 673–689

correlation score (solid line) is far more stable (�0.95) andremains relatively invariant to changes in camera view-points compared to the standard correlation score (dashedline) which deteriorates to less than 0.5. The transforma-tion OW provides a more accurate representation of the

Fig. 12. Comparison between standard and warped c

mapping between neighborhood points of a corner match,as seen in Fig. 12. The resulting warped neighborhood inthe reference image is more similar to the original neigh-borhood defined in image k. Thus, the warped correlationscore is a more reliable similarity measure for assessing

orrelation neighborhoods about a corner match.

Page 13: Real-time camera tracking for marker-less and unprepared augmented reality environments

K. Xu et al. / Image and Vision Computing 26 (2008) 673–689 685

corner matches in a wide baseline scenario. This reducesthe possibility of ignoring actual corner matches.

7.2. Comparison between standard least-squares and

temporal regularization technique for motion estimation

We filmed a video sequence of a desktop scene. The ini-tial system preparation, i.e., taking reference images, etc.was performed as described in Section 2. The current imageframe was matched to each reference image using the algo-rithm described in Section 6.2. The camera motion TAK orxAk of the current image frame relative to reference imageVA was estimated using the standard least-squares method(Section 4) and the proposed recursive temporal regulariza-tion technique (Section 5.2). Fig. 13 shows the resulting

Fig. 13. Estimates of relative camera motion xAk by the standard le

motion estimates. The motion estimates were used to placea virtual wire-frame cube along the edge of a book (Fig. 14)and the augmented video sequences can be downloadedfrom [25].

From Fig. 13, we can see that the motion estimatesobtained by the standard least-squares method areindeed quite noisy as explained in Section 4. This leadsto some serious ‘jittering’ of the virtual cube which canbe observed in the augmented sequence. In contrast,the noise is substantially reduced by the temporal regu-larization technique and the motion estimates are alsomore smooth as previous motion estimates are beingtaken into account. ‘Jittering’ is still noticeable but isnow very small. This enhances the user’s perception thatthe real and virtual worlds coexist.

ast-squares method and the temporal regularization technique.

Page 14: Real-time camera tracking for marker-less and unprepared augmented reality environments

Fig. 14. The camera pose for each incoming image frame is computed so as to generate the appropriate view of the virtual cube as the camera moves in thescene.

Fig. 15. Placing of annotation is demonstrated to be accurate to belowone pixel over 25� rotations for homography calculation. Performancedegrades as the image overlap becomes negligible (camera field of viewmeasured at 33�).

686 K. Xu et al. / Image and Vision Computing 26 (2008) 673–689

7.3. Real-time homography system implementation details

and performance

This system lets us calculate homographies at 25 Hz –the capture rate of the PAL video stream – between320 · 240 pixel images on a standard desktop PC (Dell2.4 GHz Pentium IV). Typically 50–60 corners per imageare identified, of which >80% are inliers to the final solu-tion, depending on the amount of overlap of the twoimages. If less than 35% of corners are inliers we considerthe calculation to have failed. We apply a number of heu-ristics to increase the system speed and accuracy: we repeatthe RANSAC sampling up to 70 times, but immediatelyaccept the solution if it has greater than 75% support and15 iterations have already been computed. Using this crite-rion, the majority of frames are completed prematurely. Inorder to increase the average quality of the solutions, weensure that the initial four points in the first image are spa-tially separated by at least 60 pixels. We also reject homog-raphies that are near singular by testing the determinant.Singular matrices map the entire first image to a line orpoint in the second image.

For a static camera in an indoor environment, thehomography was successfully calculated for 500/500 testframes in a 20 s sequence. For each frame, the center pointof the image was transformed by the incoming homogra-phy. Since the camera is static, we expect it to remain inthe same place. The mean deviation was <1 pixel per frame.Performance as a function of image rotation is assessed inFigs. 15 and 16. In each case we matched 100 images fromtwo static video streams of the same scene, where the cam-

era had been rotated between capture. Two example framesare shown in Fig. 17. No prior information was givenabout the direction or magnitude of the camera movement.Performance is at or close to pixel accuracy across a widerange of distances.

To test the calculation speed and accuracy for homogra-phy estimation, we developed some simple application on a

Page 15: Real-time camera tracking for marker-less and unprepared augmented reality environments

Fig. 16. Percentage of successful trials for homography algorithm as afunction of rotation angle. Performance extends to larger angles as thenumber of iterations of the robust estimation algorithm increases.

K. Xu et al. / Image and Vision Computing 26 (2008) 673–689 687

Dell 2.4 GHz Pentium IV Laptop computer. The applica-tion was a wide-scale geographical labeling system. Thesystem attempts to find the velocity map between the cur-rent frame and the stored frame. The positions of the geo-graphical annotations are then mapped from the storedframe to the current frame and drawn into the field.Although the scene is not a planar scene, and the camerais not under pure rotation either, the results still turn outto be very good as shown in Fig. 18.

7.4. Complete system test

By considering both the tracking accuracy and the cal-culation speed, and fine tuning for the best balance inbetween these two criteria, the final 6DOF camera posttracking system was successfully run on a normal Dell

Fig. 17. Two example frames from the video streams which were used to assessthe motion or flow between the left and right images. The result shows that thefirst image to an unique point in the second image.

2.4 GHz Pentium IV Desktop computer at nearly 20frames per second. In this final system, a 3D virtual char-acter was placed in an arbitrary scene in a lab room. Andthen the camera moves forward to this virtual character.As shown in Fig. 1, despite the many irregular surfaces inthe scene, the system manages to achieve a very promisingresult with only little jittering.

8. Discussion

To summarize, we have presented a system for naturalfeature tracking in augmented reality. We estimate the cur-rent camera position relative to pre-captured referenceimages by matching corners across the frames and mini-mizing a cost function based on two- and three-view rela-tions. We apply a method based on time series tostabilize these estimates. The proposed system currentlyruns at �10 Hz on a fast desktop PC with 320 · 240 pixelimages. However, real-time implementation, i.e., >25 Hzis definitely a possibility in the near future with the fastpace at which computer hardware is advancing. Moreimportantly, the accuracy and robustness of the real-timecamera motion estimation algorithm are the main issuesof concern that are addressed in this paper. We alsoshowed another approach to implement real-time applica-tions based on current computer hardware technologies.That is to simplify the algorithm by assuming the camerais under small translation and near pure rotation, and esti-mate the homography between the pre-stored frame withthe current frame.

Only a few previous attempts have been made to imple-ment real-time 6DOF motion tracking based on naturalfeatures alone in a general environment, e.g., [14]. To thebest of our knowledge, all of these methods suffer fromthe inevitable drift that results from chaining together cam-era transformations along the time series. We remove thisproblem by always estimating the camera pose relative tofixed reference images. This aggravates the correspondenceproblem, but we resolve this difficulty by using information

the system performance. For each pair of frames, we attempt to calculateimage flow is well described by a homography, which maps a point in the

Page 16: Real-time camera tracking for marker-less and unprepared augmented reality environments

Fig. 18. The geographical labeling of different buildings in the National University of Singapore Kent Ridge campus.

688 K. Xu et al. / Image and Vision Computing 26 (2008) 673–689

derived from previous correspondences in the time series.As seen from the experimental results, the temporal regu-larization technique is robust and maintain accurate esti-mates of the camera motion when the incoming data isimpoverished.

A strong advantage of this system is scalability. It isquite possible to store more than two reference images.By accurately pre-calculating the geometric relationshipbetween a large number of such images, we couldpotentially perform natural feature tracking over wideareas. The current camera position could be used todetermine which of the many possible reference imagesto match to.

Moreover, the system can work with completely generalscenes, but does not fail if the scene structure shouldbecome degenerate, e.g., all points fall on a plane. As thecorner matching is based on the RANSAC method whichis tolerant to changes or movement in parts of the image,the only requirements of the system are that the scene bemostly static and rigid and contain enough texture to reli-ably identify corners across images.

References

[1] U. Neumann, Y. Cho, A self-tracking augmented reality system, in:Proceedings of VRST, July 1996, pp. 109–115.

[2] H. Kato, M. Billinghurst, Marker tracking and HMD calibration fora video-based augmented reality conferencing system, in: Proceedingsof IWAR, October 1999, pp. 85–94.

[3] D. Dementhon, L. Davis, Model based object pose in 25 lines of code,IJCV 15 (1995) 123–141.

[4] M.A. Fischler, R.C. Bolles, Random sample consensus: a paradigmfor model fitting with applications to image analysis and automatedcartography, Commun. ACM 24 (6) (1981) 381–395.

[5] G. Simon, A. Fitzgibbon, A. Zisserman, Markerless tracking usingplanar structures in the scene, in: Proceedings of ISAR, 2000, pp.120–128.

[6] S.J.D. Prince, K. Xu, A.D. Cheok, Augmented reality cameratracking with homographies, IEEE Trans. Comput. Graph. Appl.22 (2002) 39–45.

[7] K. Cornelis, M. Pollefeys, M. Vergauwen, L.V. Gool, AugmentedReality using Uncalibrated Video Sequences, Technical Report, No.KUL/ESAT/PSI/0002, PSI-ESAT, K.U. Leuven, Belgium,2000.

[8] B. Triggs, P. McLauchlan, R. Hartley, A. Fitzgibbon, Bundleadjustment – a modern synthesis, in: Proceedings of the InternationalWorkshop on Vision Algorithms: Theory and Practice, September1999, pp. 298–372.

[9] D. Nister, An efficient solution to the five-point relative pose problem,IEEE Trans. Pattern Anal. Mach. Intell. 26 (6) (2004) 756–777.

[10] T. Broida, S. Chandrashekhar, R. Chellappa, Recursive 3-d motionestimation from a monocular image sequence, AeroSys 26 (1990) 639–656.

[11] S. Avidan, A. Shashua, Threading fundamental matrices, in: Pro-ceedings of the Fifth European Conference on Computer Vision,1998, pp. 124–140.

[12] A.J. Davison, Real-time simultaneous localisation and mapping witha single camera, in: ICCV ’03: Proceedings of the Ninth IEEEInternational Conference on Computer Vision, 2003, pp.1403.

[13] A. Fitzgibbon, A. Zisserman, Automatic camera recovery for closedor open image sequences, in: Proceedings of the Fifth EuropeanConference on Computer Vision, June 1998, pp. 311–326.

[14] Z. Zhang, Y. Shan, Incremental Motion Estimation through LocalBundle Adjustment, Technical Report, MSR-TR-01-54, MicrosoftResearch, May 2001.

[15] Available from: <http://www.hitl.washington.edu/artoolkit/>.[16] C.J. Harris, M. Stephens, A combined corner and edge detector, in:

Proceedings of the Fourth Alvey Vision Conferences, 1988, pp. 147–151.

[17] Z. Zhang, O. Faugeras, Q.T. Luong, A Robust Technique forMatching two Uncalibrated Images through the Recovery of theUnknown Epipolar Geometry, Technical Report, INRIA, May1994.

Page 17: Real-time camera tracking for marker-less and unprepared augmented reality environments

K. Xu et al. / Image and Vision Computing 26 (2008) 673–689 689

[18] P. Pritchett, A. Zisserman, Wide baseline stereo matching, ICCV(1998) 754–760.

[19] A. Baumberg, Reliable feature matching across widely separatedviews, CVPR, June 2000.

[20] M. Uenohara, T. Kanade, Vision-based object registration for real-time image overlay, Int. J. Comput. Biol. Med. 25 (1995) 249–260.

[21] R. Hartley, A. Zisserman, Multiple View Geometry in ComputerVision, Cambridge University Press, 2000.

[22] O.D. Faugeras, What can be seen in three dimensions with anuncalibrated stereo rig?, in: Proceedings of the Second EuropeanConference on Computer Vision, 1992, pp. 563–578.

[23] D. Eberly, Rotation Representations and Performance Issues. MagicSoftware, Chapel Hill, NC.

[24] Z. Zhang, Parameter Estimation Techniques: A Tutorial with Appli-cation to Conic Fitting, Technical Report, INRIA, October 1995.

[25] Available from: <http://mixedreality.nus.edu.sg/>.