8
Adaptive View-Based Appearance Models Louis-Philippe Morency Ali Rahimi Trevor Darrell MIT Artificial Intelligence Laboratory Cambridge, MA 02139 Abstract We present a method for online rigid object tracking using an adaptive view-based appearance model. When the ob- ject’s pose trajectory crosses itself, our tracker has bounded drift and can track objects undergoing large motion for long periods of time. Our tracker registers each incoming frame against the views of the appearance model using a two- frame registration algorithm. Using a linear Gaussian fil- ter, we simultaneously estimate the pose of the object and adjust the view-based model as pose-changes are recovered from the registration algorithm. The adaptive view-based model is populated online with views of the object as it un- dergoes different orientations in pose space, allowing us to capture non-Lambertian effects. We tested our approach on a real-time rigid object tracking task using stereo cameras and observed an RMS error within the accuracy limit of an attached inertial sensor. 1. Introduction Accurate drift-free tracking is an important goal of many computer vision applications. Traditional models for frame- to-frame tracking accumulate drift even when viewing a previous pose. In this paper we show how to use view- based appearance models to allow existing two-frame regis- tration algorithms to track objects over long distances with bounded drift. We use a 6 degree-of-freedom (DOF) rigid body registration algorithm to track against a view-based model that is acquired and refined concurrently with track- ing. The appearance model used in this paper maintains views (key frames) of the object under various poses. These views are annotated with the pose of the object, as estimated by the tracker. Tracking against the appearance model en- tails registering the current frame against previous frames and all relevant key frames. The adaptive view-based ap- pearance model can be updated by adjusting the pose pa- rameters of the key frames, or by adding or removing key frames. These online updates are non-committal so that fur- ther tracking can correct earlier mistakes induced into the model. View-based models can capture non-Lambertian re- flectance in a way that makes them well suited for track- ing rigid bodies. We show that our appearance model has bounded drift when the object’s pose trajectory crosses it- self. We compare our pose tracking results with the orienta- tion estimate of an Inertia Cube 2 inertial sensor [11]. On a Pentium 4 1.7GHz, our tracker implementation runs at 7Hz. The following section discusses related previous work for tracking. Subsequent sections describe the data struc- ture we use to maintain the appearance model and our algo- rithm for recovering the pose of the current frame and for populating and adjusting the appearance model. We then report experiments with our approach and a 3D view reg- istration algorithm that obtains range and intensity frames using a commercial stereo system [7]. Finally we show the generality of our approach by tracking the 6-DOF pose of a general object using the same stereo system. 2. Previous Work Many different representations have been used for tracking objects based on aggregate statistics about the subject, or they can be generative rendering models for the appearance of the subject. Trackers which model the appearance of the subject using aggregate statistics of their appearance in- clude [2] and [16]. These use the distribution of skin-color pixels to localize the head; the distribution can be adapted to fit the subject as tracking goes on. To recover pose, these techniques rely on characteristics of the aggregate distribu- tion, which is influenced by many factors, only one of which is pose. Thus the tracking does not lock on to the target tightly. Graphics-based representations model the appearance of the target more closely, and thus tracking can lock onto the subject more tightly. Textured geometric 3D models [13, 1] can represent the target under different poses. Because the prior 3D shape models for these systems do not adapt to the user, they tend to have limited tracking range. Deformable 3D models fix this problem by adapting the shape of the model to the subject [12, 14, 3, 6]. These ap- proaches maintain the 3D structure of the subject in a state vector which is updated recursively as images are observed. 1

Adaptive View-Based Appearance Models

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Adaptive View-Based Appearance Models

Adaptive View-Based Appearance Models

Louis-Philippe Morency Ali Rahimi Trevor Darrell

MIT Artificial Intelligence LaboratoryCambridge, MA 02139

Abstract

We present a method for online rigid object tracking usingan adaptive view-based appearance model. When the ob-ject’s pose trajectory crosses itself, our tracker has boundeddrift and can track objects undergoing large motion for longperiods of time. Our tracker registers each incoming frameagainst the views of the appearance model using a two-frame registration algorithm. Using a linear Gaussian fil-ter, we simultaneously estimate the pose of the object andadjust the view-based model as pose-changes are recoveredfrom the registration algorithm. The adaptive view-basedmodel is populated online with views of the object as it un-dergoes different orientations in pose space, allowing us tocapture non-Lambertian effects. We tested our approach ona real-time rigid object tracking task using stereo camerasand observed an RMS error within the accuracy limit of anattached inertial sensor.

1. IntroductionAccurate drift-free tracking is an important goal of manycomputer vision applications. Traditional models for frame-to-frame tracking accumulate drift even when viewing aprevious pose. In this paper we show how to use view-based appearance models to allow existing two-frame regis-tration algorithms to track objects over long distances withbounded drift. We use a 6 degree-of-freedom (DOF) rigidbody registration algorithm to track against a view-basedmodel that is acquired and refined concurrently with track-ing.

The appearance model used in this paper maintainsviews (key frames) of the object under various poses. Theseviews are annotated with the pose of the object, as estimatedby the tracker. Tracking against the appearance model en-tails registering the current frame against previous framesand all relevant key frames. The adaptive view-based ap-pearance model can be updated by adjusting the pose pa-rameters of the key frames, or by adding or removing keyframes. These online updates are non-committal so that fur-ther tracking can correct earlier mistakes induced into themodel.

View-based models can capture non-Lambertian re-flectance in a way that makes them well suited for track-ing rigid bodies. We show that our appearance model hasbounded drift when the object’s pose trajectory crosses it-self. We compare our pose tracking results with the orienta-tion estimate of anInertia Cube2 inertial sensor [11]. On aPentium 4 1.7GHz, our tracker implementation runs at 7Hz.

The following section discusses related previous workfor tracking. Subsequent sections describe the data struc-ture we use to maintain the appearance model and our algo-rithm for recovering the pose of the current frame and forpopulating and adjusting the appearance model. We thenreport experiments with our approach and a 3D view reg-istration algorithm that obtains range and intensity framesusing a commercial stereo system [7]. Finally we show thegenerality of our approach by tracking the 6-DOF pose of ageneral object using the same stereo system.

2. Previous WorkMany different representations have been used for trackingobjects based on aggregate statistics about the subject, orthey can be generative rendering models for the appearanceof the subject. Trackers which model the appearance ofthe subject using aggregate statistics of their appearance in-clude [2] and [16]. These use the distribution of skin-colorpixels to localize the head; the distribution can be adaptedto fit the subject as tracking goes on. To recover pose, thesetechniques rely on characteristics of the aggregate distribu-tion, which is influenced by many factors, only one of whichis pose. Thus the tracking does not lock on to the targettightly.

Graphics-based representations model the appearance ofthe target more closely, and thus tracking can lock onto thesubject more tightly. Textured geometric 3D models [13, 1]can represent the target under different poses. Because theprior 3D shape models for these systems do not adapt to theuser, they tend to have limited tracking range.

Deformable 3D models fix this problem by adapting theshape of the model to the subject [12, 14, 3, 6]. These ap-proaches maintain the 3D structure of the subject in a statevector which is updated recursively as images are observed.

1

Page 2: Adaptive View-Based Appearance Models

These updates require that correspondences between fea-tures in the model and features in the image be known.Computing these correspondences reliably is difficult, andthe complexity of the update grows quadratically with thenumber of 3D features, making the updates expensive [12].

Linear subspace methods have been used in several facetracking systems. [8] models the change in appearancedue to lighting variations and affine motion with a sub-space representation. The representation is acquired duringa separate training phase, with the subject fronto-parallel tothe camera, and viewed under various lighting conditions.Cootes and Taylor track using a linear subspace for shapeand texture [5]. The manifold underlying the appearance ofan object under varying poses is highly non-linear, so thesemethods work well with relatively small pose changes only.

This paper augments our previous work [18] which usedan appearance model made up of all frames available in theinput video sequence. In this previous paper, frames areregistered against several base frames off-line, and are as-signed poses which are most consistent with these registra-tions. This paper uses a similar appearance model by rep-resenting the subject with a subset of the frames seen so farin the input sequence. These key frames are annotated withtheir estimated pose, and collectively represent the appear-ance of the subject as viewed from these estimated poses.Unlike [18], our algorithm runs online. It operates withoutprior training, and does not use an approximate shape modelfor the subject.

3. Adaptive View-Based Model

Our adaptive view-based model consists of a collection ofpose-annotated key frames acquired using a stereo cameraduring tracking (Figure 1). For each key frame, the view-based model maintains the following information:

Ms = Is, Zs, xs

where Is and Zs are the intensity and depth images as-sociated with the key frames. The adaptive view-basedmodel is defined by the setM1 . . .Mk, where k isthe number of key frames. We think of the pose of eachkey frame as a Gaussian random variable whose distribu-tion is to be refined during the course of tracking.xs =[ T x T y T z Ωx Ωy Ωz ] is a 6 dimensional vec-tor consisting of the translation and the three euler angles,representing the mean of each random variable. Althoughin this paper we use a rigid body motion representation forthe pose of each frame, any representation, such as affine,or translational, could be used. The view-based model alsomaintains the correlation between these random variables ina matrixΛX , which is the covariance of these poses whenthey are stacked up in a column vector.

While tracking, three adjustments can be made to theadaptive view-based model: the tracker can correct the poseof each key frame, insert or remove a key frame.

Adding new frames into this appearance model entailsinserting a newMs and upgrading the covariance matrix.Traditional 3D representations, such as global mesh models,may require expensive stitching and meshing operations tointroduce new frames.

Adaptive view-based models provide a compact repre-sentation of objects in terms of the pose of the key frames.The appearance of the object can be tuned by updating afew parameters. In Section 4.3, we show that our modelcan be updated by solving a linear system of the order ofthe number of key frames in the model. On the other hand,3D mesh models require that many vertices be modified inorder to affect a significant change in the object representa-tion. When this level of control is not necessary, a 3D meshmodel can be an expensive representation.

View-based appearance models can provide robustnessto variation due to non-Lambertian reflectance. Each pointon the subject is exposed to varying reflectance condi-tions as the subject moves around (see Figure 1). The setof key frames which contains these points capture thesenon-Lambertian reflectances. Representing similar non-Lambertian reflectance with a 3D model is more difficult,requiring that an albedo model be recovered, or the texturebe represented using view-based textures.

The following section discusses how to track rigid ob-jects using our adaptive view-based appearance model.

4. Tracking and View-based ModelAdjustments

In our framework, tracking and pose adjustments to theadaptive view-based model are performed simultaneously.As the tracker acquires each frame, it seeks to estimate thenew frame’s pose as well as that of the key frames, usingall data seen so far. That is, we want to approximate theposterior density:

p(xt, xM|y1..t), (1)

wherext is the pose of the current frame,y1..t is the set ofall observations from the registration algorithm made so far,andxM contains the poses of the key frames in the view-based model,xM = x1 . . . xk.

Each incoming frame(It, Zt) is registered against sev-eral base frames. These base frames consist of key-frameschosen from the appearance model, and the previous frame(It−1, Zt−1). The registration is performed only againstkey-frames that are likely to yield sensible pose-changemeasurements. The next section discusses how these key-frames are chosen. These pose-change estimates are mod-

2

Page 3: Adaptive View-Based Appearance Models

Y

Y

Z

Z

Z

X

X

Y

X

Figure 1: The view-based model represents the subject undervarying poses. It also implicitly captures non-Lambertian re-flectance as a function of pose. Observe the reflection in theglasses and the lighting difference when facing up and down.

elled as Gaussians and combined using a Gauss-Markov up-date (Sections 4.2 and 4.3).

4.1. Selecting Base FramesOcclusions, lighting effects, and other unmodeled effectslimit the range of many registration algorithms. For exam-ple, when tracking heads, our 6 DOF registration algorithmreturns a reliable pose estimate if the head has undergone arotation of at most 10 degrees along any axis. Thus to obtainreasonable tracking against the appearance model, the algo-rithm must select key-frames whose true poses are withintracking range of the pose of the current frame.

To find key-frames whose pose is similar to the currentframe, we look for key-frames whose appearance is simi-lar to that of the current frame. This assumes that the pri-mary factor governing appearance is pose. We compute the

change in appearance between those two images, and ten-tatively accept a key-frame as a base frame if the changein their appearances is within a threshold. To compute theappearance distance, the intensity images of the frames arealigned with respect to translation. The L2 distance betweenthe resulting images is used as the final appearance distance.

This scheme works well if there is a one-to-one mappingbetween appearance and pose. But in some situations, dif-ferent poses may yield the same appearance. This happenswith objects with repetitive textures, such as floor tiles [19]or a calibration cube all of whose sides look identical. Todisambiguate these situations, key-frames that sufficientlyresemble the current frame are chosen as base frames onlyif their pose is likely to be within tracking range of the cur-rent frame.

We assess the probability that the pose of a key frame iswithin tracking range range by first estimating the pose forthe current frame. This estimate is obtained by registeringthe current frame against the previous frame and applyingthe linear Gaussian filter equations described later in Sec-tion 4.3.

Suppose this estimated pose of the current frame followsa Gaussian with meanxt, with covarianceΛt, and the poseof a key-frame under consideration has meanxs, and co-varianceΛs. Then the probability that the rotations of theseposes are all withinθ0 degrees of each other is the integralof joint distribution of the poses in the region where thisconstraint holds. For rotation about the X axis, this proba-bility is∫

(ΩXs ,ΩX

t )s.t.|ΩXs −ΩX

t |<θ0

p(ΩXs ,ΩX

t ) dΩXs dΩX

t ,

whereΩX is the rotations component of a pose vectorxabout the X axis. To evaluate this integral, define the Gaus-sian random variable∆X = ΩX

s − ΩXt . By linearity of

expectation we can compute the meanE[∆X ] from xs andxt and its variancevar[∆X ] from ΛM. The above proba-bility can be expressed using a one dimensional integral:

Pr [|ΩXs − ΩX

t | < θ0] = Pr[|∆X | < θ0] (2)

= G(−θ0|E[∆X ], var[∆X ])−G(θ0|E[∆X ], var[∆X ]),

where G is the cumulative distribution function for theGaussian. This probability is evaluated for rotations aboutthe Y and Z axes as well. If this probability is sufficientlylarge for all three rotational components, the frames whichare similar in appearance are declared to be within trackingrange of each other, and the key-frame can be safely used asa base frame.

4.2. Pairwise Registration AlgorithmOnce suitable base frames have been chosen from the viewmodel, a registration algorithm computes their pose differ-ence with respect to the current frame. In this section, we

3

Page 4: Adaptive View-Based Appearance Models

model the observation as the true pose difference betweentwo frames, corrupted by Gaussian noise. This Gaussianapproximation is used in the following section to combinepose-change estimates to update the distribution of (1).

The registration algorithm operates on the current frame(It, Zt), which has unknown posext and a base frame(Is, Zs) acquired at times, with posexs. It returns an ob-served pose-change estimateyt

s. We presume that this pose-change is probabilistically drawn from a Gaussian distribu-tionN (yt

s|xt−xs,Λy|xx). Thus pose-changes are assumedto be additive and corrupted by Gaussian noise.

We further assume that the current frame was generatedby warping a base frame and adding white Gaussian noise.Under these circumstances, if the registration algorithm re-ports the mode of

ε(yts) =

∑x∈R

‖It(x + u(x; yts))− Is(x)‖2,

whereu(x; yts) is the image-based flow, then the result of

[18] can be used to fit a Gaussian noise model to the re-ported pose-change estimate. Using Laplace’s approxima-tion, it can be shown that the likelihood model for the pose-changext − xs can be written asN (yt

s|xt − xs,Λy|xx),where

Λy|xx =1

ε(yts)

∂2y2ε(yt

s). (3)

4.3. Updating PosesThis section shows how to incorporate a set of observedpose-changes into the posterior distribution of (1). By as-suming that these observations are the true pose-changecorrupted by Gaussian noise, we can employ the Gauss-Markov equation.

Suppose that at timet, there is an up-to-date es-timate of the posext−1 and of the frames in themodel, so thatp(xt−1, xM|y1..t−1) is known. De-note the new pose-change measurements asy1..t =y1..t−1, y

tt−1, y

tM1

, ytM2

, . . ., whereM1,M2, . . . are theindices of key frames selected as base frames. We wouldlike to computep(xt, xM|y1..t).

The update first computes a prior forp(xt|y1..t−1) bypropagating the marginal distribution forp(xt−1|y1..t−1)one step forward using a dynamical model. This is simi-lar to the prediction step of the Kalman filter.

The variables involved in the update arext, the previousframe posext−1 and the key-frames chosen as base framesxM1 , xM2 , etc. These are stacked together in a variableX :

X =[

xt xt−1 xM1 xM2 · · ·]>

.

The covariance between the components ofX is denotedby ΛX . The rows and columns ofΛold

X corresponding tothe poses of the key-frames are mirrored inΛM. Together,

X andΛX completely determine the posterior distributionover the pose of the key-frames, the current frame, and theprevious frame.

Following the result of Section 4.2, a pose-change mea-surementyt

s between the current frame and a base frame inX is modeled as having come from:

yts = CX + ω,

C =[

I 0 · · · −I · · · 0],

whereω is Gaussian with covarianceΛy|xx. Each pose-change measurementyt

s is used to update all poses usingthe Kalman Filter update equation:

[ΛnewX ]−1 =

[ΛoldX

]−1+ C>Λ−1

y|xxC (4)

Xnew = ΛnewX

([ΛoldX

]−1Xold + C>Λ−1y|xxyt

s

)(5)

After individually incorporating the pose-changesyts using

this update,Xnew is the mean of the posterior distributionp(xt, xt−1,M |y1..t) and Cov[Xnew] is its variance. Thisdistribution can be marginalized by picking out the appro-priate elements ofXnew andΛnew

X .

4.4. ConvergenceIn the context of our tracker, we define drift to mean grow-ing uncertainty in the estimate of the pose of the currentframe and key-frames. We show here that the updates ofthe previous section can only lower the uncertainty in thesepose estimates.

The elements along the diagonal ofΛM are the marginalvariances for the pose of each frame and their pose uncer-tainty. We therefore prove that the updates of the previoussection shrink these elements. Using the matrix inversionlemma, we can write equation (5) as:

ΛnewX = Λold

X − ΛoldX C

(Λy|xx + C>Λold

X C)−1

C>ΛoldX .

(6)Because positive definite matrices have diagonal entriesgreater than zero, we can show that the diagonal entries ofΛnewX are no larger than those ofΛold

X by proving that thesecond term in equation 6 is positive semi-definite.

This term is an outer product, and is positive semi-definite if

(Λy|xx + C>Λold

X C)−1

is positive definite. Butthis is the case because the sum of a positive matrix (Λy|xx)and non-negative matrices (C>Λold

X C) is positive, and so isthe inverse of the sum. This shows that the uncertainty inthe pose of the key frames is non-decreasing.

Because the uncertainty in the key-frame poses shrinks,the uncertainty in the pose estimatext of any frame regis-tered against a key-frame must also be bounded: initially,xt is the sum of the pose of a key frame and a pose change.

4

Page 5: Adaptive View-Based Appearance Models

Hence its variance is the sum of the variance of these. Sub-sequent registrations reduce this variance as per the argu-ment above. Thus a loose upper bound on the variance ofxt is the variance of any of the key-frames, plus the varianceΛy|xx of the pose change estimate.

We have shown that the marginal variance of the pose ofthe key frames shrinks. Using this argument, we proved thatthe marginal variance of any frame registered against theseis also bounded.

5. Acquisition of View-Based ModelThis section describes an online algorithm for populatingthe view-based model with frames and poses. After esti-mating the posext of each frame as per the updates of Sec-tion 4.3, the tracker decides whether the frame should beinserted into the appearance model as a key-frame.

A key-frame should be available whenever the object re-turns to a previously visited pose. To identify poses thatthe object is likely to revisit, the 3 dimensional space ofrotations is tesselated into adjacent regions maintaining arepresentative key-frame. Throughout tracking, each regionis assigned the frame that most likely belongs to it. Thisensures that key-frames provide good coverage of the posespace, while retaining only those key-frames whose posecan be determined with high certainty.

The probability that a particular frame belongs to a re-gion centered atxr is:

Pr[xt ∈ B(xr)] =∫

x∈B(xr)

N (x|xt,Λt)dx,

whereB(x) is the region centered around a locationx, andΛt is the pose covariance of the current frame and can beread fromΛM.

If this frame belongs to a region with higher probabil-ity than any other frame so far, it is the best representativefor that region, and so the tracker assigns it there. If theframe does not belong to any region with sufficiently highprobability, or all regions already maintain key-frames withhigher probability, the frame is discarded.

The above criteria exhibit two desirable properties: 1)frames are assigned to regions near their estimated pose. 2)Frames with low certainty in their pose are penalized, be-cause the integral of a Gaussian under a fixed volume de-creases with the variance of the Gaussian. 3) Key-framesare replaced when better key-frames are found for a givenregion.

When a frame is added to the appearance model,X andΛM must be upgraded. This involves creating a new slot asthe last element ofX and moving the first component ofX(which corresponds toxt) to that slot. These changes aresimilarly reflected inΛM. The slot forxt in X is initial-ized to zero. This newxt is initially assumed to be very

uncertain and independent of all frames observed so far, soits corresponding rows and columns inΛM are set to zero.Following these operations, the updates from Section 4.3can be applied to subsequent frames.

6. ExperimentsThis section presents three experiments where the view-based appearance model is applied to tracking objects un-dergoing large movements in the near-field (∼1m) for sev-eral minutes. All three experiments use a 6 DOF registrationalgorithm (described in the following subsection) to trackthe object and create an appearance model. In the first ex-periment, we compare qualitatively 3 approaches for headpose tracking: differential tracking, first frame as keyframeand our adaptive view-based model. In the second experi-ment, we present a quantitative analysis of our view-basedtracking approach by comparing with an inertial sensorIn-ertia Cube2 . Finally, we show that the view-based appear-ance model can track general object including a hand-heldpuppet. All the experiments were done using a Videre De-sign stereo camera [7].

6.1. 6 DOF Registration AlgorithmGiven frames(It, Zt) and (Is, Zs), the registration algo-rithm estimates the pose changeyt

s between these frames. Itfirst identifies the object of interest by assuming that it is thefront-most object in the scene, as determined by the rangeimagesZt andZs . These pixels are grouped in regions ofinterestRt andRs. The registration parameters are com-puted in several steps: First the centers of mass of the re-gions of interest are aligned in 3D translation. This transla-tional component is then refined using 2D cross-correlationin the image plane. Finally, a finer registration algorithm[15] based on Iterative Closest Point (ICP) and the Bright-ness Constancy Constraint Equation (BCCE) is applied.

The correlation step provides a good initialization pointfor the iterative ICP and BCCE registration algorithm. Cen-tering the regions of interest reduces the search window ofthe correlation tracker, making it more efficient.

The ICP algorithm iteratively computes correspondencesbetween points in the depth images and finds the transfor-mation parameters which minimize the distance betweenthese pixels. By using depth values obtained from therange images, the BCCE can also be used to recover 3Dpose-change estimates [9]. Combining these approachesis advantageous because BCCE registers intensity imageswhereas ICP is limited to registering range imagery. Inaddition, we have empirically found that BCCE providessuperior performance in estimating rotations, whereas ICPprovides more accurate translation estimates.

To combine these registration algorithms, their objectivefunctions are summed and minimized iteratively. At each

5

Page 6: Adaptive View-Based Appearance Models

step of the minimization, correspondences are computed forbuilding the ICP cost function. Then the ICP and BCCEcost functions are linearized, and the locally optimal solu-tion is found using a robust least-squares solver [10]. Thisprocess usually converges within 3 to 4 iterations. For moredetails, see [15].

6.2. Head Pose TrackingWe tested our view-based approach with sequences ob-tained from a stereo camera [7] recording at 5Hz. Thetracking was initialized automatically using a face detec-tor [20]. The pose space used for acquiring the view-basedmodel was evenly tesselated in rotation. The registration al-gorithm used about 2500 points per frame. On a Pentium4 1.7GHz, our C++ implementation of the complete rigidobject tracking framework, including frame grabbing, 3Dview registration and pose updates, runs at 7Hz.

Figure 2 shows tracking results from a 2 minute test se-quence. The subject underwent rotations of about 110 de-grees and translations of about 80cm, including some trans-lation along the Z axis. We compared our view-based ap-proach with a differential tracking approach which registerseach frame with its previous frame, concatenating the posechanges. To gauge the utility of multiple key-frames, weshow results when the first frame in the sequence is the onlykey-frame.

The left column of Figure 2 shows how the that the dif-ferential tracker drifts after a short while. When track-ing with only the first frame and the previous frame (cen-ter column), the pose estimate is accurate when the sub-ject is near-frontal but drifts when moving outside this re-gion. The view-based approach (right column) gives ac-curate poses during the entire the sequence for both largeand small movements. Usually, the tracker used 2 or 3 baseframes (including the previous frame) to estimate pose.

6.3. Ground Truth ExperimentTo analyze quantitatively our algorithm, we compared ourresults to anInertia Cube2 sensor from InterSense[11].In-ertia Cube2 is an inertial 3-DOF (Degree of Freedom) ori-entation tracking system. The sensor was mounted on theinside structure of a construction hat. By sensing gravityand earth magnetic field,Inertia Cube2 estimates for theaxis X and Z axis (where Z points outside the camera andY points up) are mostly driftless but the Y axis can sufferfrom drift. InterSense reports a absolute pose accuracy of3RMS when the sensor is moving.

We recorded 4 sequences with ground truth poses usingthe Inertia Cube2 sensor. The sequences were recorded at6 Hz and the average length is 801 frames (∼133sec). Dur-ing recording, subjects underwent rotations of about 125 de-grees and translations of about 90cm, including translation

Previous frame only

First and previous frames

AdaptiveView-based model

Figure 2: Comparison of face tracking results using a 6 DOFregistration algorithm. Rows represent results at 31.4s, 52.2s, 65s,72.6, 80, 88.4, 113s and 127s. The thickness of the box aroundthe face is inversely proportional to the uncertainty in the poseestimate (the determinant ofxt). The number of indicator squaresbelow the box indicate the number of base frames used duringtracking.

6

Page 7: Adaptive View-Based Appearance Models

1 20 110 150 220

500 525

565 615 665 715 790

420360270

Figure 3:Head pose estimation using an adaptive view-based appearance model.

Pitch Yaw Roll TotalSequence 1 2.88 3.19 2.81 2.97

Sequence 2 1.73 3.86 2.32 2.78

Sequence 3 2.56 3.33 2.80 2.92

Sequence 4 2.26 3.62 2.39 2.82

Table 1: RMS error for each sequence. Pitch, yaw and rollrepresent rotation around X, Y and Z axis, respectively.

along the Z axis. Figure 3 shows the pose estimates of ouradaptive view-based tracker for the sequence 1. Figure 4compares the tracking results of this sequence with the in-ertial sensor. The RMS errors for all 4 sequences are shownin table 1. Our results suggest that our tracker is accurate towithin the resolution of theInertia Cube2 sensor.

6.4. General Object TrackingSince our tracking approach doesn’t use any prior informa-tion about the object, our algorithm can works on differentclass of objects without changing any parameters. Our lastexperiment uses the same tracking technique described inthis paper to track a puppet. The position of the puppet inthe first frame was defined manually. Figure 5 presents thetracking results.

7. ConclusionWe presented a method for online rigid object tracking us-ing adaptive view-based appearance models. The trackerregisters each incoming frame against the key-frames of the

view-based model using a two-frame 3D registration algo-rithm. Pose-changes recovered from registration are used tosimultaneously update the model and track the subject. Wetested our approach on real-time 6-DOF head tracking taskusing stereo cameras and observed an RMS error withinthe accuracy limit of an attached inertial sensor. During allour experiments, the tracker had bounded drift, could modelnon-Lambertian reflectance and could be used to track ob-jects undergoing large motion for a long time.

References

[1] S. Basu, I.A. Essa, and A.P. Pentland. Motion regu-larization for model-based head tracking. InICPR96,1996.

[2] S. Birchfield. Elliptical head tracking using intensitygradients and color histograms. InCVPR, pages 232–237, 1998.

[3] A. Chiuso, P. Favaro, H. Jin, and S. Soatto. Struc-ture from motion causally integrated over time.PAMI,24(4):523–535, April 2002.

[4] T. F. Cootes, K. N. Walker, and C. J. Taylor. View-based active appearance models. In4th InternationalConference on Automatic Face and Gesture Recogni-tion, pages 227–232, 2000.

[5] T.F. Cootes, G.J. Edwards, and C.J. Taylor. Activeappearance models.PAMI, 23(6):681–684, June 2001.

7

Page 8: Adaptive View-Based Appearance Models

-30

-20

-10

0

10

20

301

26

51

76

101

126

151

176

201

226

251

276

301

326

351

376

401

426

451

476

501

526

551

576

601

626

651

676

701

726

751

776

Inertia Cube2

Adaptive View-based Model

-50

-40

-30

-20

-10

0

10

20

30

40

1

26

51

76

10

1

12

6

15

1

17

6

20

1

22

6

25

1

27

6

30

1

32

6

35

1

37

6

40

1

42

6

45

1

47

6

50

1

52

6

55

1

57

6

60

1

62

6

65

1

67

6

70

1

72

6

75

1

77

6

Inertia Cube2

Adaptive View-based Model

-30

-20

-10

0

10

20

30

40

50

1

26

51

76

10

1

12

6

15

1

17

6

20

1

22

6

25

1

27

6

30

1

32

6

35

1

37

6

40

1

42

6

45

1

47

6

50

1

52

6

55

1

57

6

60

1

62

6

65

1

67

6

70

1

72

6

75

1

77

6

Inertia Cube2

Adaptive View-based Model

Rota

tio

n a

rou

nd

X a

xis

(deg

rees

)Ro

tati

on

aro

un

d Y

axi

s (d

egre

es)

Rota

tio

n a

rou

nd

Z a

xis

(deg

rees

)

Figure 4:Comparison of the head pose estimation from our adap-tive view-based approach with the measurements from theInertiaCube2 sensor.

165 235 255 310

400370350330

1 45 105 130

Figure 5:6-DOF puppet tracking using the adaptive view-basedappearance model.

[6] D. DeCarlo and D. Metaxas. Adjusting shape param-eters using model-based optical flow residuals.PAMI,24(6):814–823, June 2002.

[7] Videre Design. MEGA-D Megapixel Digital StereoHead. http://www.ai.sri.com/ konolige/svs/, 2000.

[8] G.D. Hager and P.N. Belhumeur. Efficient regiontracking with parametric models of geometry and il-lumination.PAMI, 20(10):1025–1039, October 1998.

[9] M. Harville, A. Rahimi, T. Darrell, G.G. Gordon, andJ. Woodfill. 3D pose tracking with linear depth andbrightness constraints. InICCV99, pages 206–213,1999.

[10] P.J. Huber.Robust statistics. Addison-Wesley, NewYork, 1981.

[11] InterSense Inc. Inertia Cube2 Manual.http://www.intersense.com.

[12] T. Jebara and A. Pentland. Parametrized structurefrom motion for 3D adaptive feedback tracking offaces. InCVPR, 1997.

[13] M. LaCascia, S. Sclaroff, and V. Athitsos. Fast, reli-able head tracking under varying illumination: An ap-proach based on registration of textured-mapped 3Dmodels.PAMI, 22(4):322–336, April 2000.

[14] Philip F. McLauchlan. A batch/recursive algorithm for3D scene reconstruction.Conf. Computer Vision andPattern Recognition, 2:738–743, 2000.

[15] L.-P. Morency and T. Darrell. Stereo tracking usingICP and normal flow constraint. InProceedings of In-ternational Conference on Pattern Recognition, 2002.

[16] N. Oliver, A. Pentland, and F. Berard. Lafter: Lips andface real time tracker. InComputer Vision and PatternRecognition, 1997.

[17] A. Pentland, B. Moghaddam, and T. Starner. View-based and modular eigenspaces for face recognition.In Proc. of IEEE Conf. on Computer Vision and Pat-tern Recognition, Seattle, WA, June 1994.

[18] A. Rahimi, L.-P. Morency, and T. Darrell. Reducingdrift in parametric motion tracking. InICCV, vol-ume 1, pages 315–322, June 2001.

[19] P. Rowe and A. Kelly. Map construction formosaic-based vehicle position estimation. InInterna-tional Conference on Intelligent Autonomous Systems(IAS6), July 2000.

[20] P. Viola and M. Jones. Robust real-time face detection.In ICCV, volume II, page 747, 2001.

8