[IEEE 2008 Canadian Conference on Computer and Robot Vision (CRV) - Windsor, Canada (2008.05.28-2008.05.30)] 2008 Canadian Conference on Computer and Robot Vision - Visual-Model Based

Visual-Model Based Spatial Tracking in the Presence of Occlusions

Hans de Ruiter and Beno Benhabib Computer Integrated Manufacturing Laboratory,

Department of Mechanical and Industrial Engineering, University of Toronto,

5 King’s College Road, Toronto, Ontario, M5S 3G8, Canada. {deruiter & beno}@mie.utoronto.ca

Abstract A key criticism for using a template/visual-model based object-tracker is that they often lack robustness to partial occlusions. This results from the global nature of the algorithm as opposed to operating on local features as in feature-based methods. Nevertheless, visual-model based methods could have significant advantage as they can model more complex objects and use more of the available data than their feature-based counterparts.

This paper presents a novel per-pixel occlusion-rejection algorithm for a spatial (3D) visual-model based object tracker. It detects occlusions based on violations of the brightness-constancy assumption of the optical-flow algorithm, which is an essential part of the tracker. Experimental results have clearly shown that using such an occlusion-rejection algorithm facilitates accurate tracking in the presence of partial occlusions that typically cause the object tracker to fail. This is achieved with no noticeable loss of performance; the frame-rate at which the object-tracker operates would remain the same.

1. Introduction Model-based object-tracking could potentially be used for real-time robotic control. Such object trackers have a 3D model of the target object, enabling spatial tracking of the full 6-dof (degree-of-freedom) pose (position and orientation). Knowing the full 6-dof pose could, for example, enable mobile robots to interact with their environment more effectively.

Computer-vision object trackers can be divided broadly into two categories: feature-based and template/visual-model based. Feature-based methods track an object by tracking a group of local features such as corners (e.g., [1]-[6]). Template/visual-model based techniques, on the other hand, track by matching

a complete template/model to the input image (e.g., [7] and [8]). Whilst feature-based techniques are typically faster, template/visual-model based methods can model the target object more completely. Hence, they use more of the available data and cope with more complex objects.

Despite their potential advantages, visual-model based techniques have a number of barriers to being practical. A common criticism of template/visual-model based object trackers is the general lack of robustness to occlusions. Coping with occlusions is difficult for these algorithms because they operate globally on the entire model. Thus, proponents of feature-based tracking algorithms point to this problem as a reason to use feature-based techniques instead, which can often cope with partial occlusions, (e.g., [2], [3], [6], and [9]-[13]). Recently, several techniques have been developed for occlusion-robust template-based tracking, (e.g., [14]-[17]). However, these are for 2D tracking, rendering them unsuitable for many robotic tasks.

A few 6-dof visual-model based trackers have been reported in the literature that can cope with partial occlusions. Mittraoiyanuruk et al. [18] presented a system that uses robust M-estimation to achieve occlusion robustness. Liebelt and Schertler [19] take a different tact; they use swarming particles, an optimization technique. Unfortunately, neither is real-time. For example, despite offloading processing to a GPU (Graphics Processing Unit), Liebelt and Schertler’s method [19] operates at 2 seconds per frame (fps). For object-tracking to be used in robotic control, it has to operate at real-time rates. 10 fps can be considered a minimum tracking rate for obtaining reasonable responsiveness, depending on the application.

This paper presents a novel real-time per-pixel occlusion-rejection algorithm that builds on our

Canadian Conference on Computer and Robot Vision

978-0-7695-3153-3/08 $25.00 © 2008 IEEEDOI 10.1109/CRV.2008.6

163

existing 3D visual model based object tracker ([20] and [21]). It stems from the simple yet powerful observation that the target object and occluding objects are typically visually different. This appearance difference can be used to reject occluded regions on a per-pixel basis, resulting in a real-time occlusion robust 6-dof tracking algorithm.

With the proposed occlusion-rejection algorithm in place, the advantages of a visual 3D model can be used. Experiments have shown that this algorithm provides high accuracy in situations that the original tracker failed due to partial occlusions. This is achieved with no noticeable loss of speed; 40-70 fps tracking was achieved, which is identical to the original algorithm’s performance.

2. Background – Object Tracker Overview Prior to delving into the actual occlusion-rejection algorithm, it would be beneficial to review the basic operation of the spatial object tracker. The object tracker first projects a visual 3D model of the target onto the image plane at its current predicted pose. Next, optical-flow is used to estimate the disparity (i.e., the “motion”) between the predicted and actual poses. This disparity is used to correct the predicted pose, resulting in the pose estimate for the current frame.

Real-time performance is achieved by using 3D graphics hardware via OpenGL [22] to perform projection and other image-processing operations. This relieves the main processor from this task and, hence, significantly improves the performance of the system. The algorithm also exploits the depth-map and mask produced by OpenGL for motion calculation and segmentation, respectively. The depth-map provides the depth information required to calculate 6-dof motion from one camera whilst the mask segments the target object from the background.

The basic 2D optical-flow equation is:

[ ] 0

1

=

′′

∇∇∇ y

x

tyx v

v

III , (1)

where x∇ and y∇ are the x and y derivatives

respectively and t∇ is the time derivative. I

represents ( )tyxI ,, , which is the intensity of pixel

[ ]Tyx at time t . Solving (1) for a set of pixels

yields the mean 2D velocity for that set of pixels

[ ]Tyx vv ′′ .

Next, the 2D motion in (1) needs to be related to 3D motion. 2D image coordinates are related to 3D coordinates via the (pinhole) camera projection equation:

pp' extintMM= ,

=

z

y

x

p'

p'

p'

p' , and

=

1

p

p

p

z

y

x

p , (2)

where p' is a (3×1) vector denoting the projected 2D

position of p in homogeneous coordinates.1 The

(3×4) matrix intM is the camera’s internal parameters

(focal-length, scaling-factors, etc.). extM is a (4×4)

matrix that transforms 3D coordinates between the world reference-frame and the camera’s reference-frame. The x and y coordinates for a point can be found by normalizing p' such that z is equal to one (i.e.,

divide p' by zp' ). Differentiating (2) with respect to

time gives us the desired relationship between 2D and 3D motion.

( )pωvp

v' ×+′

== extint MMzpdt

d 1, (3)

where v' is the projected 2D velocity corresponding to the linear 3D velocity v and the angular velocity ω .

Substituting Equation (3) into Equation (1) yields

[ ] ( )

+

′×+

∇∇∇1

0

0

ztyx pIII

pωvextint MM

0= . (4) This is the optical-flow equation for 6-dof motion.

The tracking algorithm calculates motion between the projected (or virtual) image and the actual input image for those pixels marked as belonging to the

target object’s projection. The z-axis value ( zp' ) is

extracted from the depth-map produced by OpenGL as part of the model projection (rendering) operation. Thus, the spatial derivatives and depth-values are all with respect to the virtual image – an important issue for the occlusion-rejection algorithm. More detail can be found in our previous papers, [20] and [21].

1 Homogeneous coordinates are used in projective geometry and are defined up to a scale-factor. Thus, a 2D point is given

164

3. Methodology While the tracking algorithm described in Section 2 can cope with background clutter, it cannot cope with occlusions. The mask generated by OpenGL masks off only objects behind the target (e.g., the background, not those in front). Some other method of filtering out occluded regions is required.

Pan and Hu [23] suggest that occlusion is a cyclic problem: the occlusion must be obtained before the target can be located, but the occluded part can only be determined after knowing the target location. Whilst this may be true to a certain extent, our motion calculations are referenced to the virtual image, and not the actual input image. Namely, one needs to determine which pixels in the virtual image are occluded in the input image and not detecting which parts are occluded in the input image relative to the target object’s actual pose.

The fundamental problem is caused by attempting to calculate optical-flow between the target object’s projection and occluding objects, Figure 2 in Section 5. In essence, the pixels at which this occlusion occurs violate the fundamental assumption of differential optical-flow: brightness-constancy. Brightness constancy assumes that any intensity changes from image to image are due to motion; the total brightness remains constant. Thus, one can conclude that most occluded pixels could be detected by checking if they violate brightness constancy.

Unfortunately no simple test exists for brightness constancy. Nevertheless, if the local brightness variation is due to motion, and not occlusion, then, the image intensity gradients in the virtual and input

images ( vI∇ and iI∇ , respectively) should be of

similar direction and magnitude. If the current pixel is part of a white-to-black edge in one image, but a red-to-black edge in the other, then, brightness-constancy is violated, and it is highly unlikely that the pixels belong to the same physical edge.

A series of tests are used herein to discard occluded pixels. As these tests work on colour channels

separately, we will denote colour channel j in image iI

as jiI , , where { }BGRj ,,∈ , the Red, Green and

Blue colour channels, respectively. The first and simplest test is to check whether there

is sufficient image intensity gradient in the input image, i.e.,

gmjiI τ>∇ , , (5)

in homogeneous coordinates as a three-element vector, in which the third element is set to one.

for all colour channels ( gmτ is the minimum gradient

magnitude). This is the same test performed by the

original algorithm on the virtual image, vI .

Essentially, if there is no gradient in vI and/or iI ,

then, motion cannot be estimated using that pixel; if one has a gradient, but not the other, then, they probably belong to different objects (or different parts of the same object).

If one passes the above first test, the next check is to compare the gradient directions. Denoting angle

between the gradients of vI and iI with

( )iv II ∇∇∠ , , this test is defined as:

( ) θτ<∇∇∠ jijv II ,, , ,

{ }gmjvgmji IIj ττ >∇∧>∇∀ ,,: . (6)

Equation (6) implies that the difference between gradient directions should be less than the threshold

θτ for all colour channels in which the gradient

magnitudes in both images (i.e., both vI and iI ) are

above threshold gmτ . θτ should be large enough to

allow for object rotation, but not so large that optical-flow could produce large errors. The angle can only be measured in colour channels in which there is a gradient; ignoring all other channels prevents random angles, due to noise, from causing the algorithm to reject valid pixels.

The final test is to verify whether the gradient magnitudes are similar. Gradient based optical-flow assumes local linearity. Hence, the gradient magnitudes should be similar, but not necessarily equal. This results in our final test:

mdiv II τ<∇−∇ , (7)

where mdτ is the image gradient magnitude difference

threshold. mdτ should be large enough to allow for

colour variations due to shading effects, or camera properties.

If a pixel passes all the above tests, then, it is passed on to the optical-flow solver; otherwise, it is rejected. As a result, occluded regions would be rejected along with any other region that violates the brightness-constancy assumption.

4. Implementation The sequence of tests described in Section 3 is implemented during the pre-processing stage, before optical-flow constraints are assembled and solved. This

165

is the same point at which the colour-gradient redundancy algorithm operates in [21]. Like the colour-gradient redundancy algorithm, it is implemented on the GPU for performance. Thus, the overall procedure can be summarized as: • Predict the pose in the current time-step, • Project the target object’s model (i.e., generate the

virtual image), • Perform all filtering operations (e.g., blurring and

derivative calculations) • Perform occlusion rejection as per Equations (5) to

(7), and colour-gradient redundancy, • Calculate the motion between the virtual and input

images, and use this to estimate the actual pose of the target, and

• Repeat this procedure for the next time-step.

5. Experiments The occlusion-rejection algorithm was tested with a number of motion sequences in which the target object is partially occluded. Results given below are based on a single sequence of images containing a cardboard box as target. The target object was moved along a trajectory including sharp changes in direction. Sequences containing partial occlusions were generated by compositing other objects over the sequence. This provides a reference, or control sequence, and a range of different types of occluding objects containing identical motions.

The camera had a viewing angle of about 41○ and a resolution of (512×256). Throughout the sequences, the target was distanced approximately 600 to 1200 mm from the camera. The thresholds 6104.98 −×=gmτ ,

o8.25=θτ , and 3102.39 −×=mdτ were used.

Figure 1 shows a sequence in which the target box is occluded by a vertical white bar. Pixels used in optical-flow motion calculations are marked in green (or gray in non-colour copies) for both the basic tracking algorithm, Figure 2(a), and with the occlusion-rejection algorithm, Figure 2(b). It can be noted in Figure 2(a) that the original algorithm incorrectly uses pixels

within the white bar as part of motion calculations, resulting in tracking failure, Figure 3. Using the occlusion-rejection algorithm, motion calculations no longer include the white bar, Figure 2(b), resulting in successful tracking, Figure 4, with errors below ±10 mm positionally and 2.5○ orientationally.

In another test sequence, the occluding object was a cube with letters on the side, Figure 5. Once again, the original algorithm fails to track in the presence of partial occlusion, Figure 6, but the new algorithm tracks it accurately until close to the end of the sequence, Figure 7. At the end of the sequence, one of the cube’s edges is similar enough to the target that it is not perfectly removed by the occlusion-rejection algorithm, causing a deviation. However, it should also be noted that the target object is mostly occluded at this point, leaving a small number of pixels available for motion calculations. Thus, a small number of erroneous pixels have a larger impact on the motion estimate.

For reference, the occlusion-rejection algorithm was also tested on the original, unoccluded, sequence, Figure 8. Close inspection of Figure 8 shows that the occlusion-rejection algorithm has removed pixels in regions with unreliable data. As a result, tracking errors have dropped from ±8 mm to ±6 mm positionally, and from 3.5○ to 2.5○ orientationally.

Adding the occlusion-rejection algorithm did not noticeably lower the tracking speed. Both the original algorithm, and the new one operated at 40-70 fps on a

(a)

(b) Figure 2: The original tracking algorithm (a) incorrectly uses pixels belonging to the white bar for tracking; the occlusion-rejection algorithm (b) filters out the white-bar.

Figure 1: A target object with a white bar causing partial occlusion.

166

2.0 GHz AMD Athlon 64 3000+ running Windows XP with a Radeon X800 Platinum Edition GPU.

6. Discussion The proposed occlusion-rejection algorithm not only enables tracking under partial occlusions, but it also aids overall robustness. The algorithm rejects any pixels that violate brightness-constancy, regardless of whether it is caused by occlusions or not. For example, if the target’s predicted pose is significantly off, parts of the projected image may cover background regions instead of the target object. Previously, optical-flow would have been calculated for these regions too, even though they come from different objects. The current algorithm filters out those regions as well. This was demonstrated by the final experiment in which no

occlusions were present, but the occlusion-rejection algorithm improved tracking accuracy.

As with any algorithm, there exist some limitations. Objects with similar edge characteristics or surface features could pass the occlusion-rejection algorithm and cause a corresponding decrease in accuracy. Fortunately, this algorithm does not work to the exclusion of other techniques. It serves to rapidly remove many occluded, or otherwise invalid, pixels. Other algorithms could operate on the remaining pixels to improve robustness further. For example, robust estimation methods could be used in order to filter out outliers. Up to now, iterative methods have been avoided due to their computational expense and the need to maintain real-time performance. However, with advances in computational power and suitable

-20

-15

-10

-5

0

5

10

15

20

0 100 200 300 400 500 600

Time (frames)

Pos

itio

nal

Err

or

(mm

)

Position xPosition yPosition z


(a)

0

0.5

1

1.5

2

2.5

3

3.5

4

0 100 200 300 400 500 600

Time (frames)

Ori

enta

tiona

l Err

or (d

egre

es)

Orientation x-axis

Orientation y-axisOrientation z-axis

Orientation x -axisOrientation y -axisOrientation z -axis

(b) Figure 4: The positional (a) and orientational (b) tracking error for the sequence in Figure 2, when tracking with the occlusion-rejection algorithm.

-20

-15

-10

-5

0

5

10

15

20

0 100 200 300 400 500 600

Time (frames)

Po

siti

onal

Err

or

(mm

)Position xPosition yPosition z


(a)

0

0.5

1

1.5

2

2.5

3

3.5

4

0 100 200 300 400 500 600

Time (frames)

Ori

enta

tiona

l Err

or (d

egre

es)

Orientation x-axis



(b) Figure 3: The positional (a) and orientational (b) tracking error for the sequence in Figure 2, whentracking without the occlusion-rejection algorithm.

167

acceleration structures (e.g., a block-based technique), this may become a viable technique.

7. Conclusions This paper has presented a novel occlusion-rejection algorithm for visual-model-based object tracking that is suitable for any algorithm that uses gradient-based optical-flow. It is based on discarding pixels that violate brightness-constancy. By discarding these pixels, only pixels belonging to the target object are used in motion calculations, resulting in occlusion-robust tracking.

Experimental results clearly showed that adding the occlusion-rejection algorithm results in accurate tracking in situations that previously caused tracking failure. It even increased tracking accuracy in situations with no occlusions. The algorithm results in no noticeable slowdown; the original algorithm’s 40-70 fps was maintained.

The algorithm may have trouble with objects that have similar surface features to the target object, particularly if the target is almost completely occluded. Nevertheless, it is highly effective in most cases, and its low computational overhead makes it a worthy addition to any object tracker.

References [1] E. Marchand, P. Bouthemy, and F. Chaumette, “A

2d-3d model-based approach to real-time visual tracking,” Image and Vision Computing, vol. 19, no. 7, pp. 941–955, November 2001.

[2] T. Drummond and R. Cipolla, “Real-time visual tracking of complex scenes,” IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 24, no. 7, pp. 932–946, July 2002.

[3] A. Comport, E. Marchand, and F. Chaumette, “A real-time tracker for markerless augmented reality,” IEEE and ACM Int. Symposium on Mixed and Augmented Reality, Tokyo, Japan, October 2003, pp. 36–45.

(a)

(b) Figure 5: The original tracking algorithm (a) incorrectly uses pixels belonging to the cube for tracking; the occlusion-rejection algorithm (b) filters out the cube.

-20

-15

-10

-5

0

5

10

15

20

0 100 200 300 400 500 600

Time (frames)

Po

siti

ona

l Err

or

(mm

)



(a)

0

0.5

1

1.5

2

2.5

3

3.5

4

0 100 200 300 400 500 600

Time (frames)

Ori

enta

tiona

l Err

or (d

egre

es)

Orientation x-axis



(b) Figure 6: The positional (a) and orientational (b) tracking error for the sequence in Figure 5, whentracking without the occlusion-rejection algorithm.

168

[4] S. Kim and I. Kweon, “Robust model-based 3d object recognition by combining feature matching with tracking,” Int. Conf. on Robotics and Automation, vol. 2, Taipei, Taiwan, September 2003, pp. 2123–2128.

[5] V. Kyrki and D. Kragic, “Integration of model-based and model-free cues for visual object tracking in 3d,” Int. Conf. on Robotics and Automation, Barcelona, Spain, April 2005, pp. 1554–1560.

[6] M. Vincze, M. Schlemmer, P. Gemeiner, and M. Ayromlou, “Vision for robotics: A tool for model-based object tracking,” IEEE Robotics and Automation Magazine, vol. 12, no. 4, pp. 53–64, December 2005.

[7] F. Jurie and M. Dhome, “Real time robust template matching,” 13th British Machine Vision Conf., Cardiff, Wales, 2002, pp. 123–132.

[8] M. L. Cascia, S. Scarloff, and V. Athitsos, “Fast, realiable head tracking under varying illumination: An approach based on registration of texture-mapped 3d models,” IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 22, no. 4, pp. 322–336, April 2000.

[9] E. Loutas, K. Diamantaras, and I. Pitas, “Occlusion resistant object tracking,” IEEE Int. Conf. on Image Processing, vol. 2, Thessaloniki, Greece, 2001, pp. 65–68.

[10] S. Kamijo, Y. Matsushita, K. Ikeuchi, and M. Sakauchi, “Occlusion robust tracking utilizing spatio-temporal markov random field model,” 15th IEEE Int. Conf. on Pattern Recognition, vol. 1, Barcelona, Spain, September 2000, pp. 140–144.

[11] C. Gentile, O. Camps, and M. Sznaier, “Segmentation for robust tracking in the presence of severe occlusion,” IEEE Trans. on Image Processing, vol. 13, no. 2, pp. 166–178, February 2004.

[12] Y. Wu, G. Hua, and T. Yu, “Switching observation models for contour tracking in clutter,” IEEE Conf. on Computer Vision and Pattern Recognition, Madison, WI, June 2003, pp. 295–302.

[13] W. Chang, C. Chen, and Y. Hung, “Discriminative descriptor-based observation model for visual

-20

-15

-10

-5

0

5

10

15

20

0 100 200 300 400 500 600

Time (frames)

Po

siti

ona

l Err

or

(mm

)Position xPosition yPosition z


(a)

0

0.5

1

1.5

2

2.5

3

3.5

4

0 100 200 300 400 500 600

Time (frames)

Ori

enta

tiona

l Err

or (d

egre

es)

Orientation x-axis



(b) Figure 7: The positional (a) and orientational (b) tracking errors for the sequence in Figure 5 when tracking with the occlusion-rejection algorithm.

(a)

(b) Figure 8: Unreliable pixels used by the original algorithm (a) are filtered out by the occlusion-rejection algorithm (b).

169

tracking,” Int. Conf. on Pattern Recognition, vol. 3, Hong Kong, September 2006, pp. 83–86.

[14] A. Jepson, D. Fleet, and T. El-Maraghi, “Robust online appearance models for visual tracking,” IEEE Conf. on Computer Vision and Pattern Recognition, vol. I, Kauai, HI, 2001, pp. 415–422.

[15] H. Nguyen and A. Smeulders, “Fast occluded object tracking by a robust appearance filter,” IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 26, no. 8, pp. 1099–1104, August 2004.

[16] L. Xu and P. Puig, “A hybrid blob- and appearance-based framework for multi-object tracking through complex occlusions,” Proc. 2nd Joint IEEE Int. Workshop on VS-PETS, Beijing, October 2005, pp. 73–88.

[17] L. Latecki and R. Miezianko, “Object tracking with dynamic template update and occlusion detection,” Int. Conf. on Pattern Recognition, vol. 1, Hong Kong, September 2006, pp. 556–560.

[18] P. Mittraoiyanuruk, G. DeSouza, and A. Kak, “Accurate 3d tracking of rigid objects with occlusion using active appearance models,” IEEE

Workshop on Motion and Video Computing, vol. 2, Breckenridge, CO, January 2005, pp. 90–95.

[19] J. Liebelt and K. Schertler, “Precise registration of 3d models to images by swarming particles,” Conf. on Computer Vision and Pattern Recognition, Minneapolis, MN, June 2007, pp. 1–8.

[20] H. de Ruiter and B. Benhabib, “Tracking of rigid bodies for autonomous surveillance,” IEEE Int. Conf. on Mechatronics and Automation, vol. 2, Niagara Falls, Canada, July 2005, pp. 928–933.

[21] H. de Ruiter and B. Benhabib, “Colour-gradient redundancy for real-time spatial pose tracking in autonomous robot navigation,” Canadian Conf. on Computer and Robotic Vision, Québec City, Canada, June 2006, pp. 20–28.

[22] D. Shreiner, Ed., OpenGL Reference Manual, 4th ed. Boston, MA: Addison-Wesley, 2004.

[23] J. Pan and B. Hu, “Robust occlusion handling in object tracking,” IEEE conf. on Computer Vision and Pattern Recognition, Minneapolis, MN, June 2007, pp. 1–8.

170

Documents

[IEEE 2008 Canadian Conference on Computer and Robot Vision (CRV) - Windsor, Canada (2008.05.28-2008.05.30)] 2008 Canadian Conference on Computer and Robot Vision - Visual-Model Based