Motion and tracking

Motion and Tracking

Eng-Jon OngUniversity of [email protected]

Introduction

There have been many objects that have been tracked in the past.

Whole objects: Cars, bicycles, human bodies.

Source:Youtube: Intelligent Traffic Surveillance

What objects have been tracked? There have been many

objects that have been tracked in the past.

Medium level features: Heads, Hands, small objects, etc..

What objects have been tracked? There have been many objects that have been

tracked in the past. Fine level features: Facial feature points, finger

positions, etc...

Overview

The task of visual tracking involves locating the position of a tracked target by a combination of features and motion models.

There is a strong relationship between the task of object detection and tracking.

Visual model + Detector

Motion Model

Overview

One can think of tracking as a motion-model constrained detection. Detection on the whole image tends to be expensive


Motion Model

Overview Introduction Object models Simple search strategies Using linear dynamics Optimisation search

strategies Summary

Object Models and Evaluation

Representation of Tracked Objects The first question: How do we computationally

represent an object we want to track? Image template Combination of low level information (e.g. Lines) Contour information

Evaluation of different models “fitness” We need a measure of model fitness on an image

given a set of parameters (e.g. Position + scale). For images, we have template matching using

different scores: Normalised cross correlation is the most basic(i.e. Sum ofsquares ofpixel differences)

Evaluation of different models “fitness” There are more sophisticated methods for

matching a template to an image: Boosted detectors are a popular choice. Boosting is a method that combines a set of very

simple object detectors together to yield a strong detector.

Boosted Cascade

Cascade Layer 1

90% Rejected

10% pass . . . .

Cascade Layer 2 Cascade Layer 3

10% pass

90% Rejected 90% Rejected 90% Rejected

Face detected

Cascade Layer n

Boosted CascadeLayer 12 Classifiers

Layer 25 Classifiers


Layer 420 ClassifiersLayer 550 Classifiers

Layer 650 ClassifiersLayer 7128 ClassifiersLayer 8132 Classifiers


Detecting and Tracking Humans in Images

Constrained Detection: Simple Search Strategies

Simple Tracking Strategies

Detection/Global Search Goal: Where to place the

contour on the image?


n

dIdn

I

n

(x1,y1)

(x2,y2)

(x3,y3)

(x4,y4)

^n1

^n2

^n3

^n4

Contours and Costs– Search along contour normal for edges

– Move contour x,y,scale & rotation

Evaluation of different models “fitness” For lines and contours, we can use distances to

nearest edges. But, different configurations of contour searches

can have different results. Run demos: 3tracescanline.exe 4tracescanlinelong.exe

n

dIdn

I

n

(x1,y1)

(x2,y2)

(x3,y3)

(x4,y4)

^n1

^n2 ^

n3

^n4


Global Search– If the parameter space of

the search is low in dimensionality then a simple global search of the image is sufficient







– Not practical for most applications

Detecting and TrackingHumans in Images We can track just using

global search if the detectors are fast enough

Iterative Tracking

Most tracking schemes work on the assumption that an object will make small iterative movements between frames

Using this assumption only a local search is required to update model parameters

Tracking is typically posed as a 2 step process:– Initialisation (Global/Detection)– Iteration (Local)

Iterative Tracking Example 1

Assume the initial position is known

Assume object wont move far

Search locally to find movement that maximises some fitness function


Assume the initial position is known

Assume object wont move far

Search locally to find movement that maximises some fitness function


Again:– requires good initialisation– relies on small inter-frame movements


Example of contour tracking failing due to indistinct edges

A better example of tracking but highly susceptible to initialisation

Increasing the local search provides better initialisation but decreases tracking performance

1BadContour.exe

2BetterContour.exe

4TraceScanLineLong.exe

Constrained Detection: Optimisation Search Strategies

Tracking as an Optimisation Problem

Tracking can be thought of as an optimisation where some cost function represents how well a model fits an image.

Model fitting is done by attempt to find the model parameters that minimise/maximise this cost function

This can be done at each frame to track objects through a video sequence

Using Gradient Descent

The previous approaches of iteratively refining a model given a local search is effectively a gradient descent optimisation

This will only work if theinitial pose of the model is very close to the idealposition as energy surfacestypically have many localminima

Cost

Parameter

Using Gradient Descent

Energy surfaces are typically very complex and impossible to visualise due to high dimensionality

In the figure there is one global minimum but many local minima that are almost as good

Unless our model is very close to the ideal location a gradient descent approach will converge on a local minima and get trapped

We've already seen this in action on the contour tracker

Cost

Parameter

Choosing a cost function

Returning to the contour example lets formulate a cost function as the Euclidean distance between a model and the strongest features in the image

We can visualise the cost surface across a single parameter

Notice the surface has a global minimum but it is not distinct

3TraceScanLine.exe


We can do the same after increasing the local search (by extending our search along normals) to see how this affects the cost surface

Note it makes the minima more distinct but this image has no background clutter. Additional clutter would result in further complicating the surface

4TraceScanLineLong.exe


Lets choose a different cost function

This time we will take the edge strength supporting the model pose

Notice the surface has inverted and we now seek to find the maximum

It has a very clear maximum which corresponds to the global solution which SHOULD be easy to find!!!

5cost2TraceScanLine.exe

Lucas-Kanade Tracking

Remember Gradient Descent

Cost

Parameter

Well if we know more about the surface we can speed things up:– If we assume the cost

surface is a parabola then given a position anda gradient we can move to the minimum in one move


Newton-Raphson convergence

v n+1=vn−f n '

f n '' Jacobian

Hessian

• Two differences

• LK uses the Sum of Squared differences across the entire image.

• x is a multi-dimensional warp parameter.

v

f(v)


x

ssd Tv,wI=d 2xx

xx Tv,wIv

wI=d

v xssd

2

- =

{ } *

∑

y

wI

Jacobian

?)(?,

ssddv

x

wI


x

ssd Tv,wI=d 2xx

xx Tv,wIv

wI=d

v xssd

2

22

2

2 dO+v

wI

v

wI=

v

d

x

T

ssd

∑

y

wI

Jacobian

Hessian

x

wI

y

wI

??

??2

2

v

dssd

x

wI



Youtube: vision: optical flow detection

Mean-shift

We can look for local maxima in object detector outputs using mean-shift

Mean-shift

We can look for local maxima in object detector outputs using mean-shift

Mean shift

Example of simple mean-shift tracking Object “Detector” is distance to RGB histogram

Youtube: Mean shift tracking of red bal, normalised RGB and 64 bin histogram

Regression-based Tracking

Regression-based Tracking

Up till now, tracking is seen as a constrained detection problem. Essentially template matching, searching a parameter space to minimise a matching fitness function.

Another approach is to pose the problem as a regression problem: Given template difference, predict the translational offset to the correct position. (no explicit search needed!)

Linear Predictors (Robust Facial Feature Tracking using Shape Constrained Multi Resolution Selected Linear Predictors, Ong et al)

a

cb Y

P= [ Ia – I'a, Ib – I'b, lc – I'c ]

X = HP

Reference Point + Support Pixels (a,b,c) Linear mapping (H) from support pixel

intensity difference to translation vector

Linear Predictor “Bunches”– Single LPs are not stable enough for tracking image

features– Use a set (“bunch”) of

LPs instead– Final prediction =

consensus of the mostcommon predictedtranslation

Linear Predictors

Linear Predictor “Bunches”– Single LPs are not stable enough for tracking image

features– Use a set (“bunch”) of

LPs instead– Final prediction =

consensus of the mostcommon predictedtranslation

Linear Predictors

“Tracking context” is very important.

We only want to use surrounding visual information if it helps the tracking

Linear Predictors

We want to track this point

BUT, we shoulduse visual informationaround here for tracking it! Other regions have toomuch variations.

We can find the tracking context by evaluating the accuracy of trackers using local patches, and gradually removing the bad ones

Linear Predictors

Cascaded linear predictors:– Linear predictors trained to overcome large offsets are not

accurate but robust

– LPs trained to overcome small offsets are accurate but not robust.

– Solution, cascade them: Use big-offset LPs, then pass the results to smaller ones for refinement.

Linear Predictors

Errors of “large” LP predictingfrom an offseted position (blue is medium prediction error)

Errors of “small” LP predictingfrom an offseted position (white is small prediction error)

Linear Predictors

Linear Predictors

Linear Predictors

Non-Linear Predictors(Non-linear Predictors for Facial feature Tracking, FG2013, Sheerman-Chase et al.)

a

cb Y

P= [ Ia – I'a, Ib – I'b, lc – I'c ]

X = H( P )

Replace linear mapping with the non-linear mapping of regression trees

Input still support pixel differences, output still offsets

Non-Linear Predictors

Replace linear mapping with the non-linear mapping of regression trees

Input still support pixel differences, output still offsets

S1<0.4

dy = 23 S50<0.1

Dy = 32dy = -10


Results: More robust tracking able to handle larger amounts of pose and expression variations.


Results: More robust tracking able to handle larger amounts of pose and expression variations.


Allows us to do freaky things like this:

Background to template update problem

No update– Misrepresentation Error– Catastrophic

Naïve update– Drift Error– Slow accumulation

True Feature – Old AppearanceTrue Feature – New AppearanceFalse Feature

Frame

time

Error

time

Error

1 2 3 4 5

Background template update(Mutual information for Lucas Kanade tracking (MILK): An inverse compositional formulation, Dowson et al, PAMI 08)

Building a Model of Templates

Appearance space

LP SMAT

SMAT

Incorporating Motion Modelsfor Tracking

Temporal Consistency

This sequence shows a surveillance application tracking subjects as they move.

The technique uses a per pixel mixture of Gaussians to model background colour distributions and perform dynamic background subtraction.

Tracking with Motion Models

The task of visual tracking involves locating the position of a tracked target by a combination of features and motion models.

There is a strong relationship between the task of object detection and tracking.


Motion Model

Using Motion

Objects often exhibit consistent motion

Kalman Filter

To exploit this motion consistency, many authors model it with simple dynamics in the what is called the Kalman filter

A Kalman filter is simply an optimal recursive data processing algorithm. It makes predictions based on previous

estimates and current observations

Kalman Filter

Suppose we have some hidden information to recover (i.e. Not directly observable) and takes the form of a state vector E.g. X = [x,y,v] position, velocity of a tracked object

This object has a true position at time t, Xt, which we do not know But suppose we think this object’s dynamics works in a linear

fashion like: Xt = FXt-1 BUT this may not be exactly the case, it might be slightly off, thus

we have Xt = FXt-1 + wt, where wt ~ N(0,Q)

Xt

Kalman Filter

Suppose we have some sensors that can provide some measurements about the tracked object in the form of a state vector: Z = [a,b]

This sensor measurements is originates from the hidden state vector X with the form: Zt = HXt

BUT, in reality this sensor can be imperfect, noisy etc... We deal with this by saying Zt = HXt + v, where v ~ N(0,R) R is called the sensor’s error covariance

Kalman Filter

We want to recover some hidden information about a tracked object: X = [x,y,v]

We can predict it’s movements “blindly” using: X’t|t-1 = FX’t-1|t-1 + wt

But this model is inaccurate in a Gaussian sense: wt ~ N(0,Q) We have some sensors that provide observations to indirectly tell

us how accurate our predictions are Zt – HX’t|t-1 BUT, need to take this with a pinch of salt, since our sensors are

inaccurate as well (Zt has Gaussian noise with covariance R)

Kalman Filter

Suppose we have some hidden information to recover (i.e. Not directly observable) and takes the form of a state vector E.g. X = [x,y,v] position, velocity of a tracked object

This object has a true position at time t, Xt, which we do not know But suppose we think this object’s dynamics works in a linear

fashion like: Xt = FXt-1 BUT this may not be exactly the case, it might be slightly off, thus

we have Xt = FXt-1 + wt, where wt ~ N(0,Q)

Xt

Kalman Filter

So, task at hand: how do we best combine our prediction of a tracked object state with the sensor observations, given that both have Gaussian noise?

That is what a Kalman filter does in a optimal sense (provide your noise IS Gaussian and your dynamics IS linear)

Xt|t = X’t|t-1 + K( Zt – HX’t|t-1 ) K is called the “Kalman gain” Essentially, if sensor noise is small and prediction noise large, K

becomes H-1, meaning trust the observations. Conversely, if sensor noise is large,

K becomes 0, trust prediction

Kalman Filter Operation

From: Kalman filter for dummies

Using a Kalman Filter to Track

How prediction overcomes occlusion issues

Youtube: kalman Filter result on real aircraft & Result of Kalman Filter on a Moving Aircraft







Extended Kalman Filter-EKF

The Kalman filter addresses the problem of dynamics estimation by linear equations

Most problems are non-linear EKF attempts to address this making

the state prediction Xt = F( Xt-1 ) + w F can be any non linear function

See www.cs.unc.edu/~welch for introductory tutorials and sample code

Exploring a parameter space for the global solution

We could try every single model configuration to find the lowest cost solution but this can be unfeasible (640x480x100x360=11,059,200,000)

We could just randomly pick model configurations in the hope that we find a low cost solution but this does not guarantee that we will find it and as the dimensionality and complexity increase so must the number of random samples

These are common problems and hence standard optimisation techniques can be employed– e.g. Simulated Annealing, Genetic Algorithms

7RandomSample.exe

Tracking as an Optimisation Problem In simulated annealing we try and use some simple

heuristic to reduce the number of samples we need to test

In Genetic Algorithms we try and guide our random search through observation to again reduce the complexity of the search

However, these are blind optimisations and we often know much more about the problem we are trying to solve such as the nature of observations or the dynamics we are expecting (remember the Kalman Filter)

Tracking as an Optimisation Problem Example of using simulated annealing for tracking the

body pose

N. Lehment, M. Kaiser, D. Arsic, and G. Rigoll. Cue-Independent Extending Inverse Kinematics For Robust Pose Estimation in 3D Point Clouds. Proc. IEEE Intern. Conf.on Image Processing (ICIP2010)

Factored Sampling

We have seen how the KF uses a simple Gaussian to model observations but what happens if observations are non-Gaussian?

Factored Sampling can be used to search a static image in these cases

We want to calculate the posterior probability that an object X exists in an image given the observed data obj – P(X |obj)

Factored Sampling

This is difficult to achieve for continuous complex non-Gaussian distributions

Luckily Bayes’ formula says that the posterior density can be obtained as a product of a prior density P0(X ) and an observation density P(obj|X )– P(X |obj) ≈ P(obj|X ) P0(X )

Factored sampling estimates the posterior by generating samples from the prior and weighting them according to the observation density

Factored Sampling

A set of n points s (n), the centres of the blobs in the figure are sampled randomly from the prior density P(X )

Each sample is then assigned a weight (depicted by blob area) based upon the observation density P(obj|X = s (n) )

If n is sufficiently large then the weighted set represents the posterior density P(X |obj)

State X

Probability

posterior density

weightedsample

CONDENSATION and Particle Filtering

CONDitional DENsity propagATION also known as particle filtering is the natural extension of the KF to factored sampling

Basically:– Randomly generate a distribution from the prior pdf

and apply a model of dynamics (i.e. predict)– Fit each sample to the image (i.e. measure)– Weight samples accordingly to generate a new

posterior pdf that will serve as the prior for the next iteration


predict

measure


The animation shows a few cycles of the algorithm applied to a one-dimensional system. The green spheres correspond to the members of the sample set, where the size of the sphere is an indication of the sample weight. The red line is the measurement density function.

This animation shows a short sequence of the CONDENSATION filter tracking a leaf exhibiting non-linear motion with occlusion and clutter.

Movie sequences taken from http://www.dai.ed.ac.uk/CVonline/LOCAL_COPIES/ISARD1/condensation.html


We can extend our random sampler to a simple PF using gaussian noise as our dynamics/drift term

Notice how the population quickly homes in on the area of highest probability as we saw in the random sampling

It quickly converges on incorrect local solutions, increasing the noise term helps explore the space further but the global maximum is at the bottom of the image

8ParticleFilter.exe


We can further try to change the model to better fit the head and ensure the global is at the correct position

Tracking is better but easily lost to other maxima

As the population size is increased we start to see multiple hypothesis tracking

By combining both the PF and a gradient decent method we can get the best results for the lowest population, but our cost function is still flawed

9Particle filter.exe

10ParticleFilter.exe


Advantages– Allows complex non-Gaussian systems– Easy to add non-linear dynamics– Provides support for multiple hypotheses (!!!)

Disadvantages– Large numbers of samples make the techniques

extremely slow for high parameter spaces– Not a global optimisation so has the tendency to

converge upon good observations at the cost of other observations

There are many schemes for overcoming these problems but are beyond the scope of this lecture

Interesting Applications of Motion Tracking

Lip-Reading

Facial features of a subject are tracked, specifically the mouth regions.

Mouth texture and shape are extracted and used to build discriminative patterns called sequential patterns

Lip-Reading

Results:

Sign Language Recognition

Tracking required for extracting the motions of the hands and head.

Movement features of the hands and hand shapes are extracted

Again, discriminative movement patterns uniquely identifying a sign is extracted

These patterns will be used to detect whether a sign is present in a video sequence or not

Sign Language Recognition

Results:

Group Behaviour Profiling

Even when tracking is not very accurate or robust, it can still be used to do useful things!

Example: Use simple trackers (e.g. Lucas Kanade trackers) to “track” people in a crowd

These will only last a short while, but can form short trajectories.

The analysis of these trajectories can be used to do profile crowd behaviours.

Group Behaviour Profiling

Results:

Summary

We have looked at a variety of tracking strategies from very simple schemes to those which can learn and predict complex non-linear motion in cluttered environments. This talk is not exhaustive but should give you a basic understanding of the types of techniques used in modern computer vision systems.

For more details on many of the examples see my website http://www.surrey.ac.uk/personal/e.ong

For a good introduction on the temporal mechanics of tracking I would recommend reading “Active Contours” by Isard and Blake

Things to remember!!!

When tracking:– Tracking is only as good as your model and data

A bad metric will give bad results The larger the parameter space the more difficult things

become

– Make things as simple as possible Constrain your environment Use appropriate techniques and dynamics

– e.g. if your tracking someone jumping up and down don’t use a kalman filter

– Don’t try to reinvent the wheel But if your going to use black box techniques ensure you

know what they will and wont do for you

Science

Motion and tracking