EDIC RESEARCH PROPOSAL 1 Visual Detection and Tracking of ... exam/PR13Roza… · EDIC RESEARCH PROPOSAL 1 Visual Detection and Tracking of Flying Objects in Unmanned Aerial Vehicles

EDIC RESEARCH PROPOSAL 1

Visual Detection and Tracking of Flying Objects inUnmanned Aerial Vehicles

Artem RozantsevCVLAB, I&C, EPFL

Abstract—Visual detection and tracking of flying objects inunmanned aerial vehicles (UAV) plays a central role in automatedflight and navigation. It is a subproblem of a more general one,which can be defined as moving object detection by a movingcamera. In this proposal, we first present three relevant papers[1]–[3] that provide solution to this problem and outline theiradvantages and disadvantages, together with their applicationareas. Then we discuss our current work, proposing a way to usestatistical learning methods for flying object detection purposes.Inspired by [2], we employ temporal information from the videosas an additional feature in the learning process. This approachallows us to significantly increase the detection rate, comparingto conventional single-frame classification methods.

Index Terms—UAV, drone, moving objects detection, spatio-temporal descriptor, collision avoidance, optical flow.

I. INTRODUCTION

THE ability to detect and estimate the relative distanceand bearing of neighboring aircrafts plays a crucial

role in automated flight and navigation. Vision-based relativepositioning is of particular interest as cameras generally re-quire less power consumption and are more lightweight than

Proposal submitted to committee: June 28th, 2013; Candi-dacy exam date: July 5th, 2013; Candidacy exam committee:Prof. Alcherio Martinoli, Prof. Pascal Fua, Dr. Vincent Lepetit,Dr. Michael Themans.

This research plan has been approved:

Date: ————————————

Doctoral candidate: ————————————(name and signature)

Thesis director: ————————————(name and signature)

Thesis co-director: ————————————(if applicable) (name and signature)

Doct. prog. director:————————————(R. Urbanke) (signature)

EDIC-ru/05.05.2009

active sensor alternatives like radar and laser. It also haspotential application in non-collaborative flight scenarios orin situations where GPS-based collision-warning systems areeither unreliable or not commonly available on all aircrafts.

Vision-based relative positioning poses several challenges:• high aircraft detection rate;• invariance to scale and illumination changes;• high aircraft detection speed.Unlike other detection tasks, in our case a missed detection

can be quite costly. A high detection accuracy is thereforerequired across variety of operating conditions. Aircrafts travelat a high speeds and must be detected quickly particularly incollision avoidance scenarios. Also, neighboring aircrafts canappear across a wide range of distances and can be difficultto detect when far away.

This problem belongs to the category of moving objectsdetection and segmentation from a single moving camera.Many algorithms have been developed to address this task.Some of the popular and commonly used ones are describedin [1]. These algorithms are based on estimating optical flowbetween two images, with its further analysis, which allowsto detect moving objects in the video sequence.

Optical flow methods are, however, subject to image noiseand motion discontinuities. Consequently, Laptev in [2] intro-duced a concept of spatio-temporal interest points that proveto be more robust to noise and complex background, as wellas motion discontinuities and fast motion changes.

The detection of aircrafts on a collision course is of particu-lar importance in our context. In such case, the aircraft remainsstatic in the video while increasing in apparent size. Thisincrease in size can possibly be quite slow, if the relative speedof the aircraft is not very high or distance between aircrafts ismuch larger then the size of the aircraft. This can significantlydecrease the accuracy of the approaches mentioned above, asthey highly rely on motion of the observed aircraft. To addressthis issue, another method was introduced by Liu et al. [3]. Hisapproach allows to detect moving objects even if they remainstable in some parts of the video sequence.

In our approach, we use the idea of spatio-temporal interestpoints that was introduced in [2], and strengthen it with otherfeatures, as was done in [3]. We then use statistical learningmethods to achieve better accuracy in aircraft detection. Thisapproach is designed to add some prior knowledge and obtainbetter results, comparing to the methods, described in [1]–[3],that are purely automatic and rely only on the video itself.

The use of statistical learning methods requires significantamounts of training data, which is a major limitation in our


case, as we need the whole system to be able to detectdifferent types of aircrafts in various environments and weatherconditions. In our experiments we can film videos only fortwo different types of UAVs (quadrotor UAV and fixed-wingUAV), thus another important contribution of our future workwill be creating a synthetic database of video sequences fortraining and evaluation purposes. The dataset should be highlyrepresentative in the sense that the videos need to be very closeto what a camera could get, when mounted on the real aircraft.

The remainder of the report is organised as follows. InSections II - IV we will discuss popular approaches that dealwith the problem of moving object detection and segmentationfrom a moving camera. We will conclude by a discussion ofour current work in Section V.

II. OPTICAL FLOW

We start by introducing a commonly used technique formoving objects detection and motion segmentation, namelyoptical flow. Baker et al. [1] gives a good overview of thesuccessful methods designed to solve this task.

Following [1], we define optical flow as the apparent motionof brightness patterns in the image. Estimation of this motionmakes it possible to segment the whole image sequences intoareas of pixels that have similar motion vectors and considerthem as moving objects. Most of the existing algorithms,dealing with the problem of the optical flow evaluation,view it as an energy minimisation task, where the energy isrepresented by the following equation:

EGl = ED + λEP , (1)

where ED reflects the consistency of the flow with the inputimages, EP imposes different constraints on the resulting flowfield and λ is a weighting parameter. According to [1] variousmethods represent each of these terms differently.

A. Data Term

A common way to define ED is to assume that brightnessor color of the pixels of the objects does not change muchfrom one frame to another, which results into the followingequality:

I(x, y, t) = I(x+ u, y + v, t+ 1). (2)

Applying first-order Taylor expansion to the right-hand sideof the Equation (2) we can rewrite it in the following way:

u∂I

∂x+ v

∂I

∂y+∂I

∂t= 0. (3)

Brightness constancy assumption represented in Equation(2) imposes several strong constraints on the scene (e.g., allthe objects are Lambertian and the illumination of the sceneis uniform) which are not always true in real world scenarios.

Baker discusses some ways that were developed to weakenthese assumptions. One of the approaches is to analyse not theintensity, but the gradients of the image:

∇I(x, y, t) = ∇I(x+ u, y + v, t+ 1), (4)

as gradient information is often more robust to illuminationchanges. However, Equation (4) adds another constraint whichstates that optical flow is locally translational. This restrictionis quite strong and can be easily violated when the objectis rotated or its scale changes from one frame of the videosequence to another.

Instead of image gradients it is also possible to use morecomplex features like SIFT [4] or try to explicitly model theillumination and appearance change from one frame to another.

Employing different advanced features makes optical flowapproach more robust to image noise and illumination changes,however it also increases computation complexity and memoryconsumption, which is critical in our case.

Equations (2) and (4) both provide one error per pixel.This errors need to be aggregated over the whole image. Thebaseline approach is to use L2 norm to penalise errors:

ED =∑x,y

(u∂I

∂x+ v

∂I

∂y+∂I

∂t

)2

. (5)

A number of other penalty functions, such as L1 normand Lorentzian function, have been used in some modernapproaches.

B. Prior Term

Data term alone is ill-posed with more unknowns thanequations, which leads to the necessity of using some priorconstraints on the optical flow field. Following [1] we willoutline different approaches to define EP in the most effectiveway. The simplest solution is to use first order derivatives:

EP =∑x,y

[(∂u

∂x

)2

+

(∂u

∂y

)2

+

(∂v

∂x

)2

+

(∂v

∂y

)2]. (6)

Choosing prior term in the way described in Equation (6) willfavor smooth optical flow field, which will eliminate errors,but also smooth the boundaries of the objects, which is notpreferable for most applications.

Based on the survey by Baker et al. [1] one of the mostpopular approaches to deal with this problem is to weightthe penalty function with a spatially varying function. Oneexample of this approach is to add gradient dependant weightsto the right part of the Equation (6):

EP =∑x,y

ω(∇I)

[(∂u

∂x

)2

+

(∂u

∂y

)2

+

(∂v

∂x

)2

+

(∂v

∂y

)2]

(7)In the method described by Equation (7) the weighting

function ω(∇I) is isotropic which means equal in all direc-tions, however there exists a whole family of approaches thatsuggest using anisotropic weighting functions (e.g. weightingthe direction along the image gradient less than the directionorthogonal to it).

It is also possible to use higher-order priors (e.g. encouragesecond-derivatives of flow vectors to be small).

Following [1] another approach is to use affine priors, forexample over-parametrize the flow in the following way:


u(P ) = a1(P ) +x− x0x0

a3(P ) +y − y0y0

a5(P ),

v(P ) = a2(P ) +x− x0x0

a4(P ) +y − y0y0

a6(P ),(8)

where P = (x, y, t) is the location of the point in the videosequence and (x0, y0) - are the coordinates of the centre pointof the image. As we can see from Equations (8) currentapproach is flexible in the fact that each pixel has the freedomto choose its own set of model parameters ai, i = 1...6,which will lead to a more accurate estimation of the flowfield, however also increase computational complexity of thealgorithm, as for each location we will need to solve sixequations instead of two.

The design of EGl involves a variety of choices, each witha number of free parameters, which can be tuned manually orwith help of statistical learning techniques.

C. Optimization Algorithms

Having set both ED and EP for the Equation (1), we needto define the optimisation method that will effectively solveit. Following [1] there are different ways to do it.

One of the simplest algorithms is steepest descent, for whichconvergence to global minima is guaranteed, but possibly quiteslow, as this method fails to model the coupling between theunknowns. Several other approaches such as Gauss-Newtonmethod or Levenberg-Marquardt algorithms were introducedto improve the convergence speed. However these algorithmsare inapplicable to general optical flow problems as theyrequire estimating and inverting the Hessian matrix of size2n× 2n, where n is the number of pixels in the image.

The next class of algorithms, according to [1], assumes thatglobal energy EGl in Equation (1) can be represented as :

EGl =

∫∫E(u(x, y), v(x, y), x, y, ux, uy, vx, vy), (9)

where ux = ∂u∂x , uy = ∂u

∂y , vx = ∂v∂x , vy = ∂v

∂y . Hereu(x, y) and v(x, y) are treated as unknown functions instead ofthe set of unknown parameters. The difference with previousmethods is that here the approach is to find the minima of thecontinuous functions u(x, y) and v(x, y), instead of analysingtheir discrete equivalents.

Equation (9) can be then solved using calculus of variationswith Euler-Lagrange equations:

∂EGl∂u

− ∂

∂x

∂EGl∂ux

− ∂

∂y

∂EGl∂uy

= 0,

∂EGl∂v

− ∂

∂x

∂EGl∂vx

− ∂

∂y

∂EGl∂vy

= 0.

(10)

Based on how EGl depends on functions u and v it ispossible to choose different methods to solve Equations (10).For example in case of linear dependency we can simplyparametrize (10) with two unknowns per pixel and solve themas a sparse linear system. In case the dependency is morecomplicated, Equations (10) are nonlinear, which requires

iterative methods to be used. This brings all the advantages anddisadvantages of these kind of methods that were discussedabove.

All the described optimisation methods solve the Equation(1) as a huge nonlinear minimisation task, which usuallytakes a lot of time, as it highly depends on the initialisation.Following [1], a Coarse-to-Fine approach was developed toovercome this problem. This method requires first buildingimage pyramids, by repeated blurring and downsampling.Then optical flow is computed for each level from the top(fewest number of pixels) to the bottom while resulting flowfrom each level is taken as an initial estimate for the nextlevel.

Following [1] this method is much faster than estimating asingle solution at the bottom level, however it has a tendencyto over-smooth fine structures and fail to capture small fastmoving objects, which are of main interest for our problem.

As described above, almost all solutions, that focus oncomputing optical flow, follow the structure shown in Equation(1), which means they have a specific data term, prior termand an optimisation algorithm to compute the flow field.Regardless of the choices made, all the algorithms in realworld scenarios have to deal with some problems that makeoptical flow inherently difficult. Following [1] these include:• aperture problem and textureless regions, which empha-

sise the fact that optical flow is intrinsically ill-posed;• camera noise, nonrigid motion, motion discontinuities

and occlusions, which make the correct choice of priorterm EP important;

• large motions and small objects, which can force optimi-sation algorithms to converge to local minima;

• mixed pixels, specularities, illuminations changes andmotion blur, which make the correct choice of data termED along with its assumptions (e.g. brightness constancy)important.

In order to fulfil all constraints outlined above, computa-tionally more complex methods are introduced which makesthem unsuitable for our task and for real-time applications ingeneral.

III. SPACE-TIME INTEREST POINTS

Another way to deal with motion detection is introducedin [2]. In this work, Laptev extended the notion of spatialinterest points [5] into spatio-temporal domain and showedthat resulting features are highly representative and can beused for interpretation of spatio-temporal events.

A. Method

Similar to [5], Laptev seeks for an operator that respondsto events in video sequences at specific locations and extentsin space and time. The general idea is to find points invideo sequences such that their image values in correspondinglocal spatio-temporal volumes have large variations along bothspatial and temporal directions.

Laptev choose to calculate scale-space representationL(·;σ2

l , τ2l ) from the video sequence f(·) as a preliminary


step for the algorithm. This involves convolution with spatio-temporal separable Gaussian kernel:

g(x, y, t;σ2l , τ

2l ) =

1√(2π)3σ2

l τlexp(−x

2 + y2

2σ2l

− t2

2τ2l) (11)

and can be written as:

L(·;σ2l , τ

2l ) = g(·;σ2

l , τ2l ) ∗ f(·), (12)

where ∗ denotes convolution and parameters (σ2l , τ

2l ) represent

spatial and temporal variances respectively. Laptev states thatusing separate parameter for temporal variance is essential,because spatial and temporal extents of events are generalyindependent.

Next step of the algorithm is similar to [5]. Here weconsider spatio-temporal second moment matrix µ composedof first order spatial and temporal derivatives averaged usinga Gaussian weighting function g(·;σ2

i , τ2i ):

µ = g(·;σ2i , τ

2i ) ∗

L2x LxLy LxLt

LxLy L2y LyLt

LxLt LyLt L2t

, (13)

where (σ2i , τ

2i ) = s(σ2

l , τ2l ) and Lx, Ly, Lt are first order

derivatives of L(x, y, t;σ2l , τ

2l ) defined in the Equation (12).

The final step of the algorithm is interest points detection.Inspired by [5], Laptev investigated the function H whichdepends on eigenvalues λ1, λ2, λ3 of matrix µ and has thefollowing form:

H = det(µ)− k(trace(µ))3 = λ1λ2λ3 − k(

3∑i=1

λi)3. (14)

As matrix µ is closely related to auto-correlation matrix, weneed all its three eigenvalues λ1, λ2, λ3 to be large to indicatean interest point in spatio-temporal domain. Laptev in [2]proved that similar to [5] this means that spatio-temporalinterest points in the video sequence correspond to positivelocal maxima of function H , defined in the Equation (14).

Proposed algorithm is an extension of Harris corner detector[5] into spatio-temporal domain. Consequently, it inherits allits benefits, such as rotation invariance, and drawbacks, such asnot being invariant to scale changes. Considering this, Laptevin [2] developed spatio-temporal scale adaptation mechanismto overcome this problem.

B. Spatio-Temporal Scale Adaptation

The general idea of the approach is to define a differentialoperator that assumes simultaneous extrema over spatial andtemporal scales that are characteristic for an event with aparticular spatio-temporal location.

For analytical purposes Laptev first study the prototypefunction:

f(x, y, t;σ20 , τ

20 ) =

1√(2π)3σ2

0τ0exp(−x

2 + y2

2σ20

− t2

2τ20)

(15)Using properties of Gaussian kernel, it follows that scale-spacerepresentation of f is:

L(·;σ2, τ2) = g(·;σ2, τ2)∗f(·;σ20 , τ

20 ) = g(·;σ2

0+σ2, τ20 +τ2).(16)

In order to recover (σ0, τ0) we consider second-order deriva-tives normalised by the scale factors:

Lxx,norm = σ2aτ2bLxx,

Lyy,norm = σ2aτ2bLyy,

Ltt,norm = σ2cτ2dLtt.

(17)

Here it is assumed that local maxima of f over space and timeis at the centre of f .

According to [2], the next step is to estimate parametersa, b, c, d in the Equation (17). To do so we differentiateexpressions from Equation (17) and set the result to zero.Finally, we have the following system of equations:

aσ2 − 2σ2 + aσ20 = 0,

2bτ20 − 2bτ2 + τ2 = 0,

cσ2 − σ2 + cσ20 = 0,

2dτ20 + 2dτ2 − 3τ2 = 0.

(18)

which after substituting σ = σ0 and τ = τ0 result in a =1, b = 1

4 , c = 12 and d = 3

4 .The resulting operator has the following form:

∇2normL = σ2τ

12 (Lxx + Lyy) + στ

32Ltt. (19)

and reaches maximal response at a pair of scales (σ2, τ2)which provides the highest variation of the image intensityin space and time simultaneously for a given spatio-temporallocation (x, y, t) in the video sequence.

C. Algorithm

Having defined both spatio-temporal interest points detec-tion method and mechanism for spatio-temporal scale adapta-tion, Laptev employed the extended to spatio-temporal domainapproach described in [6]. The resulting algorithm finds pointsin video sequences, that simultaneously correspond to localmaxima of corner function H and spatio-temporal Laplaceoperator ∇2

normL over scales (σ2, τ2).There are two ways to find the solution to this optimisation

task. The first way is to compute space-time maxima of Hfor each spatio-temporal scale and then select points thatmaximise the norm ∇2

normL. This approach guarantees thatfound solution is optimal, however it requires dense samplingand therefore computationally expensive. An alternative is todetect interest points at a set of sparsely distributed scalevalues and then track these points in the spatio-temporal scale-time space towards the maxima of ∇2

normL. This meansthat instead of applying simultaneous optimisation of H and∇2normL over five dimensions (x, y, t, σ2, τ2), we can split

the whole parameter space into two subspaces: (x, y, t) and(σ2, τ2) and perform the optimisation iteratively on each ofthem until the convergence is reached. Algorithm 1 brieflydescribes the whole detection framework, implemented byLaptev [2].

D. Apllications

In his work [2], Laptev illustrated the applicability ofthe developed approach for video interpretation and poseestimation. He illustrated it on a video of a walking person.


Algorithm 1 Spatio-temporal interest points detection.1: Detect interest points pj , j = 1..N as a maxima of H (14)

over space and time, using sparcely selected number ofspatio-temporal scales σ2

l = σ2l,1..σ

2l,n and τ2l = τ2l,1..τ

2l,m;

2: for each interest point pj found at step 1 do3: Compute ∇2

normL at (xj , yj , tj);4: Find the combination of scales (σi,j , τi,j) that max-

imises ∇2normL;

5: if (σi,j 6= σi,j or τi,j 6= τi,j) then6: Re-detect interest points using (σi,j , τi,j);7: Go to 3;8: end if9: end for

Using Algorithm 1, Laptev extracts spatio-temporal interest-ing points from training video sequences. He then associatesa descriptor with each interest point, that can be written in thefollowing way:

j = (Lx, Ly, Lt, Lxx, ..., Lttt) |σ2=σ2i ,τ

2=τ2i,

Lxmyntk = σm+nτk(∂xmyntkg ∗ f),(20)

where derivatives are computed at the spatio-temporal scalesof the corresponding interest point. These descriptors arethen combined into a database with labels, correspondingto different events (spatio-temporal interest points). Usingthis database it is then possible to classify events, that areextracted from evaluation videos, by calculating minimumdistance between their descriptors and the descriptors fromthe training data.

In his work [2], Laptev showes that developed method canreliably detect walking people among variety of scales andis able to handle occlusions and complicated background mo-tions. He illustrates that, unlike previous methods his approachdoes not require careful initialisation and/or clear or stationarybackground. However there are some limitations of currentmethod:• The algorithm is not online, as it requires pre-recorded

video sequence to reliably detect interest points. Thiscomes from the fact that in online case not all temporalscales will be available. In this scenario it is still possibleto approximately estimate the proper temporal scale ofthe interesting point, however this estimation might beinaccurate with slow or constant motions.

• Descriptors of spatio-temporal interest points are notinvariant to planar image rotations. Such invariance couldbe added by using steerable derivatives or rotationallyinvariant operators in space.

• Spatio-temporal interest points detection algorithm isdependent on relative camera motion. This limitation iscritical in our case, as generally appearance of flyingobjects is different for various view point positions.

• Developed algorithm is not resistant to fast contrast andillumination changes. To overcome this problem it ispossible to use sequences of gradient images, as theyare more invariant to illumination and contrast changes,however, as it was discussed in Section II, it will imposeother constraints on the problem.

IV. MOVING OBJECT DETECTION AND SEGMENTATIONUSING COLOR AND LOCALITY CUES

Methods discussed in Section II and Section III deeply relyon motion in order to detect moving objects, thus they cannotwork well in case motion is sparse or insufficient. In ourcontext, this may lead to inability to detect flying objects,that are on a collision course, as in this case observed aircraftwon’t be moving with respect to the camera of the observingaircraft.

To overcome this limitation, Liu develops an unsupervisedalgorithm [3] that uses color and locality cues from sparsemotion information for moving objects segmentation. Here wefirst outline the method and then discuss its advantages anddisadvantages.

A. Method

Liu proposes a detection and segmentation framework that isbased on learning a moving object model by collecting sparseand insufficient motion information throughout the video.

The first step of the approach is to identify key frames ofthe video sequence that have strong enough motions cues toreliably identify at least some parts of the moving objects inthe scene. Liu assumes that the background is dominant inthe scene, thus local motion can be defined as the discrepancybetween local and global motions.

Proposed approach uses homography, based on correspon-dences acquired from SIFT [7] feature-based method, to modelglobal motion mg(x, y) from one frame to another in the videosequence. In order to estimate motion cues mc(x, y) for themoving objects Liu computes optical flow mo(x, y) for eachpixel in the image. Using homography and optical flow wecan then calculate motion cues for every pixel in the followingway:

mc(x, y) = ||mo(x, y)−mg(x, y)||22. (21)

If an object or its part can be reliably inferred from themotion cues of the current frame, this frame is considered tobe a key frame for the method. Liu uses the following criteriato extract key frames from the video sequence:∑

(mc(x, y) ≥ δ) > minArea,

Var((x, y)|mc(x, y) ≥ δ) < maxSpan,(22)

where δ is a motion threshold, which discards weak mo-tion cues; minArea defines the minimum number of pixelswith significantly different motion from the background andmaxSpan defines the degree of sparsity of these pixels in theimage.

Second step is to segment moving objects or parts of objects(sub-objects) in key frames. In order to do so, Liu suggestsusing motion cues, defined in Equation (21). This approachhowever may suffer from aperture problem as not all pixelsof the object have significant motion cues. This issue can beaddressed by considering the interaction between neighboringpixels:

1) Neighboring pixels are likely to have similar labels;2) Neighboring pixels with similar colors are more likely

to have similar labels.


To model these interactions, Liu suggests using Markov Ran-dom Field prior [8] on labels of the pixels as follows:

pn(li|i = 1...M) ∝∏

i∈{1...M}

∏i∈{Ni}

exp

(λlilj

α+ d(i, j)

)(23)

where li is the label of pixel i, M is the number of pixelsin the image, Ni is the 8-connection neighborhood of pixel i,d(i,j) measures color difference between pixels i and j of theimage in the perceptually uniform color space (e.g. Lab), and(α, λ) are the weighting parameters of the method.

Since for key frames motion cues are reliable we can modelmoving parts of the image as a foreground in the followingway:

p(I|{li|i = 1...M}) =∏

i∈{1...M}

(exp (li(mci − δ))) . (24)

With Equations (23) and (24) moving object segmentation fora key frame can be achieved by maximizing the posterior:

P (L|I) =1

Zp(I|{li|i = 1...M})pn(li|i = 1...M). (25)

Following [3], this optimization task can be solved using graphcuts algorithm.

After all sub-objects are extracted from the key frames Liuproposes a way to learn color and locality cues of the movingobjects. He suggests to represent moving sub-objects by aGaussian Mixture Model (GMM) Gf =

⋃j=1...n

gj(pj , µj ,Σj),

where n is the number of components and pj , µj ,Σj arethe prior of the component, the mean color vector and thecovariance matrix respectively. Model learning process isbased on the expectation-maximisation algorithm together withclustering strategy to estimate the number of components ofGMM. Liu uses the following likelihood model to estimatewhether a pixel with color c belongs to the moving object ornot:

pc(c|li) ≡ exp (li(log (ψc(c))− δc)),

ψc(c) = maxgj∈Gf

(pj exp

(− 1

2 (c− µj)TΣ−1(c− µj))√

|Σj |

)(26)

where ψc(c) is the affinity of the pixel with color c tothe model Gf and δc is a color threshold, which discardsweak color cues. Since some of the detected moving partsof extracted sub-objects might contain background pixels, itcan lead to Gf having some false components. These falsecomponents can be detected and removed by checking theaffinity of all the GMM components with the backgroundpixels. If the component is too close to the background thanmost likely it is a false component.

Color cues however are insufficient when the background ofthe image has similar color as the moving object, which leadsto necessity of using locality cues. To address this issue, Liuadds spatial affinity component to the model. This basicallystates that the closer pixel is to the moving object pixels, thatwere detected reliably with motion cues, the higher probabilitythat it also belongs to the moving object. Consequently,locality cues defined in this way can only be provided forkey frames, as they require reliable knowledge of pixels that

Algorithm 2 Locality cues propagation.1: for each key frame k do2: for t← k, k + 1, k + 2, ... do3: Initialise the locality cue at frame t+1: pt+1

s (xi|li) =pts(xi|li);

4: Estimate moving objects at frame t + 1 using prob-ability distribution, defined in Equation (28);

5: Refine locality cue of frame t + 1 using estimatedobjects;

6: if (frame t+1 is a key frame) & (objects estimated atstep 4 cover the objects detected on this frame usingmotion cues) then

7: Remove this frame from the list of the key frames;8: end if9: end for

10: Propagate the locality cue at frame k to frames k −1, k − 2, ... using the same algorithm:

11: end for

correspond to moving objects, and can be represented in thefollowing way:

pts(xi|li) ≡ exp (li(log (ψs(xi))− δs)),

ψts(xi) =1

2πσ2maxj∈F

exp

(− (xi − xj)T (xi − xj)

2σ2

),

(27)

where ψs(xi) is the spatial affinity of the pixel i to the movingobjects F in the key frame t; xi corresponds to the position ofpixel i; σ is a weighting parameter and δs is a spatial threshold,which discards weak locality cues.

B. Framework

Given the learned color and locality cues for moving objectswe will extend the likelihood, defined in Equation (23). Forkey frames this can be done in the following way:

pt(fi|li) = ptm(mci|li) · ptc(ci|li)λc · pts(xi|li)λs , (28)

where (λc, λs) are weighting parameters. Since locality cuesare available only for key frames and color information is notalways sufficient for correct object segmentation, Liu proposesa way to propagate locality cues from each key frame to thewhole video sequence, assuming that the positions of movingobjects do not change much in a sequence of two consecutiveframes, which is outlined in the Algorithm 2. Having definedlocality cues for all frames, we can use Equation (28) to re-estimate moving objects in each frame.

Proposed framework [3] was evaluated on handmade videosequences, which show:• robustness of the approach to occlusions and motion

discontinuities;• successful segmentation of moving objects in case they

are stable for some frames, which makes the algorithmsuitable for our problem, as it will remain robust incollision situations;

• dependance only on the video itself to infer the objectsof interest.


This algorithm, however, has some disadvantages that mayturn to be critical in our case:• It needs the whole video sequence to robustly estimate

moving objects in all frames. However, the idea can beextended to online version with gradual model formationand refinement.

• Algorithm may fail to detect some parts of the movingobjects if they remain stable throughout the whole videosequence, or their motion cues are not reliable enoughto satisfy criteria in Equation (22). This can be avoidedby incorporating external information about the objectof interest to the algorithm and using more efficienttechniques for motion cues extraction. This can alsospeedup the algorithm, as authors claim that 80% ofprocessing time is spent on motion estimation.

• Proposed method depends on a number of free pa-rameters, that in [3] where manually tuned to achievesatisfactory results. To overcome this problem, statisticallearning methods can be also incorporated into the currentalgorithm.

V. DISCUSSION AND THESIS PLAN

In this section, we first summarize our current work andpropose a way to use machine learning techniques for flyingmoving objects detection. Then we discuss some future re-search directions for improvement of the detection accuracyand reduction of computational complexity of the framework.

A. Current Work

All the methods discussed in Sections II-IV are completelyautomatic. They depend only on the video itself and do notrequire any prior knowledge about the object of interest.Their goal is to detect any kind of moving objects thatoccur in the video sequence. However our objective is slightlydifferent: we need to be able to robustly detect and track flyingobjects. This restricts us to some number of shapes and allowsusing statistical learning methods for achievement of betterperformance, comparing to approaches, described in [1]–[3].

We start by training a classifier, based on the videos of onetype of the aircrafts (quadrotor UAV). Sample images fromvideos of our dataset can be seen in Figure 1. Inspired bythe idea of spatio-temporal interest points [2], we extractedfeatures for the classifier not only from one frame, but froma sequence of frames from the video. This approach allowsus to incorporate temporal information into the descriptor andreduce the influence of such factors like motion blur or imagenoise on the performance of the framework. We show theadvantage of this spatio-temporal approach by comparing itto conventional single frame approaches, which is presentedin Figure 3.We can see that temporal information significantlyincreases the classification performance.

Extracting information from several frames for detectionpurposes is challenging due to the fact that features that weget, if the aircraft is moving from the top to the bottom ofthe frame are different from those, in case of left to rightmovement. In order not to train the classifier for all possiblemotions that an aircraft could make, we implement a motion

Fig. 1. Example images from the quadrotor dataset, that we use in trainingand evaluation of our framework.

Fig. 2. (upper) Frames of the video sequence from the evaluation datasetbefore the alignment. (lower) Frames of the same video sequence after thealignment.

Algorithm 3 Detection framework.1: Divide video sequence into slices of images:2: Compute gradients for all spatio-temporal locations in the

slice of the video.3: Compensate for aircraft motion.4: Use sliding window approach in combination with a

classifier to compute a response image;5: Use non-maximal suppression to identify positive local

maxima in the response image.6: Use coordinates of the local maxima as spatial coordinates

of the aircraft in the video sequence.

compensation mechanism, that is capable of aligning imagestogether, so that the aircraft is more or less stable throughoutthe whole image sequence. This mechanism is basically alearned function that basing on two consecutive images defineshow the position of the aircraft change from one frameto another. More specifically, we employ two preliminarilylearned regressors for horizontal and vertical movement ofthe aircraft and combine their responses into a single motionvector. Given this vector, it is then possible to align images inthe way that the aircraft is always close to their centers, whichis illustrated in Figure 2.

The complete detection framework, that is currently imple-mented is outlined in the Algorithm 3.

One of the main applications for visual-based air position-ing and flying objects detection is collision avoidance. Thisparticular scenario has some important properties:


Fig. 3. Illustration of the comparison between our spatio-temporal approach and single-frame detection methods. (left) Accuracy of our spatio-temporalmethod on our evaluation dataset. (right) Accuracy of conventional single-frame approaches on the same dataset.

Algorithm 4 Prediction of collision situations.1: for every detection that we get from the Algorithm 3 do2: Extract a spatio-temporal cube of data, that surrounds

the detection;3: Build a spatio-temporal histogram of gradients [9],

based on the intensity values of this cube of data;4: Use acquired histogram as a descriptor for the detection;5: Employ preliminarily learned SVM for checking,

whether the descriptor corresponds to a collision sit-uation or not.

6: end for

• Observed aircraft remains stable with respect to thecamera of the observing aircraft;

• Apparent size of the observed aircraft is increasing.These features are very strong and can be used for ourpurposes. Algorithm 4 briefly outlines the main steps that wetake for prediction of collision situations.

B. Future research

The use of machine learning techniques requires consid-erable amount of training data, which is hard to get in ourcase. In order to deal with this problem, we can film thisdata ourselves, however we are able to make experimentswith only a restricted number of different aircrafts (quadrotorUAV, fixed-wing UAV), which causes the need to syntheticallygenerate required amount of video sequences. This videosshould be highly representative and allow us to robustly detectflying objects. To address this issue, one goal of our futureresearch is to make a system that given a model of an aircraftand some number of its real images will be able to generatesynthetic video sequences for the classifier. Videos, obtainedin this way, should have the same statistics as real videos andfully represent different types of aircrafts at various rotationangles, in different environments and weather conditions.

One of the objectives of our research is to make the wholesystem autonomously work online onboard of the aircraft.

Consequently, we need to reduce the computational complex-ity and memory consumption of our algorithm. Currently, wehave implemented the sliding window approach for detection,which is slow, as we need to run the classifier across all possi-ble spatio-temporal positions and different scales. In order todiminish computational complexity and increase the speed ofthe framework, we will use background subtraction techniques.This approach will significantly decrease the number of spatio-temporal positions that need to be checked by the classifier.Moreover, we will implement object tracking approaches, sothat the system won’t lose the observing aircraft in the areasof the video that are hard for the detection algorithm.

REFERENCES

[1] S. Baker, S. Roth, D. Scharstein, M. Black, J. P. Lewis, and R. Szeliski,“A database and evaluation methodology for optical flow,” in ComputerVision, 2007. ICCV 2007. IEEE 11th International Conference on, 2007,pp. 1–8.

[2] I. Laptev, “On space-time interest points,” Int. J. Comput. Vision, vol. 64,no. 2-3, pp. 107–123, Sep. 2005.

[3] F. Liu and M. Gleicher, “Learning color and locality cues for moving ob-ject detection and segmentation,” in Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, 2009.

[4] C. Liu, J. Yuen, A. Torralba, J. Sivic, and W. T. Freeman, “Sift flow:Dense correspondence across different scenes,” in Proceedings of the10th European Conference on Computer Vision: Part III, ser. ECCV ’08.Berlin, Heidelberg: Springer-Verlag, 2008, pp. 28–42.

[5] C. Harris and M. Stephens, “A combined corner and edge detector,” inIn Proc. of Fourth Alvey Vision Conference, 1988, pp. 147–151.

[6] K. Mikolajczyk and C. Schmid, “Indexing based on scale invariant interestpoints,” in Proceedings of the 8th International Conference on ComputerVision, Vancouver, Canada, 2001, pp. 525–531.

[7] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,”Int. J. Comput. Vision, vol. 60, no. 2, pp. 91–110, Nov. 2004.

[8] Y. Sheikh and M. Shah, “Bayesian modeling of dynamic scenes for objectdetection,” PAMI, vol. 27, pp. 1778–1792, 2005.

[9] A. Klaser, M. Marszałek, and C. Schmid, “A Spatio-Temporal DescriptorBased on 3D-Gradients,” in BMVC’08.

Documents

EDIC RESEARCH PROPOSAL 1 Visual Detection and Tracking of ... exam/PR13Roza… · EDIC RESEARCH PROPOSAL 1 Visual Detection and Tracking of Flying Objects in Unmanned Aerial Vehicles