multi-view vehicle detection and tracking in

Multi-view Vehicle Detection and Tracking inCrossroads

Liwei Liu, Junliang Xing, Haizhou AiComputer Science and Technology Department,

Tsinghua University,Beijing 100084, ChinaEmail: [email protected]

Abstract—Multi-view vehicle detection and tracking in cross-roads is of fundamental importance in traffic surveillance yetstill remains a very challenging task. The view changes ofdifferent vehicles and their occlusions in crossroads are twomain difficulties that often fail many existing methods. To handlethese difficulties, we propose a new method for multi-view vehicledetection and tracking that innovates mainly on two aspects: thetwo-stage view selection and the dual-layer occlusion handling.For the two-stage view selection, a Multi-Modal Particle Filter(MMPF) is proposed to track vehicles in explicit view, i.e. frontal(rear) view or side view. In the second stage, for the vehicles ininexplicit views, i.e. intermediate views between frontal and sideview, spatial-temporal analysis is employed to further decide theirviews so as to maintain the consistence of view transition. For thedual-layer occlusion handling, a cluster based dedicated vehiclemodel for partial occlusion and a backward retracking procedurefor full occlusion are integrated complementarily to deal withocclusion problems. The two-stage view selection is efficientfor fusing multiple detectors, while the dual-layer occlusionhandling improves tracking performance effectively. Extensiveexperiments under different weather conditions, including snowy,sunny and cloudy, demonstrate the effectiveness and efficiencyof our method.

I. INTRODUCTION

Detection and tracking of vehicles in traffic scenes is of fun-damental importance for surveillance system and has apparentcommercial value, which provides great potentials for manyhigh level computer vision applications such as traffic analysis,intelligent scheduling and abnormal activity detection. Thedifficulties behind this problem, however, are also hard, suchas vehicle view and type changes, partial and full vehicleocclusions, gradual and sudden illumination changes. Thesedifficulties are inevitable in practical applications and thusnoticeably aggravate this problem.

Vehicle detection and tracking has been researched for manyyears, and significant advances have been achieved. Traditionalmethods try to detect vehicles based on background subtrac-tion [1][2][3] and track them using techniques like KalmanFitler [3] and Spatial-Temporal Markov Radom Field [2] withdifferent observations such as contour [1] and appearance[3]. Since their methods are sensitive to foreground noise,particular cases such as camera adjustment, raining, snowingand shadow will cause failure in these methods. Moreover,they all require vehicles must be identified separately beforeocclusion happens, which is a strong constraint and confinethey to the practical applications in crowded scenarios.

Fig. 1. The flow chart of our approach.

In the last decade, the fast development of object detectiontechniques has result in many promising methods for detectingparticular object classes, e.g., faces [4][5], pedestrians [6][7],and vehicles [8]. These object detectors provide good ob-servation models for detection based tracking algorithm. Allthe detection based methods can be categorized into threeclasses according to the types of detectors: single view detector[4][7], integration of multiple view detectors [6] and singlemulti-view detector [8]. Obviously, single view detector isunsuitable for the scenarios which contain multi-view targets,e.g. crossroads. In consideration of the connection and distinc-tion among multiple view detectors, tracking algorithm basedon multiple detectors must have a sophisticated integrationstrategy. Single multi-view detector (always used in onboardsystem) requires high affinity of targets in each view anduniform aspect ratio of vehicles, so this approach doesn’t workin our problem. In addition, Data-Driven MCMC [9] has beenused to recover trajectories of targets of interest over time,but this method requires all the videos in advance and usesoptimization algorithms to solve the problem, which conflictwith the requirements of online and real-time processingin our problem. As far as we known, there are very fewworks on multi-view car detection and tracking in crossroadsbased on detection techniques, which can process online andin real-time. Our approach is motivated from this practicalrequirement for applications.

In this work, we focus on videos taken by a single cameraat a height above ground as would be common in surveillanceapplication. The vehicle videos are acquired in crossroadswhere occlusions among vehicles and viewpoint changes arerather grave. Our approach is much more robust to shadow

978-1-4577-0121-4/11/$26.00 ©2011 IEEE 608

Aalaa Khattab

Highlight

(a) (b) (c)

Fig. 2. Results of view confidence weight (from left to right: side view (red),intermediate view and frontal view (green); the histogram in the bottom-rightcorner of each figure: the quantitative comparison of the weights).

and illumination changes owing to detection based techniquescompared with background subtraction based ones. The maincontributions of our approach lie in: (1) A real-time andonline processing system that can deal with view changesand occlusions effectively. (2) A two-stage view selectiontechnique that can efficiently fuse multiple detectors; (3) Adual-layer occlusion handling technique that can deal withpartial and full occlusions integrally.

The rest of this paper is organized as follows. The details ofthis proposed method are elaborated in section II. Experimentresults are demonstrated in Section III. And conclusions anddiscussions are made in section IV.

II. THE PROPOSED APPROACH

The flow chart of our multi-view vehicle detection andtracking system is shown in Fig.1. Multiple view detectors arenot only employed to search for new targets but also coupledtogether in MMPF to guide the tracking process and performview selection of targets in explicit views. For those targetsin inexplicit views, spatial-temporal analysis is explored tosmooth their view transition and maintain the consistence oftraffic flow. For the sake of handling occlusion, we devisea cluster based dedicated vehicle model and a backward re-tracking procedure for partial occlusion and full occlusionrespectively. In the following, after briefly introducing multipleview detectors, we will focus our illustration on the two-stage view selection and dual-layer occlusion handling, whichmainly differentiate our approach from previous methods.

A. Multiple View Detectors

For vehicle surveillance videos in crossroads, it is verydifficult to train one detector that covers all views due tothe large variance of the vehicle appearance. So we traindetectors that cover typical views like frontal (rear) view andside view. The two detectors are offline trained in the boostingframework with Joint Sparse Granular Features (JSGF)† whichhas been proven to be effective for object detection and robustto illumination variation. They provide very discriminative andsteady observation models for multi-view vehicle tracking.

B. Two-Stage View Selection

Having frontal and side view detectors is far from enoughfor multi-view vehicle tracking due to response conflict ofthe two detectors. In other words, if unreliable observationis chosen to track a response conflict target, the target may

†Specified object detection apparatus, Chinese Patent 200710305499.7,inventer: Haizhou Ai, Chang Huang, Shihong Lao, Takayoshi Yamashita.

TABLE ITHE FRAMEWORK OF TWO-STAGE VIEW SELECTION

Given: Each object st−1 has its supporting multi-view parti-cle set

{snt−1,v, π

nt−1,v

}N,V

n=1,v=1, where N is the number of

particles of each view, V is the number of views, t− 1 is theframe number and πn

t−1,v is the weight of particle snt−1,v:• For the particles of the dominant view dvt−1 ∈ V :

+ Predict, resample and update as traditional particle filter;+ Obtain the weighted mean state st,dvt−1

of dominant view;• For each other view {v′|v′ ∈ V ∩ v′ 6= dvt−1}:

+ If a new target match with this view:- Reinitialize the particles with the new target;- Use the detector to evaluate the particles;

+ Else if∑N

n=1 πnt−1,v′ < TS or the distance between

the center of st−1,v′ and the center of st−1,dvt−1

Dis(v′, dvt−1) > TDis:- Reset all the particles according to st,dvt−1 ;- Update with the resetted particles;

+ Else:- Predict, resample and update as traditional particle filter;

+ Obtain the weighted mean state st,v′ of v′;• If ∀

(∑Nn=1 π

nt,v −

∑Nn=1 π

nt,v′ > TW {v′|v′ ∈ V ∩ v′ 6= v}

):

+ dvt = v;• Else:

+ Spatial-temporal analysis;

lost when it cannot get enough supporting observation. Sowe propose the two-stage view selection to integrate the twoindependent detectors for multi-view vehicle tracking. Thetwo-stage view selection contains multi-modal particle filterand spatial-temporal analysis which will be introduced below.

1) Multi-Modal Particle Filter: Multi-Modal Particle Filter(MMPF) is devised to track multi-view targets. As the namesuggests, a target has two possible views (frontal and sideviews) as its two modals but at a time it only reveals one view,and MMPF is employed to integrate the two view detectors totrack it and perform first stage view selection.

Different from traditional particle filter or CONDENSA-TION [10], MMPF maintains two groups of particles for atarget, one for frontal view and the other for side view, notonly to track the target but also to acquire its view transition. Inthe MMPF framework (Table 1), each particle is evaluated bya confidence reflecting the likelihood of the target belongingto the corresponding view. To select the dominant view, thetotal confidence of its particles is calculated for each view. Ifthe difference between two views’ total confidences is biggerthan a threshold TW (equation (1)), then the bigger one (asFig.2(a) and Fig.2(c)) will be treated as the dominant view,otherwise (as Fig.2(b)) a second stage view selection will beadopted. Denoting N as the number of particles, and πn

t,v asthe nth particle’s confidence for view v in frame t:∑N

n=1πnt,v −

∑N

n=1πnt,v′ > TW (1)

Since the two groups of particles are not independent, thetraditional procedures of predict, resample and update [10]

978-1-4577-0121-4/11/$26.00 ©2011 IEEE 609

Aalaa Khattab

Highlight

Aalaa Khattab

Highlight

Aalaa Khattab

Highlight

Aalaa Khattab

Highlight

(a) (b)

Fig. 3. (a) tracking result (green box represents frontal view and red boxdenotes side view). (b) the predefined confidences of particles (brightnessindicates confidence and the position is the center of a particle).

for particles are unsuitable for our framework. So MMPFneeds redesigned procedures to deal with all the special caseswhen the observation of minor view (the view other than thedominant view) becomes unreliable or drifts. The redesignedprocedures of predict, resample and update can be formalizedas equation (2) (follow the framework in Table 1):

Predict by p(st,dv|st−1,dv) : {s′(i)t,v, π(i)t−1,v} ∼ p(st,v|Ot−1,v)

Resample :

{s(i)t,dv, 1/Ndv} ∼ p( st,dv|Ot−1,dv)

{N(snew, δ2), 1} ∼ p( st,mv|Ot−1,mv)

{T (s(i)t,dv), 1} ∼ p( st,mv|Ot−1,mv)

{s(i)t,mv, 1/Nmv} ∼ p( st,mv|Ot−1,mv)

Update :π(n)t,v ∝ p(ot,v| st,v), {s

(i)t,v, π

(i)t,v} ∼ p( st,v|Ot,v)

(2)where dv is the dominant view and mv is the minor view,dv ∪ mv = v. The tracking algorithm first predict all theparticles according to a motion model p(st,dv|st−1,dv) of dv.In the stage of resample, the particles of dv resample accordingto their weights. But for the minor view mv, different mea-surements are adopted depending on the circumstance: whena new target (snew) match with the minor view, N(snew, δ

2)is used to generate new particles through Gaussian sampling.When the observation becomes unreliable (the total confidenceis too small) or particles drift to another target, T (s(i)t,dv) resetsparticles according the particles of the dominant view (with thesame center and scale). Except for the two situations above,the particles of minor view resample like the dominant view’s.Finally, the tracking algorithm updates the states for bothviews by the weighted mean of all the resampled particles.

In the MMPF framework, the observation models needto give a confidence reflecting the likelihood of the targetbelonging to the corresponding view. The outputs of eachview detector are of potential to give confidences of particlesand yield to the corresponding view confidence. But they areinaccurate and different from view to view, they cannot be useddirectly without post processing. So, we utilize the number oflayers a particle passed l and the output of last layer confdetto predefine the confidence of a particle.

xl = exp(a× (confdet − T ldet)) (3)

p′l = cl−lmax (4)

pl = p′l−1 +xl

1 + xl× (p′l − p′l−1) (5)

(a) (b) (c)

Fig. 4. (a) Vehicles in transition view have different view responses. (b) Somekinds of vehicles yield to similar appearances with other views’ vehicles. (c)Different positions cause similar appearances with other views’ vehicles.

where xl is the exponent amplification of difference betweenconfdet and T l

det, Tldet is the confidence threshold of the

detector, a is a constant (set to be 5 in experiments). In (4), p′lis the basis confidence of layer l, c is also a constant (set to be1.1), lmax is the number of total layers of the correspondingdetector. pl in (5) is the redefined confidence.

Frontal view and side view detectors are trained in thesame way with the same number of layers, same detectorrate, so their pass rates of positive samples in each layerare the same, from which we can see that the number oflayers a particle passed is important for evaluating the particle.Since the numbers are discrete and the outputs of detectorsare inaccurate, integrating the two metrics to redefine theconfidence is more appropriate than using them respectively.After our redefinition, the confidence is normalized to [0, 1).The higher layer a particle passes, the bigger the confidenceis. Figure 3 (b) shows the predefined confidences.

2) Spatial-Temporal Analysis: Although MMPF is effectivein most cases, it is likely to fail when targets reveal inexplicitviews (Fig.4 (a)) which may lead to frequent view switch.What is more, some targets may confuse MMPF when theirappearances are ambiguous due to their types (Fig.4 (b)) ordistance to camera (Fig.4 (c)). To address these problems,spatial-temporal analysis is employed to perform a secondstage view selection, which smooth view switch procedure sothat the selected view coincides not only with traffic flow butalso view variation tendency.

During the spatial-temporal analysis, four different types ofenergy terms are explored to vote for the correct view. The fourenergy terms we concern about are primary particles, velocitydifference, historical views and neighboring targets.

Primary Particles. It is the number of confident particles,which reflects the likelihood of a target belonging to a viewfrom another perspective. The energy term can be denoted as:

|P | (P = {p|Confp > Tc}) (6)

Velocity Difference. Since a different view of vehiclehas a different moving direction in crossroads, the velocitydifference can be used as an energy term. Take the side viewfor an example, the velocity along x-direction is bigger thany-direction. We adopt the mean velocity of recent 10 famesas targets’ velocity, because velocity between two contiguousframes is inaccurate.

VSide = |Vx| − |Vy| (7)

VFrontal = |Vy| − |Vx| (8)

978-1-4577-0121-4/11/$26.00 ©2011 IEEE 610

Aalaa Khattab

Highlight

Aalaa Khattab

Highlight

Aalaa Khattab

Highlight

TABLE IICOEFFICIENT OF ENERGY TERM

EnergyTerm

PrimaryParticle

VelocityDifference

HistoricalViews

NeighboringTargets

Coefficient α = 1/200 β = 1 γ = 1/10 δ = 1/4

Historical Views. As the temporal information, historicalviews are utilized to smooth view variation tendency. In ourexperiments, we record targets’ views of last n frames (set tobe 10 in experiments), and use the number of side view HSide

and the number of frontal view HFrontal as the temporalenergy terms.

Neighboring Targets. Since traffic flow is consistent ata certain time, a target’s view is always the same withneighbors’. So the numbers of the same view’s targets nearby,NSide and NFrontal , are introduced into spatial-temporalanalysis as spatial information.

The composite energy function can be defined as equation(9) whose coefficients are shown in Table 2. Maximum like-lihood estimation is used to select dominant view.

Uv = α× Pv + β × Vv + γ ×Hv + δ ×Nv (9)

As the second stage of view selection, spatial-temporalanalysis uses spatial and temporal information to help targetsin inexplicit view to obtain their reliable views. The efficientfusion of MMPF and spatial-temporal analysis is capable oftracking multi-view vehicles for seizing their primary obser-vations.

C. Dual-layer Complementary Occlusion Handling

Besides of the difficulties in selecting the corresponding ve-hicle view, the occlusion between multiple vehicles is anothertough problem. Occlusion can be divided into two differenttypes: partial occlusion and full occlusion. In partial occlusion,the detectors tend to drift due to its congenital deficiency atdistinguishing different targets. To solve this problem, a dedi-cated vehicle model based on clustering is proposed to preventresponse from drifting. As for full occlusion and particularpartial occlusion whose observation becomes unreliable orlost, a backward smoothing [10] process is adopted to handlethem.

Taking advantage of traffic scene, we propose a dedicatedvehicle model based on clustering to solve partial occlusioneffectively. The model fuses multiple cues, including position,size and moving trend, to label particles in order to preventparticles from drifting. When one target is partially occludedby another target, particle filter of this target may fail downbecause of response drifting. In the stage of resample, somerandom resampled particles contain the other target and havehigh confidences so that the merged result will drift to the othertarget gradually and fail down the particle filter ultimately. Soit is necessary to label the particles with high confidence in oneocclusion cluster before merging to prevent responses fromdrifting. For this purpose, we adopt K-Means to cluster confi-dent particles, exploring the features of position, size and thedifference of moving trend. We denote the feature vector of aparticle as (xn, yn, wn, hn, dv

in,x, dv

in,y) where xn, yn, wn, hn

indicate its location and size, dvin,x, dvin,y are the differences

of moving trend in x direction and y direction, and i is thetarget id in the occlusion cluster. For an example, supposinga particle belonging to object 1, dv1n,x, dv

1n,y represent the

differences between the velocity from the target position inlast frame (t−1) to the position of the particle and the velocityof the target. The differences can be formalized as equation(10) and (11). The smaller the differences are, the more likelythe particle belongs to the target.

dvin,x =∣∣(xn − xit−1)− vt−1,x∣∣ (10)

dvin,y =∣∣(yn − yit−1)− vt−1,y∣∣ (11)

In order to accelerate the process of convergence andincrease the accuracy, it is desirable to use the states ofthe targets in the occlusion cluster last frame (t − 1) as thecenters of clustering (dvin,x = 0, dvin,y = 0) in the stageof initialization. After clustering by K-Means, the obtainedcluster centers are deemed as the states of the targets and usedto reinitialize the corresponding particle sets. If the overlapratio of two merged targets is bigger than a threshold Toverlapwhich indicates that a target tends to be fully occluded by theother target, the second layer of occlusion handling will beperformed.

To surmount full occlusion, a backward smoothing [10]process is adopted to retrack lost targets. When a target cannotget enough supporting particles, the track of the target isbuffered for future backward smoothing with newly collectedobservations. It first finds the match between the new targetsand the buffered targets based on their affinity (the rate ofoverlap). And then Hungarian algorithm is employed to obtainthe optimal match between these two sets.

III. EXPERIMENTS

Experiments are carried out on videos collected from trafficsurveillance cameras in crossroads (with camera adjustment,raining, snowing and shadow) and some real-world video datacollected with a hand-held camera. The system runs at morethan 22 fps on VGA size (640×480) video on an Intel CoreQuad 2.66 GHz CPU with 4G RAM.

A. Experiment Settings

In our experiments, the offline frontal view detector istrained from 18567 samples with normalized size 24×24while side view detector is trained from 9814 samples withnormalized size 48×24. The total confidence threshold (TW )used to check whether a target’s view is explicit or not is set tobe 5. For dual-layer occlusion handling, if the overlap betweentwo merged targets after particles clustering is bigger than 0.9(Toverlap), the backward smoothing procedure is adopted.

B. Detection Performance Evaluation

The evaluation set contains 3334 images with manual la-beled ground truth out of 24800 multi-view vehicles capturedfrom traffic surveillance videos under different weather. Weevaluate the detection performance with the tracking resultswhich reflects the precision of tracking algorithm (detectors

978-1-4577-0121-4/11/$26.00 ©2011 IEEE 611

Aalaa Khattab

Highlight

Aalaa Khattab

Highlight

Aalaa Khattab

Highlight

Aalaa Khattab

Highlight

Aalaa Khattab

Highlight

Aalaa Khattab

Highlight

Aalaa Khattab

Highlight

Aalaa Khattab

Highlight

Aalaa Khattab

Highlight

Fig. 5. ROC curve of multi-view vehicle tracking results.

are used to search for new targets in a part of image pyramidand offer confidences for particles). We compare our work (redcurve) with other two baseline methods: 1) Simple Integration:detect (same detectors) and track (particle filter) targets infrontal view and side view respectively. When a target revealsboth views (overlap), the one with big confidence is selectedand corresponding detector is used to track it afterwards; 2)Frontal + Side View: detect and track targets in frontal viewand side view respectively with no post processing. In Fig.5, we can see that our method achieves a relatively higherdetection rate than the other two methods at the same falsealarm level while using the same detectors. We attribute thisto the two-stage view selection since it makes the MMPF seizethe primary observations of targets and track them effectively.

C. Tracking Performance Evaluation

We adopt the same metrics for evaluating tracking perfor-mance as in [6][10]. These metrics are defined as following.MT: number of Mostly Tracked trajectories; ML: number ofMostly Lost trajectories; Frmt: number of Fragments trajecto-ries; FAT: number of False trajectories; IDS: the frequency ofIdentity Switches.

The video we use to evaluate tracking performance consistsof 10002 frames in 640×480 resolution which contain frequentpartial occlusions and intensive full occlusions. To evaluate theperformance of the dual-layer occlusion handling, we compareour algorithm with the one without occlusion handling. FromTable 3 that gives the comparison results, we can see thatthe dual-layer occlusion handling achieves an improvement onalmost all the metrics. Especially on the Frmt, we attribute thissignificant improvement to our dual-layer occlusion handlingsince it provides progressive association for tracking occludedtargets, which overcomes most of the fragments. The improve-ment of Frmt further increases the MT in our method. Fig.6gives some typical tracking results.

TABLE IIITRACKING COMPARISION

Algorithm GT MT ML Frmt FAT IDSOur method 215 187 6 41 3 5

Without occlusionhandling

215 167 6 129 5 7

Fig. 6. Typical tracking results (first row: complex background; second row:shadow and occlusion; third row: pedestrian disturbance).

IV. CONCLUSION

In this paper, we present a robust multi-view vehicle detec-tion and tracking algorithm in crossroads. It is a real-time andonline processing system that can deal with view changes andocclusions effectively. The two-stage view selection is efficientin fusing multiple detectors while the dual-layer occlusionhandling technique can tackle both partial and full occlusions.Experiments under different weather conditions (snowy, sunnyand cloudy) demonstrate the effectiveness and efficiency of ourmethod.

ACKNOWLEDGMENT

This work is supported by Beijing Educational CommitteeProgram (YB20081000303).

REFERENCES

[1] D. Koller, J. Weber, and J. Malik, “Robust multiple car tracking withocclusion reasoning,” in Eur. Conf. Comput. Vis., 1994.

[2] S. Kamijo, Y. Matsushita, K. Ikeuchi, and M. Sakauchi, “Occlusionrobust tracking utilizing spatio-temporal markov random field model,”in IEEE Int. Conf. Pattern Recognition, 2000.

[3] B. T. Morris and M. M. Trivedi, “Learning, modeling, and classificationof vehicle track patterns from live video,” IEEE Trans. Intell. Transp.Syst., vol. 9, pp. 425–437, 2008.

[4] P. Viola and M. J. Jones, “Robust real-time face detection,” Int. J.Comput. Vis., vol. 57, pp. 137–154, 2004.

[5] C. Huang, H. Ai, Y. Li, and S. Lao, “High performance rotation invariantmultiview face detection,” IEEE Trans. Pattern Anal. Mach. Intel.,vol. 29, pp. 671–686, 2007.

[6] B. Wu and R. Nevatia, “Detection and tracking of multiple, partiallyoccluded humans by bayesian combination of edgelet based part detec-tors,” Int. J. Comput. Vis., vol. 75, pp. 247–266, 2007.

[7] N. Dalal and B. Triggs, “Histograms of oriented gradients for humandetection,” in IEEE Int. Conf. Comput. Vis. Pattern Recognition, 2005.

[8] C.-H. Kuo and R. Nevatia, “Robust multi-view car detection usingunsupervised sub-categorization,” in Appl. of Comput. Vis., 2009.

[9] Q. Yu and G. Medioni, “Integrated detection and tracking for multiplemoving objects using data-driven mcmc data association,” in IEEEMotion and Video Computing, 2008.

[10] J. Xing, H. Ai, L. Liu, and S. Lao, “Multiple player tracking in sportsvideo: a dual-mode two-way bayesian inference approach with progres-sive observation modeling,” IEEE Tans. Image Processing, vol. 20, pp.1652–1667, 2011.

978-1-4577-0121-4/11/$26.00 ©2011 IEEE 612

Aalaa Khattab

Highlight

Aalaa Khattab

Highlight