Computer Vision System

A computer vision system for the detection and

classification of vehicles at urban road intersections

Stefano Messelodi, Carla Maria Modena, Michele Zanin

ITC-irst

Via Sommarive 18, I-38050 Povo, Trento, ITALY

{messelod,modena,mizanin}@itc.it

ITC-irst Technical Report T04-02-07February 2004

Abstract

This paper presents a real-time vision system to compute trafficparameters by analyzing monocular image sequences from pole-mounted video-cameras at urban crossroads.The system is flexible with respect to road geometry and camera

position, permitting data collection from several monitored crosses. Ituses a combination of segmentation and motion information to localizeand track moving objects on the road plane, utilizing backgroundsubtraction and a feature-based tracking methodology. For eachdetected vehicle, the system is able to describe its path, to estimate itsspeed and to classify it into seven categories. The classification taskrelies on a model-based matching technique refined by a feature-basedone.Experimental results demonstrate robust, real-time vehicle

detection, tracking and classification over several hours of videos takenunder different illumination conditions. The system is presently undertrial in Trento, a one hundred thousand people town in northern Italy.

Keywords: Traffic Scene Analysis, Multiple Object Tracking,Background Estimation, Model Based Classification.

1 Introduction

Nowadays, surveillance video cameras are installed at many locations. Thosemounted along highways provide live monitoring of traffic situation to traffic

1

controllers. In the urban environment, cameras are primarily located atmajor street junctions: the operator at the traffic management center canusually connect to the camera network, select crossroads and scene, andobserve it to detect congestions and modify the signal timing, if necessary.In both cases, images are often published on the Internet to report trafficsituation to citizens.

The traffic control market has recently made available products thatcapture and process the video signal coming from cameras. The target is toautomatize tasks such as the detection of potentially dangerous situations,like vehicle stopped in a tunnel, the recognition of license plates in case oflaw violations, or the extraction of statistical traffic parameters.

The rapidly falling cost of image acquisition devices and the availabilityof cheap, yet powerful CPUs, has spawn new interest in the research ofcomputer vision techniques for the purposes of traffic monitoring and control([1][15]). The majority of the works are devoted to highway or suburbanenvironment, with few exceptions to intersection monitoring [13].

The system presented in this paper, called scoca - System for Countingand Classifying Automatically vehicles, aims at collecting traffic data forstatistical purposes. It constitutes an extension of an existing urban trafficmonitoring system endowed with standard, remote controllable, surveillancecameras, mounted on road-side poles located in several city junctions [16].Each camera (Figure 1) is connected to a local processing unit that capturesthe images and codes the video stream into mpeg2 format. The local unitsare connected through an optical fiber network to a central processor at theTraffic Management Center, where the compressed flows can be received,decompressed and visualized on monitors for remote control purposes.scoca resides on the central processor.

The system is able to detect vehicles as they move through the camerasfield of view, to track them and to classify each individual object in real-time into several categories: bicycle, motorcycle, car, van, lorry, urban bus,extra-urban bus. In particular, the discrimination of bicycles is one of themost powerful features of scoca with respect to existing systems, that aremainly devoted to motorized vehicles.

scoca is presently undergoing a field trial in the city of Trento, Italy,where the traffic managers are interested in the number of vehicles crossingthe intersections, distinguished into seven classes, their average speed andthe local Origin-Destination map through the road intersection (also calledTurning Movement Table). The latter, in particular, is not computable withstandard loop-based systems, where loops are electromagnetic loops as wellas virtual loops in some vision systems. Only area vision-based systems, like

2

LocalProcessingUnit

MPEG coding transmission

acquisition

optical fiber

Traffic Monitoring

TC pilot SCOCA

Figure 1: The extended traffic monitoring system (surveillance camera:courtesy of Comune di Trento).

scoca, can measure turning movements, i.e. the number of vehicles turningright, left or no-turning, measured separately from any approaching lane.

Several are the challenging problems: they range from low middle-level vision tasks, such as the detection of multiple moving objects in ascene and their tracking, to higher level analysis, like object classification.Difficulties stem from variable light conditions of the outside environment(from sunny to cloudy days, different weather, shadows presence), from themany, partially occluded, moving objects, from the plethora of vehicle makesand from pedestrians populating the traffic scenes.

This paper presents an overview of scoca along with the adoptedtechnical solutions. The paper is organized as follows. Section 2outlines scoca architecture and its role in the existing traffic monitoringsystem. Section 3 describes the off line operations necessary for systemworking. Section 4 presents the manager of the data collection sessions.Section 5 details some aspects of the core module of the system: movingobject detection and tracking through the scene, and their classification.Experiments and performance considerations are provided in Section 6.Finally, Section 7 presents conclusions and discusses future improvements ofthe system.

3

2 System outline

The structure of scoca as an extension of the Trento existing trafficmonitoring system [16] is illustrated by the diagram reported in Figure 2where scoca modules are:

the graphical interface scoca gui,

the off-line operation modules config / init,

the manager of the data collection booking session scheduler,

the vision based traffic data extractor,

the statistical descriptor that provides a concise description ofthe collected data.

scoca gui (Figure 3) supports an operator at the Traffic ManagementCenter (TMC) in the selection of a list of monitored road intersections andviews for traffic data collection. scoca has been designed in order to bemodular and flexible with respect to the junction topology and cameraparameters. For each selected view, a configuration and an initializationphase are needed to prepare the system for working. The former operationrequires as input some data about the acquisition device, while the latter adefinition of the area of interest. Both operations, explained in more detailsin Section 3, are launched once to compute and store information used ineach subsequent data collection session.

After the previous off-line operations, the TMC operator is able toschedule data collection sessions from the various cameras. For each cameraview, the operator inputs the starting time and the duration of the desireddata acquisition sessions. The sessions are organized as a queue of eventsto be served. The session scheduler controls the compatibility of theselected time intervals by managing an agenda of events.

The traffic data extractor loads the files generated by the configand init modules and receives the frames coming from the camera througha decompression module. The traffic data extractor is the core ofscoca and analyzes the video producing micro traffic parameters, i.e. dataregarding the single object moving through the scene. For each objectdetected and recognized as a vehicle, it appends a record of informationto a result file. Each record includes:

the enter and leave time in the camera field of view,

4

filesresult

SCOCA GUI

filesauxiliary

EXTRACTORTRAFFIC DATAMPEG

decoder

TC remotecontrol

GUI

pilotTC

TRAFFIC MANAGEMENT CENTER

DESCRIPTORSTATISTICAL

CONFIGINIT

SCHEDULERSESSION

Camera 1

Camera N

.

.

.

Figure 2: The scoca architecture as an extension of a pre-existent trafficmonitoring system. scoca modules are painted with a colored background.They are: scoca gui, config, init, traffic data extractor,statistical descriptor, session scheduler.

5

Figure 3: The scoca gui supporting the operator in system set-up and dataacquisition session scheduling.

the entrance and exit lanes,

the average speed,

the class the vehicle belongs to.

The resulting files can be post-processed off-line in order to obtain trafficdescription parameters, like volume or vehicle class percentages. scoca isendowed with a statistical descriptor module (Figure 4), that loadsthe result file of a given analysis session and provides the user with aconcise description of traffic parameters with graphics and tables. In thenext sections the most relevant modules of the system are described.

6

Figure 4: The statistical descriptor provides a concise description oftraffic for each analysis session.

7

3 System initialization

The configuration and initialization operations, performed off-line by theconfig and init modules, respectively, make the system flexible withrespect to junction geometry and camera set-up and enable it working inreal-time.

The configuration phase requires information about each selected cameraview from which the traffic data have to be computed. A camera view isdescribed in terms of intrinsic and extrinsic camera parameters. They areused to recover world coordinates of objects, which are necessary to computereal paths, sizes and speed of the moving objects. For each view (refer toFigure 5) the operator has to provide the approximate height H of thecamera from the ground plane, the approximate vertical angle of theoptical axis, the focal length of the lens, and the CCD sensor dimensions.Proper functioning of the system requires that:

H

AREA

XY

Figure 5: Requirements of the camera set-up for a proper functioning ofthe system: H greater that 7 meters, look down angle greater than 35,observed area at least 12 12 meters.

the height H of the camera be greater than about 7 meters;

the vertical angle of the optical axis be at least 35 with respectto the horizontal direction: cameras looking down provide a clearerdistinction among close vehicles, reducing the occlusion problem;

the focal length, together with the camera height, must guarantee that

8

the area observed by the camera is large enough to capture vehiclepaths (at least 12 12 meters).

These data are usually available to the TMC operator, or, if unknown,they can be estimated by means of a landmark based camera calibrationprocedure [17], in which the user has to identify some markers on the groundplane and provide their real measure in order to recover the perspectiveprojection matrix. Interactive calibration tools have been proposed inliterature (see [18]). The config module stores camera data and computesa preliminary background image for the view at hand. It logs in to thecamera local unit and requests a 20 seconds video sequence. The preliminarybackground image is computed as the median image over the sequence, i.e.each pixel takes the median of the values, in the correspondent location, ofevery frame of the sequence.

The purpose of the initialization phase is to pre-compute some view-dependent data in order to speed-up subsequent phases, most notably vehicleclassification. In this phase, the TMC operator specifies the portion of theimage to be monitored (typically only the region corresponding to the road ischosen), and selects, along the boundary of the region, a set of points whichdefine the lanes, i.e. the entry and exit gates of interest for the turningmovements computation of the road intersection at hand.

This can be easily done using a tool provided by the scoca gui, whichshows the early background computed by the config module and allowsthe operator to draw a closed polygonal line on it, by placing small bulletsalong the boundary (Figure 6). These data, the region of interest map andthe virtual gates map, are stored in files that are loaded and used during thedata extraction phase. Moreover, the init module computes two additionalmaps, called threshold map and model position map. The former is a mapof thresholds that depend on the geometric perspective, the latter is used inthe classification task, and its role will be explained in Section 5.2.1. Themodel position map is very large (up to 100 MByte) and requires a quite longtime to be computed (up to 1 hour on a 800Mhz Pentium r III processor),but it is crucial to allow the system to perform traffic analysis at frame rate.

4 The Session Scheduler

The session scheduler communicates with the scoca GUI and thecontroller of the camera monitoring system (TC pilot) by means of sockets.The role of the session scheduler is depicted in Figure 7. Through theinterface the operator books the desired session of analysis. Each session is

9

Figure 6: The operator can draw the region of interest by means of a userinterface and indicate on its border the extrema of the lanes.

defined by its starting time, duration, crossroads number and a predefinedcamera set-up (orientation and zoom). Traffic data may be collected onlyfrom configured and initialized views.

The scheduler controls the booking compatibility (in between time) ofeach added session with those already booked.

As the session starting time arrives, the session scheduler activatesthe connection to the selected camera and, through the TC pilot, commandsthe local processing unit to configure the camera according to the set-up parameters. Afterwards it activates the vision-based Traffic DataExtractor. As the session is completed, the session schedulercommunicates to the TC pilot to stop the sending of the video stream and torelease the camera, so that it can be controlled again by the traffic operators.The state of the analysis (success/failure) is reported in the agenda.

5 The Traffic Data Extractor

The traffic data extractor is the core of the scoca system. The aimof the module is to detect and track each object moving through the scene,and to compute its speed, path and class. It contains two main submoduleswhich work in parallel, as outlined in Figure 8.

The first one, called Detector and Tracker, analyzes the frames of thevideo sequence in order to locate objects passing through the field of viewof the camera, and to track them until they leave the observed scene. The

10

resultfiles

SESSION SCHEDULER

DATA EXTRACTORTRAFFIC

TC pilot

surveillance camera network

Figure 7: Role of the session scheduler in the scoca system. It managesthe sequence of data collection from the various cameras and acts as interfacebetween the traffic data extractor and the video-surveillance system.

output is a list of passed objects which contains, for each object, a richdescription of its transition in the scene.

The second process, called Object Parameter Extractor, analyses eachobject as soon as it is available in the passed object list, in order to estimateits real world path through the road intersection, in particular to determinethe entry and exit lanes, its average speed, and its category among thoseforeseen by the scoca system (bicycle, motorcycle, car, van, lorry, urbanbus, extra-urban bus) or reject it. The vehicle parameters are stored in aresult file which will contain, for each object recognized as passed vehicle,the following data:

scene entering time;

permanence time inside the scene;

vehicle category;

average speed;

11

filesauxiliary

class speed path

ObjectParameter Extractor

resultfile

Detector andTracker

moving objectsof

sequencevideo

bufferimage

circularobjectspassed

.

.

.

Figure 8: The traffic data extractor comprises two processes workingin parallel: Detector and Tracker and Object Parameter Extractor.

12

entering and exiting gates.

The Detector and Tracker and the Object Parameter Extractor blocksare detailed in Subsection 5.1 and 5.2, respectively.

5.1 Detection and tracking of moving objects

The goal of the tracking process is to maintain the identity of a movingobject over the sequence of frames. Multi-object tracking is a verychallenging research topic: the problems of single object tracking, i.e. objectdeformation, occlusion, illumination changes, and background changes, areincreased by the fact that multiple objects can touch and occlude eachother, enter and leave at the same time the camera field of view. Trackingis a necessary step in many applications (robotics, surveillance, human-machine interaction) and the literature presents many related algorithms.Object tracking methods are usually classified into, at least, four groups:model-based, region based, contour-based, and feature-based. Model-basedmethods [19] exploit a priori knowledge of the expected object shape.Region-based methods [10] track connected regions that roughly correspondto the 2D shape of objects using information extracted from the entireregion, such as motion, color, texture. Contour-based methods [20] trackthe contour of the object using deformable object models (active contours).Feature-based methods [21] track parts of the object (e.g. corners). Forscoca a hybrid method is proposed that relies on a region-based methodand a feature-based one.

Inputs of the detection and tracking module of scoca are the videosequence, at a rate of 25 frames per second, and some auxiliary filescomputed off-line: the early background image, the threshold map and theregion of interest map. The main steps of the object detection and trackingalgorithm are illustrated by the diagram in Figure 9.

The algorithm is a loop which, at each iteration, processes a new inputframe. It is analyzed alternately following two different branches at twodifferent temporal resolution: (a) every T iterations (in the system 5-7frames), the frame is processed to detect the presence of new objects in thescene and to accurately track objects detected in previous frames, calledactive objects. This process relies on the computation of the moving map.(b) In the other iterations, the frame is processed to perform a fast feature-based tracking and position updating of the active objects.

At 5 Hz an input frame is used to maintain an updated backgroundimage which is used for the computation of the moving map.

13

active objects

bufferimagecircular

passedobjects

moving mapextractionfeatures

object detection

leaving test

read nextframe (t)

backgroundupdating

trackingactive objects

(a) (b)test on t

Figure 9: Detector and tracker of moving objects in the scoca system.Following (a) the frame is processed to detect new objects entering in thescene and to track them with a region-based paradigm; following (b) theframe is processed to perform a fast feature-based tracking.

14

Once the active object list has been updated, the leaving test modulechecks it in order to determine if some objects have left the imaged scene,moving them to the passed object list.

Each object in the lists is represented by a sequence of object states,characterized by two different structures depending on the branch ofprocessing the frame undergoes.

(a) In case of moving map computation, the object state stores a framenumber identifier, the center of the object in the image, the convexhull of the object, the regions containing the object extracted from thecurrent frame and from the moving map;

(b) in case of feature tracking, the object state stores the centercoordinates and the displacement vector with respect to the previousframe.

In the following, each submodule is described in more detail.

Moving map computation - There are several methods for detectingmoving objects in sequences taken from a stationary camera [22]. They areusually categorized into frame by frame differencing, background subtraction,and optical flow methods. The last one has not been considered because ofthe real-time requirement of the overall system. Frame by frame differencingmethods rely on the subtraction of subsequent frames to identify themoving regions of the image [23]. Background subtraction methods performthe analysis of the difference between the current frame and a referencebackground image. This difference enhances static and moving objects withrespect to a constant background. This second approach, used in scoca,requires the estimation of a starting background image and its updating.If an accurate approximation of the background is available, this techniquepermits a better localization of objects with respect to the frame-by-framedifferencing. scoca makes use of the adaptive background updating schemedescribed in [24]. It offers the advantage of taking into account slow changesin environmental light conditions. The initial estimate of the backgroundimage is performed as in the configuration step, using the median imageof a short frame sequence. Background updating is performed at 5 Hz byKalman filtering the pixels in the same location through the time. Theabsolute difference between the current frame and the updated backgroundimage is the moving map. If no moving objects are detected with highconfidence, the background image is replaced by the current frame.

15

Object detection - Straightforward thresholding of the moving mapis a fragile operation, because it may produce many spurious pixels, dueto camera noise and to the compression process that heavily influences themoving map. In scoca, objects are detected by binarizing the gradient ofthe moving map, which makes the process more robust to minor shadowsand noise and does not require the choice of an adaptive threshold. Thedetection process follows these steps:

1. gradient computation of the moving map;

2. binarization with a fixed threshold;

3. noise filtering;

4. connection of nearby components, using morphological operations;

5. filtering of the components.

The result is a list of moving objects, each one described by the followingfeatures: support (represented by its run-length coding), center of mass, andconvex hull.

Each object m in the moving object list is then compared with the activeobject list in order to determine if it corresponds to a previously trackedobject, or if it represents a new object that has just entered the scene. Theconvex hull of m is matched with those of the active objects shifted in thepositions estimated from the previous tracking steps. If m does not overlapany active object, a new object descriptor is created containing a singleobject state, and it is added to the active object list. Otherwise, m is markedas belonging to the matched active object. Finally, a new object state isappended to the descriptor of each active object composed of its movingobjects. Figures 10 and 11 illustrate an example of this process.

At this stage a detected object may correspond to:

a vehicle;

noise: it can be detected and removed if in subsequent frames theobject disappears, or appears in locations which are not compatiblewith the previous one. If not removed, it is likely to be detected in theclassification step;

part of a vehicle: it is possible that, in the subsequent frames, it will beconnected to rest of the proper vehicle. Otherwise, it will be probablydiscarded in the classification step;

16

(c)

(a) (b)

(d)

Figure 10: Object detection: (a) current frame; (b) updated background;(c) moving map; (d) gradient of the moving map.

17

(e) (f)

(g) (h)

Figure 11: (e) edge of the moving map; (f) blob of active objects (inlight) and their expected positions (in black); (g) expected blobs and movedcomponents; (h) detected objects: two active objects and one new enteringobject.

18

a group of vehicles: it is considered as a single item and the detectionof each single vehicle is left to the classification step. This case, whichis typically caused by occlusions or the presence of strong shadows,represents the most critical situation.

Feature extraction and objects tracking - As mentioned above,tracking is performed in two alternative ways. The object detection moduleperforms a region-based tracking checking the overlap between the currentobjects and the active objects, in the expected positions. This task isquite expensive and for this reason it is performed only every T frames.Considering that vehicles have a rigid shape and that their movement ispredictable, we adopt a feature-based method that provides less accuratebut much faster tracking between two successive applications of the region-based method.

To implement this method, the system maintains a copy Fp of the gray-level frame that preceded the current one, Fc. For each active object O, aset of small rectangular windows is selected in the subregion of the imageFp corresponding to the object. The feature selection criteria are:

maximize the presence of both horizontal and vertical edges in thewindow;

forbid windows to be too close;

select at most Nw windows (5 in the working system).

Such windows are then searched, in Fc, near their expected position asderived from the last displacement stored in the descriptor of O. The resultis the relative displacement between the position of the windows in Fp andthose in Fc that produce the minimum L1-distance. This result is stored in anew object state which is appended to the object descriptor of O (Figure 12).

The size of the searching area depends on the maximum expecteddisplacement, i.e. on the vehicle velocity, the acquisition set-up geometryand the processing rate. The application of this tracking step at every frameallows the system to properly work with a small searching area.

This approach avoids tracking the entire object region and permits torecover the object in case of partial occlusion. It showed to be robust to smallchanges in object size, orientation and even to changing lighting conditions.A matching score is also provided for each feature, which can be useful toidentify situations like occlusions or the exiting of vehicles from the scene.

Leaving scene test - This module examines all active objects toestablish if they have been found in the current frame; if it is not the case,

19

(a) (b)

Figure 12: Feature tracking of active objects: (a) windows of selectedfeatures at time t 1; (b) searching areas - greater squares - and matchedfeatures at time t.

the corresponding object is moved into the passed object list. Passed objectsare the output of the Detector and Tracker module and are at disposal forthe Object Parameter Extractor module.

5.2 Object parameter computation: class, speed, path

The Object Parameter Extractor module works in parallel with the Detectorand Tracker one. It picks each object from the passed objects list as soon asit is available. A target of the module is to determine if the consideredobject is a vehicle or not and, consequently, to update the counter ofvehicles traveled across the intersection. Furthermore, the module aimsat determining the class the vehicle belongs to, its average speed in crossingthe road intersection and, finally, the covered path, in order to populate theTurning Movement Table of the intersection. The main task is classification,upon which the computation of the remaining parameters relies.

5.2.1 Classification

The aim of the classification phase is to determine which class a certainobject belongs to, if any, or to discard it.

The classifier takes as input a passed object, that is represented by itsobject descriptor. The list of states of the object descriptor is firstly analyzedin order to check if the positions of the object are consistent with a single

20

vehicle motion. If not, the object is labelled as noise and discarded.Otherwise, the system classifies the object by assigning it to one of the

following categories: bicycle, motorcycle, car, van, lorry, urban bus, extra-urban bus, unknown. The classification returns a confidence factor in theinterval [0,1]. The scoca classifier works at two levels: a model-basedclassification level and a post-classification level, as described in the followingparagraphs.

Model-based classification. This classification relies on a set of threedimensional models, each one providing a coarse description of the shapeof different groups of vehicles. The main difficulty faced by a model-basedclassification system is the need to generalize the appearances of objectsbelonging to the same class. Supplying the system with an exact model foreach different kind of vehicle is not feasible. On the other side, a single modelcan not capture the variability of what we classify as car. For this reasonwe have introduced three different models to describe the car category. Asingle model is used for both bicycle and motorcycle categories. The 3Dmodels are shown in Figure 13, while Table 1 lists the association of thevehicle classes with the utilized models.

class model

bicycle cyclemotorcycle cyclecar small-car, car, minibusvan sm-truck-o, sm-truck-curban bus busextra-urban bus buslorry buspedestrian pedestrian

Table 1: Classes of objects and their describing models.

The pedestrian model has been introduced to capture moving people,enabling the system to discriminate them from other vehicles (typicallybicycles).

The descriptor of the object to be classified contains informationabout the history of its movement through the scene: the image portionof the object every T frames, called object view, its difference fromthe background, its position expressed in image coordinates, and its

21

Figure 13: The 3D models used for the first classification level.

displacement vector with respect to the previous frame.The goal is to consider the projection onto the image plane of each

3D model, each one placed on the monitored area in all possible locationsand orientations, and to compare them with the convex hull of theobject expressed in image coordinates. However, this direct comparisonis impracticable under the real-time constraint. To deal with this problem,the number of comparisons is reduced by pre-computing the projection ofeach model throughout the scene, accordingly with the camera configurationparameters. These data are computed in the off-line initialization phase andare stored in the model position map relative to the scene under analysis.The map is accessed using as indexes the position of the convex hull and theorientation of the object in the image plane according to its displacement;the output is a list of possible 3D models complete with position andorientation on the road plane.

The convex hull of the object is matched against the projection of all themodels present in the candidate list and the best score for each 3D model isstored. The matching score S is a number in [0, 1] computed by measuringthe overlap area between the convex hull of the object Och and that of the

22

0.310.860.770.820.780.820.380.13PEDESTRIAN

CYCLE

CAR

BUS

MINIBUS

SMALLCARSMTRUCKOSMTRUCKC

0.250.740.71

0.710.720.700.02

0.75

0.32

0.920.930.830.940.870.08

0.950.620.79

0.720.760.770.310.46

0.83

0.17

0.680.840.76

0.000.76

0.03

0.850.18

0.690.880.840.940.000.03

0.96

score

Overall

Figure 14: Model-based classification: each view of the object has aclassification score for each model. The model with the best overall score isthe output of the first classification level: in this case minibus.

projected model Mb with respect to their union:

S = Area(Och Mb)/Area(Och Mb).

If the best matching score is too low, the object is further analyzed in orderto determine if it corresponds to two or more vehicles, partially occludingeach other. The result of this analysis is a list of hypotheses, each onecontaining a set of models along with their position and orientation, withan associated multiple matching score.

Once all the object views in the object descriptor are classified, an overallclassification score is computed for each 3D model and for each possiblemultiple vehicle hypothesis: this step takes into account if the vehicle iscompletely, or only partially, contained in the visual field.

If the best overall score is below a given threshold, then the object isclassified as unknown; otherwise the 3D model, or the set of 3D models, thatgives rise to the best score is taken as result of the first classification level.An example is reported in the table of Figure 14.

Post-classification. The second classification level aims at assigning avehicle class to the object, rather than a model. If the best matched3D model corresponds to a single vehicle class (see Table 1), then theobject is simply assigned to that class, otherwise a specialized classificationstep is executed. In the scoca environment, only two of the 3D modelscorrespond to multiple vehicle classes: the cycle model, that covers bicycleandmotorcycle classes, and the bus model that covers urban bus, extra-urban

23

Figure 15: Motorcycles and bicycles with orthogonal movement directions:the vehicle appearance changes greatly in dependence on the relative viewangle.

bus and lorry classes. Two features based classifiers are then introduced,devoted to the discrimination between bicycle and motorcycle, and amongurban bus, extra-urban bus, and lorry. The used features are extracted fromthe object views stored in their descriptors.

The motor-bicycle classifier takes into account some visual featuresextracted from each object view and from the difference with thebackground, and the position and orientation on the road plane of the cyclemodel that matched the object. In order to cope with the great differencein the apparent shape of the vehicle depending on the movement direction(Figure 15), the classification algorithm considers two cases: the movementdirection is closest to the direction of the optical axis of the camera, or toits normal direction

In the first case, the vehicle presents a (near) frontal or back viewand the basic idea is to exploit the fact that, typically, bicycle tyres havethickness sensibly lower than those of motorcycles. This feature is takeninto account by analyzing the shape of a profile computed on the imageregion corresponding to the wheel of the vehicle closest to the camera. In

24

the second case (lateral view), the underlying assumption is that the imageregions corresponding to the wheels of a bicycle have brightness levels mostsimilar to the background with respect to the same regions of a motorcycle.The algorithm computes the average value in the background differenceimage over the wheels regions and compares it to a threshold to distinguishbetween the two classes.

The bus-lorry classifier is mainly based on the fact that urban and extra-urban bus, in the scoca application, have characteristic colors: orangeand azure. The algorithm extracts the color information from the objectviews. The color histogram is analyzed and compared with a set of colorpattern extracted from examples of urban and extra-urban bus. If there isa dominant color that matches one of these reference colors the vehicle isclassified accordingly, otherwise it is considered as a lorry.

5.2.2 Other parameters

For each detected and classified vehicle, the system computes its path andspeed. Data computed in the configuration and initialization phases areused to perform this task.

Vehicle count For a given session, it is the number of moving objectsclassified as vehicles.

Path computation The object descriptor contains the position of thevehicle in the image for every object state. This set of positions areback projected onto the road plane to give rise to the vehicle real worldpath, represented by a piecewise linear trajectory. Information storedin the virtual gates map during the initialization phase (Section 3) isnow used to compute a polygonal line on the road surfaces whose sidesare associated to a corresponding virtual gate. The sides of the polygonthat intersect the vehicle path determine the input and output gates;this contributes to the population of the Turning Movement Table ofthe crossroads.

Speed estimation It takes into account the so called parallax effect, i.e.for the same displacement measured on the image, objects close tothe camera undergo a larger translation than objects far from it.The object displacement vectors, stored in the object descriptor, areconverted to real world coordinates and added up to compute theglobal displacement. Knowing the number of frames the object stayedin the scene and the input frame rate, the average vehicle speed isstraightforwardly estimated.

25

6 Performance evaluation

The system works in real-time at 25 frames per second (possibly, with ashort delay due to the classification task, which is performed after a vehiclehas completed its path), including the video decoding process.

The detection, tracking and classification algorithms implemented in thesystem have been applied to a variety of video sequences, taken from differentcrossroads and with different field of views.

A thorough testing of the system has been realized on sequences acquiredfrom cameras located at two high traffic urban junction. Sequences wereacquired at different daytime hours and under different weather conditions.

Data test. The evaluation of the single modules and of the overall systemwas performed on three sequences:

Sequence1 and Sequence2

1. Sequence1 is an mpeg coded sequence with 21 000 frames (14 minutes)having dimension W H = 352 288 pixels (acquisition date/time:May 17, 1999 - 7.40 am).

2. Sequence2 is an mpeg coded sequence of the same cross and view with20 125 frames (13 minutes and 25 secs) (acquisition date/time: May17, 1999 - 7.55 am).

3. operator information provided during the configuration andinitialization phases of the view:

height of the camera: 7.8 mt

look vertical angle of the optical axis: 40 degrees;

focal length: 4 mm;

a region of interest with 4 interesting gates (usual traffic flow: 2in and 2 out gates). Vehicles entering from gate g4 may travelout through gate g1 or g2, while those entering from gate g5 maytravel out through gate g1.

Sequence3

1. a jpeg coded sequence with 30 000 frames (20 minutes) havingdimension W H = 352 288 pixels (acquisition date/time: May22, 2003 - 11.00 am).

26

Figure 16: Example of two frames in Sequence3.

2. operator information provided during the configuration andinitialization phases of the view:

height of the camera: 7.8 mt

look vertical angle of the optical axis: 53 degrees;

focal length: 4.9 mm;

a region of interest with 6 interesting gates (usual traffic flow: 2in gates g3, g4, and 2 out gates g1, g6). Pedestrian across thestreet through gates g2 and g5

Figures 10 (a), 12 and 15 are extracted from Sequence1 or Sequence2.Figure 16 are two frames of Sequence3. Gates are illustrated in Figure 17.

Ground-truth. A labelling of the sequences obtained by visual inspectionhas been performed by means of a Tcl/Tk graphical tool, dividing thevehicles in the considered classes and indicating their paths. Pedestrian andgroups have been labelled as other objects. The manual labelling providedthe data is reported in Table 2.

Performance on vehicle counting. In the first sequence, the systemdetermines 595 as the number of passed objects and classifies 573 of themas vehicles and 22 as non-vehicles. This corresponds to a global error of-3.1% (573-591 = -18 vehicles). The output and the error analysis, reportedin Table 3 shows that there are but 10 false alarms, due to noise detected asvehicles, 20 non-vehicles confused as vehicles, and 48 miss vehicle detectionswith respect to the 591 passed vehicles.

27

Figure 17: Gates of the initialization of Sequence1 and Sequence2 (up-left) and Sequence3 (up-right) and below their Turning Movement Tablesproduced by the scoca system.

In the second sequence, the system determines 459 vehicles and 21 otherobjects (pedestrian and unknown objects). This corresponds to a globalerror of -2.8% (-13 vehicles counted). Table 3 shows that there are but 6false alarms due to noise, 9 non-vehicles confused as vehicles, and 26 missvehicle detections with respect to the 472 passed vehicles.

In the third test sequence, the system determines 224 vehicles and 33other objects. This corresponds to a global error of -14.5% (-38 vehiclescounted). Table 3 shows that there are 6 false alarms due to noise, 10 non-vehicles confused as vehicles, and 52 miss vehicle detections with respect tothe 262 passed vehicles.

Globally on the three sequences, the system determines 1256 vehicles and76 other objects. This corresponds to a global error of -5.2% (13251256 =69 vehicles counted). Table 3 shows that there are 22 false alarms due tonoise, 39 non-vehicles confused as vehicles, and 126 miss vehicle detectionswith respect to the 1325 passed vehicles. The global reliability of the vehiclecounter is 95.1% in scenes where moving objects other than vehicles arepresent.

In the Table, False detections are noise detected as vehicle, whileconfusions refers to objects moving through the scene other than vehicles

28

class Sequence1 Sequence2 Sequence3 Total

bicycle 23 14 24 61motorcycle 111 49 10 170

car 435 375 200 1010van 6 25 22 53

urban bus 5 5 0 10extra-urban bus 7 1 0 8

lorry 4 3 6 13

total vehicles 591 472 262 1325

other objects 30 14 44 88

total objects 621 486 306 1413

Table 2: Ground-truth of the three test sequences; passed vehicles have beencounted and classified. In the other objects category are included movingobjects such as single pedestrian and pedestrian group.

but classified by the system as vehicles. Miss detection indicates the numberof missed vehicle detections with respect to the ground-truth. Countingerror is the difference in percentage between the output of the system andthe ground-truth data. False alarm rate is the percentage of false alarmsin vehicle detection produced by the system with respect to the numberof vehicles of the output. Reliability refers to the number of correctlydetected vehicles with respect to the system output. Miss detection rateis the percentage of missed vehicle detections with respect to the ground-truth.

By analysing the error in counting output, we observe that missingdetections are mainly due to occlusions, while false alarms are due to noise.

Performance on vehicle classification. The moving objects detectedby the system are classified into seven vehicular classes, or labelled as non-vehicles (single pedestrian or unknown). For Sequence1 the classificationoutput of the system compared with the ground-truth data is reportedin Table 4, while the classification performances are listed in Table 5.Analogously, for Sequence2 and Sequence3, outputs of the system arereported in Tables 6 and 8, while classification performances are in Tables 7and 9, respectively.

The output and the analysis on the whole video data are sumarized inTables 10 and 11.

29

Sequence1 Sequence2 Sequence3 Total

system output 573 459 224 1256ground-truth 591 472 262 1325

false detections 10 6 6 22confusion 20 9 10 39correct 543 444 208 1195

miss detections 48 26 52 126

counting error -3.1% -2.8% -14.5% -5.2%false alarm rate 5.2% 3.3% 7.1 % 4.9 %

reliability 94.8% 96.7% 92.9% 95.1%miss det. rate 8.1% 5.5% 19.8% 9.5%

Table 3: Output of the vehicle counter on the three sequences and dataanalysis.

Classification errors are due to: partial occlusions, vehicles with theircast shadows, and presence of pedestrian groups. Partial occlusions andcast shadows cause a greater than real object silhouette detection, thereforeproviding an overestimation in the lorry detection.

Test on specialized classifier. A separate test has been conducted tomeasure the performances of the second level bicycle/motorcycle classifier.Using a test set of 189 vehicles (45 bicycles and 144 motorcycles) extractedfrom two video sequences acquired at two different junctions, the classifiersprovides an average error rate of 12.2%. Further experiments are in progressusing a classification technique based on Support Vector Machines which usessimilar features to distinguish between the two classes (see [25]). Preliminaryresults provide error rates below 6.7%.

7 Conclusions and future work

We have presented a video-based crossroads monitoring system that is ableto extract and collect traffic data in real-time, by analysing sequencesacquired from a monocular camera. The system works in real-time,counts vehicles, distinguishes them into seven categories, bicycles included,determines entering and exiting lane and estimates the average speed.

Preliminary tests show that rain is not a problem for the system.The system cannot work under night illumination, because of vehicular

30

bike motor car van ubus ebus lorry non-v det. GT

bike 14 6 2 0 0 0 0 0 22 23motor 2 91 2 1 0 0 1 0 97 111car 0 0 388 13 0 0 5 0 406 435van 0 0 0 4 0 0 2 0 6 6ubus 0 0 0 0 5 0 0 0 5 5ebus 0 0 0 0 0 3 0 0 3 7lorry 0 0 0 0 0 0 4 0 4 4other 6 2 8 2 0 0 2 3 23 30

FD 0 1 1 3 0 2 3 19 29Total 22 100 401 23 5 5 17 22 595

Table 4: Output of the classifier for Sequence1. Each column reports theoutput of the classifier with respect to each input. The GT column containsthe ground-truth data. FD (False Detections) indicates noise detected asobjects and passed to the classifier. In bold font are the correct classificationsfor each vehicular class.

vehicle class. global false al. miss al. reliabilityclass rate rate rate rate

bicycle 63.6 60.9 0.0 4.3 63.6motorcycle 93.8 82.0 1.0 12.6 91.0

car 95.6 89.2 0.3 6.7 96.8van 66.6 66.6 13.0 0.0 17.4

urban bus 100.0 100.0 0.0 0.0 55.6extra-urban bus 100.0 42.9 40.0 57.1 60.0

lorry 100.0 100.0 17.6 0.0 23.5

total 93.7 86.1 1.8 8.1 88.3

Table 5: Classification performances for each vehicle class of Sequence1.The columns contain respectively: classification rate (number of correctclassifications wrt number of detection), global rate (number of correctclassifications wrt ground truth), false alarm rate (number of false alarmswrt output), miss alarm rate (miss objects wrt ground truth), reliability(number of correct classifications wrt output).

31



FD 0 1 1 2 0 0 2 16 22Total 13 45 352 26 5 1 17 21 480

Table 6: Output of the classifier for Sequence2. Each column reports theoutput of the classifier with respect to each input. The GT column containsthe ground-truth data. FD (False Detections) indicates noise detected asobjects and passed to the classifier. In bold font are the correct classificationsfor each vehicular class.



car 96.4 91.5 0.3 5.1 97.4van 71.4 60.0 7.7 16.0 57.7


lorry 100.0 100.0 11.8 0.0 17.7

total 92.6 87.5 1.3 5.5 90.0

Table 7: Classification performances for each vehicle class of Sequence2.

32



FD 0 0 2 2 0 0 2 13 19Tot 11 19 162 18 1 0 13 33 257

Table 8: Output of the classifier for Sequence3. Each column reports theoutput of the classifier with respect to each input. The GT column containthe ground-truth data. FD (False Detections) indicates noise detected asobjects and passed to the classifier. In bold font are the correct classificationsfor each vehicular class.



car 95.6 76.5 1.2 20.0 94.4van 42.1 36.4 11.1 13.6 44.4

urban bus - - 0.0 - 0.0extra-urban bus - - - - -

lorry 40.0 33.3 15.4 16.7 15.4

total 83.3 66.8 2.7 19.9 78.1

Table 9: Classification performances for each vehicle class of Sequence3.

33



FD 0 2 4 7 0 2 7 48 70Tot 46 164 915 67 11 6 47 76 1332

Table 10: Output of the classifier for the three sequences. Each columnreports the output of the classifier with respect to each input. The GTcolumn contain the ground-truth data. FD (False Detections) indicatesnoise detected as objects and passed to the classifier. In bold font are thecorrect classifications for each vehicular class.



car 95.9 87.5 0.4 8.7 96.6van 58.7 50.9 10.5 13.2 40.3


lorry 75.0 69.2 14.9 7.7 19.1

total 91.5 82.8 1.7 9.5 87.3

Table 11: Classification performances for each vehicle class of the threesequences.

34

lights on the road surface, having a major negative impact on the currentalgorithm for object detection based on background subtraction technique.However, by analyzing the intensity of the background image, the systemrecognizes that it is not in working conditions and autonomously suspendsdata extraction.

Shadows, reflections, occlusions, and pedestrian groups are the mainsources of error impacting on counting and classification performance.Despite these disadvantages, the system has proved to be useful in theestimation of traffic parameters at urban road intersections and it is used ina real workflow on a daily basis.

First results in object-detection and object-classification areencouraging, albeit further efforts appear to be necessary.

Future work

Separating a vehicle from its shadow remain a challenging problem [26].We are currently investigating alternative techniques for shadow elimination.

Groups of pedestrian and uninteresting moving objects have to beidentified and excluded from the traffic statistics.

Further investigations are needed to improve the robustness of the systemin particular when dealing with occluded vehicles.

Acknowledgments

This work was performed in cooperation with the Comune di Trento,Servizio Reti. We thank ing. Luca Leonelli for testing the system, reportingbugs, and suggesting improvements.

References

[1] R. Cucchiara, M. Piccardi, A. Prati, and N. Scarabottolo. Real-timeDetection of Moving Vehicles. In 10th International Conference onImage Analysis and Processing, pages 618623, Venice, Italy, September1999.

[2] M.Y. Siyal and M. Fathy. A neural-vision based approach to measuretraffic queue parameters in real-time. Pattern Recognition Letters,20:761770, 1999.

[3] J.M. Ferryman, S.J. Maybank, and A.D. Worrall. Visual surveillance formoving vehicles. International Journal on Computer Vision, 37(2):187197, June 2000.

35

[4] M. Haag and H.H. Nagel. Incremental recognition of traffic situationsfrom video image sequences. Image and Vision Computing, 18(2):137153, January 2000.

[5] R.J. Morris and D.C. Hogg. Statistical models of object interaction.International Journal of Computer Vision, 37(2):209215, June 2000.

[6] J. Badenas, M. Bober, and F. Pla. Segmenting traffic scenes fromgray level and motion information. Pattern Analysis and Applications,4(1):2838, 2001.

[7] C. Setchell and E.L. Dagless. Vision-Based Road-Traffic MonitoringSensor. IEE Proceedings-Vision Image and Signal Processing,148(1):7884, February 2001.

[8] A. Cavallaro, O. Steiger, and T. Ebrahimi. Multiple Video ObjectTracking in Complex Scenes. In ACM Multimedia Conference, pages523532, Juan Les Pins, France, 2002.

[9] J. Kato, T. Watanabe, S. Joga, J. Rittscher, and A. Blake.HMM-Based Segmentation Method for Traffic Monitoring Movies.IEEE Transactions on Pattern Analysis and Machine Intelligence,24(9):12911296, September 2002.

[10] S. Gupte, O. Masoud, R.F.K. Martin, and N.P. Papanikolopoulos.Detection and Classification of Vehicles. IEEE Trans. on IntelligentTransportation Systems, 3(1):3747, March 2002.

[11] A. Yoneyama, C. Yeh, and C.J.J. Kuo. Highway traffic analysiswith vision-based surveillance systems. In SPIE Symposium on VisualInformation Processing XII, Orlando, Florida, USA, April 2003.

[12] K.Stubbs, H. Arumugam, O. Masoud, C. McMillen, H. Veeraraghavan,R. Janardan, and N. Papanikolopoulos. A Real-Time Collision WarningSystem for Intersections. In 13th Annual Meeting on IntelligentTransportation Systems America, Minneapolis, MN, USA, May 2003.

[13] H. Veeraraghavan, O. Masoud, and N. Papanikolopoulos. Computervision algorithms for intersection monitoring. IEEE Trans. onIntelligent Transportation Systems, 4(2):7889, June 2003.

[14] M. Zanin, S. Messelodi, and C.M. Modena. An Efficient Vehicle QueueDetection System Based on Image Processing. In 12th International

36

Conference on Image Analysis and Processing, pages 232237, Mantova,Italy, September 2003.

[15] V. Kastrinaki, M. Zervakis, and K. Kalaitzakis. A survey of videoprocessing techniques for traffic applications. Image and VisionComputing, 21(4):359381, April 2003.

[16] L. Leonelli. Sistema di regolazione centralizzata del traffico a Trento.In P. Simonetti A. Peretti, editor, Convegno Nazionale Traffico eAmbiente, pages 10651076, Trento, Italia, Febbraio 2000.

[17] R. M. Haralick. Determining camera paramaters from the perspectiveprojection of a rectangle. Pattern Recognition, 22(3):225230, 1989.

[18] A.D. Worrall, G.D. Sullivan, and K.D. Baker. A simple, intuitivecamera calibration tool for natural images. In 5th British MachineVision Conference, pages 781790, 1994.

[19] D. Koller, K. Daniilidis, and H.-H. Nagel. Model-Based Object Trackingin Monocular Image Sequences of Road Traffic Scenes. InternationalJournal of Computer Vision, 10(3):257281, 1993.

[20] D. Koller, J. Weber, T. Huang, J. Malik, G. Ogasawara, B. Rao, andS. Russell. Towards robust automatic traffic scene analysis in real-time. In 12th Int. Conference on Pattern Recognition, pages 126131,Jerusalem, Israel, October 1994.

[21] D. Beymer, P. McLauchlan, B. Coifman, and J. Malik. A Real-time Computer Vision System for Measuring Traffic Parameters. InComputer Vision and Pattern Recognition Conference, pages 495501,Puerto Rico, June 1997.

[22] S. Andra, O. Al-Kofahi, R.J. Radke, and B. Roysam. Image ChangeDetection Algorithms: A Systematic Survey. (submitted to) IEEETransaction on Image Processing, 2003.

[23] P.L. Rosin. Thresholding for Change Detection. In 6th InternationalConference on Computer Vision, pages 274279. IEEE ComputerSociety Press, 1998.

[24] M. Boninsegna and A. Bozzoli. A Tunable Algorithm to Update aReference Image. Signal Processing: Image Communication, 16(4):353365, November 2000.

37

[25] G. Cattoni. Utilizzo di Support Vector Machine per la classificazionedi veicoli basata su caratteristiche visive: sviluppo di classificatoredi motociclette/biciclette. Masters thesis, Universita` degli Studi diTrento, Corso di Laurea in Ingegneria delle Telecomunicazioni, AA2002-2003.

[26] A. Prati, I. Mikic, M.M. Trivedi, and R. Cucchiara. Detecting MovingShadows: Algorithms and Evaluation. IEEE Transaction on PatternAnalysis and Machine Intelligence, 25(7):918923, 2003.

38

Documents

Computer Vision System