Moving object detection in the H.264/AVC compressed domain for video surveillance applications

J. Vis. Commun. Image R. 20 (2009) 428–437

Contents lists available at ScienceDirect

J. Vis. Commun. Image R.

journal homepage: www.elsevier .com/ locate / jvc i

Moving object detection in the H.264/AVC compressed domain for videosurveillance applications

Chris Poppe *, Sarah De Bruyne, Tom Paridaens, Peter Lambert, Rik Van de WalleDepartment of Electronics and Information Systems, Multimedia Lab, Ghent University – IBBT, Gaston Crommenlaan 8, B-9050 Ledeberg-Ghent, Belgium

a r t i c l e i n f o a b s t r a c t

Article history:Received 8 October 2008Accepted 20 May 2009Available online 28 May 2009

Keywords:Moving object detectionCompressed domain analysisVideo surveillanceObject SegmentationMPEG videoBlock-based video codingH. 264/AVCSignal processing

1047-3203/$ - see front matter � 2009 Elsevier Inc. Adoi:10.1016/j.jvcir.2009.05.001

* Corresponding author.E-mail address: [email protected] (C. Poppe).

In this paper a novel method is presented to detect moving objects in H.264/AVC [T. Wiegand, G. Sullivan,G. Bjontegaard, G. Luthra, Overview of the H.264/AVC video coding standard, IEEE Transactions on Cir-cuits and Systems for Video Technology, 13 (7) (2003) 560–576] compressed video surveillancesequences. Related work, within the H.264/AVC compressed domain, analyses the motion vector fieldto find moving objects. However, motion vectors are created from a coding perspective and additionalcomplexity is needed to clean the noisy field. Hence, an alternative approach is presented here, basedon the size (in bits) of the blocks and transform coefficients used within the video stream. The systemis restricted to the syntax level and achieves high execution speeds, up to 20 times faster than the relatedwork. To show the good detection results, a detailed comparison with related work is presented for dif-ferent challenging video sequences. Finally, the influence of different encoder settings is investigated toshow the robustness of our system.

� 2009 Elsevier Inc. All rights reserved.

1. Introduction

Video surveillance is proliferating worldwide, and the need forsmart autonomic video analysis systems that provide fast andaccurate solutions increases. Implementations of such surveillancesystems start with the detection and segmentation of objects ofinterest in image sequences. The outcome of this step is used inother processing modules, e.g., to track objects and to performbehavior analysis. Therefore, it is important to achieve very highaccuracy in the detection, with the lowest possible false alarmrates and detection misses. A practical video surveillance scenarioincludes video compression to reduce the used bandwidth andstorage. Consequently, if one wants to analyze the captured imagesto find moving objects, a decoding step is needed. To avoid thisdecoding step and to reuse the work done during the encoding,several efforts are done to detect moving objects directly uponthe compressed video stream. Several algorithms have been pro-posed to analyse video content in the MPEG compressed domainwhich have good performance [2]. With H.264/AVC [1], a newand effective video coding standard was introduced by the MovingPicture Experts Group (MPEG) and the Video Coding Experts Group(VCEG), containing several new features which make previousobject detection techniques in MPEG compressed domain notdirectly reusable. When looking into object detection techniquesthat work in the H.264/AVC compressed domain, few approaches

ll rights reserved.

exist and they generally rely on the motion vector (MV) field. How-ever, MVs are created to optimally compress the video, not to opti-mally represent the real motion in the sequence. As such, due to itscoding-oriented nature, the MV field is noisy and requires addi-tional complexity to be processed. Hence, in this paper a novelalternative approach is presented that works on a higher level.The related work assumes that MVs correspond to real motion,we assume that an encoder can compress parts of the backgroundbetter than those parts containing moving objects. When analyzinga H.264/AVC compressed bitstream, the size (in bits) that a macr-oblok (MB) occupies, is recorded. Based on these sizes, a back-ground model is created during a training period. New imagesare consequently compared with this model to yield MBs that cor-respond to moving objects. Subsequently, these MBs are spatiallyand temporally filtered to remove noise. Finally, the sizes of thesixteen 4� 4 transform coefficients within such a boundary MBare evaluated to make a more fine-grained segmentation.

As such, our system disregards the MV field and the problemsthat arise from it. This leads to two main benefits. Firstly, our sys-tem suffers less from noise or small illumination changes. Thispaper will present an extensive comparison of our method witha MV-based approach, showing that we obtain better detectionresults on different sequences. Secondly, our system can obtainmuch higher speeds than the related work. The proposed systemfirst analyzes the stream on a MB level and then refines only thoseMBs that are assumed to be at the edge of a moving object. We willshow that we can process video sequences up to 20 times fasterthan the MV-based approaches.

mailto:[email protected]

http://www.sciencedirect.com/science/journal/10473203

http://www.elsevier.com/locate/jvci

C. Poppe et al. / J. Vis. Commun. Image R. 20 (2009) 428–437 429

Section 2 presents related work on moving object detection inthe MPEG compressed domain, with a strong focus on techniquesthat work in the H.264/AVC compressed domain. Subsequently,Section 3 elaborates on the H.264/AVC standard and the problemthat we are facing. Section 4 presents the proposed system to findmoving objects in the H.264/AVC compressed domain. Experimen-tal results of our system are shown in Section 5 and concludingremarks are made in Section 6.

sliceheader slice data

MB MB MBskip_run MBMB.....

mb_type mb_pred coded residual

Fig. 1. Syntax of a slice.

2. Related work

Several techniques exist that deal with moving object detectionin the MPEG-2 compressed domain. Zen et al. used the MV magni-tudes to determine whether a block corresponds to a movingobject [3]. According to the MV angle similarity, the blocks are spa-tially merged to reduce the effect of noisy MVs. Jamrozik andHayes used a levelled watershed technique on a MV field thatwas accumulated over time [4]. Long et al. created a MV field byaccumulation over time and used a median filter to clean the field[5]. Additionally, they created a feature vector based on the DCTcoefficients to refine the MV field even further.

Most techniques that work in the MPEG-2 compressed domainare based on the MV field, hence it is no surprise that this is adoptedin the literature by H.264/AVC compressed domain techniques. Thil-ak and Cruesere presented a system to track targets in H.264/AVCvideo [6]. They use the MV magnitudes to detect objects of interest.As a consequence, MVs that point in different directions are still con-sidered as belonging to the same object. Moreover, their systemrelies on prior knowledge of the size of the target.

Zeng et al. classified MVs in edge, foreground, background andnoise MVs, to create a moving object detection system in theH.264/AVC compressed domain [7]. The classified MV field is thensubmitted to Markovian labeling, to yield 4� 4 blocks that corre-spond to moving objects. Finally, backtracking is used to make adecision for the intra-predicted frames, called I frames. They pre-sented good results, however the system utilizes several parametersfor the thresholds in the classification and the weights used in thelabeling, and these need to be fine-tuned for different sequences.

The classification system proposed by Zeng et al. was reused byYang et al. to perform moving object segmentation [8]. The labelingresult is linked with an image consisting of the DC part of the DCTcoefficients which are obtained from partly decoded I frames.Finally, an extra decoding step is performed on the regions aroundthe edges of the found objects and edge information in the pixeldomain is extracted for refinement. The execution speed of thismethod highly depends on the number of edges found in the pres-ent image, since each edge can result in several different blocksthat need to be decoded. Several possible combinations of thethresholds can result in different classifications, as such influenc-ing the segmentation accuracy.

Liu et al. created a normalized and median filtered MV field toperform moving object segmentation [9]. Subsequently, a complexbinary partition tree filtering is used to segment the MV field. Thecomplexity of the partition tree increases drastically with a noisyMV field.

As was the case in the MPEG-2 compressed domain analysistechniques, most techniques that work on the H.264/AVC com-pressed domain, are based on the MV field. However, as MVs arecreated from a coding point of view, they are created to optimallycompress the video, not to optimally represent the real motion inthe sequence. Consequently, MV fields can be very noisy and it isdifficult to find real moving objects based solely on this field.Hence, we present an alternative technique that solely relies onthe number of bits that a MB uses, in an H.264/AVC compressedbitstream, to perform moving object detection.

3. Context

With the standardization of H.264/AVC, a new video codec wasintroduced to the video surveillance market. Since this new codecoutperforms other ones in coding efficiency, it is assumed thatmore and more video surveillance data will be encoded in thisnew format [10]. H.264/AVC contains several new features thatincrease the coding performance, and several profiles are createdtargeting specific classes of applications. In this work we restrainourselves to the Baseline Profile of H.264/AVC, which is suitablefor video surveillance applications thanks to the low coding com-plexity. This restriction is also made in related work [7–9] and isthe most likely to be supported by surveillance cameras in the nearfuture. The H.264/AVC video streams are generated using the JointModel reference software (version JM 12.4). In the rest of thepaper, we assume a fixed Group Of Pictures (GOP) structure of 16frames, composed of one I frame and 15 inter-predicted frames(P frames). Each of these frames is subdivided in macroblocks(MBs), which represent a 16 by 16 sample region of the videoframe. An I frame consists of MBs which are intra-coded, meaningthe encoder makes a prediction of the current MB based on previ-ously encoded MBs of the same frame. Subsequently, this predic-tion is subtracted from the original MB to yield residual data,which is transform coded. P frames can contain intra-coded MBs,but also inter-coded MBs. In this case the MBs are predicted basedon regions in previously encoded frames. A MV is used to denotethe prediction area and again the residual data is transform coded.Thanks to this temporal prediction, P frames generally result inbetter compression than I frames.

Within H.264/AVC, the MBs in a frame are grouped in so calledslices. In the rest of the paper, we assume that a frame consists ofonly one slice, containing the entire frame. Fig. 1 shows a simpli-fied illustration of the syntax of a slice within a H.264/AVC encodedbitstream. The slice consists of a header and a series of coded MBs.These MBs partly consist of syntax elements (representing e.g., thetype, partitioning, prediction modes, and information regardingthe MVs). The rest of the MB contains the coded transform coeffi-cients after prediction and compensation.

The goal is to find moving objects, or foreground (FG) objects witha fixed camera, so the actual background (BG) is assumed to be moreor less static and visible over several different frames. Even whenusing a moving camera (e.g., pan-tilt-zoom camera), different tech-niques exist to compensate for the camera-movement and to createa stabilized image sequence [11,12]. Note that the assumption of astatic camera does not imply that the BG is static. Clutter, movingbushes and trees, noise, etc. can make a highly dynamic BG.

In general it is observed that, during encoding, most of the BGwill be predicted very accurately, resulting in MBs that use a lowamount of bits within the bitstream. This is visualized in Fig. 2,which shows the data size (in bits) of two different MBs over sev-eral consecutive P frames of the Etri_od_A sequence (a middle focal

430 C. Poppe et al. / J. Vis. Commun. Image R. 20 (2009) 428–437

length sequence selected from the 30th CD in the MPEG-7 contentset [13]). Fig. 2a shows a MB that contains homogenous back-ground values. As such, this MB can achieve high compression. Incontrast, Fig. 2c shows a MB that contains a piece of the back-ground with much detail so it is harder to compress and resultsin larger bit sizes. Fig. 2b and d show the influence of a movingobject upon the size of these MBs; a person walks through thescene (frame 92–120 for MB 190 (a) and frame 71–86 for MB151 (b)). When parts of the person occupy the analysed MBs, wesee in both graphs a sudden raise in size. Typically, two peaksappear, corresponding to the edges of the moving object. TheMBs that contain an edge of a moving object are more difficult tocompress since it is hard to find a good match for this MB. Inbetween the peaks, a lower MB size is seen, due to the internalregion of the object that passes by the MB. This internal part canbe better compressed since it does not change much over consec-utive frames.

If parts of the BG are dynamic (e.g. waving trees), they can stillbe predicted good if the motion is repetitive. If a foreground objectappears, typically these will be predicted with less accuracy,resulting in larger amounts of bits. For high bitrates we also seethat the MBs corresponding to edges of these moving objects willhave small sub partitions. A general conclusion is that MBs corre-sponding to (the edges of) moving objects will typically containmore bits in the bitstream than those representing BG.

4. Proposed algorithm

Our algorithm is based on the observations above and tries todetect moving objects. P frames are analysed using a two-stepapproach. In the first step, a BG model is learned from the sceneand subsequently used to find MBs which correspond with moving

a

c

Fig. 2. Influence of a passing object on the size of a MB for the Etri_od_A sequence; (a)background, (b) and (d) show for each P frame the size of MB 190 and 151, respectively

objects. This is explained in the next section. The second stepconsists of refining the found 16� 16 MBs to the 4� 4 subMB levelby analysing the size of the transform coefficients within boundaryMBs. This is discussed in Section 4.2. Finally, a detection for Iframes is generated based on the found objects within the sur-rounding P frames, as will be discussed in Section 4.3. Fig. 3 depictsthe proposed algorithm (for analyzing P frames) and will be refer-enced in the next sections.

4.1. MB level

From the observations of the previous section, it is clear that byusing the size of a MB over consecutive frames a distinction can bemade between BG and FG MBs. Additionally, it is shown that differ-ent MBs have different behaviors, so a BG model is used that con-sists of different values for each MB.

During a training phase, in which no moving objects are pres-ent, a BG model is constructed by recording, for each MB, the max-imum size it takes within the bitstream over several consecutiveframes (denoted as MBmodel;i in Fig. 3, with i the number of theMB). Although it might be hard to obtain training images withoutmoving objects, this is a common technique even in pixel domainobject detection [14]. Furthermore, in environments that consis-tently show moving objects, the BG model could be created bycomparing the MB sizes within one frame to each other. The lowestones are then assumed to be BG. Eventually, enough data can beaccumulated over time to create a reliable BG model that coversthe entire image. However, we foresee such a learning phase tobe time-consuming and error-prone. Dealing with sequences thatconsistently show moving objects is part of our future work.

After the initial training phase we compare, for each new frame,the size of the current MB (denoted as MBi) with the correspondingvalue in the BG model. If the difference between the size of the

050

100150200250300350400

1 26 51 76 101Frame

MB

Siz

e (b

its)

b

050

100150200250300350400

1 26 51 76 101

Frame

MB

Siz

e (b

its)

d

shows MB 190 with a homogenous background, (c) shows MB 151 with a detailed.

Temporal filtering

Spatial filtering

Parse Frame, retrieve sizes of MBs

If (MB > (MBmodel + Tmb)) No

If all surrounding MBs detected as FG

No

FG MB

BG MB

No

(MBprevframe =FG) or

(MBnextframe = FG) FG MB BG MB

if (#(FG neighbours)>4)

Boundary MB?Get the size in

bits of each 4x4 block in the MB

Yes

If (subMB > Tsubmb)

FG subMB BG subMB

if (subMB < avg(subMB))

No

MB decision applied to all subMBs

MB decision applied to all subMBs

Yes

Yes

Yes

No

Yes

YesNo

Skipped MB?

if No

No

Yes

Yes

No

Fig. 3. Flowchart of the proposed algorithm working on P frames.

MB A Skipped MB

MB B MB C

Fig. 4. Skipped MB and the surrounding MBs that are used to predict a MV.


current MB and the value of the corresponding MB in the BG mod-el, is larger than a threshold Tmb, it is considered a FG MB. Tmb

allows to vary the precision and recall of the system. Note that,once the BG model is constructed, only simple subtractions areneeded to detect FG MBs, resulting in high speed processing.

Our approach is in fact a background subtraction technique, inwhich new images are compared to a BG model to yield FG regions.The model tries to incorporate the maximum size of a MB thatshould be regarded as a BG MB. In this sense, it can be comparedwith pixel domain background subtraction techniques. However,in pixel domain, the BG models are used to accurately representand follow the distributions of the pixel values, whereas we tryto model the maximum size of a BG MB. In pixel domain, numerousapproaches have been presented using different kinds of BG mod-els [15–17]. In this paper we restrict the BG model to hold the max-imum size of the MBs.

The combination of the BG model and threshold Tmb allows todetect those MBs that have an unusually large size and these havehigh probability to correspond with a moving object. However, asshown in Fig. 2, the MBs corresponding with internal parts of amoving object can have lower sizes, so additional processing isneeded to detect these.

If a moving object is large enough, holes can appear in thedetection which are caused by Skipped MBs [1]. H.264/AVC definesthese Skipped MBs as special MBs for which, during encoding, nomotion vector or residual data is constructed. In case of a SkippedMB, the decoder calculates a MV-based on the surrounding MBs(MB A–C in Fig. 4). This MV is consequently used to reconstructthe current MB. These types of MBs are very useful when largeareas of no change between pictures occur in the sequence, sincethis can be coded with very few bits. The lack of residual data ofsuch MBs makes it impossible to compare them with the generatedBG model. However, since a MV is created during the decoding

process based on other MBs, we can assume that the Skipped MBhas the same behaviour as these surrounding ones. Therefore, ifall the surrounding MBs were denoted as FG MB in the previousstep, the Skipped MB is also regarded as a FG MB, otherwise theMB is classified as a BG MB.

These BG MBs form the input of a spatial filtering phase, asshown in Fig. 3, to deal with other MBs (not skipped MBs) thatlie within a moving object. The spatial filtering consists of a medianfilter which is applied on an eight-connected neighborhood of MBsthat are detected as BG. So if more than four of the neighboringMBs are FG, the MB is also considered to be a FG MB. This spatialfiltering is iteratively repeated until no more changes occur. Assuch, we can deal with MBs that have low residual data becausethey lie within an object and generally can be predicted well dur-ing the encoding process.

Using the BG model and the spatial filtering allows to detect themoving objects, but noise in the video that causes the size of a MBto temporally rise, will also be regarded as FG. Consequently,temporal information is needed to deal with these situations.Fig. 2 also shows the temporal consistency of an object that passesby, the size of the MB is increased over several consecutive frames.Since the MBs occupy regions of 16 by 16 pixels, an object thatpasses by a MB will occupy that region over several consecutiveframes, so the MB will consistently be detected as FG. Based on thisobservation, a temporal filter is used to remove noisy MBs thatonly appear shortly. This filter is kept very simple for fast process-ing; if a FG MB is not detected as FG in the previous ðMBprevframe;iÞ ornext ðMBnextframe;iÞ frame it is rejected and treated as BG MB.

The system presented above allows to detect moving objects upto the 16� 16 MB level, each MB of a frame is classified as FG MBor BG MB. Fig. 5 shows typical detection results of the system ondifferent sequences containing a variety of objects. From thesefigures it is clear that the detected FG MBs have high a chance tobe correct, but that the 16� 16 detection is rather coarse to accu-rately detect the arbitrarily shaped objects.

To elaborate on this, we present precision–recall values for theMB level approach on the PetsD2TeC2 and Indoor sequence (bothfrom the PETS 2005 workshop[18]) in Fig. 6a and b, respectively.These sequences are generally used test sequences which containslow and fast moving objects of small and large sizes. For bothsequences we compare the output of the proposed system againsttwo different ground truth annotations, resulting in two graphs.The first, GT_pix, represents the comparison with a pixel-basedground truth that was manually made for every 50th frame ofthe sequence. The second graph, denoted as GT_MB, uses a MB-based ground truth annotation. This MB-based ground truth isautomatically generated from the original pixel-based groundtruth. In this case if one pixel of a MB is considered FG in thepixel-based ground truth, all the 16� 16 pixels of that MB aredenoted as FG in the MB-based ground truth.

The actual graphs are constructed as follows. If a realbackground pixel is misclassified as foreground, it is called a false

Fig. 5. Output of proposed for (a) Etri_od_A sequence (frame 680), (b) Speedway2 (frame 1705), (c) Hallmonitor (frame 255), (d) PetsD2TeC2 (frame 700), and (e)Indoor(frame 1530).

0

10

20

30

40

50

60

70

80

90

100

30 35 40 45 50 55 60 65 70 75 80 85 90 95 100Recall (%)

Prec

isio

n (%

)

GT_pixGT_MB

0

10

20

30

40

50

60

70

80

90

100

65 70 75 80 85 90 95 100Recall (%)

Prec

isio

n (%

)

GT_pixGT_MB

a

b

Fig. 6. Results of the proposed MB level algorithm using a pixel-based and MB-based ground truth for (a) the PetsD2TeC2 sequence and (b) the Indoor sequence.


positive. If a foreground pixel is not detected, it is called a falsenegative. Accordingly, a real foreground pixel or background pixelthat is correctly classified is called a true positive or true negative,respectively. The X-axis shows the recall which defines how manypositive samples have been detected among all positive samplesavailable during the test. In this case, it represents the ratio of pix-els which were correctly considered as foreground, among all thereal foreground pixels:

recall ¼ TruePositivesTruePositivesþ FalseNegatives

: ð1Þ

The precision, shown on the Y-axis, denotes the percentage ofthe pixels that are classified as foreground, which actually are realforeground pixels:

precision ¼ TruePositivesTruePositivesþ FalsePositives

: ð2Þ

To calculate the precision and recall values the output of thealgorithm is compared to the ground truth annotations. Subse-quently, these values are summed for the entire sequence to createone point of a graph. Finally, an entire graph is constructed byvarying the threshold Tmb (0–140), and plotting the average preci-sion and recall value of the entire sequence.

Good systems obtain high precision and recall values. Low val-ues for Tmb tend to result in higher recall and lower precision val-ues, while higher values are situated more to the left of the graph.Indeed, if Tmb is set low, many MBs are detected as foreground,even if they only differ slightly from the BG model. Hence, noisecreates many false positives, which decreases the precision.

The graph shows that, when using a pixel-based ground truth,the precision is rather low, meaning that many of the pixels thatare detected as FG by the system actually are BG. However, Thegraph shows that the precision of GT_MB is much higher than thatof GT_pix, meaning that the MBs which are detected have highprobability to be correct, but the coarse MB-based detection causesmany pixels to be misdetected. Indeed, within the MBs that arelocated at the edges of the objects, there are many pixels that donot correspond to real foreground, which was also noticeable inthe visual results shown in Fig. 5. Therefore, an extension of thealgorithm to the sub-macroblock (subMB) level for these boundaryMBs, which is the topic of the next section, promises to increasethe detection performance.

4.2. SubMB level

The subMB level takes the decisions of the MB level as input andtries make a decision for the 4� 4 blocks (denoted as subMBj inFig. 3) within the MBs. All the 4� 4 blocks corresponding to the16� 16 BG MBs are directly regarded as BG (denoted as BG sub-MB). Since the MBs that are detected as FG on the MB level havehigh probability to be correct, we restrict ourselves to these FGMBs for further investigation. Moreover, as can be seen fromFig. 5 the coarseness of the detection especially causes problemswithin the boundary MBs. Therefore, the first step is to check if aFG MB is a boundary MB and only use these MBs during the subMBlevel analysis. In this sense, a MB is considered to be a boundary

10

20

30

40

50

60

70

80

50 60 70 80 90 100Recall (%)

Prec

isio

n (%

)

ProposedMV-based

0

10

20

30

40

50

60

70

80

90

45 50 55 60 65 70 75 80 85 90 95Recall (%)

Prec

isio

n (%

)

ProposedMV-based

50

60

70

80

n (%

)

ProposedMV-based

a

b

c


MB if it is four-connected to a MB that was detected as BG duringthe MB level analysis. For all other FG MBs the decision is appliedto all the containing 4� 4 blocks (resulting in FG subMBs).

Within a MB the transformation of the residual data occurs on4� 4 blocks [19]. Since this is the transformed residual data, theassumption is that the 4� 4 blocks with a high amount of bitsare those that were hardest to compress. Therefore, the proposedalgorithm is extended with the following. For each boundary MB,again by simple parsing, the sizes in bits that these transform coef-ficients use within the bitstream are gathered. These sizes are com-pared with the average size of these 4� 4 blocks within the currentMB. The 4� 4 blocks that have a lower amount of bits than theaverage are considered to be BG, those with a higher amount areconsidered to be FG.

However, this check can be too strict in certain cases. If a MB istotally covered by a moving object, the proposed step will still con-sider the smallest 4� 4 blocks within that MB as background, eventhough it can have a large size. Hence, a new threshold, Tsubmb, isintroduced to control the sensitivity on subMB level. So the algo-rithm consists of the following (see Fig. 3), if a 4� 4 block is largerthan this threshold it is immediately regarded as a FG subMB. Thisway we can prevent that 4� 4 blocks of large sizes are falselyregarded as background. If the 4� 4 block is smaller, it is comparedwith the average size of the 4� 4 blocks in the current MB. Notethat, according to our experiments, using the median in stead ofthe average does not influences the results much.

If Tsubmb is set to 0, all the 4� 4 blocks within the boundary MBsare considered as foreground. In that case, the algorithm detectsobjects up to the MB level as discussed above. When using highervalues for Tsubmb, more 4� 4 blocks are compared to the average, soa more fine-grained detection is possible. This threshold should beadjusted according to the noise level of the sequence. High noiselevels require higher values for Tsubmb. In the experiments in therest of the paper we set Tsubmb to 10, a value which was experimen-tally determined on a test sequence.

4.3. I frames

Until now, only P frames are considered in the system. How-ever, next to P frames, an H.264/AVC bitstream also contains Iframes. These frames are coded by applying intra prediction sofor these frames the temporal redundancy in the video can notbe exploited. Therefore, the sizes of the MBs within an I frameare typically much larger than those within P frames and theseMBs cannot be evaluated in the same manner as the MBs in a Pframe. To allow high speed processing of an I frame, a simple inter-polation is used of the detection results of the previous and follow-ing P frame. The interpolation consists of a binary and-operation,so if a MB at a certain position in the frame was detected as FGin the previous and following P frame, it is detected as FG in thecurrent I frame. Although this interpolation is fast and simple, itwill not have a large influence on the accuracy of the system, since,typically, only one I frame is used in each GOP structure, comparedto several P frames. We have tested the above algorithm for differ-ent GOP sizes, yielding similar results for the different structures.

0

10

20

30

40

50 55 60 65 70 75 80 85 90 95Recall (%)

Prec

isio

Fig. 7. Precision–recall graph of MV-based and proposed for (a) the Etri_od_Asequence, (b) the Speedway2 sequence, and (c) the PetsD2TeC2 sequence.

5. Experimental results

The algorithm explained above allows to detect moving objectsup to the 4� 4 level. In this section we present an extensive eval-uation of the proposed system. Next, we compare the detectionperformance with the related work and give a speed and visualcomparison. Last, we show how different encoder configurationsinfluence our algorithm. Although the syntax of an H.264/AVC bit-stream is standardized, the configuration of the encoder can result

in completely different bitstreams. As such, we believe this is animportant point of evaluation, which is missing in most of therelated works.

5.1. Comparison between the proposed algorithm and the related work

In this section we present a comparison of our proposed algo-rithm with the work of Zeng et al. a MV-based approach. Since theyare relying on MVs they can also detect moving objects up to the4� 4 level. As such, to give a fair comparison, we will compare theoutput of our system and theirs with a subMB-based ground truth.In this sense, to create the ground truth, a 4� 4 block of pixels isregarded as foreground if one of the pixels is considered foregroundin the pixel-based ground truth. The comparison with such a subMB-based ground truth is commonly used in the related work.

Figs. 7 and 8 show the precision–recall graphs for the relatedand proposed approach on outdoor and indoor sequences, respec-

0

10

20

30

40

50

60

70

80

90

100

50 55 60 65 70 75 80 85 90

Recall (%)

Prec

isio

n (%

)

ProposedMV-based

0

10

20

30

40

50

60

35 40 45 50 55 60 65 70 75 80 85 90Recall (%)

Prec

isio

n (%

)

ProposedMV-based

a

b

Fig. 8. Precision–recall graph of MV-based and proposed for (a) the Indoorsequence, (b) the Laboratory sequence.


tively. To create these graphs for the MV-based approach we varyone of the thresholds (The, a threshold used to find edge blocks),while the others are set to fixed values for each sequence. Thesevalues are shown in Table 1 and for an in-depth explanation ofthe thresholds we refer to [7]. To find these values we have exper-imentally determined the optimal threshold settings for eachsequence, since different sequences require a different configura-tion. Note that, for the proposed algorithm, Tmb is again varied(0–140) and Tsubmb is set to a fixed value of 10 for all sequences.

The first graph (Fig. 7a) shows the detection results on theEtri_od_A sequence, as can be seen the proposed system has higherprecision values, where the related work has higher recall values.The high precision of our system is due to the two-step approach;the analysis on the MB level succeeds in removing noisy MBs, thefound FG MBs are then refined in the subMB step. Note that Zenget al. published slightly different precision and recall values for thissequence in their paper (they obtain an average precision of 71.3%for a recall of 87.2%). However, to create those values they used aground truth that was based on the blocks that are present withinthe specific encoded sequence. As such, different encoder configu-rations yield different versions of the ground truth. In contrast, our

Table 1Parameter settings for the MV-based approach for different test sequences.

Sequence Thb Thf a b c g

Etri_od_A 1 3 4 5 7 6PetsD2TeC2 1 2 3 3 5 6Speedway 3 8 4 4 6 4Indoor 2 8 4 3 5 2Lab 1 7 5 3 5 2

ground truth is fixed and created based on the pixel-based groundtruth, so it can be used for different encoders and configurations.Moreover, Zeng et al. used only 105 frames of the Etri_od_Asequence, whereas we make an evaluation upon the entiresequence. Lastly, no information has been given about the specificsettings of the encoder or the parameters of their system. Thisexplains the difference with their presented results.

Fig. 7b and c show the detection results of the algorithms on theSpeedway2 and PetsD2TeC2 sequence, respectively. Thesesequences are challenging examples of real outdoor video surveil-lance scenarios which contain noise, shadows, and changing light-ing conditions [13,18]. As the figure shows, the MV-basedapproach, which directly works on a 4� 4 level for the entireimage, suffers from the noisy MV field in the sequences. SinceMVs are created from a coding perspective, the noise, shadowsand lighting changes result in several MVs that do not correspondto real moving objects, but which are wrongly classified by the MV-based approach. Using different parameter settings to filter out thenoise, results in too many false negatives, since then small or slowmoving objects are not detected. On the other hand, our approachachieves higher precision and recall values. This is especially due tothe low number of false positives in our system. The initial MBlevel analysis succeeds in filtering out much of the noise in thedata, while the subMB level analysis refines the detection. As oursystem only uses the residual information of the bitstream, it isnot affected by the MVs that are chosen by the encoder. Althoughnoise and shadows do increase the size of the affected MBs, it is notenough to be regarded as a FG MB, hence our system is more resis-tant to these situations.

The same behavior is visible when looking at the detectionresults for indoor situations. Typically, in indoor video sequences,objects move faster due to a shorter distance to the camera. Never-theless, Fig. 8a and b show that the proposed algorithm still out-performs the related approach. Note that, for the Laboratorysequence (an indoor surveillance sequence from the ATON project[20]), the precision values are low for both approaches. The mainreason for this is that the sequence contains many shadows, caston the ground and wall, which are wrongly considered to be fore-ground. Moreover, during the sequence a closet door is opened.This change is detected by both algorithms, but it is not regardedas moving object in the ground truth.

Note that the related work uses ‘‘hard” thresholds to determineif a MV corresponds with moving objects or not. In that sense, ourthreshold ðTmbÞ is less restrictive since it compares the MB sizeswith a learned model from the actual sequence. The analysedsequences contain slow and fast moving objects. (For example,the PetsD2TeC2 sequence has very slow moving people, the Indoorsequence shows fast moving objects due to a shorter distance ofthe camera to the actual objects). The results show that in bothcases we outperform the related work. Finally, the threshold forthe subMBs is kept constant for all tests. The influence of thisthreshold is weaker that on the MB level. Firstly, it is only usedfor the detected boundary MBs. Secondly, if the threshold is toolow, the results will approximate those of the MB level. If it istoo high, all subMBs are compared with the average sizes of thesubMBs in a MB.

5.2. Speed

Table 2 shows the execution performances for differentsequences in frames per second. All measurements were done onan Intel Core 2 Duo 2.13 GHz processor with 2GB RAM. These val-ues include the parsing of the H.264/AVC compressed bitstreamand the actual analysis to detect the moving objects. As shown,our system achieves very high execution speeds, both when work-ing on the MB level (425–782 fps) and subMB level (389–702 fps).

Table 2Average execution speeds in frames per second.

Sequence MV ProposedMB ProposedsubMB

Avg Stdv Avg Stdv Avg Stdv

Etri_od_A (352 � 240) 28 0.4 662 5.8 613 6.2Speedway2 (352 � 288) 31 0.8 548 7.1 465 7.8PetsD2TeC2 (384 � 288) 22 0.4 448 7.7 403 7.2Indoor (320 � 240) 31 0.6 751 11.1 648 9.8Laboratory (320 � 240) 45 0.5 782 10.5 702 8.8Indoor QP8 (320 � 240) 6.5 0.4 425 9.5 389 8.1


In contrast, the related work, based on MVs, achieves processingspeeds of 30 frames per second for CIF resolution [7,9]. In the tablewe have added a version of the Indoor sequence, coded with a fixedQuantization Parameter (QP) of 8, to show the influence this has onthe execution speed. We noticed that sequences encoded by thereference software with a very small QP, tend to have more MVsof different sizes. As a result, the MV-based approach needs to ana-lyze more blocks, resulting in a raise of execution times. Thisbehaviour is similar to the fact that many moving objects in a scenewill slow down the system, which was reported in their paper. Thetable shows that our system is also affected by this different QPsetting, however, the loss in speed is mostly due to the fact thatthe parsing of the bitstream takes more time. Indeed, at low QPs(or high bitrates) less Skipped MBs are used and the MBs are morepartitioned, which makes the parsing slower.

5.3. Visual comparision

Fig. 9 shows typical results of the MV-based approach and theproposed system where pixels detected as FG or BG are coloredwhite or black, respectively. As can be seen, the MV-basedapproach is more sensitive to noise. The MB-based approach suc-ceeds in finding the MBs that correspond with moving objects,however the 16� 16 size is coarse. The subMB level approach isable to refine the detected MBs. Note that, in some cases, therefinement leads to additional false negatives, since some partsof the boundary MBs are wrongly considered as BG.

Fig. 9. First column: current frame, second column: ground truth, third column: output boutput by proposed on subMB level. The rows show the results for the Etri_od_A seque(frame 50).

5.4. Influence of encoder configuration

The configuration of the encoder has a large influence on theresulting bitstream. Different settings result in different decisionsthat are made during the encoding process. To make a detailedanalysis of our system, we tested the proposed algorithm and therelated work to see the influence of the QP, bitrate, and the usedmotion estimation method.

The QP has a strong influence on the amount of compressionthat is achieved. The higher the QP, the more the data is com-pressed. When using a fixed QP, the visual quality of the video ismore or less fixed, but the used bitrate for each frame can differa lot. Fig. 10 shows the precision and recall values when varyingthe QP for different sequences. Again, we vary the threshold Tmb

and apply our algorithm to the same sequence encoded at differentQP values, resulting in different graphs.

High QPs result in high compression of the data, so the size ofthe resulting MBs decreases. As can be seen, the detection perfor-mance of the algorithm drops in that case. More specifically, therecall decreases since many MBs are wrongly considered to beBG due to the low sizes. Vice versa, low QPs lead to low compres-sion, so MBs containing data that is harder to compress (like FG ob-jects) will have a larger size than MBs that are easy to compress(like BG objects). The graphs show that for different QPs a differentthreshold gives the best performance (high precision and highrecall). A possible extension to our algorithm could exist of makingthe threshold dependent on the used QP, so for low QPs the thresh-old could be automatically raised and vice versa.

Note that for high thresholds (resulting in low recall values),fluctuations can arise in the precision (for example see Fig. 10a).As shown in Eq. (2), both the nominator and denominator of theprecision are dependent on the output of the algorithm. As such,varying the thresholds could increase or decrease the precision.Moreover, this effect is especially visible when the algorithmdetects only few foreground pixels, so when a high threshold isused. This peculiar behavior of Precision–Recall graphs has alsobeen shown by Davis and Goadrich [21].

In many cases an encoder needs to achieve a consistent fixedbitrate for transport purposes. In this case, each GOP structure

y MV-based approach, fourth column: output by proposed on MB level, fifth column:nce (frame 300), the PetsD2TeC2 sequence (frame 2300) and the Indoor sequence

0

10

20

30

40

50

60

70

80

90

0 10 20 30 40 50 60 70 80 90 100Recall (%)

Prec

isio

n (%

)

QP8QP16QP24QP32QP8_MVQP16_MVQP24_MVQP32_MV

0

10

20

30

40

50

60

70

80

90

100

40 45 50 55 60 65 70 75 80 85 90 95 100Recall (%)

Prec

isio

n (%

)

QP8QP16QP24QP32QP8_MVQP16_MVQP24_MVQP32_MV

a

b

Fig. 10. Precision–recall graph for fixed QP versions of (a) the PetsD2TeC2 sequenceand (b) the Indoor sequence.

0

10

20

30

40

50

60

70

80

90

100

0 10 20 30 40 50 60 70 80 90 100Recall (%)

Prec

isio

n (%

)

100kbps500kbps2000kbps100kbps_MV500kbps_MV2000kbps_MV

0

10

20

30

40

50

60

70

80

90

100

20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100Recall (%)

Prec

isio

n (%

)

100kbps500kbps2000kbps100kbps_MV500kbps_MV2000kbps_MV

a

b

Fig. 11. Precision–recall graph for fixed bitrate versions of (a) the PetsD2TeC2sequence and (b) the Indoor sequence.

0

10

20

30

40

50

60

70

80

90

20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95

Recall (%)

Prec

isio

n (%

)

UMHexSHexEPZSUMHex_MVSHex_MVEPZS_MV

20

30

40

50

60

70

80

90

45 50 55 60 65 70 75 80 85 90 95Recall (%)

Prec

isio

n (%

)

UMHexSHexEPZSUMHex_MVSHex_MVEPZS_MV

a

b

Fig. 12. Precision–recall graph for different motion estimation methods for (a) thePetsD2TeC2 sequence and (b) the Indoor sequence.


contains about the same amount of bits. A rate control mechanismwithin the encoder is responsible for adapting the QP for each MBto match a given bitrate. Fig. 11 shows the graphs when varyingthe bitrate for different sequences. Note that different bitratesinfluence the performance of the system. Using a high fixed bitrategenerally results in small QPs used when encoding the MBs. Hence,the same global behavior as in Fig. 10 is visible.

An encoder is free to implement its own method for estimatingthe motion and the H.264/AVC reference software contains severalmethods for this. We created sequences using the Uneven Multi-Hexagon Search (UMHex), Simplified Hexagon Search (SHex), andEnhanced Predictive Zonal Search (EPZS). The last method usesthe default pattern (Extended Diamond). The chosen methods ap-ply different search patterns to find the best match of a specificMB, and as such each of these methods have an influence on thespeed of encoding and the resulting MVs that are found. Fig. 12shows the graphs for the MV-based and the proposed approach,when using these different motion estimation methods. It can beseen that the MV-based approach is more dependent on the chosenmotion estimation method than the proposed approach. A differentmotion estimation method can result in totally different MVs,resulting in a different behaviour of the MV-based object detection.In contrast, for high thresholds, our system is more or less indepen-dent of this parameter. Although, the different MVs result in differ-ent residual data, the system is still able to accurately detect theMBs that correspond to moving objects.

6. Conclusions

This paper presents a novel high speed method to accuratelydetect moving objects in H.264/AVC compressed video surveillancesequences. In contrast to the related work, that is based on MVs,


our system purely relies on the structure of the compressed bi-stream. During a training phase the numbers of bits that MBs usewithin a frame, are used to create an effective background model.Subsequently, the MB sizes of new images are compared to thismodel to yield regions of interest, which are then spatially andtemporally filtered. Additionally, the sizes of the 4� 4 transformcoefficients within boundary MBs are incorporated to refine thedetection results. A comparison on challenging sequences showsthat the system achieves better precision and recall values thanthe related MV-based approaches. Additionally, since we restrictourselves to the syntax level (our system does not need any decod-ing), very high execution speeds are achieved (up to 20 times fasterthan the MV-based approaches). Future work consists of extendingthe algorithm to other profiles of H.264/AVC. Moreover, currentlythe assumption of a static camera is made, so further work isneeded to see how stabilization techniques existing in the litera-ture can succesfully be combined with our system.

Acknowledgments

The research activities that have been described in this paperwere funded by Ghent University, the Interdisciplinary Institutefor Broadband Technology (IBBT), the Institute for the Promotionof Innovation by Science and Technology in Flanders (IWT-Flan-ders), the Fund for Scientific Research-Flanders (FWO-Flanders),and the European Union.

References

[1] T. Wiegand, G. Sullivan, G. Bjontegaard, G. Luthra, Overview of the H.264/AVCvideo coding standard, IEEE Transactions on Circuits and Systems for VideoTechnology 13 (7) (2003) 560–576.

[2] H. Wang, A. Divakaran, A. Vetro, S. Chang, H. Sun, Survey on compressed-domain features used in video/audio indexing and analysis, Journal of VisualCommunication and Image Representation 14 (2003) 150–183.

[3] H. Zen, T. Hasegawa, S. Ozawa, Moving object detection from MPEG codedpicture, Proceedings of the International Conference on Image Processing(1999) 25–29.

[4] M.L. Jamrozik, M.H. Hayes, A compressed domain video object segmentationsystem, Proceedings of the International Conference on Image Processing(2002) 113–116.

[5] L. Long, F. Xingle, J. Ruirui, D. Yi, A moving object segmentation in MPEGcompressed domain based on motion vectors and DCT coefficients,Proceedings of the Congress on Image and Signal Processing (2008) 605–609.

[6] V. Thilak, C. Creusere, Tracking of extended size targets in H.264 compressedvideo using the probabilistic data association filter, Proceedings of theEuropean Signal Processing Conference (2004) 281–284.

[7] W. Zeng, J. Du, W. Gao, Q. Huang, Robust moving object segmentation on H.264compressed video using te block-based MRF model, Real-Time Imaging 11(2003) 290–299.

[8] G. Yang, S. Yu, Z. Zhang, Robust moving object segmentation in the compresseddomain for H.264 video stream, in: Proceedings of the Picture CodingSymposium, 2006.

[9] Z. Liu, Z. Zhang, L. Shen, Moving object segmentation in the H.264 compresseddomain, Optical Engineering 46 (1) (2007) 017003.

[10] S. Kwon, A. Tamhamkar, K.R. Rao, Overview of H.264/MPEG-4 part 10, Journalof Visual Communication and Image Representation 17 (2006) 186–216.

[11] A. Mittal, D. Huttenlocher, Scene modelling for wide area surveillance andimage synthesis, Proceedings of the IEEE Conference on Computer Vision andPattern Recognition (2000) 160–167.

[12] F. Tiburzi, J. Bescos, Camera motion analysis in on-line MPEG sequences,Proceedings of the International Workshop on Image Analysis for MultimediaInteractive Services (2007) 42–46.

[13] MPEG-7 Overview, International Organization for Standardisation, Klagenfurt,ISO/IEC JTC1/SC29/WG11, July 2002, <http://mpeg.telecomitalialab.com/standards/mpeg-7/mpeg-7.htm>.

[14] S. Cheung, C. Kamath, Robust techniques for background subtraction in Urbantraffic video, Proceedings of Visual Communications and Image Processing(2004) 881–892.

[15] Turgay Celik, Hasan Demirel, Huseyin Ozkaramanli, Mustafa Uyguroglu, Firedetection using statistical color model in video sequences, Journal of VisualCommunication and Image Representation 18 (2) (2007) 176–185.

[16] H. Yang, Y. Tan, J. Tian, J. Liu, Accurate dynamic scene model for moving objectdetection, Proceedings of the IEEE International Conference on ImageProcessing (2007) 157–160.

[17] C. Poppe, G. Martens, S. De Bruyne, P. Lambert, R. Van de Walle, Robust spatio-temporal multimodal background subtraction for video surveillance, OpticalEngineering 47 (2008) 107203.

[18] L.M. Brown, A.W. Senior, Y. Tian, J. Connell, A. Hampapur, C. Shu, H. Merkl, M.Lu, Performance evaluation of surveillance systems under varying conditions,in: Proceedings of the IEEE International Workshop on Performance Evaluationof Tracking and Surveillance (PETS), 2005, <http://www.research.ibm.com/peoplevision/performanceevaluation.html>.

[19] I.E.G. Richardson, H.264 and MPEG-4 video compression: video coding fornext-generation multimedia, 2003.

[20] Autonomous Agents for On-Scene Networked Incident Management (ATON).<http://cvrr.ucsd.edu/aton/>.

[21] J. Davis, M. Goadrich, The Relationship between Precision–Recall and ROCCurves, University of Wisconsin – Madison Computer Science Department,Technical Report, 2006.

http://mpeg.telecomitalialab.com/standards/mpeg-7/mpeg-7.htm

http://mpeg.telecomitalialab.com/standards/mpeg-7/mpeg-7.htm

http://www.research.ibm.com/peoplevision/performanceevaluation.html

http://www.research.ibm.com/peoplevision/performanceevaluation.html

http://cvrr.ucsd.edu/aton/

Documents

Moving object detection in the H.264/AVC compressed domain for video surveillance applications