Kent State University - Temporal video segmentation: A surveyjaved/class-DR/sample-videosegmentation.pdf · 2003. 8. 8. · have led to the widespread use and availability of digital

*Corresponding author. Tel.: #39 040-676-7147; fax: #39-040-676-3460.

E-mail addresses: [email protected] (I. Koprinska), [email protected] (S. Carrato).

Signal Processing: Image Communication 16 (2001) 477}500

Temporal video segmentation: A survey

Irena Koprinska!, Sergio Carrato",*!Institute for Information Technologies, Acad. G. Bonchev Str., Bl. 29A, 1113 Soxa, Bulgaria

"Department of Electrical Engineering and Computer Science (D.E.E.I), Image Processing Laboratory, University of Trieste,via Valerio 10, 34127 Trieste, Italy

Received 27 July 1999; received in revised form 8 February 2000; accepted 15 February 2000

Abstract

Temporal video segmentation is the "rst step towards automatic annotation of digital video for browsing and retrieval.This article gives an overview of existing techniques for video segmentation that operate on both uncompressed andcompressed video stream. The performance, relative merits and limitations of each of the approaches are comprehens-ively discussed and contrasted. The gradual development of the techniques and how the uncompressed domain methodswere tailored and applied into compressed domain are considered. In addition to the algorithms for shot boundariesdetection, the related topic of camera operation recognition is also reviewed. ( 2001 Elsevier Science B.V. All rightsreserved.

Keywords: Temporal video segmentation; Shot boundaries detection; Camera operations; Video databases

1. Introduction

Recent advances in multimedia compressiontechnology, coupled with the signi"cant increase incomputer performance and the growth of Internet,have led to the widespread use and availability ofdigital video. Applications such as digital libraries,distance learning, video-on-demand, digital videobroadcast, interactive TV, multimedia informationsystems generate and use large collections of videodata. This has created a need for tools that cane$ciently index, search, browse and retrieve rel-evant material. Consequently, several content-

based retrieval systems for organizing and manag-ing video databases have been recently proposed[8,26,34].

As shown in Fig. 1, temporal video segmentationis the "rst step towards automatic annotation ofdigital video sequences. Its goal is to divide thevideo stream into a set of meaningful and manage-able segments (shots) that are used as basic elementsfor indexing. Each shot is then represented by se-lecting key frames and indexed by extracting spatialand temporal features. The retrieval is based on thesimilarity between the feature vector of the queryand already stored video features.

A shot is de"ned as an unbroken sequence offrames taken from one camera. There are two basictypes of shot transitions: abrupt and gradual.Abrupt transitions (cuts) are simpler, they occur ina single frame when stopping and restarting the

0923-5965/01/$ - see front matter ( 2001 Elsevier Science B.V. All rights reserved.PII: S 0 9 2 3 - 5 9 6 5 ( 0 0 ) 0 0 0 1 1 - 4

Fig. 2. Dissolve, cut.

Fig. 1. Content-based retrieval of video databases.

camera. Although many kinds of cinematic e!ectscould be applied to arti"cially combine two shots,and thus to create gradual transitions, most oftenfades and dissolves are used. A fade out is a slowdecrease in brightness resulting in a black frame;a fade in is a gradual increase in intensity startingfrom a black image. Dissolves show one image super-imposed on the other as the frames of the "rst shotget dimmer and those of the second one get brighter.Fig. 2 shows an example of dissolve and cut. Fadeout followed by fade in is presented in Fig. 3.

Gradual transitions are more di$cult to detectthan cuts. They must be distinguished from cameraoperations (Fig. 4) and object movement that ex-hibit temporal variances of the same order andcause false positives. It is particularly di$cult todetect dissolves between sequences involving inten-sive motion [14,44,47].

Camera operation recognition is an importantissue also for another reason. As camera operationsusually explicitly re#ect how the attention of theviewer should be directed, the clues obtained areuseful for key frame selection. For example, whena camera pans over a scene, the entire video se-quence belongs to one shot but the content of thescene could change substantially, thus suggestingthe use of more than one key frame. Also, when thecamera zooms, the images at the beginning and endof the zoom may be considered as representative ofthe entire shot. Furthermore, recognizing cameraoperations allows the construction of salient videostills [38] } static images that e$ciently representvideo content.

Algorithms for shot boundaries detection werealready discussed in several review papers. Anangerand Little [4] presented a survey in video indexing,including some techniques for temporal video seg-mentation mainly in uncompressed domain. Idrisand Panchanathan [15] surveyed methods for con-tent-based indexing in image and video databasesfocusing on feature extraction. A review of videoparsing is presented but it mainly includes methodsthat operate on uncompressed domain and detectcuts. The goal of this paper is to provide a compre-hensive taxonomy and critical survey of the existingapproaches for temporal video segmentation in bothuncompressed and compressed video. The perfor-mance, relative merits and shortcomings of each

478 I. Koprinska, S. Carrato / Signal Processing: Image Communication 16 (2001) 477}500

Fig. 3. Fade out followed by fade in.

Fig. 4. Basic camera operations: "xed, zooming (focal lengthchange of a stationary camera), panning/tilting (camera rotationaround its horizontal/vertical axis), tracking/booming (horizon-tal/vertical transverse movement) and dollying (horizontal lat-eral movement).

approach are discussed in detail. A special attentionis given to the gradual development and improve-ment of the techniques, their relationships and simil-arities, in particular how the uncompressed domainmethods were tailored and imported into the com-pressed domain. In addition to the algorithms for

shot boundaries detection, the related topic of cam-era operation recognition is also discussed.

The paper is organized as follows. In the nextsection we review shot boundaries detection tech-niques starting with approaches in uncompresseddomain and then moving to compressed domainvia an introduction to MPEG fundamentals. Anoverview of methods for camera operation recogni-tion is presented in Section 3. Finally, a summarywith future directions concludes the paper.

2. Temporal video segmentation

More than eight years of temporal video segmen-tation research have resulted in a great variety ofalgorithms. Early work focus on cut detection,while more recent techniques deal with the harderproblem } gradual transitions detection.

2.1. Temporal video segmentation inuncompressed domain

The majority of algorithms process uncom-pressed video. Usually, a similarity measure

I. Koprinska, S. Carrato / Signal Processing: Image Communication 16 (2001) 477}500 479

between successive images is de"ned. When twoimages are su$ciently dissimilar, there may bea cut. Gradual transitions are found by usingcumulative di!erence measures and more sophisti-cated thresholding schemes.

Based on the metrics used to detect the di!erencebetween successive frames, the algorithms can bedivided broadly into three categories: pixel, block-based and histogram comparisons.

2.1.1. Pixel comparisonPair-wise pixel comparison (also called template

matching) evaluates the di!erences in intensity orcolor values of corresponding pixels in two success-ive frames.

The simplest way is to calculate the absolute sumof pixel di!erences and compare it against a thre-shold [18]:

D(i, i#1)"+X

x/1+Y

y/1DP

i(x, y)!P

i`1(x, y)D

X>

for gray level images,

D(i, i#1)"+X

x/1+Y

y/1+

cDP

i(x, y, c)!P

i`1(x, y, c)D

X>

for color images, (1)

where i and i#1 are two successive frameswith dimension X]>, P

i(x, y) is the intensity

value of the pixel at the coordinates (x, y) in frame i,c is index for the color components (e.g.c3MR, G, BN in case of RGB color system) andPi(x, y, c) is the color component of the pixel at

(x, y) in frame i.A cut is detected if the di!erence D(i, i#1) is

above a prespeci"ed threshold ¹. The main dis-advantage of this method is that it is not able todistinguish between a large change in a small areaand a small change in a large area. For example,cuts are misdetected when a small part of the frameundergoes a large, rapid change. Therefore,methods based on simple pixel comparison aresensitive to object and camera movements.

A possible improvement is to count the numberof pixels that change in value more than somethreshold and to compare the total against a sec-

ond threshold [25,45]:

DP(i, i#1, x, y)"G1 if DP

i(x, y)!P

i`1(x, y)D'¹

1,

0, otherwise,

D(i, i#1)"+Xx/1

+Yy/1

DP(i, i#1, x, y)X>

. (2)

If the percentage of changed pixels D(i, i#1) isgreater than a threshold ¹

2, a cut is detected.

Although some irrelevant frame di!erences are"ltered out, these approaches are still sensitive toobject and camera movements. For example, ifcamera pans, a large number of pixels can bejudged as changed, even though there is actuallya shift with a few pixels. It is possible to reduce thise!ect to a certain extent by the application ofa smoothing "lter: before the comparison eachpixel is replaced by the mean value of its neighbors.

2.1.2. Block-based comparisonIn contrast to template matching that is based on

global image characteristic (pixel by pixel di!er-ences), block-based approaches use local character-istic to increase the robustness to camera andobject movement. Each frame i is divided intob blocks that are compared with their correspond-ing blocks in i#1. Typically, the di!erence be-tween i and i#1 is measured by

D(i, i#1)"b+k/1

ckDP(i, i#1, k), (3)

where ckis a predetermined coe$cient for the block

k and DP(i, i#1, k) is a partial match value be-tween the kth blocks in i and i#1 frames.

In [17] corresponding blocks are compared us-ing a likelihood ratio

jk"C

pk,i#p

k,i`12

#Akk,i`1

!kk,i`1

2 B2

D2

pk,i) p

k,i`1

, (4)

where pk,i

, pk,i`1

are the mean intensity values forthe two corresponding blocks k in the consecutiveframes i and i#1, and p

k,i, p

k,i`1are their vari-

ances, respectively. Then, the number of blocksfor which the likelihood ratio is greater than


Fig. 5. Net comparison algorithm: base windows Bij.

a threshold ¹1

is counted,

DP(i, i#1, k)"G1 if j

k'¹

1,

0 otherwise.(5)

A cut is declared when the number of changedblocks is large enough, i.e. D(i, i#1) is greater thana given threshold ¹

2and c

k"1 for all k.

Compared to template matching, this method ismore tolerant to slow and small object motion fromframe to frame. On the other hand, it is slower dueto the complexity of the statistical formulas. Addi-tional potential disadvantage is that no change willbe detected in the case of two corresponding blocksthat are di!erent but have the same density func-tion. Such situations, however, are very unlikely.

Another block-based technique is proposed byShahraray [32]. The frame is divided into 12 non-overlapping blocks. For each of them the bestmatch is found in the respective neighborhoods inthe previous image based on image intensity values.A non-linear order statistics "lter is used to com-bine the match values, i.e. the weight of a matchvalue in Eq. (3) will depend on its order in thematch value list. Thus, the e!ect of camera andobject movements is further suppressed. Theauthor claims that such similarity measure of twoimages is more consistent with human judgement.Both cuts and gradual transitions are detected.Cuts are found using thresholds like in the otherapproaches that are discussed while gradualtransitions are detected by identifying sustainedlow-level increase in match values.

Xiong et al. [41] describe a method they call netcomparison, which attempts to detect cuts inspect-ing only part of the image. It is shown that the errorwill be low enough if less than half of so called basewindows (non-overlapping square blocks, Fig. 5)are checked. Under an assumption about the lar-gest movement between two images, the size of the

windows can be chosen large enough to be indi!er-ent to a non-break change and small enough tocontain the spatial information as much as pos-sible. Base windows are compared using the di!er-ence between the mean values of their gray-level orcolor values. If this di!erence is larger than a thre-shold, the region is considered changed. When thenumber of changed windows is greater than an-other threshold, a cut is declared. The experimentsdemonstrated that the approach is faster and moreaccurate than pixel pair-wise, likelihood and localhistogram methods. In their subsequent paper [40],the idea of video subsampling into space is furtherextended to subsampling in both space and time.The new Step-variable algorithm detects bothabrupt and gradual transition comparing framesi and j, where j"i#myStep. If no signi"cantchange is found between them, the move is with halfstep forward and the next comparison is betweeni#myStep/2 and j#myStep/2. Otherwise, binarysearch is used to locate the change. If i and j aresuccessive and their di!erence is bigger than a thre-shold, cut is declared. Otherwise, edge di!erencesbetween the two frames are compared against an-other threshold to check for gradual transition.Obviously, the performance depends on the propersetting of myStep: large steps are e$cient but in-crease the number of false alarms, too small stepsmay result in missing gradual transition. In addi-tion, the approach is very sensitive to object andcamera motion.

2.1.3. Histogram comparisonA step further towards reducing sensitivity to

camera and object movements can be done bycomparing the histograms of successive images.The idea behind histogram-based approaches isthat two frames with unchanging background andunchanging (although moving) objects will havelittle di!erence in their histograms. In addition,histograms are invariant to image rotation andchange slowly under the variations of viewing angleand scale [35]. As a disadvantage one can note thattwo images with similar histograms may have com-pletely di!erent content. However, the probabilityfor such events is low enough, moreover techniquesfor dealing with this problem have already beenproposed in [28].


A gray level (color) histogram of a frame i is ann-dimensional vector H

i( j), j"1,2, n, where n is

the number of gray levels (colors) and H( j) is thenumber of pixels from the frame i with gray level(color) j.

2.1.3.1. Global histogram comparison. The simplestapproach uses an adaptation of the metrics fromEq. (1): instead of intensity values, gray level histo-grams are compared [25,39,45]. A cut is declared ifthe absolute sum of histogram di!erences betweentwo successive frames D(i, i#1) is greater thana threshold ¹,

D(i, i#1)" n+j/1

DHi( j)!H

i`1( j)D , (6)

where Hi( j) is the histogram value for the gray level

j in the frame i, j is the gray value and n is the totalnumber of gray levels.

Another simple and very e!ective approach is tocompare color histograms. Zhang et al. [45] applyEq. (6) where j, instead of gray levels, denotes a codevalue derived from the three color intensities ofa pixel. In order to reduce the bin number (3 colors]8 bits create histograms with 224 bins), only theupper two bits of each color intensity are used tocompose the color code. The comparison of theresulting 64 bins has been shown to give su$cientaccuracy.

To enhance the di!erence between two framesacross a cut, several authors [25] propose the use ofthe s2 test to compare the (color) histograms H

i( j)

and Hi`1

( j) of the two successive frames i andi#1,

D(i, i#1)"n+j/1

DHi( j)!H

i`1( j)D2

Hi`1

( j). (7)

When the di!erence is larger than a given threshold¹, a cut is declared. However, experimental resultsreported in [45] show that s2 test not only enhan-ces the di!erence between two frames across a cutbut also increases the di!erence due to camera andobject movements. Hence, the overall performanceis not necessarily better than the linear histogramcomparison represented in Eq. (6). In addition,s2 statistics requires more computational time.

Gargi et al. [12] evaluate the performance ofthree histogram based methods using six di!erentcolor coordinate systems: RGB, HSIQ,¸HaHbH, ¸HuHvH and Munsell. The RGB histogramof a frame is computed as three sets of 256 bins. Theother "ve histograms are represented as a 2-dimen-sional distribution over the two non-intensitybased dimensions of the color spaces, namely:H and S for the HSIQ, aH andbH for the ¸HaHbH, uH and vH for the ¸HuHvH and hueand chroma components for the Munsell space.The number of bins is 1600 (40]40) for the¸HaHbH, ¸HuHvH and >IQ histograms and 1800 (60hues]30 saturations/chromas) for the HS< andMunsell space histograms. The di!erence functionsused to compare histograms of two consecutiveframes are de"ned as follows:f bin-to-bin di!erences as in Eq. (6)f histogram intersection:D(i, i#1)"1!Intersection(H

i, H

i`1)

"1!+n

j/1min(H

i( j)!H

i`1( j))

+nj/1

max(Hi( j)!H

i`1( j))

. (8)

Note that for two identical histograms the intersec-tion is 1 and the di!erence 0 while for two frameswhich do not share even a single pixel of the samecolor (bin), the di!erence is 1.f weighted bin di!erences

D(i, i#1)"n+j/1

+k|N(k)

=(k) ) (Hi( j)!H

i(k)), (9)

where N(k) is a neighborhood of bin j and=(k) isthe weight value assigned to that neighbor. A 3]3or 3 neighborhoods are used in the case of 2-dimensional and 1-dimensional histograms, respec-tively.

It is found that in terms of overall classi"cationaccuracy >IQ, ¸HaHbH and Munsell color coordi-nate spaces perform well, followed by HSV, ¸HuHvHand RGB. In terms of computational cost of con-version from RGB, the HS< and >IQ are the leastexpensive, followed by ¸HaHbH, ¸HuHvH and theMunsell space.

So far only histogram comparison techniques forcut detection have been presented. They are basedon the fact that there is a big di!erence between the


Fig. 6. Twin comparison: (a) consecutive and (b) accumulatedhistogram di!erences.

frames across a cut that results in a high peak in thehistogram comparison and can be easily detectedusing one threshold. However, such one-thresholdbased approaches are not suitable to detect gradualtransitions. Although during a gradual transitionthe frame to frame di!erences are usually higherthan those within a shot, they are much smallerthan the di!erences in the case of cut and cannot bedetected with the same threshold. On the otherhand, object and camera motions might entail big-ger di!erences than the gradual transition. Hence,lowering the threshold will increase the numberof false positives. Below we review a simple ande!ective two-thresholds technique for gradualtransition recognition.

The twin-comparison method [45] takes into ac-count the cumulative di!erences between frames ofthe gradual transition. In the "rst pass a high thre-shold ¹

)is used to detect cuts as shown in Fig. 6(a).

In the second pass a lower threshold ¹-

is em-ployed to detect the potential starting frame F

4of

a gradual transition. F4

is then compared to sub-sequent frames (Fig. 6(b)). This is called an accumu-lated comparison as during a gradual transitionthis di!erence value increases. The end frame F

%of

the transition is detected when the di!erence be-tween consecutive frames decreases to less than ¹

-,

while the accumulated comparison has increased toa value higher than ¹

). If the consecutive di!erence

falls below ¹-

before the accumulated di!erenceexceeds ¹

), then the potential start frame F

4is

dropped and the search continues for other gradualtransitions. It was found, however, that there aresome gradual transitions during which the con-secutive di!erence falls below the lower threshold.This problem can be easily solved by setting a toler-

ance value that allows a certain number of con-secutive frames with low di!erence values beforerejecting the transition candidate. As it can be seen,the twin-comparison detects both abrupt and grad-ual transitions at the same time. Boreczky andRowe [6] compared several temporal video seg-mentation techniques on real video sequences andfound that twin comparison is a simple algorithmthat works very well.

2.1.3.2. Local histogram comparison. As it was al-ready discussed, histogram-based approaches aresimple and more robust to object and cameramovements but they ignore the spatial informationand, therefore, fail when two di!erent images havesimilar histograms. On the other hand, block-basedcomparison methods make use of spatial informa-tion. They typically perform better than pair-wisepixel comparison but are still sensitive to cameraand object motion and are also computationallyexpensive. By integrating the two paradigms, falsealarms due to camera and object movement can bereduced while enough spatial information is re-tained to produce more accurate results.

The frame-to-frame di!erence of frame i andframe i#1 is computed as

D(i, i#1)"b+k/1

DP(i, i#1, k),(10)

DP(i, i#1, k)"n+j

DHi( j, k)!H

i`1( j, k)D,

where Hi( j, k) denotes the histogram value at gray

level j for the region (block) k and b is the totalnumber of the blocks.

For example, Nagasaka and Tanaka [25] com-pare several statistics based on gray-level and colorpixel di!erences and histogram comparisons. Thebest results were obtained by breaking the imageinto 16 equal-sized regions, using s2 test on colorhistograms for these regions and discarding thelargest di!erences to reduce the e!ects of noise,object and camera movements.

Another approach based on local histogramcomparison is proposed by Swanberg et al. [36].The partial di!erence DP(i, i#1, k) is measured bycomparing the color RGB histograms of the blocks


using the following equation:

DP(i, i#1, k)" +c|MR,B,GN

n+l/1

(Hci(l)!Hc

i`1(l))2

Hci(l)!Hc

i`1(l)

. (11)

Then, Eq. (3) is applied where ckis 1/b for all k. Lee

and Ip [22] introduce a selective HSV histogramcomparison algorithm. In order to reduce theframe-to-frame di!erences caused by change inintensity or shade, image blocks are compared inHS< (hue, saturation, value) color space. It is theuse of hue that makes the algorithm insensitive tosuch changes since hue is independent of saturationand intensity. However, as hue is unstable when thesaturation or the value are very low, selective com-parison is proposed. If a pixel contains rich colorinformation (i.e. a high < and a high S), it is classi-"ed into a discrete color based on its hue (Hue),otherwise on its intensity value (Gray). The selec-tive histograms H)6%

i(h, k), H'3!:

i(g, k) and the

frame-to-frame di!erence for the block k with di-mensionality X]> are formulated as follows:

H)6%i

(h, k)"X+x

Y+y

I)6%i

(x, y, h),

H'3!:i

(g, k)"X+x

Y+y

I'3!:i

(x, y, g),

I)6%i

(x, y, h)

"G1 if S

i(x, y, h)'¹

sand <

i(x, y, h)'¹l ,

0 otherwise,

I'3!:i

(x, y, g)

"G1 if (S

i(x, y, g))¹

sor <

i(x, y, g))¹l ),

0 otherwise,

D(i, i#1, k)" N+h/1

DH)6%i

(h, k)!H)6%i`1

(h, k)D

#M+g/1

DH'3!:i

(g, k)!H'3!:i`1

(g, k)D , (12)

where h and g are indexes for the hue and graylevels, respectively; ¹

sand ¹

vare thresholds and

x, y are pixel coordinates.To further improve the algorithm by increasing

the di!erences across a cut, local histogram com-parison is performed. It is shown that the algorithmoutperforms both histogram (gray level global and

local) and pixel di!erences based approaches. How-ever, none of the algorithms gives satisfactoryperformance on very dark video images.

2.1.4. Clustering-based temporal videosegmentation

The approaches discussed so far rely on suitablethresholding of similarities between successiveframes. However, the thresholds are typically high-ly sensitive to the type of input video. This draw-back is overcome in [13] by the application ofunsupervised clustering algorithm. More speci"-cally, the temporal video segmentation is viewed asa 2-class clustering problem (`scene changea and`no scene changea) and the well-known K-meansalgorithm [27] is used to cluster frame dissimilar-ities. Then the frames from the cluster `scenechangea which are temporary adjacent are labeledas belonging to a gradual transition and the otherframes from this cluster are considered as cuts. Twosimilarity measures based on color histograms wereused: s2 statistics and the histogram di!erencede"ned in Eq. (6), both in RGB and >;< colorspaces. The experiments show that the s2->;<detects the larger number of correct transitions butthe histogram di!erence->;< is the best choice interms of overall performance (i.e. number of falsealarms and correct detections). As a limitation wecan note that the approach is not able to recognizethe type of the gradual transitions. The main ad-vantage of the clustering-based segmentation isthat it is a generic techniques that not only elimin-ates the need for threshold setting but also allowsmultiple features to be used simultaneously to im-prove the performance. For example, in their sub-sequent work Ferman and Tekalp [10] incorporatetwo features in the clustering method: histogramdi!erence and pair-wise pixel comparison. It wasfound that when "ltered these features supplementone another, which results in both high recall andprecision. A technique for clustering-based tem-poral segmentation on-the-#y was introducedas well.

2.1.5. Feature based temporal video segmentationAn interesting approach for temporal video seg-

mentation based on features is described by Zabihet al. [44]. It involves analyzing intensity edges


between consecutive frames. During a cut or a dis-solve, new intensity edges appear far from the loca-tions of the old edges. Similarly, old edgesdisappear far from the location of new edges. Thus,by counting the entering and exiting edge pixels,cuts, fades and dissolves are detected and classi"ed.To obtain better results in case of object and cam-era movements, an algorithm for motion compen-sation is also included. It "rst estimates the globalmotion between frames that is then used to alignthe frames before detecting entering and exitingedge pixels. However, this technique is not able tohandle multiple rapidly moving objects. As theauthors have pointed out, another weakness of theapproach are the false positives due to the limita-tions of the edge detection method. In particular,rapid changes in the overall shot brightness, andvery dark or very light frames, may cause falsepositives.

2.1.6. Model driven temporal video segmentationThe video segmentation techniques presented so

far are sometimes referred to as data driven,bottom}up approaches [14]. They address the prob-lem from data analysis point of view. It is alsopossible to apply top}down algorithms that arebased on mathematical models of video data. Suchapproaches allow a systematic analysis of the prob-lem and the use of several domain-speci"c con-straints that might improve the e$ciency.

Hampapur et al. [14] present a shot boundariesidenti"cation approach based on the mathematicalmodel of the video production process. This modelwas used as a basis for the classi"cation of the videoedit types (cuts, fades, dissolves).

For example, fades and dissolves are chromaticedits and can be modeled as

S(x, y, t)"S1(x, y, t)(1! t

l1)#S

2(x, y, t)(1! t

l2) ,

(13)

where S1(x, y, t) and S

2(x, y, t) are two shots that

are being edited, S(x, y, t) is the edited shot andl1, l

2are the number of frames for each shot during

the edit.The taxonomy along with the models are then

used to identify features that correspond to thedi!erent classes of shot boundaries. Finally, feature

vectors are fed into a system for frames classi"ca-tion and temporal video segmentation. The ap-proach is sensitive to camera and object motion.

Another model-based technique, called di!eren-tial model of motion picture, is proposed by Aig-rain and Joly [1]. It is based on the probabilisticdistribution of di!erences in pixel values betweentwo successive frames and combines the followingfactors: (1) a small amplitude additive zero-centeredGaussian noise that models camera, "lm, digitizerand other noises; (2) an intrashot change model forpixel change probability distribution resulting fromobject and camera motion, angle, focus and lightchange; (3) a shot transition model for the di!erenttypes of abrupt and gradual transitions. The histo-gram of absolute values of pixel di!erences is com-puted and the number of pixels that change in valuewithin a certain range determined by the modelsis counted. Then shot transitions are detected byexamining the resulting integer sequences. Experi-ments show 94}100% accuracy for cuts and 80%for gradual transitions detection.

Yu et al. [43] present an approach for gradualtransitions detection based on a model of intensitychanges during fade out, fade in and dissolve. Atthe "rst pass, cuts are detected using histogramcomparison. The gradual transitions are then de-tected by examining the frames between the cutsusing the proposed model of their characteristics.For example, it was found that the number of edgepixels have a local minimum during a gradualtransition. However, as this feature exhibits thesame behavior in case of zoom and pan, additionalcharacteristics of the fades and dissolves need to beused for their detection. During a fade, the begin-ning and end image is a constant image. Hence thenumber of edge pixels will be close to zero. Further-more, the number of edge pixels gradually increasesgoing away from the minimum in either side. Inorder to distinguish dissolves, the so called doublechromatic di!erence curve is examined. It is basedon the idea that the frames of a dissolve can berecovered using the beginning and end frames. Theapproach has low computational requirementsbut works under the assumption of small objectmovement.

Boreczky and Wilcox [7] use hidden Markovmodels (HMM) for temporal video segmentation.


Table 1Six groups of approaches for temporal video segmentation incompressed domain based on the information used

Group

Information used 1 2 3 4 5 6

DCT coe$cients @ @DC terms @ @MB coding mode @ @ @ @MVs @ @ @Bit-rate @

Separate states are used to model shot, cut, fade,dissolve, pan and zoom. The arcs between statesmodel the allowable progressions of states. Forexample, from the shot state it is possible to go toany of the transition states, but from a transitionstate it is only possible to return to a shot state.Similarly, the pan and zoom states can only bereached from the shot state, since they are subsetsof the shot. The arcs from a state to itself model thelength of time the video is in that particular state.Three di!erent types of features (image, audio andmotion) are used: (1) a standard gray-level histo-gram distance between two adjacent frames; (2) anaudio distance based on the acoustic di!erence inintervals just before and just after the frames and(3) an estimate of object motion between the twoframes. The parameters of the HMM, namely thetransition probabilities associated with the arcs andthe probability distributions of the features asso-ciated with the states, are learned by training withthe Baum}Welch algorithm. Training data consistsof features vectors computed for a collection ofvideo and labeled as one of the following classes:shot, cut, fade, dissolve, pan and zoom. Once theparameters are trained, segmenting the video isperformed using the Viterbi algorithm, a standardtechnique for recognition in HMM.

Thus, thresholds are not required as the para-meters are learned automatically. Another advant-age of the approach is that HMM frameworkallows any number of features to be included ina feature vector. The algorithm was tested on di!er-ent video databases and has been shown to im-prove the accuracy of the temporal videosegmentation in comparison to the standard thre-shold-based approaches.

2.2. Temporal video segmentation in MPEGcompressed domain

The previous approaches for video segmentationprocess uncompressed video. As nowadays video isincreasingly stored and moved in compressed for-mat (e.g. MPEG), it is highly desirable to developmethods that can operate directly on the encodedstream. Working in the compressed domain o!ersthe following advantages. First, by not havingto perform decoding/re-encoding, computational

complexity is reduced and savings on decom-pression time and decompression storage are ob-tained. Second, operations are faster due to thelower data rate of compressed video. Last but notleast, the encoded video stream already containsa rich set of pre-computed features, such as motionvectors (MVs) and block averages, that are suitablefor temporal video segmentation.

Several algorithms for temporal video segmenta-tion in the compressed domain have been reported.According to the type of information used (seeTable 1), they can be divided into six non-overlap-ping groups } segmentation based on (1) DCTcoe$cients; (2) DC terms; (3) DC terms, macro-block (MB) coding mode and MVs; (4) DCT coe$-cients, MB coding mode and MVs; (5) MB codingmode and MVs and (6) MB coding mode andbit-rate information. Before reviewing each ofthem, we present a brief description of the funda-mentals of MPEG compression standard.

2.2.1. MPEG streamThe Moving Picture Expert Group (MPEG)

standard is the most widely accepted internationalstandard for digital video compression. It uses twobasic techniques: MB-based motion compensationto reduce temporal redundancy and transformdomain block-based compression to capture spa-tial redundancy. An MPEG stream consists ofthree types of pictures } I, P and B } which arecombined in a repetitive pattern called group ofpicture (GOP). Fig. 7 shows a typical GOP and thepredictive relationships between the di!erent typesof frames.


Fig. 8. Intra coding.

Fig. 7. Typical GOP and predictive relationships between I,P and B pictures.

Intra (I) frames provide random access pointsinto the compressed data and are coded using onlyinformation present in the picture itself by DiscreteCosine Transform (DCT), Quantization (Q), RunLength Encoding (RLE), and Hu!man entropycoding, see Fig. 8. The "rst DCT coe$cient is calledDC term and is 8 times the average intensity of therespective block.

P (predicted) frames are coded with forwardmotion compensation using the nearest previousreference (I or P) pictures. Bi-directional (B) pic-tures are also motion compensated, this time withrespect to both past and future reference frames. Inthe case of motion compensation, for each16]16 MB of the current frame the encoder"nds the best matching block in the respectivereference frame(s), calculates and DCT-encodesthe residual error and also transmits one or twoMVs, see Figs. 9 and 10. During the encodingprocess a test is made on each MB of P and B frameto see if it is more expensive to use motion compen-sation or intra coding. The latter occurs when the

current frame does not have much in common withthe reference frame(s). As a result each MB ofa P frame could be coded either intra or forwardwhile for each MB of a B frame there are fourpossibilities: intra, forward, backward or interpo-lated. For more information about MPEG see[16].

2.2.2. Temporal video segmentation basedon DCT coezcients

The pioneering work on video parsing directly incompressed domain is conducted by Arman et al.[5] who proposed a technique for cut detectionbased on the DCT coe$cients of I frames. For eachframe a subset of the DCT coe$cients of a subset ofthe blocks is selected in order to construct a vector<i"Mc

1, c

2, c

3, 2N.

Fig. 9. Forward prediction for P frames.

Fig. 10. Interpolated prediction for B frames.

block l from two frames which are u frames apart isgiven by

DP(i, i#u, l)

"1

64

64+k/1

Dcl,k

(i)!cl,k

(i#u)Dmax[c

l,k(i), c

l,k(i#u)]

'¹1, (15)

where cl,k

(i) is the DCT coe$cient of block l in theframe i, k"1, 2, 64 and l depends on the size ofthe frame.

If the di!erence exceeds a given threshold ¹1, the

block l is considered to be changed. If the numberof changed blocks is larger than another threshold¹

2, a transition between the two frames is declared.

The pair-wise comparison requires far less compu-tation than the di!erence metric used by Arman.The processing time can be further reduced byapplying Arman's method of using only a subset ofcoe$cients and blocks.

It should be noted that both of the above algo-rithms may be applied only to I frames of theMPEG compressed video, as they are the framesfully encoded with DCT coe$cients. As a result,the processing time is signi"cantly reduced but thetemporal resolution is low. In addition, due to the

loss of the resolution between the I frames, falsepositives are introduced and, hence, the classi"ca-tion accuracy decreases. Also, neither of the twoalgorithms can handle gradual transitions or falsepositives introduced by camera operations and ob-ject motion.

2.2.3. Temporal video segmentation basedon DC terms

For temporal video segmentation in MPEGcompressed domain the most natural solution is touse the DC terms as they are directly related to thepixel domain, possibly reconstructing them forP and B frames, when only DC terms of the residualerrors are available. Then, analogous to the uncom-pressed domain methods, the changes between suc-cessive frames are evaluated by di!erence metricsand the decision is taken by complex thresholding.

For example, Yeo and Liu [42] proposea method where so called DC-images are createdand compared. DC-images are spatially reducedversions of the original images: the (i, j) pixel of theDC-image is the average value of the (i, j) block ofthe image (Fig. 11).

As each DC term is a scaled version of the block'saverage value, DC-images can be constructed fromDC terms. The DC terms of I frames are directlyavailable in the MPEG stream while those of B andP frames are estimated using the MVs and DCTcoe$cients of previous I frames. It should be notedthat the reconstruction techniques is computation-ally very expensive } in order to compute the DCterm of a reference frame (DC

3%&) for each block,

eight 8]8 matrix multiplications and 4 matrixsummations are required. Then, the pixel di!er-ences of DC-images are compared and a slidingwindow is used to set the thresholds because theshot transition is a local activity.

In order to "nd a suitable similarity measure, theauthors compare metrics based on pixel di!erencesand color histograms. They con"rm that when fullimages are compared, the "rst group of metrics ismore sensitive to camera and object movementsbut computationally less expensive than the secondone. However, when DC-images are compared,pixel-di!erences-based metrics give satisfactory re-sults as DC-images are already smoothed versionsof the corresponding full images. Hence, as in the


Fig. 11. A full image (352]288 pixels) and its DC image (44]36pixels).

Fig. 12. gnand D

gn(l, l#k) in the dissolve detection algorithm of

Yeo and Liu.

pixel domain approaches (e.g. Eq. (1)), abrupttransitions are detected using a similarity measurebased on the sum of absolute pixel di!erences oftwo consecutive frames (DC-images in this case):

D(l, l#1)"+i, j

(DPl(i, j)!P

l`1(i, j)D), (16)

where l and l#1 are two consecutive DC-imagesand P

l(i, j) is the intensity value of the pixel in lth

DC-image at the coordinates (i, j).In contrast to the previous methods for cut de-

tection that apply global thresholds on the di!er-ence metrics, Yeo and Liu propose to use localthresholds as scene changes are local activities inthe temporal domain. In this way false positivesdue to signi"cant camera and object motions arereduced. More speci"cally, a sliding window is usedto examine m successive frame di!erences. A cutbetween frames l and l#1 is declared if the follow-ing two conditions are satis"ed: (1) D(l, l#1) isthe maximum within a symmetric sliding windowof size 2m!1 and (2) D(l, l#1) is n times thesecond largest maximum in the window. The sec-ond condition guards against false positives due tofast panning or zooming and camera #ashes thattypically manifest themselves as sequences of largedi!erences or two consecutive peaks, respectively.The size of the sliding window m is set to be smallerthan the minimum duration between twotransitions, while the values of n typically rangefrom 2 to 3.

Gradual transitions are detected by comparingeach frame with the following kth frame where k islarger than the number of frames in the gradualtransition. A gradual transition g

nin the form of

linear transition from c1

to c2

in the time interval(a

1, a

2) is modeled as

gn"G

c1, n(a

1,

c2!c

1a2!a

1

(n!a2)#c

2, a

1)n(a

2,

c2, n*a

2.

(17)

Then if k'a2!a

1, the di!erence between frames

l and l#k from the transition gn

will be

Dgn

(l, l#k)"

G0, n(a

1!k,

Dc2!c

1D

Da2!a

1D[n!(a

1!k)], a

1!k)n(a

2!k,

Dc2!c

1D , a

2!k)n(a

1,

!Dc2!c

1D

Da2!a

1D(n!a

2), a

1)n(a

2,

0, n*a2.

(18)

As Dgn

(l, l#k) corresponds to a symmetric plateauwith sloping sides (see Fig. 12), the goal of thegradual transition detection algorithm is to identifysuch plateau patterns. The algorithm of Yeo andLiu needs 11 parameters to be speci"ed.

In [33] shots are detected by color histogramcomparison of DC term images of consecutiveframes. Such images are formed by the DC terms ofthe DCT coe$cients for a frame. DC terms ofI pictures are taken directly from the MPEGstream, while those for P and B frames are recon-structed by the following fast algorithm. First, theDC term of the reference image (DC

3%&) is approxi-

mated using the weighted average of the DC terms


Fig. 13. DC term estimation in the method of Shen and Delp.Fig. 14. Histogram di!erence diagram ((*) cut; (- - - -) dissolve).

of the blocks pointed by the MVs, Fig. 13:

DC3%&

"1

64+a|E

Na DCa , (19)

where DCa is the DC term of block a, E is thecollection of all blocks that are overlapped by thereference block and Na is the number of pixels inblock a that is overlapped by the reference block.

Then, the approximated DC terms of the pre-dicted pictures are added to the encoded DC termsof the di!erence images in order to form the DCterms of P and B pictures,

DC"DC$*&&

#DC3%&

(only forward or only backward prediction),

DC"DC$*&&

#12(DC

3%&1#DC

3%&2)

(interpolated prediction). (20)

In this way the computations are reduced to atmost 4 scalar multiplications and 3 scalar summa-tions for each block to determine DC

3%&.

The histogram di!erence diagram is generatedusing the measure from Eq. (6) comparing DC termimages. As it can be seen from Fig. 14, a break isrepresented by a single sharp pulse and a dissolveentails a number of consecutive medium-heightedpulses. Cuts are detected using a static threshold.For the recognition of gradual transitions, the his-togram di!erence of the current frame is comparedwith the average of the histogram di!erences of theprevious frames within a window. If this di!erenceis n times larger than the average value, a possiblestart of a gradual transition is marked. The samevalue of n is used as a soft threshold for the follow-ing frames. End of the transition is declared whenthe histogram di!erence is lower than the thre-shold. Since during a gradual transition not all of

the histogram di!erences may be higher than thesoft threshold, similarly to the twin comparison,several frames are allowed to have lower di!erenceas long as the majority of the frames in thetransition have higher magnitude than the softthreshold.

As only the DC terms are used, the computationof the histograms is 64 times faster than that usingthe original pixel values. The approach is not ableto distinguish rapid object movement from gradualtransition. As a partial solution, a median "lter(of size 3) is applied to smooth the histogram di!er-ences when detecting gradual transitions. There are7 parameters that need to be speci"ed.

An interesting extension of the previous ap-proach is proposed by Taskiran and Delp [37].After the DC term image sequence and theluminance histogram for each image are obtained,a two-dimensional feature vector is extracted fromeach pair of images. The "rst component is thedissimilarity measure based on the histogram inter-section of the consecutive DC term images,

x1i"1!Intersection(H

i, H

i`1)

"+n

j/1min(H

i( j), H

i`1( j))

+nj/1

Hi`1

( j), (21)

where Hi( j) is the luminance histogram value for

the bin j in frame i and n is the number of bins used.Note that the de"nition of the histogram inter-section is slightly di!erent from that used in [12,Section 2.1.3.1].

The second feature is the absolute value of thedi!erence of standard deviations p for theluminance component of the DC term images,i.e. x

i2"Dp

i!p

i`1D. The so called generalized se-

quence trace d for a video stream composed ofn frames is de"ned as d

i"DDx

i!x

i`1DD, i"1,2, n.


Fig. 15. Video shot detection scheme of Patel and Sethi.

These features are chosen not only because theyare easy to extract. Combining histogram-basedand pixel-based parameters makes sense as theycomplement some of their disadvantages. As it wasdiscussed already, pixel-based techniques give falsealarms in case of camera and object movements.On the other hand, histogram-based techniques areless sensitive to these e!ects but may miss shottransition if the luminance distribution of theframes do not change signi"cantly. It is shown thatthere are di!erent types of peaks in the generalizedtrace plot: wide, narrow and middle correspondingto a fade out followed by a fade in, cuts and dis-solves, respectively. Then, in contrast to the otherapproaches that apply global or local thresholds todetect the shot boundaries, Taskiran and Delp posethe problem as a one-dimensional edge detectionand apply a method based on mathematicalmorphology.

Patel and Sethi [29,30] use only the DC compo-nents of I frames. In [30] they compute the inten-sity histogram for the DC term images andcompare them using three di!erent statistics:Yakimovski likelihood ratio, s2 test and Kol-mogorov}Smirnov statistics. The experimentsshow that s2 test gives satisfactory results and out-performs the other techniques. In their consequentpaper [29], Patel and Sethi compare local andglobal histograms of consecutive DC term imagesusing s2 test, Fig. 15.

The local row and column histograms Xi

and>j

are de"ned as follows:

Xi"

1

M

M+j/1

b0,0

(i, j), >j"

1

N

N+j/1

b0,0

(i, j) , (22)

where b0,0

(i, j) is the DC term of block(i, j), i"1,2,N, j"1,2, M. The outputs of the

s2 test are combined using majority and averagecomparison in order to detect abrupt and gradualtransitions.

As only I frames are used, the DC recovering iseliminated. However, the temporal resolution islow as in a typical GOP every 12th frame is anI frame and, hence, the exact shot boundaries can-not be labeled.

2.2.4. Temporal video segmentation basedon DC terms and MB coding mode

Meng et al. [24] propose a shot boundaries de-tection algorithm based on the DC terms and thetype of MB coding, Fig. 16. DC components onlyfor P frames are reconstructed. Gradual transitionsare detected by calculating the variance p2 of theDC term sequence for I and P frames and lookingfor parabolic shapes in this curve. This is based onthe fact that if gradual transitions are linear mix-ture of two video sequences f

1and f

2with intensity

variances p1

and p2, respectively, and are char-

acterized by f (t)"f1(t)[1!a(t)]#f

2(t)a(t) where

a(t) is a linear parameter, then the shape of the vari-ance curve is parabolic: p2(t)"(p2

1#p2

2)a(t)!

2p21a(t)#p2

1. Cuts are detected by the computation

of the following three ratios:

Rp"

intra

forw, R

b"

back

forw, R

f"

forw

back, (23)

where intra, forw and back are the number of MBsin the current frame that are intra, forward andbackward coded, respectively.

If there is a cut on a P frame, the encoder cannotuse many MBs from the previous anchor frame formotion compensation and as a result many MBswill be coded intra. Hence, a suspected cut onP frame is declared if R

ppeaks. On the other hand,


Fig. 16. Shot detection algorithm of Meng et al.

if there is a cut on a B frame, the encoding will bemainly backward. Therefore, a suspected cut onB frame is declared if there is a peak in R

b. An

I frame is a suspected cut frame if two conditionsare satis"ed: (1) there is a peak in D*p2D for thisframe and (2) the B frames before I have peaks inR

f. The "rst condition is based on the observation

that the intensity variance of the frames duringa shot is stable, while the second condition preventsfalse positives due to motion.

This technique is relatively simple, requires min-imum encoding and produces good accuracy. Thetotal number of parameters needed to implementthis algorithm is 7.

2.2.5. Temporal video segmentation basedon DCT coezcients, MB coding mode and MVs

A very interesting two-pass approach is taken byZhang et al. [47]. They "rst locate the regions ofpotential transitions, camera operations and objectmotion, applying the pair-wise DCT coe$cientscomparison of I frames (Eq. (15)) as in their pre-vious approach (see Section 2.2.2). The goal of thesecond pass is to re"ne and con"rm the breakpoints detected by the "rst pass. By checking thenumber of MVs M for the selected areas, the exactcut locations are detected. If M denotes the numberof MVs in P frames and the smaller of the numbersof forward and backward nonzero MVs inB frames, then M(¹ (where ¹ is a threshold closeto zero) is an e!ective indicator of a cut before orafter the B and P frame. Gradual transitions arefound by an adaptation of the twin comparisonalgorithm utilizing the DCT di!erences of I frames.By MV analysis (see Section 3.1 for more details),

though using thresholds, false positives due to panand zoom are detected and discriminated fromgradual transitions.

Thus, the algorithm only uses information dir-ectly available in the MPEG stream. It o!ers highprocessing speed due to the multipass strategy, goodaccuracy and also detects false positives due to panand zoom. However, the metric for cut detectionyields false positives in the case of static frames. Also,the problem of how to distinguish object move-ments from gradual transitions is not addressed.

2.2.6. Temporal video segmentation basedon MB coding mode and MVs

In [21] cuts, fades and dissolves are detected onlyusing MVs from P and B frames and informationabout MB coding mode. The system follows a two-pass scheme and has a hybrid rule-based/neuralstructure. During the rough scan peaks in the num-ber of intra coded MBs in P frames are detected.They can be sharp (Fig. 17) or gradual with speci"cshape (Fig. 18) and are good indicators of abruptand gradual transitions, respectively.

The solution is then re"ned by a precise scanover the frames of the respective neighborhoods.The `simplera boundaries (cuts and black fadeedges) are recognized by the rule-based module,while the decisions for the `complexa ones (dis-solves and non-black fade edges) are taken by theneural part. The precise scan also reveals cuts thatremain hidden for the rough scan, e.g. B

24, I

49, B

71and B

96in Fig. 17. The rules for the exact cut

location are based on the number of backward andforward MBs while those for the fades black edgesdetection use the number of interpolated and


Fig. 17. Cuts: (a) video structure, (b) number of intra-codedMBs for P frames.

Fig. 18. Fade out, fade in, dissolve: (a) video structure, (b)number of intra-coded MBs for P frames.

backward coded MBs. There is only one thresholdin the rules that is easy to set and not sensitive tothe type of video. The neural network modulelearns from pre-classi"ed examples in the form ofMV patterns corresponding to the following6 classes: stationary, pan, zoom, object motion,tracking and dissolve. It is used to distinguish dis-solves from object and camera movements, "nd theexact location of the `complexa boundaries of thegradual transition and further divide shots intosub-shots. For more details about the neural net-work see Section 3.3.

The approach is simple, fast, robust to cameraoperations and very accurate when detecting theexact locations of cuts, fades and simple dissolves.However, sometimes dissolves between busy se-quences are recognized as object movement or theirboundaries are not exactly determined.

2.2.7. Temporal video segmentation based onMB coding mode and bit-rate information

Although limited only to cut detection, a simpleand e!ective approach is proposed in [9]. It only

uses the bit-rate information at MB level and thenumber of various motion predicted MBs. A largechange in bit-rate between two consecutive I orP frames indicates a cut between them. Similarly to[24], the number of backward predicted MBs isused for detecting cuts on B frames. Here, the ratiois calculated as R

b"back/mc, where back and mc

are the number of backward and all motion com-pensated MBs in a B frame, respectively. The algo-rithm is able to locate the exact cut locations. Itoperates hierarchically by "rst locating a suspectedcut between two I frames, then between theP frames of the GOP and "nally (if necessary) bychecking the B frames.

2.2.8. Comparison of algorithms for temporalvideo segmentation in compressed domain

In [11] the approaches of Arman et al. [5], Pateland Sethi [29], Meng et al. [24], Yeo and Liu [42]and Shen and Delp [33] are compared along sev-eral parameters: classi"cation performance (recalland precision), full data use, ease of implementa-tion, source e!ects. Ten MPEG video sequencescontaining more than 30 000 frames connected


Fig. 19. MV patterns resulting from various camera operations.

with 172 cuts and 38 gradual transitions are used asan evaluation database. It is found that the algo-rithm of Yeo and Liu and those of Shen and Delpperform best when detecting cuts. Although none ofthe approaches recognizes gradual transitions par-ticularly well, the best performance is achieved bythe last one. As the authors point out, the reasonfor the poor gradual transition detection is that thealgorithms expect some sort of ideal curve (a pla-teau or a parabola) but the actual frame di!erencesare noisy and either do not follow this ideal patternor do not do this smoothly for the entire transition.Another interesting conclusion is that not process-ing of all frame types (e.g., like in the "rst twomethods) does decrease performance signi"cantly.The algorithm of Yeo and Liu is found to be easiestfor implementation as it speci"es the parametervalues and even some performance analysis is al-ready carried out by the authors. The dependenceof the two best performing algorithms on bit-ratevariations is investigated and shown that they arerobust to bit-rate changes except at very low rates.Finally, the dependence of the algorithm of Yeoand Liu on two di!erent software encoder imple-mentations is studied and signi"cant performancedi!erences are reported.

3. Camera operation recognition

As stated previously, at the stage of temporalvideo segmentation gradual transitions have to bedistinguished from false positives due to cameramotion. In the context of content-based retrievalsystems, camera operation recognition is also im-portant for key frame selection, index extraction,construction of salient video stills and search nar-

rowing. Historically, motion estimation were exten-sively studied in the "eld of computer vision andimage coding and used for tracking of movingobjects, recovering of object movement, andmotion compensation coding. Below we review ap-proaches for camera operation recognition relatedto shot partitioning and characterization.

3.1. Analysis of MVs

Camera movements exhibit speci"c patterns inthe "eld of MVs, as shown in Fig. 19. Therefore,many approaches for camera operation recognitionare based on the analysis of MV "elds.

Zhang et al. [46] apply rules to detect pan/tiltand zoom in/zoom out. During a pan most of theMVs will be parallel to a modal vector that corres-ponds to the movement of the camera. This isexpressed by the following inequality:

N+b/1

Dhb!h

mD)¹, (24)

where hb

is the direction of the MV for block b, hm

is the direction of the modal vector, N is the totalnumber of blocks into which the frame is par-titioned and ¹ is a threshold near zero.

In the case of zooming, the "eld of MVs has focusof expansion (zoom in) or focus of contraction(zoom out). Zooming is determined on the basis of`peripherial visiona, i.e. by comparing the verticalcomponents v

kof the MVs for the top and bottom

rows of a frame, since during a zoom they haveopposite signs. In addition, the horizontal compo-nents u

kof the MVs for the left-most and right-

most columns are analyzed in the same way. Math-ematically these two conditions can be expressed in


Table 2MV patterns characterization

Camera operation MV origin MV magnitude

Still No ZeroPanning In"nity ConstantTracking In"nity ChangeableTilting In"nity ConstantBooming In"nity ChangeableZooming Image center ConstantDollying Image center Changeable

Fig. 20. Decision tree.

the following way:

Dv501k

!v"0550.k

D*max(Dv501k

D, Dv"0550.k

D),(25)

Du-%&5k

!u3*')5k

D*max(Du-%&5k

D, Du3*')5k

D).

When both conditions are satis"ed, a zoom isdeclared.

3.2. Hough transform

Akutsu et al. [3] characterize the MV patternscorresponding to the di!erent types of cameraoperations by two parameters: (1) the magnitude ofMVs and (2) the divergence/convergence point, seeTable 2.

The algorithm has three stages. During the "rstone, a block matching algorithm is applied to deter-mine the MVs between successive frames. Then, thespatial and temporal characteristics of MVs are de-termined. MVs are mapped to a polar coordinatespace by the Hough transform. A Hough transformof a line is a point. A group of lines with point ofconvergence/divergence (x

0, y

0) is represented by

a curve o"x0

sin u#y0

cos u in the Hough space.The least-squares method is used to "t the trans-formed MVs to the curve represented by the aboveequation. There are speci"c curves that correspondto the di!erent camera operations, e.g. zoom is char-acterized by a sinusoidal pattern, pan by a straightline. During the third stage these patterns are recog-nized and the respective camera operations are iden-ti"ed. The approach is e!ective but also noisesensitive and with high computational complexity.

3.3. Supervised learning by examples

An alternative approach for detecting cameraoperations is proposed by Patel and Sethi [29].They apply induction of decision trees (DTs) [31]to distinguish among the MV patterns of the fol-lowing six classes: stationary, object motion, pan,zoom, track and ambiguous. DTs are simple, popu-lar and highly developed technique for supervisedlearning. In each internal node a test of a singlefeature leads to the path down the tree towardsa leaf containing a class label, see Fig. 20. To builda decision tree, a recursive splitting procedure isapplied to the set of training examples so that theclassi"cation error is reduced. To classify anexample that has not been seen during the learningphase, the system starts at the root of the tree andpropagates the example down the leaves.

After the extraction of the MVs from the MPEGstream, Patel and Sethi generate a 10-dimensionalfeature vector for each P frame. Its "rst componentis the fraction of zero MVs and the remainingcomponents are obtained by averaging the columnprojection of MV directions. In order to developa decision tree classi"er, the MV patterns of 1320frames have been manually labeled. The resultshave shown high classi"cation accuracy at a lowcomputational price. We note that as only MVs ofP frames are used, the classi"cation resolution islow. In addition, there are problems with the calcu-lation of the MV direction due to the discontinuityat 0/3603.

The above limitations are addressed in [21]where a neural supervised algorithm is applied.Given a set of pre-classi"ed feature vectors (train-ing examples), Learning Vector Quantization(LVQ) [19] creates a few prototypes for eachclass, adjusts their positions by learning and then


Fig. 21. MV patterns corresponding to di!erent classes.

classi"es the unseen examples by means of thenearest-neighbor principle. While LVQ can formarbitrary borders, DTs delineate the concept bya set of axis-parallel hyperplanes which constrainstheir accuracy in realistic domains. In comparisonto the approach of Patel and Sethi, one more class(dissolve) is added (Fig. 21) and the MVs fromboth P and B frames are used to generate a22-dimensional feature vector for each frame. The"rst component is calculated using the number ofzero MVs in forward, backward and interpolatedareas. Then, the forward MV pattern is sub-dividedin 7 vertical strips for which the following 3 para-meters are computed: the average of the MV direc-tion, the standard deviation of the MV directionand the average of MV modulus. A technique thatdeals with the discontinuity of angles at 0/3603 isproposed for the calculation of the MV direction.Although high classi"cation accuracy is reported, itwas found that the most di$cult case is to distin-guish dissolve from object motion. In [20] MVpatterns are classi"ed by an integration betweenDTs and LVQ. More speci"cally, DTs are viewedas a feature selection mechanism and only thoseparameters that appear in the tree are considered as

informative and used as inputs in LVQ. The resultis faster learning at the cost of a slightly worseclassi"cation accuracy.

3.4. Spatiotemporal analysis

Another way to detect camera operations isto examine the so called spatiotemporal imagesequence. The latter is constructed by arrangingeach frame close to the other and forming a paral-lelepiped where the "rst two dimensions are deter-mined by the frame size and the third one is thetime. Camera operations are recognized by textureanalysis of the di!erent faces.

In [2] video X-ray images are created from thespatiotemporal image sequence, as shown in Fig. 22.Sliced x}t and y}t images are "rst extracted from thespatiotemporal sequence and are then subject to anedge detection. The process is repeated for all x andy values, the slices are summed in the vertical andhorizontal directions to produce gray-scale x}t andy}t video X-ray images. There are typical X-rayimages corresponding to the following camera op-erations: still, pan, tilt and zoom. For example, whenthe camera is still, the video X-ray show lines parallel


Fig. 22. Creating video X-ray image.

Fig. 23. Producing 2DST image using 25 horizontal and verticalsegments.

to the time line for the background and unmovingobjects. When the camera pans, the lines becomeslanted; in the case of zooming, they are spread.

We should note that performing edge detectionon all frames in the video sequence is time consum-ing and computationally expensive.

In [23] the texture of 2-dimensional spatiotem-poral (2DST) images is analyzed and the shots aredivided into sub-shots described in terms of stillscene, zoom and pan. The 2DST images are con-structed by stacking up the corresponding seg-ments of the images (Fig. 23). The directivity of thetextures are calculated by computing the powerspectrum by applying the 2-dimensional discreteFourier transform.

4. Conclusions

Temporal video segmentation is the "rst steptowards automatic annotation of digital video forbrowsing and retrieval. It is an active area of re-search gaining attention from several research com-munities including image processing, computervision, pattern recognition and arti"cial intelli-gence.

In this paper we have classi"ed and reviewedexisting approaches for temporal video segmenta-tion and camera operations recognition discussingtheir relative advantages and disadvantages. Morethan eight years of video segmentation researchhave resulted in a great variety of approaches.Early work focused on cut detection, while morerecent techniques deal with gradual transition de-tection. The majority of algorithms process uncom-pressed video. They can be broadly classi"ed into"ve categories, Fig. 24(a). Since the video is likely tobe stored in compressed format, several algorithmswhich operate directly on the compressed videostream were reported. Based on the type ofinformation used they can be divided into sixgroups, Fig. 24(b). Their limitations, that highlightthe directions for further development, can be sum-marized as follows. Most of the algorithms (1)


Fig. 24. Taxonomy of techniques for temporal video segmentation that process (a) uncompressed and (b) compressed video ( ) detectcut, ( ) detect gradual transitions.

Fig. 25. Taxonomy of techniques for camera operation recognition.

require reconstruction of DC terms of P or P&Bframes, or sacri"ce temporal resolution and classi-"cation accuracy; (2) process unrealistically shortgradual transitions and are unable to recognize thedi!erent types of gradual transitions; (3) involvemany adjustable thresholds; (4) do not handle falsepositives due to camera operations. None of them isable to distinguish gradual transitions from objectmovement su$ciently well. Some of the ways toachieve further improvement include the use ofadditional information (e.g. audio features and textcaptions), integration of di!erent temporal videosegmentation techniques and development of

methods that can learn from experience how toadjust their parameters. Camera operation recogni-tion is an important issue related to the videosegmentation. Fig. 25 presents the taxonomy ofcamera operation recognition techniques.

This research also con"rms the need for bench-mark video sequences and uni"ed evaluation cri-teria that will allow consistent comparison andprecise evaluation of the various techniques. Thebenchmark sequences should contain enough rep-resentative data for the possible types of cameraoperations and shot transitions, including complexgradual transition (i.e. between sequences involving


motion). The evaluation should take into consid-eration the type of application that indeed mayrequire di!erent trade-o! between recall and pre-cision. In case of gradual transition detection, animportant evaluation criteria is the algorithm'sability to determine exactly between which framesthe transition occurs and to classify the type of thetransition (dissolve, fade, etc.). Other essentialissues are the sensitivity to the encoder's type andthe ease of implementation. Probably the best wayfor comparison and testing of the di!erent tem-poral video segmentation techniques is to builda repository that contains Web-executable versionsof the algorithms as suggested in [11]. It could bedone by either providing an Web interface to thealgorithms or by implementing them in a plat-form-independent language (e.g. Java).

Acknowledgements

The "rst author was supported by a fellowshipfrom the Consorzio per lo Sviluppo Internazionaledell'Università di Trieste. The work was also sup-ported in part by the European ESPRIT LTRproject 20229 `Noblessea and by the University ofTrieste, grant 60%.

References

[1] P. Aigrain, P. Joly, The automatic real-time analysis of "lmediting and transition e!ects and its applications, Comput.Graphics 18 (1) (1994) 93}103.

[2] A. Akutsu, Y. Tonomura, Video tomography: an e$cientmethod for camerawork extraction and motion analysis,in: Proceedings of ACM Multimedia'94, 1994, pp.349}356.

[3] A. Akutsu, Y. Tonomura, H. Hashimoto, Y. Ohba, Videoindexing using motion vectors, in: Proceedings of SPIE:Visual Communications and Image Processing 1818,Boston, 1992, pp. 1522}1530.

[4] G. Ananger, T.D.C. Little, A survey of technologies forparsing and indexing digital video, J. Visual Commun.Image Representation 7 (1) (1996) 28}43.

[5] F. Arman, A. Hsu, M.-Y. Chiu, Image processing on com-pressed data for large video databases, in: Proceedings ofFirst ACM International Conference on Multimedia,1993, pp. 267}272.

[6] J.S. Boreczky, L.A. Rowe, Comparison of video shotboundary detection techniques, in: Proceedings of IS

& T/SPIE International Symposium Electronic Imaging,San Jose, 1996.

[7] J.S. Boreczky, L.D. Wilcox, A hidden Markov modelframework for video segmentation using audio and imagefeatures, in: Proceedings of International Conference onAcoustics, Speech, and Signal Processing, Vol. 6, Seattle,1998, pp. 3741}3744.

[8] S.-F. Chang, W. Chen, H.J. Meng, H. Sundaram,D. Zhong, VideoQ: an automated content based videosearch system using visual cues, in: Proceedings of ACMMultimedia Conference, Seattle, 1997.

[9] J. Feng, K.-T. Lo, H. Mehrpour, Scene change detectionalgorithm for MPEG video sequence, in: Proceedings ofInternational Conference on Image Processing (ICIP'96),Lausanne, 1996.

[10] A.M. Ferman, A.M. Tekalp, E$cient "ltering and cluster-ing for temporal video segmentation and visual summariz-ation, J. Visual Commun. Image Representation 9 (4)(1998) 3368}351.

[11] U. Gargi, R. Kasturi, S. Antani, Performance characteriza-tion and comparison of video indexing algorithms, in:Proceedings of Conference on Computer Vision and Pat-tern Recognition (CVPR), 1998.

[12] U. Gargi, S. Oswald, D. Kosiba, S. Devadiga, R. Kasturi,Evaluation of video sequence indexing and hierarchicalvideo indexing, in: Proceedings of SPIE Conference onStorage and Retrieval in Image and Video Databases,1995, pp. 1522}1530.

[13] B. GuK nsel, A.M. Ferman, A.M. Tekalp, Temporal videosegmentation using unsupervised clustering and semanticobject tracking, J. Electron. Imaging 7 (3) (1998) 592}604.

[14] A. Hampapur, R. Jain, T.E. Weymouth, Production modelbased digital video segmentation, Multimedia Tools Appl.1 (1) (1995) 9}46.

[15] F. Idris, S. Panchanathan, Review of image and videoindexing techniques, J. Visual Commun. Image Repres-entation 8 (2) (1997) 146}166.

[16] ISO/IEC 13818 Draft Int. Standard: Generic Coding ofMoving Pictures and Associated Audio, Part 2: video.

[17] R. Kasturi, R. Jain, Dynamic vision, in: R. Kasturi, R. Jain(Eds.), Computer Vision: Principles, IEEE Computer So-ciety Press, Washington DC, 1991, pp. 469}480.

[18] T. Kikukawa, S. Kawafuchi, Development of an automaticsummary editing system for the audio-visual resources,Trans. Electron. Inform. J75-A (1992) 204}212.

[19] T. Kohonen, The self-organizing map, Proc. IEEE 78 (9)(1990) 1464}1480.

[20] I. Koprinska, S. Carrato, Camera operation detection inMPEG video data by means of neural networks, in:Proceedings of COST 254 Workshop, Toulouse, 1997,pp. 300}304.

[21] I. Koprinska, S. Carrato, Detecting and classifying videoshot boundaries in MPEG compressed sequences, in: Pro-ceedings of IX European Signal Processing Conference(EUSIPCO), Rhodes, 1998, pp. 1729}1732.

[22] C.-M. Lee, D. M.-C. Ip, A robust approach for camerabreak detection in color video sequences, in: Proceedings


of IAPR Workshop Machine Vision Applications,Kawasaki, Japan, 1994, pp. 502}505.

[23] J. Maeda, Method for extracting camera operations todescribe sub-scenes in video sequences, in: Proceedings ofIS&T/SPIE Conference on Digital Video Compress, SanJose, Vol. 2187, 1994, pp. 56}67.

[24] J. Meng, Y. Juan, S.-F. Chang, Scene change detection ina MPEG compressed video sequence, in: Proceedings ofIS&T/SPIE International Symposium on ElectronicImaging, San Jose, Vol. 2417, 1995, pp. 14}25.

[25] A. Nagasaka, Y. Tanaka, Automatic video indexing andfull-video search for object appearances, in: E. Knuth,L.M. Wegner (Eds.), Visual Database Systems II, Elsevier,Amsterdam, 1995, pp. 113}127.

[26] W. Niblack, X. Zhu, J.L. Hafner, T. Breuer, D.B. Pon-celeon, D. Petkovic, M.D. Flickner, E. Upfal, S.I. Nin,S. Sull, B.E. Dom, B.-L. Yeo, S. Srinivasan, D. Zivkovic,M. Penner, Updates to the QBIC system, in: Proceedingsof IS&T/SPIE Conference on Storage and Retrieval forImage and Video Databases VI, Vol. 3312, 1997, pp.150}161.

[27] T.N. Pappas, An adaptive clustering algorithm for imagesegmentation, IEEE Trans. Signal Process. 40 (1992)901}914.

[28] G. Pass, R. Zabih, Comparing images using joint histo-grams, Multimedia Systems (1999) in press.

[29] N.V. Patel, I.K. Sethi, Video shot detection and character-ization for video databases, Pattern Recognition 30 (1997)583}592.

[30] I.K. Sethi, N.V. Patel, A statistical approach toscene change detection, in: Proceedings of IS&T/SPIE Conference on Storage and Retrieval for Imageand Video Databases III, San Jose, Vol. 2420, 1995, pp.2}11.

[31] I.K. Sethi, G.P.R. Sarvarayuda, Hierarchical classi"er de-sign using mutual information, IEEE Trans. Pattern Anal.Mach. Intell. (1982) 441}445.

[32] B. Shahraray, Scene change detection and content-basedsampling of video sequences, in: Proceedings ofIS&T/SPIE, Vol, 2419, 1995, pp. 2}13.

[33] K. Shen, E. Delp, A fast algorithm for video parsing usingMPEG compressed sequences, in: Proceedings of Interna-tional Conference on Image Processing (ICIP'96),Lausanne, 1996.

[34] M. Smith, T. Kanade, Video skimming and characteriza-tion through the combination of image and languageunderstanding, in: Proceedings of International Workshop

Content-Based Access of Image and Video Databases(ICCV'98), Bombay, 1998.

[35] M.J. Swain, Interactive indexing into image databases, in:Proceedings of SPIE Conference on Storage and Retrievalin Image and Video Databases, 1993, pp. 173}187.

[36] D. Swanberg, C.-F. Shu, R. Jain, Knowledge guided pars-ing in video databases, in: Proceedings of SPIE Confer-ence, Vol. 1908, 1993, pp. 13}24.

[37] C. Taskiran, E. Delp, Video scene change detection usingthe generalized sequence trace, in: Proceedings of IEEEInternational Conference on Acoustics, Speech and SignalProcessing, Seattle, 1998, pp. 2961}2964.

[38] L. Teodosio, W. Bender, Salient video stills: content andcontext preserved, in: Proceedings of ACM MultimediaConference, Anaheim, CA, 1993, pp. 39}46.

[39] Y. Tonomura, Video handling based on structured in-formation for hypermedia systems, in: Proceedings ofACM International Conference on Multimedia Informa-tion Systems, 1991, pp. 333}344.

[40] W. Xiong, J.C.-M. Lee, E$cient scene change detection andcamera motion annotation for video classi"cation, Comput.Vision Image Understanding 71 (2) (1998) 166}181.

[41] W. Xiong, J.C.-M. Lee, M.C. Ip, Net comparison: a fastand e!ective method for classifying image sequences, in:Proceedings of SPIE Conference on Storage and Retrievalfor Image and Video Databases III, San Jose, CA, Vol.2420, 1995, pp. 318}328.

[42] B. Yeo, B. Liu, Rapid scene analysis on compressed video,IEEE Trans. Circuits Systems Video Technol. 5 (6) (1995)533}544.

[43] H. Yu, G. Bozdagi, S. Harrington, Feature-based hier-archical video segmentation, in: Proceedings of Interna-tional Conference on Image Processing (ICIP'97), SantaBarbara, 1997, pp. 498}501.

[44] R. Zabih, J. Miler, K. Mai, A feature-based algorithm fordetecting and classifying production e!ects, MultimediaSystems 7 (1999) 119}128.

[45] H.J. Zhang, A. Kankanhalli, S.W. Smoliar, Automaticpartitioning of full-motion video, Multimedia Systems1 (1) (1993) 10}28.

[46] H.J. Zhang, C.Y. Low, Y.H. Gong, S.W. Smoliar, Videoparsing using compressed data, in: Proceedings of SPIEConference on Image and Video Processing II, 1994, pp.142}149.

[47] H.J. Zhang, C.Y. Low, S.W. Smoliar, Video parsing andbrowsing using compressed data, Multimedia Tools Appl.1 (1995) 89}111.


Documents

Kent State University - Temporal video segmentation: A surveyjaved/class-DR/sample-videosegmentation.pdf · 2003. 8. 8. · have led to the widespread use and availability of digital