Scene Segmentation and Semantic Representation for High-Level Retrieval

IEEE SIGNAL PROCESSING LETTERS, VOL. 15, 2008 713

Scene Segmentation and SemanticRepresentation for High-Level Retrieval

Songhao Zhu and Yuncai Liu, Member, IEEE

Abstract—In this letter, a novel framework to segment videoscene and represent scene content is proposed. Firstly, video shotsare detected using a rough-to-fine algorithm. Secondly, key framesare selected adaptively, and redundant key frames are removedusing template matching. Then, spatio-temporal coherent shotsare clustered into the same scene. Finally, under the full analysisof typical characters on continuously recorded videos, video scenecontent is semantically represented to satisfy human demand onvideo retrieval. Experimental results show the proposed methodmakes sense to efficient retrieval of video content of interest.

Index Terms—Semantic representation, video content analysis,video segmentation.

I. INTRODUCTION

W ITH the rapid explosion of video data, efficient tech-nologies for organizing, browsing, and retrieving them

are urgently required by common users. Ideally, these video datashould be indexed by semantic descriptions so that traditionalinformation retrieval may be adopted for retrieving video con-tent of interest. However, the manual segmentation and annota-tion is too expensive and infeasible. Therefore, automatic seg-mentation of video scene and semantic representation of scenecontent might be a promising solution.

Many methods have been developed to partition video scenes.Yeung [1] proposed to segment video using scene transitiongraph. Hanjalic [2] detected approximate scene by finding a log-ical story unit. Zhao [3] employed normalized cuts method todetermine the optimal scene boundary points. Tavanapong [4]extracted local a visual feature vector to perform scene detec-tion.

From the view of methodology, these above algorithms can beclassified into graph theory based and editing technique based.Next, we will introduce another class: statistics learning-basedapproach. Zhu [5] utilized support vector machine to parse videoscenes. Zhai [6] exploited the Markov chain Monte Carlo algo-rithm to determine boundaries.

Besides the above three categories, there exists a multi-fea-tures-based method. In [7], the audio-visual features were ex-ploited to implement the movie scene partition. Adams [8] pro-posed to segment videos by measuring the movie tempo.

Semantic representation of the scene content is important forvideo retrieval. Zhao et al. [3] defined three scene types: a par-

Manuscript received May 26, 2008; revised June 19, 2008. The associate ed-itor coordinating the review of this manuscript and approving it for publicationwas Dr. Konstantinos N. Plataniotis.

The authors are with the Institute of Image Processing and Pattern Recog-nition, Shanghai Jiao Tong University, Shanghai 200240, China (e-mail:[email protected]; [email protected]).

Color versions of one or more of the figures in this paper are available onlineat http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/LSP.2008.2002718

allel scene with interacting events, a parallel scene with simul-taneous serial events, and a serial scene. Tavanapong et al. [4]labeled another three scene classes: a serial-event scene, a par-allel-event scene, and a traveling scene.

In this letter, a novel scene segmentation and semantic rep-resentation framework is proposed for the efficient retrieval ofvideo data using a large number of low-level and high-level fea-tures. First, to well depict the properties of different shots, boththe color and the texture information are chosen as the featuresto perform the task of the shot detection and key frame selection.Second, redundant frames in the set of key frames are removedto accurately segment video content into different scenes. Then,reasonably temporal constraint and visually similar activity areused to analyze the coherence of shots in the same scene. To re-trieve interested video content in terms of human language, it isnecessary to represent scene content from the semantic level.

The rest of this letter is organized as follow. Section II de-tails the proposed scheme. Section III presents the experimentresults. Finally, Section IV concludes our work.

II. PROPOSED SCHEME

A. Shot Detection

Let denote the images of the video clip. The detection process is composed of two phases: a coarse

and a fine searching phase.In the coarse searching phase, rough positions of all possible

shot boundaries are obtained using the following steps.1) Pre-disposal. The original clip is first sampled by every

frames ( in the experiment). Then the imagespace is divided into equal-sized, non-overlapping blocksof 8 8 pixels.

2) Inter-image correlation measurement. For the th block inthe th image, the variance of the intensity information

and the variances of approximation coefficient in 3Haar Wavelet sub-bands are computed and togetherused as histogram bin values. Then, the inter-image corre-lation between two consecutive images, and , isdefined as the histogram intersection

(1)

where is the integrated coefficients of th image

(2)

3) Shot boundary choice. The threshold of shot boundary isset at which captures the significant difference be-tween continuous images. Namely, image can be con-sidered as a coarse shot transition point in the samplingspace if

1070-9908/$25.00 © 2008 IEEE

714 IEEE SIGNAL PROCESSING LETTERS, VOL. 15, 2008

Fig. 1. (a) Coarse searching process in the original sequence space. (b) Finesearching process in the sampling space. The circles in (a) show the precisetransition locations. A set of blue circle and green circle in (b) means the ap-pearance of a rough shot boundary point.

(3)

In the fine searching phase, the precise location of shot tran-sition is achieved using the following processes.

1) For a coarsely detected transition positionin the sampling space, a sub-sequence cen-tered at is extracted from the original clip

.2) Image with the local minimum of inter-image correlation

in is chosen as the precise transition location.Fig. 1 plots the coarse and fine searching process, respec-

tively, for an interview scene. From Fig. 1, the relation betweenrough and precise shot transition position is clear.

B. Key Frame Extraction

Representing the content of a video shot concisely is a neces-sary step for various video processing. Our scheme uses a sim-ilar approach in [2] to select key frames.

C. Removing Redundant Frames

To accurately segment video data into different scenes and se-mantically represent scene contents, key frames without usefulinformation (such as rainbow frame, frame with black or graybackground) should be recognized and removed. We use the ap-proach of template matching to dispose such a problem. Specif-ically, the descriptive information of a frame with black or graybackground is the mean and standard deviation of its color in-formation. For a rainbow frame, an RGB normalized histogramis used to describe its character. If the difference between queryframe and temple frame is less than a given value, the queryframe can be considered as a redundant frame and removed.

D. Scene Segmentation

A scene is a collection of shots which present the same theme.From the view of coherence, shots within a scene share similarcoherence in time and space. We exploit these two facts to se-mantically cluster related shots into the same scene.

1) Temporal Constraint: To avoid under-segmentation orover-segmentation, we introduce the concept of temporal con-straint meaning the maximum of total number of shots a scenecan contain according to the feature of continuously recordedvideo. Specifically, if the temporal interval between two shotsexceeds one constraint, then although their visual contents aresimilar, they may not be grouped into the same scene.

2) Spatio-Temporal Correlation Clustering: Since shotswithin the same scene sharing coherence in spatial correlation,similar shots are clustered into the same scene using spatial cor-relation by taking temporal constraint into account. Differentscenes are obtained by the following passes which are partialoverlapping with [4].

1) The similarity between two key frames and from dif-ferent shots is measured by the Euclidean distance

(4)

where and are the integrated coefficients for theth and th frame, respectively.

2) The measurement of the dissimilarity between shot withkey frames and with key frames is the average

distance across all pairs of frames in each shot

(5)

3) Within one time-window , computing the shot similaritybetween all pairs of shots

(6)

4) For the current shot Sc within one temporal constraintwindow , finding all the subsequent shots sharingsimilar visual content ( , and )

(7)

where is the threshold to detect scene. Furthermore,shot is chosen as the current shot for subsequentprocessing.

5) For shots between , and , finding whether there existsimilar shots ( , and ) between fs and

(8)

Furthermore, is also chosen as the current shot forsubsequent processing.

6) Repeat step 4 or/and 5 until the end of . If the iterativeprocess stops at step 4, then shot is the current sceneboundary and shot is the beginning shot for the nextscene; if the iterative process stops at step 5, then shot isthe current scene boundary and shot is the beginningshot for the next scene.

Fig. 2 shows an example of the scene boundary detection pro-cedure for one testing video. Firstly, shot is chosen as thecurrent shot . For shot , shots , , , and are the visu-ally similar shots which are generated using inequation 7. Sec-ondly, shot is now becoming the current shot . This time, forshot , shots , , and are the visually similar shots, whichmake the inequation 8 true. Then, shot is chosen as the cur-rent shot according to the idea of scene segmentation. Sincethere are no shots which can satisfy the inequation 8, shot isthe last shot for scene X. So far, the procedure of the boundary

ZHU AND LIU: SCENE SEGMENTATION AND SEMANTIC REPRESENTATION FOR HIGH-LEVEL RETRIEVAL 715

Fig. 2. Chart of the process of shots clustering.

detection for scene X has finished. That is, scene X is composedof all of the shots from shot to shot .

E. Semantic Representation of Scene Content

To retrieve video data on the Internet in terms of humanknowledge, it is necessary to represent scene content fromthe semantic level. Based on the editing technique used in thecontinuously recorded video [9], scene content can be classifiedinto one of the following typical genres: a conversational scene,a suspense scene, and an action scene, and their respectivecharacteristics are described as below.

1) Conversational scene: faces with similar spatial positionand similar size, a sequence of shots with low activity in-tensity, and strong similarity between shots.

2) Suspense scene: a sequence of shots with low average in-tensity distribution, a long period of low audio energy, andlow activity intensity followed by a sudden change eitherin sound track or in activity intensity or both.

3) Action scene: a sequence of shots with short duration, andintensive activity intensity or intensity audio energy.

1) Average Intensity Distribution: Reference [9] indicated“the amount and distribution of light in relation to shadow anddarkness and the relative tonal value of the scene is a primaryvisual means of setting mood.” The amount of light in the scene,here, is described as the average intensity distribution

(9)

where is the average intensity distribution ofthe entire key frames in the th shot, is the lengthof the th shot in terms of frames, and is the total numberof frames within the th scene.

2) Face Detection: The occurrence of face is a salient featurein video, as it means the present of human in the scene. The sizeof a face is also a hint for the role of the person, i.e., a large facedenotes that this person is in the center of attention.

As a result of the face feature extraction process, we obtainthe position and size of each detected face, and the number ofhits.

3) Audio Energy: Sound can be also considered as an impor-tant cue to represent the atmosphere of the scene. For example,the climax of suspense scene, fighting, explosions, etc. is mostlyaccompanied with an abrupt eruption in the audio level. Here,the formulation of the audio energy is depicted as follows:

(10)

where is the audio sample indexed by time interval in theth shot. In the experiment, time interval is set to be 20 ms.

4) Activity Intensity: The activity intensity indicates thetempo in video. For example, in conversational scene, theactivity intensity is relatively low. On the other hand, in action

scene, the activity intensity is relatively high. The activityintensity of the th scene is

(11)

where and are the local gray-information variance his-togram of the th and the th frame, respectively, andis the duration of the th shot.

5) Representing Video Scene Semantically: In this subsec-tion, conditions for representing each of the three categories ofscene content is described as below using the knowledge dis-cussed in the above subsections.

1) A conversation scene can be declared when the followingseveral conditions are simultaneously satisfied. On the onehand, a shot of talker A and a shot of talker B occur alter-nately. On the other hand, the audio energy of such shotsare lower than 10.

2) The following two conditions are used to determine an ac-tion scene. One side is that length of a sequence of shotsis less than 25. The other side is that activity intensity ofthese shots is larger than 200 or (and) audio energy of theseshots is larger than 100.

3) While for a suspense scene, several criteria should be true.In the first place, for continuous shots, the ratio of the av-erage intensity distribution from 0 to 50 to the whole dis-tribution region between 0 and 255 is more than 90%. Inthe next place, audio energy and activity intensity of firstsome shots are both close to 0. Then the change of audioenergy of the following shots is more than 40, or (togetherwith) the change of activity intensity of the following shotsis more than 90.

III. EXPERIMENTS

A. Experiment Design

Six videos are test to evaluate the performance of the pro-posed algorithm: one TV interview video and five full-lengthmovies. Each video track is analyzed at 25 fps and at per frameof size pixels. The sound track is processed at the sam-pling rate of 22 KHZ and with 16-bit precision.

To get the ground truth of the testing videos, five graduatestudents are invited to watch the movies and then give their ownscene boundaries. The ground truth used for the experimentsis the intersection of their segmentation. Generally speaking,there is no such clear boundary between two adjacent scenesin movies due to film editing effects. Therefore, the most com-monly used criterion, Hanjalic’s evaluation [2], is here used tomatch the ground truth with the detected ones: if the detectedscene boundary is within four shots from the boundary detectedmanually, this boundary is counted as a correct boundary.

B. Experiment I: Experiment on Scene Segmentation

Table I summarizes the length of the testing videos (here theunit of length is minute), the ground truth scenes (GTS), and re-sults of scene segmentation (DS) achieved using our proposedalgorithm. Furthermore, the total number of matched scenes(MS), false positive scenes, (FPS), and false negative scenes(FNS) of each testing video are listed to sufficiently evaluatethe algorithm performance.

We compare the system performance of scene segmentationbetween the results generated by the proposed algorithm and theNormalized Cuts method [3], which is shown in Table II. From

716 IEEE SIGNAL PROCESSING LETTERS, VOL. 15, 2008

TABLE IDETAIL OF THE EXPERIMENTAL RESULTS

TABLE IICOMPARISON WITH NORMALIZED CUTS METHOD

Fig. 3. Six testing clips with the key frames of the shots in the scene, whereeach shot is represented with a representative frame.

Table II, it can be observed that average F1-Score of the pro-posed approach and the Normalized Cuts method are 0.818 and0.712, respectively. The proposed approach gains a large im-provement in all the evaluating measures because the temporalconstraint is integrated into the spatial coherence clustering inthe procedure of scene segmentation.

C. Experiment II: Experiment on Semantic Representation

We test semantic representation performance on two moviescontaining “X Man III (X M.III)” and “Mission Impossible III(M.I.III)” because these two movies contain all of the above dis-cussed three categories of scene: conversation scene, suspensescene, and action scene. Some examples are shown in Fig. 3.

Performance of semantic representation of scene content issummarized in Table III. It shows that the average precision

TABLE IIIRESULTS OF SEMANTIC REPRESENTATION

and recall for action scene is 0.845 and 0.822, which clearlydemonstrate that the proposed scheme can detect and classifyvideo scene into categories. That is, the selected features for de-scribing scene content and the way of deciding scene type aresatisfactory.

IV. CONCLUSION

In this letter, we have proposed a framework for scene seg-mentation and semantic representation on different video types.Firstly, shot boundaries are detected by a rough searching passand a fine searching pass. Secondly, key frames are selectedadaptively using the combination information of color and tex-ture. To improve the accuracy of scene segmentation and sceneclassification, redundant key frames are recognized by templatematching. Then, similar shots are clustered into the same sceneby using the analysis of spatio-temporal coherence. Finally, toretrieve video content of interest, scene content is classified intothe three most common types of scenes. Extensive experimentsare tested on different movie types and one TV program. Thepromising results demonstrate good performance of the pro-posed approach.

Further work includes defining more types of scenes andtesting more types of videos. Furthermore, presenting higherlevel description of video is another further research direction.

REFERENCES

[1] M. Yeung and B.-L. Yeo, “Segmentation of video by clustering andgraph analysis,” IJCVIU, vol. 71, no. 1, pp. 94–109, 1998.

[2] A. Hanjalic, R. L. Lagendijk, and J. Biemond, “Automated high-levelmovie segmentation for advanced video-retrieval systems,” IEEETrans. Circuits Syst. Video Technol., vol. 9, no. 4, pp. 580–588, Jun.1999.

[3] Y. J. Zhao, T. Wang, and etc, “Scene segmentation and categorizationusing NCuts,” in Proc. IEEE CVPR, 2007, pp. 343–348.

[4] W. Tavanapong and J. Zhou, “Shot clustering techniques for storybrowsing,” IEEE Trans. Multimedia, vol. 6, no. 4, pp. 517–526, Aug.2004.

[5] Y. Zhu and Z. Ming, “SVM-based video scene classification and seg-mentation,” in Proc. ICMUE, 2008, pp. 407–412.

[6] Y. Zhai and M. Shah, “Video scene segmentation using Markov chainMonte Carlo,” IEEE Trans. Multimedia, vol. 8, no. 4, pp. 686–697,Aug. 2006.

[7] M. Kyperountas, C. Kotropoulos, and I. Pitas, “Enhanced eigen-au-dioframes for audiovisual scene change detection,” IEEE Trans. Mul-timedia, vol. 9, no. 4, pp. 785–797, Jun. 2007.

[8] B. Adams, C. Dorai, and S. Venkatesh, “Towards automatic extractionof expressive elements from motion pictures: Tempo,” in Proc. ICIP,2000, pp. 641–644.

[9] A. Aner and J. R. Kender, “Video summaries through mosaic-basedshot and scene clustering,” in Proc. ECCV, 2002, pp. 388–402.

[10] Y. S. N. Li and C.-C. Jay Kuo, “Video mining,” in Movie Content Anal-ysis Indexing, and Skimming. Norwell, MA: Kluwer, 2003, ch. 5,.

Documents

Scene Segmentation and Semantic Representation for High-Level Retrieval