Automatic creation and evaluation of MPEG-7 compliant summary descriptions for generic audiovisual content

ARTICLE IN PRESS

Contents lists available at ScienceDirect

Signal Processing: Image Communication

Signal Processing: Image Communication 23 (2008) 581– 598

0923-59

doi:10.1

� Cor

E-m

[email protected]

journal homepage: www.elsevier.com/locate/image

Automatic creation and evaluation of MPEG-7 compliant summarydescriptions for generic audiovisual content

Nuno Matos, Fernando Pereira �

Instituto Superior Tecnico, Av. Rovisco Pais, 1049-001 Lisboa, Portugal

a r t i c l e i n f o

Article history:

Received 21 October 2007

Received in revised form

6 May 2008

Accepted 22 May 2008

Keywords:

Automatic audiovisual summarization

Generic content

MPEG-7 summary description

Arousal modeling

Motion intensity

Shot cut density

Sound energy

65/$ - see front matter & 2008 Elsevier B.V. A

016/j.image.2008.05.003

responding author. Tel.: +351 218418460; fax:

ail addresses: [email protected] (N. Matos),

.pt, [email protected] (F. Pereira).

a b s t r a c t

Today’s world is characterized by the increasing availability of audiovisual content,

created in many application domains and, thus, with rather different semantic

characteristics. Audiovisual content is easily acquired, produced, processed, coded,

stored and distributed, progressively making more important the ability for users to

selectively consume the large amounts of content available. Automatic audiovisual

summarization is a technology playing a major role towards facilitating the user’s

effective consumption of large amounts of audiovisual data in a reduced amount of time

as time is getting more and more precious and scarce. This paper proposes an automatic

summarization system for generic audiovisual content based on MPEG-7 compliant

summary descriptions. To evaluate the quality of the created summaries, a user

evaluation methodology was designed, and applied with promising results, showing

that the developed application is able to summarize with success generic audiovisual

content, especially high action content. The main novelty of this paper is related to the

usage of the MPEG-7 summarization tools in combination with an arousal-based

audiovisual summarization model, and to the novel user evaluation methodology and

study.

& 2008 Elsevier B.V. All rights reserved.

1. Introduction

With the recent explosion of multimedia contentavailability, the selective and effective consumption ofaudiovisual content has become increasingly important.Audiovisual content is no longer available only throughthe television, as was the case for many decades, but it isalso accessible through an endless number of systems,notably the Internet, and mobile networks, among others.All of these systems, store or stream audiovisual contentusing coded formats, still often rather large in size (interms of the number of bits) and duration, which impliesthere are usually massive amounts of content available tothe users. One of the novel distinctive features of

ll rights reserved.

+351 218418472.

audiovisual content is that there is no need for anyspecialized skills to acquire, produce, process, store,distribute or visualize audiovisual content. Most peopleare technically savvy enough to make videos of theirvacations or special moments with their personal cam-corders and store them in their personal computers, or usethem on a daily basis on their VoD and Personal VideoRecorder (PVR) systems; moreover, they can use theInternet to make their own videos available to the worldas there are a lot of people ready to browse the videos ofothers. One prominent example in recent years is theYouTube boom, with millions of new video additions perday. YouTube represents today almost 3% of all daily pageviews across the Internet, arising from near 0% at thebeginning of 2006 [1].

However, the huge amount of available audiovisualdata is also a problem as each user’s viewing time islimited and, thus, it is not only important to quickly findthe audiovisual material one is looking for but it is also

www.sciencedirect.com/science/journal/image

www.elsevier.com/locate/image

dx.doi.org/10.1016/j.image.2008.05.003

mailto:[email protected]

mailto:[email protected],

mailto:[email protected]

ARTICLE IN PRESS

N. Matos, F. Pereira / Signal Processing: Image Communication 23 (2008) 581–598582

many times more important to filter from that materialthe relevant and exciting parts, especially if the content islarge. For example, it would be very useful to be able toautomatically produce a video summary of a vacation orholiday to distribute among family and friends, or movietrailers and teasers for a VoD service, or even thehighlights of a football match or a Formula 1 race toinclude in the TV news. These examples highlight somecritical usages of automatic audiovisual summarization,justifying the development of tools capable of successfullyand automatically identifying and filtering the mostexciting moments from a content asset and includingthem in a summary. Automatically summarizing audio-visual content may allow the user to spend much less timein viewing and browsing tasks, as summaries are smallerfiles, both in size and duration; therefore, they take lesstime to become available and also to be consumed by theusers who can also infer the relevance of the entirecontent, deciding afterwards if they wish to view more.The context above motivates the strong need for auto-matic audiovisual summarization tools; this need is ratherwide across many application domains and, thus, contenttypes.

According to Taskiran et al. [17], ‘‘The goal of videosummarization is to process video programs, whichusually contain semantic redundancy, and make themmore interesting or useful for users. The properties of avideo summary depends on the application domain, thecharacteristics of the sequences to be summarized, andthe purpose of the summary’’. Many times, summariza-tion has to go beyond semantic redundancy reduction as itmay be necessary to filter the more relevant segmentsfrom those less relevant even if not redundant, e.g. toproduce a summary with the required duration. Currently,many approaches to the audiovisual summarizationproblem have been studied and proposed. A maindistinction among them is their application scope, notablyif they can address generic context or if they are contentspecific. Although important, the applicability of solutionsaddressing only specific content, for example football orbasketball matches, is obviously limited. Another impor-tant distinction is the level of abstraction at which theyoperate; while some solutions work at a low level,dividing the content in segments, assigning each one ascore based on some features, other solutions work at ahigher-level, detecting concepts and even semantic rela-tionships. Finally, it is important to acknowledge thatemotions and excitement play a central role for veryimportant types of content, such as sports and movies.

In this context, this paper targets the development ofan audiovisual summarization system for generic audio-visual content. The main objectives of this paper are, thus,to design, implement and evaluate an application capableof automatically creating different types of summaries forgeneric audiovisual content. These summaries shouldcontain the most interesting and exciting events occurringin the content at hand, defined in a simple and intuitiveway to the user. To do so, the design of the summarizationapplication is based on modeling the excitement, alsoknown as arousal, experienced by the viewer of thecontent; the summary is then created with the segments

that provoke more excitement in the viewer. This arousalmodeling approach allows any content to be summarizedregardless of its type, origin, etc. assuring, in this manner,the generality of the application and, thus, a wideapplicability. While the proposed model does not expli-citly consider any semantics and thus does not addressspecific events, being thus generic, it adopts a low-levelarousal model which works better for high action contentsuch as sports and action movies. For more uniformlyinformative content [17] or content where importantevents are less associated to action, e.g. home videos anddocumentaries, the summarization performance may belower.

Moreover, to ensure that the modeling task does nothave to be performed for every summary created, thesystem produces an MPEG-7 compliant hierarchicalsummary description [10] which once available, permitsthe generation of many different summaries as needed,notably based on length and relevance criteria. Thesesummary descriptions also provide some degree ofinteroperability because they can be used in all playersthat are compliant with MPEG-7 summary descriptions.While the arousal modeling solution used in this paper islargely based on available literature, this paper is novel inthe way the arousal model is used to create MPEG-7compliant summary descriptions, which permits thesubsequent creation of summaries according to the user’sneeds. Moreover, this paper designs and applies a solidperformance evaluation of the summarization systemdeveloped.

This paper is organized into seven sections, includingSection 1 which serves as an Introduction. Section 2reviews the literature on automatic audiovisual summar-ization, including a brief description of some relevantsystems, and proposes some major classification dimen-sions for the audiovisual summarization solutions, de-pending on the adopted technical approach. Section 3introduces the solution developed in this paper, bypresenting its architecture and a functional descriptionof each of its modules. Section 4 presents a description ofthe processing algorithms in order to allow the reader toget a complete understanding of the entire modelingprocess proposed for the summarization. Section 5 isdedicated to a short description of the application’sgraphical user interface (GUI), while Section 6 presentsand analyses the results of the subjective assessmentdesigned and carried out to evaluate the performance ofthe proposed summarization solution. Section 7 sum-marizes the authors’ conclusions and future work.

2. Background and classification

As for the majority of multimedia problems, thevarious ways to address the audiovisual summarizationproblem can be organized, clustered and classifieddepending on the technical approach, concepts and toolsused. Based on the literature reviewing made for thepurpose of understanding and structuring the problem athand—automatic audiovisual summarization—some clas-sification dimensions emerged as more relevant. While

ARTICLE IN PRESS

N. Matos, F. Pereira / Signal Processing: Image Communication 23 (2008) 581–598 583

there is no single good classification approach, havingsome appropriate organization for audiovisual summar-ization solutions helps in understanding their relation-ships, notably similarities and differences betweenavailable and emerging solutions. In this context, themain dimensions proposed to organize and classify thetechnologies and solutions for automatic audiovisualsummarization are:

1.
Generic versus specific content solutions—the maindifference between these two families of solutions,generic and specific, relies on the target type ofaudiovisual content. The generic approach is designedto have the ability to produce summaries for any typeof audiovisual content, without explicitly modelingspecific events, while the specific content approachaddresses some explicit (and limited) type of contentor events, e.g. news programs or football. Both forgeneric and specific solutions, there are approachesbased on the selection of the most distinctive segmentsfrom the video, using predefined scores, thus eliminat-ing similar or redundant segments. On the contrary,semantic modeling and filtering is more appropriatefor specific solutions where the relevant set of conceptscan be defined.
2.
Affective-based versus non-affective-based solu-tions—both in the context of generic and specificcontent solutions, it is important to distinguishbetween affective and non-affective solutions sinceclearly affect, emotions and excitement play a key rolein many audiovisual summaries. Audiovisual summar-ization is one of many possible applications of affectivecontent analysis. These solutions are characterized bythe usage as filtering criteria of the presence in theviewer of a certain amount or type of feelings oremotions, or of a certain amount of attention orexcitement. Therefore, affective audiovisual summar-ization is performed after a process of affect orattention modeling. One of the main appeals ofaffective summarization is its potential to createsummaries for any type of audiovisual content.
3.
Low-level versus high-level features based solution-s—low-level features based summarization solutionsare based on low-level information, and derived scores,automatically extracted from the audiovisual seg-ments. The main difference among the various low-level features based models are the features selectedand the usage made with the low-level informationextracted. On the contrary, high-level features basedsolutions are mainly characterized by their usage ofevent modeling and concept detection tools, whichmeans they mainly target specific application domains.In a generic content context, it is difficult to build aneffective audiovisual summarization solution basedonly on high-level features since not all relevant eventsand concepts may be known in advance. High-levelfeatures based solutions are far more common indomain-specific solutions, based on semantic eventand conceptual modeling, where the list of relevantevents and concepts is typically predetermined. Com-
bined low-level and high-level solutions also exist,sometimes gathering the better of the two worlds. Boththe low and high-level based solutions may bemonomodal or multimodal, depending on their usageof a single media, e.g. only video, or several media, e.g.audio and video.

While there may be other ways to classify and organizesummarization technologies, very likely as good as theone proposed here, what is most important, thusmotivating this proposal, is the capability to get a viewon this technical field, presented in a structured andorganized way, and not just as a simple list of solutions. Inthe next sections, a generic, affective, multimodal andlow-level features based summarization system will beproposed and evaluated.

From the many summarization systems available andreviewed, some have been considered more representa-tive and, thus, will be briefly reviewed in the following.The systems are strongly dependent on the summarypurposes they might serve; in Ref. [17], a list ofsummarization purposes is proposed, from ‘intriguingthe viewer to watch the whole video’ to the more usual‘letting the viewer decide if the complete program isworth watching’ or ‘giving the viewer all the importantinformation contained in the video’. Clearly, low-level,affective and user attention modeling, generic approachesand high-level, concept based, specific video understand-ing approaches are two of the major classes of summar-ization systems available in the literature.

2.1. Low-level, affective and user attention modeling, generic

approaches

In Ref. [7], Jaimes et al. propose a framework forgenerating personal video summaries based on someextracted features. This system is intended to summarizeonly football matches and it is based on the extraction ofhigh-level features from available events metadata; thesefeatures are subsequently used in a supervised learningcontext. After a training phase, where a user selects his/her personal highlights from a set of training videos,features are extracted from the training set. Those featuresare then used by a classifier that chooses, using themetadata from a new test video, which segments will beincluded in the digest, according to the user’s preferences.A second interesting system still addressing only footballcontent has been developed by Ekin et al. [3]. This systemis based on both low-level and high-level features: low-level features are used in cinematic feature extractionalgorithms while high-level features are used in thedetection of goals, the referee and penalty-box events.

In Refs. [4–6], a system developed by Hanjalic’s team ispresented which focuses on the semantic summarizationof multimedia content, mainly based on the extraction ofmoods from video and sounds. This system addressesgeneric content, making it a powerful approach to thesummarization problem as it is able to produce summa-ries for any kind of audiovisual content. Considering thescope of this paper, it reveals another very relevant

ARTICLE IN PRESS


summarization dimension which is the exploitation ofaffect, notably through arousal. The generic and affectivedimensions of this solution come through the modeling ofthe concept of arousal: the search for highlights is done bytracing the audiovisual segments where the arousal, ‘‘aphysiological and psychological state of being awake’’,experienced by the viewer is expected to be high, insteadof modeling each potential highlight event individually asit happens with systems addressing specific content, e.g.football goals. The system designed, implemented andevaluated in this paper uses an arousal model largelybased on the arousal model presented in Ref. [6].

The system developed by Ma et al. [9] provides ageneric framework for user attention modeling and itsapplication to the problem at hand. This system is aninteresting alternative to the Hanjalic’s solution men-tioned above since both are generic content, affective-based solutions. Ma et al. propose a generic user attentionmodel, considering multiple sensory perceptions, e.g.visual and aural stimulus, which estimates the attentionviewers may pay to video content. As human attention isan effective and efficient mechanism for informationprioritizing and filtering, the authors propose a userattention model based video summarization solutionwithout fully semantic understanding of video contentas well as complex heuristic rules, which has proven to beeffective, robust, and generic. The results from thesummarization user study performed indicate that userattention model is a valid alternative way to videounderstanding for audiovisual summarization.

In Ref. [13], Ngo et al. propose a unified approach forvideo summarization based on the analysis of videostructures and video highlights. Two major componentsof the approach are scene modeling and highlightdetection. While scene modeling is achieved using anormalized cut algorithm and temporal graph analysis,highlight detection is achieved by motion attentionmodeling. In this summarization system, a video isrepresented as a complete undirected graph and thenormalized cut algorithm is carried out to globally andoptimally partition the graph into video clusters. Theresulting clusters form a directed temporal graph and ashortest path algorithm is proposed to efficiently detectvideo scenes. The attention values are then computed andattached to the scenes, clusters, shots, and subshots in atemporal graph. As a result, the temporal graph caninherently describe the evolution and perceptual impor-tance of a video. The proposed system may generate videosummaries emphasizing both content balance and per-ceptual quality, directly from a temporal graph thatembeds both the structure and attention information.

In Ref. [17], Taskiran et al. propose an automaticmultimodal summarization solution using transcriptsobtained by automatic speech recognition. The authorsdivide the full program into segments based on pausedetection and derive a score for each segment, based onthe frequencies of the words and bigrams it contains. Thesummary is built by selecting the segments with thehighest score to duration ratios while at the same timemaximizing the coverage of the summary over the fullprogram. The user studies performed have shown that the

proposed algorithm produces more informative summa-ries than two other rather basic summarization algo-rithms.

Very recently, Choudary and Liu [2] proposed a specificsummarization system for instructional videos of chalkboard presentations, where the visual content refers to thetext and figures written on the boards. The authors claimthat existing methods on video summarization are noteffective for that video content because they are mainlybased on rather simple low-level image features such ascolor and edges. Thus, the authors propose a novelapproach based on middle-level features. This approachstarts by extracting text and figures from the instructionalvideos using statistical modeling and clustering; thealgorithm is able to deal with image noise, non-uniformityof the board regions, camera movements, occlusions, andother difficulties which are typical of instructional videosthat are recorded in real classrooms. Then, using theextracted text and figures, as the middle-level features,the system selects a set of key frames containing most ofthe visual content. Finally, to further reduce the contentredundancy, the system builds a mosaic summaryimage by matching the extracted content based on KthHausdorff distance and connected component decompo-sition. The authors claim that user studies have shown thesystem is highly effective in summarizing instructionalvideo content.

2.2. High-level, concept based, specific video understanding

approaches

On the semantic, conceptual and content understand-ing side, there are some early papers such as Refs. [11,12]where first solutions for video semantics extraction havebeen proposed. In Ref. [15], the authors propose achallenge for the automated detection of 101 semanticconcepts in multimedia. The efficient detection of seman-tic concepts is important for many multimedia fields suchas indexing, retrieval, summarization, and recognition.More recently, some rather powerful solutions have beenproposed for semantic detection.

In Ref. [16], Snoek et al. propose the so-called semanticpathfinder architecture for generic indexing of multi-media archives. The semantic pathfinder extracts seman-tic concepts from video by exploring different pathsthrough three consecutive analysis steps; these stepsresulted from the observation that produced video is theresult of an authoring-driven process. The three analysissteps are: (i) analysis based on a data-driven approach ofindexing semantics; (ii) style analysis to address theindexing problem by viewing a video from the perspectiveof production; and (iii) context analysis where semanticsare viewed in context. The main asset of the semanticpathfinder is its ability to learn the best path of analysissteps on a per-concept basis. The generality of thisindexing approach has been shown for a lexicon of 32concepts; the performance of the system was evaluatedagainst the 2004 TRECVID video retrieval benchmark,using a news archive of 64 h. A top ranking performance inthe semantic concept detection task has shown the merits

ARTICLE IN PRESS


of the semantic pathfinder for generic indexing of multi-media archives. This type of system may be used for high-level, specific, event-based summarization where eventsare associated to semantic concepts.

Even more recently, Liu et al. proposed a general post-filtering framework to enhance robustness and accuracyof semantic concept detection using association andtemporal analysis for concept knowledge discovery [8].Since the co-occurrence of several semantic concepts mayimply the presence of other concepts, the authors useassociation mining techniques to discover such inter-concept association relationships from annotations. Ex-ploiting discovered concept association rules, the authorspropose a strategy to combine associated concept classi-fiers to improve detection accuracy. Moreover, since videois often visually smooth and semantically coherent,detection results from temporally adjacent shots are usedfor the detection of the current shot. With this purpose,the authors propose temporal filter designs for inter-shottemporal dependency mining to further improve detec-tion accuracy. The authors claim that experiments on theTRECVID 2005 dataset have shown that the proposedpost-filtering framework is both efficient and effective inimproving the accuracy of semantic concept detection invideo. The proposed tool may also be easily integrated intoexisting semantic classifiers to improve their perfor-mance.

Since performance evaluation plays a major role in thedevelopment of multimedia systems, notably summariza-tion solutions, it is essential to mention at this stage theperformance evaluation and benchmarking efforts devel-oped in the context of TRECVID [18]. Since 2005, theTRECVID tasks/challenges have been considering videosummarization: first indirectly through the rushes task,and, since 2007, more directly through an explicitsummarization task. These tasks have been significantlycontributing towards the coordinated development andperformance evaluation of more powerful summarizationsolutions.

3. Summarization system architecture

This section presents the architecture for the generic,affective, multimodal and low-level features based sum-marization system designed, implemented and evaluated.Although personalization and user profiles are importanttools to generate user adapted content, the scenarioconsidered in this paper regards a summarization servicewhere rather user-independent arousal-based summaries,driven by common user attention capturing features, maybe provided to users at large. However, the proposedsystem architecture may be easily extended to consideruser profiles in the selection of the summary segments;user profiles could be described using the MPEG-7 userpreference description tools since these descriptionstypically involve critical interoperability issues. Alongthe same lines, content adaptation has been a major issueaddressed in the context of the MPEG-21 standard,notably its Digital Item Adaptation (DIA) part. TheMPEG-21 DIA user environments description tools allow

describing consumption dimensions such as the user, thenetwork, the terminal, and the natural environment.Again, the proposed system architecture may be easilyextended to consider MPEG-21 DIA descriptions, bringingfurther constraints to the summary segments selectionprocess.

The proposed summarization system architecture ispresented in Fig. 1: it shows three core modules, as well astheir inputs and outputs. The three core modules in theproposed system architecture are:

�
Low-level features extraction—the first step in thesummarization process is the extraction of low-levelfeatures, necessary to model the arousal for the inputaudiovisual content, e.g. available in MPEG-1 (coded)format, to be summarized. This module has a funda-mental role as it provides the necessary informationabout the audiovisual content to effectively model theuser arousal. Each of the feature extraction processe-s—one per feature—is performed independently; it ispossible to produce summaries based on only one ormore of the three selected features (as well as to easilyadd more features due to the modularity of thearchitecture). Due to their effectiveness in expressingthe viewer’s reactions when watching a video, thethree low-level features selected are the motionintensity, the density of shot cuts and the soundenergy. In fact, an increase on objects motion, as wellas on the camera motion, typically implies an increasein the user arousal. Also shot lengths, or its patterning,are used many times by movie directors to inflict adesired pace of action. Normally, a higher density ofshot cuts, and consequently shorter shot lengths,means action and stressful segments, while longershot lengths are useful to provoke more relaxing andcalmer moments to the viewer. A change in shotlengths during a video is likely to cause significantchanges in the viewer’s arousal, similarly to motionintensity. The third feature chosen for arousal model-ing is sound energy. As with motion intensity, theloudness or energy of the audio signal has a directinfluence on the emotions that viewers may experiencewhile watching a video. An increase in the soundenergy in specific segments of the audiovisual contenttypically leads to a boost of the audience’s arousal. In afootball match, for example, when a goal event occursor when a rough tackle takes place, normally thecommentator shouts or starts to talk louder and theaudience cheers or boos. In an action movie, gunfire orexplosions are related to action sequences and, there-fore, to segments where the arousal experiencedsignificantly increases. All this justifies the choice ofthese three low-level features. The outputs of this firstmodule serve as input for the arousal modeling modulewhich comes next. In the implemented system, thelow-level features metadata is stored in XML format sothat it may be reused as many times as desired withoutthe need for repeated processing. � Arousal modeling—the information obtained by ex-
tracting the low-level features from the audiovisual

ARTICLE IN PRESS

Fig. 1. Architecture of the automatic audiovisual summarization system.


content is used to model the user arousal scoring. As inthe extraction module, arousal is modeled indepen-dently for each feature, producing an arousal scoring

curve for each of them. A smoothing filter is subse-quently applied in the process, for each of the features,with the objective to transform the (sometimes) abrupt

ARTICLE IN PRESS


arousal changes, directly resulting from the featureextraction, into smoother arousal changes more likelyto express the viewer’s feelings when watching a video(which does not abruptly change from frame to frame).After the smoothing filtering process, scaling is appliedto scale the resulting arousal curve to a [0,1] scale; thisis fundamental to allow comparisons and combina-tions of the various arousal curves. When all features’arousals are modeled, a fusion function is applied tointegrate them into one single final arousal curve. Thefinal arousal curve illustrates the arousal evolutionalong the audiovisual content duration. In the fusionprocess, different weights can be assigned to thevarious features, producing a different final arousalcurve and, therefore, different summaries. For thesame reasons as for the individual features, the finalarousal curve is also smoothed and scaled. In theimplemented system, the arousal model is largelybased on Hanjalic’s model [6]. It is fair to say at thisstage that the arousal-based model adopted worksrather well for content such as sports and actionmovies but is more limited for placid content, i.e. more‘quiet’ movies, documentaries or home videos. Thismeans the summarization performance will not be asgood for such content. However, the proposed ap-proach is rather simple and may thus provide veryadequate summaries for a lot of important types ofcontent. Notably, this solution is rather efficient forcontent which has a high probability of having usersprepared to pay for a summary like sports which is arather important argument. The simplicity of theproposed solution may also be an important argumentin this context since the timeliness of this type ofsummaries is critical.
� Summary description creation—after user arousal
modeling, a MPEG-7 compliant hierarchical summarydescription is created [10], resulting in the final outputof the system. This module constitutes a majordifference regarding Ref. [6] which basically stops afterthe arousal modeling. According to the frame arousalscores, frames will be grouped together in segmentswhich will be labeled with four proposed classes: ‘‘TopHighlights’’, ‘‘Key Points’’, ‘‘Extended Summary’’, or‘‘Remaining Content’’; the relevant segments will thenbe included in the summary description under thoselabels. The arousal labels range from the mostexciting—‘‘Top Highlights’’—to the less exciting—

‘‘Remaining Content’’. The labels, their number andthe way to associate content to each of them are notspecified by the MPEG-7 standard. In the currentapplication, the ‘‘Top Highlights’’ label corresponds tothe 10% most relevant content, the ‘‘Key Points’’ label tothe next 15%, and the ‘‘Extended Summary’’ label tomore 25%. After, the user can create and viewsummaries for an audiovisual content based on theirneeds using two criteria: (i) relevance by choosingthrough the first three labels (as the ‘‘RemainingContent’’ represents the whole audiovisual content)and (ii) length by stating the duration of the desiredsummary. If the user decides to input the summarylength, the summary will include segments from ‘‘Top

Highlights’’ to ‘‘Remaining Content’’, until the desiredlength is reached. The small number of user interactionparameters—relevance or length—is intentional be-cause easy interaction was considered very importantfor this summarization application. At the end of theprocess, if the user wishes, he/she can produce anMPEG-1 file with the created audiovisual summary.The MPEG Summaries Creation module is not con-sidered a core module of the summarization system asit may or may not be run immediately after thesummarization process is performed. A main novelty ofthe summarization system proposed in this paper isprecisely the MPEG-7 compliant hierarchical summarydescription based on which it is possible to createmany types of summaries, following the user’sneeds, without rerunning the analysis and modelingprocesses; moreover, the MPEG-7 summary descrip-tions provide to this system some degree of interoper-ability as all MPEG-7 compliant systems will beable to interpret and process the created summarydescriptions.

Although various metadata are produced in the pro-posed system architecture described above, it was decidedthat interoperability was relevant only for the finalsummary descriptions and not for the system internallow-level metadata created along the process. Thisdifference motivated the decision to use MPEG-7 standarddescription tools for the parts of the system whereinteroperability is a requirement and not for the otherparts. Although MPEG-7 description tools could also havebeen used for the system internal metadata, the decisionnot to use them highlights the important fact that theusage of standards is essential for some modules but notfor others. The authors believe this is an importantmessage, to avoid the appearance of standards as aconstant imposition as at times some system designersare compelled to conclude. That is, there is a ‘designfreedom’ zone in the system and here the authors decidedto use it without any impact on the final systeminteroperability as standards are used where they areessential.

4. Processing algorithms and metrics

This section provides a detailed description of thealgorithms and metrics adopted and implemented for themodules in the architecture presented in Fig. 1.

4.1. Low-level features extraction

Three feature extraction tracks were introduced in thesystem architecture, one for each of the selected low-levelfeatures: motion intensity, shot cuts, and sound energy.Each of these extraction tracks has the capability ofdelivering the low-level data needed for elementaryarousal modeling. The architectures of the low-levelfeatures extraction modules are presented in Fig. 2 and

ARTICLE IN PRESS

Fig. 2. Architecture of (a) motion information extraction, (b) shot cut detection and (c) sound information extraction modules.


briefly described in the following:

�
Motion intensity extraction—to model the user arousalfrom motion intensity, motion vectors for the videoframes (with the exclusion of I frames which do nothave motion vectors) are required. The motion inten-sity extraction module directly extracts from theMPEG-1 Video coded bitstream the necessary motionvectors for P, B, or B and P frames depending of thesystem setup. � Shot cut detection—to model the user arousal from the
density of shot cuts, the frame indices which representshot boundaries are needed. To detect the shotboundaries, a simple algorithm based on luminanceand saturation histograms differences is used. Thisalgorithm has two control parameters, a and b, relatedto the shot detection sensitivity and the frame step(frame distance between processed frames). The algo-rithm is not detailed in this paper because the specificshot cut detection algorithm is not specifically relevantfor the purpose of this paper; any algorithm withacceptable performance may be used.
� Sound energy extraction—to model the user arousal
from the sound energy, all audio samples from theaudiovisual content are necessary. The solution used toobtain all audio samples is based on a conversion fromMPEG to WAV format (in this case, using a MP3decoder), followed by the extraction of the samplesfrom the WAV format.

Each of the low-level feature extraction modulesproduces, as output, an XML file with the information

collected in the extraction process. For each feature, anadequate XML structure has been defined, see Fig. 3. Thisoutput will serve as input to the corresponding arousalmodeling module described next. The storage in an XMLfile with a well defined structure of the low-level datapermits the execution of the low-level processing onlyonce since these low-level data can be later reused asmany times as needed.

As shown in Fig. 3(a), the document type definition(DTD) for the motion information XML file has thefollowing structure:

�
video element—the root element of the XML file;contains only one child element, the filename element. � filename element—contains as an attribute the path for
the audiovisual content original file; it also has onechild element, the motionextraction element.
� motionextraction element—it has several child ele-
ments, one related to the last frame—lastFrame

element—and one for each P or B frame, dependingon the user’s choice of parameters. This indicateswhich type of frames is considered, notably, only P,only B, or both P and B.
� lastFrame element—contains the index of the last
frame that is relevant for the arousal metrics computa-tion.
� frame element—includes the frame’s type (P or B), its
index, and all its motion vectors.

The DTDs for the shot cut and sound information aresimilar to the motion activity DTD as it is shown inFig. 3(b) and (c).

ARTICLE IN PRESS

Fig. 3. DTDs for (a) the motion, (b) shot cut and (c) sound information XML files.


4.2. Arousal modeling

After extracting the selected low-level data and storingit in XML files, the next stage is arousal modeling. Arousalmodeling is done in two main steps: (i) arousal metricscomputation; and (ii) fused arousal computation. The finalarousal curve Afinal(k) can be seen as the fusion of theelementary arousal functions, Ai(k), which represents thearousal corresponding to the information for feature i

along the audiovisual content (frame k). The computationof the Ai(k) functions will be made in the arousal metricscomputation module while Afinal(k) will be generated inthe fused arousal computation module.

The arousal model used in this paper is largely basedon the arousal model proposed in Ref. [6]. However, themodel is used here not only to identify highlights butrather to create MPEG-7 compliant hierarchical summarydescriptions which should permit the generation of avaried range of summaries depending on the user’s needs.The processes leading to the final arousal curve, Afinal(k),are detailed in the following sections.

4.2.1. Arousal metrics computation

As explained previously, arousal metrics computationis the immediate stage after low-level informationextraction. In the process of determining an arousal curvefor each feature, to be given to the fused arousal process,three main steps have to be taken: (i) computing theassociated arousal metric; (ii) smoothing that samemetric; and (iii) scaling the smoothed curve to a common([0,1]) scale. These processes are presented in this sectionfor the three features selected.

4.2.1.1. Motion intensity arousal. First, following the stepsdescribed above, a metric for computing the motion in-tensity at each video frame k has to be defined. Having allmotion vectors values for each (P and/or B) frame storedin the motion information XML file, one possible way torepresent the motion intensity for each video frame k is tocompute the average motion vector component magni-tude for each frame and divide it by the maximum pos-sible motion vector magnitude for that frame; with this,the motion intensity for each video frame k, in relation toits maximum, in this case in a [0,1] scale, is obtained. Inthis context, the following function is proposed as a mo-tion intensity metric mi(k):

miðkÞ ¼1

vkmax

�� PTotalMV

i¼1 viðkÞ

TotalMV(1)

where mi(k) corresponds to the average magnitude of allmotion vector component values extracted, vi(k), for framek (regardless of its direction, horizontal, or vertical), nor-malized by the maximum possible magnitude of themotion vector component for that frame, |vkmax|. TotalMV

corresponds to the number of motion vector components,x and y, read for frame k.

In order to smooth this motion intensity metric giventhe reasons explained in Section 3, a mathematicalconvolution with a Kaiser window, K(N,a), is performed,creating a smoothed miKaiser(k) curve. The Kaiser windowis a window function used for digital signal processing,that is defined by the formula:

wn ¼I0

affiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi1�ð2n=N�1Þ2p

I0ðaÞ

� �if 0pnpN

0 otherwise

8<: (2)

where I0 is the zeroth order modified Bessel function ofthe first kind, a is an arbitrary real number thatdetermines the shape of the window, and the integer N

gives the length of the window (N+1 points). In this paper,the values of N are, typically, the total duration of thevideo divided by 15 with a ¼ 5, as these were the valuesthat showed, after intensive testing, to be more suitablefor the desired purposes. Note that the convolution of anymetric with the Kaiser window will produce a curve in adifferent scale range; therefore, rescaling is neededthereafter, precisely to scale back the curve for [0,1]values. Thus, to complete the motion intensity arousalmetric computation, the miKaiser(k) curve is scaled back to[0,1] values, as it is also done for the other metrics. Thescaling results in a final motion intensity arousal metric,Amotion(k):

AmotionðkÞ ¼maxðmiðkÞÞ

maxðmiKaiserðkÞÞ�miKaiserðkÞ with

miKaiserðkÞ ¼ miðkÞ � KðN;aÞ (3)

Fig. 4 shows a motion intensity arousal curve before andafter Kaiser window filtering and scaling.

4.2.1.2. Shot cut density arousal. The same type of reason-ing made for the motion information is required for theshot cut information. Here, the goal is to relate the shotsduration, also known as the shots rhythm, with thearousal experienced by the audience. Shorter consecutiveshots are normally related with moments of fast actionwhile longer shots often mean calmer and more relaxingsegments in the audiovisual content. With this in mind,the objective of the adopted shot cut density arousal

ARTICLE IN PRESS

Fig. 4. Motion intensity arousal (in %) before and after Kaiser window filtering and scaling.


metric is to compute arousal values which are able toexpress this relationship. In this way, the metric shouldresult in higher values for shorter shots and lower valuesfor longer shots. The metric adopted to fulfill these re-quirements, sc(k), is:

scðkÞ ¼ eðð1�ðnðkÞ�pðkÞÞÞ=dÞ, (4)

where n(k) and p(k) represent, respectively, the frame indexof the next and previous shot boundaries in relation to thecurrent frame k. These index values are obtained from theXML file created as output of the shot cut detection process.The difference between 1 and the shot duration (n(k)�p(k))in ex implies computing ex only for xo0; consequently, thesmaller the difference between n(k) and p(k), and hence theduration of the shot, the closer the sc(k) value is to 1 andvice-versa. To constrain the x values in an adequate range,the parameter d will determine the shape of the curve. Ahigh value of d will result in a curve where all values are tooclose to 1, as all x values are close to 0; on the contrary, avalue of d that is too small will result in a curve always near0, as shot durations are normally near the hundreds orthousands of frames (hence, the ex value will be muchsmaller). A good proven value for d is around 300 as theresulting curve shows fluctuations adequate for an arousalmodeling process.

For the same reasons, and in the same way as for themotion intensity feature, smoothing and scaling have tobe performed to create the final arousal curve for the shotcut density feature, Ashotcut(k):

AshotcutðkÞ ¼maxðscðkÞÞ

maxðscKaiserðkÞÞ� scKaiserðkÞ with

scKaiserðkÞ ¼ scðkÞ � KðN;aÞ (5)

4.2.1.3. Sound energy arousal. As for the previous low-levelfeatures metrics, the goal of this metric is to relate thefeature information, sound energy, to the degree of ex-citement experienced by the audience. For sound, thisrelation may be rather simple: the louder the sound of anaudiovisual segment, the higher the arousal experiencedby the viewer. Explosions or gunfire in action films, cheersof the audience in sport broadcasts, screams in horror

films are all segments with high values of sound energyand all these segments typically provoke high arousalexperiences in the viewer. On the other hand, silent seg-ments are usually associated to calming moments for theviewer. Therefore, the objective is to use an arousal metriccapable of providing higher values for higher sound en-ergy values and lower values for lower sound energy va-lues. To do so, se(k), representing for each frame k the sumof the squares of the audio samples is computed:

seðkÞ ¼XTotalSamples

i¼1

ðAudioSampleiðkÞÞ2 (6)

where AudioSamplei (k) represents the ith audio sample inframe k and TotalSamples is the number of audio samplescorresponding to the duration of a video frame (becausese(k) is computed for each video frame).

Smoothing is applied in the same way (and for thesame reasons, as previously described for the othermetrics), by convoluting se(k) with the Kaiser window,with N and a valued as before, thus defining seKaiser(k) as:

seKaiserðkÞ ¼ seðkÞ � KðN;aÞ (7)

However, for sound energy, a second scaling process mustbe performed. While the first scaling has the samefunction as the scaling used for the other elementaryarousal metrics, i.e. to scale back the arousal curveresulting from the smoothing filtering to [0,1] values,the second scaling is related to the fact that the soundenergy arousal metric is not initially computed in a [0,1]scale, as the other metrics. Thus, its peak values may bevery different from frame to frame and between differentcontents. Therefore, in order to ensure that the soundenergy curves may be compared, the smoothed soundenergy curve must be scaled in an appropriate way,notably:

seKaisernðkÞ ¼seKaiserðkÞ

maxðseKaiserðkÞÞ(8)

and

seKaisern ¼1

NumFrames

XNumFrames

k¼1

seKaisernðkÞ (9)

ARTICLE IN PRESS

Fig. 5. Example with the three elementary and the final arousal curves

for a football sequence.


where NumFrames is the number of frames in the videosequence. To create the final sound energy arousal curve,Asoundenergy(k) is computed as:

AsoundenergyðkÞ ¼ seKaisernðkÞ � ð1� seKaisernÞ (10)

While the first scaling function is performed by Eq. (8), thesecond scaling is performed by Eq. (10). If no secondscaling is applied, then a sound energy arousal curve witha value 1 on the highest sound energy value would alwaysresult. Such a result is not desirable because, as for motionintensity and shot cut density, the purpose of the soundenergy arousal curve is to model the arousal experiencedby the viewer while watching an audiovisual content andnot just to produce a curve expressing the sound energyvalues in relation to its maximum value. Therefore, Eq. (9)is necessary to transform the first scaled function into acurve capable of representing the level of arousal relatedto the sound energy felt by the viewer along the content.The solution adopted here is to compute the mean ofseKaisern(k), i.e. the mean of the sound energy valuesrelated to its peak (Eq. (9)) [6]. A high mean valueindicates that the sound energy of most of the audiovisualcontent is close to the peak value; therefore, the arousalcurve should be flattened and scaled down, as theaudiovisual content is very constant in terms of soundenergy. However, if the mean value is low, then the soundenergy peak may be considerably more relevant than therest of the content implying that the variations are moresignificant. Therefore, the curve should be able to high-light more intensely its peaks as these peaks representimportant and relevant changes in terms of arousal.

4.2.2. Fused arousal computation

The last sub-module before creating the summarydescription is the fused arousal computation. The objec-tive of fused arousal computation is to integrate allelementary arousal curves resulting from the low-levelfeatures arousal metrics computation processes. As themaximum of each Ai(k) can be located on different frameindices, a weighted average of the various feature metricscan be an appropriate fusion function as it is shown to befaithful enough to the variations of each individual Ai(k)function [6]. As the default solution, all three features willbe considered with the same weight, i.e. 1/3 of the weightin the creation of the final arousal curve. However, theuser can change the weights assigned to each feature if he/she wishes to see how the final summary changes for acertain piece of content or if different types of contentrequire different weights to produce more meaningfulsummaries. Then the fused arousal metric, A0finalðkÞ, iscomputed as:

A0finalðkÞ ¼X3

i¼1

wiAiðkÞ (11)

where i regards motion intensity, shot cuts density, andsound energy and k is the video frame number.

In Eq. (11), wi refers to the weight assigned to eachfeature for which their sum must be 1. Smoothing is doneto merge the neighboring maxima of each Ai(k) functionand is applied in the same way as before. After smoothing,

scaling is performed to scale back the values to [0,1]values. The fused arousal curve, Afinal(k) is the final result,computed after scaling:

AfinalðkÞ ¼maxðA0finalðkÞÞ

maxðA0finalKaiserðkÞÞ� A0finalKaiserðkÞ (12)

with

A0finalKaiserðkÞ ¼ A0finalðkÞ � KðN;aÞ (13)

Fig. 5 shows an example with the three elementaryarousal curves and the final fused arousal curve for a videosequence extracted from the Football World Cup Final of1998 (l is the length of the Kaiser window).

4.3. MPEG-7 compliant hierarchical summary description

creation

The main output of the full summarization process,derived from the final fused arousal curve resulting fromprevious modules, is an XML file with an MPEG-7compliant hierarchical summary description of the audio-visual content. This hierarchical summary descriptionprovides the means to represent the audiovisual contentin segments labeled according to their importance. Themost important level contains the top of the hierarchyand, as levels progress downward, less important seg-ments will be included in the summary.

The creation of the MPEG-7 hierarchical summarydescription has two main motivations: (i) flexibility, thispermits the user to view and create many differentsummaries fulfilling different needs, e.g. different typesor with different lengths, without having to repeat theentire summarization process and (ii) interoperability,this permits the creation of a summarization outputcapable of interoperating with other MPEG-7 compliantsystems.

To achieve the first requirement, the obvious choicewas to create a description capable of hierarchicallyrepresenting the audiovisual content in terms of itssummarization relevance. To fulfill the second require-ment, the solution was to adopt a standard format. In thiscase, the MPEG-7 standard was chosen because it defines

ARTICLE IN PRESS


precisely a description tool, HierarchicalSummary, for thispurpose. From one MPEG-7 compliant hierarchical sum-mary description, an infinite number of summaries can becreated according to the user’s needs, notably in terms ofrelevance and length.

MPEG-7 offers various types of audiovisual descriptiontools that represent the metadata elements and theirstructure and relationships—named descriptors and de-scription schemes, respectively [10]. The description toolshave the purpose of creating descriptions, i.e. by instan-tiating some description schemes with their associateddescriptors that can constitute the basis for applicationsproviding efficient access, filtering, retrieval, summariza-tion, etc., of multimedia content. Two of the manydescription schemes defined in the MPEG-7 standardfocus on summarization capabilities: the SequentialSum-

mary and HierarchicalSummary description schemes. TheSequentialSummary description scheme is used to specifysummaries of variable length with the objective ofsupporting sequential navigation. The HierarchicalSum-

mary description scheme is used to specify summaries ofvariable length with the intention of supporting bothsequential and hierarchical navigation. The descriptionschemes are represented in XML format and, therefore,the HierarchicalSummary description scheme emerged asthe perfect solution to fulfill the requirements defined forthe developed summarization application. Since theMPEG-7 summary description tools are very flexible andpowerful, some decisions on their usage had to be taken,

Fig. 6. MPEG-7 HierarchicalSumma

again exploiting the freedom left to the system designer interms of the creation of the MPEG-7 descriptions.Considering that sports and action movies are amongthe most relevant types of content for this kind ofsummarization system, it was decided to use the adoptedHierarchicalSummary description scheme to provide deep(only top topics but well covered) rather than shallow(most topics briefly covered) summaries as this istypically the user preference in this context.

The DTD structure of the MPEG-7 HierarchicalSummary

description scheme and, consequently, of the output XMLsummary file is shown in Fig. 6.

From the DTD structure presented in Fig. 6, it ispossible to see that the main elements to highlight are theSummarySegmentGroup and SummarySegment elements,which are described next:

�

ry d

SummarySegmentGroup element—it contains one ormore SummarySegments that correspond to the basicsegments of the content. A SummarySegmentGroup

element may also contain one or more childrenSummarySegmentGroup elements, which represent thesummary at a finer level of detail. This construct maybe used to build hierarchical summaries.
� SummarySegment element—it contains time stamps
indicating the location of the audio and/or videoassociated with this segment. A SummarySegment alsohas an order attribute to specify the navigation orpresentation order of the segments in a summary.

escription scheme DTD.

ARTICLE IN PRESS


The first step in creating the output MPEG-7 compliant,hierarchical summary description is to decide whichaudiovisual segments should belong to which hierarchicallevel. First, it is proposed to use four hierarchical levels,with the objective of representing three different types ofsummaries in terms of additional content. The bottom ofthe hierarchy does not necessarily represent a summarytype as it corresponds to the entire audiovisual content.However, because a user can decide to watch a summaryby length, segments belonging to the bottom of thehierarchy may have to be included in the requestedsummary. The existence of three types of summariesstrives to provide the user with a sufficiently wide rangeof summaries capable of representing what is differentlyimportant in the original content. The proposed hierarch-ical levels are, in a top-down approach:

�
Top Highlights, level 0—intends to represent the mostexciting moments of the audiovisual content, i.e. thesegments that will provoke more arousal on the viewerand, consequently, with higher arousal values on thefinal fused arousal curve. In this paper, the segmentswith the top 10% arousal values are labeled as ‘‘TopHighlights’’. � Key Points, level 1—the second level of the description
hierarchy aims to provide some context for the ‘‘TopHighlights’’ segments. A ‘‘Key Points’’ summary wouldbe presented to the user including the ‘‘Top Highlights’’segments as well as the segments labeled as ‘‘KeyPoints’’, i.e. it would include the first two levels of thedescription hierarchy. In this paper, the frame seg-ments in the top 10–25% arousal values are labeled as‘‘Key Points’’; the 25% value was selected because it is amiddle value able to produce a relatively shortsummary while including most of the very interestingsegments.
� Extended Summary, level 2—this level provides a
summary with a longer duration with the objective ofoffering to the user a wider view of the audiovisualcontent. In this paper, an ‘‘Extended Summary’’ willpresent to the user 50% of the total audiovisualcontent, basically excluding only the dullest and leastinteresting segments of the audiovisual content.
� Remaining Content, level 3—the remaining segments
are labeled as ‘‘Remaining Content’’ and should containthe less (arousal) relevant parts of the audiovisualcontent.

The process of labeling the frames is quite simple andrelies on a function capable of retrieving the top x% frameswith maximum arousal values from the final arousalcurve. No segments shorter than 3 s are included in thefirst three levels of summarization because these seg-ments are considered to be too short to be meaningfulfrom a summarization perspective. Therefore, first the 10%of frame indexes with top arousal segments are retrievedand grouped together. Those segments are labeled as ‘‘TopHighlights’’. The second step is related to ‘‘Key Points’’segments; to retrieve the desired top 25% frame indexeswith top arousal values, 15% additional frame indexes are

retrieved from the final arousal curve; these indexes areafterwards grouped into segments and labeled as ‘‘KeyPoints’’. The same process is done for ‘‘Extended Sum-mary’’; more 25% frame indexes are retrieved, grouped insegments, and labeled as ‘‘Extended Summary’’. Finally,the remaining 50% frame indexes are also groupedtogether to form the segments labeled as ‘‘RemainingContent’’. Within each level, the segments are hierarchi-cally organized by its arousal value meaning that the firstsegments to appear, and thus to be used, are thosewith higher arousal values. Fig. 7 shows an example of aMPEG-7 compliant summary description.

4.4. MPEG-1 summaries creation

The MPEG-1 summaries creation process regards theextraction from the full coded stream of the video andaudio segments identified by the MPEG-7 summarydescription for the summary criteria provided by the user.From the hierarchical summary description, and off theuser’s choice of parameters, a summary is generated,formed by the adequate group of segments from theaudiovisual content. In this way, the application willcreate a summary for visualization, and eventually anMPEG-1 file for storage with the desired summary, byslicing from the audiovisual content the relevant seg-ments and joining them together. The user may ask for thecreation of two types of summaries:

�
Summaries by relevance—three summary relevancelevels are provided as described above, this means‘‘Top Highlights’’, ‘‘Key Points’’ and ‘‘Extended Sum-mary’’. � Summaries by length—the user inputs the desired
summary length; again, no segments shorter than 3 sare included in any summary since these segments areconsidered too short to be meaningful from a summar-ization perspective.

From one MPEG-7 compliant hierarchical summarydescription, an infinite number of summaries can becreated as the user can choose to create summaries bylength or relevance. The user can either only play thesummary or can also create and store an MPEG-1compliant file with the audiovisual content correspondingto the requested summary.

5. Graphical user interface

This section provides a brief overview of the developedapplication’s GUI. The application is composed of a singleWindows Form, which is divided into five main areas, asshown in Fig. 8. This figure shows the state of theapplication after extracting motion information. In thecharts and tab colors, light blue is related to motioninformation, green to shot cut information and orange tosound information. The five areas highlighted in Fig. 8have the following main functions:

1.
Player—area intended to play the audiovisual content.

ARTICLE IN PRESS

Fig. 8. Summarization application GUI.

Fig. 7. Example of an MPEG-7 compliant summary description.


ARTICLE IN PRESS


2.
Charts/Summary player tab control—area destined topresent the arousal charts for each feature to the userand also to play the final summary. It has four tabs,three related to the arousal charts for each of the low-level features and a fourth one destined to play thefinal created summary.
3.
Main tab control—area to present information result-ing from the low-level features extraction to the userand also to parameterize the fused arousal computa-tion process as well as the summary creation andviewing processes.
4.
Side menu—it has three buttons which serve as thestart for the corresponding low-level features extrac-tion processes.
5.
Side menu options tab—area reserved to control theoptions and parameters related to each low-levelfeature extraction process.
The next section will present the user evaluation studydeveloped to evaluate the summarization system perfor-mance.

6. Performance evaluation

Since the development of any multimedia applicationis not finished without a solid and meaningful evaluationof its performance regarding the user objectives, thissection endeavors to present and analyze the resultsobtained by a subjective evaluation study which wascarried out to evaluate the developed application’sperformance. Considering the type of system deve-loped, it was considered that the adequate performanceevaluation methodology should follow a subjectiveapproach.

The reviewing of the literature shows a major relevantsolution for the subjective evaluation of video summaries,notably the summarization evaluation process proposedby TRECVID in the Call for Proposals for the 2008campaign. The main features of the TRECVID 2008summarization evaluation methodology are [18]:

1.
Content—BBC video rushes. Rushes are the rawmaterial used to produce a video; 20–40 times asmuch material may be shot as actually becomes part ofthe finished product.
2.
Task—automatically create MPEG-1 summary clips nolonger than 2% of the full video, showing the mainobjects and events in the video to be summarizedusing the minimal number of frames and presentingthe information in a way to maximize usability andspeed of object/event recognition.
3.
Ground truth—lists of significant segments identifiedin terms of a major object/event created at NIST foreach video to be summarized.
4.
Evaluation—a user will view the summary usingonly the controls play and pause to check off theobjects/events in the ground truth list that appear inthe video summary and answer questions aboutsummary quality. This evaluation process will betimed.
5.
Measures—the following measures are to be used:� Fraction of ground truth segments found in summary� Time needed to check summary against ground
truth� Size of summary in terms of number of frames� Elapsed system time to create summary� Usability/quality of summary

The TRECVID methodology uses a ground truth refer-ence which means that it requires the preliminarydefinition of the ground truth.

As deriving this definition is many times not anobvious process for many content types, it was decided,in the context of this paper, to design a subjectiveevaluation methodology without reference, implying thatthe previous definition of ground truth is not required.This approach leaves to the evaluation subjects thedefinition of what is more exciting in the content, avoidingthe difficult task to predefine this.

Having defined an appropriate subjective evaluationmethodology, a user evaluation study has been performedto assess how the developed summarization applicationperforms in view of the initially defined objectives,notably to assess:

1.
How good is the experience that is provided by thecreated summaries, according to each summary type,i.e. ‘‘Top Highlights’’ summaries capture only theindispensable segments of the content; ‘‘Key Points’’summaries are able to provide some context to a ‘‘TopHighlights’’ summary, capturing only interesting seg-ments without being too extensive, and an ‘‘ExtendedSummary’’ summary is able to exclude the most boringsegments of the content.
2.
If any relevant segments are excluded from any of thecreated summary types, notably ‘‘Top Highlights’’, ‘‘KeyPoints’’, and ‘‘Extended Summary’’.
The next section presents the test methodology andconditions designed to achieve these two objectives.

6.1. Test methodology

So that the user study is credible and meaningful, it hasto be defined with enough precision to be reproducible byother experts with results and conclusions that arestatistically similar.

6.1.1. Test material

The test set was comprised of six audiovisual testsequences. The set of six sequences was divided into twoclasses, with three pieces per class (see Fig. 9):

1.
Sports content—contents containing sport broadcasts,with two clips from football matches and one clip froma basketball match.
2.
Entertainment content—contents containing clipsfrom TV series containing action events; one from‘‘Lost’’, other from ‘‘Prison Break’’ and another from‘‘Heroes’’.

ARTICLE IN PRESS

Fig. 9. Frame samples of the six test sequences.

Table 1Test material characteristics

Content

duration

Content

resolution

Sports content

BASKETBALL.mpg 10:02 320�240

FOOTBALL1.mpg 13:27 418�288

FOOTBALL2.mpg 10:10 418�288

Entertainment content

ACTION1.mpg, from ‘‘Heroes’’ 14:14 624�352

ACTION2.mpg, from ‘‘Lost’’ 12:48 624�352

ACTION3.mpg, from ‘‘Prison Break’’ 14:38 624�352


As it was the intention that the evaluation subject viewthe original content as well as the three types ofsummaries created for each piece of content, the originalcontents were clipped in order to reduce the test’sduration in a way that the results can be consideredmeaningful but without exhausting the subjects. Thisclipping was performed only in terms of duration, whichmeans the sequences evaluated still corresponded to acontinuous part of the initial content. The contentsduration as well as their luminance spatial resolutionsare presented in Table 1.

Using the six test sequences, 18 summaries wereproduced and exported for MPEG-1 files, using thedeveloped summarization application. This means thatfor each audiovisual piece, three summaries were pro-duced: (i) ‘‘Top Highlights’’ summary; (ii) ‘‘Key Points’’summary; and (iii) ‘‘Extended Summary’’.

6.1.2. Test questions

The test questions defined for this user study toaddress the objectives above defined are:

�
Question 1—does the summary viewed satisfy its typedefinition, i.e. does it contain the top most relevant/exciting 10%, 25%, and 50% of the original content,respectively, for the ‘‘Top Highlights’’, ‘‘Key Points’’, and‘‘Extended Summary’’ summaries? � Scores for Question 1—(a) not at all; (b) badly; (c)
reasonably; (d) mostly; and (e) totally.
� Question 2—were any relevant segments ruled out of
the viewed summary for each summary type consider-ing their definitions, i.e. ‘‘Top Highlights’’ (10%), ‘‘KeyPoints’’ (25%), and ‘‘Extended Summary’’ (50%)?
� Scores for Question 2—(a) all; (b) many; (c) some; (d)
few; and (e) none.

6.1.3. Sequence of testing

A group of 15 volunteers, between the ages of 20–60years old, was asked to view the original audiovisualcontent and the various summaries created and to give

their subjective assessment for the questions above,following the sequence of steps defined next:

a.
Open and visualize the original content, starting fromthe first piece of test material (in this test, ‘‘Basket-ball’’).
b.
For the original content at hand, visualize one singletime its three possible relevance summaries, i.e. ‘‘TopHighlights’’, ‘‘Key Points’’, and ‘‘Extended Summary’’.
c.
Answer to Questions 1 and 2, marking with a cross (X),in the evaluation tables, the desired assessment score,for each one of the three summaries just visualized, i.e.‘‘Top Highlights’’, ‘‘Key Points’’, and ‘‘Extended Sum-mary’’.
d.
Go back to point (a.) for all remaining content itemsuntil the whole test set is evaluated.
6.2. Results and analysis

This section presents the evaluation results obtainedwith the tests, and analyzes them so that some conclu-sions may be drawn.

Table 2 shows the results obtained in this user evaluationstudy for Question 1. Regarding this question, the averageresults show that 41.5% and 42.3% of the subjects considered

ARTICLE IN PRESS

Table 2Evaluation results for Question 1 (all values in percentage)

(a) Not at all (b) Badly (c) Reasonably (d) Mostly (e) Totally

Sports content

TopHighlights 0.0 0.0 23.1 51.3 25.6

KeyPoints 0.0 0.0 7.7 48.7 43.6

ExtendedSummary 0.0 0.0 7.7 38.5 53.9


TopHighlights 0.0 18.0 28.2 28.2 25.6

KeyPoints 0.0 2.6 7.7 59.0 30.8


TopHighlights average 0.0 9.0 25.6 39.7 25.6

KeyPoints average 0.0 1.3 7.7 53.9 37.2

ExtendedSummary average 0.0 1.3 3.9 30.8 64.1

Sports content average 0.0 0.0 12.8 46.2 41.0

Entertainment content average 0.0 7.7 12.0 36.8 43.6

Total average 0.0 3.9 12.4 41.5 42.3

Table 3Evaluation results for Question 2 (all values in percentage)

(a)

All

(b)

Many

(c)

Some

(d)

Few

(e)

None

Sports content

TopHighlights 0.0 10.3 30.8 41.0 18.0

KeyPoints 0.0 0.0 15.4 56.4 28.2



TopHighlights 0.0 18.0 43.6 23.1 15.4

KeyPoints 0.0 2.6 18.0 46.2 33.3


TopHighlights average 0.0 14.1 37.2 32.1 16.7

KeyPoints average 0.0 1.3 16.7 51.3 30.8

ExtendedSummary average 0.0 3.9 7.8 33.3 56.4

Sports content average 0.0 5.1 18.0 46.2 30.8


average

0.0 7.7 23.1 34.2 35.0

Total average 0.0 6.4 20.5 40.2 32.9


that the viewed summaries ‘mostly’ or ‘totally’ satisfied itstype definition. None of the subjects considered that thesummaries did not satisfy at all its type definition and only3.9% considered that it satisfied ‘badly’. Analyzing the resultsby summary type, ‘‘Top Highlights’’, ‘‘Key Points’’ and‘‘Extended Summary’’, all summaries had positive results,with the results for the scores ‘mostly’ and ‘totally’ addingalways to more than 65%, which was the poorer result,obtained for the ‘‘TopHighlights’’ type summary. The ‘‘Key-Points’’ and ‘‘ExtendedSummary’’ type summaries presentedvery good results, with the scores ‘mostly’ and ‘totally’adding to about 90%. In terms of content type, the resultswere more satisfactory for sports than for entertainmentcontent. The authors believe that this is related to the factthat exciting moments in sport broadcasts are easier toidentify than in TV series as they do not depend so much onthe viewer’s interpretation; the identification of an excitingevent in sports is less ambiguous than in TV series sincesports has better defined ‘rules and goals’. This distinctionimplies that more precise event semantics are easilyidentified.

Regarding Question 2, Table 3 shows that, on average,40.2% and 32.9% of the subjects considered that only ‘few’or ‘none’ of the relevant segments were excluded from theviewed summaries for the various types; this sums tonearly 73% of adequate summaries. Performing an analysisby type, the ‘‘TopHighlights’’ summaries are those that, ascould be expected, show the poorer results because theycorrespond to the shorter summaries in duration. They arealso those with the higher probability to exclude anyrelevant segment. Even so, the ‘‘TopHighlights’’ summa-ries present an added result for the ‘few’ and ‘none’ scoresof near 50%, with the majority of the subjects consideringthat ‘some’ segments were excluded from the summaryand only 14.1% of the subjects considering that ‘many’relevant segments were excluded. The ‘‘KeyPoints’’ and‘‘ExtendedSummary’’ summaries performed quite well inQuestion 2, as already was the case for Question 1, withtotal results for the scores ‘few’ and ‘none’ near 82% and90%, respectively.

Question 2 presented a greater discrepancy of resultsbetween sports and entertainment content, with thesports content summaries achieving better results thanthose regarding entertainment content. Once more, it isbelieved that this is related to the fact that entertainmentcontent may have rather complex storylines and, there-fore, the ‘‘quality’’ of the summary is more permeable toeach user’s interpretation than in sports content wherethe main events, as goals in football, are more easilyidentifiable by any subject, i.e. they have a more ‘obvious’semantics.

7. Conclusions and future work

This paper proposes and evaluates a fully automaticsummarization application for generic audiovisual con-tent based on MPEG-7 compliant hierarchical summarydescriptions, which allows flexibility, low complexity, and

ARTICLE IN PRESS


interoperability. The novelty of this paper regards theexploitation of a three feature, low-level arousalmodel to generate the summary metadata needed toinstantiate MPEG-7 compliant summary descriptionswith all the advantages this brings. Moreover, a solidevaluation study has been performed. This paper alsoproposes some classification dimensions to cluster auto-matic audiovisual summarization solutions in order abetter and more structured understanding of this field isobtained.

Despite the promising results obtained with the userevaluation study, the summarization system developedstill has room for improvement, mainly regarding theissues highlighted next. Regarding the low-level informa-tion extraction processes, other motion and sound low-level metrics may be studied in the future. For example, itis possible to adopt a lighter metric for the sound energybased on the uncompressed subband scale factors used inMPEG Audio coding as has been made in Ref. [14].Moreover, other low-level features may be introduced inthe system to make the final fused arousal metric morerepresentative and robust, for example the position of themotion within the frame (since typically users pay moreattention to the center of the screen), and the type andposition of colors (since typically users are more sensitiveto certain colors, such as red). In terms of the developedapplication, some additional features can be also added inthe future, namely a summarization wizard able to guidethe user step by step through the entire process forsummarization and, possibly, a more complete summaryplayer, e.g. allowing the user to leap from segment tosegment inside the summary, perhaps by clicking in thecorresponding segment on the chart. Despite presentinggood results and fulfilling its function in a rathersatisfactory manner, the fused arousal metric can also beimproved or even other fusion metrics can be studied inorder to enhance the system’s performance. For example,the fusion weights may be defined according to a giventask or a given type of content; in this context, it would beuseful to propose a statistical or parametric modelmethod to define these weights. Finally, regarding theperformance evaluation, more complete tests can be madein the future, notably exploiting the availability of someground truth, using more content types and also longerpieces of content, to evaluate how the system performsunder those different and more complete conditions. Amore complete set of questions can also be placed to theusers, in order to evaluate, in more detail, the mean-ingfulness of each of the created summaries. The test canalso be conducted in a larger scale, with a higher numberof subjects, collecting, in this manner, many more scoresand, therefore, obtaining more statistically significantresults. An alternative approach to the summarizationevaluation may consider the predefinition of a model ofthe task the summary stands for in order an objectiveevaluation process is possible; for example in Ref. [19] asolution is proposed to create and evaluate videosummaries, using a performance measure which is basedon a simulated user experiment whose results are easilyinterpretable.

While it is true that the system may be improved, it isalso true that the experiments performed have shown thelimitations of the proposed low-level modeling approach,notably for content where the action driven arousal modelis less appropriate. In fact, using a low-level modeling forexcitement and user interest will always be limited sincesome very semantically relevant and exciting momentswill never be able to be modeled with low-level featuresfor some content types. These unavoidable limitationsseem to indicate that the most promising direction interms of content summarization is semantic modeling andconcept detection where much research work is needed toreach a good performance for a large set of concepts. Low-level modeling may have an important role as comple-mentary approach.

References

[1] Alexa, The Web Information Company, /http://www.alexa.comS.[2] C. Choudary, T. Liu, Summarization of visual content in instructional

videos, IEEE Trans. Multimedia 9 (7) (November 2007) 1443–1455.[3] A. Ekin, A. Murat Tekalp, R. Mehrotra, Automatic soccer video

analysis and summarization, IEEE Trans. Image Process. 12 (7) (July2003) 796–807.

[4] A. Hanjalic, Extracting moods from pictures and sounds, IEEE SignalProcess. Mag. 23 (2) (March 2006) 90–100.

[5] A. Hanjalic, L.Q. Xu, User-oriented affective video content analysis,in: IEEE Workshop on Content-based Access to Image and VideoLibraries, Kauai, HW, USA, December 2001, pp. 50–57.

[6] A. Hanjalic, L.Q. Xu, Affective video content representation andmodeling, IEEE Trans. Multimedia 7 (1) (February 2005) 143–154.

[7] A. Jaimes, T. Echigo, M. Teraguchin, F. Satoh, Learning personalizedvideo highlights from detailed MPEG-7 metadata, in: IEEE Interna-tional Conference on Image Processing, Rochester, New York, USA,September 2002.

[8] K.-H. Liu, M.-F. Weng, C.-Y. Tseng, Y.-Y. Chuang, M.-S. Chen,Association and temporal rule mining for post-filtering of semanticconcept detection in video, IEEE Trans. Multimedia 10(2) (February2008) 240–251.

[9] Y.-F. Ma, X.-S. Hua, L. Lu, H.-J. Zhang, A generic framework of userattention model and its application in video summarization, IEEETrans. Multimedia 7 (5) (October 2005) 907–919.

[10] B.S. Manjunath, P. Salembier, T. Sikora, Introduction to MPEG-7:Multimedia Content Description Interface, Wiley, 2002.

[11] M.R. Naphade, T.S. Huang, A probabilistic framework for semanticvideo indexing, filtering and retrieval, IEEE Trans. Multimedia 3 (1)(March 2001) 141–151.

[12] M.R. Naphade, T.S. Huang, Extracting semantics from audio-visualcontent: the final frontier in multimedia retrieval, IEEE Trans.Neural Netw. 13 (4) (July 2002) 793–810.

[13] C. Ngo, Y. Ma, H.-J. Zhang, Video summarization and scene detectionby graph modeling, IEEE Trans. Circuits Syst. Video Technol. 15 (2)(February 2005) 296–305.

[14] D. Sadlier, N. O’Connor, Event detection in field sports video usingaudio-visual features and a support vector machine, IEEE Trans.Circuits Syst. Video Technol. 15 (10) (October 2005) 1225–1233.

[15] C.G.M. Snoek, M. Worring, J.C. van Gemert, J.-M. Geusebroek,A.W.M. Smeulders, The challenge problem for automated detectionof 101 semantic concepts in multimedia, ACM Multimedia (October2006) 421–430.

[16] C.G.M. Snoek, M. Worring, J.-M. Geusebroek, D.C. Koelma, F.J.Seinstra, A.W.M. Smeulders, The semantic pathfinder: using anauthoring metaphor for generic multimedia indexing, IEEE Trans.Pattern Anal. Mach. Intell. 28 (10) (October 2006) 1678–1689.

[17] C.M. Taskiran, Z. Pizlo, A. Amir, D. Ponceleon, E.J. Delp, Automatedvideo program summarization using speech transcripts, IEEE Trans.Multimedia 8 (4) (August 2006) 75–791.

[18] TREC, Video Retrieval Evaluation Home Page, /http://www.itl.nist.gov/iaui/894.02/projects/trecvid/S, 2008.

[19] I. Yahiaoui, B. Merialdo, B. Huet, Optimal video summaries forsimulated evaluation, in: European Workshop on Content-BasedMultimedia Indexing, Brescia, Italy, September 2001.

http://www.alexa.com

http://www.itl.nist.gov/iaui/894.02/projects/trecvid/

http://www.itl.nist.gov/iaui/894.02/projects/trecvid/

Documents

Automatic creation and evaluation of MPEG-7 compliant summary descriptions for generic audiovisual content