6
160 CONTENT-BASED DIGITAL VIDEO RETRIEVAL R M Bolle, B L Yeo and M M Yeung IBM Thomas J Watson Research Center, USA ABSTRACT All video will eventually become fully digital - there seems to be no way around it. Consequently, digital video databases will become more and more pervasive and finding video in large digital video databases will become a problem just like it is a problem today to find video in analog video databases. The digital form of the video, however, opens up tremendous possibilities. Just like it is possible today to retrieve text documents from large text document databases by querying document content represented by indices, it will become possible to index digital video databases based (semi-)automatically derived indices. In this paper, we address the problem of automatic video annotation- associating semantic meaning with video segments to aid in content-based video retrieval. We present a novel framework of structural video analysis which focuses on the processing of low-level visual data cues to obtain high-level (structural and semantic) video interpretations. Additionally, we propose a flexible framework for video query formulation to aid rapid retrieval of video. This framework is meant to accommodate the “depth-first searcher” - i.e., the power user, the “breath-first searcher,” and the casual browser. 1 INTRODUCTION Digital video is the ultimate multi-media document. It con- tains both deterministic data, like text in the form of closed- caption or the script, and stochastic data - data that is ob- tained by measuring the world such as imagery sensed by a camera and sound sensed by a microphone. Eventually all video storage and video transport mech- anisms to television receivers and computer displays will be dominated by digital technology [l]. With the advent of fully digital television, many things will become possible. Many of these things can be envisioned today while others will only be imagined during the years to come. Digital tele- visions will become powerful desktop computers, allowing for tremendous interactivity. Viewers will be able to search on-line TV guides and schedule their viewing according to their needs and interests [2]. Moreover, the digital form of the video stream allows for performing direct computations on the video without the analog to digital conversion, the common practice today. One possibility opening up is rapid content-bused video retrieval: The digital form allows processing of the video data to generate appropriate, possibly semantic, data ab- stractions that enable content-based archival and retrieval of video. That is, very much like today’s large text databases can be searched with text queries, video databases will be able to be searched with combined text and visual queries. It is this topic that we would like to discuss. Till now, most of the video in large existing legacy video databases has been annotated solely by hand by meticulously preview- ing the video, if it has been annotated at all. Semantic labels are determined visually by the annotator and added manu- ally, often with the assistance of user interfaces. Ideally, the digital video would be automatically annotated through full machine semantic interpretation. In practice, given the state of the art in computer vision and speech recognition, such sophisticated semantic video interpretation is not fea- sible. Rather, the computer may offer intelligent assistance for manual annotation, or the computer performs automatic limited semantic annotation. Video-on-demand is an area that has concerned itself with the above issues for quite a while. The consensus is that, to scale to larger video data- bases, automatic extraction of video content information is desired [3]. This paper proposes a model of video retrieval and a novel framework of structural video analysis for video anno- tation. In particular, the paper is about the process of video query, and about the automatic recognition and generation of syntactic and semantic video annotations. In Section 2, we formulate and discuss a flexible interactive framework for video retrieval. In Section 3, we specifically focus on the analysis of visual content. We briefly survey the state-of-the- art in automatic video annotation and introduce a processing paradigm called “between-shot’’ processing illustrated with some examples. This research is part of a NISTiATP funded research consortium’ to develop a video query station for the High Definition Television (HDTV) studio of the future. This consortium is charged with performing the breakthrough re- search and development needed to have the necessary com- Program Manager: David K. Hermreck, [email protected] International Broadcasting Convention, 12-16 September 1997 Conference Publication No. 447, 0 IEE, 1997

[IEE International Broadcasting Conference (IBC) - Amsterdam, Netherlands (12-16 Sept. 1997)] International Broadcasting Conference (IBC) - Content-based digital video retrieval

  • Upload
    rm

  • View
    215

  • Download
    0

Embed Size (px)

Citation preview

160

CONTENT-BASED DIGITAL VIDEO RETRIEVAL

R M Bolle, B L Yeo and M M Yeung

IBM Thomas J Watson Research Center, USA

ABSTRACT

All video will eventually become fully digital - there seems to be no way around it. Consequently, digital video databases will become more and more pervasive and finding video in large digital video databases will become a problem just like it is a problem today to find video in analog video databases. The digital form of the video, however, opens up tremendous possibilities. Just like it is possible today to retrieve text documents from large text document databases by querying document content represented by indices, it will become possible to index digital video databases based (semi-)automatically derived indices. In this paper, we address the problem of automatic video annotation- associating semantic meaning with video segments to aid in content-based video retrieval. We present a novel framework of structural video analysis which focuses on the processing of low-level visual data cues to obtain high-level (structural and semantic) video interpretations. Additionally, we propose a flexible framework for video query formulation to aid rapid retrieval of video. This framework is meant to accommodate the “depth-first searcher” - i.e., the power user, the “breath-first searcher,” and the casual browser.

1 INTRODUCTION

Digital video is the ultimate multi-media document. It con- tains both deterministic data, like text in the form of closed- caption or the script, and stochastic data - data that is ob- tained by measuring the world such as imagery sensed by a camera and sound sensed by a microphone.

Eventually all video storage and video transport mech- anisms to television receivers and computer displays will be dominated by digital technology [l] . With the advent of fully digital television, many things will become possible. Many of these things can be envisioned today while others will only be imagined during the years to come. Digital tele- visions will become powerful desktop computers, allowing for tremendous interactivity. Viewers will be able to search on-line TV guides and schedule their viewing according to their needs and interests [2]. Moreover, the digital form of the video stream allows for performing direct computations on the video without the analog to digital conversion, the common practice today.

One possibility opening up is rapid content-bused video retrieval: The digital form allows processing of the video data to generate appropriate, possibly semantic, data ab- stractions that enable content-based archival and retrieval of video. That is, very much like today’s large text databases can be searched with text queries, video databases will be able to be searched with combined text and visual queries.

It is this topic that we would like to discuss. Till now, most of the video in large existing legacy video databases has been annotated solely by hand by meticulously preview-

ing the video, if it has been annotated at all. Semantic labels are determined visually by the annotator and added manu- ally, often with the assistance of user interfaces. Ideally, the digital video would be automatically annotated through full machine semantic interpretation. In practice, given the state of the art in computer vision and speech recognition, such sophisticated semantic video interpretation is not fea- sible. Rather, the computer may offer intelligent assistance for manual annotation, or the computer performs automatic limited semantic annotation. Video-on-demand is an area that has concerned itself with the above issues for quite a while. The consensus is that, to scale to larger video data- bases, automatic extraction of video content information is desired [3].

This paper proposes a model of video retrieval and a novel framework of structural video analysis for video anno- tation. In particular, the paper is about the process of video query, and about the automatic recognition and generation of syntactic and semantic video annotations. In Section 2, we formulate and discuss a flexible interactive framework for video retrieval. In Section 3, we specifically focus on the analysis of visual content. We briefly survey the state-of-the- art in automatic video annotation and introduce a processing paradigm called “between-shot’’ processing illustrated with some examples.

This research is part of a NISTiATP funded research consortium’ to develop a video query station for the High Definition Television (HDTV) studio of the future. This consortium is charged with performing the breakthrough re- search and development needed to have the necessary com-

Program Manager: David K. Hermreck, [email protected]

International Broadcasting Convention, 12-16 September 1997 Conference Publication No. 447, 0 IEE, 1997

161

ponents ready for a fully integrated HDTV studio. Searching: Search is the center stage of query in video data- base systems. The result of the search is a list of candidate videos that satisfy in some sense the constraints of the query. The ultimate goal of the search is to make this list as short

2 INTERACTIVE QUERY as possible without missing the video of interest.

Much work has been done in the area of text document Associated with video are parametric data (date of shooting, that has been type of footage), textual data (bibliographic data, closed- should be brought to bear in the video retrieval area as is captions, annotated keywords), but also audio and visual described in [5] . However, the amount of text associated data, The former is deterministic data while the latter is with the individual video clips in the video database is po- stochastic data in the sense that acquisition of this data is tentially mbalanced. Some video may have only two or inherently a noisy process. ~ i ~ i ~ ~ l video should ideally be three keywords associated with it stemming from traditional

L41, and the

retrievable by querying on all these information modalities alpha-numeric dakhW3 and annotation technology, while in a seamless fashion. The user interface ofthe query system Other video may have the script, closed-caption, Or speech should assists in a user-friendly way and the search results transcript available. Hence, at some point in the search it should be presented clearly while this presentation may de- and audio) attributes, pend on the query that is posed that annotated or computed. The difficult question is to define has posed the query. Video databases should be searchable the appropriate audio-visual properties that can be extracted so that intermediate candidate lists are brought down to a from video that will bring down this candidate list to man- manageable size and duration as quickly as possible. ageable size.

One would like to interactively constrain visual and/or Different types of queries and users can be expected: ( 1) A query for a particular piece of video; (2) A query for a audio attributes ofthe candidates and narrow down the candi- specific video, but the has not Seen it; and, (3) A query date list. Static visual attributes (of keyframes) may be in the where the only has vague idea of the content. form of color, object shape and texture. There is also active The query formulation process should ideally accommodate ongoing research in audio indexing to extract special audio

systems should accommodate a wide range of users: the address the attributes of video based on visual content in power user who is used to retrieval of video from legacy Section 3 . databases (such user can be called “depth-first searchers”), Browsing: The with a

video based on its documentary value or artistic content representations of the video should be displayed which are (such users can be called “breath-first searchers”), and the good high-level overviews of the content of the candidate casual browser. Especially casual browser access to video videos (visual summaries). Just as with textual summaries

to look at the Other the type of

these types of queries plus the queries of a user who just features (e.g., L61). The wants to browse video. In other words, ideal video query

nature Of the audio-visua1 Opens search beyond keyframes” we

result is a series ofvideo the journalist or professional editor who is searching for total duration that could be quite long. In the browsing phase,

databases Open tremendous for re- of text documents, a user will be able to get a quick under- purposing video. Video can be standing of the candidate content. In addition, the user gets viewed for a one-time viewing fee or (remotely) purchased for multiple use.

we model the formulation of a video inquiry as a se- quence of steps - each step is itself an active filtering of in- formation, that reduces the size of the candidate data pools. The sequence of steps is as follows: query on the category

audio and visual feature descriptions of video (searching),

and query on the full-motion audio-visual content (viewing).

Navigating: This is the stage at which the user decides

Of the direct the search to a specific interval of time, or direct the

a quick idea whether she or he is asking the ‘‘right” question and how to redefine or refine query. Ideally, the user should also have random access to any point of any video in the

Visual summaries are an important facet of the video query process - both for the problem of search query result

bandwidth video and transmission. The problem is to derive

dimensional plane for Screen presentation.

viewing: After a video clip (or a few video clips) is (are)

to view the search results. In such cases, the traditional functionality of today’s VCRs should be available. Further,

list.

and class of video (navigating), query On the text,

query on the semantic summary of the content (browsing),

representation, interpretation and refinement; and for low-

mappings of the three-dimensional video data to the two-

which category Of video is to be searched* It is the selected as the most likely candidate(s), the user may decide to navigate based On metadata3 for

search to a of footage (sitcoms, documentaries, raw footage), or even

Or direct the search to a ’pecific capabilities like semantic fast-fomard should be available. Such semantic random access is the ability to truly skip

direct the search to a specific video server. Almost all video has some text associated with it, though it may be little-say, a few keywords. And the initial navigating through video databases will most certainly be based on textual queries. In theory, techniques from the text retrieval area called source selection (i.e., deduce from the query which body of data should be searched) can be used to reduce the search space.

though video based on semantic content, rather than using timecodes.

The remainder of this paper is about automatic video content analysis. We discuss briefly what has been accom- plished in this area and we discuss the research directions

162

we feel that need to be exploited.

3 VIDEO ANALYSIS

We survey some of the results in automatic video analysis that have been achieved to date and we indicate the research that needs to be done to arrive at flexible video retrieval sys- tems as described above. We concentrate on edited video documents - as opposed to raw footage. These documents are structured in nature conveying stories and messages. That is, the shots are combined in specific ways and ac- cording to special orders, to form the montage in telling the story.

in segmentation of most videos and they save both auxiliary storage and computation costs.

Various researchers have made the observation that rep- resenting a video shot as a single still image could be a sig- nificant step toward the video indexing problem. A simple way to obtain a single image to represent a shot is the selec- tion of a keyframe or, if there is much motion and action in the shot, the selection of a multiple keyframes (for example, [8]). Alternatively, subsequent frames can be “computed together” into a larger image (this is within-shot process- ing). Techniques such as salient stills (also known under other names) are proposed to obtain such still images ([9] and references). One of the rationales is that if shots are rep- resented by images, image search techniques such as [lo] can be applied.

We first look at the most fundamental unit of video Besides that dynamic video information is lost when production - the shot. Techniques for detection of shots selecting keyframes, an hour of video is typically composed and current work on processing individual shots are briefly of a few hundred shots. Thus, searching large video data- discussed. We then concentrate on a new video analysis bases amounts to searching very large image databases and paradigm, called between-shot processing. searching a small number of hours of video may already push

the limits of image search engines in terms of computational requirements.

3.1 The video shot

The importance of the shot is widely realized and compu- tational detection of video shot boundaries has gotten much attention. The act of segmenting a video into shots is com- monly called scene-change detection, which should prob- ably be more properly called shot detection. (This avoids confusion with the cinematographic term scene which is a collection of shots.)

Many types of shot transitions are used, among these are abrupt and gradual transitions (dissolves). In Figure 1, we show these two types. The difficulty of automatic shot transition detection can be seen from, for example, Figure lb.

3.2 Beyond shots

In video and film, a story is told by the presentation of the images of time. Actions have to develop sequentially, si- multaneous or parallel processes have to be shown one after the other in a concatenation of shots. As observed by Miller [ 1 11, video has a continuity of meaning that overwhelms the inherent discontinuity of presentation. There is something that ensures that the continuity of meaning is preserved when viewing the program. The continuity that obtains from shot to shot - from wide shot to close-up, from one speaker’s face to another and so on - is achieved by the viewer over- looking the interruption by using the more or less conscious knowledge or understanding of the fact that the situation is identifiable from one shot to thc next and that what is shown is nothing more than various aspects of the same scene, as a

(a) noted in [ 111.

Figure 1 : Example shot boundaries from an IBM commer- cial: (a) abrupt boundary - three frames before and after the boundary (b) gradual transition - ten frames showing the transition of one shot into the next.

Major efforts have focused on algorithms that operate on the full frames of the video (see references in [5]). There are also recent efforts in performing the segmentation on compressed video. Most works study shot boundary de- tection on MPEG compressed video (for example, [7] and references herein). These schemes are sufficiently accurate

A video is built up from shots to form a story. Groups of shots are concatenated in sequences to form a depiction of the story which may be continuous in time - such a con- catenation of shots is called a scene or a sto y segment. A video, usually, consists of multiple segments where beyond the segment discontinuity there may or may not be continu- ity of time. Either way, the continuity of time is not what is really important, what is important is if there is continuity of meaning. This is perhaps the most challenging aspect of automatic video annotation, $finding the underlying dis- continuities of meaning. Once video segmentation based on meaning or subject has been achieved, one is left with video data that deals with a particular subject and is unintenupted by video segments about different subjects such as commer- cials. The task of automatic video analysis, and the subse- quent automatic annotation, then is one of finding high-level interpretations, not of the individual shots but of collections of shots and possibly of the video as a whole. It seems

163

that to find these higher-level interpretations, between-shot analyses are at least as important as within-shot analyses via image computations.

The goals of between-shot processing are to derive high-level video structure and semantics for automatic an- notation of video segments and for visual presentation of the segment. Some research efforts in exploiting between- shot relationships are surveyed in the following sections to further explain the ideas presented in this section.

3.3 News story indexing

Closed-caption is textual information which essentially comes for free and conventional text search engines can be used to index the video. It should be noted, however, that closed-caption does not always contain information about discontinuities in meaning. Processing of the visual data may have to be done to find these discontinuities.

We mention two approaches to closed-caption index- ing. Mohan 11121 uses a unique combination of sources of information to segment news items. Typically, for the closed-captioning, the beginning of a new news item is indi- cated by the symbols > > >. Because of the real-time nature of captioning, the closed-caption may be lagging behind the actual spoken words. A shot-boundary detection algo- rithm in combination with the detection of audio silences plus the new-item indicator of the closed-captioning is used to segment and synchronize the individual news items. A different approach is found in [ 131: The Pictorial Transcript system transcribes news programs into HTML-based sum- maries. Each HTML page has several news stories and each story is represented by a few keyframes with detailed text derived from closed-captions. Furthermore, linguistic pro- cessing and closed-caption control information are used to align sentences to keyframes and to capitalize the text.

More domain specific U priori models to recover news segments are used in [ 14, 151. The video can be in differ- ent states, such as, “anchor,” “reporter,” “weather forecast,” etc and model-based parsing recovers the basic news story or episode guided by domain knowledge of spatial frame layouts. Knowledge of station logos is used to identify the return from commercial breaks. News. segment retrieval is achieved through an interface which offers a high-level of random access into different news items. Thus, if a user is only interested in the weather forecast, content-based fast- forward allows for this.

3.4 Hierarchical video decomposition

The above techniques find news stories by labeling the video shots with relatively high-level semantic interpretations. The labeling is based on the spatial layout of keyframes and uses very domain-specific models.

More generic models can be used to parse video that tells a dramatic story. Typically, such video, like sitcoms, is composed as a sequence of story units and each of these

story units, in turn, is a sequence of shots. Time-construined clustering [ 161 uses symbolic labels based on the data con- tent (e.g., color distribution of keyframes) to compute this hierarchical structure. Typically, the story takes place in a small number of locales and in each locale, a small number of cameras is used. The labels that are associated with shots taken in the same locale tend to cluster together while labels of shots from different locales tend to be dissimilar. The clustering also takes into account the temporal location of shots within the video, i.e., it prevents two shots that are far apart in time but similar in data content to be clustered together. Hence, a high-level syntactic model of video edit- ing is defined. Time-constrained clustering computes video structure by fitting this model to the data derived from shots.

The data content of a shot that is used for clustering is a color histogram (in simple terms, the amount of red, green, and blue) of a representative frame in the shot. The metric used is is a distance between color histograms, hence shots that are filmed in the same locale cluster together. The re- sult is that one can group the shots into several clusters and these clusters correspond to the different story units. Other features (than color) extracted from the shots, of course, can also be used. Such features include shot duration, domi- nant motion characteristics, dominant texture patterns, spa- tial moments, audio features, etc. The same approach can be taken by clustering these features derived from the shots and different groupings of the shots will be found.

It is reported in [ 161 that there is one order magnitude of reduction from the number of shots to the number of story units. For example, for a typical sitcom, there are about 300 shots in a half-hour of program and about 20 story units identified. This implies that for each half-hour program, a user needs to examine 20 units instead of 300 shots to get a first glance of the program. Furthermore, it means that a user can index into one of 20 units, instead of into one of 300 shots.

Different algorithms that also operate in time- complexity which depends only on the number of shots in a video are reported in [16]. Moreover, these algorithms can be applied to partially decoded MPEG streams similar to the shot detection described in [7].

3.5 Label sequence semantics

In the above section, a symbolic label sequence is used to compute the hierarchical structure of dramatic video. Such label sequences can be used to recognize concatenations of shots that correspond to dialogues and action event [ 171. By looking at the degree of repetition or the lack ofrepetition in a subsequence of (shot) labels, subsequences can be classified into one of three categories: “dialogues,” “actions,” and “other.”

A dialogue refers to actual conversation or a conversation-like montage presentation of two or more con- current processes, which have to be shown sequentially, one after the other. The parallel events are possibly in- terspersed by so-called establishing shots or shots of other

164

parties (“noise” shots or labels). A model of such repetition A compact representation of both video content and can be constructed and parsed to the shot labels. Consider structure called the Scene Transition Graph [20] can be built the following example, it is a video sequence of 16 shots, by clustering the data content of video shots. It maps a se- with the following label sequence: quence of video shots into a two-dimensional graph. The

graph has nodes which display individual shot content and directed edges which depict time. Clusters of nodes (shots) represent the progression of the story, with each story unit being represented by a connected subgraph and connected to the next. A simple example would be a video clip of a con- versation between two persons where the camera alternates

Again, here the labels are derived from visual data content of between shots of eachperson. Here, the graph consists oftwo the shots and hence equally labeled shots are likely to be of nodes - “establishing shots” or other types of interspersed the same object and background- for a dialogue, the object shots would result in additional clusters, possibly with only would be a person. The label subsequence A , B, A at the one member shot. Graphs depicting the story units and flow start of the sequence is not considered a valid label sequence of the story for a half-hour video story can dlsplayed on a of a dialogue event. The label subsequence A , B , A, B , A , B single computer screen. characterizes a dialogue. Here, there is no “noise” label. In addition, the subsequence D , E , F, E , D, E also character- Figure 2 shows a four-node graph of the IBM Commer- izes a dialogue; here label F is a “noise” label. cia1 “France.” The bottom-window shows the linear view of

the commercial, where each image represents a shot. The An action sequence, on the other hand, in video, is content of a node consisting of four different shots of the two

characterized by a progressive presentation of shots with characters in the cast is also shown. contrasting visual data content to express the sense of fast movement and achieve strong impact. In such a sequence, there is typically little or no recurring of shots taken from the same camera, of the same person or background locale. Such a sequence of shots constitutes an action event. Amodel that captures lack of repetition can be constructed and parsed to label sequences. In addition, further classifications can consider shot durations. For example, one could characterize an action sequence with many shots of short durations as a “fast action” sequence.

In [ 171 it is reported that dialogue and action events constitute an average of about SO to 70% of each video program. In programs such as sitcoms and talk shows, there is a high percentage of dialogue while the percentage of action events is significantly higher in action movies.

In speech processing, Hidden Markov Models [ 181 have

- A , B , A , x, y, 2, A, B , A , B , A , B, c, - D , E , F , E , D , E , G, H , I -

been successfully used for word recognition and speaker identification, both of which are classification problems. Such techniques could also be applied to video classification. Here, the challenges are in the use of appropriate features, the judicial choice of model and the assignment of probabil- ity distribution. An example of such work can be found in [ 191. Here Hidden Markov Models on the one-dimensional sequence of shot durations are used to detect inserted com- mercial material and action-like sequences within long video sequences.

Figure 2: The Scene transition graph for IBM commercial “France.”

Another form of visual summary called the pictorial summary is introduced in [21]. A pictorial summary is a sequence of representative images, arranged in temporal or- der. Each representative image is called a video poster and is comprised of at least one-subimage, each of which rep- resents a unique dramatic element of the underlying story, and such that the layout of the sub-images differentiates the dominance of individual elements.

3.6 Visual summaries

A simple form of visual summary is a storyboard of depictive 4 keyframes (though automatic determination of “depictive” frames is a non-trivial task). Visual summaries give the end- user the ability to quickly get an idea of the content of a hit We have proposed a framework for video retrieval and in- and the ability to flexibly refine the query. They should also quiry formulation which is based on an iterated sequence of allow for non-linear viewing of the video. It is important navigating, searching, browsing, and viewing. This frame- that such summaries are compact so that minimal real estate work calls for certain capabilities of video query systems, in is taken up on the user interface. particular, search on dynamic visual properties of video and

DISCUSSION

165

the ability of rapid non-linear viewing of video. To achieve these capabilities, algorithms have to be developed for auto- matic extraction of dynamic visual video properties and for processing of video to extract visual features for compact presentation of the video content. Such representations, pro- vide nonlinear access into video and give quick views of the visual content of video. The algorithms need to involve a type of processing which we have called between-shot video analysis.

As noted, video is a multi-media document that contains both deterministic and stochastic data. The essence of video query is that a user-formulated query represents a combined textual and visual pattern. The retrieval process is then concerned with matching this pattern to patterns that are derived from video and finding the “closest” matches. The challenges we face in video retrieval are plenty, just to name a few: (1) the definition of visual patterns that both can be computed from video and that are interesting patterns that users want to search for: (2) the definition ofwhat constitutes a match for a combined textual and visual pattern (query); and, (3) determining the relative importance of combined deterministic matches and stochastic matches.

Acknowledgments

The work reported is funded in part by NISTiATP under Con- tract Number 70NANB5H1174. The support of the NIST program and the NIST Technical Program Manager David Hermreck and Barbara Cuthil is greatly acknowledged.

References

[ 11 D. Anastassiou, “Digital television”, Proc. IEEE, vol. 82, pp. 510-519, Apr. 1994.

[2] J. Brinkley, Defining Msion, Harcourt Brace & Com- pany, New York, N Y , 1997.

[3] L.A. Rowe, J.S. Boreczky, and C.A. Eads, “Indices for user access to large video database”, in Storage and Retrieval f o r Image and Kdeo Database I t IS& T/SPIE, Symposium on Elec. Imaging Sci. & Tech., pp. 150- 161, Feb. 1994.

I. Witten, A. Moffat, and T. Bell, Managinggigabytes: Compressing and indexing documents and images, Van Nostrand Reinhold, New York, NY, 1994.

R.M. Bolle, M.M. Yeung, and B.L. Yeo, “Video query: Beyond the keyframes”, Technical Report RC 20586, IBM T.J. Watson Research Center, Oct. 1996.

[6] E. Chan, S. Garcia, and S. Roukos, “KNN nearest neighbor information retrieval”, 1997.

[7] E. L. Ye0 and B. Liu, “Rapid scene analysis on com- pressed videos”, IEEE Trans. on Circuits and Sys. For Mdeo Techn., vol. 5, pp. 533-544, Dec. 1995.

[SI M. M. Yeung and B. Liu, “Efficient matching and clustering of video shots”, in International Conference on Image Processing, vol. I, pp. 338-341, 1995.

[9] S. Mann and R.W. Picard, “Virtual bellows: Construct- ing high quality stills from video”, in Int. Con$ Image Processing, vol. 1, pp. 363-367, 1994.

[ 101 W. Niblack et al., “The QBIC project: Querying images by content using color, texture and shape”, in Storage and Retrieval for Image and Mdeo Databases, vol.

111 J. Miller, “Moving pictures”, in H. Arlow, C. Blake- more, and M. Weston-Smith, editors, Images and Understanding, pp. 18&194. Cambridge University Press, Oct. 1986.

121 R. Mohan, “Text based indexing of TV news stories”, in Proceedings, SPIE Multimedia Storage and Archiv- ingsystems, vol. SPIE 2916, pp. 2-13, Nov. 1996.

[ 131 B. Shahraray and D. Gibbon, “Automatic generation of pictorial transcripts of video programs”, in Multimedia Computing and Networking 1995, vol. SPIE 24 17, pp. 5 12-528, Feb. 1995.

[14] H. J. Zhang, Y. H. Gong, S. W. Smoliar, and S. Y. Yan, “Automatic parsing of news video”, in Int. Conf Multimedia Computing and Sys., pp. 45-54, 1994.

[I51 D. Swanberg, C. F. Shu, and R. Jain, “Knowledge guided parsing in video databases”, in Storage and Retrieval for Image and Kdeo Databases, vol. SPIE

[16] M. M. Yeung and B. L. Yeo, “Time-constrained clus- tering for segmentation of video into story units”, in Int. Con$ on Pattern Recog., pp. 375-380, Aug. 1996.

[ 171 M.M. Yeung and B.L. Yeo, “Video content characteri- zation and compaction for digital library applications”, in SPIE Storage and Retrieval forlinage & Kdeo Data- bases, vol. SPIE 3022, pp. 45-58, Feb. 1997.

[ 181 L. R. Rabiner, “A tutorial on hidden markov models and selected applications in speech recognition”, Pro- ceedings of the IEEE, vol. 77, pp. 257-286, Feb. 1989.

SPIE 1908, pp. 13-25,1993.

1908, pp. 13-25, 1993.

Y-P Tan and R.M. Bolle, “Binary video classifica- tion”, Technical Report TBD, IBM T.J. Watson Re- search Center, 1997.

M. M. Yeung, B. L. Yeo, W. Wolf, and B. Liu, “Video browsing using clustering and scene transitions on compressed sequences”, in Multimedia Computing and Networking 1995, vol. SPIE 2417, pp. 399-413, Feb. 1995.

[21] M.M. Yeung and B. L. Yeo, “Video visualization for compact presentation and fast browsing of pictorial content”, to appear in IEEE Transactions on Circuits and Systems For Mdeo Technology, August 1997 (also IBM Research Report RC 20615, 1996).