On the Significance of Cluster-Temporal Browsing for Generic Video … · On the Significance of Cluster-Temporal Browsing for Generic Video Retrieval - A Statistical Analysis Mika

On the Significance of Cluster-Temporal Browsing for Generic Video Retrieval - A Statistical Analysis Mika Rautiainen

MediaTeam / Univ. Of Oulu, Erkki Koiso-Kanttilan katu 3

FIN-90570 Oulu, Finland +358 8 553 1011

[email protected]

Tapio Seppänen MediaTeam / Univ. of Oulu, Erkki Koiso-Kanttilan katu 3

FIN-90570 Oulu, Finland +358 8 553 1011

[email protected]

Timo Ojala MediaTeam Oulu / Univ. of Oulu

Erkki Koiso-Kanttilan katu 3 FIN-90570 Oulu, Finland

+358 8 553 1011

[email protected]

ABSTRACT In this paper, we test statistically the effect of content-based browsing in generic video retrieval. Using TRECVID 2004 and 2005 experiments, we demonstrate that content-based browsing improves retrieval over sequential queries and relevance feedback. Two user groups, novices and system developers have been used in the experiments on large and multilingual video collections. Novice users were found to achieve improvement in search effectiveness with cluster-temporal browsing by statistically significant amount. System developers did not have statistically significant difference between the different system configurations.

Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval – query formulation, retrieval models, search process

H.3.4 [Information Storage and Retrieval]: Systems and Software– Performance evaluation (efficiency and effectiveness)

General Terms Algorithms, Performance, Experimentation, Verification

Keywords Content-based video retrieval, cluster-temporal browsing, Wilcoxon’s signed-rank test, non-parametric test

1. INTRODUCTION The prevailing paradigm in content-based retrieval is based on a query with attributes and examples about the desired content and a mechanism that computes ranked relevance from the actual content data. Currently, baseline video search performance is obtained with text queries on automatic speech recognition (ASR) and machine translation (MT) transcripts. However, transcripts do not enclose all of the existing semantic information. Additional content-based cues, such as detected concept terms and low-level audiovisual features have been found to improve retrieval [7]. Also relevance feedback improves the retrieval performance by incorporating information from prior relevance judgments [2].

Recently, successful techniques based on browsing and filtering have been developed for interactive video retrieval [3]. In [8], a video retrieval interface used several visualization techniques, such as story key frame montages, to emphasize relevant results to the user. Authors reported good interactive search performance in TRECVID evaluation. However, they did not provide any performance data using only novice users.

We propose a technique for browsing by content-based features. Cluster-temporal browsing is a content navigation technique for the video databases. The proposed technique merges video structure and parallel content-based clusters into a dynamically updated view. Our previous experiments have found cluster-temporal browser to improve retrieval effectiveness of novice users by 12-22% over sequential search with relevance feedback [5]. One of the open questions is whether the measured improvement in search effectiveness is statistically significant. A commonly used retrieval performance measure is average precision. It approximates the area inside a precision-recall curve. Zobel [1] demonstrated that non-parametric Wilcoxon’s signed rank test is reliable and discriminative way to evaluate statistical differences between two retrieval systems’ average precision performance in TREC framework. In TRECVID video retrieval experiments, statistical tests have not been widely utilized to validate obtained results. In this paper, Wilcoxon’s test is employed to validate the content-based browsing results in TRECVID 2004 and 2005 experiments. Section 2 introduces experimental retrieval system and the cluster-temporal browsing technique. Section 3 describes the experiments and results of the statistical tests. Section 4 provides final conclusions.

2. EXPERIMENTAL VIDEO BROWSING AND RETRIEVAL SYSTEM - VIRE Our content-based video retrieval system (VIRE) is based on visual, concept, and text transcript indexes. VIRE supports both user defined and example-based queries. The search interface enables manual query definition. The result container collects selected relevant shots and communicates via relevance feedback creating new content-based queries dynamically from the selected examples. The cluster-temporal browser acts as a navigational tool where relevant shots are collected via direct interaction with the system. [6]

2.1 Content-based Feature Indexes Our visual feature indexes are based on the color and structure of a video shot [7]. The Color Correlogram (CC) and Gradient

Copyright is held by the author/owner(s). MM'06, October 23–27, 2006, Santa Barbara, California, USA. ACM 1-59593-447-2/06/0010.

125

Correlogram (GC) features describe statistical co-occurrences of colors and edges in a single key frame. Dissimilarities of the individual features are based on city-block distance. CC and GC queries generate two separate search results, which are combined using sum of ranks [7] that has been found effective when the dimensions of the features vary.

The semantic concept index is based on concept confidences. We have two types of detectors: SVM classifiers and propagated labeling by positive examples. The following concepts are currently implemented using SVM: entertainment, faces, newsroom, outdoor, desert, natural-disaster, and snow; and using propagated labeling: fire-explosion-smoke, maps-charts, meeting-footage, nature-footage, sports, water, and weather.[6]

A text index is constructed from the automatic speech recognition (ASR) and machine translation (MT) transcripts. The words are first pre-processed with stop word removal and stemming and then indexed into a database. Grouping words into speaker segments expands the text search temporally. The textual similarity of shots is computed using prioritized ranking combined with weighed term frequency score. System can also compute an example-based text search using text from a query shot. [6]

Combination of the different feature results is created with a variant of Borda voting [7].

2.2 Cluster-temporal Browser Interface Cluster-temporal browsing enables navigation by the video content and logical structure. The arrows in Figure 1 illustrate how cluster-temporal browsing combines content-based clusters and video structure to maximize information in a single view. The row beneath the horizontal arrow displays the current video of interest, with chronological key frame sequence. The vertical arrow is in a panel that contains similar shots retrieved from the database. Each vertical column represents a retrieved cluster where the results are ordered with downward decreasing similarity. The columns form a matrix of similar shots, from which the user can open a new relevant video structure to the top row and similarity view is immediately updated. The user can also

navigate in the selected video timeline. Similarity matrix is always recomputed using the currently visible video segment.

Figure 1. Cluster-temporal browser combines chronological structure and content-based browsing together dynamically

2.3 Interfaces for Content-based Search and Relevance Feedback The search interface (see Fig. 2(a)) allows users to define manual queries: First, the user can select examples for visual search. Second, user can enable semantic concepts from the list of detectors. Third, text query can be created using a text box. User is able to pick relevant shots from the results or select a video as a start point for browsing in the cluster-temporal browser.

The result container collects selected relevant shots. These shots are used in a relevance feedback query, which is based on visual and textual similarity. Each time a new relevant shot is added, query is re-generated. Fig. 2 (b) shows example results for the relevance feedback query based on shots about Bill Clinton.

Figure 2. (a) Search interface allows users to create content-based queries with keywords for ASR transcripts, detected semantic concepts, and visual examples. (b) Relevance feedback is dynamically computed from the positive shots selected by the user.

CO

NTEN

T-BA

SED

SIMILA

RITY C

LUSTER

S

VIDEO TIMELINE

126

3. EXPERIMENTS AND STATISTICAL VALIDATION In order to understand the significance of the cluster-temporal browsing in generic video retrieval we statistically validate our two large-scale TRECVID retrieval experiments from the years 2004 and 2005. TRECVID video retrieval evaluation is a U.S. National Institute of Standards and Technology (NIST) led international retrieval benchmark providing common framework for research groups to test their content-based video retrieval systems. The focus of the experiments is to measure semantic search effectiveness between two distinct system configurations. First configuration employs the conventional content-based model: search interface and relevance feedback view. Second configuration incorporates the search interface, relevance feedback and cluster-temporal browser. Test users were not limited to using any particular interface within either configuration. For the TRECVID 2004 experiments [4], NIST supplied English test video database and common shot segmentation (by CLIPS-IMAG). The test database consists of 70 hours of ABC and CNN news and commercials from the year 1998 segmented into 33367 shots. NIST also provided 24 semantic search topics from which 23 topics were used in the experiments (one topic did not result in any correct results and it was removed). The search results were pooled by NIST for evaluation. A single search topic contained one or more example clips of video or images and textual topic description to aid the search process. TRECVID 2005 experiments were based on 80 hours of English, Arabic and Chinese video data from TRECVID 2005, the collection contained material from the genres of news, commercials and entertainment. Video database was initially segmented into shots (by Fraunhofer HHI). NIST provided 24 search topics that were used in the experiment and the results were pooled by NIST for evaluation. ASR and MT transcripts were provided by NIST and Carnegie Mellon University. [6] The interactive experiments followed similar formula in both 2004 and 2005 experiments. Four novice users participated to the experiments each year, totaling in eight novice users overall. Four system developers were selected as the expert test users for the 2005 experiments. The novice users were mainly information engineering students and had little experience in content-based searching of videos. Novices received 30 minutes of training with the system before the actual experiments. Each user searched results for 12 topics (from NIST), using 12 minutes for each topic. One system configuration was used for six topics, then break was issued to eliminate effect of fatigue and finally the rest of the topics were finished with another configuration. The effects of learning and preference were minimized by arranging the topics, users and system configurations into Latin Square setting:

Table 1. An example of the four person interactive test configuration

Run ID Searcher ID [topic set IDs] I1Q S1[TG1] S3[TG2] S2[TG3] S4[TG4] I2B S2[TG1] S4[TG2] S1[TG3] S3[TG4]

Table 2 shows the overall distribution of the collected shots from the different search interfaces. When the browser is disabled, most results are selected from the results in the search interface

but as many as three out of ten shots are being picked from the relevance feedback tool. When cluster-temporal browser is available, half of the shots are being selected from the browser view. This denotes that the browser has been adopted as a valid communication technique by the searchers.

Table 2. Distribution of the collected shots by the search types, TRECVID 2005 experiments

Experimental Setup

Search Inteface

Browser Relevance Feedback

With Browser 29% 52% 19% Without Browser 68% - 32%

Table 3 demonstrates the mean average precisions for the different experiments. Important observation is that the search precision consistently improves when novice users are given the opportunity to use cluster-temporal browser.

Table 3. Mean average precisions from the TRECVID interactive search tasks 2004 and 2005

Search Experiment

With Cluster-Temporal Browser

Without Cluster-Temporal Browser

2004, Novice Users 0.202 0.165 2005, Novice Users 0.226 0.202

2005, Designers as Expert Users

0.242 0.264

The difference in precision is 22 and 12 percent for the 2004 and 2005 benchmarks, respectively. In contrast, the system designers as expert users do not benefit from the cluster-temporal browser. System expert users end up with conventional queries being more effective than browsing. More detailed investigation revealed that proper selection of concept detectors was useful for the experts. For example ‘meeting footage’ and ‘faces’ concepts were found helpful in finding politicians by their name in 2005 experiments. The knowledge of the detector performance was obtained during the system development and training phase without contamination of the actual test database and search topics. Since the knowledge of the independent concept detector performances benefits the retrieval, search experiments with the system developers have poorer external validity than with novice users having no specialization to any parts of the system. [5] The most important question remains: is the improvement with novice users significant enough to conclude that the cluster-temporal browser is improving the retrieval? In order to confirm this we need pair-wise statistical tests to validate if the two system configurations are significantly different. We use the non-parametric Wilcoxon’s signed-rank test to validate the observation. Our alternate hypothesis is that the average precision differences with the cluster-temporal browser are statistically significant from the baseline configuration (search interface with relevance feedback). Table 4 shows the results of Wilcoxon’s tests with normalized and non-normalized average precisions. Wilcoxon’s rank-test is employed to compare the semantic query pairs for the two system types. First column (left) shows statistical test results with 23 search topics from 2004 set-up, second column is computed from the results of 24 topics in 2005

127

experiments, third column is a combination of novice user tests from 2004 and 2005 experiments totaling in 47 search topics, fourth column displays test values with the system developers.

Two types of initializations were used for the tests. First, the average precisions were used directly from the original NIST results. Second, the precisions were normalized for each query using maximum precisions from the TRECVID participants group (test values , ,norm norm normW z p ). Rationale for the normalization is to eliminate the variance in difficulty for the topics and to create more reliable test results. As an example, average precisions from 2005 were normalized using the maximum average precisions across all results evaluated by NIST for each query. The overall test was computed as the combination of 2004 and 2005 results. Normalization of the average precisions was based on maximum performance of the respective year’s experiments.

Table 4. Wilcoxon’s signed rank tests for TRECVID experiments with novice and expert users

Wilcoxon Test

2004 (Novices)

2005 (Novices)

2004/2005 (Novices)

2005 (Experts)

W

normW

-134

-144

-46

-46

-358

-360

106

104

n 23 24 47 24

z

normz

-2.03

-2.18

-0.65

-0.65

-1.89

-1.9

1.51

1.48

p

normp

0.0212

0.0146

0.2578

0.2578

0.0294

0.0287

0.0655

0.0694

First line of Table 4 shows the Wilcoxon’s signed rank differences for aX (without browser) and bX (with browser) query pairs. Negative values of , , ,norm normW W z z indicate that the test favors configuration bX whereas positive values favor results in the group aX . Last two rows display the most interesting information; the one-tail probability that the null hypothesis will not be rejected for the pairs in that set. Following conclusions can be derived from Table 4: 1. 2004 experiments with novice users demonstrate

improvement in search efficiency using cluster-temporal browser. Results are statistically significant within 2.5% error margin;

2. For the 2005 experiments, null hypothesis cannot be rejected within 5% error margin. The browser configuration does not provide statistically significant benefits;

3. The overall combined experiments of 2004 and 2005 with 47 semantic query topics show that the cluster-temporal browser improves the semantic search for novice users. This result is statistically significant within 5% error margin;

4. 2005 experiments with expert users are in favor of non-browser configuration. Statistical tests show that the 5% error margin is not reached; results are not statistically significant.

The most important values, i.e., the Wilcoxon’s test on combined experiments with normalized average precision values, appear bold in Table 4. This test gives the most extensive and reliable

results about the retrieval effectiveness of cluster-temporal browsing with novice user group. Cluster-temporal browser improves search efficiency by statistically significant proportion.

4. CONCLUSIONS This paper presents statistical validation for the semantic retrieval experiments using content-based retrieval with and without cluster-temporal browsing. Due to versatile genres of videos and semantic search topics, we assert that the experiments represent generic video retrieval. Two user groups have been used in experiments with large and multilingual video collections. Novice users were found to benefit from the cluster-temporal browsing. The improvement in search effectiveness was statistically significant for the eight novice users performing a total of 47 semantic search topics. System developers did not receive statistically significant benefit from the use of cluster-temporal browser. Experiments with system developers is compromising the internal validity since they have prior knowledge about the performances of the individual content-based features which are obtained during the system development and training phase. This assists searching even if the query topics and contents of the test collection remain unknown until the actual experiments.

5. ACKNOWLEDGMENTS We like to thank the Academy of Finland for supporting this research.

6. REFERENCES [1] Zobel, J. How reliable are the results of large-scale

information retrieval experiments? In Proc. of 21st international ACM SIGIR conf. on Research and development in information retrieval, Melbourne, Australia, 1998, 307-314.

[2] Salton, G., and Buckley, C. Improving retrieval performance by relevance feedback. Journal of the American Society for Information Science, vol. 41, 1990, pp. 288-297.

[3] Hauptmann, A.G., and Christel, M.G. Successful Approaches in the TREC Video Retrieval Evaluations. Proc. ACM Multimedia, 2004, 668-675.

[4] Kraaij, W., Smeaton, A.F., Over, P., and Arlandis, J. TRECVID 2004 - an overview. Online Proceedings of the TRECVID 2004, Gaithersburg, MD, United States, 2004, http://wwwnlpir.nist.gov/projects/tvpubs/tv.pubs.org.html.

[5] Rautiainen, M., Seppänen, T., and Ojala, T. Advancing content-based retrieval effectiveness with cluster-temporal browsing in multilingual video databases. IEEE Int. Conf. on Multimedia & Expo, Toronto, Canada, 2006, 377-380.

[6] Rautiainen, M., Varanka, M., Hanski, I., Hosio, M., Pramila, A., Liu, J., Ojala, T., and Seppänen, T. TRECVID 2005 Experiments at MediaTeam Oulu. TREC Video Retrieval Evaluation Online Proc., TRECVID ‘05, Gaithersburg, MD.

[7] Rautiainen, M., Ojala, T., and Seppänen, T. Analysing the performance of visual, concept and text features in content-based video retrieval. 6th ACM SIGMM International Workshop on Multimedia Information Retrieval MIR’04, New York, NY, 2004, 197-205.

[8] Girgensohn, A., Adcock, J., Cooper, M., Wilcox, L. A Synergistic Approach to Efficient Interactive Video Retrieval. INTERACT 2005, LNCS 3585, 2005, 781-794.

128

Documents

On the Significance of Cluster-Temporal Browsing for Generic Video … · On the Significance of Cluster-Temporal Browsing for Generic Video Retrieval - A Statistical Analysis Mika