6
Content-based TV Stream Analysis Techniques Toward Building a Catch-Up TV Service Xavier Naturel, Sid-Ahmed Berrani Orange Labs - France Telecom 4 rue du clos courtel, 35510 Cesson-S´ evign´ e, France {xavier.naturel, sidahmed.berrani}@orange-ftgroup.com Abstract One of the promises of Digital Television is the possibil- ity of creating interactive and innovative television services, like catch-up TV. However, these services need external re- sources, coming from the channels themselves or from man- ual annotation. In this paper, a system for automatically building a Catch-up TV service from the available EPG and the broadcasted TV stream, is presented. The system com- bines several content-based techniques for extracting exact program boundaries from the TV stream. Traditional com- mercial detection and recognition methods are used, as well as novel techniques to detect and classify repetitions. Iden- tification of the TV program is then performed by match- ing the detected boundaries with the EPG. Extensive exper- iments on three weeks of TV assess the effectiveness of the proposed system. 1. Introduction Telecom operators and television channels have recently shown a strong interest in novel TV services. Catch-Up TV, TV-on-Demand(TVoD) and search engines over TV con- tent are becoming more and more popular. These added- value services are considered as one of the key differen- tiation criteria between market players that make use of television content. Their development generally relies on an off-line processing step in which the TV content is pre- pared. The content has to be properly segmented and anno- tated in order to be repurposed in a convenient way. When performed manually, this step is highly time-consuming. It is one of the main obstacles toward a widespread usage of these services. Two representative and very important TV services are Catch-Up TV and TVoD. They aim at making TV con- tent available to viewers without any constraint of location and/or time. In order to build these services, the main under- lying required processing is to extract individual programs from the linear TV stream. This can be done by the chan- nels at the time of broadcasting. They can tag the start and end points of each broadcasted program, as in the PDC/VPS standard for analog TV. In practice, only very few channels use this possibility. It is not always a straightforward and accurate procedure because of the complexity of the broad- cast chain. Metadata like the Event Information Table (EIT) or the Electronic Program Guide (EPG) provide information on the structure of TV streams. They are unfortunately im- precise and incomplete [1]. Consequently, program extrac- tion from the TV stream has to be performed after broad- cast, considering only the linear audio-visual stream. In this paper, a set of content-based techniques is pro- posed to automatically process the TV stream. It is also shown how these techniques can be assembled, integrated and used along with EPG into a single fully automatic sys- tem. This system is able to extract programs online from a TV stream which allows making programs available a few minutes after the end of their broadcast. 2 Related work Automatic program extraction from TV streams is close to TV commercial detection, but has received less at- tention. One main difference is that all types of Inter- Programs (IPs), e.g. commercials, sponsorships, trailers, self-advertisements and anything else broadcasted between regular programs have to be detected. Program bound- aries are then given indirectly by detecting IPs and deduc- ing programs as the rest of the stream. Popular features like monochrome frames and/or silence have been used by [8, 12] to detect unknown inter-programs. These are then stored in a IP database. Recognition of known IPs stored in the IP database is usually performed using audio/video hashing [6, 2]. The fact that IPs are generally broadcasted several times in the stream is also used by [1], where IPs are detected as repeated sequences. Some methods have been proposed for updating automatically the IP database. First 2009 11th IEEE International Symposium on Multimedia 978-0-7695-3890-7/09 $26.00 © 2009 IEEE DOI 10.1109/ISM.2009.99 412

[IEEE 2009 11th IEEE International Symposium on Multimedia - San Diego, California, USA (2009.12.14-2009.12.16)] 2009 11th IEEE International Symposium on Multimedia - Content-Based

Embed Size (px)

Citation preview

Page 1: [IEEE 2009 11th IEEE International Symposium on Multimedia - San Diego, California, USA (2009.12.14-2009.12.16)] 2009 11th IEEE International Symposium on Multimedia - Content-Based

Content-based TV Stream Analysis TechniquesToward Building a Catch-Up TV Service

Xavier Naturel, Sid-Ahmed BerraniOrange Labs - France Telecom

4 rue du clos courtel, 35510 Cesson-Sevigne, France{xavier.naturel, sidahmed.berrani}@orange-ftgroup.com

Abstract

One of the promises of Digital Television is the possibil-ity of creating interactive and innovative television services,like catch-up TV. However, these services need external re-sources, coming from the channels themselves or from man-ual annotation. In this paper, a system for automaticallybuilding a Catch-up TV service from the available EPG andthe broadcasted TV stream, is presented. The system com-bines several content-based techniques for extracting exactprogram boundaries from the TV stream. Traditional com-mercial detection and recognition methods are used, as wellas novel techniques to detect and classify repetitions. Iden-tification of the TV program is then performed by match-ing the detected boundaries with the EPG. Extensive exper-iments on three weeks of TV assess the effectiveness of theproposed system.

1. Introduction

Telecom operators and television channels have recentlyshown a strong interest in novel TV services. Catch-Up TV,TV-on-Demand (TVoD) and search engines over TV con-tent are becoming more and more popular. These added-value services are considered as one of the key differen-tiation criteria between market players that make use oftelevision content. Their development generally relies onan off-line processing step in which the TV content is pre-pared. The content has to be properly segmented and anno-tated in order to be repurposed in a convenient way. Whenperformed manually, this step is highly time-consuming. Itis one of the main obstacles toward a widespread usage ofthese services.

Two representative and very important TV services areCatch-Up TV and TVoD. They aim at making TV con-tent available to viewers without any constraint of locationand/or time. In order to build these services, the main under-lying required processing is to extract individual programs

from the linear TV stream. This can be done by the chan-nels at the time of broadcasting. They can tag the start andend points of each broadcasted program, as in the PDC/VPSstandard for analog TV. In practice, only very few channelsuse this possibility. It is not always a straightforward andaccurate procedure because of the complexity of the broad-cast chain. Metadata like the Event Information Table (EIT)or the Electronic Program Guide (EPG) provide informationon the structure of TV streams. They are unfortunately im-precise and incomplete [1]. Consequently, program extrac-tion from the TV stream has to be performed after broad-cast, considering only the linear audio-visual stream.

In this paper, a set of content-based techniques is pro-posed to automatically process the TV stream. It is alsoshown how these techniques can be assembled, integratedand used along with EPG into a single fully automatic sys-tem. This system is able to extract programs online from aTV stream which allows making programs available a fewminutes after the end of their broadcast.

2 Related work

Automatic program extraction from TV streams is closeto TV commercial detection, but has received less at-tention. One main difference is that all types of Inter-Programs (IPs), e.g. commercials, sponsorships, trailers,self-advertisements and anything else broadcasted betweenregular programs have to be detected. Program bound-aries are then given indirectly by detecting IPs and deduc-ing programs as the rest of the stream. Popular featureslike monochrome frames and/or silence have been usedby [8, 12] to detect unknown inter-programs. These are thenstored in a IP database. Recognition of known IPs storedin the IP database is usually performed using audio/videohashing [6, 2]. The fact that IPs are generally broadcastedseveral times in the stream is also used by [1], where IPs aredetected as repeated sequences. Some methods have beenproposed for updating automatically the IP database. First

2009 11th IEEE International Symposium on Multimedia

978-0-7695-3890-7/09 $26.00 © 2009 IEEE

DOI 10.1109/ISM.2009.99

412

Page 2: [IEEE 2009 11th IEEE International Symposium on Multimedia - San Diego, California, USA (2009.12.14-2009.12.16)] 2009 11th IEEE International Symposium on Multimedia - Content-Based

mentioned by Lienhart [6], an approach using repetition hasbeen studied by [2], and a complete system using audio fea-tures has been proposed by Wang [11].

Another class of methods for program extraction tries todirectly detect the program boundaries. In [5], Liang et al.proposed a method that builds models for program begin-nings and endings. It is based on day-to-day repetition ofprograms. Models are essentially based on shot similarityand temporal rules. However, the strong assumption whichstates that programs repeat from day-to-day with the sameopening/closing credits, is not always true. Some authorshave tried to provide more generic methods, El-Khoury etal. [3] proposed to detect TV program boundaries by detect-ing changes between homogeneous zones by the GLR/BICcriterion. This unsupervised method is, however, less effec-tive than specific ones.

Most of these methods only recover the program bound-aries. Program identification and labeling has been givenlittle attention. In [8], program segments are matched withthe EPG using a dynamic time warping algorithm with tem-poral constraints. A different approach based on metadatais proposed by Poli at al. [10]. A model based on HMM istrained on a large amount of ground-truth to predict both theprogram boundaries and their class. It is unfortunately dif-ficult to use in a real-world application due to the requiredamount of training data for each channel.

In addition to the explained drawbacks of each of thesetechniques, most of them do not provide the whole process-ing chain, from boundaries detection to program labelingand extraction. The few systems like [8, 10] that providethe whole chain are not able to extract programs online.

3 System Overview

The architecture of the system is presented in Figure 1.It is composed of three main parts: low, mid, and highlevel. Low-level corresponds to features extraction, mid-level to inter-program detection, and high-level correspondsto stream segmentation and labeling. Although the goal isto detect program boundaries, the main focus of the systemis on detecting IPs. The reason for this is that they are eas-ier to detect than the programs themselves. Programs do notshare common features, and are therefore difficult to detectas such. On the contrary, IPs exhibit characteristics that canbe automatically detected. Since the stream is only com-posed of programs and IPs, program boundaries are givenindirectly by the IPs boundaries, e.g. the end of an IP is thestart of a program.

Two inputs are necessary, the TV stream itself (framesand audio), and a program guide, either in the form of anEIT, extracted from a DVB transport stream, or an EPG,collected from web sites. The term EPG is used for conve-nience in the paper, but should be understood as either EPG,

Figure 1. Overview of the system. Online pro-cessing steps are in yellow, offline one inblue.

EIT, or a mix of both. If no EPG is present, the system canstill perform segmentation but not labeling. The TV streamcan be a stored file or a live feed.

The low-level part performs shot segmentation and ex-tracts the relevant features for the different mid-level pro-cesses. This is detailed in Section 4. The mid-level partof the system is IP detection. It relies on several meth-ods: Separation detection, Repetition analysis, and Onlinerecognition, explained in Section 5. Separation detectionand Repetition analysis detect unknown IPs, and store themin an IP database. Online recognition uses this databaseto recognize known IPs. The third and last part of the sys-tem combines the result of IP detection in order to performa segmentation of the stream into program/inter-program,and labels the resulting segments with titles and related in-formation coming from the EPG. This is explained in Sec-tion 6. These three parts are all performed online, fasterthan realtime.

4 Features extraction

This part of the system extracts several low-level featuresthat are used by different mid-level methods. The first stepis shot segmentation, performed according to the methoddescribed in [1], which is based on [4]. Perfect shot seg-mentation is not required, however the segmentation shouldbe the same for different broadcasts of the same video se-quence. For each shot, a keyframe is chosen according tothe Page-Hinkley test applied to the signal of cumulativeinter-frame difference, as explained in [1].

413

Page 3: [IEEE 2009 11th IEEE International Symposium on Multimedia - San Diego, California, USA (2009.12.14-2009.12.16)] 2009 11th IEEE International Symposium on Multimedia - Content-Based

Two levels of frame description are used. A basic visualdescriptor (BVD) is extracted from each frame of the videostream. It is used to match almost identical frames and onlyneeds to be invariant to small variations due to compression,for instance. The second level focuses on carefully chosenkeyframes of the video stream. The descriptor associatedwith these keyframes is called key visual descriptor (KVD).It is more sophisticated and must be more robust. KVDs areused during the clustering step to cluster similar shots andto create the set of repeated sequences. However, at thisstage, the boundaries of detected repeated sequences can-not be determined. A KVD is associated with a frame ofthe repeated sequence but does not provide any informationon the sequence boundary. The BVD is thus used to pre-cisely determine these boundaries by matching correspond-ing frames in all occurrences of the repeated sequence. BothBVDs and KVDs are DCT-based descriptors. For more de-tails on these desciptors, the reader is referred to [1].

Next, basic indicators of frame monochromaticity and si-lence are computed according to [8]. The monochromatic-ity indicator is computed as the entropy of the luminancehistogram, in the YUV color space. The silence indicator isthe log-energy of the audio signal, computed on 10 ms au-dio frames with 50% overlap. Color histograms in the RGBspace are also computed on each frame.

5 Inter-program detection

Inter-program detection is composed of several meth-ods: Separation detection, Repetition analysis, and Onlinerecognition, all linked with an IP database. These threeIP detection methods are complementary: IPs that may bemissed by separation detection can be recognized by onlinerecognition, and vice-versa. While the global system is us-ing all three methods of IP detection, they can be used indifferent combinations, e.g. repetition and recognition, orseparation and recognition.

5.1 Separations

A frequently used feature for detecting commercials isblack frames [6]. However, black/monochrome framesalone yield a very high false alarms rate, and are usuallyused in conjonction with another feature. A less used butquite powerful feature is silence used by [12]. These fea-tures are used together in the proposed method, generaliz-ing black frames to monochrome. The simultaneaous oc-currence of monochrome frames and silence is called sep-aration in the rest of the paper. Separations are used bychannels to mark the beginning/end of an IP or a program,with different behaviours from one channel to another. Sep-arations are not used worldwide, but they are very strongindicators of IP presence. They are not mandatory for

our system to work but they provide much facility in theIP detection process. Note that in France, most channelsare using separations, with monochrome frames of colorwhite, black, blue or even pink, depending on the chan-nel. Separation detection is performed using the method de-scribed in [8]. A candidate separation location is first foundby thresholding the audio log-energy. Next, we look formonochrome frames in a window around the candidate lo-cation, by thresholding the monochromaticity indicator de-fined in Section 4. If no monochrome frames are found atthis location, the candidate is discarded.

In order to improve the reliability of separation detec-tion, we propose a post-process that takes place on top ofthe separation detection, to remove some false alarms. Thisis performed using a Support Vector Machine (SVM) classi-fier on the detected separations, to distinguish between cor-rect matches and false alarms. A feature vector xF of di-mension 18 is built for each separation, using mostly color,temporal and motion information. It is expressed as:

xF =

(N, min

idi, max

idi,

1N − 1

N∑i=2

di, H, I, L, ξ

),

where N is the length of the separation in frames, di is theluminance histogram intersection between frame i − 1 andframe i, L is the distance to the previous separation, ξ isa boolean indicating whether it is day or night (1 am to 5am), H = (hM

r , hMg , hM

b , hmr , hm

g , hmb ) is a vector of the

min (m) and max (M) values of the RGB histograms, and Ithe vector of indexes of the H values.

A soft-margin SVM with Gaussian kernel K(x1, x2) =exp(− ||x1−x2||

2σ2 ) is chosen. Three datasets, training, valida-tion and testing, of each 24 hours of video, representing onefull day of TV, are used to train and test the model. Eachset has between 400 and 500 separations. SVM parame-ters σ and regularisation cost C are computed by exhaustivesearch of the parameter space on the validation set, yieldingσ = 0.01 and C = 10.

Once separations are detected, IP detection is a simpleprocedure. For each new separation, the distance with thelast detected separation is computed. If this distance is be-low a certain threshold, an IP is detected between the twoseparations. A threshold of 150 s is used in the experi-ments. The detected IPs are used for both stream segmen-tation and for feeding the IP database. IPs are inserted intothe database shot by shot, as we have no knowledge of theboundaries of each individual IP.

5.2 Repetition Analysis

Repetition analysis consists in detecting repeated videosequences in a TV stream. This analysis is usually per-formed on a long duration, e.g. 1 to 10 days, so that the

414

Page 4: [IEEE 2009 11th IEEE International Symposium on Multimedia - San Diego, California, USA (2009.12.14-2009.12.16)] 2009 11th IEEE International Symposium on Multimedia - Content-Based

number of repetitions is not negligible. In our case, it isperformed periodically every day, on a accumulated TVstream of five days. Repetition analysis is done as speci-fied in [1]. Repetitions are identified by clustering DCT de-scriptors computed on key-frames, the so-called KVD de-scriptors described in Section 4. Clusters are then analyzedto provide a set of repeated sequences. Filtering rules basedon the temporal distance between KVD of the same clusterare used to remove some clusters, and inter-cluster similar-ity is computed to group clusters that belong to the samevideo sequence.

Once the repetitions are obtained, they are classified intoprograms or IPs using the method of [7]. It uses InductiveLogic Programming (ILP) to classify the repetitions basedon their duration, number of repetitions, and contextual fea-tures such as the neighborhood. The repetitions classifiedas IP are then inserted into the IP database as a sequence ofshots.

The contribution of this method to the system is twofold.First, it can detect IPs that cannot be detected with the sep-aration method, especially if the TV channel does not useseparations. Secondly, it helps to structure the database intosequences that have a meaning (e.g. a trailer). This reducesthe possible number of false positives in the online recog-nition process, since recognition is not based on a singleshot but on a whole sequence. It also solves naturally someshortcomings of the shot by shot approach, especially con-cerning trailers. In the shot by shot approach, each shotof the trailer is tagged as an IP, which is a problem whenthe announced program is eventually broadcasted: some ofits shots will be tagged as IP, possibly oversegmenting it.This does not happen in the sequence approach, becausethe trailer is not included in the film itself.

5.3 Recognition

Recognition aims at detecting in the TV stream IP se-quences which are stored in the IP database. The rationaleis that IPs are broadcasted more than once, and recogni-tion of known IPs is easier than the detection of unknownones. The features, database construction and recognitionprocess described in this section are very similar to thoseproposed by [9]. In the database, an IP is represented by asequence of shots (an ordered set of shots). Note that a se-quence can have only one shot, which is an important casein practice, since the method of Section 5.1 inserts only in-dividual shots. To compare the TV stream to the databasevideo sequences, basic visual descriptor (BVD) are com-puted on each frame, as explained in Section 4. Each shothas a set of BVDs, one BVD for each frame of the shot.

Retrieval is based on perceptual hashing. BVDs arestored in a hash table which allows constant-time query-ing. The basic unit of recognition is the shot. Matching the

TV stream with sequences from the database is done shotby shot, by testing BVDs from the query shot (from the TVstream), against the hash table. This provides with candi-dates shots (from the database). A distance between queryand candidate shots is computed as the mean of the Ham-ming distances between BVDs. The query and candidateshots are said to be identical if this distance is below a cer-tain threshold. To recognize a sequence, individual shotsfrom this sequence have to be recognized in the right order.

The database is dynamic, meaning that content is addedand removed over time. Concerning removal, several up-date strategies can be defined. Adopting a day-by-day anal-ysis, three update strategies are inspected in this paper. In-finite memory: all sequences are kept, except for those thathave not been recognized for a certain time. A periodicscan of the database thus removes the sequences that arenot used. No memory: only the sequences from the currentday are kept, All sequences are removed from the databaseat the end of the day. One day memory: only the sequencesfrom the previous and the current day are kept.

6 Stream segmentation and labeling

The final boundaries of IPs are computed by a simpleunion of the different results of IP detection, as illustrated inFigure 2. Short segments located between IPs are also saidto be IP and merged. The maximal length of the segmentfor this merge to happen is a parameter of the segmentationalgorithm. This is performed online, IPs are detected assoon as they appear by methods 5.1 or 5.3, and their resultsare merged. Program boundaries are deduced immediatelyfrom the IPs boundaries. Method 5.2 is not used directly forsegmentation, since it is performed periodically offline. Re-peated sequences detected by method 5.2 are, however, in-serted offline in the IP database, and further occurrences ofthese sequences are thus detected immediately after broad-cast by recognition method 5.3.

����������

����������

�����

�����

����������

����������

����������

����������

�����

�����

P2 P3 P4 P5P1

���������������

���������������

�����

�����

����������

����������

���������������

���������������

������������

������������

���������������

���������������

������

������

Separations

Recognitions

Union

Segmentation

Figure 2. Segmentation steps leading to pro-grams P1 to P5.

Once segmentation has been performed, we seek tomatch each obtained program with a label from the EPG.A straightforward matching procedure based on maximal

415

Page 5: [IEEE 2009 11th IEEE International Symposium on Multimedia - San Diego, California, USA (2009.12.14-2009.12.16)] 2009 11th IEEE International Symposium on Multimedia - Content-Based

overlapping is proposed in this paper. For a specific pro-gram p, we compute two measures based on the overlapbetween p and epgi, for each program i of the EPG. Theoverlap is simply the temporal intersection ovi = p ∩ epgi.The first measure is the best ratio between the overlap andprogram length Mp = maxi ovi/|p|. The second measureis the best ratio between the overlap and the EPG programlength Me = maxi ovi/|epgi|. If the program labels givenby the EPG programs that maximize Me and Mp are iden-tical, p is labeled by this label. If different, the EPG pro-gram the closest in length to p is chosen. Contiguous pro-grams that share the same label are merged. An exceptionto this merge is made for short programs (provided that theyare not surrounded by programs with which they share thesame label). This rule takes into account the fact that shortprograms (less than 5 min) are often wrongly labeled, andmight be erroneously merged with a regular program.

7 Experiments

The proposed system has been evaluated using real TVbroadcast from a French channel. The dataset is com-posed of three weeks of TV, recorded from 07/09/2007 to07/30/2007, with their program guide. When available, theEIT is preferred over the EPG, as it is more precise. Thisdataset has been manually annotated, indicating for eachprogram its start time, end time, and label. In Section 7.1,the results of different configurations are presented and an-alyzed. Detailed results are then presented in Section 7.2for a specific configuration. We have chosen to focus onlong programs broadcasted from 7 a.m. to midnight. Theseare the most interesting from an application point of view.Short programs (less than five minutes) are not taken intoconsideration in these experiments, but the system can han-dle them. Experiments are performed on a PC running Win-dows XP, 2.5 GHZ Intel Xeon CPU with 3 GB of mainmemory. It takes 3 h 40 min for the system to process 24hof TV.

7.1 Results

We first evaluate the global performance of the system onthe three weeks dataset. Four configurations are tested, allof them using separations detection and online recognitionas defined in section 3. The first three configurations arethe three memory strategies defined in section 5.3: no mem-ory (No), one day memory (One), and infinite memory (Inf),in which the repetition analysis of section 5.2 is not used.In the fourth configuration (rep), repetitions are used withan infinite memory strategy. The results in Table 1 are pre-sented for the four configurations (No, One , Inf, Rep) andevaluated using several measures: the first (resp. second)measure is the average absolute temporal distance from theactual start (resp. end) of the program. The display format

is mm:ss. The four following measures are the percentagesof programs for which the start (resp. end) time is less than1 second (resp. 10 s) away from the actual start (resp. end)time given by the ground truth. The measure at 1 second in-dicates almost perfect segmentation. The last measure is thenumber of correctly retrieved and labeled programs, withrespect to the ground truth. Note that only the start and endtime of a program is evaluated here: if a program is broad-casted into several parts, only its start and final end pointsare evaluated, not the possible intra-segmentation.

Δ Δ <1 s (%) Δ <10 s (%) labelΔstart Δend start end start end (in %)

No 01:27 06:11 56.4 64.7 79.6 71.6 90.2One 01:19 06:27 67.7 65.8 83.5 73.8 89.7Inf 01:12 06:19 47.4 58.8 81.4 67.8 89.5Rep 01:14 06:22 47.1 61.3 80.3 70.8 89.5

Table 1. Evaluation of the four configurationsof the system on the three weeks dataset.

Table 1 shows a large difference between the start andend times for all configurations. This may be due to sev-eral factors: first, metadata are better at indicating the starttime than the end, second, when a program is cut into sev-eral parts, the last part may be missed, or wrongly labeled,resulting in a very large difference between the computedend time and the actual one. The percentages of almostperfect segmentation (<1 s) and acceptable segmentation(<10 s) are generally quite high, indicating also that mostprograms are correctly segmented, while the few others mayhave boundaries very far from the ground truth. As ex-pected, different database strategies yield significally differ-ent results. At the extremes, the no memory, and the infinitememory strategy have the lowest percentage of perfect andacceptable segmentation. In the first case the database isnot complete enough, so that some IPs are not recognized.In the second case, some non-IPs have been inserted intothe database, e.g. opening logos, title sequence, or open-ing/closing credits, which disturb the segmentation. A goodtrade-off is to keep a sufficiently large database, and to re-move periodically old entries, so that errors are removed.In the last configuration, the use of repetitions with infinitememory seems to correct some of the bad effects of the in-finite memory, thanks to the classification of sequences. Itseffect on the database, and the best way to use the repeti-tions have still to be further studied.

7.2 Detailed analysis

In this section, results for the One day memory configu-ration are analyzed in details. Table 2 shows the result foreach day of the week from 07/10/2007 to 07/16/2007. Eval-uation measures are the same as in the previous section. Itis interesting to note that results are not homogeneous from

416

Page 6: [IEEE 2009 11th IEEE International Symposium on Multimedia - San Diego, California, USA (2009.12.14-2009.12.16)] 2009 11th IEEE International Symposium on Multimedia - Content-Based

one day to another. The reason for this is that IPs are chang-ing every day and are not broadcasted at the same time. Onthree occasions out of seven, 84.7% of the programs havea perfect start time. The large Δend values are mainly dueto a sport program broadcasted every day, followed by pro-grams discussing this event, which vary greatly in length.In that case, the EPG is very inaccurate and some programsare wrongly labeled. Large temporal differences betweenthe segmentation and the ground truth are thus mainly dueto labeling errors and not to segmentation errors.

Date Δ Δ <1 s (%) Δ <10 s (%) labelΔstart Δend start end start end (in %)

07/10 00:10 06:20 84.6 61.5 92.3 84.6 92.907/11 04:58 09:26 57.1 64.3 71.4 71.4 87.507/12 00:19 05:11 71.4 78.6 85.7 78.6 87.507/13 00:06 06:32 84.6 69.2 84.6 76.9 92.907/14 01:47 12:43 77.8 66.7 77.8 66.7 9007/15 03:50 23:05 84.6 76.9 92.3 92.3 92.907/16 03:12 03:56 42.9 50 71.4 57.1 93.3

Table 2. Per-day results over one week.

Table 3 presents the results for the One day memory con-figuration over three weeks by program genre. Three gen-res are studied here: News, games, and evening programs,which are generally feature films, series or documentaries.The frame-exact columns represents the percentage of pro-grams perfectly segmented, at the exact frame, both at startand end points. Δstart and Δend are the temporal differ-ences with the ground truth, in seconds. News is the easi-est genre, with 90% of newscasts perfectly segmented overthree weeks. Concerning the evening programs and games,errors are generally small, and mainly due to trailers orsponsorships that are not detected. A few large temporalerrors exist, due to bad labeling, followed by inappropriatemerging by the method of Section 6.

Genre frame-exact Δstart Δend

News 90% 0.2 s 0.03 sEvening 38.9% 2.7 s 3.6 sGames 44.9% 0.8 s 3.6 s

Table 3. detailed results by genres on threeweeks.

The system is also robust to exceptional or unexpectedevents. The death of a famous French actor on July 30th

made the channel change its evening programs. In thatcase, the programs are perfectly segmented but wrongly la-beled. The same thing happens in the middle of a childrenprogram, interrupted by a news flash that is correctly seg-mented but wrongly labeled.

8 Conclusion

A system for automatically extracting programs froma television stream using content-based methods has beenpresented. It is able to perform online an accurate tempo-ral segmentation of TV programs six times faster than real-time. The system does not need prior learning. It uses theinformation coming from EPG to label the programs. Ex-periments have been performed on three weeks of televisionand show good results, with more than 60% of TV programssegmented at a temporal resolution of less than 1 second.

References

[1] S.-A. Berrani, G. Manson, and P. Lechat. A non-supervisedapproach for repeated sequence detection in tv broad-cast streams. Signal Processing: Image Communication,23(7):525–537, 2008.

[2] M. Covell, S. Baluja, and M. Fink. Advertisement detec-tion and replacement using acoustic and visual repetition.In IEEE International Workshop on Multimedia Signal Pro-cessing, pages 461–466, Victoria, BC, Canada, 2006.

[3] E. El-Khoury, C. Senac, and P. Joly. Unsupervised TV pro-gram boundaries detection based on audiovisual features. InInternational Conference Visual Information Engineering,pages 498–503, Xi’an, China, 2008.

[4] A. Hanjalic. Shot-boundary detection: unraveled and re-solved? IEEE Transactions on Circuits and Systems forVideo Technology, 12(2):90–105, 2002.

[5] L. Liang, H. Lu, X. Xue, and Y.-P. Tan. Program segmenta-tion for TV videos. In International Symposium on Circuitsand Systems, pages 1549–1552 Vol. 2, 2005.

[6] R. Lienhart, C. Kuhmunch, and W. Effelsberg. On the detec-tion and recognition of television commercials. In Interna-tional Conference on Multimedia Computing and Systems,page 509, Ottawa, Canada, 1997.

[7] G. Manson and S.-A. Berrani. An inductive logicprogramming-based approach for tv stream segment classi-fication. In International Symposium on Multimedia, pages130–135, Berkeley, USA, 2008.

[8] X. Naturel, G. Gravier, and P. Gros. Fast structuring oflarge television streams using program guides. In Interna-tional Workshop on Adaptive Multimedia Retrieval, pages222–231, Geneva, Switzerland, 2006.

[9] X. Naturel and P. Gros. Detecting repeats for video structur-ing. Multimedia Tools Application, 38(2):233–252, 2008.

[10] J.-P. Poli. An automatic television stream structuring sys-tem for television archives holders. Multimedia Systems,14(5):255–275, 2008.

[11] X. Wang, X. Li, Y. Qian, Y. Yang, and S. Lin.Break-segment detection and recognition in broadcastingvideo/audio based on c/s architecture. In Studies in Com-putational Intelligence, volume 214, pages 45–51, 2009.

[12] Z. Zeng, S. Zhang, H. Zheng, and W. Yang. Program seg-mentation in a television stream using acoustic cues. In In-ternational Conference on Audio, Language and Image Pro-cessing, pages 748–752, Shanghai, China, 2008.

417