10
Temporal video segmentation and classification of edit effects Sarah Porter * , Majid Mirmehdi, Barry Thomas Department of Computer Science, University of Bristol, Bristol BS8 1UB, UK Accepted 13 August 2003 Abstract The process of shot break detection is a fundamental component in automatic video indexing, editing and archiving. This paper introduces a novel approach to the detection and classification of shot transitions in video sequences including cuts, fades and dissolves. It uses the average inter-frame correlation coefficient and block-based motion estimation to track image blocks through the video sequence and to distinguish changes caused by shot transitions from those caused by camera and object motion. We present a number of experiments in which we achieve better results compared with two established techniques. q 2003 Elsevier B.V. All rights reserved. Keywords: Shot transitions; Shot break detection; Shot classification; Video segmentation 1. Introduction Indexing and annotating large quantities of film and video material is becoming an increasing problem through- out the media industry, particularly where archived material is concerned. Manually indexing video content is currently the most accurate method but it is a very time consuming process. Considerable amounts of archived data remain unindexed which often leads to the production of new film instead of the reutilisation of existing material. An efficient video indexing technique is to temporally segment a sequence into shots, where a shot is defined as a sequence of frames captured from a single camera operation, and then select representative key-frames to create an indexed database. Hence, a small subset of frames can be used to retrieve information from the video and enable content- based video browsing. There are two different types of transitions that can occur between shots: abrupt (discontinuous) shot transitions also referred to as cuts; or gradual (continuous) shot transitions such as fades, dissolves and pushes or wipes. A cut is an instantaneous change from one shot to another. During a fade, a shot gradually appears from, or disappears to, a constant image. A dissolve occurs when the first shot fades out whilst the second shot fades in. There are hundreds of different pushes or wipes, and all are considered to be special transitional effects [1]. One example is when the new shot pushes the last shot off the screen to the left, right, up or down. In general during a wipe, the new shot is revealed by a moving boundary in the form of a line or pattern. Shots unified by a common locale or event are grouped together into scenes. Although the cut is the simplest, most common way of moving from one shot to the next, gradual transitions are often used at scene boundaries to emphasis the change in content of the sequence [2]. Hence, detecting gradual transitions is particularly important for the identi- fication of key-frames. The most common edit effects used in video sequences are cuts, fades and dissolves [1]. In fact, the data set used in this paper contains 450 cuts, 79 fade-ins, 74 fade-outs, 114 dissolves and only 5 wipes over a total of 21580 frames. In the case of shot cuts, the content change is usually large and easier to detect than the content change during a gradual transition [3,4]. Fig. 1 shows a sequence of four consecutive frames with a shot cut occurring between the second and third frame. The significant inter-frame difference during the shot cut is clearly shown. In contrast, Fig. 2 shows six frames during a dissolve and illustrates that the inter-frame difference during a gradual transition is small. Indeed, the content change caused by camera operations, such as pans, tilts or zooms and object move- ment can be of the same magnitude as those caused by gradual transitions. This makes it difficult to differentiate 0262-8856/$ - see front matter q 2003 Elsevier B.V. All rights reserved. doi:10.1016/j.imavis.2003.08.014 Image and Vision Computing 21 (2003) 1097–1106 www.elsevier.com/locate/imavis * Corresponding author. E-mail address: [email protected] (S. Porter).

Temporal video segmentation and classification of edit effects

Embed Size (px)

Citation preview

Temporal video segmentation and classification of edit effects

Sarah Porter*, Majid Mirmehdi, Barry Thomas

Department of Computer Science, University of Bristol, Bristol BS8 1UB, UK

Accepted 13 August 2003

Abstract

The process of shot break detection is a fundamental component in automatic video indexing, editing and archiving. This paper introduces

a novel approach to the detection and classification of shot transitions in video sequences including cuts, fades and dissolves. It uses the

average inter-frame correlation coefficient and block-based motion estimation to track image blocks through the video sequence and to

distinguish changes caused by shot transitions from those caused by camera and object motion. We present a number of experiments in which

we achieve better results compared with two established techniques.

q 2003 Elsevier B.V. All rights reserved.

Keywords: Shot transitions; Shot break detection; Shot classification; Video segmentation

1. Introduction

Indexing and annotating large quantities of film and

video material is becoming an increasing problem through-

out the media industry, particularly where archived material

is concerned. Manually indexing video content is currently

the most accurate method but it is a very time consuming

process. Considerable amounts of archived data remain

unindexed which often leads to the production of new film

instead of the reutilisation of existing material. An efficient

video indexing technique is to temporally segment a

sequence into shots, where a shot is defined as a sequence

of frames captured from a single camera operation, and then

select representative key-frames to create an indexed

database. Hence, a small subset of frames can be used to

retrieve information from the video and enable content-

based video browsing.

There are two different types of transitions that can occur

between shots: abrupt (discontinuous) shot transitions also

referred to as cuts; or gradual (continuous) shot transitions

such as fades, dissolves and pushes or wipes. A cut is an

instantaneous change from one shot to another. During a

fade, a shot gradually appears from, or disappears to, a

constant image. A dissolve occurs when the first shot fades

out whilst the second shot fades in. There are hundreds of

different pushes or wipes, and all are considered to be

special transitional effects [1]. One example is when the new

shot pushes the last shot off the screen to the left, right, up or

down. In general during a wipe, the new shot is revealed by

a moving boundary in the form of a line or pattern.

Shots unified by a common locale or event are grouped

together into scenes. Although the cut is the simplest, most

common way of moving from one shot to the next, gradual

transitions are often used at scene boundaries to emphasis

the change in content of the sequence [2]. Hence, detecting

gradual transitions is particularly important for the identi-

fication of key-frames. The most common edit effects used

in video sequences are cuts, fades and dissolves [1]. In fact,

the data set used in this paper contains 450 cuts, 79 fade-ins,

74 fade-outs, 114 dissolves and only 5 wipes over a total of

21580 frames.

In the case of shot cuts, the content change is usually

large and easier to detect than the content change during a

gradual transition [3,4]. Fig. 1 shows a sequence of four

consecutive frames with a shot cut occurring between the

second and third frame. The significant inter-frame

difference during the shot cut is clearly shown. In contrast,

Fig. 2 shows six frames during a dissolve and illustrates that

the inter-frame difference during a gradual transition is

small. Indeed, the content change caused by camera

operations, such as pans, tilts or zooms and object move-

ment can be of the same magnitude as those caused by

gradual transitions. This makes it difficult to differentiate

0262-8856/$ - see front matter q 2003 Elsevier B.V. All rights reserved.

doi:10.1016/j.imavis.2003.08.014

Image and Vision Computing 21 (2003) 1097–1106

www.elsevier.com/locate/imavis

* Corresponding author.

E-mail address: [email protected] (S. Porter).

between changes caused by a continuous edit effect and

those caused by object and camera motion without also

incurring a large number of false positives. A comparison of

recent algorithms shows that the false positive rate when

detecting dissolves is usually unacceptably high, indicating

that reliable dissolve detection is still an unsolved problem

[3]. In this paper, we introduce a novel approach for the

detection and classification of the most commonly occurring

shot transitions: cuts, fades and dissolves.

Section 2 presents a brief overview of previous

approaches to video segmentation. In Section 3 we propose

a method designed explicitly to detect shot cuts using block-

based motion compensation. Normalised correlation

implemented in the frequency domain is used to estimate

the motion for each block. In Section 4, we extend the

algorithm for shot cut detection to detect fades and

dissolves. The proposed method uses block tracking to

differentiate between changes caused by gradual effects and

those caused by object and camera motion and it has been

designed to handle some of the shortcomings of previous

methods [5,6]. Experimental results confirming the validity

of the approach are presented and discussed in Section 5.

2. Previous work

Most of the existing methods for shot cut detection use

some inter-frame difference metric applied to various

features related to the visual content of a video. A frame

pair where this difference is greater than some predefined

threshold is considered to contain a shot cut. For each

selected feature, a number of suitable metrics can be

applied. In this section, we only outline the main

contributions but good summaries and comparisons of

features and metrics used for video segmentation with

respect to the quality of results obtained can be found in

Fig. 1. Four consecutive frames containing a shot cut between the second and third frame.

Fig. 2. Frames from a dissolve which occurs over 25 frames.

S. Porter et al. / Image and Vision Computing 21 (2003) 1097–11061098

several references [2–4,7–9]. Arguably, the simplest of

these methods is pair-wise pixel comparison, or frame

differencing. This compares the corresponding pixels

between two frames to determine how many have changed

[6]. The drawback of frame differencing is that it is sensitive

to small camera and object motions which can lead to a high

number of false positives [9]. Zhang et al. reduced the

effects of motion and noise by firstly applying a 3 £ 3

averaging filter [6]. They also suggested a more robust

method based on dividing a frame into regions and

comparing corresponding regions instead of individual

pixels. Regions were compared on the basis of second-

order statistical characteristics of their intensity values using

the likelihood ratio as the disparity metric [10]. This method

is less sensitive to local object motion but can still lead to

false positives in the presence of large object and camera

motions. A potential problem with the likelihood ratio is

that two regions may have the same mean and variance but

completely different probability functions. In such a case, a

shot cut would be missed.

Instead of comparing features based on individual pixels

or regions, features of the entire image have been used for

comparison to reduce further the sensitivity to object and

camera motion. For example, the average intensity measure

compares the average of the intensity values for each frame

pair [11]. Another global feature which has been used is a

histogram of intensity values [2,4,6,12]. An intensity

histogram of an image describes the distribution of the

intensities while ignoring the spatial distribution of pixels

within an image. The basic idea of using histograms is that

two frames with unchanging background and unchanging

objects will have similar intensity distributions even in the

presence of object and camera motion. A shot cut is detected

if the bin-wise difference between the histograms of two

consecutive frames exceeds some threshold. A disadvantage

of histogram-based (HB) methods is that a shot cut may be

missed between two shots with similar intensity distri-

butions but different content. To overcome this, Nagasaka

and Tanaka proposed dividing each frame into 16 blocks

and computing the difference between local histograms

[12]. They also removed the eight largest differences before

computing a single difference metric. This way, the method

is still robust to local motions within a region, and by

discarding the largest differences the method becomes less

sensitive to large object and camera motions. One drawback

is that this method may miss a cut between two shots with

similar backgrounds because the blocks that have changed

are the very ones being removed. They also compared

several different statistics on grey-level and colour histo-

grams and found the best performance was obtained by

using the x 2 test to compare colour histograms [13]. Each

pixel is represented by a colour code obtained by merging

the two most significant bits of each RGB component. This

also helped reduce the effect of changes in the luminance.

HB methods are the most common approach to shot cut

detection in use today, since they offer a good trade-off

between accuracy and computational efficiency [2,4,6,12].

All of the previously mentioned algorithms have been

devised for shot cut detection only. The difference between

a frame pair during a gradual transition is much smaller than

the difference that occurs during a shot cut. Lowering the

threshold to detect such small differences may result in

many false detections due to the differences caused by

camera and object motion. Zhang et al. proposed a twin

comparison technique comparing the histogram difference

with two thresholds [6]. A lower threshold was used to

detect small differences that occur for the duration of the

gradual transition while a higher threshold was used in the

detection of shot cuts and gradual transitions. This method

can fail when camera operations such as pans generate a

change in the colour distribution similar to that caused by a

gradual transition. To overcome this, they suggested

analysing the motion between frames to identify camera

operations such as pans, tilts and zooms. Where this type of

motion is identified the gradual transition is assumed to be

false to reduce the number of false positives. However, this

means that gradual transitions containing object or camera

motions will not be detected.

Motion-based algorithms (MB) have been proposed to

distinguish between changes caused by motion and those

caused by an edit effect [2,4,14]. Shahraray noted that

block-based comparison is usually performed by super-

imposing each block of the first image on exactly the same

location of the second image [6,12,14]. It was suggested that

a more robust measure can be obtained by motion

compensating blocks prior to calculating the block-wise

difference metrics. Shahraray used a weighted sum of the

motion-compensated pixel differences as the disparity

metric [14]. If this difference measure exceeds some

threshold a shot cut is detected. Gradual transitions are

detected by locating a sustained small increase in the

difference metric. Such methods may detect false positives

if there exists several blocks with poor matches as a result of

multiple motions within a block or in the presence of

motions that violate the translational model, while the

majority of blocks match well. To overcome the problem of

such outliers, Shahraray used an order statistic filter to

combine the disparity metrics of all the blocks [14]. In this

method, the values for each block are sorted in ascending

order and a weighted sum is computed where the

coefficients for each match value are assigned according

to the position of the value in the sorted list. While this may

reduce the chances of detecting shot transitions between

scenes with shared backgrounds, by choosing the coeffi-

cients properly it reduces the number of false detections in

the presence of several extremely bad matches compared to

the use of a linear combination of the similarity metrics.

Lupatini et al. also evaluate the motion compensated

difference values of each block [4]. They noted that since

the difference values are obtained by a pixel–pixel

comparison the method can still be highly sensitive to

S. Porter et al. / Image and Vision Computing 21 (2003) 1097–1106 1099

local motions. In order to overcome this problem, instead of

considering pixel differences, the average of the luminance

function was evaluated for each block. Instead of computing

a motion compensated disparity metric, Akutsu et al.

analysed the motion field measured between two frames

to detect shot cuts [15]. The disparity value between two

consecutive frames was then computed as the inverse of

motion smoothness.

Zabih et al. proposed another method to detect abrupt and

gradual transitions by checking the spatial distribution of

exiting and entering edge pixels [5]. Transitions were

identified by examining the relative values of the entering

and exiting edge percentages. To make the algorithm more

robust with respect to camera motions the two frames were

registered before comparison. However, despite the global

motion compensation this method is still sensitive to object

motion and camera motions such as zooms. Lienhart

proposed two algorithms, one to detect fades and the other

dissolves [3]. The first method detects fades by examining

the standard deviation of pixel intensities, which exhibits a

characteristic pattern during a fade. The second method is

based on the idea that there is a loss of contrast in an image

during a dissolve. The author described an edge-based

contrast feature, which emphasises the loss in contrast to

enable dissolve detection. One further approach for

detecting specifically dissolves is to monitor the temporal

quadratic behaviour of the variance of the pixel intensities

which was first proposed by Alattar, but has been modified

by other authors [16,17]. During a dissolve the intensity

variance starts to decrease at the beginning of a transition,

reaches its minimum in the middle and starts to increase

towards the end of the transition. Hence, the transition is

detected by locating this pattern in a series of variances.

However, a problem of this approach is that the pattern is

not sufficiently pronounced due to noise and motion in the

video [8].

The majority of the existing methods for detecting shot

transitions are weakened by camera and object motion and

sudden changes in the mean intensity within a shot. Yusoff

et al. proposed a method for shot cut detection which uses a

combination of multiple experts [18]. The experts them-

selves are stand-alone methods to detect shot cuts like those

mentioned above. However, they suggested that because

each method performs well in different circumstances,

significantly better results can be obtained by combining

these methods, as opposed to using each on their own.

Several authors have also proposed methods to detect the

type of motions that cause a sustained increase in the

disparity metric similar to that caused by a gradual transition

to reduce the number of false detections [6,14]. However, as

mentioned earlier these methods can then fail to detect

gradual transitions in the presence of such motion before,

during or after the transition. The conclusion must be that, a

shot transition detection method is still required that is

robust in the presence of camera and object motion and

changes in the global illumination.

3. Shot cut detection

We propose a MB to identify shot cuts which deals

inherently with object and camera motion. It uses block-

matching motion compensation to generate an inter-frame

difference metric. For each block in frame n; the best match

in a neighbourhood around the corresponding block in

frame n þ 1 is sought. This is achieved by calculating the

normalised correlation between blocks and locating the

maximum correlation coefficient. Calculating the normal-

ised correlation in the spatial domain is, however,

prohibitively expensive unless the blocks are small.

Hence, we perform normalised correlations in the frequency

domain [19] defined by:

rðjÞ ¼F21{x̂1ðvÞx̂

p2ðvÞ}ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiÐ

lx̂1ðvÞl2

dvÐlx̂2ðvÞ

2 dv

q ð1Þ

where j and v are the spatial and spatial frequency

coordinate vectors, respectively, x̂iðvÞ denotes the Fourier

transform of block xiðjÞ; F21 denotes the inverse Fourier

operator and p is the complex conjugate. A high-pass filter

is applied to each image before performing the correlations

to accentuate the contributions from higher spatial frequen-

cies, since a correlation field derived from high-pass regions

will contain more detectable peaks. Correlation fields

derived from low-pass regions will result in a flat correlation

field leading to inaccurate peak detection [20]. For this

reason, blocks with insufficient energy are not used. A

consequence of applying the high-pass filter is that the mean

of the image is removed. Hence, the correlation between

blocks is invariant to changes in the mean intensity. By

normalising the correlation, the method is insensitive to a

positive scaling of the image intensities. Most of the

previous methods for shot cut detection may falsely detect a

shot cut where sudden intensity changes occur within a shot,

for example, where there is a change in the lighting

conditions. By applying a high-pass filter and performing

normalised correlation our method is robust to changes in

the global illumination.

The location of the maximum correlation coefficient is

used to find the offset of each block in frame n þ 1 from its

position in frame n: Previous approaches use the estimated

motion vectors to calculate the motion-compensated frame

difference [14]. In contrast, our approach uses only the value

of the maximum correlation coefficient, as a goodness-of-fit

measure for each block. The value of the goodness-of-fit

measure lies between 0 and 1, where a value of 0 indicates a

complete mismatch and a value of 1 indicates a perfect

match. Fig. 3(a) shows the correlation field which resulted

from the correlation between two blocks from a frame pair

within the same shot and Fig. 3(b) shows the correlation

field from a frame pair containing a shot cut. Between two

frames belonging to the same shot, the goodness-of-fit for

the majority of the blocks should be close to 1.0, indicating

S. Porter et al. / Image and Vision Computing 21 (2003) 1097–11061100

a good match. A high number of poor matches should

suggest the presence of a shot cut.

A similarity metric for each frame pair is derived by

combining the goodness-of-fit measures of all the blocks.

Initially, the mean m of the goodness-of-fit measures was

computed, defined as

m ¼

XB

i¼1

pi

Bð2Þ

where pi ¼ maxðrðjÞÞ for block i; and B is the total number

of blocks. However, the linear combination of the goodness-

of-fit measures has the clear disadvantage of averaging very

high match values with low ones to generate mediocre

values. This is not a good approach since mismatches can

occur during a shot due to occlusion, objects entering or

leaving the image or data that violates the 2D translational

model, while the majority of blocks match well. To prevent

these outliers negatively influencing the similarity metric

for a frame pair, a more satisfactory measure can be

obtained by using the median of the goodness-of-fit

measures. Therefore, a similarity metric Mn for a frame

pair n and n þ 1 is defined as

Mn ¼ median{pi}: ð3Þ

Given �M as the average of the previous similarity measures

since the last shot cut, defined as

�M ¼

Xn21

i¼1

Mi

n 2 1ð4Þ

then a shot cut is detected if �M 2 Mn . Tc; i.e. if the rate of

change from the average similarity measure is greater than

some threshold Tc: Fig. 4 shows a plot of Mn for three

different video sequences. It can be seen that during a shot

Mn remains high (close to 1). On the other hand, a shot cut

manifests itself as a sudden decrease in Mn:

The choice of optimal block size is an ill-defined

problem. A large block is more likely to invalidate the

model of a single translational motion per block whereas a

small block is less likely to contain enough intensity

variation, which makes it difficult to measure the motion

accurately. In this work a block size of 32 £ 32 was chosen

Fig. 3. Correlation fields resulting from normalised correlation between

corresponding blocks within a frame pair. (a) Frame pair belong to the same

shot (b) Frame pair contains a shot cut.

Fig. 4. Similarity metric Mn for three different video sequences. (a) advert

(b) holiday video (c) film trailer.

S. Porter et al. / Image and Vision Computing 21 (2003) 1097–1106 1101

empirically as it gives an acceptable trade-off between

accuracy and the resolution of the motion field (using

images with typical dimensions 256 £ 256).

4. Detecting fades and dissolves

The method described above responds well to shot cuts.

In this section we describe the extension of this method to

also detect fades and dissolves. As in shot cut detection,

most of the previous approaches to detecting gradual

transitions are also sensitive to object and camera motions.

We propose a method that can differentiate between

changes caused by a gradual transition from those caused

by camera and object motion.

The shot cut detection method described in Section 3 can

be straightforwardly extended to detect fades. The end of a

fade-out and the start of a fade-in is marked by a constant

image. A constant image contains very little, if any, high-

pass energy. Therefore, correlation of an image with a

constant image results in Mn ¼ 0 which can be used to

identify the end of a fade-out and the start of a fade-in.

However, gradual transitions occur over a number of frames

so knowledge of the boundaries of the edit effect is required.

A fade is a scaling of the pixel intensities over time which

can be observed in the standard deviation of the pixel

intensities [3]. If Mn falls to 0 and the standard deviation of

the pixel intensities decreased prior to this, the frame where

the standard deviation started to decrease is marked as the

first frame of the fade-out. The decrease in the standard

deviation must have occurred over more than two frames to

distinguish fade-outs from a shot cut to a constant image.

Similarly, if the standard deviation of the pixel intensities

increases after the similarity metric increases from 0, the

frame where the standard deviation becomes constant is

marked as the end of the fade-in. Again, this must have

occurred over more than two frames. Initially, to compute

where the standard deviation becomes constant after a fade-

in, the standard deviation of frame n; denoted sn; was

compared to the standard deviation of frame n 2 1; sn21: If

sn # sn21 then the end of the fade-in was marked.

However, we observed that often the scaling factor is not

altered for every frame but only every other frame.

Therefore, the end of a fade-in would be marked too

early. Hence, the end of the fade-in was marked when

sn #sn21 þ sn22

2:

A similar comparison is used to detect the start of a fade-out.

Extending this method to detect dissolves is somewhat

more involved. The difference between each frame pair

during a dissolve is so small that Mn does not indicate that a

dissolve has occurred. We divide the first frame of a

sequence into a regular grid of blocks of size 32 £ 32.

A selection of these blocks is then used to represent

regions of interest (ROI) in the image. A block is selected as

a ROI if

sb2 .

sI2

lnðsI2Þ

ð5Þ

where sb2 is the variance of a block b and sI

2 is the

variance of the image I: This is to prevent all of the blocks in

an image with low variance being selected as ROI. Fig. 5

shows the first frame of two shots and their selected ROI

highlighted in white.

In Section 3, the method for shot cut detection discarded

the motion vector estimated from block matching using only

the correlation peak value. However, motion estimation

between frame pairs is now used to track blocks over time

in the video sequence. Between each frame pair n and

n þ 1; Mn is still computed to detect shot cuts. In addition,

each ROI is correlated with its new location in frame

n þ 1; n þ 2; etc. as shown in Fig. 6(a–c), until the end of

the next edit effect or until the block is removed. The value

of the correlation peak, maxðrðjÞÞ; is used as a goodness-of-

fit measure for each ROI over time. A single similarity

metric Fn for the set of ROI is calculated by taking the

median of the goodness-of-fit measures for all the ROI to

reduce the effect of outliers.

As mentioned earlier, blocks are tracked through the

shot, irrespective of how far they have moved until the next

edit effect or until they are removed. While tracking, object

or camera motion may cause blocks to become overlapped

as shown in Fig. 6(c). Once this occurs the block tracking is

no longer reliable because block-matching cannot resolve

occlusion. Therefore, blocks that are overlapping or have

begun to move outside the image are removed as shown in

Fig. 6(d). If any of the removed blocks were a ROI they are

also removed from the current set of ROI. This will leave

Fig. 5. Blocks are selected to be regions of interest (ROI) in the first frame of each shot.

S. Porter et al. / Image and Vision Computing 21 (2003) 1097–11061102

areas of the image uncovered, the contents of which will still

need to be tracked. For this reason we try to reintroduce new

blocks in the uncovered areas. This is achieved by

comparing the current positions of the remaining blocks to

a regular spatial grid. Any blocks in this regular grid that are

not covered by the current set of blocks are added as shown

in Fig. 6(e). If a new block satisfies Eq. (5) it is added to the

current set of ROI. Once this is complete it is possible to

continue to track the blocks into the next frame (Fig. 6(f)).

The addition and removal of blocks allows the set of ROI to

be updated for changes due to camera and object motion.

During a shot, Fn should remain high indicating that the

contents of each ROI has not changed significantly. During

a dissolve, the content of each ROI gradually changes and

Fn will decrease until it reaches its lowest value at the end of

the dissolve. During a shot Mn and Fn should be

approximately equivalent. Rather than compare the value

of Fn to a threshold, we want to compare how much it has

changed with respect to Mn: Hence, we define the ratio Rn as

Rn ¼Mn

Fn

: ð6Þ

If Rn is greater than a threshold TD then the end of the

dissolve is marked once Rn reaches its maximum. The start

of the dissolve is marked where Rn started to increase.

Fig. 7 shows Fn and Rn during three consecutive

dissolves in a video sequence. It can be seen that Fn

decreases during a dissolve and reaches its minimum at

the end. During these three dissolves Mn remained

approximately equal to 1, causing Rn to increase during

each dissolve. The three dissolves are therefore easily

detected.

After every detected shot transition, the first frame of the

next shot is divided into a regular grid and a new set of ROI

are selected to be tracked for the detection of the next edit

effect.

5. Comparative results

To evaluate the performance of this algorithm it was

compared with the performance of two other methods, one

HB and one feature-based. The HB method was chosen

since it is a well-established technique and has been shown

to perform well in detecting edit effects [4]. The feature-

based (FB) method was chosen for its ability to detect

gradual transitions, although its performance was reported

on limited test sequences [5,21].

The HB method used is similar to the method with the

best performance in the comparative investigation by

Lupatini et al. [4]. This approach uses the x2 value to

define the difference between two global colour histograms

which is compared against two thresholds, TH and TL:

Whenever the histogram difference between two consecu-

tive frames is greater than TH; a shot cut is detected. If the

difference lies between the two thresholds the frame is

marked as the potential start of a gradual transition.

Successive frames are then compared with the first frame

of the transition and if the difference exceeds TH; a gradual

transition is detected. The end of the gradual transition is

marked once the difference between frame pairs drops

below TL for two frame pairs.

The FB method used was by Zabih et al. [5,21] and the

code for this algorithm has been made available allowing

exactly the same implementation to be used. The approach

is based on the idea that during a cut or a dissolve, new

edges appear far from the locations of disappearing, older

edges. By comparing the relative values of entering and

exiting edge pixels the method classifies cuts, fades and

Fig. 6. (a–c) Blocks are tracked over time and may become overlapped, (d)

overlapped blocks removed, (e) blocks added in uncovered area, (f) blocks

continue to be tracked.

Fig. 7. Feature similarity metric Fn and ratio Rn during three consecutive

dissolves.

S. Porter et al. / Image and Vision Computing 21 (2003) 1097–1106 1103

dissolves. A registration technique is used to compensate for

global motion between two frames. To compensate for

small object motions edge pixels in one image within a

small distance of edge pixels in the other image are not

considered to be entering or exiting edges.

To test these methods 10 different movie trailers were

used. These were found to be a good source of data since

they tend to contain many shot transitions over a short

sequence. The locations and types of these transitions were

hand-labelled for comparison. The distribution of shot

transitions in the complete set of test data can be seen in

Table 1.

For each algorithm we experimented with a single

training sequence and performed an exhaustive search to

select the parameter values which gave the best perform-

ance. Two parameters often used to compare the perform-

ance of shot boundary detection methods are recall and

precision [2], defined as

Recall ¼NC

NC þ NM

Precision ¼NC

NC þ NF

ð7Þ

where NC;NM; and NF are the number of correctly detected,

the number of missed, and the number of falsely detected

shot transitions, respectively. In other words, recall is the

percentage of true transitions detected, and precision is the

percentage of detected transitions that are actually correct.

For each method, the recall and precision values were

computed for every parameter set tried. Assuming recall and

precision are equally important, the threshold values, which

gave the greatest linear combination of recall and precision

were chosen. Each method was then run over the complete

data set using its chosen parameter set. For a robust

algorithm the selected thresholds should generalise well to

other sequences particularly as the training sequence and the

test sequences are all of the same type (film trailers).

The proposed motion-based approach and the HB

approach both require two parameter values. The feature-

based approach requires five main parameters, which would

require searching a large set of possible parameter values.

However, the authors report that although the algorithm has

several parameters that control its performance they

achieved good performance from a single set of values for

three of these parameters across all the sequences they

tested (they do not report the values used for the remaining

two) [5]. They state that their algorithm’s performance does

not depend critically upon the precise values of these

parameters, but report the values they found to give the best

performance. Therefore, these values are used in

the experiments and a search was performed only for values

for the remaining two thresholds.

A novel aspect of our method is its ability to classify

transitions into cuts, fade-ins, fade-outs and dissolves.

Therefore, in our experiments we are not only concerned

with the detection of transitions but also their correct

classification. If a shot transition was detected, but not

classified correctly, it was considered a false detection and

the actual edit effect was labeled as undetected. What

‘classify correctly’ means is relative to each algorithm’s

ability to classify shot transitions. Our method must classify

each effect (cuts, fade-ins, fade-outs and dissolves)

correctly.

The FB method attempts to classify edit effects into cuts,

fades and dissolves. Therefore, if this algorithm classified a

fade-in or a fade-out as a fade, it was considered a correct

classification. The HB method only distinguishes between

cuts and gradual transitions. Thus, it must classify cuts

correctly, but if it classifies a fade-in, fade-out or dissolve as

a gradual transition this is also considered a correct

classification. Only if it detected, for example, several

shot cuts during a dissolve is the dissolve considered

undetected and each shot cut is considered a false detection.

The reason for not using a comparison based simply on

detection rather than correct classification is related to the

accuracy of the boundaries of the detected shot transitions.

A shot cut only occurs between two frames where as a

gradual transition occurs over a number of frames. If an

algorithm declares a gradual transition where there exists a

shot cut, which sometimes happens due to the presence of

motion before and after a cut, or it declares several shot cuts

during a gradual transition, which can often be the result of a

rapid transition, then the precision of the detected transition

boundaries will be poor. In fact, in the comparative study by

Lupatini et al. they consider a transition (cut or gradual) to

be correctly detected if at least one of its frames has been

detected as a shot transition [4]. However, they also define

two more parameters to evaluate the precision of the

detected boundaries and note that they frequently assume

very low values. Also, if an algorithm detects several shot

cuts during a gradual transition it will obviously result in a

high number of false detections. In the comparative study by

Boreczky and Rowe a gradual transition was correctly

detected if any of the frames of the transition was marked as

a shot boundary [2]. To reduce the number of false positives

during gradual transitions they did not penalise an algorithm

for reporting multiple consecutive frames as shot cuts.

However, if, for example, an algorithm marked every other

frame of a gradual transition as a shot boundary, the first

would be a correct detection and the remainder would be

false positives. An algorithm must be able to distinguish

between cuts and gradual transitions to improve the

precision of the detected boundaries and to reduce the

number of false positives during gradual transitions.

The performance of our MB compared with those of the

HB and FB methods for shot cut detection only can be seen

Table 1

Number and types of edit effects contained within the complete test data set

Cuts Fade-ins Fade-outs Dissolves

450 79 74 114

S. Porter et al. / Image and Vision Computing 21 (2003) 1097–11061104

in Table 2. A comparison of the performance of the

algorithms for the detection of gradual transitions can be

seen in Table 3. In these two tables we report the total results

across all 10 sequences in the data set. It should be noted

again that while we classify gradual transitions into fade-

ins, fade-outs, and dissolves, FB only classifies into fades

and dissolves and HB does not make a distinction at all. For

these experiments, if an edit effect was detected but

classified incorrectly (according to each method), it was

considered a false detection and the actual edit effect was

labelled as undetected. From Tables 2 and 3 it is clear that

our proposed method performs better compared with the

other two techniques over our chosen data set.

Table 4 summarises the performance of the algorithms

by comparing the recall and precision of each one (after

combining the results for the gradual transitions for FB and

MB). It also shows the overall performance, which is a

linear combination of the recall and precision value

assuming they are both equally important. FB’s perform-

ance was disappointing as it detected many false gradual

transitions and few of the actual gradual transitions,

reflected by the low precision and recall values (33 and

45%, respectively). There are several reasons for this. The

algorithm compensates only for translational motion. This

means that zooms are a cause of false detections. Also, the

registration technique used only computes the dominant

motion, meaning that multiple object motions within the

frame are another source of false detections. Furthermore, if

there are strong motions before or after a cut, the cut is

typically misclassified as a dissolve and cuts to or from a

constant image are misclassified as fades.

The results for the HB method were a considerable

improvement on the FB approach. The biggest drawback is

that many gradual transitions are misclassified as shot cuts,

resulting in a low recall value for gradual transitions (58%)

and a low precision value for shot cuts (30%). One reason

why the recall values for HB are low is that it misses edit

effects between shots with similar colour distributions.

Another reason is that if a gradual transition is closely

followed by another, then HB often detects this as a single

transition, meaning that the first one is detected and the

second is considered undetected. Finally, another source of

false detections was camera and object motion that created

changes similar to that caused by a gradual transition.

Our proposed motion-based algorithm gives the most

favourable results with high recall and precision values and

the best performance for both cuts and gradual transitions.

In addition, our algorithm is able to distinguish between

fade-ins, fade-outs, and dissolves. The main cause of false

detections of dissolves in our technique was due to the

contents of a ROI changing, not in the presence of a

dissolve, but for example due to motion blur or a light

source that saturates a large part of the image. Also, if a shot

cut is undetected, then the set of ROI are not updated and

they are tracked into the next shot resulting in a

misclassification as a dissolve.

6. Conclusions

We have presented a novel, unified approach that

classifies shot boundaries into cuts, fade-ins, fade-outs and

dissolves. The recall, precision and performance values

show either a significant improvement on other approaches

or are comparable given that all shot transitions are

separately resolved. This was shown experimentally in a

comparative study against two commonly used techniques.

A weakness of our method is that it will track the most

dominant motion if there are multiple motions within a

block. This can cause the contents of a ROI to change and

result in a decrease in Fn leading to a false detection of a

dissolve. Such problems might be improved by using a

multi-resolution model to estimate the motion [20].

Another drawback to our method is the computational

cost. Our approach takes around 2 s to complete a frame

pair. In this time, the FB approach completed four frame

pairs and the HB approach 30 frame pairs. However, we feel

Table 2

Detection and classification of shot cuts for each method over the complete

data set

Detected Method

MB HB FB

NC 410 301 329

NM 40 149 121

NF 48 190 224

Table 3

Detection and classification of gradual transitions for each method over the

complete data set

Detected Method and effects

MB FB HB

Fade-ins Fade-outs Dissolves Fades Dissolves Gradual

NC 64 71 103 66 55 155

NM 15 3 15 87 59 112

NF 1 6 63 86 164 27

Table 4

Recall, precision and performance for each method over all the cuts and

gradual transitions

Parameters (%) Method and Effects

MB HB FB

Cuts Gradual Cuts Gradual Cuts Gradual

Recall 91 88 67 58 73 45

Precision 90 77 61 85 60 33

Performance 90.5 82.5 64 71.5 66.5 39

S. Porter et al. / Image and Vision Computing 21 (2003) 1097–1106 1105

that the increase in processing time can be justified by the

significant improvement in the quality of results. In fact,

Zhang et al. who proposed the ‘twin comparison’ technique

suggest using a block matching algorithm to try and

distinguish changes caused by camera movements from

those due to gradual transitions to reduce the number of

false positives [6]. They propose using motion vectors

obtained from a block matching algorithm to try and classify

certain camera operations (panning, tilting, zooming). If

such camera operations are detected during a potential

gradual transition then the transition is ignored. The authors

note the number of false positives is reduced at a cost of an

increase in computational time.

There are advantages in working with the block-

matching algorithm and in the future we plan to attempt

to make use of motion vectors contained in MPEG-

compressed video if present. Although, MPEG encoders

optimise for compression and do not necessarily produce

accurate motion vectors, such estimates might be used as an

initial, rough approximation of the location of the best

matching block. If using correlation in the frequency

domain it will help centralise the correlation peak and

improve the goodness-of-fit measure. If implementing

correlation in the spatial domain it can be used to reduce

the search space therefore reducing the amount of

computation required for the correlation.

Although threshold values were chosen that gave the best

performance on a training sequence before applying the

algorithms to the test data, these thresholds might not have

been equally suitable for every sequence. If the performance

of an algorithm is very dependent on the thresholds selected

then we consider this to be a weakness of the algorithm.

However, future work will be carried out to test the

dependency of these algorithms on the threshold values used.

Acknowledgements

The authors would like to thank EPSRC and UBQT

Media Ltd, Bristol for sponsorship of this work.

References

[1] C. Jones, Transitions in video editing, in: B. Hoffman, The

Encyclopedia of Educational Technology, San Diego State Univer-

sity, 1994–2003

[2] J. Boreczky, L. Rowe, Comparison of video shot boundary detection

techniques, in: SPIE Conference on Storage and Retrieval for Image

and Video Databases IV, vol. 2670, 1996, p. 170–179.

[3] R. Lienhart, Comparison of automatic shot boundary detection

algorithms, in: SPIE Conf. on Storage and Retrieval for Image and

Video Databases VII, vol. 3656, 1999, p. 290–301.

[4] G. Lupatini, C. Saraceno, R. Leonardi, Scene break detection: a

comparison, in: 8th International Workshop on Research Issues in

Data Engineering, 1998, pp. 34–41.

[5] R. Zabih, J. Miller, K. Mai, A. feature-based, A feature based

algorithm for detecting and classifying scene breaks, in: ACM

Multimedia ’95 Proceedings, ACM Press, New York, 1995.

[6] H. Zhang, A. Kankanhalli, S.W. Smoliar, Automatic partitioning of

full-motion video, Multimedia Systems 1 (1) (1993) 10–28.

[7] R. Lienhart, Reliable transition detection in videos: A survey and a

practitioner’s guide, International Journal of Image and Graphics 1 (3)

(2001) 469–486.

[8] A. Hanjalic, Shot-boundary detection: Unraveled and resolved, IEEE

Transactions on Circuits and Systems for Video Technology 12 (2)

(2002) 90–105.

[9] Y. Yusoff, W. Christmas, J. Kittler, A. study, A study on automatic

shot change detection, in: Third European Conference on Multimedia

Applications, Services and Techniques, 1998, pp. 177–189.

[10] R. Kasturi, R. Jain, Computer Vision: Principles, IEEE Computer

Society, Silver Spring, 1991.

[11] A. Hampapur, R. Jain, T. Weymouth, Digital video segmentation, in:

ACM Multimedia ‘94 Proceedings, ACM Press, New York, 1994, pp.

357–364.

[12] A. Nagasaka, Y. Tanaka, Automatic video indexing and full-video

search for object appearances, in: Visual Database Systems, 2, 1992,

pp. 113–127.

[13] J.A. Rice, Mathematical statistics and data analysis, second ed.,

Duxbury Press, North Scituate, 1995.

[14] B. Shahraray, Scene change detection and content-based sampling of

video sequences, in: Digital Video Compression: Algortithms and

Technologies, 2419, 1995, pp. 2–13.

[15] A. Akutsu, Y. Tonomura, H. Hashimoto, Y. Ohba, Video indexing

using motion vectors, in: SPIE Visual Communication and Image

Processing, vol. 1818, 1992, pp. 1522–1530.

[16] A. Alattar, Detecting and compressing dissolve regions in video

sequences with a DVI multimedia image compression algorithm, in:

Proceedings of the IEEE International Symposium on Circuits and

Systems, 1993, pp. 13–16.

[17] W.A.C. Fernando, C.N. Canagarajah, D.R. Bull, Fade and dissolve

detection in uncompressed and compressed video sequences, in:

Proceedings of the IEEE International Conference on Image

Processing, 1999, pp. 299–303.

[18] Y. Yusoff, J. Kittler, W. Christmas, Combining multiple experts for

classifying shot changes in video sequences, in: Proceedings of the

IEEE International Conference on Multimedia Computing and

Systems, vol. 2, 1999, pp. 700–704.

[19] A.D. Calway, H. Knutsson, R. Wilson, Multiresolution estimation of

2-d disparity using a frequency domain approach, in: British Machine

Vision Conference, 1992, pp. 227–236.

[20] S. Kruger, Motion analysis and estimation using multiresolution affine

models, PhD Thesis, Department of Computer Science, University of

Bristol, (October, 1998)

[21] R. Zabih, J. Miller, K. Mai, A feature-based algorithm for detecting

and classifying production effects, Multimedia Systems 7 (1999)

119–128.

S. Porter et al. / Image and Vision Computing 21 (2003) 1097–11061106