[IEEE Multimedia Technology (IC-BNMT 2011) - Shenzhen, China (2011.10.28-2011.10.30)] 2011 4th IEEE International Conference on Broadband Network and Multimedia Technology - A modified

Proceedings of IEEE IC-BNMT2011

A MODIFIED SPORTS GENRE CATEGORIZATION FRAMEWORK BASED ON CLOSE-UP VIEW PRE-

DETECTION Jiwei Zhang1, Yuan Dong1, Kun Tao2, Xiaofu Chang2

1 Beijing University of Posts and Telecommunications, Beijing 100876, China 2 France Telecom Orange Labs(Beijing), Beijing, China

[email protected], [email protected], [email protected], [email protected]

Abstract In this paper a modified sports genre categorization framework is presented. The view type of close-up is detected as domain knowledge before categorization on large scale database. Close-up views occupy more than 1/3 of the duration of a sport match depending on its genre, and appears almost the same in various genres, which largely affected the performance of sports genre categorization. The presented framework is formed into two levels, a skin-tone based human detector are performed on all the key-frames to identify the close-up views in the first level. The second level is based on bag-of-word (BOW) model using Scale Invariant Feature Transform (SIFT) and Support Vector Machine (SVM) with close-ups detected in the first level. In training part, codebook is generated without close-ups according to the annotation; while in the testing part, the scores of close-ups pre-detected in the first level are calculated in low weights to make late fusion. Experiments on a dataset of 10 sports genres with 300 hours of videos from TV and Internet to ensure diversity have proven the improvements on the robustness and efficiency using our modified framework on sports genre categorization in both TV and Internet applications. Keywords: Skin-tone, SIFT, Close-up, sports genre categorization

1 Introduction The rapid-growing multimedia data necessitates powerful indexing and retrieval methods. Sports videos, an import kind of video documents, are widely studied due to its high commercial potential. One of the hot topics is sports genre classification since manual annotation of ever-increasing sports videos now is impossible. Many researchers have made meaningful achievements on sports genre classification, Gilbert et al, proposed a Hidden Markov Models

(HMM) based classification using motion and color features [1]. Yuan et al, present automatic video genre categorization based on hierarchical SVM using two simple temporal and spatial features [2]. Those methods [3][4][5][6] both use domain knowledge and have been proven effective in small dataset and limited genre types. But when facing large dataset and various sports genres, appropriate domain knowledge becomes important. Discriminating the view-types (long-view, middle-view, close-up, other) is essential in sports categorization especially close-up. The view-type of close-up can occupies more than 1/3 of the sport match’s duration depending on its genres as Figure 1 shows. Moreover, it appears almost the same in various genres, which makes it an import cue in sports categorization and drawback in codebook construction.

Figure 1 The ratio of 4 view-types in 10 sports genres of our database.

Though close-up can perform key domain knowledge in sports categorization, view-type classification methods state-of-the-art is suffered from the extensibility and flexibility. Duan et al, proposed a unified framework for semantic view type classification to make further structural and temporal analysis [7]. Ekin et al, used cinematic and object-based features to analysis soccer view-types [6], Li et al, introduced speeded-up robust features (SURF) with SVM to identify view-types [8]. All the methods above must be obtained by

___________________________________ 978-1-61284-159-5/11/$26.00 ©2011IEEE

prior knowledge of sports genre, which makes it impossible to achieve view-type detection as domain knowledge to sports genre categorization. In order to process large scale of dataset fast as well as efficiently, we propose an automatic sports genre categorization with close-up pre-detected as domain knowledge using skin-tone based human detection. First, video frame sequences are transferred into robust color spaces with illumination comprehension and morphology processing to detect close-up. In the next level, after SIFT descriptors are extracted from all frames; the codebook is constructed with less close-ups in training part. Reaching late fusion part of testing procedure, scores of close-ups are calculated in low weight. Our framework is proved to be computing-efficient and extensible to various genres and large scale of data. With the dataset consist of 10 sport genres and 300 hours in total, the average precision could be achieved 87%. Detecting close-up as the domain knowledge is observably meaningful to sports categorization. The proposed framework is showed in the next section, including robust color space transfer, illumination comprehension, and morphology processing as a human detector to detect close-ups, as well as the sports categorization part. In Section 3, sufficient experiments results and discussions are presented. Finally, the paper is concluded in Section 4.

2 Framework of modified sports categorization The proposed framework is illustrated by Figure 2. A sports video needs to go through three processes in sequence. First in the pre-processing stage, video is split into shots using shot-boundary-detect (SBD); key-frames are extracted to present the shots they belong to. Furthermore, a close-up detector is applied to discriminate close-ups and other view-types. After close-up and non-close-up view are discriminated as the domain knowledge, the codebook based sports genre categorization are applied, the number of close-up key-frames is reduced in the stage of constructing bag-of-word model. In testing part, results of close-ups detection are calculated in lower weight to do score fusion. In section 3.2 and section 3, we will discuss why those approaches can work.

2.1 pre-processing

Color based shot boundary detect (SBD) is applied to segment video consequences, and then key-frames are extracted from shots respectively to represent the video.

Histogram based methods are most common methods to detect shot boundary in video consequences. The histogram methods are a good trade-off between accuracy and speed when comparing with pixel methods, statistical ones. The spatial distance moment (SDM) is introduced to compute the distance of two images [13].

( ) ( ) ( )p

M

i

N

j

ptss jiIjiI

NMtD

−

= =+ ⎟⎟

⎠

⎞⎜⎜⎝

⎛−

×= ∑∑

1

0 01 ,,1

(1)

Ds(t) is the distance between frame t and its next, where in total M×N pixels’ grey-level values are computed.

Figure 2 The framework of our proposed modified sports genre categorization.

2.2 Close-up detector

Figure 3 gives an overview of the close-up detector based on skin-tone. A. Hadid et al [9], Chai and Ngan [10], both propose inspiring results on their color spaces as well as thresholds. In our framework, a combination of both is applied.

Figure 3 Framework of close-up detector. Close-ups will be detected using this skin-tone based human detector.

2.2.1 Image preprocessing

Before transfer RGB to other robust color space, original image should be pre-processed to avoid compression noise and add illumination comprehension since illumination situation in sports is complex.

A Gaussian blur filter is used to smooth the image and two illumination comprehension method named grey-world[12] and method in [10], in Section 3 we will analysis the effect of illumination comprehension in improving the performance of genre categorization.

2.2.2 Color space transformation

YCbCr color space is introduced because it represents a color image more efficiently by separating the luminance from color information [4]. Threshold is obtained using Cb and Cr values on the ECU face and skin database. The transformation formula is listed in Equal. (2).

⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡⋅⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡

−−−=

⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡

BGR

CrCbY

081.0419.0500.0500.0331.0169.0114.0587.0299.0

(2)

The other is a model which consists of error signals considering luminance map. The thresholds are derived by fitting a Gaussian curve using the expectation maximization (EM) method on the error signals [2].

( ) ( ) ( ) ( )( )( ) ( )( )

BGRTwherexeifxf

xBxGxTxe

skin

*1402.0*5870.0*2985.0,1177.00251.0,1

,maxarg

++=≤≤=

−=

(3)

2.2.3 Skin segmentation

After pre-processing and color space transformation, the tested key-frames are then calculated pixel by pixel with Cb Cr components or T values checked with the thresholds.

The resulting mask called p-mask is generated using the equation blow Equal (4).

( ) ( )⎩⎨⎧ ∈

=others , 0

thresholdsji,value255,ji,pmask

(4) In order to ensure the precision, results of both skin detectors are added together.

2.2.4 Morphological Operations

Considering the isolated pixels and small black regions created by nose and eyes, morphological operations based on erosion and dilation is introduced. There are two steps: Step 1, closing: used to eliminate black regions created by eyes and nose. The equation is listed blow where p-mask image src[p] first performs dilation then erosion with a specified template temp [p].

( ) ][][][][ ptempptemppsrcpres Θ⊕= (5) Step 2, opening: used to eliminate white regions too small to be identified as a face. The equation is listed blow where p-mask image src[p] first performs erosion then dilation with a specified template temp [p].

( ) ][][][][ ptempptemppsrcpres ⊕Θ= (6) Figure 4 illustrates the results of close-up detector on different views. Mask image is showed below each view-type based key-frame, and the ratio of white place which represents the human part is

listed below. Close-up is observably identified by the ratio.

Figure 4 Mask image and the ratio of humans in the four view-types.

2.3 sports genre categorization

2.3.1 Codebook construction

The view type of close-up occupies more than 1/3 of the sports key-frames and has few differences in various genres. In training part, Scale Invariant Feature Transform (SIFT) based visual words is constructed by unbalanced numbers of view-types samples. Less close-ups will be gathered in get codebook with all view-types to improve the discriminative skills of feature. Qi Tian, et al. has proven the efficient using codebook on large-scale image application [11]. After close-up detecting, SIFT extraction procedure are preformed on all key-frames. Then key-frames are transferred into histogram features by the BOW model above.

2.3.2 SVM based categorization

After feature extraction and codebook construction, supervised learning is introduced to make classification procedure. Considering the inspiring performance of SVM in unbalanced and small training datasets, it is applied in our framework. A typical kernel named Chi-Square kernel is adopted in order to improving results with the histogram form of BOW model based feature. The exponential chi-square kernel for two image x and y is defined to be,

( )( )AHHyxK yx /,exp),( 2χ−= (7) Where Hx is the histogram of image, the χ2distance is evaluated as

( ) ( ) ( )( )( ) ( )∑

= +−

=n

i yx

yxyx iHiH

iHiHHH

0

22

21,χ

(8) Table 1 lists a hierarchical category of sports in detail considering the number of athletes, which in

connection with the form of close-ups. In score fusion stage, scores of key-frames are fused to label the video. Close-ups’ scores are calculated in low weight to improve the precision.

Table 1 Taxonomy of 10 sports genres according to the ratio of close-ups and the numbers of videos

in our database

Sports Group Rich Close-up Games

Sport Individual

handball billiards motor basketball

Data Size(h)

19 14 8 65

Sports Group Middle Close-up Games

Sport Individual

bowling tennis soccer skating

Data Size(h)

20 19 104 22

Sports Group Pool Close-up Games

Sport Individual

bike rugby

Data Size(h)

4 24

3 Experiments and analysis To evaluate the performance of proposed framework, a database of 10 genres with 300 mpeg-4 video sequences collected from Internet and TV in order to ensure the diversity, which lasts more than 300 hours. The structure and duration of our dataset is illustrated in Figure 5. Videos from 20% of Orange sports and all the other sources are regarded as the training set, while the rest of Orange sports are used to test the performance of our framework.

Figure 5 The structure of our database.

3.1 Close-up detector

The performance of close-up detector is very important to our proposed framework. More details will be discussed based on experiments. Ratio of white pixels in all the pixels of the mask image is counted as the threshold, to reduce the compute cost, average value is counted. V(i,j) is the value of P(i,j) in the mask, and nw is the number of white pixels. Average value can represent the threshold.

( ) ( )( )

255

0; 255, ,

,

,,

00

00

⋅=⇒

==+

=

+⋅+⋅

=∗

=∑

∑≤≤≤≤

≤≤≤≤

thresaver

VVnn

nthres

nnVnVn

jiP

jiVjiPaver

bwbw

w

bw

bbww

njmi

njmi

(9)

Table 2 The performance of skin-tone based close-up detector in different stages.

Color Space procedure AccuracyYCbCr 91%

Cheddad 86% All procedures 95%

No Illumination 90% Combine No Morphology 79%

Table 2 gives results on stages of close-up detector on different stages. Conclusions can be made that illuminations comprehension is able to eliminate the negative influence of different illumination statement by providing color constancy. Moreover, morphological operations of closing and opening can fill the holes created by nose and eyes, as well as correct false positive pixels by eliminating small white regions. 3.2 Codebook analysis Unbalanced view-types’ weights based BOW model is applied in our proposed framework since close-up views is per-detected as a kind of domain knowledge. Moreover, the size of codebook also plays an important role in improving the performance of genre categorization. Figure 6 illustrates the performance of combining various codebook sizes and close-up weights. Conclusions can be made that 1) performance is improved with the sizes of codebooks getting bigger, but stays when the sizes reach 1200; 2) eliminating the close-ups can reduce the structure risk and improve the discriminative ability of features.

Figure 6 Trade off between bag-of-word sizes and the ratio of close-up weight. Precision of key-frames classification is described by the bars’ length

3.3 Sport genre categorizationOur modified sports genre categorization framework detects close-ups as domain knowledge and associates codebook construction before SVM learning. Based on the conclusions of the experiment parts 3.1 and 3.2, in our proposed framework a combination of both color space transformation rules with grey-world illumination algorithm and morphology operation is used. In feature extraction stage, a codebook with 600 visual words and a reduction of no close-ups participating is generated.

Dealing with a multi-classification task, one-against-all method is introduced to ensure the robustness as well as efficiency. The results are listed in Figure 7. In late fusion stage, close-ups are given the weight of 1/5 to gain the scores of a video. In contrast, the performance of genre categorization with out view-types detection is listed. To further improve the performance, dense-sift is applied to construct bag-of-word model, which has a positive affect.

Modifying the categorization with close-ups as domain knowledge is useful to all the sports genres. Of all the ten sports, there are five categories easy to classify, i.e. basketball, billiards, bowling, skating, and soccer, which all gain the scores greater than 0.85. The five games have the strong shot structure feature and are easy for SIFT to classify.

Ori-sift

Ori- Densift

Modified-sift

Modified densift

Average precision

80.5% 83.8% 85.1% 87.4%

Figure 7 Accuracy of our proposed framework in 10 sports.

4 Conclusions The performance on large scale database of sports categorization is observably improved by detecting close-up as domain knowledge, which proves the efficiency and robustness of our framework. In the close-up detector part, combining two color spaces can ensure the precision. Codebook is important in genre categorization, 600 visual words with no close-ups in training part will generate best results. The average accuracy of our proposed framework is 87.4%. In the future, more effort will be made in improving the performance of the five low-performance sports genres, i.e. bike, handball, motor, rugby and tennis, more features will be introduced and other view types such as long-view and middle-view will be pre-detected as full domain knowledge.

Acknowledgements This work is sponsored by National Natural Science Foundation of China (90920001), and

collaborative Research Project (SEV01100474) between Beijing University of Posts and Telecommunications and France Telecom R&D.

References [1] X. Gibert et al., Sports video classification

using HMMS, Proceedings of the 2003 International Conference on Multimedia and Expo, p.345-348, July 06-09, 2003

[2] X. Yuan, et al., Automatic video genre categorization using hierarchical SVM. IEEE International conference on Image Processing, pp. 2905--2908, 2006.

[3] D. Brezeale and D. Cook. Automatic video classification: A survey of the literature. IEEE Transactions on Systems, Man, and Cybernetics, 38(3):416--430, 2008.

[4] Jinjun Wang et al., Automatic Sports Video Genre Classification using Pseudo-2D-HMM, Proceedings of the 18th International Conference on Pattern Recognition, p.778-781, August 20-24, 2006

[5] E. Jaser, et al., Hierarchical decision making scheme for sports video categorization with temporal post-processing. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 908--913, 2004.

[6] A. Ekin, et al., Automatic Soccer Video Analysis and Summarization. IEEE Transactions on Image Processing, 12(7), 2003.

[7] L. Duan, et al., A unified framework for semantic shot classification in sports video. IEEE Transactions on Multimedia, 7(6), 2005.

[8] L Li, et al., Automatic sports genre categorization and view-type classification over large-scale dataset. Proceedings of the seventeen ACM international conference on Multimedia, Oct 19-24, 2009, Beijing, China

[9] A. Cheddad, et al., " A skin tone detection algorithm for an adaptive approach to steganography" Signal Processing, vol. 89, no. 12, pp. 2465-2478, 2009

[10] Gaurav Jain, “Skin-tone Detection: A Machine's Perspective”, ECE1774 - Sensory Cybernetics Final Project Report, 2010.

[11] Qi Tian, et al., “Building descriptive and discriminative visual codebook for large-scale image applications” Multimedia Tools and Applications, 2011, Volume 51, Number 2, Pages 441-477

[12] HSU, R.-L., ABDEL-MOTTALEB, M., AND JAIN, A. K. "Face detection in color images". IEEE Trans. Pattern Analysis and Machine Intelligence 24, 5, 696–706. 2002

[13] Yuan et al., "THU and ICRC at TRECVID 2007," in Proceedings of the TRECVD workshop, Gaithersburg, USA, 2007.

Documents

[IEEE Multimedia Technology (IC-BNMT 2011) - Shenzhen, China (2011.10.28-2011.10.30)] 2011 4th IEEE International Conference on Broadband Network and Multimedia Technology - A modified