01221242

Embed Size (px)

Citation preview

  • 8/22/2019 01221242

    1/4

    HIGHLIGHT SOUND EFFECTS DETECTION IN AUDIO STREAM'Rui Cui', Lie Lu2,Hong-Jiang Zhung' and Liun-H ong Cai'

    'Department of Com puter Science and TechnologyTsinghua UniversityBeijing, 100084,China

    ABSTRACTThis paper addresses the problem of highlight sound effectsdetection in audio stream, which is very useful in fields of videosummarization and highlight extraction. Unlike researches onaudio segmentation and classification, in this domain, it justlocates those highlight sound effects in audio stream. Anextensible framework is proposed and in current system threesound effects are considered: laughter, applause and cheer,which are tied up with highlight events in entenainments, spons,meetings and home videos. HMMs are used to model thesesound effects and a log-likelihood scores based method is usedto make final decision . A sound effect atlention model is alsoproposed to extend general audio attention model for highlightextraction and video summarization. Evaluations on a 2-hoursaudio database showed very encouraging results.

    1. INTRODUCTIONAudio content analysis plays an important role in video contentparsing. Besides visual features, audio features are widelyconsidered in many works, such as highlights extraction [l ] andvideo summarization [21. n order to extract the highlight shotmore accurately, Rui [ I ] utilized announcer's excited speech andbaseball hit for TV baseball programs; Ma and Lu [2] roposedan audio attention model to measure the importance curve of anaudio track. However, these work s did not conside r somegeneral highlight sound effects, such as laughter, applause andcheer. These sounds are usually semantically related withhighlight events in general video, such as entenainments, spons,meeting, and home videos. The audience's laughter often mea nsa humor scene in TV shows and applause in meeting often implywonderful presentations. Detection of these highlight soundeffects in audio stream is very helpful for highlight extractionand video sum marization.Most of previous works on audio content analysis focusedon general audio segmentation and classification [3][4][5],.where an audio track is segmented and then each segment is.classified into one of predefined classes. In comparison wilhthese previous works, sound effects detection in audio streammust handle the following cases: (i) model more panicular soundclasses and (ii) recall the expected sound effects only and ignoreaudio segm ents not belonging to an y predefined effect.

    ' his work is performed when the first author is a visiting studen tat Microsoft Research Asia

    Microsoft Research AsiaBeijing, 100080,China

    2

    3/F, Beijing Sigma Center, 49 Zhichun Road,

    0-7803-7965-9/03/$17.00 020 03 IEEE 111- 37

    An ideal framework of sound effects detection sho uldpossess following characters (i) high recall and precision: itshould exactly locate the interested sound effects and ignoreothers; (ii) extensibility: it should be easy to add or removesound effect models for new requirements. In this paper, anextensible framework and an efficient algorithm for highlightsound effects detection in audio stream is presented. HMM isused to model these sound effects, as suggested in Casey's soundrecognition tools [71. Based on the log-likelihood scores gotfrom each model, the final judgm ent is made.The rest of this paper is organized as follows. Audiofeatures are discussed in Section 2. The highlight sound effectsmodeling and detection scheme is presented in detail in Section3.The highlight effect attention model is illustrated in Section 4.In Section 5, experiments and evaluations of the proposedframework and algorithm are given.

    2. AUDIO FEATURE SELECTIONI n our experiment, all audio streams are 16-bi1, mono-channel,and down-sampled to 8 KHz. Each frame is of 2 00 samples(25ms), with 50% overlaps. Grounded on previous work in[31[41[51, wo types of features are computed for each frame: (i)perceptual features and (ii) Mel-frequency Cepstra l Coefficients(MFCCs). The perceptual features are composed of shon timeenergy, zero crossing rate, sub-band energies, brightness andbandwidth.A. Short-Time EnergyS h w - T i m e Energy ( S E ) provides a convenient representationof the amplitude variation over t ime. The STE is normalized asfollow:Here E x s the K h frame's STE and N is the frame amount of theinput audio data.E. Average Zero-Crossing RateAverage Zero-Crossing Rare (ZCR) gives a rough estimate offrequency content. which is one of th e most imponant features ofaudio signal.C. Sub-band EnergiesIn order to model th e characteristics of spectral distribution moreaccurately, sub-band energies are used in our method. The entire

    E , =E, /max ( E ~ ) I S s N (1)

    lCME 2003

  • 8/22/2019 01221242

    2/4

    frequency spectrum is divided into four sub-bands at the sameinterval of I KHz. The Sub-band energy is defined as(2)

    are lower and upper hound of sub-band i , and

    w.l,Q==,,E, = E IF(m)lz ( l S i S 4 )

    Here o,.andthen E, is normalized as

    E, = E , I Z E , ( l S i 9 4 ) (3)D. Brightness and Bandw idfhBrightness is defined as the frequency centroid.

    Er =EmIF(m)II l z l F ( ~ ) r (4)Bandwidth is the square root of the power-weighted average ofthe squared difference between the spectral components and thebrightness.

    Bw = Z [ ( m - E r ) z ~ F ( ~ ) ~ 2 1 ~ ~ ~ F ( ~ ) ~ z5 )'U

    E . 8 orde r Mel-frequency Cepstral Coefficients (MF CC s).Mel-scale gives a more accurate simulation of human auditorysystem. It's a gradually warped linear spectrum, with coarserresolution at high frequencies. MFCC is one of the Mel-frequency sub-band energy features. As suggested in [ 5 ] ,8 orderMFCC s are used in our experiment.

    These features are then combined as a 16-dimensionalfeature vector for a frame. In order to describe the variancebetween frames, the gradient feature of adjacent frames is alsoconsidered, and is concatenated to the original vector. Thus, weget a 32-dimensional feature vector for each frame.

    3.SOUN D EFFECTS MODELING AN D DETECTION3.1. Sound Effect ModelingMost sound effects can he partitioned into several statisticallysignificant patterns. More imponanlly, the time evolution ofthese patterns is critical for sound effect modeling. While bothGMM and HMM possess states which can represent suchpatterns, HMM also describes the time evolution between statesusing the transition probabilities matrix. Thus, HMM is selectedto model sound effects.A complete connected HMM is used for each sound effect.with the continuous Gaussian mixture modeling each state. Thecomponent number of Gaussian mixture for each state is selectedas four. since the sound effects are relatively simple and withlittle variation. Using more components need more training data,but give slightly improvement to the accuracy in experiments.The training data for each sound effect includes 100 piecesof samples segmented from audio-track. Each piece is about 3-10s long and totally about lOmin training data for each class.The basic sound effect modeling process is as follows. At first, aclustering algorithm proposed in [61 is used to find a reasonablestate numbers of HMM for each sound effect. In our experiment,the HMM state numbers for applause, cheer and laughter are 2, 4an d 4 respectively. And then. frame-based feature vectors areextracted far estimating the HMM parameters using the Baum-Welch method, which is widely used in the field of AutomaticSpeech Recognition (ASR).

    3.2. Sound Effect Detection3.2.1. Framework OverviewThe system framework of sound effect detection is illustrated inFigure 1.

    I L 1J T

    Figure1. System Framework of sound effect detectionAs the Figure I shows, a sliding window of t seconds

    moves through the input audio stream with A t overlapping. Inour experiment we choose t =I and A r 9 . 5 for the tradeoff of thealgorithm's efficiency and accuracy. Each data window isfunher divided into 25ms frames with 12.5ms overlapping, fromwhich feature is extracted. The extracted feature vectors form aninput of HMM. In order to reduce the process time, silencewindow is skimmed before testing on HMMs. A silence windowis detected base on average short time energy and average zero-crossing rate:

    ( 6 )Here 6 ; and 6 ! are thresholds of average ST E and ZCRrespectively.Non-silence window i s compared against each sound effectmodel and then k log-likelihood scores are obtained, given thatthere are k sound effect models. A judgment is made usingdecision algorithm based on these scores. In this framework, it'seasy to add o r remove sound ef fect model to adapt for newrequirements.The recent three decisions are preserved in the RecentHistory Records database for a post-processing. The decisiongiven by the algorithm is first send to the Records database andthe final result is obtained after post-processing. Some simplerules are used in this post-processing. For example, consideredthe continuity between adjacent windows, if the consecutivethree decisions are "A-B-A", they are modified to "A-A-Y.3.2.2. Lug-l ikelihood Scores B ased Dec ision MethodThe most important issue is how to do decision based on the log-likelihood scores. Unlike au dio classification, we can't simplyclassify the sliding window into the class which has themaximum log-likelihood score. Sliding windoa, not belonging toany predefined sound effect should he ignored.

    AverageSTE

  • 8/22/2019 01221242

    3/4

    Figure 2 illustrates the flowchan of the proposed log-likelihood based decision method. From Figure 2, each log-likelihood score is examined to see if the window data is"accepted" by the corresponding sound effect. To implement thistask, an optimal decision is made by minimizing the followingcost function [E ] , based on Bayesian decision theory,

    e;")ejedionFigure 2. The flowchartof log-likelihood based decisionc=P (C , )P (C , Ic,)c)=-r e'

    a = p ' / u ' , l = d / j L (11)P r ( a )

    where parameten 0 and B are estimated as:

    where P an d o are mean and standard deviation of log-likelihood scores. In order to get an accurate estimation of allthese parameters, it's necessary to prune abnormal scores first. Inour experiment. the abnormal score is defined as those whosedistance with P are larger than 2 m. n each iteration P and oare calculated firstly, an d then th ose abnorm al one s are pruned.The iteration i s stopped until there is no abnormal data in scoreset any more.Based on Eq. 8). each input window is examined if it is"acc epted by a sound effect. If it is accepted by a sound effect.the corresponding likelihood score, which i s also considered aconfidence, i s added to the candidate stack, as the Figure 2shows. After going through all the log-likelihood scores. thefinal decision is made as following. If the stack is empty, theinput window does not belong to any registered model;otherwise, it i s classified into the i sound effect with themaximum confidence,

    i =a r g m a x ( p ( s , IC,)) (12)I

    (8) (b)Figure 3. (a) log-likelihood scores distributions; @)Approximate (a) with Gamma distribution

    4. SOUND EFFECT ATTENTION M ODELBased on the IoCation of these highlight sound effects. anextended audio attention model can be established. whichdescribe the saliency of each sound effect. It's very helpful forfurther highlight extraction and video su"anza1ion. In thispaper, two chara cten are used to represent sound effect attentionmodel. One is loudness, which can be represented by its energy;the other is the confidence in some sound effect class, which isrepresented by the log-likelihood score under correspondingHMM.

    These two characters are normalized as:E =E,b,,IMax-Ea,,,7. exp(s j -Max-s , ) (13)

    where E,,, and s denote the average energy and log-likelihoodscore under model j of an audio segment respectively. M a - E , , ,and Max-s, are the maximum average energy and log-likelihoodscore under model j in an entire audio stream Then the attentionmodel for class j i s defined a:M , = Z . T (14)Figure 4 illustrates the sound effect attention curves of threeeffects, which include laughter, applause and cheer. for 1 minuteclip of the NBC's TV show "Hollywood Square".

    111 - 39

  • 8/22/2019 01221242

    4/4

    I \ , Im M m EO I m l mTm.Figure 4. Sound effect attention model curves

    5. EXPERIMENTSThe evaluations of the proposed algorithm are performed on ourdatabase. Three sound effects. inclu ding laughter, applause andcheer. are modeled. The training data for each sound effectincludes IO0 pieces of samples. Each piece i s about 3-10s longand totally about lOmin training data for each class. The testingdatabase i s about 2 hours videos, with different programs anddifferent language, including NBCs TV show Hol lywwdSquare (30min), CCTVs TV how Lucky 52 (60min) and alive program record of table tenn is champion ship (30 min). Allthe audio-tracks are first manually labeled. And then, the audiostreams are segmented into I-second windows with 0.5-secondoverlap, each window get a ground truth according to the labels.

    In order to estimate the log-likelihood distributions inFigure 3more accurately, two kind distribution curves, Gaussianand Gamma distribution. are com pared . Recall and precision,which i s always used in retrieval system evaluation, are used tomeasure the performance of OU T system. With all otherparameters kept the same, the comparison results on the 30minHo llyw wd Square are showed in Table 1. From the Table, itcan be seen that G a m istribution performs much better thanGaussian distribution. Com pared with Gaussian distribution,Gamma distribution increases the precision remarkably (about9.3%) whileju st affects the recall ratio lightly (about 1.8%).

    Results of general tests on all the 2-hours data are listed inTable 2. From Table 2, it can be seen that the performance isencouraging. 7h e average recall is about 92.95% and precisioni s about 86.88% respectively. 7he high recall can meet therequirements well for highlights extraction and summarization.However, there still exist some mis-detections. In the table tennischampionship, sometimes the reporter uses brief and excitingvoice for a wonderful play, which i s often detected as laughter,thus makes the laughters precision is a little l ow . Moreover,when sound effects are mixed with music, speech and othercomplicated environment sounds, its also hard to make a

    judgmen t, thats why the recall of TV shows are somewhat lowerthan that of the table tennis match, which has a relative quietenvironment .

    6. CONCLUSIONIn this paper, we have presented in detail our framework andalgorithm for highlight soun d effects detection in aud io stream.Three highlight sound effects are considered in current system:laughter, applause and cheer, which are mostly semanticallyrelated with interesting events in TV shows, sports, meeting andhome videos. HMMs are used to model sound effects and a log-likelihood based method is proposed to mak e decision. A soundeffect attention model is also proposed for further highlightextraction and summarization. Experim ental evalua tions haveshown that the algorithm can obtain very satisfying results.

    7. REFERENCES[ I ] Y. ui. A. Gup ta and A. Acero. A utomatically ExtractingHighlights for TV Basebal l Program, Proc. of 81h ACM

    l n t e r n a r i o n a l Conference o n Miclrimedia, 2000.[Z] Y.-F. Ma, L. Lu, H:J. Zhang and M.J . Li . An Attention

    Model for Video Summarization, Proc. of 10th ACMlntemarionnl Conference on Multimedia, 2002.131 L. Lu, H. J iang, and H.3. Uang, A Robust AudioClassification and Segmentation Methocl. Proc. of 9rhACM lnternarionol Conference on Multimedia, 2001.[4] L. Lu, S . Z. Li and H:I. Zhang, Content-Based AudioSegmentation Using Support Vector Machines, Proc. of

    IEEE lntemational Conference on Mulrimedia and Expo,pp.956-959, Tokyo , Japan, 2001[5] S . Z. Li. Content-based classification and retrieval ofaudio using the nearest feature line method, IEEE

    Transactions on Speech and Audio Process ing, Vo1.8. No.5.pp. 619-625, September , 2000.[6] T. Zhang and C.-C. Jay Kuo, Hierarchical System forContent-based Audio Classification and Retrieval, SPIESConference on Mulr imedia Storage and Archiving Sysrenis

    111. Boston, Nov., 1998.[7] M. Casey, MPEG-7 Sound-Recogni t ion Tools, IEEE

    Transactionson Circuits an d Systems for Video Technology,Vo l.l l , No.6, pp. 737-747, June. 2001.[8] R. 0.Duda, P.E. H a l , D. G. Stork, Parrem Classificarion,John Wiley & Sons, 2001[9 ] L. L. Scharf, Statisricnl Signal Processing; Detection,Estimation, and TimeAnalysis, Addison-Wesley Inc., 1991.

    111- 40