DETECTING MUSIC IN AMBIENT AUDIO BY LONG-WINDOW ...dpwe/pubs/LeeE08-musicdet.pdfapproximately 18% were found to include music – enough to be a generally-useful feature, while still

DETECTING MUSIC IN AMBIENT AUDIO BY LONG-WINDOW AUTOCORRELATION

Keansub Lee and Daniel P. W. Ellis

LabROSA, Department of Electrical EngineeringColumbia University, New York, NY 10027 USA

{kslee, dpwe}@ee.columbia.edu

ABSTRACTWe address the problem of detecting music in the backgroundof ambient real-world audio recordings such as the sound trackof consumer-shot video. Such material may contain high lev-els of noises, and we seek to devise features that will revealmusic content in such circumstances. Sustained, steady mu-sical pitches show significant, structured autocorrelation atwhen calculated over windows of hundreds of milliseconds,where autocorrelation of aperiodic noise has become negligi-ble at higher-lag points if a signal is whitened by LPC. Usingsuch features, further compensated by their long-term aver-age to remove the effect of stationary periodic noise, we pro-duce GMM and SVM based classifiers with high performancecompared with previous approaches, as verified on a corpusof real consumer video.

Index Terms— Speech analysis, Music, Acoustic signaldetection, Correlation

1. INTRODUCTION

Short video clips are in some cases replacing still-image snap-shots as a medium for the casual recording of daily life. Whilethe thousands of digital photos in a typical user’s collectionalready present a serious navigation and browsing challenge,video clips, which may not be well represented by a singlethumbnail image, can make things still more difficult.

However, video clips contain much richer information thansingle images, and consequently present many new opportu-nities for the automatic extraction of information that can beused in intelligent browsing systems. We are particularly in-terested in exploiting the acoustic information – the sound-track – that is available for video, and in seeing what usefulinformation can be reliably extracted from these kinds of data.

One attribute that we see as both informative and useful tousers, and at the same time technically feasible, is the detec-tion of background music. For instance, if a user is searchingfor the video clip of a certain event, they are likely to able toremember (or guess) if there was music in the background,and thereby limit the scope of a search. In a manual labeling

This work was supported by the NSF (grant IIS-0238301), the EastmanKodak company, and EU project AMIDA (via ICSI).

of a database of over 1000 video clips recorded by real usersof current digital cameras (which include video capability),approximately 18% were found to include music – enough tobe a generally-useful feature, while still retaining some dis-criminative power.

There has been a substantial amount of work relating tothe detection of music in audio, or the discrimination be-tween a few categories such as speech and music, including[1, 2, 3, 4]. However, the soundtrack of ‘consumer video’has many characteristics that distinguish it from the broad-cast audio that has most commonly been considered in thiswork: Casual recordings made with small, hand-held cameraswill very often contain a great deal of spurious, non-stationarynoise such as babble, crowd, traffic, or handling artifacts. Thisunpredictable noise can have a great impact on detection al-gorithms, particularly if they rely on the global characteristicsof the signal (e.g. the broad spectral shape encoded by MFCCfeatures) which may now be dominated by noise.

In trying to design robust features, we focus on the twokey characteristics of music worldwide: Pitch and Rhythm.Pitch refers to the perceived musical notes that build up melod-ies and harmony, and is generally conveyed by locally-periodicsignals (thus possessing a spectrum with harmonic peaks);musical instruments are usually designed to have relativelystable periods, and musical notes typically last for hundredsof milliseconds before the pitch is changed. Rhythm is the reg-ular temporal structuring of note events giving rise to a senseof beat or pulse, usually at several hierarchically-related lev-els (beat, bar, etc.). While a given musical instance may lackclear pitch (e.g. percussion music) or a strong rhythm (e.g.an extremely ‘romantic’ piano style), it is difficult to imaginemusic possessing neither.

In the next section, we describe a music detection featurefor detecting the stable periodicities of pitch that is robust tohigh levels of background noise. This feature, combined witha rhythm-detection feature based on the beat tracker of [5], isevaluated in section 3, and compared to previous music de-tection features. We discuss the results in section 4.

freq

/KH

z

(a) Spectrogram : Speech

0

1

2

3

Original Autocorrelogram

lag

s

Compensated Autocorrelogram

120140160180200

0 1 2 3-0.05

0

time / sec

log

-mag

nit

ud

e Dynamics

4 (b) Spectrogram : Music



0 1 2 3time / sec

Dynamics

−40

−20

0

20

40

−1

0

1

−1

0

1

(c) Spectrogram : Machine Noise



Dynamics

0 1 2 3time / sec

100

lag

s

120140160180200

100

w/o Compensationwith Compensation

Fig. 1. Examples of noisy speech, music and machine sound from a consumer audio recording.

2. MUSICAL PITCH DETECTION

Our strategy for detecting musical pitches is to identify theautocorrelation function (ACF) peaks resulting from the peri-odic, pitched energy that are stationary for around 100..500 ms,but to exclude aperiodic noise and stationary periodicity aris-ing from background noise. Whitening by Linear Predic-tive (LP) inverse filtering prior to ACF concentrates aperi-odic noise energy around zero lag, so we use only higher-lagcoefficients to avoid this energy. Calculating the ACF over100 ms windows emphasizes periodicities stable on that timescale, but we then subtract the long-term average ACF to re-move any stationary, periodic background. Finally, the sta-bility (or dynamics) of pitch content is estimated by a featurecomposed of the cosine similarity between successive framesof the compensated ACF.

2.1. LPC Whitening and ACF

Mono input recordings are resampled to 16 kHz, and fit witha 12th order LPC model over 64 ms windows every 32 ms.Further processing is applied to residual of this modeling,which is a spectrally flat (whitened) version of the originalsignal, preserving any pitch-rate periodicity.The short-timeACF ree(n, τ) for each LPC residual envelope output e(n)at a given time index n may be defined as:

ree(n, τ) =n+W∑i=n+1

e(i)e(i + τ) (1)

where W is an integration window size, and ree(n, τ) is cal-culated over 100 ms windows every 5 ms for lag τ = 0 . . . 200samples (i.e. up to 12.5 ms, for a lowest pitch of 80 Hz).

2.2. ACF Compensation

Assume that residual e(n) consists of a clean musical signalm(n) and a background aperiodic noise a(n) and stationaryperiodic noise b(n) i.e. e(n) = m(n) + a(n) + b(n). Ifthe noise a(n) and b(n) are zero-mean and uncorrelated withm(n) and each other for large W , the ACF is given by:

ree(n, τ) = rmm(n, τ) + raa(n, τ) + rbb(n, τ) (2)

To simplify notation, variables n and τ are henceforth dropped.

2.2.1. Aperiodic Noise Suppression

The effect of the LPC whitening is to concentrate the ACF ofunstructured noise, raa at or close to the zero-lag bins. Wecan remove the influence of aperiodic noise from our featuresby using only the coefficients of ree for lag τ ≥ τ1 samples(i.e. in our system, τ ≥ 100).

ree = rmm + rbb, for τ ≥ 100 (3)

Once the low-lag region has been removed, ACF ree is nor-malized by its peak value ||ree|| to lie in the range -1 to 1.

2.2.2. Long-time Stationary Periodic Noise Suppression

A common form of interference in environmental recordingsis a stationary periodic noise such as the steady hum of a ma-chine as shown in the third column of figure 1, resulting in

Table 1. Speech-music classification, with and without vo-cals (broadcast audio corpus, single Gaussian classifier withfull covariance). Each value indicates how many of the 2.4second segments out of a total of 120 are correctly classifiedas speech (first number) or music (second number). The bestperformance of each column is shown in bold. Features aredescribed in the text.

Speech vs. Speech vs.Feature Music w/ vocals Music w/o vocals

Rth 96/120, 65/120 96/120, 62/120mDyn 114/120, 99/120 114/120, 104/120vDyn 89/120, 115/120 89/120, 116/1204HzE 106/120, 118/120 106/120, 120/120vFlux 106/120, 116/120 106/120, 120/120

Rth+mDyn 111/120, 109/120 111/120, 114/120mDyn+vDyn 114/120, 101/120 114/120, 104/1204HzE+vFlux 104/120, 118/120 104/120, 120/120

Rth+mDyn+vDyn 112/120, 114/120 112/120, 117/120Rth+4HzE+vFlux 103/120, 119/120 103/120, 120/120Rth+mDyn+4HzE 108/120, 119/120 108/120, 120/120Rth+mDyn+vFlux 108/120, 117/120 108/120, 120/120

ACF ridges that are not, in fact, related to music [6]. TheACF contribution of this noise rbb will change very little withtime, so it can be approximated as the long-time average ofree over M adjacent frames (covering around 10 second). Wecan estimate the autocorrelation of the music signal, r̂mm, asthe difference between the local ACF and its long-term aver-age,

r̂mm = ree − γ · r̂bb = ree − γ · avg{ree} (4)

γ is a scaling term to accommodate the per-frame normaliza-tion of the high-lag ACF and is calculated as the best projec-tion of the average onto the current frame:

γ =∑

τ ree · avg{ree}∑τ avg{ree}2

, for τ ≥ 100 (5)

This estimated music ACF r̂mm is shown in the third row offigure 1.

2.3. Pitch Dynamics Estimation

The stability of pitch in time can be estimated by comparingtemporally adjacent pairs of the estimated music ACFs:

Υ(n) = Scos{r̂mm(n), r̂mm(n + 1)} (6)

where Scos is the cosine similarity (dot product divided byboth magnitudes) between the two AC vectors. Υ is shownin the fourth row of figure 1. The sustained pitches of musicresult in flat pitch contours in the ACF, and values of Υ that

approach 1, as shown in the second column of figure 1. Bycontrast, speech (column 1) has a constantly-changing pitchcontour, resulting in a generally smaller Υ, and the initiallylarger Υ of stationary periodic noise from e.g. machine isattenuated by our algorithm (column 3).

3. EVALUATION

The pitch dynamics feature Υ was summarized by its mean(mDyn) and variance (vDyn) for the purpose of classifyingclips. We compared these features with others that have beensuccessfully used in music detection [2], namely the 4Hz Mod-ulation Energy (4HzE), Variance of the spectral Flux (vFlux)and Rhythm (Rth) which we took as the largest peak valueof normalized ACF of an ‘onset strength’ signal [5] over thetempo range (50-300 BPM).

Table 1 compares performance on a data set of randomclips captured from broadcast radio, as used in [2]. The datawas randomly divided into a 15 s segments, giving 120 fortraining and a 60 for testing (20 each of speech, music withvocals, and music without vocals). Classification was per-formed by a likelihood ratio test of single Gaussians fit tothe training data. 4HzE and vFlux have the best performanceamong single features, but Rth + mDyn + vDyn has the bestperformance (by a small margin) in distinguishing speech fromvocal-free music.

However, classification of clean broadcast audio is not themain goal of our current work. We also tested these featureson the soundtracks of 1873 video clips from the YouTube [7],retured by consumer-relevant search terms such as ‘animal’,‘people’, ‘birthday’, ‘sports’ and ‘music’, then filtered to re-tain only unedited, raw consumer video. Clips were manu-ally sorted into 653 (34.9%) that contained music, and 1220(65.1%) that did not. We labeled a clip as music if it includedclearly-audible professional or quality amateur music (regard-less of vocals or other instruments) throughout. These clipsare recorded in a variety of locations such as home, street,park and restaurant, and frequently contain noise includingbackground voices and many different types of a mechanicalnoise.

We used a 10 fold cross-validation to evaluate the perfor-mance in terms of the accuracy, d′ (the equivalent separationof two normalized Gaussian distributions), and Average Pre-cision (the average of the precision of the ordered returned listtruncated at every true item). We compared two classifiers, asingle Gaussian as above, and an SVM with an RBF kernel.At each fold, the classifier is trained on 40% of the data, tunedon 20%, and then are tested on the remaining 40% selectedat random. For comparison,we also report the performanceof the ‘1G+KL with MFCC’ system from [8], which simplytakes the mean and covariance matrix of MFCC features overthe entire clip, and then uses an SVM classifier with a sym-metrized Kullback-Leibler (KL) kernel.

As shown in table 2, the new mDyn feature is significantly

Table 2. Music/Non-music Classification Performance on YouTube consumer recordings. Each data point represents themean and standard deviation of the clip-based performance over 10 cross-validated experiments. d′ is a threshold-independentmeasure of the separation between two unit-variance Gaussian distributions. AP is the Average Precision over all relevant clips.The best performance of each column is shown in bold for the first three blocks.

One Gaussian Classifier SVM ClassifierFeatures Accuracy(%) d′ AP(%) Accuracy(%) d′ AP(%)

Rth 81.9 ± 0.87 1.85 ± 0.06 75.8 ± 1.67 82 ± 0.87 1.83 ± 0.08 80.9 ± 1.02mDyn 80.6± 0.73 1.67 ± 0.05 70.9± 2.08 81.1 ± 0.69 1.66 ± 0.06 77.5 ± 2.23vDyn 63 ± 1.44 0.76 ± 0.08 47.7± 1.26 66.7 ± 1.37 0.57 ± 0.11 50.2 ± 2.474HzE 65.2± 1.08 0.87 ± 0.07 53.7± 1.62 68.6 ± 0.89 0.81 ± 0.07 53± 1.47vFlux 61.5± 1.17 0.74 ± 0.09 52.4± 1.85 67.3 ± 1.36 0.68 ± 0.09 50.2 ± 2.68

Rth+mDyn 86.9 ± 0.84 2.17 ± 0.08 86.1 ± 0.94 88.6 ± 0.55 2.36 ± 0.07 91.7 ± 0.694HzE+vFlux 63.9± 1.39 0.79 ± 0.07 53.1± 2.3 68.5 ± 0.77 0.76 ± 0.07 51.9 ± 1.98

Rth+mDyn+vDyn 89.9± 0.67 2.49 ± 0.08 88.2± 1.17 90.7 ± 0.81 2.61± 0.1 94.8 ± 0.67Rth+4HzE+vFlux 83 ± 1.32 1.9± 0.15 80.5± 1.8 85.1 ± 0.91 2.02 ± 0.08 84.7± 2Rth+mDyn+4HzE 88.9± 1 2.4± 0.12 88 ± 1.16 90.6 ± 1.06 2.57 ± 0.13 94± 1.25Rth+mDyn+vFlux 90 ± 0.72 2.49 ± 0.08 89.3 ± 1.25 91.4 ± 0.84 2.67 ± 0.1 93.8 ± 0.89

Rth+mDyn+vDyn+4HzE 90.6± 1.03 2.57 ± 0.12 89.3± 1.4 91.7 ± 1.02 2.72 ± 0.14 95.4 ± 0.76Rth+mDyn+vDyn+vFlux 90.2± 0.71 2.52 ± 0.09 88.9± 0.97 91.3 ± 0.78 2.66± 0.1 94.8 ± 0.87

1G+KL with MFCC N/A N/A N/A 80.2 ± 0.75 1.68 ± 0.007 80.4 ± 1.82

better than previous features 4HzE or vFlux, which are lessable to detect music in the presence of highly-variable noise.The best 2 and 3 feature combinations are ‘Rth + mDyn’ and‘Rth + mDyn + vFlux’ (which slightly outperforms ‘Rth +mDyn + vDyn’ on most metrics). This confirms the successof the pitch dynamics feature, Υ, in detecting music in noise.Matlab code for these features are available1.

4. DISCUSSION AND CONCLUSIONS

An examination of misclassified clips revealed that many rep-resent genuinely ambiguous cases, for instance weak or inter-mittent music, or partially-musical sounds such as a piano be-ing struck by a baby. There were many examples of singing,such as birthday parties, that were not considered music bythe annotators but were still detected. More clear-cut falsealarms occurred with cheering, screaming, and some alarmsounds such as car horns and telephone rings.

In this paper, we have proposed a robust musical pitchdetection algorithm for identifying the presence of music innoisy, highly-variable environmental recordings such as thesoundtracks of consumer video recordings. We have intro-duced a new technique for estimating the dynamics of musicalpitch and suppressing both aperiodic and stationary periodicnoises in the autocorrelation domain. The performance of ourproposed algorithm is significantly better than existing musicdetection features for the kinds of data we are addressing.

1http://www.ee.columbia.edu/∼kslee/projects-music.html

5. REFERENCES

[1] J. Saunders, “Real-time discrimination of broadcastspeech/music,” in Proc. ICASSP, 1996, pp. 993–996.

[2] E. Scheirer and M. Slaney, “Construction and evaluationof a robust multifeature speech/music discriminator,” inProc. ICASSP, 1997, pp. 1331–1334.

[3] T. Zhang and C.-C. J. Kuo, “Audio content analysis foronline audiovisual data segmentation and classification,”IEEE Tr. Speech and Audio Proc., pp. 441–457, 2001.

[4] J. Ajmera, I. McCowan, and H. Bourlard, “Speech/musicsegmentation using entropy and dynamism features in aHMM classification framework.,” Speech Communica-tion, vol. 40, no. 3, pp. 351–363, 2003.

[5] D. P. W. Ellis, “Beat tracking by dynamic programming,”J. New Music Res., vol. 36, no. 1, pp. 51–60, 2007.

[6] K. Lee and D. P. W. Ellis, “Voice activity detection in per-sonal audio recordings using autocorrelogram compensa-tion,” in Proc. Interspeech, Pittsburgh, 2006, pp. 1970–1973.

[7] “YouTube - broadcast yourself,” 2006, http://www.

youtube.com/.

[8] S.-F. Chang, D. Ellis, W. Jiang, K. Lee, A. Yanagawa,A. Loui, and J. Luo, “Large-scale multimodal semanticconcept detection for consumer video,” in MIR workshop,ACM Multimedia, Germany, Sep. 2007.

Documents

DETECTING MUSIC IN AMBIENT AUDIO BY LONG-WINDOW ...dpwe/pubs/LeeE08-musicdet.pdfapproximately 18% were found to include music – enough to be a generally-useful feature, while still