Meter and Autocorrelation - Université de Montréaleckdoug/papers/2005_rppw.pdf · 2005. 7. 14. · Meter and Autocorrelation∗ Douglas Eck University of Montreal Department of

Meter and Autocorrelation∗

Douglas EckUniversity of Montreal

Department of Computer ScienceCP 6128, Succ. Centre-Ville

Montreal, Quebec H3C 3J7 [email protected]

Abstract

This paper introduces a novel way to detect metrical structure inmusic. We introduce a way to compute autocorrelation such that thedistribution of energy in phase space is preserved in a matrix. Theresulting autocorrelation phase matrix is useful for several tasks in-volving metrical structure. First we can use the matrix to enhancestandard autocorrelation by calculating the Shannon entropy at eachlag. This approach yields improved results for autocorrelation-basedtempo induction. Second, we can efficiently search the matrix forcombinations of lags that suggest particular metrical hierarchies. Thisapproach yields a good model for predicting the meter of a piece ofmusic. Finally we can use the phase information in the matrix to aligna candidate meter with music, making it possible to perform beat in-duction with an autocorrelation-based model. We argue that the au-tocorrelation phase matrix is a good, relatively efficient representationof temporal structure that is useful for a variety of applications. Wepresent results for several relatively large meter prediction and tempoinduction datasets, demonstrating that the approach is competitivewith models designed specifically for these tasks. We also presentpreliminary beat induction results on a small set of artificial patterns.

∗Presented at Rhythm Perception Production Workshop (RPPW) 2005. Draft. DoNot Cite.

1

1 Introduction

In this paper we introduce an autocorrelation phase matrix, a two-dimensionalstructure (computed from MIDI or digital audio) that provides the necessaryinformation for estimating the lags and phases of the music’s metrical hierar-chy. We use this matrix as the core data structure to estimate the meter of apiece (meter prediction), to estimate the tempo of a piece (tempo induction)and to align the piece of music with the predicted metrical structure (beatinduction).

We will provide algorithm details and experimental results for meter pre-diction and tempo induction. We will also present some details concerningthe alignment of the metrical structure with a piece of music. We will alsopresent alignment results for a small dataset of artificial patterns. Howeverthe details of computing this alignment online (for beat induction) are thetopic of another paper.

The structure of this paper is as follows. In Section 2 we will discussother approaches to finding meter and beat in music. In Section 3 we willdescribe our model consisting of the creation of an autocorrelation matrix,computation of the entropy for each lag in this matrix, the selection of ametrical hierarchy and the alignment of the hierarchy with music. Finally inSection 4 we present simulation results.

2 Meter and Autocorrelation

Meter is the sense of strong and weak beats that arises from the interactionamong hierarchical levels of sequences having nested periodic components.Such a hierarchy is implied in Western music notation, where different levelsare indicated by kinds of notes (whole notes, half notes, quarter notes, etc.)and where bars establish measures of an equal number of beats (Handel,1993). For instance, most contemporary pop songs are built on four-beatmeters. In such songs, the first and third beats are usually emphasized.Knowing the meter of a piece of music helps in predicting other componentsof musical structure such as the location of chord changes and repetitionboundaries (Cooper and Meyer, 1960).

Autocorrelation works by transforming a signal from the time domaininto the frequency domain. Autocorrelation provides a high-resolution pic-ture of the relative salience of different periodicities, thus motivating its use

2

in tempo and meter related music tasks. However, the autocorrelation trans-form discards all phase information, making it impossible to align salientperiodicities with the music. Thus autocorrelation can be used to predict,for example, that music has something that repeats every 1000ms but it can-not say when the repetition takes place relative to the start of the music. Oneprimary goal of our work here is to compute autocorrelation efficiently whileat the same time preserving the phase information necessary to perform suchan alignment. Our solution is the autocorrelation phase matrix.

Autocorrelation is certainly not the only way to perform meter predictionand related tasks like tempo induction. Adaptive oscillator models (Largeand Kolen, 1994; Eck, 2002) can be thought of as a time-domain correlate toautocorrelation based methods and have shown promise, especially in cogni-tive modeling. Multi-agent systems such as those by Dixon (2001) have beenapplied with success. as have Monte-Carlo sampling (Cemgil and Kappen,2003) and Kalman filtering methods (Cemgil et al., 2001). However, dueto space constraints we will omit details of these approaches and focus heresolely on autocorrelation methods.

Brown (1993) used autocorrelation to find meter in musical scores rep-resented as note onsets weighted by their duration. The durational accentshe used is applicable for musical score analysis but is impractical for digitalaudio due to difficulties in computing note durations. However it was oneof the first reported uses of autocorrelation for meter prediction. Brown re-ported that the model was able to provide a reliable estimate of meter usingrelatively little computational power.

Vos et al. (1994) proposed a similar autocorrelation method. The primarydifference between their work and that of Brown was their use of melodicintervals in computing accents. They applied their model to compositionsby Bach, demonstrating the usefulness of melodic accent in detecting meterin these examples.

Scheirer (1998) provided a model of beat tracking that treats audio filesdirectly and performs relatively well over a wide range of musical styles (41correct of 60 examples). Though he does not use autocorrelation he usesrelated comb filtering techniques elements such as note onset times. Hismodel required extensive multi-band preprocessing. He filtered an audiosignal into several bands and then downsampled, differentiated and rectifiedeach band. He then passed these signals into a bank of 150 comb filters,selecting the maximum output to recover the tempo and phase. Tempochanges were handled by repeatedly changing the choice of filter.

3

Volk (2004) explored the influence of interactions between levels in themetrical hierarchy on metrical accenting. Her model compared a metricinterpretation gained by analyzing note onsets to an interpretation gained byanalyzing the time signature of the musical score. Her method included thecomputation of a weighted score of a particular candidate meter as extendedthrough the entire piece of music.

Toiviainen and Eerola (2004) also investigated an autocorrelation-basedmeter induction model. Their focus was on the relative usefulness of dura-tional accent and melodic accent in predicting meter. The authors observedthat durational and melodic accents provide a modest boost in performancewhen used in conjunction with unaccented data, but that unaccented datawas the most useful single factor for successful meter classification. Centralto their model was stepwise discriminant function analysis, a powerful tool foranalyzing data. However as this method does not control against overfitting,further tests are necessary to know how well the model will generalize.

Klapuri et al. (2005) incorporate the signal processing approaches of Goto(2001) and Scheierer in a model that analyzes the period and phase of threelevels of the metrical hierarchy: the fastest-changing level or tatum the mostprominent level or tactus (usually the same level as the foot tapping rate),and the level at which musical measures are grouped. A probabilistic modelaided by hand-encoded musical prior knowledge is used for joint estimationof pulses at the three levels. A fixed three-level approach to meter maypose difficulties for processing rhythmically simple music, where the tactusand tatum may be identical (imagine a fugue consisting entirely of eighthnotes). It may also pose difficulties for processing rhythmically complexmusic where there can exist multiple stable metrical levels separating thetatum and tactus. Despite this, the model in fact performs very well at tempoinduction as seen in the ISMIR 2004 Tempo Induction contest (Gouyon et al.,2005). We return to this in Section 4.

3 Model Details

We describe a model that uses autocorrelation as its core, but that takes ad-vantage of the distribution of energy in phase space as a method to overcomeweaknesses in standard autocorrelation. The model is described in Sections3.1 through 3.6.

4

3.1 Preprocessing

For MIDI files, the onsets can be transformed into spikes with amplitudeproportional to their midi note onset volume. Alternately MIDI files cansimply be rendered as audio and written to wave files. Stereo audio files areconverted to mono by taking the mean of the two channels. Then files aredownsampled to some rate near 1000Hz. The actual rate is kept variablebecause it depends on the original sampling rate. For CD-audio (44.1Khz),we used a sampling rate of 1050Hz allowing us to downsample by a factor of42 from the original file. Best results were achieved by computing a sum-of-squares envelope over windows of size 42 with 5 points of overlap. Howeverfor most audio sources a simple decimation and rectification works as well.The model was not very sensitive to changes in sampling rate nor to minoradjustments in the envelope computation such as substituting RMS (rootmean square) for the sum of squares computation.

One of our goals was to avoid complicated preprocessing, and we suc-ceeded in doing so. However there is no reason that our model could not beadapted to work with multi-band filtering approaches as used by, e.g., Kla-puri et al. (2005); Goto (2001). This did not seem necessary for our currentexperiments but it may be necessary for future work in online beat tracking.

3.2 Autocorrelation Phase Matrix

The method of cross-correlation is commonly used to evaluate whether twosignals exhibit common features and are therefore correlated (Ifeachor andJervis, 1993). To perform cross-correlation one computes the sum of theproducts of corresponding pairs of two signals. A range of lags are considered,accounting for potential time delays between correlated information in thetwo signals. The formula for the lag k cross-correlation Ck between signalsx1 and x2 (having length N) is:

Ck(X1, X2) =1

N

∑0<n<N−k

x1(n) ∗ x2(n + k) (1)

Autocorrelation is a special case of cross-correlation where x1 == x2.There is a strong and somewhat surprising link between autocorrelation andthe Fourier transform. Namely the autocorrelation A of a signal X (havinglength N) is:

A(X) = ifft(|fft(X)|) (2)

5

where fft is the (fast) Fourier transform, ifft is the inverse (fast) Fouriertransform and || is the complex modulus. One advantage of autocorrelationfor our purposes is that it is defined over periods rather than frequencies (notethe application of the IFFT in Equation 2), yielding better representation oflow-frequency information than is possible with the FFT.

Autocorrelation values for a random signal should be roughly equal acrosslags. Spikes in an autocorrelation indicate temporal order in a signal, makingit possible to use autocorrelation to find the periods at which high correla-tion exists in a signal. As a music example, consider the autocorrelation fora ChaChaCha from the ISMIR 2004 Tempo Induction contest is shown (Fig-ure 1). The peaks of the autocorrelation align with the tempo and integermultiples of the tempo.

Unfortunately autocorrelation has been shown in practice to not workwell for many kinds of music. For example when a signal lacks strong onsetenergy, as it might for voice or smoothly changing musical instruments likestrings, the autocorrelation tends to be flat. See for example a song fromManos Xatzidakis from the ISMIR 2004 Tempo Induction in Figure 2. Herethe peaks are less sharp and are not well-aligned with the target tempo. Notethat the y-axis scale of this graph is identical to that in Figure 1.

One way to address this is to apply the autocorrelation to a numberof band-pass filtered versions of the signal, as discussed in Section 3.1. Inplace of multi-band processing we compute the distribution of autocorrelationenergy in phase space. This has a sharpening effect, allowing autocorrelationto be applied to a wider range of signals than autocorrelation alone withoutextensive preprocessing.

The autocorrelation phase information for lag l is a vector Al:

Al =

bN−ll

c∑i=0

xli+φ xl(i+1)+φ

l−1

φ=0

(3)

We compute an autocorrelation phase vector Al for each lag of interest.In our case the minimum lag of interest was 200ms and the maximum lag ofinterest was 3999ms. Lags were sampled at 1ms intervals yielding L = 3800lags. Equation 3 effectively “wraps” the signal modulo the lag l question,yielding vectors of differing lengths (|Al| == l). To simplify later compu-tations we normalized the length of all vectors by computing a histogramestimate. This was achieved by fixing the number of phase points for all

6

0 500 1000 1500 2000 2500 3000 3500 40000

100

200

300

400

500

600

True lag

lag (msec)

auto

corr

elat

ion

Albums−Cafe_Paradiso−08.wav

Figure 1: Autocorrelation of a ChaChaCha from the ISMIR 2004 Tempo Inductioncontest (Albums-Cafe Paradiso-08.wav). The dotted vertical line marks the actualtempo of the song (484 msec, 124 bpm).

lags at K (K = 50 for all simulations; larger values were tried and yieldedsimilar results but significantly smaller values resulted in a loss of temporalresolution) and resampling the variable length vectors to this fixed length.This process yielded a rectangular autocorrelation phase matrix P where|P | = [L, K].

As an example of an autocorrelation phase table, consider Figure 3, whichshows the rectified normalized signal from a piano rendition of one of therhythmic patterns from Povel and Essens (1985). The pattern was renderedwith a base inter-onset-interval of 300ms. On the left in Figure 4 the auto-correlation phase matrix is shown. On the right, the sum of the matrix is

7

shown. It is the standard autocorrelation.

3.3 Autocorrelation Phase Entropy

As already discussed, is possible to improve significantly on the performanceof autocorrelation by taking advantage of the distribution of energy in theautocorrelation phase matrix. The idea is that metrically-salient lags willtend to be have more “spike-like” distribution than non-metrical lags. Thuseven if the autocorrelation is evenly distributed by lag, the distribution ofautocorrelation energy in phase space should not be so evenly distributed.There are at least two possible measures of “spikiness” in a signal, varianceand entropy. We focus here on entropy, although experiments using varianceyielded very similar results.

Entropy is the amount of “disorder” in a system. Shannon entropy H:

H(X) = −N∑

i=1

X(i)log2[X(i)] (4)

where X is a probability density.We compute the entropy for lag l in the autocorrelation phase matrix by

as follows:

Asum =N∑

i=0

Al(i) (5)

Hl = −N∑

i=0

Al(i)/Asumlog2[Al(i)/Asum] (6)

This entropy value, when multiplied into the autocorrelation, significantlyimproves tempo induction. For example, in Figure 5 we show the autocor-relation along with the autocorrelation multiplied by the entropy for thesame Manos Xatzidakis show in in Figure 2. On the bottom observe howthe detrended (1- entropy) information aligns well with the target lag andits multiples. (Detrending was done to remove a linear trend that favorsshort lags. Simulations revealed that performance is only slightly degradedwhen detrending is omitted.) Most robust performance was achieved whenautocorrelation and entropy were multiplied together. This was done by de-trending both the autocorrelation and entropy vectors, scaling them bothbetween 0 and 1 and then multiplying them together.

8

3.4 Metrical hierarchy selection

We now move away from the autocorrelation phase matrix for the momentand address task of selecting a winning metrical hierarchy. A rough estimateof meter can be had by simply summing hierarchical combinations of auto-correlation lags. In place of standard autocorrelation we use the product ofautocorrelation and (1 - entropy) AE as described above. The likelihood ofa duple meter Mduple existing at lag l can be estimated using the followingsum:

Mduplel = AE(l) + AE(2l) + AE(4l) + AE(8l) (7)

The likelihood of a triple meter is estimated using the following sum:

M triplel = AE(l) + AE(3l) + AE(6l) + AE(12l) (8)

Other candidate meters can be constructed. using similar combinationsof lags. A winning meter can be chosen by sampling all reasonable lags (e.g.200ms <= l <= 2000ms) and comparing the resulting M∗

l values. Providedthat the same number of points are used for all candidate meters, these M∗

l

values can be compared directly, allowing for a single winning meter to beselected among all possible lags and all possible meters. Furthermore, thissearch is efficient given that each lag/candidate meter combination requiresonly a few additions. For the meter prediction simulations in Section 4 thiswas the process used to select the meter.

3.5 Prediction of tempo

Once a metrical hierarchy is chosen, there are several simple methods forselecting a winning tempo from among the winning lags. One option is topick the lag closest to a comfortable tapping rate, say 600ms. A secondbetter option is to multiply the autocorrelation lags by a window such thatmore accent is placed on lags near a preferred tapping rate. The windowcan be applied either before or after choosing the hierarchy. If it is appliedbefore selecting the metrical hierarchy, then the selection process is biasedtowards lags in the tapping range. We tried both approaches; applying thewindow before selection yields better results, but only marginally better (onthe order of 1% better performance on the tempo prediction tasks describedbelow). To avoid adding more parameters to our model we did not construct

9

our own windowing function. Instead we used the function (with no changesto parameters) described in Parncutt (1994): a Gaussian window centeredat 600ms and symmetrical in log-scale frequency.

3.6 Alignment of predicted hierarchy with signal

The autocorrelation phase matrix provides the necessary information foraligning the selected metrical hierarchy with a score. Such an alignmentis useful for task like downbeat induction.

Our strategy for alignment is to integrate information from the autocor-relation phase matrix at all levels in the selected metrical hierarchy. As anexample of this process, consider again the first Povel & Essens pattern showin in Figure 3. The autocorrelation phase matrix is shown in Figure 4. Giventhat the rows represent relative phase, it is illustrative to distort the matrixinto a disk, as seen in Figure 6. Here progressively slower (longer) lags areshown further from the origin.

The metrical hierarchy selection algorithm described above in Section 3.4selects a duple meter at lags 300, 600, 1200, 2400 and 4800 ms. (This isthe correct set of lags. Recall that the pattern was rendered with 300msinter-onset intervals.) If we display only these lags on the disk, the metricalstructure of the pattern begins to emerge. See Figure 7.

The metrical hierarchy selection algorithm chooses a small set of rowsfrom the autocorrelation phase matrix. We will interpret these as defininga genuine metrical hierarchy. Recall that a metrical hierarchy is a set ofnested, aligned periodicities. This suggests that slower lags must align withfaster lags. This provides us with a strong constraint on how to generatean alignment. In a bottom-up fashion (from small lags to long lags) we willselect a winner and then constrain subsequent winners to align with previouswinners. Our constraint will not be a hard one. Instead we will simplymultiply slower lags by the phase-aligned value at the closest faster level. (Itis important not to have a hard constraint here in order to allow effects likesyncopation to be seen. This topic is unfortunately out of the scope of thispaper).

This bottom-up level-by-level multiplication yields a new set of autocor-relation phase values that are accented based on the selected meter (Figure8). Observe that without the bottom-up propagation of metrical informa-tion, the autocorrelation phase matrix reveals no preference for which of thenine events cycled at 4800ms should be selected as a downbeat. After the

10

bottom-up propagation, the correct downbeat is properly accented.This example only considers a short repeating pattern having no acceler-

ation or deceleration. To apply the model to online tasks like beat induction,it is necessary to compute the model online on windowed audio and to copewith tempo changes. Our approach to this is to apply standard slow expo-nential decay to the autocorrelation phase matrix and to incorporate newevidence from the signal into the matrix such that there is some spreadingof energy to near tempos. With this approach it is not necessary to rebuildthe matrix but simply update it for each lag, thus making an efficient im-plementation possible. Another approach would be to use a Hidden MarkovModel to smooth the window-by-window predictions. This is similar to theapproach taken by Klapuri et al. (2005) to incorporate evidence in theirthree-level model.

4 Simulations

We have run the model on several datasets. To test tempo induction we usedthe Ballroom and Song Excerpts databases from the ISMIR 2004 Tempo In-duction contest. For testing the ability of the model to perform meter pre-diction we used the the Essen European Folksong database and the FinnishFolk Song database. We also include preliminary simulations on alignmentusing the 35 artificial patterns from Povel and Essens (1985) as well as

4.1 ISMIR 2004 Tempo Induction

We used two datasets from the ISMIR 2004 Tempo Induction contest (Gouyonet al., 2005). The first dataset was the Ballroom dataset consisting of 698 wavfiles each approximately 30 seconds in duration encompassing eight musicalstyles. See Table 1 for a breakdown of song styles along with the performanceof our model on the dataset. In the table, “Acc. A” is Accuracy A from thecontest: the number of correct predictions within 4% of the target tempo.“Acc. B” is Accuracy B from the contest. It also takes into account missesdue to predicting the wrong level of the metrical hierarchy. Thus answersare treated as correct if they are within 4% of the target tempo multipliedby 2,3,1/2 or 1/3. “Acc C.” is our own measure which also treats answers ascorrect if they are within 4% of the target tempo multiplied by 2/3 or 3/2.This gives us a measure of model failure due to predicting the wrong meter.

11

Table 1: Performance of model by genre on the Ballroom dataset. See textfor details.

Style Count Acc. A Acc. B Acc. CChaChaCha 111 106 107 109Jive 60 6 60 60Quickstep 82 0 77 80Rumba 98 84 85 92Samba 86 78 79 83Tango 86 81 82 83Vienn.Waltz 65 0 57 64Waltz 110 86 86 93Global 698 441 633 664

We computed several baseline models for the ballroom dataset. Theseresults are shown along with our best results and those of the contest win-ner, Klapuri et al. (2005), in Table 2. The “Acorr Only” model uses simpleautocorrelation. The “Acorr+Meter” model incorporates the strategy de-scribed in this paper for using multiple hierarchically-related lags in predic-tion. The “Acorr+Entropy” uses autocorrelation plus entropy as computedon the phase autocorrelation matrix (but no meter). The full model couldalso be called “Acorr+Entropy+Meter” and is the one described in this pa-per. “Klapuri” shows the results for the contest winner.

Two things are important to note. First, it is clear that both of our twomain ideas, meter reinforcement (“Meter”) and entropy calculation (“En-tropy”) aid in computing tempo. Second, the model seems to work well,returning results that compete with the contest winner.

We also used the “Song Excerpts” dataset from the ISMIR 2005 dataset.This dataset consisted of 465 songs of roughly 20sec duration spanning ninegenres. Due to space constraints, we do not report model performance onindividual genres. In table Table 3 the results are summarized in a formatidentical to Table 2.

Here it can be seen that our model performed slightly better than thewinning model on Accuracy A but performed considerably worse on AccuracyB. In our view, Accuracy B is a more important measure because it reflectsthat the model has correctly predicted the metrical hierarchy but has simply

12

Table 2: Summary of models on the Ballroom dataset. See text for details.

Model Acc. A Acc. B Acc. CAcorr Only 49% 77% 77%Acorr+Meter 58% 80% 85%Acorr+Entropy 41% 85% 85%Full Model 63% 91% 95%Klapuri 63% 91% 93%

Table 3: Summary of models on the Song Excerpts dataset. See text fordetails.

Model Acc. A Acc. B Acc. CAcorr Only 49% 64% 64%Acorr+Meter 50% 80% 85%Acorr+Entropy 53% 74% 74%Full Model 60% 79% 88%Klapuri 58% 91% 94%

failed to report the appropriate level in the hierarchy.

4.2 Essen Database

We computed our model on a subset of the Essen collection (Schaffrath,1995) of European folk melodies. We selected all melodies in either duple (i.e.having 2n eighth notes per measure; e.g. 2/4 and 4/4) or triple/compoundmeter (i.e having 3n eighth notes per measure; e.g. 3/4 and 6/8). Thisresulted in a total of 5507 melodies of which 57% (3121) were in duple meterand 43% (2386) were in triple/compound meter. The task was to predict themeter of the piece as being either duple or triple/compound. This is exactlythe same dataset and task studied in Toiviainen and Eerola (2004).

Our results were promising. We classified 90% of the examples cor-rectly (4935 of 5507 correct). Our model performed better on duples thantriple/compounds, classifying 94% of the duple examples correctly (2912 of3121 correct) and 85% of the triple/compound examples correctly (2023 of

13

2386 correct).These success rates are similar to those in Toiviainen and Eerola (2004).

However it is difficult to compare our approaches because their data analysistechnique (stepwise discriminant function analysis) does not control for in-sample versus out-of-sample errors. Functions are combined using the targetvalue (the meter) as a dependent variable. This is suitable for weighing therelative predictive power of each function but not suitable for predicting howwell the ensemble of functions would perform on unseen data unless trainingand testing sets or cross-validation is used. Our approach used no supervisedlearning.

4.3 Finnish Folk Songs Database

We performed the same meter prediction task on a subset of the Finnish Folk-song database (Eerola and Toiviainen, 2004). This dataset was also treatedby Toiviainen and Eerola (2004) and the selection criteria were the same. Forthis dataset we used 7139 melodies of which 80% (5720) were in duple meterand 20% (1419) were triple/compound meter. (For the Toiviainen et. al.study, 6861 melodies were used due to slightly more stringent selection cri-teria. However the ratio of duples to triple/compounds is almost identical.)Note that the datasets are seriously imbalanced: a classifier which alwaysguesses duple will have a success rate of 80%. However given the relativepopularity of duple over triple, this imbalance seems unavoidable.

Our results were promising. We classified 93% examples correctly (6635 of71239 correct). Again, our model performed better on duples than triple/compounds,classifying 95% of the duple examples correctly (5461 of 5720 correct) and83% of the triple/compound examples correctly (1174 of 1419 correct).

4.4 Povel & Essens Patterns

To test alignment (beat induction) we used a set of rhythms from Experiment1 of Povel and Essens (1985). These rhythms are generated by permutingthe interval sequence 1 1 1 1 1 2 2 3 and terminating it by the interval4. These length-16 patterns all contain nine notes and seven rests, and arecycled for the oscillator.

Their model works by applying a set of rules that forced the accentuationof (a) singleton isolated events, (b) the second of two isolated events and(c) the first and last of a longer group of isolated events. Of particular

14

importance is that they validated their model using a set of psychologicalexperiments with human subjects.

Our model predicted the correct downbeat (correct with respect to thePovel & Essens model) 97% of the time (34 of 35 patterns). The patternwhere the model failed was pattern 27. Our interest in this dataset lies lessin the error rate and more in the fact that we can make good predictions forthese patterns without resorting to perceptual accentuation rules.

5 Discussion

Though the model does not perform as well as Klapuri et. al. on AccuracyB of the Song Excerpts dataset, it still performs quite well on tempo extrac-tion in general. It achieves this without complex multi-band preprocessingand without supervised learning. While we must compute the phase auto-correlation table, which is time consuming, there are other motivations forcomputing this table such as performing an alignment. Thus the time spentcomputing the table may be offset by an ability to reuse the data structurein several ways. Finally, we had the Ballroom and Song Excerpts datasetfor nearly a month. Though our model does not use supervised learningand thus cannot explicitly cheat, we admit that it also possible to improve anonparametric model by improving it using the same dataset for which oneis reporting results.

The model seems to perform basic meter categorization relatively well.It performed at competitive levels on both the Essen and the Finnish sim-ulations. Furthermore it achieved good performance without risk of under-generalizing due to overfitting from supervised learning. One area of currentresearch is to see how well the model does at aligning (identifying the locationof downbeats) in the Essen and Finnish databases.

As evidenced by the Povel & Essens results, the model has potentialfor performing alignment of an induced metrical hierarchy with a musicalsequence. Though we have many other examples of this ability performance,including some entertaining automatic drumming to Mozart compositions,we have yet to undertake a methodical study of the the limitations of ourmodel on alignment. This, and related tasks like online beat induction, areareas of ongoing research.

15

6 Conclusions

This paper introduces a novel way to detecting metrical structure in a mu-sic and to use meter as an aid in detecting tempo. Two main ideas wereexplored in this paper. First we discussed an improvement to using autocor-relations for musical feature extraction via the computation of an autocor-relation phase matrix. We also discussed computing the Shannon entropyfor each lag in this matrix as a means for sharpening the standard autocor-relation. Second we discussed ways to use the autocorrelation phase matrixto compute an alignment of a metrical hierarchy with music. We appliedthe model to the tasks of meter prediction and tempo induction on largedatasets. We also provided preliminary results for aligning the metrical hi-erarchy with the piece (downbeat induction). Though much of this work ispreliminary, we believe the results in this paper suggest that the approachwarrants further investigation.

7 Acknowledgements

We would like to thank Fabien Gouyon, Petri Toiviainen and Tuomas Eerolafor many helpful email correspondences.

References

Brown, J. (1993). Determination of meter of musical scores by autocorrela-tion. Journal of the Acoustical Society of America, 94:953–1957.

Cemgil, A. T. and Kappen, H. J. (2003). Monte Carlo methods for tempotracking and rhythm quantization. Journal of Artificial Intelligence Re-search, 18:45–81.

Cemgil, A. T., Kappen, H. J., Desain, P., and Honing, H. (2001). On tempotracking: Tempogram representation and Kalman filtering. Journal of NewMusic Research, 28:4:259–273.

Cooper, G. and Meyer, L. B. (1960). The Rhythmic Structure of Music. TheUniv. of Chicago Press.

16

Dixon, S. E. (2001). Automatic extraction of tempo and beat from expressiveperformances. Journal of New Music Research, 30(1):39–58.

Eck, D. (2002). Finding downbeats with a relaxation oscillator. Psychol.Research, 66(1):18–25.

Eerola, T. and Toiviainen, P. (2004). Digital Archive of FinnishFolktunes. [computer database]. University of Jyvaskyla.http://www.jyu.fi/musica/sks.

Goto, M. (2001). An audio-based real-time beat tracking system for musicwith or without drum-sounds. Journal of New Music Research, 30(2):159–171.

Gouyon, F., Klapuri, A., Dixon, S., Alonso, M., Tzanetakis, G., Uhle, C., andCano, P. (2005). An experimental comparison of audio tempo inductionalgorithms. Soumis.

Handel, S. (1993). Listening: An introduction to the perception of auditoryevents. MIT Press, Cambridge, Mass.

Ifeachor, E. C. and Jervis, B. W. (1993). Digital Signal Processing: A Prac-tical Approach. Addison-Wesley Publishing Company.

Klapuri, A., Eronen, A., and Astola, J. (2005). Analysis of the meter ofacoustic musical signals. IEEE Trans. Speech and Audio Processing. Toappear.

Large, E. W. and Kolen, J. F. (1994). Resonance and the perception ofmusical meter. Connection Science, 6:177–208.

Parncutt, R. (1994). A perceptual model of pulse salience and metrical accentin musical rhythms. Music Perception, 11:409–464.

Povel, D. and Essens, P. (1985). Perception of temporal patterns. MusicPerception, 2:411–440.

Schaffrath, H. (1995). The Essen Folksong Collection in Kern Format. [com-puter database]. Center for Computer Assisted Research in the Humani-tites.

17

Scheirer, E. (1998). Tempo and beat analysis of acoustic musical signals.Journal of the Acoustical Society of America, 103(1):588–601.

Toiviainen, P. and Eerola, T. (2004). The role of accent periodicities in meterinduction: a classificatin study. In Lipscomb, S., Ashley, R., Gjerdingen,R., and Webster, P., editors, The Proceedings of the Eighth InternationalConference on Music Perception and Cognition (ICMPC8), Adelaide, Aus-tralia. Causal Productions.

Volk, A. (2004). Exploring the interaction of pulse layers regarding theirinfluence on metrical accents. In Lipscomb, S., Ashley, R., Gjerdingen,R., and Webster, P., editors, The Proceedings of the Eighth InternationalConference on Music Perception and Cognition (ICMPC8), Adelaide, Aus-tralia. Causal Productions.

Vos, P., van Dijk, A., and Schomaker, L. (1994). Melodic cues for metre.Perception, 23:965–976.

18

0 500 1000 1500 2000 2500 3000 3500 40000

100

200

300

400

500

600

True lag

lag (msec)

auto

corr

elat

ion

15−AudioTrack 15.wav

Figure 2: Autocorrelation of a song by Manos Xatzidakis from the ISMIR 2004Tempo Induction contest (15-AudioTrack 15.wav). The dotted vertical line marksthe actual tempo of the song (563 msec, 106.6 bpm). Compare the flatness of theautocorrelation and the lack of alignment between peaks and the target.

19

0 1000 2000 3000 4000 5000 6000 7000 8000 90000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

abs(

x), n

orm

aliz

ed

time ms

Pattern 1 Povel & Essens (1985), piano

Figure 3: The rectified normalized signal generated by creating a piano renderingfrom a MIDI version of Povel & Essens Pattern 1. Two repetitions of the length-16nine-event pattern are shown. See Povel and Essens (1985) for details.x

20

Figure 4: The autocorrelation phase matrix for Povel & Essens Pattern 1 com-puted for lags 250Ms through 500ms. The phase points are shown in terms ofrelative phase (0, 2π). On the right it is shown that taking the sum of the matrixby row yields exactly the autocorrelation.

21

0

0.2

0.4

0.6

0.8

1

auto

corr

elat

ion

15−AudioTrack 15.wav

0 500 1000 1500 2000 2500 3000 3500 40000

0.2

0.4

0.6

0.8

1

True lag

1 −

entro

py

lag (msec)

Figure 5: Autocorrelation and entropy calculations for the same Manos Zatzidakissong shown in Figure 2. The top is the autocorrelation and is identical to Figure2 except that it is scaled to [0, 1]. On the bottom is (1 - entropy), scaled to [0, 1]and detrended. Observe how the entropy spikes align well with the correct tempolag of 563ms and with its integer multiples (shown as vertical dotted lines in bothplots.

22

Figure 6: The autocorrelation phase matrix for Povel & Essens Pattern 1 shownas a disk with progressively slow (longer) lags shown further from the origin.

23

Figure 7: The autocorrelation phase matrix for Povel & Essens Pattern 1. Onlythose lags chosen by the metrical hierarchy selection algorithm (300, 600,1200,2400and 4800ms) are shown. The outermost ring shows the entire 9-element repeatingpattern.

24

Figure 8: The autocorrelation phase matrix for Povel & Essens Pattern 1 afterbottom-up propagation of metrical information. Progressively slower lags (furtherout on the disk) are multiplied by the phase-adjusted values at the next faster(closer) level. This biases slower lags to be phase-aligned with faster lags. Noticethat the outermost ring containing the 9-element repeating pattern now reflectsmetrical accenting, making it easy to select the correct downbeat.

25

Documents

Meter and Autocorrelation - Université de Montréaleckdoug/papers/2005_rppw.pdf · 2005. 7. 14. · Meter and Autocorrelation∗ Douglas Eck University of Montreal Department of