6

Click here to load reader

Optimal filtering of dynamics in short-time features for ... · Optimal filtering of dynamics in short-time features for music organization ... (POPLS), is proposed

Embed Size (px)

Citation preview

Page 1: Optimal filtering of dynamics in short-time features for ... · Optimal filtering of dynamics in short-time features for music organization ... (POPLS), is proposed

Optimal filtering of dynamics in short-time features for music organization

Jeronimo Arenas-Garcıa, Jan Larsen, Lars Kai Hansen and Anders MengInformatics and Mathematical Modelling

Technical University of Denmark2800 Kgs. Lyngby, Denmark

{jag,jl,lkh,am}@imm.dtu.dk

AbstractThere is an increasing interest in customizable methods fororganizing music collections. Relevant music characteriza-tion can be obtained from short-time features, but it is notobvious how to combine them to get useful information.In this work, a novel method, denoted as the Positive Con-strained Orthonormalized Partial Least Squares (POPLS), isproposed. Working on the periodograms of MFCCs timeseries, this supervised method finds optimal filters whichpick up the most discriminative temporal information forany music organization task. Two examples are presented inthe paper, the first being a simple proof-of-concept, wherean altosax with and without vibrato is modelled. A morecomplex11 music genre classification setup is also inves-tigated to illustrate the robustness and validity of the pro-posed method on larger datasets. Both experiments showedthe good properties of our method, as well as superior per-formance when compared to a fixed filter bank approachsuggested previously in the MIR literature. We think thatthe proposed method is a natural step towards a customizedMIR application that generalizes well to a wide range of dif-ferent music organization tasks.

Keywords: Music organization, filter bank model, positiveconstrained OPLS

1. IntroductionThe interest in automated methods for organizing music isincreasing, which is primarily due to the large digitalizationof music. Music distribution is no longer limited to physi-cal media, but users can download music titles directly fromInternet services such as e.g.iTunes or Napster 1 . Portableplayers easily store most users personal collections and al-low the user to bring the music anywhere. The problem ofnavigating these seemingly endless streams of music appar-ently seems dubious with current technologies. However,the increased research conducted in fields of music infor-

1 www.itunes.com andwww.napster.com.

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copiesare not made or distributed for profit or commercial advantage and thatcopies bear this notice and the full citation on the first page.c© 2006 University of Victoria

mation retrieval (MIR) will aid users in organizing and nav-igating their music collections. Furthermore, there has beenan increasing interest in customization when organizing themusic, see e.g. [1, 2], which provides a better control of theusers individual collections. The problems that researchersface when working with customization, especially in MIR,are many and indeed require robust machine learning algo-rithms for handling the large amount of data available for anaverage user. User interaction could be in the sense of or-ganizing the music collection in specific taxonomies. Thiscould be a simple flat genre taxonomy that is frequently usedin portable players, or taxonomies based on instrumenta-tion, artist or theme, see e.g.www.allmusic.com and[1]. Customization in terms of predicting users personalmusic taste was investigated in [3], where a support vectormachine was applied in connection with active retrieval.

The present work introduces a method for learning im-portant dynamical structure in the short-time features2 ex-tracted from the music, in such a way that this information isas relevant as possible for a given music organization task.The basic idea stems from the work of [4], where the au-thors investigated an audio classification task using differ-ent perceptual (and non-perceptual) short-time features atlarger time-scales. A periodogram was computed for eachshort-time feature dimension over a frame correspondingto ∼ 768ms, followed by a summarization of the powerin 4 predefined frequency bands using a filter bank. Thismethod was investigated in greater detail in [5], where dif-ferent methods for handling dynamics of short-time features,denoted as temporal feature integration3 , were investigated.The fixed filter bank applied in [4], was selected from the as-sumed importance of the dynamics in the short-time featuresfor the given learning task. The method, however, is not gen-eral enough, since for a custom music organization task, thedynamics in the short-time features are context dependent(i.e., the relevant pattern of temporal changes in short-timefeatures is expected to be different for, e.g., vibrato/nonvi-brato detection, or for genre classification tasks), which isthe reason for suggesting a method where an optimal filterbank is learned for a particular music organization task.

2 Short-time features are usually extracted from music at time-levelsaround5 − 100 ms.

3 Temporal feature integration is the process of combining all the featurevectors in a time-frame into a single new feature vector in order to capturethe relevant temporal information in the frame.

Page 2: Optimal filtering of dynamics in short-time features for ... · Optimal filtering of dynamics in short-time features for music organization ... (POPLS), is proposed

The content of this paper has been structured as follows.Section 2 presents the short-time features used and shortlydescribes the method in [4] for capturing the dynamic struc-ture in the short-time features. Section 3 introduces the posi-tive constrained OPLS, which can be used to find an optimalfilter bank for any given music organization task. In Sec-tion 4, two experiments are described: the first experimentis a proof-of-concept illustrating the goodness of the filtersobtained using the proposed method, when discriminatingbetween vibrato/non-vibrato playing of music instruments;in the second experiment, we compare the filter banks de-rived from our method with those proposed in [4], within an11 flat taxonomy music genre classification task. Section 5provides the conclusion and suggestions for future work.

2. Music Feature ExtractionThe complete system considered in this work has been il-lustrated in Figure 1. The purpose of the overall system isto classify music data according to some criterion, such asgenre, or presence vibrato, so we are assuming that some la-belled data is available for the design. From the raw digitalaudio signal, an initial step towards an automated organiza-tion of music is feature extraction. This is the process of ex-tracting relevant information from the audio signal that canbe used in a sub-sequential learning algorithm. A music sig-nal is typically stationary in periods ranging from5-100ms,see e.g. [6], and features extracted at this time-scale are de-noted short-time features.

2.1. Short-time featuresThe Mel Frequency Cepstral Coefficients (MFCC) have beenselected as short-time features in this work. These coeffi-cients were originally developed for automatic speech recog-nition, aiming at deconvolving the effects of the vocal tractshape and the vocal cord excitation. However, they havebeen applied with great success in various fields of MIR,see e.g. [7, 4, 3]. The features are perceptually inspired,meaning that they resemble the auditory system of humans.The MFCCs are ranked in such a manner that the lower or-der MFCCs contain information about the slow variations inthe spectral envelope. Hence, including the higher MFCCsa richer representation of the spectral envelope will be ob-tained.

For this investigation, the6 initial MFCCs have beenused, including the first coefficient, which is correlated withthe perceptual dimension of loudness. In the investigations,each music snippet is power normalized prior to the MFCCextraction stage. A frame-size of30ms and a hop-size of7.5ms have been applied in all experiments to minimizealiasing in the MFCCs.

2.2. Temporal feature integrationTemporal feature integration is the process of combining allthe feature vectors in a time-frame into a single new fea-ture vector in order to capture the relevant information in

the frame. Formally, this amounts to the following (see alsoFig. 1):

zk = f(xk·hsx,xk·hsx+1, . . . ,xk·hsx+fsx−1), (1)

wherex represents the short-time features (MFCCs),fsxis

the frame-size, andhsxthe hop-size, both defined in a num-

ber of sample manner. Functionf(·) maps the sequence ofshort-time features into a single vectorzk, for k = 0, 1, . . . ,

K − 1.In [4] it was proposed to perform temporal feature in-

tegration by estimating the power spectrum of the MFCCsusing the periodogram method [8]. In addition to this, theauthors propose to summarize the energy in different fre-quency bands using a predefined filter bank:

z(i)k = WT z

(i)k (2)

wherez(i)k is a periodogram of dimensionDz of the i-th

MFCC coefficient over some framefsx, k = 0, . . . ,K − 1

is the index at the larger time-scale, andW comprises thefrequency magnitude response of the filter bank. Finally,the feature vectorz(i), which has as many components asthe number of filters in the bank, is used as an input to thesubsequent classification process.

In other words, the temporal feature extraction stage con-sists of estimating the periodogram of each MFCC dimen-sion independently over some time-framefsx

, after whicha filter bankW is applied. In the coming sections we haveremoved the superscripti, meaning that each short-time fea-ture dimension is processed independently, using the samefilter bank for all MFCCs.

The filter bankW is a matrix of dimensionDz×4, whereDz =

fsx

2 +1 (throughout this paper we will usefsx= 256,

so thatDz = 129), which simply summarizes the powercomponents in four frequency bands:

1. 0Hz (DC value)

2. 1 − 2Hz (beat rates)

3. 3 − 15Hz (modulation energy, e.g. vibrato)

4. 20 − sx

2 Hz (perceptual roughness)

where the sampling ratesx is related to the hop-size (hsx).

This filter bank (W) has been suggested for general audioclassification and is inherently positive, since it is applieddirectly on the estimated power spectrum (periodogram).The filter bank, however, can easily become sub-optimal fora specific music organization task, which is the reason forsuggesting a method for finding an optimal filter bank in asupervised manner. The proposed method for the optimaldesign of the filter bank is the topic of the next section.

Page 3: Optimal filtering of dynamics in short-time features for ... · Optimal filtering of dynamics in short-time features for music organization ... (POPLS), is proposed

MFCC

Audio

Dx(= 6)

Dz

Dz

x

z

z

Tem

pora

lFea

ture

Inte

grat

ion

Filter Bank

Periodogram

fsx

fs = 30ms

(W&U)

Classifier

Postprocessing

Decision

Figure 1. The figure illustrates the flow-chart of the completeprocess. After MFCC extraction, periodograms are computedfor each MFCC. The output of the “periodogram” box is aDz = 129 dimensional vector for each MFCC, correspond-ing to the power in the different frequency bands. The filterbank (W or U) summarizes the power in predefined frequencybands. The dimension ofz, denoted byDz will depend on thenumber of MFCCs, the selected frame-sizefsx and the num-ber of filters in the filter bank W (fixed to 4) or U (nf ).

3. Supervised Design of Filter BanksAs can be understood from our previous discussion, andsince the goal is to optimize the classification performanceof the whole system, a better behavior can be obtained if thefilter bank is designed in a supervised manner, i.e., to opti-mize the performance in some training dataset whose labelsare used during the design process.

Then, we will assume that we are given a set ofN train-ing pairs{zk,yk}

Nk=1, with yk being the label vector asso-

ciated tozk. TheC dimensional vectoryk, whereC is thenumber of classes, contains a one in the position of the truelabel for patternzk, and zeros elsewhere. In this section,we address the issue of how one can use the training datato design a filter bankU = [u1,u2, . . . ,unf

], whereum isthe frequency amplitude of them-th filter andnf is the totalnumber of filters in the bank, in such a way that the outputsof the filters,

zk = UT zk, (3)

are as relevant as possible for the classification task at hand 4 .

4 Note that we have opted to useU to denote the filter bank obtainedwith our method, to differentiate it from the filter bank from [4] that we

From its definition, and given that this matrix operates on thepower spectrum of the different MFCCs, it should be clearthat all elements inU should be non-negative numbers (i.e.,uij ≥ 0), so thatzk can be effectively interpreted as the out-put energies of a filter bank. Note that a negativeuij wouldcorrespond to the subtraction of the energy in a certain fre-quency band.

The procedure we present lies within the framework ofMultivariate Analysis methods [9], and, in particular, it isa variant of Orthonormalized Partial Least Squares (OPLS).Next, we will briefly review OPLS, and explain how it canbe solved under the additional constraintsuij ≥ 0, resultingin a method that we have called Positive constrained OPLS(POPLS). Readers that prefer to skip the implementation de-tails of the method, can go directly to the experiments sec-tion.

3.1. Multi-regression model for Feature Extraction andData Classification

For the classification process (see Fig. 1) we will considera multi-regression model. Although other models are pos-sible, we will see that the regression approach results in avery convenient method for computing the filter bank. Themulti-regression model can be written as

yk = BUT zk + b = Bzk + b, (4)

whereyk is the predicted output, and{B,b} are the freeparameters of the model. In particular,b is a bias that com-pensates for the different means of the input and output vari-ables. Note that, sinceB is C × nf andU is Dz × nf , thefilter bank is effectively imposing abottleneck in the sys-tem, in the sense that thezk vectors given to the classifierare lower dimensional than the originalzk. This dimension-ality reduction is very useful to simplify the design of theclassifier and to improve generalization, and is an unavoid-able step when the training dataset is small. However, inorder to not degrade the performance of the classifier, it iscrucial thatzk retains the most discriminative informationin zk, what can only be achieved with a good design of thefilter bank.

Our aim is to adjust all parameters in the model, as wellas the filter bank, to minimize the sum-of-squares of the dif-ferences between the real and estimated labels, i.e.,

[Uo,Bo,bo] = arg minU,B,b

‖Y − BZ − b1T ‖2F (5)

whereY = [y1, . . . ,yN ], 1 is an all-ones vector of appro-priate dimensions, andZ = UT Z with Z = [z1, . . . , zN ].Subscript ‘F ’ refers to the Frobenius norm of a matrix.

It is known thatBo can be obtained as the solution of the

will denote withW throughout the paper.

Page 4: Optimal filtering of dynamics in short-time features for ... · Optimal filtering of dynamics in short-time features for music organization ... (POPLS), is proposed

following modified problem:

Bo = arg minB

‖Yc − BZc‖2F

= YcZTc (ZcZ

Tc )−1

(6)

whereZc andYc are centered versions ofZ andY, respec-tively. Then, the bias is simply given by

b0 = 1

N(Y − B0Z)1. (7)

Once we have derived a closed form expression forBo

and bo, we are ready to present our POPLS method forthe selection of the optimal filter bank which minimizes (5)[10], subject to the constraint that all entries inU are posi-tive.

3.2. Positive Constrained OPLSTo start with, let us introduce the optimal regression matrix,B0, into (5). Taking also into account thatZc = UT Zc, theminimization problem can be rewritten as

Uo = arg minU

‖Yc − BoZc‖2F

= arg minU

‖Yc[I − ZTc U(UT ZcZ

Tc U)−1UT Zc]‖

2F

with I being theN dimensional identity matrix.Now, using the fact that‖A‖2

F = Tr{AAT }, and aftersome algebra, we arrive to the following optimization prob-lem

maximize: Tr{(UT CzzU)−1UT CzyCyzU} (8)

subject to: UT U = I (9)

uij ≥ 0 (10)

where we have defined the covariance matricesCzz = ZcZTc ,

Czy = ZcYTc andCyz = CT

zy, and where we have madeexplicit the positivity constraint. The additional constraint(9) is needed to make the solution unique.

There are a number of ways to solve the above problem.We will use a procedure consisting on iteratively calculatingthe best filter, so that we are not only guaranteeing thatUo

is the optimal bank withnf filters, but also that any subbankconsisting of some of the first columns ofUo is also opti-mal with respect to the number of filters used. In brief, theprocess consists of the following two differentiated stages:

1) Solve the “one filter” optimization problem given by:

maximize:uT CzyCyzu

uT Czzu(11)

subject to: uT u = 1 (12)

ui ≥ 0 (13)

2) Remove fromYc the prediction obtained from the cur-rent filter bank.

Inputs:Z, Y, nf

1 - Calculate centered data matricesZc andYc

2 - Czz = ZcZTc , Y(1)

c = Yc

3 - Form = 1, . . . , nf

3.1 -C(m)yz = Y

(m)c Zc

T ; C(m)zy = C

(m)yz

T

3.2 - Solve (11)-(13) to obtainum

3.3 -Y(m+1) = Y(m)[

I − ZTc umu

TmZc

uTmCzzum

]

4 - Output filter bank:Uo = [u1, . . . ,unf]

Table 1. POPLS pseudocode.

Table 1 summarizes our POPLS algorithm for the super-vised design of filter banks. It is also worth mentioning that,in our implementation, the maximization problem (11)-(13)was solved with thefmincon matlab function. However, inmost occasions, the convergence of this routine was not sat-isfactory, making it necessary to recur to an alternative rep-resentation ofu based on hyperspherical coordinates. Theadvantage of this representation is that restriction (12) is di-rectly incorporated into the representation, what simplifiesthe application of any optimization algorithm.

4. ExperimentsThis section considers two different experiments. The ex-periment described in Subsection 4.1 is a proof-of-conceptexperiment that illustrates the basic idea of POPLS for dis-criminating between an instrument played with and withoutvibrato. The second line of experiments described in Sub-section 4.2 considers an11 music genre dataset, investigatedusing the filterW from [4] and the filter obtained from thePOPLS,U.

4.1. Experiment 1: Instrument vibrato/non-vibrato de-tection

This experiment considers the problem of detecting vibratoor non-vibrato of a single instrument and is only intended asa proof-of-concept example.

A small dataset has been created consisting of music snip-pets consisting of an alto saxophone with notes ranging fromDb3 to Ab5 (138.59−830.61Hz), with and without vibrato,resulting in a total of64 (32 train /32 test) small music clipseach of3− 4 s. The music samples were extracted from theMIS (Music Instrument Samples) database developed by theuniversity of Iowa [11]. This database has been applied inconnection with automated instrument classification in e.g.[12].

Only the first MFCC has been used in this experiment,which is known to be correlated with the perceptual dimen-sion of loudness. A frame-size (fsx

) corresponding to960msand a corresponding hop-size of240ms were selected. Theframe-size was selected to ensure a few periods of the mod-

Page 5: Optimal filtering of dynamics in short-time features for ... · Optimal filtering of dynamics in short-time features for music organization ... (POPLS), is proposed

Frequency [Hz]

Am

plitu

de Filter No.1

1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

00 10.4 20.7 31.0 41.3 51.7 62.0

Frequency [Hz]

Am

plitu

de Filter No.2

1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

00 10.4 20.7 31.0 41.3 51.7 62.0

Figure 2. The left and right figure illustrates the first and sec-ond most discriminative filters extracted by the POPLS proce-dure, respectively.

ulation frequency of4−5Hz, hence, obtaining a better spec-tral estimate for the instruments played with vibrato.

Leave-one-out cross-validation (LOO-CV) [13] was ap-plied to access the test-accuracy. In each fold, the optimalfilters were calculated using the POPLS method described inSubsection 3.2, and the resulting error was obtained using alinear classifier. The LOO-CV classification error obtainedusingnf = 25 filters was19% at the960ms time-scale,getting as low as9.4% when performing weigted voting5

across the frames in each music sample to achieve a singledecision of each music sample. It is noted that, when us-ing the fixed filter bankW, close to random performance(48.3%, where random performance is50%) is obtained,which is ascribed to a smearing of the relevant frequencycomponents, since the filter is summarizing the frequenciesbetween3 − 15Hz.

The two filters with largest discriminative performanceprovided by POPLS have been illustrated in Figure 2. Theleft figure, which illustrates the filter with largest discrimi-native performance, clearly indicates that the most relevantinformation concerning the modulation (vibrato∼ 4−6Hz)of the instrument is learned by the POPLS. Using only thesetwo filters a classification error of20% is obtained usingweighted voting to obtain a single decision per music sam-ple.

4.2. Experiment 2: Music genre classificationThe experiment described in this subsection considers thefixed filter bankW and the POPLS method for determininga filter bankU in an11 music genre classification setup.

4.2.1. Dataset

The dataset has previously been investigated in [5, 14], andconsists of1317 music snippets each of30 s. distributedevenly among the11 music genres: alternative, country,easy listening, electronica, jazz, latin, pop&dance, rap&hip-hop, r&b and soul, reggae and rock, except for latin, whichonly has117 music samples. The labels have been obtainedfrom an external reference. The music snippets consist of

5 Weighted voting is the process of selecting class membership by sum-ming across the output vectors of the classifier,yk, corresponding to allfeature vectorszk belonging to the same clip. The class that obtains thelargest sum is the “voted” class.

Cro

ss-v

alid

atio

nte

ster

ror

[%]

Filter banks -nf

1 2 3 460

62

64

66

68

70

72

74

76

78

80

Filter banks -nf

0 5 10 15 20 2560

62

64

66

68

70

72

74

76

78

80

Figure 3. The left figure illustrates the mean cross-validationerror for the fixed filter bank ( W, solid line) and the first 4

filters of the POPLS procedure (U, broken line). The rightfigure illustrates the mean cross-validation error for thenf =

25 filters obtained by the POPLS procedure. The error-barson both plots are± the standard deviation of the mean.

MP3 (MPEG1-layer3) encoded music with a bitrate of128kbps or higher, downsampled to22050Hz. This dataset israther complex having on the average1.83 songs per artist.Previous results show that this is a difficult dataset for genreclassification (see, for instance, [14]).

4.2.2. Initial investigationsPrevious investigations of the frame-size conducted in [5]showed that a frame-size (fsx

) of approximately2 s was op-timal for the method in [4]. Since the aim is to illustratethat a supervised determination of the filter bank is supe-rior to the fixed filter bank, the same frame-size has beenused for POPLS. With a hop-size on the short-time featuresof 7.5ms, this frame-size corresponds to approximately256samples. Due to the symmetry in the periodogram, the re-sulting dimensions of the filter banks become129 × nf forU and129 × 4 for W. It was observed that the mean clas-sification test error did not improve fornf > 25, hence,nf = 25 was the largest amount of filters investigated.

4.2.3. Results & DiscussionTo access the classification accuracy of the two methods,10-fold cross-validation has been applied. In each fold, theoptimal filters were estimated from the training set as de-scribed in Section 3, and the performance of the system wassubsequently evaluated on the corresponding test fold.

Figure 3 shows the10-fold cross-validation error as afunction of the number of filters in the banks. The left fig-ure shows the cross-validation error obtained using only thefirst 4 filters of the POPLS, and using the fixed filter bankW. It is observed that the filters obtained by the POPLSprocedure are on the average2% better than the fixed filterbank, shown in solid line. Furthermore, using only the first3 filter banks obtained by POPLS the cross-validation testerror is similar to the performance obtained using the fixedfilter bankW. Although most of the important dynamicalstructure of the MFCCs is captured by the first few filters ofU, the right plot of Figure 3 shows that a significant errorreduction can be obtained when considering a larger numberof filters, achieving error rates around61% for nf > 15.

Page 6: Optimal filtering of dynamics in short-time features for ... · Optimal filtering of dynamics in short-time features for music organization ... (POPLS), is proposed

Am

plitu

de Filter No.1

1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

00 10.4 20.7 31.0 41.3 51.7 62.0

Filter No.2

1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

00 10.4 20.7 31.0 41.3 51.7 62.0

Frequency [Hz]

Am

plitu

de Filter No.3

1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

00 10.4 20.7 31.0 41.3 51.7 62.0

Frequency [Hz]

Filter No.4

1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

00 10.4 20.7 31.0 41.3 51.7 62.0

Figure 4. The four most discriminative filters, where the up-per left figure illustrates the most relevant filter for the specificmusic genre classification setup.

Figure 4 shows the first4 filters obtained on a singlefold using the POPLS. Filter1 includes the most importantfrequencies of the MFCCs periodograms, which basicallycover the modulation frequencies of instruments. Filters2and3 provide attention to the lower modulation frequencies,with filter 2 having frequency components at the beat-scale.Filter 4 spans the higher modulation frequencies, which arerelated to the perceptual roughness. The difference betweenthe filters obtained for each training data fold is small, whichpartly illustrates that the proposed method is robust to noiseand, further, that the specific underlying temporal structureof the MFCCs is relevant for discriminating between the dif-ferent genres.

5. Conclusions and Future workIn this paper we have presented a method for designing fil-ter banks that are able to learn the important dynamics inshort-time features for a given classification task. The pro-posed method is very versatile, in the sense that it can beapplied to any discrimination task, as we have illustrated inour experiments section, where we tackled two very differ-ent classification problems, namely, the detection of vibratoin instrument music clips, and music genre classification.The advantage of our approach over other feature extractionmethods, is that it provides an elegant physical interpreta-tion of the extracted features, in terms of the dynamical be-havior of the MFCCs time series.

Although here we limited the method to provide an uniquefilter bank, it is straightforward to allow for different filtersfor each of the MFCCs. Exploiting the cross-correlationamong the different filters in the bank, could also be usedto improve the accuracy of the whole system. These lines,as well as the application of the method to other MIR prob-lems, constitute logical directions for future research.

6. AcknowledgmentsThis work is partly supported by the Danish Technical Re-search Council, through the framework project ‘IntelligentSound’, www.intelligentsound.org (STVF No. 26-04-0092),and by the European Commission through the sixth frame-work IST Network of Excellence: Pattern Analysis, Sta-tistical Modelling and Computational Learning (PASCAL),contract no. 506778. The work of J. Arenas-Garcıa was alsosupported by the Spanish Ministry of Education and Sciencewith a postdoctoral fellowship.

References

[1] H. Homburg, I. Mierswa, B. Moller, K. Morik, and M. Wurst.A benchmark dataset for audio classification and clustering.In International Symposium on Music Information Retrieval,pages 528–531, 2005.

[2] J. Bergstra, N. Casagrande, D. Erhan, D. Eck, and B. Kgl.Meta-features and adaboost for music classification.Ma-chine Learning Journal : Special Issue on Machine Learningin Music, 2006.

[3] M. Mandel, G. Poliner, and D. Ellis. Support vector machineactive learning for music retrieval.Accepted for publicationin ACM Multimedia Systems Journal, 2006.

[4] M. F. McKinney and J. Breebart. Features for audio andmusic classification. InInternational Symposium on MusicInformation Retrieval, pages 151–158, 2003.

[5] A. Meng, P. Ahrendt, J. Larsen, and L. K. Hansen. Temporalfeature integration for music genre classification. Submitted,2006.

[6] J.-J. Aucouturier, F. Pachet, and M. Sandler. The way itsounds : Timbre models for analysis and retrieval of poly-phonic music signals.IEEE Trans. on Multimedia, 7(6):8,December 2005. (in press).

[7] P. Ahrendt, A. Meng, and J. Larsen. Decision time horizonfor music genre classification using short time features. InEUSIPCO, pages 1293–1296, Vienna, Austria, sept. 2004.

[8] M. H. Hayes,Statistical Digital Signal Processing and Mod-eling, N.Y.: Wiley, 1996.

[9] T. W. Anderson,An Introduction to Multivariate StatisticalAnalysis, 3rd Ed., N.Y.: Wiley-Interscience, 2003.

[10] S. Roweis and C. Brody. Linear Heteroencoders.GatsbyUnit Technical Report GCNU-TR-1999-02, Gatsby Compu-tational Neuroscience Unit, London, 1999.

[11] University of Iowa musical instrument sample database,http://theremin.music.uiowa.edu/index.html.

[12] E. Benetos, M. Kotti, C. Kotropoulos, J. J. Burred, G. Eisen-berg, M. Haller, and T. Sikora. Comparison of subspaceanalysis-based and statistical model-based algorithms formusical instrument classification. InProc. of 2nd Work-shop On Immersive Communication And Broadcast Systems(ICOB), October 2005.

[13] C. M. Bishop.Neural Networks for Pattern Recognition. Ox-ford University Press, 1995.

[14] A. Meng and J. Shawe-Taylor. An investigation of featuremodels for music genre classification using the Support Vec-tor Classifier. InInternational Conference on Music Infor-mation Retrieval, pages 604–609, 2005.