The State of the Art Ten Years After a State of the Art- Future Research in Music Information Retrieval

This article was downloaded by: [Northwestern University]On: 20 January 2015, At: 14:24Publisher: RoutledgeInforma Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House,37-41 Mortimer Street, London W1T 3JH, UK

Click for updates

Journal of New Music ResearchPublication details, including instructions for authors and subscription information:http://www.tandfonline.com/loi/nnmr20

The State of the Art Ten Years After a State of the Art:Future Research in Music Information RetrievalBob L. Sturma

a Aalborg University Copenhagen, Denmark.Published online: 09 May 2014.

To cite this article: Bob L. Sturm (2014) The State of the Art Ten Years After a State of the Art: Future Research in MusicInformation Retrieval, Journal of New Music Research, 43:2, 147-172, DOI: 10.1080/09298215.2014.894533

To link to this article: http://dx.doi.org/10.1080/09298215.2014.894533

PLEASE SCROLL DOWN FOR ARTICLE

Taylor & Francis makes every effort to ensure the accuracy of all the information (the “Content”) containedin the publications on our platform. However, Taylor & Francis, our agents, and our licensors make norepresentations or warranties whatsoever as to the accuracy, completeness, or suitability for any purpose of theContent. Any opinions and views expressed in this publication are the opinions and views of the authors, andare not the views of or endorsed by Taylor & Francis. The accuracy of the Content should not be relied upon andshould be independently verified with primary sources of information. Taylor and Francis shall not be liable forany losses, actions, claims, proceedings, demands, costs, expenses, damages, and other liabilities whatsoeveror howsoever caused arising directly or indirectly in connection with, in relation to or arising out of the use ofthe Content.

This article may be used for research, teaching, and private study purposes. Any substantial or systematicreproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in anyform to anyone is expressly forbidden. Terms & Conditions of access and use can be found at http://www.tandfonline.com/page/terms-and-conditions

http://crossmark.crossref.org/dialog/?doi=10.1080/09298215.2014.894533&domain=pdf&date_stamp=2014-05-09

http://www.tandfonline.com/loi/nnmr20

http://www.tandfonline.com/action/showCitFormats?doi=10.1080/09298215.2014.894533

http://dx.doi.org/10.1080/09298215.2014.894533

http://www.tandfonline.com/page/terms-and-conditions

http://www.tandfonline.com/page/terms-and-conditions

Journal of New Music Research, 2014Vol. 43, No. 2, 147–172, http://dx.doi.org/10.1080/09298215.2014.894533

The State of the Art Ten Years After a State of the Art: Future

Research in Music Information Retrieval

Bob L. Sturm

Aalborg University Copenhagen, Denmark

(Received 16 July 2013; accepted 7 February 2014)

Abstract

A decade has passed since the first review of research on a‘flagship application’ of music information retrieval (MIR):the problem of music genre recognition (MGR). During thistime, about 500 works addressing MGR have been published,and at least 10 campaigns have been run to evaluate MGRsystems. This makes MGR one of the most researched areasof MIR. So, where does MGR now lie? We show that in spiteof this massive amount of work, MGR does not lie far fromwhere it began, and the paramount reason for this is thatmost evaluation in MGR lacks validity. We perform a casestudy of all published research using the most-used benchmarkdataset in MGR during the past decade: GTZAN. We showthat none of the evaluations in these many works is valid toproduce conclusions with respect to recognizing genre, i.e.that a system is using criteria relevant for recognizing genre. Infact, the problems of validity in evaluation also affect researchin music emotion recognition and autotagging. We concludeby discussing the implications of our work for MGR and MIRin the next ten years.

Keywords: information retrieval, machine learning, databases,music analysis

1. Introduction

‘Representing Musical Genre: A State of the Art’ (Aucouturier& Pachet, 2003) was published a decade ago, a few years afterthe problem of music genre recognition (MGR) was desig-nated a ‘flagship application’ of music information retrieval(MIR) (Aucouturier & Pampalk, 2008). During that time, therehave been at least four other reviews of research on MGR(Dannenberg, 2010; Fu, Lu, Ting, & Zhang, 2011; Scaringella,Zoia, & Mlynek, 2006; Sturm, 2012b), nearly 500 published

Correspondence: Audio Analysis Lab, AD:MT, Aalborg University Copenhagen, A.C. Meyers Vænge 15, DK-2450 Copenhagen, Denmark.E-mail: [email protected]

works considering MGR, and at least 10 organized campaignsto evaluate systems proposed for MGR: ISMIR 2004,1 ISMIS2011,2 and MIREX 2005, and 2007–2013.3 Figure 1 showsthe massive amount of published work on MGR since that ofMatityaho and Furst (1995).4 So, where does MGR now lie?How much progress has been made?

One might find an indication by looking at how the con-struction and performance of systems designed for MGR havechanged during the past decade. The reviews by Aucouturierand Pachet (2003), Scaringella et al. (2006) and Fu et al.(2011) all describe a variety of approaches to feature ex-traction and machine learning that have been explored forMGR, and provide rough comparisons of how MGR systemsperform on benchmark datasets and in evaluation campaigns.Conclusions from these evaluations as a whole have beendrawn about research progress on the problem of MGR. Forinstance, Bergstra et al. (2006) – authors of the MGR sys-tem having the highest accuracy in MIREX 2005 – write,‘Given the steady and significant improvement in classifica-tion performance ... we wonder if automatic methods are notalready more efficient at learning genres than some people’. InFigure 2, we plot the highest reported classification accuraciesof about 100 MGR systems all evaluated in the benchmarkdataset GTZAN (Tzanetakis & Cook, 2002).5 We see there tobe many classification accuracies higher than the 61% initiallyreported by Tzanetakis and Cook (2002). The maximum of

1http://ismir2004.ismir.net/genre_contest/index.htm2http://tunedit.org/challenge/music-retrieval3http://www.music-ir.org/mirex/wiki/MIREX_HOME4The bibliography and spreadsheet that we use to generate this figureare available as supplementary material, as well as online here: http://imi.aau.dk/~bst/software.5The dataset can be downloaded from here: http://marsyas.info/download/data_sets

© 2014 Taylor & Francis

Dow

nloa

ded

by [

Nor

thw

este

rn U

nive

rsity

] at

14:

24 2

0 Ja

nuar

y 20

15

http://ismir2004.ismir.net/genre_contest/index.htm

http://tunedit.org/challenge/music-retrieval

http://www.music-ir.org/mirex/wiki/MIREX_HOME

http://imi.aau.dk/~bst/software


http://marsyas.info/ download/data_sets

http://marsyas.info/ download/data_sets

148 Bob L. Sturm

010203040506070

1995

1996

1997

1998

1999

2000

2001

2002

2003

2004

2005

2006

2007

2008

2009

2010

2011

2012

Publ

icat

ions

ISMIRConf. not ISMIRJournal ArticleIn BookMaster’sPhD

Fig. 1. Annual numbers of published works in MGR withexperimental components, divided into publication venues.

each year shows progress up to 2009, which then appears tostop. Similarly, Humphrey, Bello and LeCun (2013) look at thebest accuracies in the MGR task of MIREX from 2007 to 2012and suggest that progress ‘is decelerating, if not altogetherstalled’.

In spite of all these observations, however, might it be thatthe apparent progress in MGR is an illusion? Our exhaustivesurvey (Sturm, 2012b) of nearly 500 publications about MGRshows: of ten evaluation designs used in MGR, one in particu-lar (which we call Classify) appears in 91% of work having anexperimental component; and the most-used public dataset isGTZAN, appearing in the evaluations of about 100 publishedworks (23%). These findings are not promising. First, Classifyusing a dataset having independent variables that are not con-trolled cannot provide any valid evidence to conclude uponthe extent to which an MGR system is recognizing the genresused by music (Sturm, 2013a, 2013b). In other words, justbecause an MGR system reproduces all labels of a datasetdoes not then mean it is making decisions by using criteriarelevant to genre (e.g. instrumentation, composition, subjectmatter, among many other characteristics that are not presentin the sound). In fact, we have clearly shown (Sturm, 2012c,2013g) that an MGR system can produce a high accuracyusing confounded factors; and when these confounds break,its performance plummets. Second, we have found (Sturm,2012a, 2013d) GTZAN has several faults, namely, repetitions,mislabellings, and distortions. Because all points in Figure 2use Classify in GTZAN, their meaningfulness is thus ‘doublyquestionable’. What valid conclusions, then, can one drawfrom all this work?

Our ‘state of the art’ here attempts to serve a different func-tion than all previous reviews of MGR (Aucouturier & Pachet,2003; Dannenberg, 2010; Fu et al., 2011; Scaringella et al.,2006). First, it aims not to summarize the variety of featuresand machine learning approaches used in MGR systems overthe past ten years, but to look closely at how to interpretFigure 2. Indeed, we have a best case scenario for drawingconclusions: the evaluations of these many different systemsuse the same evaluation design (Classify) and the same dataset(GTZAN). Ultimately, we find the faults in GTZAN imposean insurmountable impediment to interpreting Figure 2. It istempting to think that since each of these MGR systems facesthe same faults in GTZAN, and that its faults are exemplary ofreal world data, then the results in Figure 2 are still meaningfulfor comparing MGR systems. We show this argument to be

wrong: the faults simply do not affect the performance ofMGR systems in the same ways. Second, this article aims notto focus on MGR, but to look more broadly at how the practiceof evaluation in the past decade of MGR research can informthe practice of evaluation in the next decade of MIR research.An obvious lesson from GTZAN is that a researcher mustknow their data, know real data has faults, and know faultshave real impacts on evaluation; but, we also distill five otherimportant ‘guidelines’: (1) Define problems with use casesand formalism; (2) Design valid and relevant experiments; (3)Perform deep system analysis; (4) Acknowledge limitationsand proceed with skepticism; and (5) Make reproducible workreproducible.

In the next section, we briefly review ‘the problem of musicgenre recognition’. The third section provides a comprehen-sive survey of how GTZAN has been and is being used inMGR research. In the fourth section, we analyse GTZAN, andidentify several of its faults. In the fifth section, we test the realeffects of its faults on the evaluation of several categoricallydifferent MGR systems. Finally, we conclude with a discus-sion of the implications of this work on future work in MGR,and then in MIR more broadly.

2. The problem of music genre recognition (inbrief)

It is rare to find in any reference of our survey (Sturm, 2012b)a formal or explicit definition of MGR; and few works explic-itly define ‘genre’, instead deferring to describing why genreis useful, e.g. to explore music collections (Aucouturier &Pachet, 2003). Aucouturier and Pachet (2003) describe MGRas ‘extracting genre information automatically from the audiosignal’. Scaringella et al. (2006) and Tzanetakis and Cook(2002) both mention automatically arranging music titles ingenre taxonomies. Since 91% of MGR evaluation employsClassify in datasets with uncontrolled independent variables(Sturm, 2012b), the majority of MGR research implicitly inter-prets MGR as reproducing by any means possible the ‘groundtruth’ genre labels of a music dataset (Sturm, 2013b). Almostall work in our survey (Sturm, 2012b) thus treats genre in anAristotelean way, assuming music belongs to categories likea specimen belongs to a species, which belongs to a genus,and so on.

So, what is genre? The work of Fabbri (1980); Fabbri(1999) – dealt with in depth by Kemp (2004), and cited byonly a few MGR works, e.g. McKay and Fujinaga (2006) andCraft (2007) – essentially conceives of music genre as ‘a setof musical events (real or possible) whose course is governedby a definite set of socially accepted rules’ (Fabbri, 1980).This ‘course’applies to the ‘musical events (real or possible)’,where a ‘musical event’Fabbri (1980) defines as being a ‘typeof activity ... involving sound’. Adopting the language of settheory, Fabbri speaks of ‘genres’ and ‘sub-genres’ as unionsand subsets of genres, each having well-defined boundaries,at least for some ‘community’. He goes as far as to suggest

Dow

nloa

ded

by [

Nor

thw

este

rn U

nive

rsity

] at

14:

24 2

0 Ja

nuar

y 20

15

The state of the art ten years after a state of the art 149

2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 201340

50

60

70

80

90

100

Year of Publication

Hig

hest

Rep

orte

d Ac

cura

cy (%

)

Fig. 2. Highest classification accuracies (%) using all GTZAN as a function of publication year. Solid grey line is our estimate of the ‘perfect’accuracy from Table 2, discussed in Section 3.4. Six ‘×’ denote incorrect results, discussed in Section 4.5.

one can construct a matrix of rules crossed with subgenres of agenre, with each entry showing the applicability of a particularrule for a particular subgenre. By consequence, producingand using such a master list amounts to nothing more thanAristotelean categorization: to which set does a piece of musicbelong?

A view of genre alternative to that of Fabbri is provided byFrow (2005), even though his perspective is of literature andnot music. Frow argues that ‘genre’ is a dynamic collection ofrules and criteria specifying how a person or community ap-proaches, interprets, describes, uses, judges, and so on, formsof human communication in particular contexts. These rulesand criteria consequently precipitate from human communica-tion as an aid for people to interact with information and witheach other in the world.6 For instance, genre helps a person toread, use, and judge a published work as a scientific article, asopposed to a newspaper column. Genre helps an author writeand sell a written work as a publish-worthy scientific article,as opposed to a newspaper column. This is not to say onlythat the ‘medium is the message’ (McLuhan, 1964), but alsothat a person reading, using, and judging a published scientificjournal article is using cues – both intrinsic and extrinsic – tojustify their particular treatment of it as a scientific journalarticle, and not as a newspaper column (although it could beread as a newspaper column in a different context). Frow’sview of genre, then, suggests that music does not belong togenre, but that the human creation and consumption of musicin particular contexts necessarily use genre. Hence, instead of‘to which genre does a piece of music belong’, the meaningfulquestion for Frow is, ‘what genres should I use to interpret,listen to, and describe a piece of music in a particular context?’In Frow’s conception of genre, it becomes clear why anyattempt at building a taxonomy or a list of characteristics forcategorizing music into musical genres will fail: they lack thehuman and the context, both necessary aspects of genre.

Essentially unchallenged in MGR research is the base as-sumption that the nature of ‘genre’ is such that music belongsto categories, and that it makes sense to talk about ‘bound-aries’ between these categories. Perhaps it is a testament to

6 Fabbri (1999) also notes that genre helps ‘to speed upcommunication’.

the pervasiveness of the Aristotelean categorization of theworld Bowker & Star (1999) that alternative conceptions ofmusic genre like Frow’s have by and large gone unnoticed inMGR research. Of course, some have argued that MGR isill-defined (Aucouturier & Pachet, 2003; McKay & Fujinaga,2006; Pachet & Cazaly, 2000), that some genre categoriesare arbitrary and unique to each person (Craft, 2007; Sordo,Celma, Blech, & Guaus, 2008), that they are motivated byindustry (Aucouturier & Pachet, 2003), and/or they come inlarge part from domains outside the purview of signal process-ing (Craft, 2007; Wiggins, 2009). Processing only sampledmusic signals necessarily ignores the ‘extrinsic’ propertiesthat contribute to judgements of genre (Aucouturier & Pachet,2003; McKay & Fujinaga, 2006; Wiggins, 2009). Hence, it isof little surprise when subjectivity and commercial motiva-tions of genre necessarily wreak havoc for any MGR system.

The value of MGR has also been debated. Some have arguedthat the value of producing genre labels for music alreadylabelled by record companies is limited. Some have suggestedthat MGR is instead encompassed by other pursuits, e.g. musicsimilarity (Pampalk, 2006), music autotagging (Aucouturier& Pampalk, 2008; Fu et al., 2011), or extracting ‘musicallyrelevant data’ (Serra et al., 2013). Others have argued that, inspite of its subjective nature, there is evidence that genre is notso arbitrary (Gjerdingen & Perrott, 2008; Lippens, Martens, &De Mulder, 2004; Sordo et al., 2008), that some genres can bedefined by specific criteria (Barbedo & Lopes, 2007), and thatresearch on MGR is still a worthy pursuit (McKay & Fujinaga,2006; Scaringella et al., 2006). Many works also suggest thatMGR provides a convenient testbed for comparing new audiofeatures and/or machine learning approaches (e.g. Andén &Mallat, 2011, 2013).

While most of the work we survey (Sturm, 2012b) implic-itly poses MGR as an Aristotelean categorization, the goalposed by Matityaho and Furst (1995) comes closer to theconception of Frow, and to one that is much more useful thanthe reproduction of ‘ground truth’ genre labels: ‘[building] aphenomenological model that [imitates] the human ability todistinguish between music [genres]’. Although it is missingthe context essential to genre for Frow, and although the em-pirical work of Matityaho and Furst (1995) still treats genre inan Aristotelean manner, this goal places humans at the centre,

Dow

nloa

ded

by [

Nor

thw

este

rn U

nive

rsity

] at

14:

24 2

0 Ja

nuar

y 20

15

150 Bob L. Sturm

establishes the goal as imitation, and prescribes the use ofhumans for evaluation. This has led us to define the ‘principalgoals of MGR’(Sturm, 2013b): to imitate the human ability toorganize, recognize, distinguish between, and imitate genresused by music. ‘Organization’ implies finding and express-ing characteristics used by particular genres, e.g. ‘blues isstrophic’; ‘recognition’ implies identification of genres, e.g.‘that sounds like blues because ...’; ‘distinguishing’ impliesdescribing why, or the extents to which, some music usessome genres but not others, e.g. ‘that does not sound like bluesbecause ...’; and ‘imitation’ implies being able to exemplifythe use of particular genres, e.g. ‘play the Bach piece as if it isblues’. Stated in such a way, MGR becomes much richer thangenerating genre-indicative labels for music signals.

3. A survey and analysis of GTZAN

In this section, we look closely at GTZAN since its creationa decade ago: how it has been used, of what it is composed,what its faults are, how its faults affect evaluation, and whatall of this implies for MGR in the past decade, and GTZAN inthe next decade.

3.1 How has GTZAN been used?

Figure 3 shows the annual number of publications that useGTZAN since its creation in 2002. We see that it has beenused more in the past three years than in its first eight years.The next most-used public dataset is ISMIR2004,7 which wascreated for the MGR task of ISMIR 2004. That dataset appearsin 75 works, 30 of which use GTZAN as well. Of the 100 worksthat use GTZAN, 49 of them use only GTZAN.8

In our review of evaluation in MGR (Sturm, 2012b), we de-limit ten different evaluation designs. Of the 100 works usingGTZAN, 96 employ the evaluation design Classify (an excerptis assigned labels, which are compared against a ‘groundtruth’). In seven works, GTZAN is used with the evaluationdesign Retrieve (a query is used to find similar music, andthe labels of the retrieved items are compared); and one work(Barreira, Cavaco, & da Silva, 2011) uses GTZAN with theevaluation design Cluster (clustering of dataset and then acomparison of the labels in the resulting clusters). Our work(Sturm, 2012c) uses Compose, where an MGR system gen-erates new music it scores as highly representative of eachGTZAN category, which we then test for identifiability usinga formal listening test. We find a few other uses of GTZAN.Markov and Matsui (2012a,b) learn bases from GTZAN, andthen apply these codebooks to classify the genres of ISMIR2004.In our work (Sturm, 2013e), we train classifiers using IS-MIR2004, and then attempt to detect all excerpts in GTZANClassical (and two in GTZAN Jazz). As described above,Figure 2 shows the highest classification accuracies reported

7http://kom.aau.dk/~jhj/files/ismir2004genre/8All relevant references are available in the supplementary material,as well as online: http://imi.aau.dk/~bst/software.

010203040506070

2002

11

2

2003

18

2

2004

27

1

2005

32

5

2006

41

8

2007

30

3

2008

38

6

2009

49

9

2010

65

22

2011

55

19

2012

47

20Publ

icat

ions

using GTZANnot using GTZAN

Fig. 3. Annual numbers of published works in MGR withexperimental components, divided into ones that use and do no useGTZAN.

in the papers that consider the 10-class problem of GTZAN (weremove duplicated experiments, e.g. Lidy (2006) contains theresults reported in Lidy and Rauber (2005)).

Among the 100 published works using GTZAN, we findonly five outside ours (Sturm 2012c, 2013a, 2013e, 2013f)that indicate someone has listened to at least some of GTZAN.The first appears to be Li and Sleep (2005), who find that‘two closely numbered files in each genre tend to sound sim-ilar to the files numbered far [apart]’. Bergstra et al. (2006)note that, ‘To our ears, the examples are well-labeled ... ourimpression from listening to the music is that no artist ap-pears twice’. This is contradicted by Seyerlehner, Widmer andPohle (2010), who predict ‘an artist effect ... as listening tosome of the songs reveals that some artists are representedwith several songs’. In his doctoral dissertation, Seyerlehner(2010) infers there to be a significant replication of artistsin GTZAN because of how classifiers trained and tested withGTZAN perform as compared to other artist-filtered datasets.Furthermore, very few people have mentioned specific faultsin GTZAN: Hartmann (2011) notes finding seven duplicates;and Li and Chan (2011) – who have manually estimated keysfor all GTZAN excerpts9 – remember hearing some repeti-tions.10 Hence, it appears that GTZAN has by and large beenassumed to have satisfactory integrity for MGR evaluation.

3.2 What is in GTZAN?

Until our work (Sturm, 2012a), GTZAN has never had meta-data identifying its contents because those details were notassembled during compilation. Hence, every MGR evaluationusing GTZAN, save ours (Sturm, 2012c, 2013a, 2013e, 2013d),has not been able to take into account its contents. We nowidentify the excerpts in GTZAN, determine how music byspecific artists compose each category, and survey the tagspeople have applied to the music and/or artist in order to obtainan idea of what each GTZAN category means.

We use the Echo Nest Musical Fingerprinter (ENMFP)11

to generate a fingerprint of an excerpt in GTZAN and then toquery the Echo Nest database having over 30,000,000 songs

9Available here: http://visal.cs.cityu.edu.hk/downloads/#gtzankeys10Personal communication.11http://developer.echonest.com

Dow

nloa

ded

by [

Nor

thw

este

rn U

nive

rsity

] at

14:

24 2

0 Ja

nuar

y 20

15

http://kom.aau.dk/~jhj/files/ismir2004genre/


http://visal.cs.cityu.edu.hk/downloads/#gtzankeys

http://developer.echonest.com


(at the time of this writing). The second column of Table 1shows that this identifies only 606 of the 1000 excerpts. Wemanually correct titles and artists as much as possible, e.g.we reduce ‘River Rat Jimmy (Album Version)’ to ‘River RatJimmy’; and ‘Bach -The #1 BachAlbum (Disc 2) - 13 - Ich stehmit einem Fuss im Grabe, BWV 156 Sinfonia’ to ‘Ich steh miteinem Fuss im Grabe, BWV 156 Sinfonia’; and we correct‘Leonard Bernstein [Piano], Rhapsody in Blue’ to ‘GeorgeGershwin’ and ‘Rhapsody in Blue’. We find four misidenti-fications: Country 1512 is misidentified as being by WaylonJennings (it is by George Jones); Pop 65 is misidentified asbeing Mariah Carey (it is Prince); Disco 79 is misidentified as‘Love Games’ by Gazeebo (it is ‘Love Is Just The Game’ byPeter Brown); and Metal 39 is Metallica playing ‘Star WarsImperial March’, but ENMFP identifies it as a track on a CDfor improving sleep.13 We then manually identify 313 moreexcerpts, but have yet to identify the remaining 81 excerpts.14

Figure 4 shows how each GTZAN category is composed ofmusic by particular artists. We see only nine artists are repre-sented in the GTZAN Blues. GTZAN Reggae is the categorywith the most excerpts from a single artist: 35 excerpts of BobMarley. The category with the most artist diversity appearsto be Disco, where we find at least 55 different artists. Fromthis, we can bound the number of artists in GTZAN, which hasuntil this time been unknown (Seyerlehner, 2010; Seyerlehneret al., 2010). Assuming each unidentified excerpt comes fromdifferent artists than those we have already identified, the totalnumber of artists represented in GTZAN cannot be larger than329. If all unlabelled excerpts are from the artists we havealready identified, then the smallest this number can be is248.

We now wish to determine the content composing eachGTZAN category. In our previous analysis of GTZAN (Sturm,2012a), we assume that since there is a category called, e.g.‘Country’, then GTZAN Country excerpts should possess typi-cal and distinguishing characteristics of music using the coun-try genre (Ammer, 2004): stringed instruments such as guitar,mandolin, banjo; emphasized ‘twang’ in playing and singing;lyrics about patriotism, hard work and hard times; and so on.This led us to the claim that at least seven GTZAN Countryexcerpts are mislabelled because they exemplify few of thesecharacteristics. We find other work that assumes the genresof music datasets overlap because they share the same genrelabels, e.g. the taxonomies of Moerchen, Mierswa and Ultsch(2006) and Guaus (2009). In this work, however, we do notmake the assumption that the excerpts in GTZAN Countrypossess the typical and distinguishing characteristics of musicusing the country genre. In other words, we now considera GTZAN category name as ‘short hand’ for the collectionof consistent and/or contradictory concepts and criteria, both

12This is the file ‘country.00015.wav’ in GTZAN.13‘Power Nap’by J.S. Epperson (Binaural Beats Entrainment), 2010.14Our machine-readable index of this metadata is available at: http://imi.aau.dk/~bst/software.

objective and subjective, that Tzanetakis employed in assem-bling music excerpts to form that GTZAN category.

To obtain an idea of the content composing each GTZANcategory, we query the application programming interfaceprovided by last.fm, and retrieve the ‘tags’ that users ofthe service (whom we will call ‘taggers’) have entered foreach song or artist we identify in GTZAN. A tag is a wordor phrase a tagger associates with an artist, song, album, andso on, for any number of reasons (Bertin-Mahieux, Eck, &Mandel, 2010), e.g. to make a music collection more useful tothemselves, to help others discover new music, or to promotetheir own music or criticize music they do not like. A pastanalysis of the last.fm tags (Bertin-Mahieux, Eck, Maillet,& Lamere, 2008) finds that they are most often genre labels(e.g. ‘blues’), but they can also be instrumentation (‘femalevocalists’), tempo (e.g. ‘120 bpm’), mood (e.g. ‘happy’), howthey use the music (e.g. ‘exercise’), lyrics (e.g. ‘fa la la la la’),the band (e.g. ‘The Rolling Stones’), or something else (e.g.‘favourite song of all time’) (Law, 2011).

The collection of tags by last.fm is far from being acontrolled process. Any given tagger is not necessarily well-versed in musicology, or knows the history of musical styles,or can correctly recognize particular instruments, or is evenacting in a benevolent manner; and any group of taggers isnot necessarily using the same criteria when they all decideon tagging a song with a particular tag appearing indicativeof, e.g. genre or emotion. For these reasons, there is apprehen-sion in using such a resource for music information research(Aucouturier, 2009). However, last.fm tags number in themillions and come from tens of thousands of users; and, fur-thermore, each tag is accompanied by a ‘count’ parameterreflecting the percentage of taggers of a song or artist thatchoose that particular tag (Levy & Sandler, 2009). A tag fora song having a count of 100 means that tag is selected byall taggers of that song (even if there is only one tagger), and0 means the tag is applied by the fewest (last.fm roundsdown all percentages less than 1). We cannot assume that eachtag selected by a tagger for a given song is done independentof those given by previous taggers, but it is not unreasonableto interpret a tag with a high count as suggesting a kind ofconsensus thatlast.fm taggers find the tag very relevant forthat song, whatever that may mean. Though last.fm tagshave found use in other MIR research (Barrington, Yazdani,Turnbull, & Lanckriet, 2008; Bertin-Mahieux et al., 2008;Levy & Sandler, 2009), we proceed cautiously about interpret-ing the meanings behind the last.fm tags retrieved for theidentified contents in GTZAN. With respect to our goals here,we assume some tags are meant by taggers to be descriptiveof a song or artist.

Using the index of GTZAN we create above, the fourthcolumn of Table 1 shows the number of songs in GTZANwe identify that have last.fm tags, and the number of tagswith non-zero count (we keep only those tags with countsgreater than 0). When we do not find tags for a song, werequest instead the tags for the artist. For instance, though weidentify all 100 excerpts in Blues, only 75 of the songs are

Dow

nloa

ded

by [

Nor

thw

este

rn U

nive

rsity

] at

14:

24 2

0 Ja

nuar

y 20

15



152 Bob L. Sturm

Blues Classical Country Disco Hip hop0

102030405060708090

100

Perc

enta

ge

Robert Johnson

Hot Toddy

John Lee Hooker

Clifton Chenier

Buckwheat Zydeco

Kelly Joe Phelps

Magic Slim & The Teardrops

Stevie Ray Vaughan

Mozart

JS Bach

Vivaldi

Franz Joseph HaydnPercy GraingerFranz SchubertClaude DebussyMaurice RavelHenri Dutilleux

Leonard BernsteinBeethoven

Willie Nelson

Vince Gill

Brad Paisley

George Strait

KC and The Sunshine BandThe Gibson Brothers

Gloria GaynorOttawan

A Tribe Called Quest

Beastie Boys

Public Enemy

Cypress HillWu−Tang Clan

Jazz Metal Pop Reggae Rock0

102030405060708090

100

Perc

enta

ge

Coleman Hawkins

Joe Lovano

James CarterBranford Marsalis Trio

Miles DavisJoe HendersonDexter Gordon

Dio

New Bomb Turks

MetallicaIron Maiden

Dark TranquillityBlack Sabbath

Rage Against The Machine

Britney Spears

Mandy Moore

Destiny’s Child

Christina AguileraAlanis Morissette

Janet JacksonJennifer LopezMariah CareyCeline Dion

Bob Marley

Dennis BrownPrince BusterBurning SpearGregory Isaacs

Queen

Led Zeppelin

Morphine

The Stone RosesSimple MindsSimply Red

The Rolling StonesSting

Jethro TullSurvivor

Ani DiFranco

Fig. 4. Artist composition of each GTZAN category. We do not include unidentified excerpts.

Table 1. For each category of GTZAN: number of excerpts we identify by fingerprint (ENMFP); then searching manually (by self); number ofsongs tagged in last.fm (and number of those tags having ‘count’ larger than 0); for songs not found, number of artists tagged in last.fm(and number of tags having ‘count’ larger than 0). Retrieved Dec. 25, 2012, 21h.

by metadata from last.fm

Label ENMFP self # songs (# tags) # artists (# tags)

Blues 63 100 75 (2904) 25 (2061)Classical 63 97 12 (400) 85 (4044)Country 54 95 81 (2475) 13 (585)Disco 52 90 82 (5041) 8 (194)Hip hop 64 96 90 (6289) 5 (263)Jazz 65 80 54 (1127) 26 (2102)Metal 65 83 73 (5579) 10 (786)Pop 59 96 89 (7173) 7 (665)Reggae 54 82 73 (4352) 9 (616)Rock 67 100 99 (7227) 1 (100)

Total 60.6% 91.9% 72.8% (42567) 18.9% (11416)

tagged on last.fm. Of these, we get 2904 tags with non-zero counts. For the remaining 25 songs, we retrieve 2061 tagsfrom those given to the artists. We thus assume the tags thattaggers give to an artist would also be given to the particularsong. Furthermore, since each GTZAN excerpt is a 30 s excerptof a song, we assume the tags given the song or artist wouldalso be given to the excerpt.

With our considerations and reservations clearly stated, wenow attempt to shed light on the content of each GTZANcategory using the last.fm tags. First, we define the toptags of a song or artist as those that have counts above 50. Todo this, we remove spaces, hyphens, and capitalization fromall tags. For example, the tags ‘Hip hop’, ‘hip-hop’ and ‘hip

hop’ all become ‘hiphop’. Then, we find for each unique toptag in a GTZAN category the percentage of identified excerptsin that category having that top tag. We define category toptags all those unique top tags of a GTZAN category that arerepresented by at least 5% of the excerpts identified in thatcategory. With the numbers of excerpts we have identified ineach category, this means a category top tag is one appearingas a top tag in at least four excerpts of a GTZAN category.

Figure 5 shows the resulting category top tags for GTZAN.We can see large differences in the numbers of category toptags, where GTZAN Rock has the most, and GTZAN Classicaland GTZAN Hiphop have the least. As seen in Table 1, mostof the tags for excerpts in GTZAN Classical come from those

Dow

nloa

ded

by [

Nor

thw

este

rn U

nive

rsity

] at

14:

24 2

0 Ja

nuar

y 20

15


Tabl

e2.

The

repe

titio

ns,p

oten

tialm

isla

belli

ngs

and

dist

ortio

nsw

efin

din

GT

ZA

N.E

xcer

ptnu

mbe

rsar

ein

pare

nthe

ses.

Exa

ctre

petit

ions

are

thos

eex

cerp

tsth

atar

eth

esa

me

with

resp

ectt

oa

com

pari

son

ofth

eir

time-

freq

uenc

yco

nten

t.R

ecor

ding

repe

titio

nsar

eth

ose

exce

rpts

we

susp

ectc

omin

gfr

omth

esa

me

reco

rdin

g.A

rtis

trep

etiti

ons

are

thos

eex

cerp

tsfe

atur

ing

the

sam

ear

tists

.Ver

sion

repe

titio

nsar

eco

vers

ofth

esa

me

song

.Pot

entia

lmis

labe

lings

are

exce

rpts

we

argu

ear

em

ispl

aced

with

rega

rds

toth

eca

tego

ryto

pta

gsof

itsG

TZ

AN

cate

gory

.Dis

tort

ions

are

thos

eex

cerp

tsth

atw

ere

gard

asha

ving

asi

gnifi

cant

amou

ntof

dist

ortio

n.T

his

tabl

eca

nbe

audi

tione

don

line

atht

tp://

imi.a

au.d

k/~b

st/r

esea

rch/

GT

ZA

Nta

ble2

.

GT

ZA

NR

epet

itio

nsP

oten

tial

Mis

labe

lling

sD

isto

rtio

ns

Cat

egor

yE

xact

Rec

ordi

ngA

rtis

tVe

rsio

n

Blu

esJo

hnL

eeH

ooke

r(0

-11)

;R

ober

tJo

hnso

n(1

2-28

);K

elly

Joe

Phel

ps(2

9-39

);St

evie

Ray

Vau

ghn

(40-

49);

Mag

icSl

im(5

0-60

);C

lifto

nC

heni

er(6

1-72

);B

uckw

heat

Zy-

deco

(73-

84);

Hot

Todd

y(8

5-97

);A

lber

tCol

lins

(98,

99)

Cla

ssic

al(4

2,53

)(5

1,80

)J.

S.B

ach

(00-

10);

Moz

art

(11-

29);

Deb

ussy

(30-

33);

Rav

el(3

4-37

);D

utill

eux

(38-

41);

Schu

bert

(63-

67);

Hay

dn(6

8-76

);G

rain

ger

(82-

88);

Viv

aldi

(89-

99);

and

othe

rs

(44,

48)

stat

ic(4

9)

Cou

ntry

(08,

51)

(52,

60)

Will

ieN

elso

n(1

9,26

,65-

80);

Vin

ceG

ill(5

0-64

);B

rad

Pais

ley

(81-

93);

Geo

rge

Stra

it(9

4-99

);an

dot

hers

(46,

47)

Ray

Pete

rson

‘Tel

lL

aura

IL

ove

Her

’(20

);B

urtB

acha

rach

‘Rai

ndro

psK

eep

Falli

ngon

my

Hea

d’(2

1);

Kar

lD

enve

r‘L

ove

Me

With

All

You

rH

eart

’(2

2);

Way

neTo

ups

&Z

ydec

ajun

‘Joh

nnie

Can

’tD

ance

’(3

9);

John

nyPr

esto

n‘R

unni

ngB

ear’

(48)

stat

icdi

stor

tion

(2)

Dis

co(5

0,51

,70)

(55,

60,8

9)(7

1,74

)(9

8,99

)(3

8,78

)G

lori

aG

ayno

r(1

,21,

38,7

8);

Otta

wan

(17,

24,4

4,45

);T

heG

ibso

nB

roth

ers

(22,

28,3

0,35

,37)

;K

Can

dT

heSu

nshi

neB

and

(49-

51,7

0,71

,73,

74);

AB

BA

(67,

68,7

2);

and

othe

rs

(66,

69)

Boz

Scag

gs‘L

owdo

wn’

(00)

;C

hery

lLy

nn‘E

ncor

e’(1

1);

The

Suga

rhill

Gan

g‘R

appe

r’s

Del

ight

"(2

7);

Eve

lyn

Tho

mas

‘Hea

rtle

ss’

(29)

;B

arbr

aSt

reis

and

and

Don

naSu

mm

er‘N

oM

ore

Tear

s(E

noug

hIs

Eno

ugh)

’(4

7);

Tom

Tom

Clu

b‘W

ordy

Rap

ping

hood

’(8

5);

Blo

ndie

‘Hea

rtO

fG

lass

’(9

2);

Bro

nski

Bea

t‘W

HY

?’(9

4)

clip

ping

dist

ortio

n(6

3)

Hip

hop

(39,

45)

(76,

78)

(01,

42)

(46,

65)

(47,

67)

(48,

68)

(49,

69)

(50,

72)

Wu-

Tang

Cla

n(1

,7,4

1,42

);B

east

ieB

oys

(8-2

5);

AT

ribe

Cal

led

Que

st(4

6-51

,62-

75);

Cyp

ress

Hill

(55-

61);

Publ

icE

nem

y(8

1-98

);an

dot

hers

(02,

32)

3LW

‘No

Mor

e(B

aby

I’m

aD

oR

ight

)’(2

6);

Aal

iyah

‘Try

Aga

in’

(29)

;Pi

nk‘C

an’t

Take

Me

Hom

e’(3

1);

Lau

ryn

Hill

‘Ex-

Fact

or’(

40)

clip

ping

dist

ortio

n(3

,5);

skip

atst

art

(38)

(Con

tinu

ed)

Dow

nloa

ded

by [

Nor

thw

este

rn U

nive

rsity

] at

14:

24 2

0 Ja

nuar

y 20

15

http://imi.aau.dk/~bst/research/GTZANtable2

154 Bob L. Sturm

Tabl

e2.

(Con

tinu

ed).

GT

ZA

NR

epet

itio

nsP

oten

tial

Mis

labe

lling

sD

isto

rtio

ns

Cat

egor

yE

xact

Rec

ordi

ngA

rtis

tVe

rsio

n

Jazz

(33,

51)

(34,

53)

(35,

55)

(36,

58)

(37,

60)

(38,

62)

(39,

65)

(40,

67)

(42,

68)

(43,

69)

(44,

70)

(45,

71)

(46,

72)

Jam

esC

arte

r(2

-10)

;Jo

eL

ovan

o(1

1-24

);B

ranf

ord

Mar

salis

Tri

o(2

5-32

);C

olem

anH

awki

ns(3

3-46

,51,

53,5

5,57

,58

,60,

62,6

5,67

-72

);D

exte

rG

ordo

n(7

3-77

);M

iles

Dav

is(8

7-92

);Jo

eH

ende

rson

(94-

99);

and

othe

rs

Leo

nard

Ber

nste

in‘O

nth

eTo

wn:

Thr

eeD

ance

Epi

sode

s,M

vt.

1’(0

0)an

d‘S

ymph

onic

danc

esfr

omW

est

Side

Stor

y,Pr

olog

ue’(

01)

clip

ping

dist

ortio

n(5

2,54

,66)

Met

al(0

4,13

)(3

4,94

)(4

0,61

)(4

1,62

)(4

2,63

)(4

3,64

)(4

4,65

)(4

5,66

)(5

8)is

Roc

k(1

6)

Dar

kT

ranq

uilli

ty(1

2-15

);D

io(4

0-45

,61-

66);

The

New

Bom

bT

urks

(46-

57);

Que

en(5

8-60

);M

etal

lica

(33,

38,7

3,75

,78,

83,8

7);

Iron

Mai

den

(2,5

,34,

92-9

4);

Rag

eA

gain

stth

eM

achi

ne(9

5-99

);an

dot

hers

(33,

74)

(85)

isO

zzy

Osb

ourn

eco

veri

ngD

isco

(14)

Cre

ed‘I

’mE

ight

een’

(21)

;L

ivin

gC

olou

r‘G

lam

our

Boy

s’(2

9);

The

New

Bom

bT

urks

‘Ham

mer

less

Nai

l’(4

6),‘

Juke

-bo

xL

ean’

(50)

,‘J

eers

ofa

Clo

wn’

(51)

;Q

ueen

‘Tie

You

rM

othe

rD

own’

(58)

,‘Te

arit

up’

(59)

,‘W

eW

illR

ock

You

’(60

)

clip

ping

dist

ortio

n(3

3,73

,84)

Pop

(15,

22)

(30,

31)

(45,

46)

(47,

80)

(52,

57)

(54,

60)

(56,

59)

(67,

71)

(87,

90)

(68,

73)

(15,

21,2

2)(4

7,48

,51)

(52,

54)

(57,

60)

Man

dyM

oore

(00,

87-9

6);

Mar

iah

Car

ey(2

,97-

99);

Ala

nis

Mor

isse

tte(3

-9);

Cel

ine

Dio

n(1

1,39

,40)

;Bri

t-ne

ySp

ears

(15-

38);

Chr

istin

aAgu

il-er

a(4

4-51

,80)

;Des

tiny’

sC

hild

(52-

62);

Jane

tJac

kson

(67-

73);

Jenn

ifer

Lop

ez(7

4-78

,82)

;M

adon

na(8

4-86

);an

dot

hers

(10,

14)

(16,

17)

(74,

77)

(75,

82)

(88,

89)

(93,

94)

Dia

naR

oss

‘Ain

’tN

oM

ount

ain

Hig

hE

noug

h’(6

3);P

rinc

e‘T

heB

eaut

iful

One

s’(6

5);K

ate

Bus

h‘C

ould

bust

ing’

(79)

;Lad

ysm

ithB

lack

Mam

bazo

‘Lea

ning

On

The

Eve

rlas

ting

Arm

’(8

1);

Mad

onna

‘Che

rish

’(86

)

(37)

isfr

omsa

me

reco

rdin

gas

(15,

21,2

2)bu

tw

ithso

und

effe

cts

Reg

gae

(03,

54)

(05,

56)

(08,

57)

(10,

60)

(13,

58)

(41,

69)

(73,

74)

(80,

81,8

2)(7

5,91

,92)

(07,

59)

(33,

44)

(85,

96)

Bob

Mar

ley

(00-

27,5

4-60

);D

enni

sB

row

n(4

6-48

,64-

68,7

1);

Prin

ceB

uste

r(8

5,94

-99)

;B

urni

ngSp

ear

(33,

42,4

4,50

,63

);G

rego

ryIs

aacs

(70,

76-7

8);a

ndot

hers

(23,

55)

Pras

‘Ghe

ttoSu

past

ar(T

hat

IsW

hat

You

Are

)’(5

2);

Mar

cia

Gri

ffith

s‘E

lect

ric

Boo

gie’

(88)

last

25s

of(8

6)ar

eus

eles

s

(Con

tinu

ed)

Dow

nloa

ded

by [

Nor

thw

este

rn U

nive

rsity

] at

14:

24 2

0 Ja

nuar

y 20

15


Tabl

e2.

(Con

tinu

ed).

GT

ZA

NR

epet

itio

nsP

oten

tial

Mis

labe

lling

sD

isto

rtio

ns

Cat

egor

yE

xact

Rec

ordi

ngA

rtis

tVe

rsio

n

Roc

k(1

6)is

Met

al(5

8)M

orph

ine

(0-9

);A

niD

iFra

nco

(10-

15);

Que

en(1

6-26

);T

heR

ollin

gSt

ones

(28-

31,3

3,35

,37)

;L

edZ

ep-

pelin

(32,

39-4

8);S

impl

eM

inds

(49-

56);

Stin

g(5

7-63

);Je

thro

Tul

l(6

4-70

);Si

mpl

yR

ed(7

1-78

);Su

rviv

or(7

9-85

);T

heSt

one

Ros

es(9

1-99

)

Mor

phin

e‘H

angi

ngO

nA

Cur

-ta

in’

(9);

Que

en‘(

You

’re

SoSq

uare

)B

aby

ID

on’t

Car

e’(2

0);

Bill

yJo

el‘M

ovin

’O

ut’

(36)

;G

uns

N’R

oses

‘Kno

ckin

’O

nH

eave

n’s

Doo

r’(3

8);

Led

Zep

pelin

‘The

Song

Rem

ains

The

Sam

e’(3

9),

‘The

Cru

nge’

(40)

,‘D

anci

ngD

ays’

(41)

,‘T

heO

cean

’(4

3),

‘Ten

Yea

rsG

one’

(45)

,‘N

ight

Flig

ht’

(46)

,‘T

heW

anto

nSo

ng’

(47)

,‘B

oogi

eW

ithSt

u’(4

8);

Sim

ply

Red

‘She

’sG

otIt

Bad

’(7

5),

‘Won

-de

rlan

d’(7

8);

Surv

ivor

‘Is

Thi

sL

ove’

(80)

,‘D

espe

rate

Dre

ams’

(83)

,‘H

owM

uch

Lov

e’(8

4);

The

Toke

ns‘T

heL

ion

Slee

psTo

nigh

t’(9

0)

jitte

r(2

7)

Dow

nloa

ded

by [

Nor

thw

este

rn U

nive

rsity

] at

14:

24 2

0 Ja

nuar

y 20

15

156 Bob L. Sturm

Blues Classical Country Disco Hip hop0

102030405060708090

100

Perc

enta

ge

blues

zyde

co

blues

rock

swing

swing

blues

cajun

class

ical

compo

ser

20thc

entur

yclas

sical

coun

try

class

iccou

ntry

oldies

williene

lson

disco

70s

80s

pop

soul

funk

hipho

p

rap

80s

Jazz Metal Pop Reggae Rock0

102030405060708090

100

Perc

enta

ge

jazz

saxo

phon

e

conte

mporar

yjazz

smoo

thjaz

z

james

carte

r

fipgimmes

umfrie

nds

gimmes

umfrie

nds0

4he

avym

etal

rockha

rdroc

k

metal

thras

hmeta

l

class

icroc

k

punkpu

nkroc

k

altern

ative

rock

80smelo

dicde

athmeta

l

garag

epun

k

garag

erock

pop

female

voca

lists

rnb

danc

e

90sbri

tneys

pears

rockde

stiny

schil

d

altern

ative

soun

dtrac

k

garag

erock

regga

e

skade

stiny

schil

d

rootsr

egga

e

danc

ehall

altern

ative

garag

erock

rock

class

icroc

k

80sind

iepo

pha

rdroc

k

altern

ative

newwav

e

britpo

p

madch

ester

quee

n

progre

ssive

rock

folkfem

alevo

calist

s

sting60

ssim

plyred

Fig. 5. Category top tags for GTZAN from last.fm.

that taggers have entered for artists, which explains why thetag ‘composer’ appears. Most of the category top tags ap-pear indicative of genre, and those appearing most in eachGTZAN category is the category label, except for Metal. Wealso see signs of a specific tagger in three category top tags ofGTZAN Jazz. With our considerations above, and by listeningto the entire dataset, it does not seem unreasonable to makesome claims about the content of each GTZAN category. Forinstance, the category top tags of GTZAN Blues reveal itscontents to include a large amount of music tagged ‘cajun’and ‘zydeco’ by a majority of taggers. GTZAN Disco appearsmore broad than dance music from the late 1970s (Shapiro,2005), but also includes a significant amount of music that hasbeen tagged ‘80s’, ‘pop’ and ‘funk’ by a majority of taggers.Listening to the excerpts in these categories confirms theseobservations.

3.3 What faults does GTZAN have?

We now delimit three kinds of faults in GTZAN: repetitions,mislabellings, and distortions.Table 2 summarizes these, whichwe reproduce online with sound examples.15

3.3.1 Repetitions

We consider four kinds of repetition, from high to low speci-ficity: exact, recording, artist, and version. We define an ex-act repetition as when two excerpts are the same to such adegree that their time–frequency fingerprints are the same.This means the excerpts are not only extracted from the same

15These can be heard online here: http://imi.aau.dk/~bst/research/GTZANtable2

48 51 54 57 60 63 66 69 72 75

33

36

39

42

45

Excerpt Number

Exce

rpt N

umbe

r

Fig. 6. Taken from GTZAN Jazz, exact repetitions appear clearlywith pair-wise comparisons of their fingerprint hashes. The darker asquare, the higher the number of matching hashes.

recording, they are essentially the same excerpt, either sam-ple for sample (up to a multiplicative factor), or displacedin time by only a small amount. To find exact repetitions,we implement a simplified version of the Shazam fingerprint(Wang, 2003). This means we compute sets of anchors inthe time–frequency plane, compare hashes of the anchorsof every pair of excerpts, and then listen to those excerptssharing many of the same anchors to confirm them to beexact replicas. Figure 6 shows the clear appearance of exactrepetitions in GTZAN Jazz. The second column of Table 2 liststhese. Our comparison of hashes across categories reveal oneexact repetition in two categories: the same excerpt of ‘TieYour Mother Down’ by Queen appears as Rock 58 and Metal16. In total, we find 50 exact repetitions in GTZAN.

We define a recording repetition as when two excerptscome from the same recording, but are not detected with thefingerprint comparison detailed above. We find these by artistname and song repetitions in the index we create above, and bylistening. For instance, Country 8 and 51 are both from ‘Never

Dow

nloa

ded

by [

Nor

thw

este

rn U

nive

rsity

] at

14:

24 2

0 Ja

nuar

y 20

15




Knew Lonely’by Vince Gill, but excerpt 8 comes from later inthe recording than excerpt 51. The second and third columnsof Table 2 shows the excerpts we suspect as coming fromthe same recording. We find GTZAN Pop has the most exactand suspected recording repetitions (16): ‘Lady Marmalade’sung by Christina Aguilera et al., as well as ‘Bootylicious’ byDestiny’s Child, each appear four times. In total, we find 21suspected recording repetitions in GTZAN.

The last two kinds of repetitions are not necessarily ‘faults’,but we show in Section 3.4 why they must be taken intoconsideration when using GTZAN. We define artist repetitionas excerpts performed by the same artist. We find these easilyusing the index we create above. Figure 4 and Table 2 showhow every GTZAN category has artist repetition. Finally, wedefine a version repetition as when two excerpts are of thesame song but performed differently. This could be a studioversion, a live version, performed by the same or differentartists (covers), or possibly a remix. We identify these withthe index we create above, and then confirm by listening. Forinstance, Classical 44 and 48 are from ‘Rhapsody in Blue’by George Gershwin, but presumably performed by differentorchestras. Metal 33 is ‘Enter Sandman’ by Metallica, andMetal 74 is a parody of it. Pop 16 and 17 are both ‘I can’t getno satisfaction’ by Britney Spears, but the latter is performedlive. In total, we find 13 version repetitions in GTZAN.

3.3.2 Potential mislabellings

The collection of concepts and criteria Tzanetakis used to as-semble GTZAN is of course unobservable; and in some sense,the use of GTZAN for training and testing an MGR system aimsto reproduce or uncover it by reverse engineering. Regardlessof whether GTZAN was constructed in a well-defined manner,or whether it makes sense to restrict the membership of anexcerpt of music to one GTZAN category, we are interestedhere in a different question: are any excerpts mislabelled? Inother words, we wish to determine whether the excerpts mightbe arranged in a way such that their membership to one GTZANcategory, as well as their exclusion from every other GTZANcategory, is in some sense ‘less debatable’ to a system, humanor artificial.

Toward this end, we previously considered (Sturm, 2012a)two kinds of mislabellings in GTZAN: ‘contentious’and ‘con-spicuous’. We based these upon non-concrete criteria formedloosely around musicological principles associated with thenames of the categories in GTZAN. Now that we have shownin Section 3.2 that those names do not necessarily reflect thecontent of the categories, we instead consider here the contentof each GTZAN category. In summary, we identify excerptsthat might be ‘better’placed in another category, placed acrossseveral categories, or excluded from the dataset altogether, bycomparing their tags to the category top tags of each GTZANcategory.

We consider an excerpt potentially mislabelled if not oneof its tags match the category top tags of its category, or ifthe ‘number of votes’ for its own category is equal to or less

than the ‘number of votes’ for another category. We definethe number of votes in a category as the sum of the counts oftags for an excerpt matching category top tags. (We considera match as when the category top tag appears in a tag, e.g.‘blues’ appears in ‘blues guitar’.) As an example, considerthe category top tags of GTZAN Country in Figure 5, and thepairs of tags and counts we find for Country 39 (‘JohnnieCan’t Dance’ by Wayne Toups & Zydecajun): {(‘zydeco’,100), (‘cajun’, 50), (‘folk’, 33), (‘louisiana’, 33), (‘bayou’,16), (‘cruise’, 16), (‘Accordeon’, 16), (‘New Orleans’, 16),(‘accordion’, 16), (‘everything’, 16), (‘boogie’, 16), (‘dance’,16), (‘swamp’, 16), (‘country’, 16), (‘french’, 16), (‘rock’,16)}. We find its two first tags among the category top tags ofGTZAN Blues, and so the number of votes for GTZAN Bluesis, 100 + 50 = 150. We only find one of its tags (‘country’)among the category top tags of GTZAN Country, and so thenumber of votes there is 16. Hence, we argue that Country 39is potentially mislabelled. Listening to the excerpt in relationto all the others in GTZAN Country, as well as the musictagged ‘cajun’ and ‘zydeco’ in GTZAN Blues, also supportsthis conclusion. It is important to emphasize that we are notclaiming Country 39 should be labelled GTZAN Blues, butonly that if a MGR system learning from GTZAN Blues labelsCountry 39 as ‘blues’, then that might not be considered amistake.

Of the excerpts we have identified in GTZAN, we find 13with no tags that match the category top tags of their category,43 that have more votes in the category top tags of a differentcategory, and 18 that have the same number of votes in itsown category and at least one other. Of these, 20 appear tobe due to a lack of tags, and two appear to be a result of badtags. For instance, among the tags with count of 100 for ‘Can’tDo Nuttin’ For Ya, Man!’ by Public Enemy (Hip hop 93) are,‘glam rock’, ‘Symphonic Rock’, ‘instrumental rock’, ‘NewYork Punk’ and ‘Progressive rock’. We list the remaining 52potential misclassifications in Table 2.

As we have pointed out (Sturm, 2012a, 2012c), GTZANhas other potential problems with excerpts in its categories.Disco 47 is of Barbra Streisand and Donna Summer singingthe introduction to ‘No More Tears’, but an excerpt from a latertime in this song might better exemplify GTZAN Disco. Hiphop 44 is of ‘Guantanamera’ by Wyclef Jean, but the majorityof the excerpt is a sample of musicians playing the traditionalCuban song ‘Guantanamera.’ Hence, GTZAN Hip hop mightor might not be appropriate. Hip hop 5 and 30 are Drum andBass dance music that are quite unique among the rest of theexcerpts; and Reggae 51 and 55 are electronic dance musicthat are contrasting to the other excerpts. Finally, Reggae 73and 74 are exact replicas of ‘Hip-Hopera’ by Bounty Killer,which might be more appropriate in GTZAN Hip hop. We donot include these as potential mislabellings in Table 2.

3.3.3 Identifying faults: Distortions

The last column of Table 2 lists some distortions we findby listening to every excerpt in GTZAN. This dataset was

Dow

nloa

ded

by [

Nor

thw

este

rn U

nive

rsity

] at

14:

24 2

0 Ja

nuar

y 20

15

158 Bob L. Sturm

purposely created to have a variety of fidelities in the excerpts(Tzanetakis & Cook, 2002); however, one of the excerpts(Reggae 86) is so severely distorted that the value of its last25 seconds for either machine learning or testing is debatable.

3.4 How do the faults of GTZAN affect evaluation?

We now study how the faults of GTZAN affect the evalua-tion of MGR systems, e.g. the classification accuracies in thenearly 100 published works seen in Figure 2. Through ourexperiments below, we see the following two claims are false:(1) ‘all MGR systems and evaluations are affected in the sameways by the faults of GTZAN’; and (2) ‘the performances ofall MGR systems in GTZAN, working with the same data andfaults, are still meaningfully comparable’. Thus, regardlessof how the systems are performing the task, the results inFigure 2 cannot be meaningfully interpreted.

It is not difficult to predict how some faults can affect theevaluations of MGR systems built using different approaches.For instance, when exact replicas are distributed across trainand test sets, the evaluation of some systems can be morebiased than others: a nearest neighbour classifier will findfeatures in the training set with zero distance to the test feature,while a Bayesian classifier with a parametric model may notso strongly benefit when its model parameters are estimatedfrom all training features. If there are replicas in the test setonly, then they will bias an estimate of a figure of meritbecause they are not independent tests – if one is classified(in)correctly then its replicas are also classified (in)correctly.In addition to exact repetitions, we show above that the numberof artists in GTZAN is at most 329. Thus, as Seyerlehner(2010), and Seyerlehner et al. (2010) predict for GTZAN, itsuse in evaluating systems will be biased due to the artist ef-fect (Flexer, 2007; Flexer & Schnitzer, 2009, 2010; Pampalk,Flexer, & Widmer, 2005), i.e. the observation that a musicsimilarity system can perform significantly worse when artistsare disjoint in training and test datasets, than when they are not.Since all results in Figure 2 come from evaluations withoutusing an artist filter, they are quite likely to be optimistic.

To investigate the faults of GTZAN, we create several MGRsystems using a variety of feature extraction and machinelearning approaches paired with different training data inGTZAN. We define a system not as an abstract proposal ofa machine learning method, a feature description, and so on,but as a real and working implementation of the componentsnecessary to produce an output from an input (Sturm, 2013c).A system, in other words, has already been trained, and mightbe likened to a ‘black box’ operated by a customer in anenvironment, according to some instructions. In our case, eachsystem specifies the customer input 30 s of monophonic audiodata uniformly sampled at 22,050 Hz, for which it outputs oneof the ten GTZAN category names.

Some systems we create by combining the same featureswith three classifiers (Duin et al., 2008): nearest neighbour(NN), minimum distance (MD), and minimum Mahalanobisdistance (MMD) (Theodoridis & Koutroumbas, 2009). Thesesystems create feature vectors from a 30 s excerpt in the fol-

lowing way. For each 46.4 ms frame, and a hop half that, itcomputes: 13 MFCCs using the approach by Slaney (1998),zero crossings, and spectral centroid and rolloff. For each 130consecutive frames, it computes the mean and variance ofeach dimension, thus producing nine 32-dimensional featurevectors. From the feature vectors of the training data, thesystem finds a normalization transformation that maps eachdimension to [0, 1], i.e. the minimum of a dimension is sub-tracted from all values in that dimension, and then those aredivided by the maximum magnitude difference between anytwo values. Each system applies the same transformation tothe feature vectors extracted from an input. Each classifierchooses a label for an excerpt as follows: NN selects the classby majority vote, and breaks ties by selecting randomly amongthose classes that are ties; MD and MMD both select the labelwith the maximum log posterior sum over the nine featurevectors. Both MD and MMD model the feature vectors asindependent and identically distributed multivariate Gaussian.

While the above approaches provide ‘baseline’systems, wealso use two state-of-the-art approaches that produce MGRsystems measured to have high classification accuracies inGTZAN. The first is SRCAM – proposed byPanagakis, Kotropoulos and Arce (2009b) but modified by us(Sturm, 2013a) – which uses psychoacoustically-motivatedfeatures. The feature extraction produces one feature vectorof 768 dimensions from a 30 s excerpt. SRCAM classifies anexcerpt by using sparse representation classification (Wright,Yang, Ganesh, Sastry, & Ma, 2009). We implement this usingthe SPGL1 solver (van den Berg & Friedlander, 2008) withat most 200 iterations, and define ε2 = 0.01. The second ap-proach is MAPsCAT, which uses feature vectors of ‘scatteringtransform’ coefficients (Andén & Mallat, 2011). The featureextraction produces 40 feature vectors of 469 dimensions froma 30 s excerpt. MAPsCAT models the features in the train-ing set as independent and identically distributed multivariateGaussian, and computes for a test feature the log posterior ineach class. We define all classes equally likely for MAPsCAT,as well as for MD and MMD. As for the baseline systemsabove, systems built using SRCAM and MAPsCAT normalizeinput feature vectors according to the training set. (See Sturm(2013a) for further details of SRCAM and MAPsCAT.)

We use four different partition strategies to create train andtest datasets from GTZAN: 10 realizations of standard non-stratified 2fCV (ST); ST without the 67 exact and recordingrepetitions and two distortions (ST’); a non-stratified 2fCVwith artist filtering (AF); AF without the 67 exact and record-ing repetitions and two distortions (AF’). Table 3 shows thecomposition of each fold in AF in terms of artists. We createdAF manually to ensure that: (1) each GTZAN category isapproximately balanced in terms of the number of training andtesting excerpts; and (2) each fold of a GTZAN category hasmusic tagged with category top tags (Figure 5). For instance,the two Blues folds have 46 and 54 excerpts, and both havemusic tagged ‘blues’ and ‘zydeco’. Unless otherwise noted,we do not take into account any potential mislabellings. Intotal, we train and test 220 MGR systems.

Dow

nloa

ded

by [

Nor

thw

este

rn U

nive

rsity

] at

14:

24 2

0 Ja

nuar

y 20

15


Table 3. Composition of each fold of the artist filter partitioning (500 excerpts in each). Italicized artists appear in two GTZAN categories.

GTZAN Category Fold 1 Fold 2

Blues John Lee Hooker, Kelly Joe Phelps, Buckwheat Zydeco,Magic Slim & The Teardrops

Robert Johnson, Stevie Ray Vaughan, Clifton Chenier,Hot Toddy, Albert Collins

Classical J. S. Bach, Percy Grainger, Maurice Ravel, HenriDutilleux, Tchaikovsky, Franz Schubert, LeonardBernstein, misc.

Beethoven, Franz Joseph Haydn, Mozart, Vivaldi,Claude Debussy, misc.

Country Shania Twain, Johnny Cash, Willie Nelson, misc. Brad Paisley, George Strait, Vince Gill, misc.Disco Donna Summer, KC and The Sunshine Band, Ottawan,

The Gibson Brothers, Heatwave, Evelyn Thomas, misc.Carl Douglas, Village People, The Trammps, Earth Windand Fire, Boney M., ABBA, Gloria Gaynor, misc.

Hip hop De La Soul, Ice Cube, Wu-Tang Clan, Cypress Hill,Beastie Boys, 50 Cent, Eminem, misc.

A Tribe Called Quest, Public Enemy, Lauryn Hill,Wyclef Jean

Jazz Leonard Bernstein, Coleman Hawkins, BranfordMarsalis Trio, misc.

James Carter, Joe Lovano, Dexter Gordon, TonyWilliams, Miles Davis, Joe Henderson, misc.

Metal Judas Priest, Black Sabbath, Queen, Dio, Def Leppard,Rage Against the Machine, Guns N’ Roses, New BombTurks, misc.

AC/DC, Dark Tranquillity, Iron Maiden, Ozzy Os-bourne, Metallica, misc.

Pop Mariah Carey, Celine Dion, Britney Spears, AlanisMorissette, Christina Aguilera, misc.

Destiny’s Child, Mandy Moore, Jennifer Lopez, JanetJackson, Madonna, misc.

Reggae Burning Spear, Desmond Dekker, Jimmy Cliff, BountyKiller, Dennis Brown, Gregory Isaacs, Ini Kamoze,misc.

Peter Tosh, Prince Buster, Bob Marley, Lauryn Hill,misc.

Rock Sting, Simply Red, Queen, Survivor, Guns N’Roses, TheStone Roses, misc.

The Rolling Stones,Ani DiFranco, Led Zeppelin, SimpleMinds, Morphine, misc.

We look at several figures of merit computed from a com-parison of the outputs of the systems to the ‘ground truth’of testing datasets: confusion, precision, recall, F-score, andclassification accuracy. Define the set of GTZAN categories G.Consider that we input to a system M (g) number of excerptsfrom GTZAN category g ∈ G, and that of these the systemcategorizes as r ∈ G the number M (g as r) ≤ M (g). We definethe confusion of GTZAN category g as r for a system

C (g as r) := M (g as r)/

M (g). (1)

The recall for GTZAN category g of a system is then C (g as g).We define the normalized accuracy of a system by

A := 1

|G|∑g∈G

C (g as g). (2)

We use normalized accuracy because the number of inputs ineach GTZAN category may not be equal in a test set. We definethe precision of a system for GTZAN category g as

P(g) := M (g as g)

/∑r∈G

M (r as g) . (3)

Finally, we define the F-score of a system for GTZAN categoryg as

F (g) := 2P(g)C (g as g)/ [

P(g) + C (g as g)]. (4)

To test for significant differences in the performance be-tween two systems in the same test dataset, we build a con-tingency table (Salzberg, 1997). Define the random variableN to be the number of times the two systems choose different

categories, but one is correct. Let t12 be the number for whichsystem 1 is correct but system 2 is wrong. Thus, N − t12is the number of observations for which system 2 is correctbut system 1 is wrong. Define the random variable T12 fromwhich t12 is a sample. The null hypothesis is that the systemsperform equally well given N = n, i.e. E[T12|N = n] = n/2,in which case T12 is distributed binomially, i.e.

pT12|N=n(t) =(

n

t

)(0.5)n, 0 ≤ t ≤ n. (5)

The probability we observe a particular performance given thesystems actually perform equally well is

p : = P[T12 ≤ min(t12, n − t12)]+ P[T12 ≥ max(t12, n − t12)]

=min(t12,n−t12)∑

t=0

pT12|N=n(t) +n∑

t=max(t12,n−t12)

pT12|N=n(t).

(6)

We define statistical significance as α = 0.05, and reject thenull hypothesis if p < α.

Figure 7 shows a summary of the normalized accuraciesof all our MGR systems, with respect to the five approachesand four partitions we use. As each partition breaks GTZANinto two folds, the left and right end points of a segmentcorrespond to using the first or second fold for training, andthe other for testing. For the ten random partitions of ST andST’, we show the mean normalized accuracy, and a verticalline segment of two standard deviations centred on the mean.It is immediately clear that the faults of GTZAN affect an

Dow

nloa

ded

by [

Nor

thw

este

rn U

nive

rsity

] at

14:

24 2

0 Ja

nuar

y 20

15

160 Bob L. Sturm

MD MMD NN SRCAM MAPsCAT

0.4

0.5

0.6

0.7

0.8

Nor

mal

ized

Acc

urac

y

STST’AFAF’

Fig. 7. Normalized accuracy (2) of each approach (x-axis) for eachfold (left and right) of different partition strategies (legend). Onestandard deviation above and below the mean are shown for ST andST’.

estimate of the classification accuracy for an MGR system.For systems created with all approaches except NN, the differ-ences between conditions ST and ST’are small. As we predictabove, the performance of systems created using NN appearsto benefit more than the others from the exact and recordingrepetition faults of GTZAN, boosting their mean normalizedaccuracy from below 0.6 to 0.63. Between conditions AF andAF’, we see that removing the repetitions produces very littlechange, presumably because the artist filter keeps exact andrecording repetitions from being split across train and testdatasets. Most clearly, however, we see for all systems a largedecrease in performance between conditions ST and AF, andthat each system is affected to different degrees.The differencein normalized accuracy between conditions STandAF appearsthe smallest for MD (7 points), while it appears the most forMAPsCAT (25 points). Since we have thus found systems withperformance evaluations affected to different degrees by thefaults in GTZAN – systems using NN are hurt by removingthe repetitions while those using MMD are not – we havedisproven the claim that the faults of GTZAN affect all MGRsystems in the same ways.

When we test for significant differences in performancebetween all pairs of system in the conditions ST and ST’,only for those created with MMD and NN do we fail toreject the null hypothesis. In terms of classification accuracyin the condition ST, and the binomial test described above,we can say with statistical significance: those systems cre-ated using MD perform worse than those systems createdusing MMD and NN, which perform worse than those systemscreated using MAPsCAT, which perform worse than thosesystems created using SRCAM. The systems created usingSRCAM and MAPsCAT are performing significantly betterthan the baseline systems. In the conditions AF and AF’,however, we fail to reject the null hypothesis for systemscreated using MAPsCAT and MD, and systems created usingMAPsCAT and MMD. MAPsCAT, measured as performingsignificantly better than the baseline in the conditions ST andST’– and hence motivating conclusions that the features it uses

are superior for MGR (Andén & Mallat, 2011, 2013) – nowperforms no better than the baseline in the conditions AF andAF’. Therefore, this disproves that the performances of allMGR systems in GTZAN — all results shown in Figure 2 – arestill meaningfully comparable for MGR. Some of them benefitsignificantly more than others due to the faults in GTZAN, butwhich ones they are cannot be known.

We now focus our analysis upon systems created usingSRCAM. Figure 8 shows averaged figures of merit for sys-tems built using SRCAM and the 10 realizations of ST andST’, as well as for the single partition AF’. Between systemsbuilt in conditions ST and ST’, we see very little change inthe recalls for GTZAN categories with the fewest exact andrecording repetitions: Blues, Classical and Rock. However,for the GTZAN categories having the most exact and recordingrepetitions, we find large changes in recall. In fact, Figure 9shows that the number of exact and recording repetitions in aGTZAN category is correlated with a decrease in the recall ofSRCAM. This makes sense because SRCAM is like adaptivenearest neighbours (Noorzad & Sturm, 2012). When evalu-ating the systems we created using SRCAM in the conditionAF’, Figure 8 shows the mean classification accuracy is 22points lower than that in condition ST. With respect to F-score,excerpts in GTZAN Classical and Metal suffer the least; but wesee decreases for all other categories by at least 10 points, e.g.62 points for GTZAN Blues, 29 points for GTZAN Reggae, and25 points for GTZAN Jazz.We see little change in classificationaccuracy between conditions AF’ and AF (not shown).

The question remains, what are the worst figures of meritthat a ‘perfect’ MGR system can obtain in GTZAN? We firstassume that the 81 excerpts yet to be identified are ‘correct’in their categorization, and that the last 25 s of Reggae 86are ignored because of severe distortion. Then, we assume a‘perfect’ system categorizes all GTZAN excerpts in their owncategories, except for the 52 potential misclassifications inTable 2. For each of those excerpts, we consider that the systemchooses the other GTZAN categories in which its number ofvotes are higher than, or as high as, the number of votes inits own category. For instance, Country 39 has its largestvote in GTZAN Blues, so we add one to the Blues row inthe GTZAN Country column. When a vote is split betweenK categories, we add 1/K in the relevant positions of theconfusion table. For instance, Disco 11 has the same numberof votes in GTZAN, Pop and Rock, so we add 0.5 to the Popand to the Rock rows of the GTZAN Disco column.

Table 4 shows the worst figures of merit we expect for this‘perfect’ system, which is also imposed as the thick grey linein Figure 2. If the figures of merit of an MGR system testedin GTZAN are better than this, it might actually be performingworse than the ‘perfect’system. Indeed, we have found that theclassification accuracies for six MGR approaches appearingclose to this limit (each marked by an ‘x’ in Figure 2) are dueinstead to inappropriate evaluation designs (discussed furtherin Section 4.5).

Dow

nloa

ded

by [

Nor

thw

este

rn U

nive

rsity

] at

14:

24 2

0 Ja

nuar

y 20

15


Fig. 8. 100× confusion, precision (Pr), F-score (F), and normalizedaccuracy (bottom right corner) for systems built using SRCAM,trained and tested in ST (a) and ST’ (b) (averaged over 10realizations), as well as in AF’ (c) (with artist filtering and withoutreplicas). Columns are ‘true’ labels; rows are predictions. Darknessof square corresponds to value. Labels: Blues (bl), Classical (cl),Country (co), Disco (di), Hip hop (hi), Jazz (ja), Metal (me), Pop(po), Reggae (re), Rock (ro).

3.5 Discussion

Though it may not have been originally designed to be thebenchmark dataset for MGR, GTZAN is currently honoured assuch by virtue of its use in earlier influential works(Tzanetakis & Cook, 2002), as well as its continued widespreaduse, and its public availability. It was, after all, one of thefirst publicly-available datasets in MIR. Altogether, GTZANappears more than any other dataset in the evaluations ofnearly 100 MGR publications (Sturm, 2012b).16 It is thus fairto claim that researchers who have developed MGR systemshave used, or been advised to use, GTZAN for evaluatingsuccess. For instance, high classification accuracy in GTZANhas been argued as indicating the superiority of approachesproposed for MGR (Bergstra et al., 2006; Li & Ogihara, 2006;Panagakis et al., 2009b). Furthermore, performance in GTZANhas been used to argue for the relevance of features to MGR,e.g. ‘the integration of timbre features with temporal featuresare important for genre classification’(Fu et al., 2011). Despiteits ten-year history spanning most of the research in MGRsince the review of Aucouturier and Pachet (2003), only fiveworks indicate someone has looked at what is in GTZAN. Ofthese, only one (Li & Chan, 2011) implicitly reports listeningto all of GTZAN; and another completely misses the mark in itsassessment of the integrity of GTZAN (Bergstra et al., 2006).When GTZAN has been used, it has been used for the mostpart without question.

GTZAN has never had metadata identifying its contents,but our work finally fills this gap, and shows the extent towhich GTZAN has faults. We find 50 exact repetitions, 21suspected recording repetitions, and a significant amount ofartist repetition. Our analysis of the content of each GTZANcategory shows the excerpts in each are more diverse thanwhat the category labels suggest, e.g. the GTZAN Blues ex-cerpts include music using the blues genre, as well as thezydeco and cajun genres. We find potential mislabellings,e.g. two excerpts of orchestral music by Leonard Bernsteinappear in GTZAN Jazz, but might be better placed with theother orchestral excerpts of Bernstein in GTZAN Classical;an excerpt by Wayne Toups & Zydecajun appears in GTZANCountry, but might be better placed with the other cajun andzydeco music in GTZAN Blues. Finally, we find some excerptssuffer from distortions, such as jitter, clipping, and extremedegradation (Reggae 86).

Having proven that GTZAN is flawed, we then sought tomeasure the real effects of these flaws on the evaluation ofMGR systems. Using three baseline and two state-of-the-artapproaches, we disproved the claims that all MGR systems areaffected in the same ways by the faults of GTZAN, and thatthe performances of MGR systems in GTZAN are still mean-ingfully comparable no matter how the systems are perform-ing the task. In other words, evaluating systems in GTZANwith Classify without taking into consideration its contents

16Which is now over 100 considering publications in 2013 notincluded in our survey (Sturm, 2012b).

Dow

nloa

ded

by [

Nor

thw

este

rn U

nive

rsity

] at

14:

24 2

0 Ja

nuar

y 20

15

162 Bob L. Sturm

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15−10

−8

−6

−4

−2

0

2

4

Number of exact and recording repetitions

Cha

nge

in F

oM (%

)

RecallPrecisionF−score

Fig. 9. Scatter plot of percent change in mean figures of merit (FoM) for SRCAM between the conditions ST and ST’ (Figure 8) as a functionof the number of exact and recording repetitions in a GTZAN category.

Table 4. The worst figures of merit (×10−2) of a ‘perfect’ classifier evaluated with GTZAN, which takes into account the 52 potentialmislabellings in Table 2, and assuming that the 81 excerpts we have yet to identify have ‘correct’ labels.

GTZAN category

Blues Classical Country Disco Hip hop Jazz Metal Pop Reggae Rock Precision

Blues 100 0 1 0 0 0 0 0 0 0 99.0Classical 0 100 0 0 0 2 0 0 0 0 98.0Country 0 0 95 0 0 0 0 0 0 1 99.0Disco 0 0 0 92 1 0 0 2 0 0 97.0Hip hop 0 0 0 1 96 0 0 0 1 0 98.0Jazz 0 0 0 0 0 98 0 0 0 0 100.0Metal 0 0 0 0 0 0 92 0 0 13 87.6Pop 0 0 0 2.5 3 0 3 95 1 3 88.3Reggae 0 0 0 0 0 0 0 0 98 1 99.0Rock 0 0 4 4.5 0 0 5 3 0 82 83.3

F-score 99.5 99.0 96.9 94.4 97.2 99.0 89.8 91.6 98.5 82.7 Acc: 94.8

provides numbers that may be precise and lend themselves toa variety of formal statistical tests, but that are nonethelessirrelevant for judging which system can satisfy the successcriteria of some use case in the real world that requires musicalintelligence, not to mention whether the system demonstratesa capacity for recognizing genre at all. This lack of validity instandard MGR evaluation means one cannot simply say, ‘94.8in GTZAN is the new 100’, and then proceed with ‘businessas usual’.

Since all results in Figure 2 come from experimental con-ditions where independent variables are not controlled, e.g.by artist filtering, the ‘progress’ in MGR seen over the yearsis very suspect. Systems built from MAPsCAT and SRCAM,previously evaluated in GTZAN to have classification accura-cies of 83% without considering the faults of GTZAN (Sturm,2012c, 2013a), now lie at the bottom in Figure 2 below 56%.Where the performances of all the other systems lie, we do notyet know. The extreme reliance of MGR systems on other un-controlled independent variables confounded with the GTZANcategory labels becomes clear with irrelevant transformations(Sturm, 2013g).

Some may argue that our litany of faults in GTZAN ignoreswhat they say is the most serious problem with a datasetlike GTZAN: that it is too small to produce meaningful re-sults. In some respects, this is justified. While personal music

collections may number thousands of music recordings, com-mercial datasets and library archives number in the millions.The 1000 excerpts of GTZAN is most definitely an insufficientrandom sample of the population of excerpts ‘exemplary’ ofthe kinds of music between which one may wish a MGRsystem to discriminate if it is to be useful for some use case.Hence, one might argue, it is unreasonably optimistic to as-sume an MGR system can learn from a fold of GTZAN thoserules and characteristics people use (or at least Tzanetakisused) when describing music as belonging to or demonstratingaspects of the particular genres used by music in GTZAN. Suchwarranted skepticism, compounded with the pairing of well-defined algorithms and an ill-defined problem, highlights theabsurdity of interpreting the results in Figure 2 as indicatingreal progress is being made in suffusing computers with mu-sical intelligence.

One might hold little hope that GTZAN could ever be usefulfor tasks such as evaluating systems for MGR, audio similar-ity, autotagging, and the like. Some might call for GTZAN tobe ‘banished’. There are, however, many ways to evaluate anMGR system using GTZAN. Indeed, its faults are representa-tive of data in the real world, and they themselves can be usedin the service of evaluation (Sturm, 2013a, 2013g). For in-stance, by using the GTZAN dataset, we (Sturm, 2012c, 2013a,2013g) perform several different experiments to illuminate

Dow

nloa

ded

by [

Nor

thw

este

rn U

nive

rsity

] at

14:

24 2

0 Ja

nuar

y 20

15


the (in)sanity of a system’s internal model of music genre.In one experiment (Sturm, 2012c, 2013a), we look at thekinds of pathological errors of an MGR system rather thanwhat it labels ‘correctly’. We design a reduced Turing test tomeasure how well the ‘wrong’ labels it selects imitate humanchoices. In another experiment (Sturm, 2012c, 2013g), weattempt to fool a system into selecting any genre label for thesame piece of music by changing factors that are irrelevantto genre, e.g. by subtle time-invariant filtering. In anotherexperiment (Sturm, 2012c), we have an MGR system composemusic excerpts it hears as highly representative of the ‘genresof GTZAN’, and then we perform a formal listening test todetermine if those genres are recognizable. The lesson is notto banish GTZAN, but to use it with full consideration of itsmusical content. It is currently not clear what makes a dataset‘better’ than another for MGR, and whether any are free of thekinds of faults in GTZAN; but at least now with GTZAN, onehas a manageable, public, and finally well-studied dataset.

4. Implications for future work in MGR and MIR

The most immediate point to come from our work here isperhaps that one should not take for granted the integrityof any given dataset, even when it has been used in a largeamount of research.Any researcher evaluating an MIR systemusing a dataset must know the data, know real data has faults,and know such faults have real impacts. However, there arefive other points important for the next ten years of MIR.First, problems in MIR should be unambiguously definedthrough specifying use cases and with formalism. Second,experiments must be designed, implemented, and analysedsuch that they have validity and relevance with respect to theintended scientific questions. Third, system analysis should bedeeper than just evaluation. Fourth, the limitations of all ex-periments should be acknowledged, and appropriate degreesof skepticism should be applied. Finally, as a large part ofMIR research is essentially algorithm- and data-based, it isimperative to make all such work reproducible.

4.1 Define problems with use cases and formalism

As mentioned in Section 2, few works of our survey (Sturm,2012b) define ‘the problem of music genre recognition’, butnearly all of them treat genre as categories to which objectsbelong, separated by boundaries, whether crisp or fuzzy, ob-jective or subjective, but recoverable by machine learningalgorithms. The monopoly of this Aristotelean viewpoint ofgenre is evinced by the kind of evaluation performed in MGRwork (Sturm, 2012b): 397 of 435 published works pose as rel-evant the comparison of labels generated by an MGR systemwith a ‘ground truth’ (Classify). All but ten of these worksconsider MGR as a single label classification problem. Wehave posed (Sturm, 2013b) ‘the principal goal of MGR’, butthe discussion of what its principal goal is might be prematurewithout the consideration of well-defined use cases.

The parameters and requirements of a music recommen-dation service are certainly different from those of a personwanting to organize their home media collection, which aredifferent from those of a music library looking to serve theinformation needs of musicologists. The light provided by ause case can help define a problem, and show how the perfor-mance of a solution should be measured; however, only a fewworks in the MGR literature consider use cases. In the samedirection, the MIReS Roadmap (Serra et al., 2013) recom-mends the development of ‘meaningful evaluation tasks’. Ause case helps specify the solutions, and what is to be tested byevaluations. One good example is provided by one of the firstworks in MGR: Dannenberg,Thom and Watson (1997) seek toaddress the specific scenario of interactive music performancebetween a musician and machine listener. It poses severalrequirements: the musician must play styles consistently; theclassification must take no longer than five seconds; and thenumber of false positives should be minimized. Dannenberget al. evaluate solutions not only in the laboratory, but also inthe intended situation. This work shows how a use case helpsdefine a problem, and the success criteria for solutions.

To fully define a problem with no uncertain terms, it must beformalized. Formalization is a tool that can disambiguate allparts of a problem in order to clarify assumptions, highlightthe existence of contradictions and other problems, suggestand analyse solutions, and provide a path for designing, im-plementing and analysing evaluations. By and large, suchformalization remains implicit in much evaluation of MIR,which has been uncritically and/or unknowingly borrowedfrom other disciplines, like machine learning and text informa-tion retrieval (Sturm, 2013c; Urbano, Schedl, & Serra, 2013).We have attempted to finally make this formalism explicit(Sturm, 2013c): we define what a system is, and what it meansto analyse one; we use the formalism of the design and analysisof experiments, e.g. Bailey (2008), to dissect an evaluationinto its aims, parts, design, execution, and analysis; and weshow how an evaluation in MIR is an experiment, and thusmakes several assumptions. When this formalism is madeexplicit, it becomes clear why the systematic and rigorousevaluations performed using standardized datasets in manyMIREX tasks can still not be scientific evaluation (Dougherty& Dalton, 2013; Sturm, 2013c). Consequently, what must bedone in order to address this becomes clearer.

4.2 Design valid and relevant experiments

No result in Figure 2 provides a reasonable direction for ad-dressing a use case that requires an artificial system havingmusical intelligence. That a MGR system evaluated usingClassify achieves even a perfect classification accuracy in adataset having uncontrolled independent variables providesno logical support for the claim that the system is using criteriarelevant to the meaning behind those labels (Sturm, 2012c,2013a, 2013b). The same is true of the results from the pastseveral years of the MIREX MGR task, or the MIREX audiomood classification task, or the various MIREX autotagging

Dow

nloa

ded

by [

Nor

thw

este

rn U

nive

rsity

] at

14:

24 2

0 Ja

nuar

y 20

15

164 Bob L. Sturm

tasks (Gouyon, Sturm, Oliveira, Hespanhol, & Langlois, 2013;Sturm, 2013c). An unwary company looking to build and sella solution for a particular use case requiring a system capableof music intelligence will be misled by all these results whenfollowing the intuition that a system with a classification accu-racy of over 0.8 must be ‘better’than one below 0.7. This is notto say that none of the systems in Figure 2 are using relevantcriteria to decide on the genre labels (e.g. instrumentation,rhythm, form), but that such a conclusion does not followfrom the experiment, no matter its outcome. In short, none ofthis evaluation is valid.

In discussing the music intelligence of a system, and thelack of validity in evaluation for measuring it, we might bemistaken as misrepresenting the purpose of machine learning,or overstating its capabilities. Of course, machine learningalgorithms are agnostic to the intentions of an engineer im-plementing and employing them, and they are impervious toany accusation of being a ‘horse’ (Sturm, 2013g). Machinelearning does not create systems that think or behave as hu-mans, or mimic the decision process of humans. However,there are different use cases where ‘by any means’ is or isnot acceptable. Even when a problem is underspecified, e.g.‘The mob boss asked me to make the guy disappear’, anysolution might not be acceptable, e.g. ‘so, I hired a magician’.If one seeks only to give the illusion that a robot can recognizegenre and emotion in music (Xia, Tay, Dannenberg, & Veloso,2012), or that a musical water fountain is optimal (Park, Park,& Sim, 2011), then ‘whatever works’ may be fine. However,if one seeks to make a tool that is useful to a musicologistfor exploring stylistic influence using a music archive, then asystem that uses musically meaningful criteria to make judge-ments is likely preferable over one that works by confoundedbut irrelevant factors.

Validity in experiments is elegantly demonstrated by thecase of ‘Clever Hans’ (Pfungst, 1911; Sturm, 2013g). Briefly,Hans was a horse that appeared able to solve complex arith-metic feats, as well as many other problems that requireabstract thought.17 Many people were convinced by Hans’abilities; and a group of intellectuals was unable to answerthe question, ‘Is Hans capable of abstract thought?’ It was notuntil valid experiments were designed and implemented to testwell-posed and scientific questions that Hans was definitivelyproven to be incapable of arithmetic – as well as his manyother intellectual feats – and only appeared so because he wasresponding to unintentional but confounded cues of whoeverasked him a question (Pfungst, 1911). The question of Hans’ability in arithmetic is not validly addressed by asking Hansmore arithmetic problems and comparing his answers withthe ‘ground truth’ (Classify), but by designing valid and rel-evant experiments in which careful control is exercised overindependent variables.

17A modern-day equivalent is ‘Maggie’, a dog that has performedarithmetic feats on nationally syndicated television programs: http://www.oprah.com/oprahshow/Maggie-the-Dog-Does-Math

Fig. 10. Normalized classification accuracy of male/female classifier(in test dataset) as a function of the number of features selectedby maximizing inter- and intra-class separation. The three featuresproducing the highest classification accuracy in the training set are‘relationship’, ‘occupation’ and ‘marital-status’.

Confounds in machine learning experiments create seriousbarriers to drawing conclusions from evaluations – which canalso be seen as the problem of ‘overfitting’. Consider thatwe wish to create a classifier to predict the sexes of people.We begin by randomly sampling 48,842 people from aroundthe United States of America (USA) in 1996.18 Consider thatwe are not aware that the sex of a person is determined (forthe most part) by the presence or absence of the ‘Y’ chro-mosome, and so we ask about age, education, occupation,marital status, relationship to spouse, and many other thingsthat might be relevant, and create a multivariate dataset fromthe entries. As is common in machine learning, we split thisinto a training dataset (2/3) and testing dataset (1/3), andthen perform feature selection (Theodoridis & Koutroumbas,2009) using the training dataset to find the best features of sexwith respect to, e.g. maximizing the ratio of intra- to inter-classseparation, or by considering the Fisher discriminant. Finally,we create several classifiers using these feature combinationsand training dataset, evaluate each using Classify on the testingdataset, and compute its normalized classification accuracy.

Figure 10 plots the accuracies of the classifiers as a functionof the number of best features. We find that with the ‘rela-tionship’ feature19 we can build a classifier with an accuracyof almost 0.7 in the testing dataset. If we add to this the‘occupation’ feature20 then accuracy increases to about 0.8in the test dataset. If we also consider the ‘marital-status’feature21 then we can build a classifier having an accuracythat exceeds 0.85 in the test dataset. Since we believe randomclassifiers would have an accuracy of 0.5 in each case, wemight claim from these results, ‘Our systems predict the sex

18We use here the multivariate dataset, http://archive.ics.uci.edu/ml/datasets/Adult19This feature takes a value in {‘Wife’, ‘Husband’, ‘Unmarried’,‘Child’}.20This feature takes a value in {‘Tech-support’, ‘Craft-repair’,‘Other-service’, ‘Sales’, ‘Exec-managerial’, ‘Prof-specialty’,‘Handlers-cleaners’, ‘Machine-op-inspct’, ‘Adm-clerical’,‘Farming-fishing’, ‘Transport-moving’, ‘Priv-house-serv’,‘Protective-serv’, ‘Armed-Forces’}.21This feature takes a value in {‘Married’, ‘Divorced’, ‘Never-married’, ‘Widowed’, ‘Separated’}.

Dow

nloa

ded

by [

Nor

thw

este

rn U

nive

rsity

] at

14:

24 2

0 Ja

nuar

y 20

15

http://www.oprah.com/oprahshow/Maggie-the-Dog-Does-Math

http://www.oprah.com/oprahshow/Maggie-the-Dog-Does-Math

http://archive.ics.uci.edu/ml/datasets/Adult

http://archive.ics.uci.edu/ml/datasets/Adult


of a person better than random’. We might go further andconclude, ‘The two most relevant features for predicting thesex of a person is ‘relationship’ and ‘occupation’.

Now consider these results while knowing how sex is trulydetermined. While it is clear that the ‘relationship’ feature is adirect signifier of sex given a person is married, we know thatthe ‘occupation’ feature is irrelevant to the sex of any person.It is only a factor that happens to be confounded in our datasetwith the attribute of interest through the economic and societalnorms of the population from which we collected samples, i.e.1996 USA. If we test our trained system in populations witheconomics and societal norms that are different from 1996USA, the confounds could break and its performance thusdegrade. Furthermore, the ‘occupation’ feature only appearsto be relevant to our task because: (1) it is a confounded factorof sex more correlated than the others; and (2) it is among theattributes we chose for this task from the outset, few of whichhave any relevance to sex. Hence, due to confounds – inde-pendent variables that are not controlled in the experimentaldesign – the evaluation lacks validity, and neither one of theconclusions can follow from the experiment.22

Our work joins the growing chorus of concerns about va-lidity in MIR evaluation. A number of interesting and prac-tical works have been produced by Urbano and co-workers(Urbano, 2011; Urbano, McFee, Downie, & Schedl, 2012;Urbano, Mónica, & Morato, 2011; Urbano et al., 2013). Thework in Urbano et al. (2011) is concerned with evaluatingthe performance of audio similarity algorithms, and is ex-tended in Urbano et al. (2012) to explore how well a MIRsystem must perform in a particular evaluation in order tobe potentially useful in the real world. Urbano (2011) dealsspecifically with the question of experimental validity of eval-uation in MIR; and Urbano et al. (2013) extend this to showthat evaluation of evaluation remains neglected by the MIRcommunity. Aucouturier and Bigand (2013) look at reasonswhy MIR has failed to acquire attention from outside com-puter science, and pose it is due in no small part to its lackof a scientific approach. While evaluation is also cited inthe MIReS roadmap (Serra et al., 2013) as a current areaof concern, such concerns are by no means only recent. Theinitiatives behind MIREX were motivated in part to addressthe fact that evaluation in early MIR research did not facil-itate comparisons with new methods (Downie, 2003, 2004).MIREX has since run numerous systematic, rigorous and stan-dardized evaluation campaigns for many different tasks since2005 (Downie, 2008; Downie, Ehmann, Bay, & Jones, 2010),which has undoubtedly increased the visibility of MIR(Cunningham, Bainbridge, & Downie, 2012). Furthermore,past MIR evaluation has stumbled upon the observations thatevaluation using training and testing datasets having musicfrom the same albums (Mandel & Ellis, 2005) or artists(Pampalk et al., 2005) results in inflated performances thanwhen the training and testing dataset do not share songs from

22At least, the scope of the first conclusion must be limited to 1996USA, and that of the second to the dataset.

the same album or by the same artists. Just as we have shownabove for GTZAN, the effects of this can be quite significant(Flexer, 2007), and can now finally be explained throughour work as resulting from independent variables that areuncontrolled in experiments.

Evaluation in MIR is not just a question of what figure ofmerit to use, but of how to draw a valid conclusion by de-signing, implementing, and analysing an experiment relevantto the intended scientific question (Sturm, 2013c). For anyexperiment, thorough consideration must be given to its ‘ma-terials’, ‘treatments’, ‘design’, ‘responses’, ‘measurements’,and ‘models’ of those measurements. Beginning ideally witha well-specified and scientific question, the experimentalist isfaced with innumerable ways of attempting to answer thatquestion. However, a constant threat to the validity of anexperiment is the selection of the components of an exper-iment because of convenience, and not because they actuallyaddress the scientific question.Another threat is the persuasionof precision. Richard Hamming has argued,23 ‘There is aconfusion between what is reliably measured, and what isrelevant. ... What can be measured precisely or reliably doesnot mean it is relevant’. Another threat is the blind applicationof statistics, resulting in ‘errors of the third kind’ (Kimball,1957): correct answers to the wrong questions. Hand (1994)shows why one must compare the intended scientific ques-tions to the statistical questions that are actually answered.While the absence of formal statistics in MIR research hasbeen highlighted by Flexer (2006), statistics is agnostic to thequality of data, the intention of the experiment, and the abilityof the experimentalist. Statistics provides powerful tools, butnone of them can rescue an invalid evaluation.

One might be inclined to say that since one problem withFigure 2 is that GTZAN has faults that are not consideredin any of the results, better datasets could help. However,though a dataset may be larger and more modern than GTZAN,it does not mean that it is free of the same kinds of faultsfound in GTZAN. Furthermore, that a dataset is large doesnot free the designer of an MIR system of the necessarilydifficult task of designing, implementing, and analysing anevaluation having the validity to conclude – if such a thing isindeed relevant to a use case – whether the decisions andbehaviours of a system are related to the musical contentthat is supposedly behind those decisions (Gouyon et al.,2013; Sturm, 2013b, 2013g). We have shown instead that thesignificant root of this problem, of which data plays a role, isin the validity and relevance of experiments (Sturm, 2013b,2013c). All independent variables must be accounted for inan evaluation, by controlling the experimental conditions, andby guaranteeing that enough material is sampled in a randomway that the effects of the independent variables that are outof reach of the experimentalist will satisfy the assumptions ofthe measurement model (Bailey, 2008; Sturm, 2013c). What

23R. Hamming, ‘You get what you measure’, lecture at NavalPostgraduate School, June 1995. http://www.youtube.com/watch?v=LNhcaVi3zPA

Dow

nloa

ded

by [

Nor

thw

este

rn U

nive

rsity

] at

14:

24 2

0 Ja

nuar

y 20

15

http://www.youtube.com/watch?v=LNhcaVi3zPA

http://www.youtube.com/watch?v=LNhcaVi3zPA

166 Bob L. Sturm

amount of data is necessary depends on the scientific ques-tion of interest; but even a million songs may not be enough(Bertin-Mahieux, Ellis, Whitman, & Lamere, 2011).

An excellent example providing a model for the design,implementation, analysis and description of valid and rele-vant scientific experiments is that of Chase (2001) – whichcomplements the earlier and pioneering results of Porter andNeuringer (1984). This work shows the effort necessary toplanning and executing experiments to take place over a du-ration of about two years such that waste is avoided and resultsare maximized. Chase designs four experiments to answer twospecific questions: (1) Can fish learn to discriminate betweencomplex auditory stimuli that humans can also discriminate?and (2) If so, does this behaviour generalize to novel auditorystimuli with the same discriminability? Chase uses three koifish as the material of the experiment, and specifies as thestimuli (treatments) recorded musical audio, both real andsynthesized, in the categories ‘blues’ and ‘classical’. Chasedesigns the experiments to have hundreds of trials, some ofwhich investigate potential confounds, probe the generaliz-ability of the discrimination of each fish, confirm results whenthe reinforcement stimuli are switched, and so on. From theseexperiments, one can conclude whether fish can indeed learnto discriminate between complex auditory stimuli.

Unfortunately, much evaluation in MGR, music emotionrecognition, and autotagging has been just the kind of showperformed by Clever Hans (Pfungst, 1911): certainly, an eval-uation observed by an audience and done not to deceive, butentirely irrelevant for proving any capacity of these systemsfor musical intelligence. This has thus contributed to the train-ing of horses of many exotic breeds, and competitions to showwhose horse is better trained; but, aside from this pageantry,one does not know whether any meaningful problem has everbeen addressed in it all. The design and implementation ofa valid experiment requires creativity and effort that cannotbe substituted with collecting more data; and a flawed designor implementation cannot be rescued by statistics. If one canelicit any response from a system by changing irrelevant fac-tors, then one has a horse (Sturm, 2013g).

4.3 Perform system analysis deeper than just evaluation

In our work (Sturm, 2013c), we define a system (‘a connectedset of interacting and interdependent components that togetheraddress a goal’), system analysis (addressing questions andhypotheses related to the past, present and future of a system),and highlight the key role played by evaluation (‘a “fact-finding campaign” intended to address a number of relevantquestions and/or hypotheses related to the goal of a system’).A system analysis that only addresses a question of how wella system reproduces the labels of a testing dataset is quiteshallow and of limited use. Such an analysis does not produceknowledge about how the system is operating or whether itsperformance is satisfactory with respect to some use case.More importantly, an opportunity is lost for determining howa system can be improved, or adapted to different use cases.

One of the shortcomings of evaluation currently practiced inMIR is that system analysis remains shallow, and does nothelp one to understand or improve systems.

The MIReS Roadmap (Serra et al., 2013) also points to theneed for deeper system analysis. One challenge it identifies isthe design of evaluation that provides ‘qualitative insights onhow to improve [systems]’. Another challenge is that entiresystems should be evaluated, and not just their unconnectedcomponents. In a similar light, Urbano et al. (2013) discusshow the development cycle that is common in InformationRetrieval research is unsatisfactorily practiced in MIR: thedefinition of a task in MIR often remains contrived or artificial,the development of solutions often remains artificial, the eval-uation of solutions is often unsatisfactory, the interpretationof evaluation results is often biased and not illuminating, thelack of valid results lead to improving solutions ‘blindly’,and returning to the beginning of the cycle rarely leads to aqualitative improvement of research. Deeper system analysisaddresses these problems.

4.4 Acknowledge limitations and proceed with skepticism

The fourth point to come from our work here is that thelimitations of experiments should be clearly acknowledged.Of her experiments with music discriminating fish, Chase(2001) is not terse when mentioning their limitations:

Even a convincing demonstration of categorization can fail toidentify the stimulus features that exert control at any given time... In particular, there can be uncertainty as to whether classifica-tion behavior had been under the stimulus control of the featuresin terms of which the experimenter had defined the categories orwhether the subjects had discovered an effective discriminant ofwhich the experimenter was unaware. ... [A] constant concern isthe possible existence of a simple attribute that would have al-lowed the subjects merely to discriminate instead of categorizing.(emphasis ours)

Chase thereby recognizes that, with the exception of timbre(experiment 3), her experiments do not validly answer ques-tions about what the fish are using in order to make theirdecisions, or even whether those criteria are related in any wayto how she categorized the music in the first place. In fact, herexperiments were not designed to answer such a question.

This same limitation exists for all MIR experiments thatuse Classify (Sturm, 2013a 2013c, 2013b, 2013g). That a MIRsystem is able to perfectly reproduce the ‘ground truth’ labelsof a testing dataset gives no reason to believe the system isusing criteria relevant to the meaning of those labels. Theexistence of many independent variables in the complex sig-nals of datasets like GTZAN, and the lack of their control inany evaluation using them, severely limits what one can sayfrom the results. Even if one is to be conservative and restrictthe conclusion to only the music excerpts in GTZAN, or in

Dow

nloa

ded

by [

Nor

thw

este

rn U

nive

rsity

] at

14:

24 2

0 Ja

nuar

y 20

15


ISMIR2004, we have shown (Sturm, 2013g) that does not helpvalidity.24

Other limitations come from the measurement model speci-fied by the experimental design.Any estimate of the responsesfrom the measurements in an experiment (Sturm, 2013c) mustalso be accompanied by its quality (confidence interval), andthe conditions under which it holds, e.g. ‘simple textbookmodel’ (Bailey, 2008). (This is not the variance of the mea-surements!) With random-, fixed- and mixed-effects models,these estimates generally have wider confidence intervals thanwith the standard textbook model, i.e. the estimates have moreuncertainty (Bailey, 2008). In the field of bioinformatics, anevaluation that does not give the confidence interval and con-ditions under which it holds has been called ‘scientificallyvacuous’ (Dougherty & Dalton, 2013).

It is of course beyond the scope of this paper whetheror not an artificial system can learn from the few examplesin GTZAN the meaning behind the various tags shown inFigure 5; but, that SRCAM is able to achieve a classificationaccuracy of over 0.75 from only 50 training dataset featurevectors of 768 dimensions in each of 10 classes, is quiteimpressive – almost miraculous. To then claim, ‘SRCAMis recognizing music genre’, is as remarkable a claim as,‘Hans the horse can add’.Any artificial system purported to becapable of, e.g. identifying thrilling Disco music with malelead vocals that is useful for driving, or romantic Pop musicwith positive feelings and backing vocals that is useful forgetting ready to go out, should be approached with as muchskepticism as Pfungst (1911) approached Hans. ‘An extraor-dinary claim requires extraordinary proof’ (Truzzi, 1978).

4.5 Make reproducible work reproducible

The final implication of our work here is in the reproducibilityof experiments. One of the hallmarks of modern science is thereproduction of the results of an experiment by an indepen-dent party (with no vested interests). The six ‘×’ in Figure 2come from our attempts at reproducing the work in severalpublications. In Sturm and Noorzad (2012), we detail ourtroubles in reproducing the work of Panagakis et al. (2009b).Through our collaboration with those authors, they found theirresults in that paper were due to a mistake in the experi-mental procedure. This also affects other published results(Panagakis & Kotropoulos, 2010; Panagakis, Kotropoulos,& Arce, 2009a). Our trouble in reproducing the results ofMarques, Guiherme, Nakamura and Papa (2011a) led to thediscovery that the authors had accidentally used the trainingdataset as the testing dataset.25 Our trouble in reproducingthe results of Bagci and Erzin (2007) led us to analyse thealgorithm (Sturm & Gouyon, 2013), which reveals that theirresults could not have been due to the proper operation of theirsystem. Finally, we have found the approach and results ofChang, Jang and Iliopoulos (2010) are irreproducible, and they

24What conclusion is valid in this case has yet to be determined.25Personal communication with J.P. Papa.

are contradicted by two well-established principles (Sturm,2013f). In only one of these cases (Marques et al., 2011a) didwe find the code to be available and accessible enough foreasily recreating the published experiments. For the others,we had to make many assumptions and decisions from thebeginning of our reimplementations, sometimes in consulta-tion with the authors, to ultimately find that the descriptionin the paper does not match what was produced, or that thepublished results are in error.

Making work reproducible can not only increase citations,but also greatly aids peer review. For instance, the negativeresults in our work (Sturm & Gouyon, 2013) were greetedby skepticism in peer review; but, because we made entirelyreproducible the results of our experiments and the figuresin our paper, the reviewers were able to easily verify ouranalysis, and explore the problem themselves. The results ofmuch of our work, and the software to produce them, areavailable online for anyone else to run and modify.26 TheMIReS Roadmap (Serra et al., 2013) also highlights the needfor reproducibility in MIR.

Among other things, ‘reproducible research’ (Vandewalle,Kovacevic, & Vetterli, 2009) is motivated by the requirementsof modern science, and by the algorithm-centric nature of dis-ciplines like signal processing. Vandewalle et al. (2009) definea piece of research ‘reproducible’ if: ‘all information relevantto the work, including, but not limited to, text, data and code,is made available, such that an independent researcher canreproduce the results’. This obviously places an extraordinaryburden on a researcher, and is impossible when data is pri-vately owned; but when work can be reproduced, then it shouldbe reproducible. Great progress is being achieved by projectssuch as SoundSoftware.27 Vandewalle et al. (2009) providesan excellent guide of how to make research reproducible.

5. Conclusions

Our ‘state of the art’ here takes a considerably different ap-proach than the previous four reviews of MGR (Aucouturier &Pachet, 2003; Dannenberg, 2010; Fu et al., 2011; Scaringellaet al., 2006). It aims not to summarize the variety of featuresand machine learning approaches used in MGR systems orthe new datasets available and how ‘ground truths’ are gen-erated. It is not concerned with the validity, well-posedness,value, usefulness, or applicability of MGR; or whether MGRis ‘replaced by’, or used in the service of, e.g. music similarity,autotagging, or the like, which are comprehensively addressedin other works (e.g. Craft, 2007; Craft, Wiggins, & Crawford,2007; McKay & Fujinaga, 2006; Sturm, 2012c, 2012b, 2013a,2013b, 2013g; Wiggins, 2009). It is not concerned with how,or even whether it is possible, to create ‘faultless’ datasets forMGR, music similarity, autotagging, and the like. Instead, this‘state of the art’ is concerned with evaluations performed inthe past ten years of work in MGR, and its implications for

26http://imi.aau.dk/~bst/software/index.html27http://soundsoftware.ac.uk

Dow

nloa

ded

by [

Nor

thw

este

rn U

nive

rsity

] at

14:

24 2

0 Ja

nuar

y 20

15

http://imi.aau.dk/~bst/software/index.html

http://soundsoftware.ac.uk

168 Bob L. Sturm

the next ten years of work in MIR. It aims to look closely atwhether the results of Figure 2 imply any progress has beenmade in solving a problem.

If one is to look at Figure 2 as a ‘map’ showing whichpairing of audio features with machine learning algorithmsis successful for MGR, or which is better than another, orwhether progress has been made in charting an unknown land,then one must defend such conclusions as being valid withthese evaluations – which are certainly rigorous, systematic,and standardized evaluations, but that may not address the sci-entific question of interest. One simply cannot take for grantedthat a systematic, standardized and rigorous evaluation designusing a benchmark dataset produces scientific results, let aloneresults that reflect anything to do with music intelligence.Questions of music intelligence cannot be addressed withvalidity by an experimental design that counts the number ofmatches between labels output by a system and the ‘groundtruth’ of a testing dataset (Classify), and that does not accountfor the numerous independent variables in a testing datasethaving flawed integrity. Only a few works have sought todetermine whether the decisions made by an MGR systemcome from anything at all relevant to genre (Sturm, 2012c,2013a), e.g. musicological dimensions such as instrumenta-tion, rhythm (Dixon, Gouyon, & Widmer, 2004), harmony(Anglade, Ramirez, & Dixon, 2009), compositional form, andlyrics (Mayer, Neumayer, & Rauber, 2008). Metaphorically,whereas only a few have been using a tape measure, a leveland a theodolite to survey the land, the points of Figure 2have been found using microscopes and weighing scales –certainly scientific instruments, but irrelevant to answeringthe questions of interest. It may be a culmination of a decadeof work, but Figure 2 is no map of reality.

These problems do not just exist for research in MGR,but are also seen in research on music emotion recognition(Sturm, 2013b, 2013c), and autotagging (Gouyon et al., 2013;Marques, Domingues, Langlois, & Gouyon, 2011b). Thesekinds of tasks constitute more than half of the evaluation tasksin MIREX since 2005, and are, in essence, representativeof the IR in MIR. How, then, has this happened, and whydoes it continue? First, as we have shown for GTZAN (Sturm,2012a, 2013d), many researchers assume a dataset is a gooddataset because many others use it. Second, as we have shownfor Classify (Sturm, 2013b, 2013c), many researchers assumeevaluation that is standard in machine learning or informationretrieval, or that is implemented systematically and rigorously,is thus relevant for MIR and is scientific. Third, researchersfail to define proper use cases, so problems and success criteriaremain by and large ill-defined – or, as in the case of GTZAN,defined by and large by the artificial construction of a dataset.Fourth, the foundation of evaluation in MIR remains obscurebecause it lacks an explicit formalism in its design and analysisof systems and experiments (Sturm, 2013c; Urbano et al.,2013). Hence, evaluations in all MIR are accompanied bymany assumptions that remain implicit, but which directlyaffect the conclusions, relevance, and validity of experiments.Because these assumptions remain implicit, researchers fail

to acknowledge limitations of experiments, and are persuadedthat their solutions are actually addressing the problem, or thattheir evaluation is measuring success – the Clever Hans Effect(Pfungst, 1911; Sturm, 2013g). While we are not claiming thatall evaluation in MIR lacks validity, we do argue there to be a‘crisis of evaluation’in MIR more severe than what is reflectedby the MIReS Roadmap (Serra et al., 2013).

The situation in MIR might be reminiscent of computer sci-ence in the late 1970s. In an editorial from 1979 (McCracken,Denning, & Brandin, 1979), the President, Vice-President andSecretary of the Association for Computing Machinery ac-knowledge that ‘experimental computer science’ (to distin-guish it from theoretical computer science) is in a crisis inpart because there is ‘a severe shortage of computer scientistsengaged in, or qualified for, experimental research in computerscience’. This led practitioners of the field to define moreformally the aims of experimental computer science, how theyrelate to the scientific pursuit of knowledge (Denning, 1980),and find model examples of scientific practice in their field(Denning, 1981). Six years later, Basili, Selby and Hutchens(1986) contributed a review of fundamentals in the designand analysis of experiments, how it has been applied to andbenefited software engineering, and suggestions for continuedimprovement. Almost a decade after this, however, Fenton,Pfleeger and Glass (1994) argue ‘there are far too few exam-ples of moderately effective research in software engineer-ing ... much of what we believe about which approaches arebest is based on anecdotes, gut feelings, expert opinions, andflawed research, not on careful, rigorous software-engineeringexperimentation’. This is not a surprise to Fenton, however,since ‘topics like experimental design, statistical analysis, andmeasurement principles’ remain neglected in contemporarycomputer science educations (Fenton et al., 1994). The con-versation is still continuing (Feitelson, 2006).

The amelioration of such problems, however, may be hin-dered by the very ways in which research is practiced andvalued. In his comment to Hand (1994), Donald Preece makesan astute but painful observation: ‘[Hand speaks on] the ques-tions that the researcher wishes to consider ...: (a) how do Iobtain a statistically significant result?; (b) how do I get mypaper published?; (c) when will I be promoted?’ It thus mightnot be that good scientific practice is the goal. There are noshortcuts to the design, implementation, and analysis of a validexperiment. It requires hard work, creativity, and intellectualpostures that can be uncomfortable. Good examples abound,however. Chase (2001) shows the time and effort necessaryto scientifically and efficiently answer meaningful questions,and exemplifies the kinds of considerations that are takenfor granted in disciplines where thousands of experimentscan be run in minutes on machines that do not need positivereinforcement. Just as Pfungst (1911) does with Hans, Chasescientifically tests whether the fish Oro, Beauty and Pepi areactually discriminating styles of music. Neither Pfungst norChase give their subjects free passes of accountability. Why,then, should it be any different for work that considers artificialsystems? The conversation is sure to continue.

Dow

nloa

ded

by [

Nor

thw

este

rn U

nive

rsity

] at

14:

24 2

0 Ja

nuar

y 20

15


Acknowledgements

Thanks to: Fabien Gouyon, Nick Collins, Arthur Flexer, MarkPlumbley, Geraint Wiggins, Mark Levy, Roger Dean, JuliánUrbano, Alan Marsden, Lars Kai Hansen, Jan Larsen, MadsG. Christensen, Sergios Theodoridis, Aggelos Pikrakis, DanStowell, Rémi Gribonval, Geoffrey Peeters, Diemo Schwarz,Roger Dannenberg, Bernard Mont-Reynaud, Gaël Richard,Rolf Bardeli, Jort Gemmeke, Curtis Roads, Stephen Pope,Yi-HsuanYang, George Tzanetakis, Constantine Kotropoulos,Yannis Panagakis, Ulas Bagci, Engin Erzin, and João PauloPapa for illuminating discussions about these topics (whichdoes not mean any endorse the ideas herein). Mads G. Chris-tensen, Nick Collins, Cynthia Liem, and Clemens Hage helpedidentify several excerpts in GTZAN, and my wife Carla Sturmendured my repeated listening to all of its excerpts. Thanks tothe many, many associate editors and anonymous reviewersfor the comments that helped move this work closer and closerto being publishable.

Funding

This work was supported by Independent Postdoc Grant11-105218 from Det Frie Forskningsråd.

ReferencesAmmer, C. (2004). Dictionary of music (4th ed.). New York: The

Facts on File Inc.Andén, J., & Mallat, S. (2011). Multiscale scattering for audio

classification. In A. Klapuri & C. Leider (Eds.), Proceedingsof ISMIR (pp. 657–662). Miami, FL: University of Miami.

Andén, J., & Mallat, S. (2013). Deep scattering spectrum.arXiv.1304.6763v2 [cs.SD]. Retrieved from http://arxiv.org/abs/1304.6763.

Anglade, A., Ramirez, R., & Dixon, S. (2009). Genreclassification using harmony rules induced from automaticchord transcriptions. In Proceedings of ISMIR (pp. 669–674).Canada: International Society for Music Information Retrieval.

Aucouturier, J.-J. (2009). Sounds like teen spirit: Computationalinsights into the grounding of everyday musical terms. InJ. Minett & W. Wang (Eds.), Language, evolution and the brain:Frontiers in linguistic series. Taipei: Academia Sinica Press.

Aucouturier, J.-J., & Bigand, E. (2013). Seven problems that keepMIR from attracting the interest of cognition and neuroscience.Journal of Intelligent Information Systems, 41(3), 483–497.

Aucouturier, J.-J., & Pachet, F. (2003). Representing music genre:Astate of the art. Journal of New Music Research, 32(1), 83–93.

Aucouturier, J.-J., & Pampalk, E. (2008). Introduction - fromgenres to tags: A little epistemology of music informationretrieval research. Journal of New Music Research, 37(2), 87–92.

Bailey, R.A. (2008). Design of comparative experiments.Cambridge: Cambridge University Press.

Barbedo, J.G.A., & Lopes, A. (2007). Automatic genreclassification of musical signals. EURASIP Journal onAdvances in Signal Processing, 2007, 064960.

Barreira, L., Cavaco, S., & da Silva, J. (2011). Unsupervisedmusic genre classification with a model-based approach. InL. Antunes & H.S. Pinto (Eds.), Progress in artificialintelligence: 15th Portuguese conference on artificialintelligence, EPIA 2011, Lisbon, Portugal, October 10–13,2011. Proceedings (Lecture Notes in Computer Science, Vol.7026, pp. 268–281). Berlin/Heidelberg: Springer.

Barrington, L., Yazdani, M., Turnbull, D., & Lanckriet, G.R.G.(2008). Combining feature kernels for semantic music retrieval.In Proceedings of ISMIR (pp. 614–619).

Basili, V., Selby, R., & Hutchens, D. (1986). Experimentationin software engineering. IEEE Transactions on SoftwareEngineering, 12(7), 733–743.

Bagci, U., & Erzin, E. (2007).Automatic classification of musicalgenres using inter-genre similarity. IEEE Signal ProcessingLetters, 14(8), 521–524.

Bergstra, J., Casagrande, N., Erhan, D., Eck, D., & Kégl,B. (2006). Aggregate features and AdaBoost for musicclassification. Machine Learning, 65(2–3), 473–484.

Bertin-Mahieux, T., Eck, D., Maillet, F., & Lamere, P. (2008).Autotagger: A model for predicting social tags from acousticfeatures on large music databases. Journal of New MusicResearch, 37(2), 115–135.

Bertin-Mahieux, T., Eck, D., & Mandel, M. (2010). Automatictagging of audio: The state-of-the-art. In W. Wang (Ed.),Machine audition: Principles, algorithms and systems.Hershey, PA: IGI Publishing.

Bertin-Mahieux, T., Ellis, D.P., Whitman, B., & Lamere, P.(2011). The million song dataset. Proceedings of ISMIR (pp.591–596). Canada: International Society for Music InformationRetrieval.

Bowker, G.C., & Star, S.L. (1999). Sorting things out:Classification and its consequences. Cambridge, MA: MITPress.

Chang, K., Jang, J.-S.R., & Iliopoulos, C.S. (2010). Music genreclassification via compressive sampling. In Proceedings ofISMIR (pp. 387–392). Canada: International Society for MusicInformation Retrieval.

Chase, A. (2001). Music discriminations by carp ‘Cyprinuscarpio’. Animal Learning & Behavior, 29(4), 336–353.

Craft, A. (2007). The role of culture in the music genreclassification task: Human behaviour and its effect onmethodology and evaluation (Technical report). London:Queen Mary University of London.

Craft, A., Wiggins, G.A., & Crawford, T. (2007). How manybeans make five? The consensus problem in music-genreclassification and a new evaluation method for single-genrecategorisation systems. In Proceedings of ISMIR (pp. 73–76).Vienna: Austrian Computer Society (OCG).

Cunningham, S.J., Bainbridge, D., & Downie, J.S. (2012). Theimpact of MIREX on scholarly research (2005–2010). InProceedings of ISMIR (pp. 259–264). Canada: InternationalSociety for Music Information Retrieval.

Dannenberg, R.B. (2010). Style in music. In S. Argamon, K.Burns & S. Dubnov (Eds.), The structure of style (pp. 45–57).Berlin/Heidelberg: Springer.

Dow

nloa

ded

by [

Nor

thw

este

rn U

nive

rsity

] at

14:

24 2

0 Ja

nuar

y 20

15

http://arxiv.org/abs/1304.6763


170 Bob L. Sturm

Dannenberg, R.B., Thom, B., & Watson, D. (1997). A machinelearning approach to musical style recognition. In Proceedingsof ICMC (pp. 344–347). Ann Arbor, MI: Michigan Publishing.

Denning, P.J. (1980). What is experimental computer science?Communications of the ACM , 23(10), 543–544.

Denning, P.J. (1981). Performance analysis: experimentalcomputer science as its best. Communications of the ACM ,24(11), 725–727.

Dixon, S., Gouyon, F., & Widmer, G. (2004). Towardscharacterisation of music via rhythmic patterns. In Proceedingsof ISMIR (pp. 509–517). Barcelona: Universitat Pompeu Fabra.

Dougherty, E.R., & Dalton, L.A. (2013). Scientific knowledge ispossible with small-sample classification. EURASIP Journalon Bioinformatics and Systems Biology, 2013, 10.

Downie, J., Ehmann, A., Bay, M., & Jones, M. (2010). The musicinformation retrieval evaluation exchange: Some observationsand insights. In Z. Ras & A. Wieczorkowska (Eds.), Advancesin music information retrieval (pp. 93–115). Berlin/Heidelberg:Springer.

Downie, J.S. (2003). Toward the scientific evaluation of musicinformation retrieval systems. In Proceedings of ISMIR (pp.25–32). Baltimore, MD: The Johns Hopkins University.

Downie, J.S. (2004). The scientific evaluation of musicinformation retrieval systems: Foundations and future.Computer Music Journal, 28(2), 12–23.

Downie, J.S. (2008). The music information retrieval evaluationexchange (2005–2007): A window into music informationretrieval research. Acoustical Science and Technology, 29(4),247–255.

Duin, R.P.W., Juszczak, P., de Ridder, D., Paclik, P., Pekalska, E.,& Tax, D.M.J. (2007). PR-Tools4.1, a MATLAB toolbox forpattern recognition. Retrieved from: http://prtools.org.

Fabbri, F. (1980, June). A theory of musical genres: Twoapplications. Paper presented at the First InternationalConference on Popular Music Studies, Amsterdam.

Fabbri, F. (1999). Browsing musical spaces. Paper presented atthe International Association for the Study of Popular Music(UK). Available at: http://www.tagg.org/others/ffabbri9907.html

Feitelson, D.G. (2006). Experimental computer science: Theneed for a cultural change. (Technical report). Jerusalem: TheHebrew University of Jerusalem.

Fenton, N., Pfleeger, S.L., & Glass, R.L. (1994). Science andsubstance: A challenge to software engineers. IEEE Software,11(4), 86–95.

Flexer, A. (2006). Statistical evaluation of music informationretrieval experiments. Journal of New Music Research, 35(2),113–120.

Flexer, A. (2007). A closer look on artist filters for musical genreclassification. In Proceedings of ISMIR (pp. 341–344). Vienna:Austrian Computer Society (OCG).

Flexer, A., & Schnitzer, D. (2009). Album and artist effects foraudio similarity at the scale of the web. In Proceedings of SMC2009, July 23–25, Porto, Portugal (pp. 59–64).

Flexer, A., & Schnitzer, D. (2010). Effects of album and artistfilters in audio similarity computed for very large musicdatabases. Computer Music Journal, 34(3), 20–28.

Frow, J. (2005). Genre. New York: Routledge.

Fu, Z., Lu, G., Ting, K.M., & Zhang, D. (2011).Asurvey of audio-based music classification and annotation. IEEE Transactionson Multimedia, 13(2), 303–319.

Gjerdingen, R. O., & Perrott, D. (2008). Scanning the dial:The rapid recognition of music genres. Journal of New MusicResearch, 37(2), 93–100.

Gouyon, F., Sturm, B.L., Oliveira, J.L., Hespanhol, N., &Langlois, T. (2013). On evaluation in music autotaggingresearch (submitted).

Guaus, E. (2009). Audio content processing for automatic musicgenre classification: Descriptors, databases, and classifiers(PhD thesis). Universitat Pompeu Fabra, Barcelona, Spain.

Hand, D. J. (1994). Deconstructing statistical questions. Journalof the Royal Statistical Society A (Statistics in Society), 157(3),317–356.

Hartmann, M.A. (2011). Testing a spectral-based feature setfor audio genre classification. (Master’s thesis). University ofJyväskylä, Jyväskylä, Finland.

Humphrey, E.J., Bello, J.P., & LeCun, Y. (2013). Feature learningand deep architectures: New directions for music informatics.Journal of Intelligent Information Systems, 41(3), 461–481.

Kemp, C. (2004). Towards a holistic interpretation of musicalgenre classification. (Master’s thesis). University of Jyväskylä,Jyväskylä, Finland.

Kimball, A.W. (1957). Errors of the third kind in statisticalconsulting. Journal of the American Statistical Association,52(278), 133–142.

Law, E. (2011). Human computation for music classification. InT. Li, M. Ogihara & G. Tzanetakis (Eds.), Music data mining(pp. 281–301). New York: CRC Press.

Levy, M., & Sandler, M. (2009). Music information retrievalusing social tags and audio. IEEE Transactions on Multimedia,11(3), 383–395.

Li, M., & Sleep, R. (2005). Genre classification via an LZ78-based string kernel. Proceedings of ISMIR (pp. 252–259).London: Queen Mary, University of London.

Li, T., & Chan, A. (2011). Genre classification and the invarianceof MFCC features to key and tempo. In K.-T. Lee, W.-H.Tsai, H.-Y.M. Liao, T. Chen, J.-W. Hsieh, & C.-C. Tseng(Eds.), Advances in multimedia modeling: 17th InternationalMultimedia Modeling Conference, MMM 2011, Taipei, Taiwan,January 5–7, 2011, Proceedings, Part I (Lecture Notes inComputer Science Vol. 6523 pp. 317–327). Berlin/Heidelberg:Springer.

Li, T., & Ogihara, M. (2006). Toward intelligent musicinformation retrieval. IEEE Transactions on Multimedia, 8(3),564–574.

Lidy, T. (2006). Evaluation of new audio features and theirutilization in novel music retrieval applications (Master’sthesis). Vienna University of Technology, Vienna, Austria.

Lidy, T., & Rauber,A. (2005). Evaluation of feature extractors andpsycho-acoustic transformations for music genre classification.In Proceedings of ISMIR (pp. 34–41). London: Queen Mary,University of London.

Lippens, S., Martens, J., & De Mulder, T. (2004). A comparisonof human and automatic musical genre classification. In IEEEInternational Conference on Acoustics, Speech, and Signal

Dow

nloa

ded

by [

Nor

thw

este

rn U

nive

rsity

] at

14:

24 2

0 Ja

nuar

y 20

15

http://prtools.org

http://www.tagg.org/others/ffabbri9907.html

http://www.tagg.org/others/ffabbri9907.html


Processing, 2004. Proceedings. (ICASSP ’04) (Vol. 4, pp. 233–236). Piscataway, NJ: IEEE.

Mandel, M., & Ellis, D.P.W. (2005). Song-level featuresand support vector machines for music classification. InProceedings of the International Symposium on MusicInformation Retrieval (pp. 594–599). London: Queen Mary,University of London.

Markov, K., & Matsui, T. (2012a). Music genre classificationusing self-taught learning via sparse coding. In 2012 IEEEInternational Conference on Acoustics, Speech and SignalProcessing (ICASSP) (pp. 1929–1932). Piscataway, NJ: IEEE.

Markov, K., & Matsui, T. (2012b). Nonnegative matrixfactorization based self-taught learning with application tomusic genre classification. In Proceedings of the IEEEInternational Workshop on Machine Learnin and SignalProcessing (pp. 1–5). Piscataway, NJ: IEEE.

Marques, C., Guiherme, I.R., Nakamura, R.Y.M., & Papa, J.P.(2011a). New trends in musical genre classification usingoptimum-path forest. In Proceedings of ISMIR (pp. 699–704).Canada: International Society for Music Information Retrieval.

Marques, G., Domingues, M., Langlois, T., & Gouyon, F. (2011b).Three current issues in music autotagging. In Proceedings ofISMIR (pp. 795–800). Canada: International Society for MusicInformation Retrieval.

Matityaho, B., & Furst, M. (1995). Neural network based modelfor classification of music type. In Eighteenth Conventionof Electrical and Electronics Engineers in Israel (pp. 1–5).Piscataway, NJ: IEEE.

Mayer, R., Neumayer, R., & Rauber, A. (2008). Rhyme andstyle features for musical genre classification by song lyrics.Proceedings of ISMIR (pp. 337–342). Canada: InternationalSociety for Music Information Retrieval.

McCracken, D.D., Denning, P.J., & Brandin, D.H. (1979).An ACM executive committee position on the crisis inexperimental computer science. Communications of the ACM ,22(9), 503–504.

McKay, C., & Fujinaga, I. (2006). Music genre classification: Isit worth pursuing and how can it be improved? In Proceedingsof ISMIR (pp. 101–106). Victoria, BC: University of Victoria.

McLuhan, M. (1964). Understanding media: The extensions ofman (94th ed.). Cambridge, MA: MIT Press.

Moerchen, F., Mierswa, I., & Ultsch, A. (2006). Understandablemodels of music collections based on exhaustive featuregeneration with temporal statistics. In Proceedings of theInternational Conference on Knowledge Discovery and DataMining (pp. 882–891). New York: ACM.

Noorzad, P., & Sturm, B.L. (2012). Regression with sparseapproximations of data. In Proceedings of EUSIPCO (pp. 674–678). Piscataway, NJ: IEEE.

Pachet, F., & Cazaly, D. (2000).Ataxonomy of musical genres. InProceedings of Content-based Multimedia Information AccessConference, Paris, France. (pp. 1238–1245).

Pampalk, E. (2006). Computational models of music similarityand their application in music information retrieval. (PhDthesis). Vienna University of Technology, Vienna, Austria.

Pampalk, E., Flexer, A., & Widmer, G. (2005). Improvementsof audio-based music similarity and genre classification. In

Proceedings of ISMIR (pp. 628–233). London: Queen Mary,University of London.

Panagakis, Y., & Kotropoulos, C. (2010). Music genreclassification via topology preserving non-negative tensorfactorization and sparse representations. In Proceedings ofICASSP (pp. 249–252). Piscataway, NJ: IEEE.

Panagakis, Y., Kotropoulos, C., & Arce, G. R. (2009a). Musicgenre classification using locality preserving non-negativetensor factorization and sparse representations. In Proceedingsof ISMIR (pp. 249–254). Canada: International Society forMusic Information Retrieval.

Panagakis, Y., Kotropoulos, C., & Arce, G.R. (2009b). Musicgenre classification via sparse representations of auditorytemporal modulations. In Proceedings of EUSIPCO. NewYork: Curran Associates Inc.

Park, S., Park, J., & Sim, K. (2011). Optimization systemof musical expression for the music genre classification.In Proceedings of the 11th International Conference onControl, Automation and Systems (ICCAS) (pp. 1644–1648).Piscataway, NJ: IEEE.

Pfungst, O. (1911). Clever Hans (The horse of Mr. Von Osten):A contribution to experimental animal and human psychology.New York: Henry Holt.

Porter, D., & Neuringer, A. (1984). Music discriminationsby pigeons. Experimental Psychology: Animal BehaviorProcesses, 10(2), 138–148.

Salzberg, S.L. (1997). Data mining and knowledge discovery,chapter On comparing classifiers: Pitfalls to avoid and arecommended approach (pp. 317–328). Dordrecht: KluwerAcademic Publishers.

Scaringella, N., Zoia, G., & Mlynek, D. (2006). Automaticgenre classification of music content: A survey. IEEE SignalProcessing Magazine, 23(2), 133–141.

Serra, X., Magas, M., Benetos, E., Chudy, M., Dixon, S., Flexer,A., Gómez, E., Gouyon, F., Herrera, P., Jordà, S., Paytuvi,O., Peeters, G., Schlüter, J., Vinet, H., & Widmer, G. (2013).Roadmap for Music Information ReSearch (Geoffroy Peeters,Ed.). Creative Commons BY-NC-ND 3.0 license.

Seyerlehner, K. (2010). Content-based music recommendersystems: Beyond simple frame-level audio similarity. (PhDthesis). Johannes Kepler University, Linz, Austria.

Seyerlehner, K., Widmer, G., & Pohle, T. (2010). Fusing block-level features for music similarity estimation. In Proceedingsof DAFx (pp. 1–8). Graz, Austria: Institute of Electronic Musicand Acoustics.

Shapiro, P. (2005). Turn the beat around: The secret history ofdisco. London: Faber & Faber.

Slaney, M. (1998). Auditory toolbox (Technical report). PaloAlto, CA: Interval Research Corporation. Retrievable fromhttps://engineering.purdue.edu/~malcolm/interval/1998-010/

Sordo, M., Celma, O., Blech, M., & Guaus, E., (2008). The questfor musical genres: Do the experts and the wisdom of crowdsagree? In Proceedings of ISMIR (pp. 255–260).

Sturm, B.L. (2012a). An analysis of the GTZAN music genredataset. In Proceedings of the ACM MIRUM Workshop (pp.7–12). New York: ACM.

Sturm, B.L. (2012b). A survey of evaluation in music genrerecognition. In Proceedings of Adaptive Multimedia Retrieval.

Dow

nloa

ded

by [

Nor

thw

este

rn U

nive

rsity

] at

14:

24 2

0 Ja

nuar

y 20

15

https://engineering.purdue.edu/~malcolm/interval/1998-010/

172 Bob L. Sturm

Sturm, B.L. (2012c). Two systems for automatic music genrerecognition: What are they really recognizing? In Proceedingsof ACM MIRUM Workshop (pp. 69–74). New York:ACM.

Sturm, B.L. (2013a). Classification accuracy is not enough: Onthe evaluation of music genre recognition systems. Journal ofIntelligent Information Systems, 41(3), 371–406.

Sturm, B. L. (2013b). Evaluating music emotion recognition:Lessons from music genre recognition? In Proceedings ofICME (pp. 1–6). Piscataway, NJ: IEEE.

Sturm, B.L. (2013c). Formalizing evaluation in musicinformation retrieval: A look at the mirex automatic moodclassification task. In Proceedings of CMMR.

Sturm, B.L. (2013d). The GTZAN dataset: Its contents,its faults, their effects on evaluation, and its future use.arXiv.1306.1461v2[cs.SD]. Retrieved from http://arxiv.org/abs/1306.1461.

Sturm, B.L. (2013e). Music genre recognition with risk andrejection. In Proceedings of ICME (pp. 1–6). Piscataway, NJ:IEEE.

Sturm, B.L. (2013f). On music genre classification viacompressive sampling. In Proceedings of ICME (pp. 1–6).Piscataway, NJ: IEEE.

Sturm, B.L. (2013g). A simple method to determine if a musicinformation retrieval system is a horse (submitted).

Sturm, B.L., & Gouyon, F. (2013). Revisiting inter-genresimilarity. IEEE Signal Processing Letters, 20(11), 1050–1053.

Sturm, B.L., & Noorzad, P. (2013). On automatic musicgenre recognition by sparse representation classification usingauditory temporal modulations. In Proceedings of CMMR(pp. 379–394).

Theodoridis, S., & Koutroumbas, K. (2009). Pattern recognition(4th ed.). Amsterdam: Academic Press, Elsevier.

Truzzi, M. (1978). On the extraordinary: An attempt atclarification. Zetetic Scholar, 1, 11–22.

Tzanetakis, G., & Cook, P. (2002). Musical genre classification ofaudio signals. IEEE Transactions on Speech Audio Processing,10(5), 293–302.

Urbano, J. (2011). Information retrieval meta-evaluation:Challenges and opportunities in the music domain. InProceedings of ISMIR (pp. 609–614). Canada: InternationalSociety for Music Information Retrieval.

Urbano, J., McFee, B., Downie, J.S., & Schedl, M. (2012). Howsignificant is statistically significant? The case of audio musicsimilarity and retrieval. In Proceedings of ISMIR (pp. 181–186). Canada: International Society for Music InformationRetrieval.

Urbano, J., Mónica, M., & Morato, J. (2011). Audio musicsimilarity and retrieval: Evaluation power and stability. InProceedings of ISMIR (pp. 597–602). Canada: InternationalSociety for Music Information Retrieval.

Urbano, J., Schedl, M., & Serra, X. (2013). Evaluation inmusic information retrieval. Journal of Intelligent InformationSystems, 41(3), 345–369.

van den Berg, E., & Friedlander, M.P. (2008). Probing the Paretofrontier for basis pursuit solutions. SIAM Journal on ScientificComputing, 31(2), 890–912.

Vandewalle, P., Kovacevic, J., & Vetterli, M. (2009).Reproducible research in signal processing: What, why, andhow. IEEE Signal Processing Magazine, 26(3), 37–47.

Wang,A. (2003).An industrial strength audio search algorithm. InProceedings of the International Society for Music InformationRetrieval. Baltimore, MD: Johns Hopkins University.

Wiggins, G.A. (2009). Semantic gap?? Schemantic schmap!!Methodological considerations in the scientific study of music.In Proceedings of the IEEE International Symposium onMulitmedia (pp. 477–482). Piscataway, NJ: IEEE.

Wright, J., Yang, A.Y., Ganesh, A., Sastry, S.S., & Ma, Y.(2009). Robust face recognition via sparse representation. IEEETransactions on Pattern Analsis and Machine Intelligence,31(2), 210–227.

Xia, G., Tay, J., Dannenberg, R., & Veloso, M. (2012).Autonomous robot dancing driven by beats and emotions ofmusic. In Proceedings of the International Conference onAutonomous Agents and Multiagent Systems (pp. 205–212).New York: ACM.

Dow

nloa

ded

by [

Nor

thw

este

rn U

nive

rsity

] at

14:

24 2

0 Ja

nuar

y 20

15



Documents

The State of the Art Ten Years After a State of the Art- Future Research in Music Information Retrieval ￼￼￼￼￼￼￼￼

The State of the Art Ten Years After a State of the Art- Future Research in Music Information Retrieval