Chord Recognition with Stacked Denoising Autoencoders · PDF fileChord Recognition with Stacked Denoising Autoencoders ... 6.1 Reduction of Chord Vocabulary ... beat detection, symbolic

Chord Recognition with StackedDenoising Autoencoders

Author:Nikolaas Steenbergen

Supervisors:Prof. Dr. Theo Gevers

Dr. John Ashley Burgoyne

A thesis submitted in fulfilment of the requirementsfor the degree of Master of Science in Artificial Intelligence

in the

Faculty of Science

July 2014

Abstract

In this thesis I propose two different approaches for chord recognitionbased on stacked denoising autoencoders working directly on the FFT.These approaches do not use any intermediate targets such as pitch classprofiles/chroma vectors or the Tonnetz, in an attempt to remove any re-strictions that might be imposed by such an interpretation. It is shownthat these systems can significantly outperform a reference system basedon state-of-the-art features. The first approach computes chord proba-bilities directly from an FFT excerpt of the audio data. In the secondapproach, two additional inputs, filtered with a median filter over dif-ferent time spans, are added to the input. Hereafter, in both systems,a hidden Markov model is used to perform a temporal smoothing afterpre-classifying chords. It is shown that using several different tempo-ral resolutions can increase the classification ability in terms of weightedchord symbol recall. All algorithms are tested in depth on the BeatlesIsophonics and the Billboard datasets on a restricted chord vocabularycontaining major and minor chords and an extended chord vocabularycontaining major, minor, 7th and inverted chord symbols. In addition topresenting the weighted chord average recall, a post-hoc Friedman mul-tiple comparison test for statistical significance on performance is alsoconducted.

1

Acknowledgements

I would like to thank Theo Gevers and John Ashley Burgoyne for su-pervising my thesis. Thanks to Ashley Burgoyne, for his helpful thoroughadvice and guidance. Thanks Amogh Gudi for all the fruit full discussionsabout deep learning techniques while lifting weights and sweating in thegym. Special thanks to my parents, Brigitte and Christiaan Steenbergenand my brothers Alexander and Florian, without their help, support andlove, I would not be where I am now.

2

Contents

1 Introduction 7

2 Musical Background 82.1 Notes and Pitch . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.2 Chords . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.3 Other Structures in Music . . . . . . . . . . . . . . . . . . . . . . 11

3 Related Work 123.1 Preprocessing / Features . . . . . . . . . . . . . . . . . . . . . . . 12

3.1.1 PCP / Chroma Vector Calculation . . . . . . . . . . . . . 123.1.2 Minor Pitch Changes . . . . . . . . . . . . . . . . . . . . 133.1.3 Percussive Noise Reduction . . . . . . . . . . . . . . . . . 143.1.4 Repeating Patterns . . . . . . . . . . . . . . . . . . . . . . 143.1.5 Harmonic / Enhanced Pitch Class Profile . . . . . . . . . 153.1.6 Modelling Human Loudness Perception . . . . . . . . . . 153.1.7 Tonnetz / Tonal Centroid . . . . . . . . . . . . . . . . . . 15

3.2 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.2.1 Template Approaches . . . . . . . . . . . . . . . . . . . . 163.2.2 Data-Driven Higher Context Models . . . . . . . . . . . . 17

4 Stacked Denoising Autoencoders 204.1 Autoencoders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204.2 Autoencoders and Denoising . . . . . . . . . . . . . . . . . . . . . 224.3 Training Multiple Layers . . . . . . . . . . . . . . . . . . . . . . . 234.4 Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

5 Chord Recognition Systems 255.1 Comparison System . . . . . . . . . . . . . . . . . . . . . . . . . 25

5.1.1 Basic Pitch Class Profile Features . . . . . . . . . . . . . 265.1.2 Comparison System Simplified Harmony Progression An-

alyzer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285.1.3 Harmonic Percussive Sound Separation . . . . . . . . . . 285.1.4 Tuning and Loudness-Based PCPs . . . . . . . . . . . . . 295.1.5 HMMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

5.2 Stacked Denoising Autoencoders for Chord Recognition . . . . . 315.2.1 Preprocessing of Features for Stacked Denoising Autoen-

coders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325.2.2 Stacked Denoising Autoencoders for Chord Recognition . 335.2.3 Multi-Resolution Input for Stacked Denoising Autoencoders 34

6 Results 366.1 Reduction of Chord Vocabulary . . . . . . . . . . . . . . . . . . . 376.2 Score Computation . . . . . . . . . . . . . . . . . . . . . . . . . . 37

6.2.1 Weighted Chord Symbol Recall . . . . . . . . . . . . . . . 376.3 Training Systems Setup . . . . . . . . . . . . . . . . . . . . . . . 386.4 Significance Testing . . . . . . . . . . . . . . . . . . . . . . . . . 396.5 Beatles Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

6.5.1 Restricted Major-Minor Chord Vocabulary . . . . . . . . 39

3

6.5.2 Extended Chord Vocabulary . . . . . . . . . . . . . . . . 426.6 Billboard Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

6.6.1 Restricted Major-Minor Chord Vocabulary . . . . . . . . 446.6.2 Extended Chord Vocabulary . . . . . . . . . . . . . . . . 46

6.7 Weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

7 Discussion 497.1 Performance on the Different Datasets . . . . . . . . . . . . . . . 497.2 SDAE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 497.3 MR-SDAE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 517.4 Weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 527.5 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

8 Conclusion 55

A Joint Optimization 61A.1 Basic System Outline . . . . . . . . . . . . . . . . . . . . . . . . . 61A.2 Gradient of the Hidden Markov Model . . . . . . . . . . . . . . . 61A.3 Adjusting Neural Network Parameters . . . . . . . . . . . . . . . 62A.4 Updating HMM Parameters . . . . . . . . . . . . . . . . . . . . . 62A.5 Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63A.6 Hidden Markov Model . . . . . . . . . . . . . . . . . . . . . . . . 63A.7 Combined Training . . . . . . . . . . . . . . . . . . . . . . . . . . 63A.8 Joint Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . 64A.9 Joint Optimization Possible Interpretation . . . . . . . . . . . . . 65

4

List of Figures

1 Piano keyboard and MIDI note range . . . . . . . . . . . . . . . 92 Conventional autoencoder training . . . . . . . . . . . . . . . . . 213 Denoising autoencoder training . . . . . . . . . . . . . . . . . . . 234 Stacked denoising autoencoder training . . . . . . . . . . . . . . 245 SDAE for chord recognition . . . . . . . . . . . . . . . . . . . . . 336 MR-SDAE for chord recognition . . . . . . . . . . . . . . . . . . 347 Post-hoc multiple-comparison Friedman tests for Beatles restricted

chord vocabulary . . . . . . . . . . . . . . . . . . . . . . . . . . . 408 Whisker plot for the Beatles restricted chord vocabulary . . . . . 419 Post-hoc multiple-comparison Friedman tests for Beatles extended

chord vocabulary . . . . . . . . . . . . . . . . . . . . . . . . . . . 4210 Whisker plot for Beatles extended chord vocabulary . . . . . . . 4311 Post-hoc multiple-comparison Friedman tests for Billboard re-

stricted chord vocabulary . . . . . . . . . . . . . . . . . . . . . . 4512 Post-hoc multiple-comparison Friedman tests for Billboard ex-

tended chord vocabulary . . . . . . . . . . . . . . . . . . . . . . . 4613 Visualization of weights of the input layer of the SDAE . . . . . 4814 Plot of sum of absolute values for the input layer of the SDAE . 4815 Absolute training error for joint optimization . . . . . . . . . . . 6416 Classification performance of joint optimization while training . . 65

5

List of Tables

1 Semitone steps and intervals. . . . . . . . . . . . . . . . . . . . . 102 Intervals and chords . . . . . . . . . . . . . . . . . . . . . . . . . 113 WCSR for the Beatles restricted chord vocabulary . . . . . . . . 414 WCSR for the Beatles extended chord vocabulary . . . . . . . . 435 WCSR for the Billboard restricted chord vocabulary . . . . . . . 456 WCSR for the Billboard extended chord vocabulary . . . . . . . 477 Results for chord recognition in MIREX 2013 . . . . . . . . . . . 54

6

1 Introduction

The increasing amount of digitized music available online has given rise to de-mand for automatic analysis methods. A new subfield of information retrievalhas emerged that concerns itself only with music: music information retrieval(MIR). Music information retrieval concerns itself with different subcategories,from analyzing features of a music piece (e.g., beat detection, symbolic melodyextraction, and audio tempo estimation) to exploring human input methods(like “query by tapping” or “query by singing/humming”) to music clusteringand recommendation (like mood detection or cover song identification).

Automatic chord estimation is one of the open challenges in MIR. Chordestimation (or recognition) describes the process of extracting musical chordlabels from digitally encoded music pieces. Given an audio file, the specificchord symbol and temporal position and duration have to be automaticallydetermined.

The main evaluation programme for MIR is the annual “Music InformationRetrieval Exchange” (MIREX) challenge1. It consists of challenges in differentsub-tasks of MIR, including chord recognition. Often improving one task caninfluence the performance in other tasks, e.g., finding a better beat estimate canimprove the performance of finding the temporal positions of chord changes,or improve the task of querying by tapping. The same is the case for chordrecognition. It can improve performance of cover song identification, in whichstarting from an input song, cover songs are retrieved: chord information isa useful if not vital feature for discrimination. Chord progressions also havean influence of the “mood” transmitted through music. Thus being able toretrieve the chords used in a music piece accurately could also be helpful formood categorization, e.g., for personalized Internet radios.

Chord recognition is also valuable to do by itself. It can aid musicologists aswell as hobby and professional musicians in transcribing songs. There is a greatdemand for chord transcriptions of well-known and also lesser-known songs.This manifests itself in many Internet pages that hold manual transcriptions ofsongs, especially for guitar.2 Unfortunately, these mostly contain transcriptionsonly of the most popular songs and often several different versions of the samesong exist. Furthermore, they not guaranteed to be correct. Chord recognitionis a difficult task which requires a lot of practice even for humans.

1http://www.music-ir.org/mirex2E.g. ultimate guitar: http://www.ultimate-guitar.com/, 911Tabs http://www.911tabs.

com/, guitartabs http://www.guitaretab.com/

7

http://www.music-ir.org/mirex

http://www.ultimate-guitar.com/

http://www.911tabs.com/

http://www.911tabs.com/

http://www.guitaretab.com/

2 Musical Background

In this section I give an overview of important musical terms and concepts laterused in this thesis. I first describe how musical notes relate to physical soundwaves in section 2.1, then how chords relate to notes in section 2.2 and laterdifferent other aspects of music that play a role for automatic chord recognitionin section 2.3.

2.1 Notes and Pitch

Pitch describes the perceived frequency of a sound. In Western tonality pitchesare labelled by the letters A to G. The transcription of a musically relevant pitchand its duration is called note. Pitches can be ordered by frequency, whereby apitch is said to be higher if the corresponding frequency is higher.

The human auditory system works on a logarithmic scale, which also mani-fests itself in music: Musical pitches are ordered in octaves, repeating the notenames, usually denoted in ascending order from C to B: C, D, E, F, G, A,B. We can denote different octave relationships with an additional number asa subscript added to the symbol described previously. So a pitch A0 is oneoctave lower than the corresponding pitch A1 one octave above. Two pitchesone octave apart double in corresponding frequency. Humans typically perceivethose two pitches as the same pitch (Shepard, 1964).

In music an octave is split into twelve roughly equal semitones. By definitioneach of the letters C to B are two semitone steps apart, excepting the steps fromE to F and B to C, which both are only one semitone apart. To denote thosenotes that are in between the named letters, the additional symbols ] for asemitone step in increasing frequency and [ for a step in decreasing frequencydirections are used. For example we can describe the musically relevant pitchbetween C and D both as C] and D[. Because this system only defines therelationship between pitches, we need a reference frequency. In modern Westerntonality usually the reference frequency of A4 at 440 Hz is standard (Sikora,2003). In practice slight deviations of this reference tuning may occur, e.g.,due to instrument mistuning or similar. This reference pitch thus defines therespective frequencies of other notes implicitly through the octave and semitonerelationships. We may compute the corresponding frequencies for all other notesgiven a reference pitch with following equation:

fn = 2n12 ∗ fr, (1)

where fn the frequency for n semitone steps from the reference pitch fr.The human ear can perceive a frequency range of approximately 20 Hz to

20 000 Hz. In practice this frequency range is not fully used in music. Forexample the MIDI standard, which is more than sufficient for musical purposesin terms of octave range, covers only notes in semitone steps from C−1, corre-sponding to about 8.17 Hz, to G9, which is 12 543.85 Hz. A standard pianokeyboard covers the range from A0 at 27.5 Hz to C8 4 186 Hz. Figure 1 de-picts a standard piano keyboard in relation to the range of frequencies of MIDIstandard notes, with indicated physical sound frequencies.

8

C−

18

Hz

C0

16

Hz

C1

33

Hz

C2

65

Hz

C3

131

Hz

C4

262

HzA4

440

Hz

C5

523

Hz

C6

1047

Hz

C7

2093

Hz

C8

4186

Hz

C9

8372

Hz

G9

12544

Hz

0127

MID

In

ote

ran

ge

188

pia

no

note

ran

ge

Fig

ure

1:P

ian

oke

yb

oard

and

MID

In

ote

ran

ge.

Wh

ite

keys

dep

ict

the

ran

ge

of

the

stan

dar

dp

ian

o,

for

those

note

sth

at

are

des

crib

edby

lett

ers.

Bla

ckke

ys

dev

iate

sem

iton

efr

oma

not

ed

escr

ibed

by

ale

tter

.T

he

gra

yare

ad

epic

tsex

ten

sion

sov

erth

en

ote

ran

ge

of

ap

ian

o,

cove

red

by

the

MID

Ist

and

ard

.

9

2.2 Chords

For the purpose of this thesis we define a chord as three or more notes playedsimultaneously. The distance in frequency of two notes is called an interval. Ina musical context we can describe an interval as the number of semitone stepstwo notes are apart (Sikora, 2003). A chord consists of a root note, usually thelowest note in terms of frequency. The interval relationship of the other notesplayed at the same time defines the chord type. Thus a chord can be defined as aroot-note and a type. In the following we use the notation <root-note>:<chord-type>, proposed by Harte (2010). We can refer to the notes in musical intervalsin order of ascending frequencies as: root-note, third, fifth, and if there is a fourthnote seventh. In Table 1, we can see the intervals for chords considered in thisthesis and the semitone step distance for those intervals. The root note andfifth have fixed intervals. For the seventh and third, we differentiate betweenmajor and minor intervals, differing by one semitone step.

For this thesis we restrict ourselves to two different chord vocabularies to berecognized, the first one containing only major and minor chord types. Bothmajor and minor chords consist of three notes: the root note, the third and thefifth. The interval between root note and third distinguishes major and minorchord types (see tables 1 and 2) a major chord contains a major third, while theminor chord contains a minor third. We distinguish between twelve root notesfor each chord type, for a total of 24 possible chords.

Burgoyne et al. (2011) propose a dataset which contains songs from theBillboard charts from 1950s through the 1990s. This major-minor chord vocab-ulary accounts for 65% of the chords. We can extend this chord vocabulary totake into account 83% of the chord types in the Billboard dataset by includingvariants of the seventh chords, by adding an optional fourth note to a chord.Hereby, in addition to simple major and minor chords, we add 7th, major 7thand minor 7th chord-types to our chord-type vocabulary. Major 7th chords andminor 7th chords are essentially major and minor chords, whereby the addedfourth note has the interval major seventh and minor seventh respectively.

In addition to different chord types, it is possible to change the frequencyorder of the notes for different intervals by “pulling” one note below the root-note in terms of frequency. This is called chord inversion. Thus our extendedchord vocabulary containing major, minor, 7th, major 7th and minor 7th alsocontains all possible inversions. We can denote this through an additional iden-tifier in our chord syntax: < root-note>:<chord-type>/<inversion-identifier>,where the inversion-identifier can either be 3, 5, or 7 played below the root-note.For example E:maj7/7 would be a major 7 chord, consisting of the root note E,

interval number of semitone-stepsroot-note 0minor third 3major third 4fifth 7minor seventh 10major seventh 11

Table 1: Semitone steps and intervals.

10

chord-type intervals notesmajor 1,3,5minor 1,[3,57 1,3,5,[7major7 1,3,5,7minor7 1,[3,5,[7

Table 2: Intervals and chords. root-note denoted as 1, third as 3, fifth as 5 andseventh as 7. We denote minor as [

a major third, fifth, and major seventh, and the major seventh is played belowthe root note in terms of frequency.

It is possible, however, that in parts of the song, no or only non-harmonicinstruments (e.g., percussion) are playing. To be able to interpret this case wedefine an additional non-chord symbol, thus adding an additional chord symbolto our 24 different chord symbols for the restricted chord vocabulary, leavingus with 25 different symbols. The extended chord vocabulary contains major,minor, 7th, major 7th and minor 7th chord types (depicted in table 2) and allpossible inversions. So, for each root-note, this leaves us with 3 different chordsymbols for major and minor, and four different chord symbols for extendedchords, thus 216 different symbols and an additional non-chord symbol.

Furthermore, we assume that chords cannot overlap, although this is notstrictly true, for example, due to reverb, multiple instruments playing chords,etc. However, in practice this overlap is negligible and reverb is often not thatlong. Thus we regard a chord to be a continuous entity with designated startpoint, end point and a chord symbol (either consisting of the root note, chordtype and inversion, or a non-chord symbol).

2.3 Other Structures in Music

A musical piece has several other components, some contributing additionalharmonic content, for example vocals, which might also carry a linguisticallyinterpretable message. Since a music piece has an overall harmonic structureand an inherent set of music theoretical harmonic rules, this information alsoinfluences the chords played at any time and vice versa, but does not necessarilycontribute to the chord played directly.

The duration and start and end point in time of a chord played is influencedby rhythmic instruments, such as percussion. These do not contribute to theharmonic content of a music piece but nonetheless are interdependent with otherinstruments in terms of timing, thus the beginning and end of a chord played.

These additional harmonic and non-harmonic components are part of thesame frequency range as components that directly contribute to the chordplayed. From this viewpoint, if we do not explicitly take into account addi-tional components, we are dealing with an additional task of filtering out this“noise” due to these extra components in addition to the task of recognizingchords themselves.

11

3 Related Work

Most musical chord estimation methods can broadly be divided into two sub-processes: preprocessing of features from wave-file data, and higher-level classi-fication of those features into chords.

I first describe in section 3.1 the preprocessing steps of the raw wave-formdata, as well as the extensions and the refinements of its computation stepsto take more properties of waveform music data into account. An overview ofhigher-level classification organized by methods applied is given in section 3.2.These not only differ in the methods per se, but also in what kind of musicalcontext they take into account for the final classification. More recent methodstake more musical context into account and seem to perform better. Since themethods proposed in this thesis are based on machine learning, I have decidedto organize the description of other higher level classification approaches froma technical perspective rather than from a music-theoretical perspective.

3.1 Preprocessing / Features

The most common preprocessing step for feature extraction from waveform datais the computation of so called pitch class profiles (PCPs), a human-perception-based concept coined by Shepard (1964). He conducted a human perceptualstudy in which he found that humans are able to perceive notes that are inoctave relation as equivalent. A similar representation can be computed fromwave form data for chord recognition. A PCP in a music-computational senseis a representation of the frequency spectrum wrapped into one musical octave,thus an aggregated 12-dimensional vector of the energy of the respective inputfrequencies. This is often called a chroma vector. A sequence of chroma vectorsover time is called a chromagram. The terms PCP and chroma vector in chordrecognition literature are used interchangeably. It should be noted, however,that only the physical sound energy is aggregated: this is not purely music har-monic information. Thus the chromagram may contain additional non-harmonicnoise, such as drums, harmonic overtones and transient noise.

In the following I will give an overview of the basics of calculating the chromavector and different extensions proposed to improve the quality of these features.

3.1.1 PCP / Chroma Vector Calculation

In order to compute a chroma vector, the input signal is broken into framesand converted to the frequency domain, which is most often done through adiscrete Fourier transform (DFT), using a window function to reduce spectralleakage. Harris (1978) compares 23 different window functions and finds thatthe performance depends very much on the properties of the data. Since mu-sical data is not heterogeneous, there is no single best-performing windowingfunction. Different window functions have been used in the literature, and oftenthe specific window function is not stated. Khadkevich and Omologo (2009a)compare the performance impact of using Hanning, Hamming and Blackmanwindowing functions on musical wave form data applied to the chord estima-tion domain. They state that the results are very similar for those three types.However, the Hamming window performed slightly better for window lengths of

12

1024 and 2048 samples (for a sampling rate of 11025 Hz), which are the mostcommon lengths in automatic chord recognition systems today.

To convert from the Fourier domain to a chroma vector, two different meth-ods are used. Wakefield (1999) sums energies of frequencies in the Fourier spaceclosest to the pitch of a chroma vector bin (and its multiples) in order to ag-gregate the energy in a discrete mapping from spectral frequency domain tothe corresponding chroma vector bin, converting the input directly to a chromavector. Brown (1991) developed a so called constant-Q transform, using a ker-nel matrix multiplication to convert the DFT spectogramm into logarithmicfrequency space. Each bin of the logarithmic frequency representation corre-sponds to the frequency of a musical note. After conversion into logarithmicfrequency domain, we then can simply sum up the respective bins, to obtain thechroma vector representation. For both methods the aggregated sound energyin the chroma vector is usually normalized either to sum to one or with respectto the maximum energy in a single bin. Both methods lead to similar resultsand are used in current literature.

3.1.2 Minor Pitch Changes

In Western tonality music instruments are tuned to the reference frequency ofA4 above middle C (MIDI note 69), whose standard frequency is 440 Hz. Insome cases the tuning of the instruments can deviate slightly, usually less thana quartertone from this standard tuning: 415–445 Hz (Mauch, 2010). Mosthumans are unable to determine an absolute pitch height without a referencepitch. We can hear a mistuning of one instrument with some practice, butit is difficult to determine a slight deviation of all instruments from the usualreference frequency described above.

The bins for the chroma vectors are relative to a fixed pitch, thus minordeviations in the input will affect its quality. Minor deviations of the referencepitch can be taken into account through shifting the pitch of the chromagrambins. Several different methods have been proposed: Harte and Sandler (2005)use a chroma vector with 36 bins, 3 per semitone. Computing a histogram ofenergies with respect to frequency for one chroma vector and the whole songand examining the peak positions in the extended chroma vector enables themto estimate the true tuning and derive a 12-bin chroma vector, under the as-sumption that the tuning will not deviate during the piece of music. This takesa slightly changed reference frequency into account. Gomez (2006) first restrictsthe input frequencies from 100 to 5000 Hz to reduce the search space and toremove additional overtone and percussive noise. She uses a weighting functionwhich aggregates spectral peaks not to one, but to several chromagram bins.The spectral energy contributions of these bins are weighted according to asquared cosine distance in frequency. Dressler and Streich (2007) treat minortuning differences as an angle and use circular statistics to compensate for minorpitch shifts, which was later adapted by Mauch and Dixon (2010b).

Minor tuning differences are quite prominent in Western tonal music, andadjusting the chromagram can lead to performance increase, such that severalother systems make use of one of the former methods, e.g.: Papadopoulos andPeeters (2007, 2008), Reed et al. (2009), Khadkevich and Omologo (2009a),Oudre et al. (2009).

13

3.1.3 Percussive Noise Reduction

Music audio often contains noise that can not directly be used for chord recog-nition, such as transient or percussive noise. Percussive and transient noisenormally is short, in contrast to harmonic components, which are rather stableover time. A simple way to reduce this is to smooth subsequent chroma vectorsthrough filtering or averaging. Different filters have been proposed. Some re-searchers, e.g., Peeters (2006), Khadkevich and Omologo (2009b), Mauch et al.(2008), use a median filter over time after tuning and before aggregating thechroma vectors, to remove transient noise. Gomez (2006) uses several differ-ent filtering methods and derivatives based on a method developed by Bonada(2000) to detect transient noise and leave a window out of the chroma vectorcalculation of 50 ms before and after transient noise, reducing the input space.Catteau et al. (2007) calculate a “background spectrum” by convolving the log-frequency spectrum with a Hamming window of length of one octave, whichthey subtract from the original chroma vector to reduce noise.

Because there are methods to estimate a beat from the audio signal (Ellis,2007), and chord changes are more likely to appear on these metric positions,several systems aggregate or filter the chromagram only in between those de-tected beats. Ni et al. (2012) use a so called harmonic percussive sound separa-tion algorithm described in Ono et al. (2008), which attempts to split the audiosignal into percussive and harmonic components. After that they use the medianchroma feature vector as representation for the complete chromagram betweentwo beats. A similar approach is used by Weil et al. (2009), who also use abeat tracking algorithm, and average the chromagram between two consecutivebeats. Glazyrin and Klepinin (2012) calculate a beat-synchronous smoothedchromagram and propose a modified Prewitt filter from image recognition foredge detection applied to music to suppress non-harmonic spectral components.

3.1.4 Repeating Patterns

Musical pieces inherit a very repetitive structure, e.g., in popular music higher-level structures such as verse and chorus are repeated, and usually those arerepetitions of different harmonic (chord) patterns themselves. These structurescan be exploited to improve the chromagram through recognizing and averag-ing or filtering those repetitive parts to remove local deviation. Repetitive partscan also be estimated and used later in the classification step to increase per-formance. Mauch et al. (2009) first perform a beat estimation and smooth thechroma vectors in a prefiltering step. Then a frame-by-frame similarity matrixfrom the beat-synchronous chromagram is computed and the song is segmentedinto an estimation of verse and chorus. This information is used to averagethe beat synchronous chromagram. Since beat estimation is a current researchtopic itself and often does not work perfectly, there might be errors in the beatpositions. Cho and Bello (2011) argue that it is advantageous to use recurrentplots with a simple threshold operation to find similarities on a chord level forlater averaging, thus leaving out the segmentation of the song into chorus andverse and beat detection. Glazyrin and Klepinin (2012) build upon and alter thesystem of Cho and Bello. They use a normalized self-similarity matrix on thecomputed chroma vectors using Euclidean distance as a comparison measure.

14

3.1.5 Harmonic / Enhanced Pitch Class Profile

One problem of the computation of PCPs in general is to find an interpretationfor overtones (energy in integer multiples of the fundamental frequency), sincethese might generate energy in frequencies that contribute to chroma vector binsother than the actual notes of the respective chord. For example the overtonesof A4 (440 Hz) are at 880 Hz and 1320 Hz, which is close to E6 (MIDI note 68)at approximately 1318.51 Hz. Several different ways to achieve this have beenproposed. In most cases the frequency range that is taken into account is re-stricted, e.g., approx from 100 Hz to 5000 Hz (Lee, 2006; Gomez, 2006). Most ofthe harmonic content is contained in this interval. Lee (2006) refines the chromavector by computing the so called “harmonic product spectrum”, in which theproduct of the energy for octave multiples (up to a certain number) for each binis calculated. Later the chromagram on basis of this harmonic product spectrumis computed. He states that multiplying the fundamental frequency with its oc-tave multiples can decrease noise on notes that are not contained in the originalpiece of music. Additionally he finds a reduction of noise induced by “false”harmonics compared to conventional chromagram calculation. Gomez (2006)proposes an aggregation function for the computation of the chroma vector, inwhich the energy of the frequency multiples are summed, but first weighted by a“decay” factor, which is dependent on the multiple. Mauch and Dixon (2010a)use a non-negative least-squares method to find a linear combination of “noteprofiles” in a dictionary matrix to compute the log-frequency representationsimilar to the constant-Q transform mentioned earlier.

3.1.6 Modelling Human Loudness Perception

Human loudness perception is not directly proportional to the power or am-plitude spectrum (Ni et al., 2012), thus the different representations describedabove do not model human perception accurately. Ni et al. (2012) describe amethod to incorporate this through a log10 scale for the sound power in respectto frequency. Pauws (2004) uses a tangential weighting function to achieve asimilar goal for key detection. They find an improvement on the quality of theresulting chromagram compared to non-loudness-weighted methods.

3.1.7 Tonnetz / Tonal Centroid

Another representation of harmonics is the so called Tonnetz, which is attributedto Euler in the 19th century. It is a planar representation of musical notes ona 6-dimensional politype, where pitch relations are mapped onto its vertices.Close musical harmonic relations (e.g., fifths and thirds) have a small Euclideandistance. Harte et al. (2006) describe a way to compute a Tonnetz from a 12-bin chroma vector, and report a performance increase for a harmonic changedetection function, compared to standard methods.

Humphrey et al. (2012) use a convolutional neural network from the FFT tomodel a projection function from wave form input to a Tonnetz. They performexperiments on the task of chord recognition with a Gaussian mixture model,and report that the Tonnetz output representation outperforms state-of-the-artchroma vectors.

15

3.2 Classification

The majority of chord recognition systems compute a chromagram using one or acombination of methods described above. Early approaches use predefined chordtemplates and compare them with the computed frame-wise chroma featuresfrom audio pieces, which are then classified.

With the supply of more and more hand-annotated data, more data-drivenlearning approaches have been developed. The most prominent data-drivenmodel adopted is taken from speech recognition, the hidden Markov model(HMM). Bayesian networks are also used frequently, which are a generalizationof HMMs. Recent approaches propose to take more musical context into accountto increase performance, such as a local key, bass note, beat and song structuresegmentation. Although most chord recognition systems rely on the compu-tation of single chroma vectors, more recent approaches compute two chromavectors for each frame. A bass and treble chromagram (differing in frequencyrange) are computed, as it is reasoned that the sequence of bass notes have animportant role in the harmonic development of a song and can colour the treblechromagrams due to harmonics.

3.2.1 Template Approaches

The chroma vector as an estimate of the harmonic content of a frame of a musicpiece should contain peaks at bins that correspond to chord notes played. Chordtemplate approaches use chroma-vector-like templates. These can be eitherpredefined through expert knowledge, or learned from data. Those templatesare then compared with a fitting function with the computed chroma vectorof each frame respectively. The frame is then classified as the chord symbolcorresponding to the best-fitting template.

The first research paper explicitly concerned with chord recognition is byFujishima (1999), which constitutes a non-machine-learning system. Fujishimafirst computes simple chroma vectors as described above. He then uses pre-defined 12-dimensional binary chord patterns (either 1 or 0 for present andnon-present notes in the chroma vector in the chord) and computes the innerproduct with the chroma vector. For real-world chord estimation, the set ofchords consists of schemata for “triadic harmonic events, and to some extentmore complex chords such as sevenths and ninths”. Fujishima’s system wasonly used on synthesized sound data, however. Binary chord templates withan enhanced chroma vector using harmonic overtone suppression were used byLee (2006). Other groups use a more elaborate chromagram with tuning (36bins) for minor pitch changes reducing chord types to be recognized (Harteand Sandler, 2005; Oudre et al., 2009). Oudre et al. (2011) extend the methodsalready mentioned, by comparing different filtering methods as described in sec-tion 3.1.3 and measures of fit (Euclidean distance, Kullback-Leibler divergenceand Itakura-Saito divergence) to select the most suitable chord template. Theyalso take harmonic overtones of chord notes into account, such that bins in thetemplates for notes not occurring in the chord do not necessarily have to bezero. Glazyrin and Klepinin (2012) use quasi-binary chord templates, in whichthe tonic and the 5th are enhanced and the template is normalized afterwards.The templates are compared to smoothed and fine-tuned chroma vectors.

Chord templates do not have to be in form of chroma vectors. They can also

16

be modelled as a Gaussian, or as mixture of Gaussians as used by Humphreyet al. (2012), in order to get an probabilistic estimate of a chord likelihood.To eliminate short spurious chords that only last a few frames, they use aViterbi decoder. They do not use the chroma vector for classification, but aTonnetz as described in section 3.1.7. The transformation function is learnedby a convolutional neural network from data.

It should be noted that basically all chord template approaches can modelchord “probabilities” that can in turn be used as input for higher level clas-sification methods or for temporal smoothing such as hidden Markov models,described in section 3.2.2 as shown by Papadopoulos and Peeters (2007).

3.2.2 Data-Driven Higher Context Models

The recent increase in availability of hand-annotated data on chord recognitionhas spawned new machine-learning-based methods. In chord-recognition litera-ture, different approaches have been proposed, from neural networks, to systemsadopted from speech recognition to support vector machines and others. Morerecent machine learning systems seem to capture more and more context of mu-sic. In this section I describe higher level classification models found organizedby machine learning methods used.

Neural Networks Su and Jeng (2001) try to model the human auditory sys-tem with artificial neural networks. They perform a wavelet transform (as ananalogy to the ear) and feed the output into a neural network (as an analogy forthe cerebrum ) for classification. They use a self-organizing map to determinethe style of chord and the tonality (C, C# etc.). It was tested on classical musicto recognize 4 different chord types (major, minor, augmented, and diminished).Zhang and Gerhard (2008) propose a system based on neural networks to de-tect basic guitar chords and their voicings (inversions) with the help of a voicingvector and a chromagram. The neural network in this case first is trained toidentify and output the basic chords; a later post processing step will determinethe voicing. Osmalsky et al. (2012) build a database with several different in-struments playing single chords individually, part of it recorded in a noisy andpart of it in a noise-free environment. They use a feed-forward neural net witha chroma vector as input to classify 10 different chords and experiment withdifferent subsets of their training set.

HMM Neural networks do not take time dependencies between subsequentinputs into account. In music pieces there is a strong interdependency of sub-sequent chords, which renders a classification of chords for a whole music piecedifficult to model based solely on neural networks. Since a template and neuralnet based approaches do not explicitly take temporal properties of music intoaccount, a widely adopted method is to use a hidden Markov model. It hasproven to be a good tool for the related field of speech recognition. The chromavector is treated as observation, which can be modelled by different probabilitydistributions, and the states of the HMM are the chord symbols to be extracted.

Sheh and Ellis (2003) pioneered HMMs for real-world chord recognition.They propose that the emission distribution be a single Gaussian with 24 di-mensions, trained from data with expectation maximization. Burgoyne et al.

17

(2007) state that a mixture of Gaussians is more suitable as the emission dis-tribution. They also compare the use of Dirichlet distributions as the emissiondistribution and conditional random fields as the higher level classifier. HMMsare used with slightly different chromagram computations and training initialisa-tions according to prior music theoretic knowledge by Bello and Pickens (2005).Lee (2006) build upon the systems of Bello and Pickens and Sheh and Ellis,generate training data from symbolic files (MIDI) and use an HMM for chordextraction. Papadopoulos and Peeters (2007) compare several different methodsof determining the parameters of the HMM and observation probabilities. Theyconclude that a template-based approach combined with an HMM with a “cogni-tive based transition matrix” shows the best performance. Later Papadopoulosand Peeters (2008, 2011) propose an HMM approach focusing on (and extract-ing) beat estimates to take into account musical beat addition, beat deletionor changes in meter to enhance recognition performance. Ueda et al. (2010)use Harmonic Percussive Sound Separation chromagram features and an HMMfor classification. Chen et al. (2012) cluster “song-level duration histograms” totake time duration explicitly into account in a so-called duration-explicit HMM.Ni et al. (2012) is the best performing system of 2012 MIREX challenge in chordestimation. It works on the basis of an HMM, bass and treble chroma and beatand key detection.

Structured SVM Weller et al. (2009) compare the performance of HMMsand support vector machines (SVMs) for chord recognition and achieve state-of-the-art results using support vector machines.

n-grams Language and music are closely related. Both spoken language andmusic rely on audio data. Thus it makes sense to apply spoken-language-recognition approaches to music analysis and chord recognition. A dominantapproach for language recognition is an n-gram model. A bigram model (n = 2)is essentially a hidden Markov model, in which one state only depends on theprevious one. Cheng et al. (2008) compare 2-, 3-, and 4-grams, thus makingone chord dependent on multiple previous chords. They use it for song sim-ilarity after a chord recognition step. In their experiments the simple 3- and4-grams outperform the basic HMM system of Harte and Sandler (2005); theystate that n-grams are able to learn the basic rules of chord progressions fromhand annotated data.

Scholz et al. (2009) use a 5-gram and compare different smoothing tech-niques and find that modelling more complex chords with 7ths and 9ths shouldbe possible with n-grams. They do not state how features are computed andinterpreted.

Dynamic Bayesian Networks Musical chords develop meaning in their in-terplay with other characteristics of a music piece, such as bass note, beat andkey: they can not be viewed as an isolated entity. These interdependencies aredifficult to model with a standard HMM approach. Bayesian networks are ageneralization of HMMs, in which the musical context can be modelled moreintuitively. Bayesian networks give the opportunity to model interdependenciessimultaneously, creating a more sound model for music pieces from a music-theoretic perspective. Another advantage of a Bayesian network is that it can

18

directly extract multiple types of information, which may not be a priority forthe task of chord recognition, but is an advantage for the extended task ofgeneral transcription of music pieces.

Cemgil et al. (2006) were among the first to introduce Bayesian networksfor music computation. They do not apply the system to chord recognitionbut to polyphonic music transcription (transcription on a note-by-note basis).They implement a special case of the switching Kalman filter. Mauch (2010)and Mauch and Dixon (2010b) make use of a Bayesian network and incorporatebeat detection, bass note and key estimation. The observations of the Bayesiannetwork in the system are treble and bass chromagrams. Dixon et al. (2011)compare a similar system to a logic based system.

Deep Learning Techniques Deep learning techniques have beaten the stateof the art in several benchmark problems in recent years, although for the taskof chord recognition it is a relatively unexplored method. There are three recentpublications using deep learning techniques. Humphrey and Bello (2012) callfor a change in the conventional approach of using a variation of chroma vectorand a higher level classifier, since they state recent improvements seem to bringonly “diminishing return”. They present a system consisting of a convolutionalneural network with several layers, trained to learn a Tonnetz from a constant-Q-transformed FFT, and subsequently classify it with a Gaussian mixture model.Boulanger-Lewandowski et al. (2013) make use of deep learning techniques withrecurrent neural networks. They use different techniques including a Viterbi-like algorithm from HMMs and beam search to take temporal information intoaccount. They report upper-bound results comparable to the state of the artusing the Beatles Isophonics dataset (see section 6.5 for a dataset description)for training and testing. Glazyrin (2013) uses stacked denoising autoencoderswith a 72-bin constant-Q transform input, trained to output chroma vectors.A self-similarity algorithm is applied to the neural network output and laterclassified with a deterministic algorithm, similar to the template approachesmentioned above.

19

4 Stacked Denoising Autoencoders

In this section I give a description of the theoretical background of stackeddenoising autoencoders used for the two chord recognition systems in this thesisfollowing Vincent et al. (2010). First a definition of autoencoders and theirtraining method is given in section 4.1, then it is described how this can beextended to form a denoising autoencoder in section 4.2. We can stack denoisingautoencoders to train them in an unsupervised manner and possibly get a usefulhigher level data abstraction by training several layers, which is described insection 4.3.

4.1 Autoencoders

Autoencoders or autoassociators try to find an encoding of given data in thehidden layers. Similar to Vincent et al. (2010) we define the following:

We assume a supervised learning scenario. A training set of n touples ofinputs x and targets t. Dn = {(x1, t1), .., (xn, tn)}, where x ∈ Rd if theinput is real valued, or x ∈ [0, 1]d. Our goal is to infer a new, higherlevel representation y, of x. The new representation again is y ∈ Rd′ ory ∈ [0, 1]d

′depending if real valued or binary representation is assumed.

Encoder A deterministic mapping fθ that transforms the input x to a hiddenrepresentation y is called an encoder. It can be described as follows:

y = fθ(x) = s(Wx+ b), (2)

where θ = {W, b}, W a d × d′ weight matrix and b an offset (or bias) vectorof dimension d′. The function s(x) is a non linear mapping, e.g., a sigmoidactivation function 1

1+e−x . The output y is called the “hidden representation”.

Decoder A deterministic mapping gθ′ that maps hidden representation y backto input space by constructing a vector z = gθ′(y) is called a decoder. Typicallythis is in form of a mapping:

z = gθ′(y) = W ′y + b′ (3)

or a mapping followed by a non-linearity:

z = gθ′(y) = s(W ′y + b′) (4)

where θ′ = {W ′, b′}, W ′ a d′ × d weight matrix and b′ an offset (or bias) vectorof dimension d. Often the restriction W> = W ′ is imposed on the weights. zcan be regarded as an approximation of the original input data x, reconstructedfrom the hidden representation y.

20

Input x

Hidden representation y

fθ

Autoencoder training output z

gθ′

L(x, z) Loss function

Figure 2: Conventional autoencoder training. Vector x from the training setis projected by fθ(x) to hidden representation y, hereafter projected back toinput space using gθ′(y) to compute z. The loss function L(x, z) is calculatedand used as training objective for minimization.

Training The idea behind such a model is to get a good hidden representationy, from which the decoder is able to reconstruct the original input as closelyas possible. It can be shown that finding the optimal parameters for such amodel can be viewed as a maximization of the lower bound between the mutualinformation of the input and the hidden representation in the first layer (Vincentet al., 2010). To estimate the parameters we define a loss function. This can befor a binary input x ∈ [0, 1]d the cross entropy:

L(x, z) = −d∑k=1

xk log(zk) + (1− xk) log(1− zk) (5)

or for real valued input x ∈ Rd:

L(x, z) = ||x− z||2, (6)

The “squared error objective”. Since we use real valued input data, this squarederror objective is used in this thesis as loss function.

Given this loss function we want to minimize the average loss (Vincent et al.,2008):

θ∗, θ′∗ = arg minθ,θ′

1

n

n∑i=1

L(x(i), z(i)) = arg minθ,θ′

1

n

n∑i=1

L(x(i), gθ′

(fθ(x

(i)))), (7)

Where θ∗, θ′∗ denote the optimal parameters for encoding and decoding func-tion for which the loss function is minimized, which might be tied. This can beachieved iteratively by backpropagation. n is the number of training samples.Figure 2 visualizes the training procedure for an autoencoder.

If the hidden representation y is of the same dimensionality as the input x,it is trivial to construct a mapping that yields zero reconstruction error, theidentity mapping. Obviously this constitutes a problem since merely learningthe identity mapping does not lead to any higher level of abstraction. To evadethis problem a bottleneck is introduced, for example by using fewer nodes for a

21

hidden representation thus reducing its dimensions. It is also possible to imposea penalty on the network activations to form a bottleneck, and thus train asparse network. These additional restrictions force the neural network to focuson the most “informative” parts of the data leaving out noisy “uninformative”parts. Several layers can be trained in a greedy manner to achieve a yet higherlevel of abstraction.

Enforcing Sparsity To prevent autoencoders from learning the identity map-ping, we can penalize activation. This is described by Hinton (2010) for re-stricted Boltzman machines, but can be used for autoencoders as well. Thegeneral idea is that it is less informative if we have nodes that fire very fre-quently, i.e. a node that is always active does not add any useful informationand could be left out. We can enforce sparsity by adding a penalty term for largeaverage activations over the whole dataset to the backpropagated error. We cancompute the average activation of a hidden unit j over all training samples with:

pj =1

n

n∑i=1

f jθ (x(i)) (8)

In this thesis the following addition to the loss function is used, which is derivedfrom the KL divergence:

Lp = β

h∑j=1

(p log

p

pj+ (1− p) log(

1− p1− pj

)

), (9)

where p the average activation over the complete training set for hidden unitj, n the number of training samples, p is a target activation parameter and βa penalty weighting parameter, all specified beforehand. The bound h is thenumber of hidden nodes. For a sigmoidal activation function p is usually set toa value that is close to zero, for example 0.05. A frequent setting for β is 0.1.This ensures that units will have a large activation only on a limited amount oftraining samples and otherwise have an activation close to zero. We now simplyadd this weighted activation error term to L(x, z), described above.

4.2 Autoencoders and Denoising

Vincent et al. (2010) propose another training criterion in addition to the bot-tleneck. They state that an autoencoder can also be trained to “clean a partiallycorrupted input”, also called denoising.

If noisy input is assumed, it can be beneficial to corrupt (parts of) the inputof the autoencoder while training and use the uncorrupted input as target.The autoencoder is hereby encouraged to reconstruct a “clean” version of thecorrupted input. This can make the hidden representation of the input morerobust to noise, and can potentially lead to a better higher level abstraction ofthe input data.

Vincent et al. (2010) state that different types of noises can be considered.There is “masking noise”, i.e., setting a random fraction of the input to 0,“salt and pepper noise”, i.e., setting a random fraction of the input to either0 or 1, and, especially for real-valued input, isotropic additive Gaussian noise,i.e. adding noise from a Gaussian distribution to the input. To achieve this, we

22

corrupt the initial input x into x according to a stochastic mapping x ∼ qD(x|x).This corrupted input is then projected to the hidden representation as describedbefore by means of y = fθ(x) = s(Wx+b). Then we can reconstruct z = gθ′(y).The parameters θ and θ′ are trained to minimize the average reconstruction errorbetween output z and the uncorrupted input x, but in contrast to “conventional”autoencoders, z is now a deterministic function of x instead of x.

For our purpose, under usage of additive Gaussian noise, we can train thedenoising autoencoder with a squared error loss function: L2(x, z) = ||x− z||2.Parameters can be initialized at random and then optimized by backpropaga-tion. Figure 3 depicts training of a denoising autoencoder.

Corrupted input x

Hidden representation y

fθ

Uncorrupted input x

qD

Denoising autoencoder output z

gθ′

L(x, z) Loss function

Figure 3: Vector x form training set is corrupted with qD and converted tohidden representation y. The loss function L(x, z) is calculated from the outputand the uncorrupted input and used for training

4.3 Training Multiple Layers

If we want to train (or initialize training parameters for supervised backprop-agation for) deep networks, we need a manner to extend the approach from asingle layer, as described in the previous sections, to multiple layers.

As described by Vincent et al. (2010), this can be easily achieved by repeat-ing the process for each layer separately. Depicted in figure 4 is such a greedylayer wise training. First we propagate the input x through the already trainedlayers. Note that we do not use additional corruption noise yet. Next we use theuncorrupted hidden representation of the previous layer as input for the layerwe are about to train. We train this specific layer as described in the previoussections. The input to the layer to be trained is first corrupted by qD and then

projected into latent space by using f(2)θ . We then project it back to “input”

space of the specific layer with g(2)θ′ . Using an error function L, we can optimize

the projection functions with respect to the defined error, and therefore possiblyobtain a useful higher-level representation. This process can be repeated severaltimes to initialize a deep neural network structure, circumventing usual prob-lems that arise when initializing deep networks at random and then applying

23

backpropagation.Next we can apply a classifier on the output of this deep neural network

trained to supress noise. Alternatively we can add another layer of hiddennodes for classification purposes on top of the previously unsupervised trainednetwork structure and apply standard backpropagation to fine-tune the networkweights according to our supervised training training targets t.

x

f(1)θ

x

f(1)θ

f(2)θ

qD

g(2)θ′

L(y, z2)

Loss function

x

f(1)θ

f(2)θ

Figure 4: Training of several layers in a greedy unsupervised manner. The inputis propagated without corruption. To train an additional layer the output of

the first layer is corrupted by qD and the weights are adjusted with f(2)θ ,g

(2)θ′

with the respective loss function. After training for this layer is completed, wecan train subsequent layers.4

4.4 Dropout

Hinton et al. (2012) were able to improve performance on several other recog-nition tasks, including MNIST for hand written digit recognition and TIMITa database for speech recognition, by randomly omitting a fraction of hiddennodes from training for each sample. This is in essence training a different modelfor each training sample and iteration on one training sample only. Accordingto Hinton et al. this prevents the network from overfitting. In the testing phasewe make use of the complete network again. Thus what we effectively are doingwith dropout is averaging: averaging many models trained on one training sam-ple each. This has yielded an improvement in different modelling tasks (Hintonet al., 2012).

4As shown in (Vincent et al., 2010)

24

5 Chord Recognition Systems

In this section I describe the structure of three different approaches to classifychords.

1. We first describe the structure of a comparison system: a simplified ver-sion of the Harmony Progression Analyzer as proposed by Ni et al. (2012).The features computed can be considered state of the art. We discard,however, additional context information like key, bass and beat tracking,since the neural network approaches developed in this thesis do not takethis into account (although it should be noted that in principle the ap-proaches developed in this thesis could be extended to take this additionalcontext information into account as well). The simplified version of theHarmonic Progression Analyzer will serve as a reference system for per-formance comparison.

2. A neural network initialized by stacked denoising autoencoder pretrainingwith later backpropagation fine-tuning can be applied to an excerpt of theFFT to estimate chord probabilities directly, which then can be smoothedwith the help of an HMM, to take temporal information into account.We substitute the emission probabilities with the output of the stackeddenoising autoencoders.

3. This approach can be extended by adding filtered versions of the FFTover different time spans to the input. We extend the input to include twoadditional vectors, median-smoothed over different timespans. Here againadditional temporal smoothing is applied in a post-classification process.

In section 5.1 we describe the comparison system and briefly the key ideasincorporated in the computation of state-of-the-art features. Since the two otherapproaches described in this thesis make use of stacked denoising autoencodersthat interpret the FFT directly, we describe beneficial pre-processing steps insection 5.2.1. In section 5.2.2 we describe a stacked denoising autoencoder ap-proach for chord recognition in which the outputs are chord symbol probabilitiesdirectly, and in section 5.2.3 we propose an extension of this approach inspiredby a system developed for face recognition and phone recognition by Tang andMohamed (2012) under usage of a so called multi-resolution deep belief networkand apply it to chord recognition with the use of stacked denoising autoen-coders. Appendix A describes the theoretical foundation of applying a jointoptimization of the HMM and neural network for chord recognition.

5.1 Comparison System

In this section we describe a basic comparison system for the other approachesimplemented. It reflects the structure of most current approaches and usesstate-of-the-art features for chord recognition.

Most recent chord recognition systems rely on an improved computation ofthe PCP vector and take extra information into account such as bass notes orkey information. This extra information is usually incorporated into a moreelaborate higher-level framework, such as multiple HMMs or a Bayesian net-work.

25

The comparison system consists of the computation of state-of-the-art PCPvectors for all frames, but only a single HMM for later classification and tempo-ral alignment of chords, which allows for a more fair comparison to the stackeddenoising autoencoder approaches. The basic computation steps described inthe following are used in the approach described by Ni et al. (2012). They splitthe computation of features into a bass chromagram and a treble chromagram,and track them with two additional HMMs. The computed frames are alignedaccording to a beat estimate. To make this more elaborate system comparable,again we only compute one chromagram containing both bass and treble anduse a single HMM for temporal smoothing and do not align frames accordingto an beat estimate.

We first describe the very basic steps of PCP features predominantly usedin chord recognition for 15 years in section 5.1.1, hereafter in section 5.1.2 wedescribe extensions of the basic PCP used in the comparison system.

5.1.1 Basic Pitch Class Profile Features

The basic pipeline for computing a pitch class profile as a feature for chordrecognition consists of two steps:

1. The signal is projected from time to frequency domain through a Fouriertransform. Often files are downsampled to 11 025 Hz to allow for fastercomputation. This is also done in the reference system. The range of fre-quencies is restricted through filtering, to only analyse frequencies below,e.g., 4000 Hz (about the range of the keyboard of a piano, see figure 1)or similar, since other frequencies carry less information about the chordnotes played and introduce more noise to the signal. In the reference sys-tem a frequency range from approximately 55 Hz to 1661.2 Hz is used, asthis interval is proposed in the original system (Ni et al., 2012).

2. The second step consists of a constant-Q transform, which projects theamplitude of the signal in the linear frequency space to a logarithmicrepresentation of signal amplitude, in which each constant-Q transformbin represents the spectral energy in respect to the frequency of a musicalnote.

3. In a third step the bins representing one musical note and its octave mul-tiples are summed and the resulting vector is sometimes normalized.

In the following section we describe the constant-Q transform and computationof the PCP in more detail.

Constant-Q transform After converting the signal from time to frequencydomain through a discrete or fast Fourier transform, we can apply an additionaltransform to make the frequency bins logarithmically spaced. This transformcan be viewed as a set of filters in time domain, which filter a frequency bandaccording to a logarithmic scaling of center frequencies of the constant-Q bins.Originally it was proposed to be an additional term in the Fourier transform,but it has been shown by Brown and Puckette (1992) to be computationallymore efficient to filter the signal in Fourier space, thus applying the set offilters transformed into Fourier space to the signal also in Fourier space. This

26

can be realized with a matrix multiplication. This transformation process tologarithmically spaced bins is called the constant-Q transform (Brown, 1991).

The name stems from the factor Q, which describes the relationship betweencenter frequency of each filter and the filter width Q = fk

∆fk. Q is a so-called

quality factor which stays constant, fk is the center frequency and ∆fk thewidth of the filter. We can choose the filters such that they filter out the energycontained in musically relevant frequencies (i.e., frequencies corresponding tomusical notes):

fkcq = (21B )kcqfmin, (10)

where fmin is the frequency for the lowest musical note to be filtered, fkcq thecenter frequency corresponding to constant-Q bin kcq. B denotes the number ofconstant-Q frequency bins per octave, usually B = 12 (one bin per semitone).Setting Q = 1

21B −1

establishes a link between musically relevant frequencies and

filter width of our filterbank.Different types of filters can be used to aggregate the energy in relevant

frequencies and to reduce spectral leakage. For comparison system we make useof a Hamming window as described as well by Brown and Puckette (1992):

w(n, fkcq ) = 0.54 + 0.46 cos( 2πn

M(fkcq )

)(11)

where n = −M(fkcq )

2 , . . . ,M(fkcq )

2 − 1, M(fkcq ) is the window size, computablewith Q and corresponding center frequency fkcq for constant-Q bin, and kcq andn the current input bin in time domain and sampling rate of the input signal fs(Brown, 1991):

M(fkcq ) = Qfsfkcq

. (12)

We can now compute the filters and thus the respective sound power in thesignal filtered according to a musically-relevant set of center frequencies.

Instead of applying these filters in time domain, it is computationally moreefficient to do so in spectral domain, by projecting the window functions toFourier space first. We can apply the filters hereafter through a matrix multi-plication in frequency space. As denoted by Brown and Puckette (1992) for binkcq of the constant-Q transform can write:

Xcq[kcq] =1

N

N−1∑k=0

X[k]K[k, kcq], (13)

where kcq describes the constant-Q transform bin, X[k] the signal amplitudeat bin k in Fourier domain, N is the number of Fourier bins and K[k, kcq] thevalue of the Fourier transform of our filter w(n, fkcq ) for constant-Q transformkcq at Fourier bin k.

Choosing the right minimum frequency and quality factor will result inconstant-Q bins corresponding to harmonically-relevant frequencies. Havingtransformed the linearly-spaced amplitude per frequency to a musically spacedconstant-Q transform bin, we can now continue to aggregate notes that are oneoctave apart, hereby reducing the dimension of the feature vector significantly.

27

PCP Aggregation Shepard’s (1964) experiments on human perception ofmusic suggest that humans can perceive notes one octave apart as belonging tothe same group of notes, known as pitch classes. Given these results we computepitch class profiles based on the signal energy in logarithmic spectral space. Asdescribed by Lee (2006):

PCP [k] =

Ncq−1∑m=0

|Xcq(k +mB)|, (14)

where k = 1, 2, ..., B is the index for the PCP bin, Ncq is the number of octavesin the frequency range of the constant-Q transform. Usually B = 12, so that onebin for each musical note in one octave is computed. For pre-processing, e.g.,correction of minor tuning differences, B = 24 or B = 36 are also sometimesused. Hereafter the resulting vector is usually normalized, typically with respectto the L1, L2 or L∞ norm.

5.1.2 Comparison System Simplified Harmony Progression Analyzer

In this section I describe the refinements made to the very basic chromagramcomputation defined above. The state-of-the-art system proposed by Ni et al.(2012) takes additional context into account. They state that tracking the keyand the bass line provides important context that provides useful additionalinformation for recognizing musical chords.

For a more accurate comparison with stacked denoising autoencoder ap-proaches, which cannot easily take such context into account, we discard themusical key, bass and beat information that is used by Ni et al. We computethe features with the code that is freely available from their website5 and adjustit to a fixed stepsize of 1024 samples with a sampling rate of 11 025 Hz thus astep size of approximately 0.09s per frame, instead of a beat-aligned step size.

In addition to a so-called harmonic percussive sound separation algorithmas described by Ono et al. (2008), which attempts to split the signal into anhamonic and a percussive part, Ni et al. implement a loudness-based PCPvector and correct for minor tuning deviations.

5.1.3 Harmonic Percussive Sound Separation

Ono et al. (2008) describe a method to discriminate between the percussive con-tribution to the Fourier transform and the harmonic one. This can be achievedby exploiting the fact that percussive sounds most often manifest themselves asbursts of energy spanning a wide range of frequencies but only during a lim-ited time. On the other hand, harmonic components span a limited frequencyrange but are more stable over time. Ono et al. present a way to estimate thepercussive and harmonic parts of the signal contribution in Fourier space as anoptimization problem which can be solved iteratively:

Fh,i is the short-time Fourier transform of an audio signal f(t) and Wh,i =|Fh,i|2 is its power spectrogram. We minimize the L2 norm of power spectrogramgradients, J(H,P ), with Hh,i the harmonic component and Ph,i the percussive

5https://patterns.enm.bris.ac.uk/hpa-software-package

28

https://patterns.enm.bris.ac.uk/hpa-software-package

component, with h the frequency bin and i the time in Fourier space:

J(H,P ) =1

2σ2H

∑h,i

(Hh,i−1 −Hh,i)2 +

1

2σ2P

∑h,i

(Ph−1,i − Ph,i)2, (15)

subject to the constraint that

Hh,i + Ph,i = Wh,i (16)

Hh,i ≥ 0, (17)

andPh,i ≥ 0, (18)

where Wh,i is the original power spectrogram, as described above, and σH andσP are parameters to control the smoothness vertically and horizontally. Detailsfor an iterative optimization procedure can be found in the original paper.

5.1.4 Tuning and Loudness-Based PCPs

Here we describe further refinements of the PCP vector, first how to take mi-nor deviations (less than a semitone) from the reference tuning into account,and later an addition proposed by Ni et al. (2012) to model human loudnessperception.

Tuning To take into account minor pitch shifts of the tuning of the specificsong, features are fine-tuned as described by Harte and Sandler (2005). Insteadof computing a 12-bin chromagram directly, we can compute multiple bins foreach semitone, as described in section 5.1.1 for settingB > 12 (e.g., B = 36). Wecan then compute a histogram of sound power peaks with respect to frequencyand select a subset of constant-Q bins to compute the PCP vectors, to shift ourreference tuning according to small deviations for a song.

Loudness Based PCPs Since human loudness perception of sound in re-spect to frequencies is not linear, Ni et al. (2012) propose a loudness weightingfunction.

First we can compute a “sound power level matrix”:

Ls,t = 10 log10

( ||Xs,t||2

pref

), s = 1, ..., S, t = 1, ..., T, (19)

where pref indicates the fundamental reference power, and Xs,t the constant-Qtransform of our input signal as described in the previous section (s denotingthe constant-Q transform bin and t the time). They propose to use A-weighting(Talbot-Smith, 2001), in which we add a specific value depending on the fre-quency. An approximation to human sensitivity of loudness perception in re-spect to frequency is then given by:

L′s,t = Ls,t +A(fs), s = 1, ..., S, t = 1, ..., T, (20)

whereA(fs) = 2.0 + 20 log10(RA(fs)), (21)

29

and

RA(fs) =122002f4

s

(f2s + 20.62)

√(f2s + 107.72)(f2

s + 737.92)(f2s + 122002)

. (22)

Having calculated this we can proceed to compute the pitch class profiles asdescribed above, using L′s,t.

Ni et. al. normalize the loudness-based PCP vector after aggregation ac-cording to:

Xp,t =X ′p,t −minp′ X

′p′,t

maxp′ X ′p′,t −minp′ X ′p′,t, (23)

where X ′p,t denotes the value for PCP bin p time t. Ni et al. state that dueto this normalization, specifying the reference sound power level pref is notnecessary.

5.1.5 HMMs

In this section we give a brief overview of the hidden Markov model (HMM), asfar as important for this thesis. It is a widely used model for speech as well aschord recognition.

A musical song is highly structured in time – certain chord sequences andtransitions are more common than others – but PCP features do not take anytime dependencies into account by themselves. A temporal alignment can in-crease the performance of a chord recognition system. Additionally, since wecompute the PCP features from the amplitude of the signal alone, which is noisyin regards to chord information due to percussion, transient noise or other, theresulting feature vector is not clean. HMMs in turn are used to deal with noisydata, which adds another argument to use HMMs for temporal smoothing.

Definition There exist several variants of HMMs. For our comparison systemwe restrict ourselves to an HMM with a single Gaussian emission distributionfor each state. For the stacked denoising autoencoders we use the output ofthe autoencoders directly as a chord estimate and as emission probability. AnHMM with a Gaussian emission probability is a so-called continuous-densitiesHMM. It is capable of interpreting multidimensional real valued input such asthe PCP vectors we use as features, described above in section 5.1.1.

An HMM estimates the probability of a sequence of latent states correspond-ing to a sequence of lower-level observations. As described by Rabiner (1989),an HMM can be defined as a 5-tuple consisting of:

1. N , the number of states in the model.

2. M , the number of distinct observations, which in the case of a continuousdensities HMM is infinite.

3. A = {aij}, the state transition probability distribution, where aij =P (qt+1 = Sj |qt = Si), 1 ≤ i, j ≤ N , and qt denotes the current stateat time t. If the HMM is ergodic (i.e., all transitions to every state fromevery state are possible) for all i and j, aij > 0. Transition probabilities

satisfy the stochastic constraintsN∑j=1

aij = 1 and 1 ≤ i ≤ N .

30

4. B = {bj(O)}, the set of observation probabilities, which in our case is infi-nite. bj(Ot) = P (Ot|qt = Sj), the observation probability in state j, where, 1 ≤ j ≤ N , for observation Ot at time t. If we assume a continuous-density HMM, i.e., we have a real-valued, possibly multidimensional input,we can use a (mixture of) Gaussian distributions for the probability distri-

bution bj(O): bj(Ot) =M∑m=1

ZjmN (Ot, µjm,Σjm), with 1 ≤ j ≤ N . Here

Ot is the input vector at time t, Zjm the mixture weight (coefficient) forthe mth mixture in state j and N (O,µjm,Σjm), the Gaussian probabilitydensity function, with mean vector µjm and covariance matrix Σjm forstate j and component m.

5. π = {πi} where πi = P (q1 = Si), with 1 ≤ i ≤ N . This is the initial stateprobability.

Parameter Estimation We can define the states to be the 24 chord symbolsand the non-chord symbol for the simple major-minor chord discrimination task,and 217 different symbols for the extended chord vocabulary, including major,minor, 7th and inverted chords and the non-chord symbol. The features incase of the baseline system are computed as a 12-bin PCP vector, with a singleGaussian as emission model for the HMM. In case of the stacked denoisingautoencoder systems, we can use the output of the networks directly as emissionprobabilities.

Since we are dealing with a fully annotated data set, it is trivial to es-timate the initial state probabilities and the transitions by computing relativefrequencies with help of supplied ground truth. In the case of Gaussian emissionmodel, we can estimate the parameters from training data by the EM algorithm(McLachlan et al., 2004).

Likelihood of a Sequence To compute the likelihood of given observationsbelonging to a certain chord sequence we can compute the following:

P (q1, q2...qt, O1, O2, ...Ot|λ) = π1b1

T∏t=2

at,t−1bt(Ot), (24)

where π1 is the initial state probability for state at time 1, b1 the emissionprobability for the first observation, at,t−1 the transition probability from statet−1 to state t, and bt(Ot) the emission probability for time t for observation Otat time t. λ denotes the parameters of our HMM. The most likely sequence ofhidden states for given observations can be computed efficiently with the helpof the Viterbi algorithm (see Rabiner, 1989, for details).

5.2 Stacked Denoising Autoencoders for Chord Recogni-tion

A piece of music contains additional non-harmonic information, or harmonicinformation which does not directly contribute to the chord played at a cer-tain time in the song. This can be considered as noise for the objective ofestimating the correct chord progressions from a song. Since stacked denoisingautoencoders are trained to reduce artificially added noise, they seem to be a

31

suitable choice for application on noisy data, and have been shown to achievestate-of-the-art performance on several benchmark tests (including audio genreclassification) (Vincent et al., 2010). Moreover deep learning architectures canbe partly trained in an unsupervised manner, which might prove to be useful fora field like chord recognition, since there is a huge amount of unlabeled digitizedmusical data available, but only a very limited fraction of this is annotated.

In this section I describe two systems relying on stacked denoising autoen-coders for chord recognition. The preprocessing of the input data follows thesame basic steps for the two stacked denoising autoencoder approaches, de-scribed in section 5.2.1. All approaches make use of an HMM to smooth andinterpret the neural network output as a post-classification step. Since the chordground truth is given, we are also able to calculate a “perfect” PCP and trainstacked denoising autoencoders to approximate the former from given FFT in-put. A description of how to apply a joint optimization procedure for the HMMand neural network for chord recognition, taken from speech recognition, isgiven in appendix A (This did not yield any further improvements, however).Furthermore it is possible to train a stacked denoising autoencoder to modelchord probabilities directly which then are smoothed by an HMM, described insection 5.2.2. Hereafter I propose an extension to this approach by extendingthe input of the stacked denoising autoencoders to cover multiple resolutions,smoothed over different time spans, in section 5.2.3.

5.2.1 Preprocessing of Features for Stacked Denoising Autoencoders

In all approaches described below, we employ the stacked denoising autoen-coders directly to the Fourier transformed signal. This minimizes the prepro-cessing steps, and restrictions imposed, but still some preprocessing of the inputcan increase the performance.

1. To restrict the search space only the first 1500 FFT bins are used. Thisrestricts the frequency range to approximately 0 to 3000 Hz. Most of thefrequencies emitted by harmonic instruments are still contained in thisinterval.

2. Since values taken from the FFT directly contain high-energy peaks, weapply a square root compression as done by Boulanger-Lewandowski et al.(2013) for deep belief networks.

3. We then normalize the FFT frames according to the L2 norm in a finalpreprocessing step.

32

5.2.2 Stacked Denoising Autoencoders for Chord Recognition

FFT input, one time frame

single frame preprocessing

SDAE

chord symbols

Figure 5: Stacked denoising autoencoder for chord recognition, single resolution.

Humphrey et al. (2012) state that the performance of chord recognition systemshas not improved significantly recently, and suggest that one reason could be thewidespread usage of PCP features. They try to find a different representationby modelling a Tonnetz under usage of convolutional neural networks. Cho andBello (2014), who evaluate the influence on performance of different parts ofchord recognition systems, also come to the conclusion that the choice of featurecomputation has a great influence on the overall performance and suggest theexploration of other types of features differing from the PCP. A nice propertyof deep learning approaches is that they are often able to find a higher levelrepresentation of the input data by themselves and do not rely on predefinedfeature computation.

When classifying data, we can train a neural network to output pseudo-probabilities for each class given an input. This is done through a final logisticregression layer (or softmax) for the output of the neural network. We usea softmax output and a 1-of-K encoding, such that we have K outputs, eachof which can be interpreted as a probability of a certain chord being played.Thus we can use the output of a 1-of-K encoding softmax output layer neuralnetwork directly as substitute for the emission probability of the HMM andfurther process it with temporal smoothing to compute a final chord symboloutput.

Since deep learning provides us with a powerful strategy for neural networktraining, we are able to discard all steps of the conventional PCP vector compu-tation and restrictions that might be imposed by them – apart from the FFT –and train the network to classify chords. This differs from previous approaches

33

like Boulanger-Lewandowski et al. (2013) and Glazyrin (2013), who use deeplearning techniques but still model PCPs either as intermediate target or asoutput of the neural network. Figure 5 depicts the processing pipeline of thesystem. This system, with a single input frame is referred to as stacked denoisingautoencoder (SDAE).

5.2.3 Multi-Resolution Input for Stacked Denoising Autoencoders

FFT input, multiple time frames

single frame median filter median filter preprocessing

Concatenate frames

SDAE

chord symbols

Figure 6: Stacked denoising autoencoder for chord recognition, multi-resolution

Glazyrin (2013), who uses stacked denoising autoencoders (with and without re-current layers) to estimate PCP vectors from the constant-Q transform, statesthat he suspects it to be beneficial to take multiple subsequent frames into ac-count, but also writes that informal experiments did not show any improvementsin recognition performance. Boulanger-Lewandowski et al. (2013) also make useof a recurrent layer with a deep belief network to take temporal information intoaccount before additional (HMM) smoothing. Both approaches thus reason thatit might be beneficial to take temporal information into account before using anHMM as a final computation step.

We can find a similar paradigm in Tang and Mohamed (2012), used withdeep learning. They propose a system in which images of faces are analyzed bya deep belief network. In addition to the original image they propose extendingthe input to different subsampled versions of the image for face recognitionand report improved performance over a single resolution input. They alsoreport improved performance for extending the classifier input to several inputs

34

with different subsampling ranges applied to phone recognition and temporalsmoothing with deep belief networks on the TIMIT dataset.

The proposed system in this thesis is designed to take additional temporalinformation into account before the HMM post-processing as well. Followingthe intuition of Glazyrin and the idea of Tang et al., we extend the input of thestacked denoising autoencoder, computing two different time resolutions of theFFT and concatenating them with the original input of the stacked denoisingautoencoders. In addition to the original FFT vector, we apply a median filterfor different ranges of subsequent frames around the current frame. After medianfiltering each vector is preprocessed as indicated in section 5.2.1. Hereafter wejoin the resulting vectors and use them as frame-wise input for the stackeddenoising autoencoders.

Cho and Bello (2014) conduct experiments to evaluate the influence on per-formance of different parts of the most prevalent constituents of chord recogni-tion systems. They find that pre-smoothing has a significant impact on chordrecognition performance in their experiments. They state that through filteringwe can eliminate or reduce transient noise, which is generated by short burstsof energy such as percussive instruments, although this has the disadvantageto also “smear” chord boundaries. However, in the proposed system we supplyboth the original input in which the chord boundaries are “sharp”, but withtransient noise, and a version that is smoothed.

Cho and Bello (2014) compare average filtering and median filtering and findthat there is little to no difference in terms of recognition performance. We use amedian filter instead of an average filter since it is a prevalent approach in chordrecognition. Median filters are applied in several other approaches, e.g., Peeters(2006), or Khadkevich and Omologo (2009b), to reduce transient noise. Thestacked denoising autoencoders are again trained to output chord probabilitiesby fine tuning with traditional backpropagation. In the following we refer tothis as a multi resolution stacked denoising autoencoder (MR-SDAE). Figure 6illustrates the processing pipeline of the MR-SDAE.

35

6 Results

Finding suitable training and testing sets for chord estimation is difficult be-cause transcribing chords in songs requires a significant amount of training,even for humans. Only experts are able to transcribe chord progressions ofsongs accurately and in full detail.

Furthermore, most musical pieces are subject to strict copyright laws. Thisposes the problem that ground truth and audio data are delivered separately.Different recordings of the same song might not fit exactly to the ground truthavailable due to minor temporal deviations. There are, fortunately, tools toalign ground truth data and audio files. For the following experiments, DanEllis’ AUDFPRINT tool was used to align audio files with publicly availableground truth.6

We report results on two different datasets: a transcription of 180 Beatlessongs, and the publicly available part of the McGill Billboard dataset, containing740 songs. The Beatles dataset has been available for several years, and as othertraining data is scarce, many algorithms published in the MIREX challenge havebeen pretrained on this dataset. Because of the same scarcity of good data, theMIREX challenge has also used the Beatles dataset (with a small number ofadditional songs) to evaluate the performance of chord recognition algorithms,and thus the “official” results on the Beatles dataset might be biased. We reporta cross-validation performance, in which we train the algorithm on a subset ofthe data and test it on the remaining unseen part. This we repeat ten timesfor different subsets of the dataset, and report the average performance and95% confidence interval. This is done to give an estimation how the proposedmethods might perform on unseen data. However the Beatles dataset is com-posed by one group of musicians only, which itself might bias the results, sincemusical groups tend to have a certain style of playing music. Therefore we alsoconduct experiments on the Billboard dataset, which is not restricted to onegroup of musicians, but rather contains popular songs from Billboard Hot 100charts from the 1958 to 1991. Additionally the Billboard dataset contains moresongs, thus providing us with more training examples. To compare the proposedmethods to other methods, we use the training and testing set of the MIREX2013 challenge, a subset of the McGill Billboard dataset that was unpublishedbefore 2012 but is available now. Although there are more recent results on theBillboard dataset (MIREX 2013), the test set ground truth for that part of thedataset has not yet been released.

Deep learning neural network training was implemented with the help ofPalm’s deep learning MATLAB toolbox (Palm, 2012). HMM smoothing wasrealized with functions of Kevin Murphy’s Bayes net MATLAB toolbox.7 Com-putation of state-of-the-art features was done under usage of Ni et al.’s code.8

In the following I first give an explanation of how we can measure the per-formance of the algorithms in section 6.2. Training algorithms to learn theset of all possible chords is infeasible in this point of time due to the numberof possible chords and relative frequencies of chords appearing in the publiclyavailable datasets. Certain chords appear in popular songs more frequently thanothers, and so we train the algorithms to recognize a set of these chord symbols

6http://www.ee.columbia.edu/~dpwe/resources/matlab/audfprint/7https://github.com/bayesnet/bnt8https://patterns.enm.bris.ac.uk/hpa-software-package

36

http://www.ee.columbia.edu/~dpwe/resources/matlab/audfprint/

https://github.com/bayesnet/bnt

https://patterns.enm.bris.ac.uk/hpa-software-package

containing only major and minor chords, which we call the restricted chord vo-cabulary, and a set of chords containing major, minor, 7th and inverted chords,which we call the extended chord vocabulary. In section 6.1, I describe how tointerpret chords that are not part of these sets. Results are reported for bothchord symbol subsets on the Beatles dataset in section 6.5 for the reference sys-tem, SDAEs and MR-SDAEs. Results for both subsets on the Billboard set arereported in section 6.6. The results of other algorithms submitted to MIREX2013 for the Billboard test set used in this thesis are stated in section 7.5.

6.1 Reduction of Chord Vocabulary

As described in section 2.2, chords considered in this thesis consist of threeor four notes, with distinct interval relationships to the root note. We have acertain set of chord symbols in the two chord symbol sets. The first containsonly major and minor chords with three notes, the second an extension to thischord symbol set containing also 7th and inverted chords. For the Billboarddataset these two subsets are already supplied. For the Beatles dataset, weneed to reduce the chords in the ground truth to match the chord symbol setswe want to recognize, since those are fully-detailed transcriptions, which containchord symbols not in our defined subsets. Some chords are an extension of otherchords, e.g., C:maj7 can be seen as an extension of C:maj, since the first onecontains the same notes as the latter one but for the additional fourth note withinterval 7 above the root note C. We thus reduce all other chords in the groundtruth according to following set of rules:

1. If the ground truth chord symbol is in the subset of chord symbols to berecognized, leave it unchanged.

2. If there is a subset of notes that matches a chord symbol in the recognitionset, denote instead of the original ground truth symbol the symbol inthe recognition set (e.g., C:maj7 is mapped to C:maj for the restrictedvocabulary).

3. If there is no subset of chord notes from a symbol in the recognition set forthe original ground truth, denote it as non-chord (e.g., C:dim is mappedto the non-chord symbol).

6.2 Score Computation

The results reported use a method of measurement that has been proposed byHarte (2010) and Mauch (2010): the weighted chord symbol recall (WCSR). Inthe following a description of how it is computed is provided.

6.2.1 Weighted Chord Symbol Recall

Since most of chord recognition algorithms including the ones proposed herework on a discretized input space, but the ground truth is measured in continu-ous segments with start time, end time and a distinct chord symbol, we need ameasure to estimate the performance of any proposed algorithm. This could beachieved by simply discretizing the ground truth according to the discretizationof the estimation, and hereafter performing a frame-wise comparison. However,

37

Harte (2010) and Mauch (2010) propose a more accurate measure. The frame-wise comparison measure can be enhanced by computing the relative overlapof matching chord segments between the continuous-time ground truth and theframe-wise estimation of chord symbols by the recognition system: This is calledchord symbol recall (CSR):

CSR =

∑SAi

∑SEjSAi ∩ SEj∑

SAiSAi

, (25)

where SAi is one segment of the hand annotated ground truth, and SEj onesegment of the machine estimation.

The test set for musical chord recognition usually contains several songs,which each have a different length and contain a different number of chords.Thus we can extend the CSR for a corpus of songs if we sum the the resultsfor each song weighted by its length. This is the weighted chord symbol recall(WCSR), used for evaluating performance on a corpus containing several songs:

WCSR =

N∑i=0

LiCSRi

N∑i=0

Li

, (26)

where Li the length of song i and CSRi the chord symbol recall between machineestimation and hand annotated segments for song i.

6.3 Training Systems Setup

Conducting experiments following parameters are found to be suitable. Thestacked denoising autoencoders are trained with 30 iterations of unsupervisedtraining with additive Gaussian noise, variance 0.2, and fraction of corruptedinputs 0.7. The autoencoders have 2 hidden layers with 800 hidden nodes eachwith a sigmoid activation function, the output layer contains as many nodes asthere are chord symbols. To enforce sparsity an activation penalty weightingof β = 0.1, and target activation p = 0.05 is used. The dropout is set to 0.5,and batch training with a batch of 100 samples is used. The learning rate is setto 1 and momentum to 0.5. For the MR-SDAE the previous and subsequent 3frames for the second input vector, and the previous and subsequent 9 framesfor the third input vector are used.

Due to memory restrictions only a subset of frames of the complete trainingset for training of the stacked denoising autoencoder based systems is employed.10% of the training data for validation while training is separated. Additionally Iextended Palm’s deep-learning library with an early stopping mechanism, whichstops supervised training after the performance on the validation set does notimprove for 20 iterations, or else after 500 iterations, to restrict computationtime. It then returns the best performing weight configuration according to thetraining validation.

For the comparison system, since not all chords of the extended chord vocab-ulary are included in all datasets, missing chords are substituted with the meanPCP vector in the training set. Malformed covariance matrices are corrected byadding a small amount of random noise.

38

6.4 Significance Testing

Similar to Mauch and Dixon (2010a), a Friedman multiple comparison test isused to test for significant differences in performance of the proposed algorithmsand the reference system. This tests the performance of different algorithms ona song level, but differs from the WCSR, which takes the song length into ac-count in the final score. The Friedman multiple comparison test measures thestatistical significance of ranks, thus indicating whether an algorithm outper-forms another algorithm with statistical significance on a song level withoutregard to the WCSR for songs in general. For the purpose of testing for statisti-cal significance of performance, we select one fold of the cross validation on theBeatles dataset, on which the performance is close to the mean, and one test runfor the Billboard dataset, which is close to the mean as well for the SDAE basedapproaches. All plots for the post hoc multiple comparison Friedman test forsignificance show the mean rank and 95% confidence interval in term of ranks.

6.5 Beatles Dataset

The Beatles Isophonics dataset9 contains songs of the Beatles and Zweieck. Weonly use the Beatles songs for evaluating the performance of algorithms, sinceit is difficult to come by the audio data of the Zweieck songs. The Beatles-onlysubset of this dataset consists of 180 songs.

In section 6.5.1 and section 6.5.2, the results for restricted and extendedchord vocabulary, for the comparison system, SDAE and MR-SDAE are re-ported. The cross-validation performance across ten folds is shown. We par-tition the dataset into ten subsets, where we use one for testing and nine fortraining. For the first fold we use every tenth song from the Beatles datasetstarting from the first, as ordered in the ground truth, the second fold everytenth song starting from the second etc. We train ten different models, one foreach testing partition. Since we use a HMM smoothing step, we show “raw”results without HMM smoothing and a final performance of the systems withtemporal smoothing, for the neural network approaches. The reference systemuses the HMM even for classification, and thus we only report a single finalperformance statistic. All results are reported as WCSR as described aboveand used in the MIREX challenge. Since there are ten different results, onefor testing on each partition, I report the average WCSR, as well as a 95%confidence interval of the aggregated results. To get an insight into the distri-bution of performance results, I also plot box-and-whisker diagrams. Finally Iperform Friedman multiple comparison tests for statistical significance acrossalgorithms.

Since the implementation of the learning algorithms in MATLAB is memoryintensive, I subsample the training partitions for the SDAEs. For SDAE, I useevery 3rd frame for training, and for MR-SDAE, every 4th frame, resulting inapproximately 95000 and 71000 training samples for each fold.

6.5.1 Restricted Major-Minor Chord Vocabulary

Friedman multiple comparison tests Values are computed on fold five ofthe Beatles dataset, which yields a result close to the mean performance for

9http://isophonics.net/datasets

39

http://isophonics.net/datasets

all algorithms tested. In figure 7 the results of the post hoc Friedman multiplecomparison tests for all systems smoothed and unsmoothed on the restrictedchord vocabulary task are depicted. The algorithms showed significantly differ-ent performance, with p < 0.001.

1 1.5 2 2.5 3 3.5 4 4.5 5

MR-SDAE

SDAE

S-HPA

MR-SDAE

SDAE

Mean column ranks with 95% confidence interval

Figure 7: Mean and 95% confidence intervals for post hoc Friedman multiplecomparison tests for the Beatles dataset on the restricted chord vocabulary, forthe comparison system (S-HPA), SDAE, and MR-SDAE, before HMM smooth-ing (normal weight) and after (highlighted in bold).

Whisker Plot and Mean Performance In this section results for the pro-posed algorithms, SDAE, MR-SDAE and the reference system on the reducedmajor-minor chord symbol recognition task are presented. Figure 8 depicts abox-and-whisker diagram for the performance of the algorithms with and with-out temporal smoothing and the performance of the reference system. Theupper and lower whiskers depict the maximum and minimum performance ofall results of the ten-fold cross validation, while the upper and lower boundariesof the boxes represent the upper and lower quartiles. We can see the medianof all runs as a dotted line inside the box. The average WCSR together with95% confidence intervals over folds before and after temporal smoothing can befound in table 3.

40

SDAE MR-SDAE S-HPA SDAE MR-SDAE0

20

40

60

80

100

WC

SR

in%

Figure 8: Results for the simplified HPA, SDAE and MR-SDAE for the re-stricted chord vocabulary 10-fold cross-validation on the Beatles dataset withand without HMM smoothing. Results after smoothing are highlighted in bold.

System Not smoothed SmoothedS-HPA – 68.92 ± 9.32SDAE 65.40 ± 6.94 69.69 ± 7.41MR-SDAE 67.13 ± 7.06 70.05 ± 7.92

Table 3: Average WCSR for the restricted chord vocabulary on the Beatlesdataset, smoothed and unsmoothed, with 95% confidence interval

Summary In the Friedman multiple comparison test in figure 7, we observethat the mean ranks of post-smoothing SDAE and MR-SDAE are significantlyhigher than the mean ranks of the reference system (S-HPA), and also thatsmoothing significantly improves the performance. Mean ranks for SDAE andMR-SDAE without smoothing are lower than that of the reference system, how-ever not significantly. The MR-SDAE has a slightly higher mean rank comparedto the SDAE, but not significantly.

In figure 8 we can observe that pre- and post-smoothed SDAE and MR-SDAE distributions are negatively skewed. The S-HPA however is skewed pos-itively. The skewness of the distribution does not change much for the SDAEand MR-SDAE comparing before and after smoothing, however, smoothing im-proves the performance in general.

In table 3, we can see that the mean performance of MR-SDAE outperformsthe SDAE slightly and that both achieve higher mean performance comparedto the reference system after HMM smoothing. The means for results before

41

HMM smoothing for SDAE and MR-SDAE are lower however.

6.5.2 Extended Chord Vocabulary

Friedman Multiple Comparison Tesst Again values are computed for foldfive of the Beatles dataset. In figure 9 the results of the post hoc Friedman multi-ple comparison tests for all systems smoothed and unsmoothed on the extendedchord vocabulary task are depicted. The algorithms showed significantly differ-ent performance, with p < 0.001.

1 1.5 2 2.5 3 3.5 4 4.5 5

MR-SDAE

SDAE

S-HPA

MR-SDAE

SDAE


Figure 9: Mean and 95% confidence intervals for post hoc Friedman multiplecomparison tests for the Beatles dataset on the extended chord vocabulary forthe comparison system (S-HPA), SDAE, and MR-SDAE, before HMM smooth-ing (normal weight) and after (highlighted in bold).

Whisker Plots and Means Similar to above we depict box-and-whiskerdiagrams for the unsmoothed and smoothed results of ten-fold cross validationfor the extended chord symbol set in Figure 10. Table 4 depicts the averageWCSR and 95% confidence interval over folds for the training for smoothed andunsmoothed results.

42

SDAE MR-SDAE S-HPA SDAE MR-SDAE0

20

40

60

80

100

WC

SR

in%

Figure 10: Whisker plot for simplified HPA, SDAE, and MR-SDAE using theextended chord vocabulary and 10-fold cross-validation on the Beatles dataset,with and without smoothing. Results after smoothing are highlighted in bold.

System Not smoothed SmoothedS-HPA – 48.54 ± 7.89SDAE 55.93 ± 7.12 59.73 ± 7.37MR-SDAE 57.52 ± 6.52 60.02 ± 6.81

Table 4: Average WCSR for simplified HPA, SDAE and MR-SDAE usging theextended chord vocabulary on the Beatles dataset, smoothed and unsmoothed,with 95% confidence intervals.

Summary In the Friedman multiple comparison tests in figure 9, we can ob-serve that again the post-smoothing performance in terms of ranks of the SDAEand MR-SDAE is significantly better than the reference system. In comparisonto the restricted chord vocabulary recognition task, the margin is even larger.A peculiar thing to note, is that with the extended chord vocabulary the pre-smoothing performance of MR-SDAE is not significantly worse than the post-smoothing performance of both SDAE based chord recognition systems. SDAEshows lower mean ranks before smoothing than the reference system, and MR-SDAE seems to perform slightly better than the reference system, although notsignificantly so before smoothing.

In figure 10, we can see similar negatively-skewed distributions of cross val-idation results for SDAE and MR-SDAE, like the restricted chord vocabularysetting. Again we can observe that the skewness of the distributions do notchange much after smoothing, but we can observe an increase in performance.

43

However, in the extended chord vocabulary task, the medians of the SDAE andMR-SDAE are higher than that of the reference system, showing values evenhigher than the best performance of the reference system. The reference systemon the extended chord vocabulary does not show a distinct skew.

The better performance is also reflected in table 4, where the proposed sys-tems achieve higher means before and after HMM smoothing compared to thereference system.

6.6 Billboard Dataset

The McGill Billboard dataset10 consists of songs randomly sampled from theBillboard Hot 100 charts from 1958 to 1991. This dataset currently contains740 songs, of which we separate 160 songs for testing and use the remaining fortraining the algorithms. The selected test set corresponds to the official testset of the MIREX 2012 challenge. Although there are results for algorithms inthe MIREX challenge 2013 on the Billboard dataset, the ground truth of thespecific test set has not been publicly released at this point of time.

Similar to the Beatles dataset, the audiofiles are not publicly available, butthere are several different audio recordings for the songs in the dataset. We againuse Dan Ellis’ AUDFPRINT tool to align audio data with the ground truth.For this dataset the ground truth is already available in the right format forrestricted major-minor chord vocabulary and extended 7th and inverted chordvocabulary, thus we do not need to reduce the chords ourselves.

Since the Billboard dataset is much larger than the Beatles dataset, wesample every 8th frame for the SDAE training and every 16th for the MR-SDAE,resulting in approximately 170 000 and 85 000 frames respectively. Algorithmswere run five times.

6.6.1 Restricted Major-Minor Chord Vocabulary

Friedman Multiple Comparison Tests In figure 11 the results for thepost hoc Friedman multiple comparison test for the Billboard restricted chordvocabulary task for the reference system and smoothed and unsmoothed SDAEand MR-SDAE are presented. The algorithms showed significantly differentperformance, with p < 0.001.

10http://ddmal.music.mcgill.ca/billboard

44

http://ddmal.music.mcgill.ca/billboard

1 1.5 2 2.5 3 3.5 4 4.5 5

MR-SDAE

SDAE

S-HPA

MR-SDAE

SDAE


Figure 11: Mean and 95% confidence interval for post hoc Friedman multiplecomparison tests for the Billboard dataset on the restricted chord vocabulary, forthe comparison system (S-HPA), SDAE, and MR-SDAE, before HMM smooth-ing (normal weight) and after (highlighted in bold).

Mean Performance In this section results for the MIREX 2012 test partitionof the Billboard dataset for the restricted major-minor chord vocabulary are de-picted. Table 5 shows the results for performance of the SDAE with and withoutsmoothing. Since we do not perform a cross validation on this dataset and thecomparison system does not have any randomized initialization, we report the95% confidence interval for the SDAEs only, with respect to multiple randominitialisations (note that these are not directly comparable to the confidenceintervals over cross-validation folds as reported for the Beatles dataset).

System Not smoothed SmoothedS-HPA – 66.04SDAE 61.19 ± 0.32 66.35 ± 0.31MR-SDAE 62.97 ± 0.16 66.46 ± 0.40

Table 5: Average WCSR for the restricted chord vocabulary on the MIREX 2012Billboard test set, smoothed and unsmoothed, with 95% confidence intervals ifapplicable.

45

Summary Figure 9, depicting the Friedman multiple comparison test for sig-nificance, reveals that in the Billboard restricted chord vocabulary task, thereference system does not perform significantly worse than the post-smoothingSDAE and MR-SDAE. It is also notable that in this setting the pre-smoothingMR-SDAE significantly outperforms the pre-smoothing SDAE.

Similar to the restricted chord vocabulary task for the Beatles test, on theBillboard dataset, the means before smoothing are lower than those of the ref-erence system. However, we can still observe a better pre-smoothing meanperformance for MR-SDAE, in comparison with SDAE. Comparing mean per-formance HMM smoothing, we see no significant differences.

6.6.2 Extended Chord Vocabulary

Friedman Multiple Comparison Tests In figure 12 the results for thepost hoc Friedman multiple comparison test for the Billboard extended chordvocabulary task for the reference system and smoothed and unsmoothed SDAEand MR-SDAE are presented. The algorithms showed significantly differentperformance, with p < 0.001.

1 1.5 2 2.5 3 3.5 4 4.5 5

MR-SDAE

SDAE

S-HPA

MR-SDAE

SDAE


Figure 12: Mean and 95% confidence intervals for post hoc Friedman multiplecomparison tests for the Billboard dataset on the extended chord vocabulary, forthe comparison system (S-HPA), SDAE, and MR-SDAE, before HMM smooth-ing (normal weight) and after (highlighted in bold).

46

Mean Performance Table 6 depicts the performance of the reference systemand SDAEs on the extended chord vocabulary containing major, minor, 7th andinverse chord symbols. Again no confidence interval is reported for the referencesystem since there is no random component and results are the same for multipleruns.

System Not smoothed SmoothedS-HPA – 46.44SDAE 46.74 ± 0.19 50.23 ± 0.32MR-SDAE 47.77 ± 0.49 50.81 ± 0.50

Table 6: Average WCSR for the extended chord vocabulary on the MIREX 2012Billboard test set, smoothed and unsmoothed, with 95% confidence intervals ifapplicable.

Summary The Friedman multiple comparison test in figure 12 shows againsignificantly better performance for the post-smoothing SDAE systems in com-parison to the pre-smoothing performance, and also to the reference system.MR-SDAE again seems to achieve a higher mean rank in comparison withSDAE, however this is not statistically significant.

In terms of mean performance in WCSR, depicted in table 6, the pre-smoothing performance figures for SDAE and MR-SDAE are higher than thoseof the reference system. Again MR-SDAE outperforms SDAE in mean WCSR.The same is the case after smoothing: MR-SDAE outperforms the SDAE slightly,and both perform better than the reference system.

6.7 Weights

In this section we visualize of the input layer of the neural network trained onthe Beatles dataset. Figure 13 shows an excerpt of the input layer of the neu-ral network, weights being depicted as a grayscale image, where black denotesnegative weights and white corresponds to positive weights.

In figure 14 the sum of absolute values over all weights for each FFT inputare plotted. The vertical lines depict FFT bins, which correspond to musicallyimportant frequencies, i.e., musical notes.

47

Hid

den

nod

es

Weights for inputs

Figure 13: Excerpt of the weights of the input layer. Black denotes negativeweights, and white positive.

0 200 400 600 800 1,000 1,200 1,4000

100

200

300

400

500

600

input (FFT bin)

Su

mof

abso

lute

wei

ghts

Figure 14: Sum of absolute values for each input of the trained neural network.Vertical gray lines indicate bins of the FFT that correspond to musically relevantfrequencies.

48

7 Discussion

7.1 Performance on the Different Datasets

All algorithms tested seem to perform better on the Beatles dataset. This seemscounter-intuitive, given that the Billboard dataset contains about four timesmore songs, and thus we would expect the algorithms to be able to find a betterestimate of the data. A possible explanation could be that it is easier to learnchords from one single artist, or group of artists, either due to a preference ofcertain chords, or a preference of certain types of instruments. Furthermore thedistribution of chords is not uniform: there are certain chord types or even dis-tinct chord symbols that are more common than others. The Billboard datasetcontains about 29% extended chords (7th and inverted chords) compared to 15%for the Beatles dataset. These chords are more difficult to model, which is anexplanation for the difference in performance for the extended chord vocabularyrecognition task on both datasets.

Usually any given song contains only a very limited number of chords thatare used repeatedly, which limits the amount of useful information that can beextracted from training data. This distribution of chords might not be the samein training and test sets for a small dataset, which might also explain the highvariance we observe performing cross-validation on the Beatles dataset in tables3 and 4.

7.2 SDAE

In this section I evaluate the performance of the SDAE in comparison withthe simplified HPA reference system. First an examination of the final per-formance is given, followed by the distribution of results for different folds forthe cross-validation test. A description of the effects of HMM smoothing isgiven hereafter, followed by an analysis for the results of the “raw” performancewithout smoothing and concluding remarks.

Post-Smoothing In the experiments the smoothed SDAE significantly out-performs the reference system with state-of-the-art-features on the extendedchord vocabulary task. This is true for both datasets, the Beatles and the Bill-board datasets. These results are important since working with bigger chordvocabularies is the direction the field of chord recognition is moving. On the re-stricted chord vocabulary the smoothed SDAE shows at least comparable perfor-mance, significantly outperforming the reference system on the Beatles dataset,and showing not significantly worse performance on the Billboard dataset.

Distribution for Cross-Validation We can also observe in figures 8 and 10that the distribution of results for the cross-validation is negatively skewed inboth the restricted and extended chord vocabulary tests. Thus results cluster to-wards higher-than-average performance. The reference system for the restrictedchord vocabulary behaves inversely. It is positively skewed, and its upper out-liers have better performance than those of the SDAE. The whisker plots for theextended chord vocabulary show no apparent skewness of the reference system.However, here the lower outliers of the unsmoothed SDAE perform similarly tothe median of the reference system. Thus superiority of the stacked denoising

49

autoencoders is reflected here as well. In all cases smoothing increases the per-formance of the SDAE, but does not change the skewness of the distributionsmuch.

Effects of Smoothing We can observe that temporal smoothing increasesthe performance in terms of ranks and mean WCSR in all cases. However, itis not the case for all songs in the test sets that HMM smoothing is beneficial.There are cases in which HMM smoothing can lead to a decrease of WCSR incomparison to the “raw” SDAE estimation. This indicates that an HMM maynot constitute a perfect temporal model of chords and supports the findingsof Boulanger-Lewandowski et al. (2013), who propose a dynamic programmingextension to beam search for temporal smoothing instead of Viterbi decoding.This might have to do with the HMM modelling temporal duration of chordssymmetrically for the frame wise feature computation method used in the thesis.This is not true for chords, since they usually last a certain amount of time(according to length of note).

Pre-Smoothing Notable is that the proposed SDAE system shows compara-ble performance even without temporal HMM smoothing for the extended chordvocabulary test. On the restricted chord vocabulary test on the Beatles dataset,the reference system yields a higher mean WCSR, however we do not observethat it significantly outperforms the unsmoothed SDAE. In the restricted chordvocabulary test on the Billboard dataset, however, the reference system outper-forms the unsmoothed SDAE significantly and yields a higher WCSR as well.

Concluding Remarks These experiments show that stacked denoising au-toencoders can be applied directly to the FFT, extending Glazyrin’s (2013)approach from the constant-Q transform, yielding comparable performance toa system with a conventional framework interpreting state of the art features,or in case of the extended chord vocabulary even yielding significantly betterperformance.

Unlike Glazyrin (2013), Boulanger-Lewandowski et al. (2013) and Humphreyet al. (2012), we do not try to model PCP’s or a Tonnetz as an intermediatetarget. The only restriction that is imposed on the system is the preprocessingof the FFT data: Otherwise there are no restrictions on the computation of thefeatures, in attempting to circumvent a possible “glass ceiling” as suggested byHumphrey et al. (2012) and Cho and Bello (2014).

The mean performance increases in WCSR in comparison to the simplifiedHPA system are for the restricted chord vocabulary about 1.12% and 0.47%(0.77 and 0.31 percentage points increase). For the extended chord vocabulary,however, it improves 23.05% and 8.16% (11.19 and 3.79 percentage points) forthe Beatles and Billboard datasets, respectively.

Although we can already observe a better mean performance, one couldexpect a further performance increase if it were possible to use more trainingdata, which could be realized with an memory-optimized implementation. Inaddition to that, removing the maximum training iterations and keeping theearly stopping mechanism would also likely improve the performance at leastslightly.

50

7.3 MR-SDAE

In the following I compare the MR-SDAE with the SDAE and the referencesystem. Similar to above for the SDAE, first a description of final performanceis given, followed by a comparison of the distribution of cross-validation re-sults on the Beatles dataset between SDAE and MR-SDAE. An examination ofthe pre-smoothing performance, effects of smoothing on the performance, andconcluding remarks are given hereafter.

Post-Smoothing Similar to the smoothed SDAE, the smoothed MR-SDAEoutperforms the reference system in all cases except the Billboard restrictedchord vocabulary test. Even on the Billboard restricted chord vocabulary test, itdoes not perform significantly worse, and still shows slightly better performancein terms of mean WCSR. Comparing the smoothed SDAE to the smoothedMR-SDAE, we always observe a slightly higher mean WCSR, but no statisticalsignificance in the Friedman multiple comparison test.

Distribution of Cross-Validation Performance SDAE and MR-SDAEseem to have a similar distribution of results for the different folds of the cross-validation tests on the Beatles dataset. However the MR-SDAE seems to be alittle more negatively skewed in comparison with the SDAE, especially in thepre-smoothed case.

Pre-Smoothing In comparison to the unsmoothed SDAE, the unsmoothedMR-SDAE performs better in terms of WCSR, however, this is only statisticallysignificant for the restricted chord vocabulary on the Billboard dataset.

Effects of Smoothing In spite of performing the median smoothing on twodifferent temporal levels in the multi-resolution case, HMM smoothing still in-creases the mean performance and thus seems beneficial. However it seems thatan improved recognition rate before smoothing does not necessarily yield thesame improved WCSR after smoothing. The improvements over the SDAE aftersmoothing are diminished compared to the pre-smoothing performance increasein WCSR, which is similarly observed by Cho and Bello (2014) for a single inputwith median smoothing. No post-smoothing improvements over the SDAE aresignificant in the Friedman multiple comparison test. As is the case with SDAE,MR-SDAE performance on some songs is diminished by temporal smoothing.

Concluding Remarks These experiments show that taking temporal infor-mation into account as suggested (but not shown) by Glazyrin (2013), throughadding more frames to the input, or by using a recurrent layer as done byBoulanger-Lewandowski et al. (2013), is beneficial to the classification perfor-mance, at least before HMM smoothing in terms of mean WCSR. Nonetheless,using an unsmoothed version of the input together with median smoothed ones,we find that after classification HMM smoothing can increase the performancemore than pre-classification smoothing, as Cho and Bello (2014) found for asingle smoothed input in their experiments.

51

This also supports findings of Peeters (2006),Khadkevich and Omologo (2009b),Mauch et al. (2008), and Cho and Bello (2014), that the use of median smoothingtechniques before further processing or classification can be beneficial.

The “raw” classification performance for MR-SDAE measured by the meanWCSR pre-smoothing increases 2.65% and 2.91% (1.73 and 1.78 percentagepoints) for the restricted chord vocabulary and for the extended chord vocab-ulary 2.84% and 2.20% (1.59 and 1.03 percentage points) on the Beatles andBillboard datasets respectively, compared to SDAE. After smoothing the per-formance is further increased by 0.52% (0.36 percentage points) for the Beatlesrestricted vocabulary and by 0.17% (0.11 percentage points) for the Billboardset. For the extended chord vocabulary the improvement is 0.49% and 1.15%(0.29 and 0.58 percentage points), for the Beatles and Billboard datasets, re-spectively.

However, it has to be noted, that the MR-SDAE is trained on less trainingdata in the experiments, such that for the Beatles dataset only every fourthframe instead of every third and for the Billboard only every 16th instead ofevery 8th frame is used for training. Further improvements could be achievedthrough improving the implementation to be able to train on more frames ofthe dataset. It could also be evaluated if the proposed system might work aswell with a more restricted input frequency range, which would enable us toin turn take more frames into account. Other smoothing techniques could beevaluated, similar to the work of Boulanger-Lewandowski et al. (2013).

7.4 Weights

Figure 14 shows the sum of absolute weights for the input layer of the SDAEs:vertical lines highlight input bins that correspond to music harmonic relevantfrequencies. For MR-SDAE the input looks similar, replicated for the three dif-ferent parts for different temporal smoothing. The network seems to emphasizethe weights on musically relevant frequencies.

In figure 13, we can see that for some nodes, some frequencies have negativeweights, which might indicate that some nodes from the network block certainfrequencies, thus weights for musically relevant frequencies are not positivelyemphasized for all nodes. In addition to that, the the sum of the absolute inputweights seem to diminish for frequencies that correspond to higher tones, as canbe seen in figure 14. It might be caused by clustering of fundamental frequenciesof pitches at the lower end of the spectrum and the decaying sound amplitude ofovertones. This seems to be similar to the PCP vector computation by Ni et al.(2012), with decreasing sensitivity to higher frequencies according to humanperception.

7.5 Extensions

In table 7, results for the MIREX 2013 challenge11 on the test set used inthis thesis are depicted, both for the restricted and extended chord vocabularytask that are also basis for the experiments evaluated in this thesis. Since thespecific Billboard test set was released before 2013, it might be possible that

11http://www.music-ir.org/mirex/wiki/2013:Audio_Chord_Estimation_Results_

Billboard_2012

52

http://www.music-ir.org/mirex/wiki/2013:Audio_Chord_Estimation_Results_Billboard_2012

http://www.music-ir.org/mirex/wiki/2013:Audio_Chord_Estimation_Results_Billboard_2012

some algorithms have been handed in pretrained on this part of the Billboarddata, although this is unlikely.

Features of the reference system used in this thesis are based on the systemNMSD2, by Ni et al. (2012), highlighted in bold. Algorithms are denoted bythe the first letter of the authors names, and as it is possible to hand in severalalgorithms, a number is added to the algorithm identifier. Some submissionswere submitted multiple times, one version to be trained by MIREX, and an-other pretrained. For algorithms submitted pretrained and untrained we showthe version that is trained by MIREX; for submitted pretrained algorithms weshow the best-performing version.

The performance of the full version of the reference system performs about15% (10 percentage points) better for the restricted chord vocabulary task andabout 24% (12 percentage points) better for the extended task in comparisonwith the MR-SDAE, in terms of WCSR. This can be attributed to the extensionof the system to take into account other musical structures as described insection 2.3. It performs a beat estimation, which can align the temporal durationof the HMM states, more closely to the temporal duration of chords. Anotherproblem that arises if using a single HMM for modelling chord progressionsis that it assumes conditional independence between chord states. This doesnot correspond to chord progessions that we find in the “real world”, thus, asdescribed by Cho and Bello (2014), we have no guarantee that a single HMM canmodel chord progressions sufficiently accurately. However, in combining severalHMMs (or by using a Bayesian network), we are able to break this conditionalindependence and take more structures of a musical piece into account. Niet al. also use more information about other musical structures, for example,musical key estimation and bass note estimation, to improve chord recognitionperformance tracked by a multi-stream HMM.

Despite the relatively worse performance of the systems proposed in thisthesis, it would be possible to fully integrate the proposed methods into thesystem developed by Ni et al. (2012). It also is in principle possible to estimateother musical qualities such as beat and bass with the help of stacked denoisingautoencoders, although it would have to be evaluated wether they could offera comparable performance increase. Since it was shown that the SDAE systemyielded better performance in terms of mean WCSR for chord recognition, andmore importantly, significantly better performance on a rank comparison level,it is to be expected that once integrated, the performance of a system based onstacked denoising autoencoders could compete or even outperform the systemby Ni et al. (2012).

53

System major minor major minor 7th invNMSD2 76 63

CB4** 76 63KO2** 76 60

PP4 70 53CF2 72 53PP3 73 51

MR-SDAE* 66 51SDAE* 66 50

NG1 71 50

Table 7: MIREX 2013 results on the Billboard train and test set used for eval-uation in this thesis. * denotes systems proposed in this thesis. The systemthe feature computation was taken for the comparison system, is highlightedin bold. Values for performance are sorted for the extended chord vocabularyrecognition task. Systems denoted with ** were trained specifically for theMIREX evaluation. All other algorithms were submitted pre-trained.

54

8 Conclusion

In this thesis I presented two deep learning approaches to chord recognitionbased on stacked denoising autoencoders. Both are evaluated against a basicHMM system with state-of-the-art PCP features.

The first system works on a truncated version of the FFT, applying squareroot compression and normalizing according to the L2 norm. It outputs chordprobabilities directly. Recognition performance is enhanced with the help ofan HMM to allow for post-classification temporal smoothing. The second algo-rithm uses two additional temporally-subsampled versions of the input, with amedian filter. Again it estimates the chord probabilities directly without anyfurther restrictions on computation. The post-classification chord probabilitiesare again smoothed in time by an HMM.

The reference system, stacked denoising autoencoders from the FFT, andthe extension with multiple resolution input, are tested extensively on the Bea-tles and Billboard datasets. Two different chord recognition vocabularies wereused: a conventional major-minor vocabulary and an extended chord vocab-ulary containing 7th and inverted chords as proposed in MIREX 2012. Theresults are shown for the Beatles dataset with ten-fold cross-validation, and theBillboard test and train set of MIREX 2012 is used. Post hoc Friedman testsare performed to test for statistically significant differences in performance.

In this thesis it is is shown that the multi-resolution approach can lead tobetter results in mean WCSR before HMM smoothing, although these improve-ments seem to be less after HMM smoothing.

It is also shown that SDAE and MR-SDAE show comparable performancefor the restricted major-minor chord recognition task, and show superior per-formance – both in mean WCSR and in the Friedman tests – on the extendedchord vocabulary on both datasets compared to the reference system. It wouldbe possible to fully integrate the SDAE or MR-SDAE in a system similar to Niet al. (2012) to take more musical information into account, potentially outper-forming state-of-the-art systems.

55

References

Bello, J. P. and Pickens, J. (2005). A robust mid-level representation for har-monic content in music signals. In Proceedings of the International Societyfor Music Information Retrieval Conference, pages 304–311.

Bengio, Y., De Mori, R., Flammia, G., and Kompe, R. (1992). Global optimiza-tion of a neural network-hidden Markov model hybrid. IEEE Transactionson Neural Networks, 3(2):252–259.

Bonada, J. (2000). Automatic technique in frequency domain for near-losslesstime-scale modification of audio. In Proceedings of the International ComputerMusic Conference, pages 396–399.

Boulanger-Lewandowski, N., Bengio, Y., and Vincent, P. (2013). Audio chordrecognition with recurrent neural networks. Proceedings of the InternationalSociety for Music Information Retrieval Conference, pages 335–340.

Brown, J. C. (1991). Calculation of a constant Q spectral transform. Journalof the Acoustical Society of America, page 425.

Brown, J. C. and Puckette, M. S. (1992). An efficient algorithm for the calcula-tion of a constant Q transform. Journal of the Acoustical Society of America,92(5):2698–2701.

Burgoyne, J. A., Kereliuk, C., Pugin, L., and Fujinaga, I. (2007). A cross-validated study of modelling strategies for automatic chord recognition inaudio. In Proceedings of the International Society for Music InformationRetrieval Conference, pages 251–254.

Burgoyne, J. A., Wild, J., and Fujinaga, I. (2011). An expert ground-truth setfor audio chord recognition and music analysis. Proceedings of the Interna-tional Conference on Music Information Retrieval, pages 633–638.

Catteau, B., Martens, J.-P., and Leman, M. (2007). A probabilistic frame-work for audio-based tonal key and chord recognition. In Advances in DataAnalysis, pages 637–644. Springer, Berlin Heidelberg.

Cemgil, A. T., Kappen, H. J., and Barber, D. (2006). A generative modelfor music transcription. IEEE Transactions on Audio, Speech, and LanguageProcessing, 14(2):679–694.

Chen, R., Shen, W., Srinivasamurthy, A., and Chordia, P. (2012). Chord recog-nition using duration-explicit hidden Markov models. In Proceedings of theInternational Society for Music Information Retrieval Conference, pages 445–450.

Cheng, H.-T., Yang, Y.-H., Lin, Y.-C., Liao, I.-B., and Chen, H. H. (2008).Automatic chord recognition for music classification and retrieval. In Pro-ceedings of IEEE International Conference on Multimedia and Expo, pages1505–1508.

Cho, T. and Bello, J. P. (2011). A feature smoothing method for chord recog-nition using recurrence plots. In Proceedings of the International Society forMusic Information Retrieval Conference, pages 651–656.

56

Cho, T. and Bello, J. P. (2014). On the relative importance of individual com-ponents of chord recognition systems. IEEE/ACM Transactions on Audio,Speech and Language Processing, 22(2):477–492.

Dixon, S., Mauch, M., and Anglade, A. (2011). Probabilistic and logic-basedmodelling of harmony. In Exploring Music Contents, pages 1–19. SpringerBerlin Heidelberg.

Dressler, K. and Streich, S. (2007). Tuning frequency estimation using circularstatistics. In Proceedings of the International Conference on Music Informa-tion Retrieval, pages 357–360.

Ellis, D. P. (2007). Beat tracking by dynamic programming. Journal of NewMusic Research, 36(1):51–60.

Fujishima, T. (1999). Realtime chord recognition of musical sound: a systemusing common LISP music. In Proceedings of the International ComputerMusic Conference, pages 464–467.

Glazyrin, N. (2013). Mid-level features for audio chord estimation us-ing stacked denoising autoencoders. Russian Summer School in Informa-tion Retrieval. http://romip.ru/russiras/doc/2013_for_participant/

russirysc2013_submission_13_1.pdf [Online; accessed 12-July-2014].

Glazyrin, N. and Klepinin, A. (2012). Chord recognition using prewitt filter andself-similarity. In Proceedings of the Sound and Music Computing Conference,pages 480–485.

Gomez, E. (2006). Tonal description of polyphonic audio for music contentprocessing. INFORMS Journal on Computing, 18(3):294–304.

Harris, F. J. (1978). On the use of windows for harmonic analysis with thediscrete Fourier transform. Proceedings of the IEEE, 66(1):51–83.

Harte, C. (2010). Towards automatic extraction of harmony information frommusic signals. PhD thesis, University of London.

Harte, C. and Sandler, M. (2005). Automatic chord identifcation using a quan-tised chromagram. In Proceedings of the Audio Engineering Society Conven-tion.

Harte, C., Sandler, M., and Gasser, M. (2006). Detecting harmonic change inmusical audio. In Proceedings of the ACM workshop on Audio and MusicComputing Multimedia, pages 21–26. ACM.

Hinton, G. (2010). A practical guide to training restricted Boltzmann machines.Momentum, 9(1):926.

Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., and Salakhutdinov,R. R. (2012). Improving neural networks by preventing co-adaptation offeature detectors. arXiv preprint arXiv:1207.0580.

Humphrey, E. J. and Bello, J. P. (2012). Rethinking automatic chord recognitionwith convolutional neural networks. In International Conference on MachineLearning and Applications, volume 2, pages 357–362.

57

http://romip.ru/russiras/doc/2013_for_participant/russirysc2013_submission_13_1.pdf

http://romip.ru/russiras/doc/2013_for_participant/russirysc2013_submission_13_1.pdf

Humphrey, E. J., Cho, T., and Bello, J. P. (2012). Learning a robust Tonnetz-space transform for automatic chord recognition. In Proceedings of the IEEEInternational Conference on Acoustics, Speech and Signal Processing, pages453–456.

Khadkevich, M. and Omologo, M. (2009a). Phase-change based tuning for auto-matic chord recognition. In Proceedings of Digital Audio Effects Conference.

Khadkevich, M. and Omologo, M. (2009b). Use of hidden Markov models andfactored language models for automatic chord recognition. In Proceedings ofthe International Conference on Music Information Retrieval, pages 561–566.

Lee, K. (2006). Automatic chord recognition from audio using enhanced pitchclass profile. In Proceedings of the International Computer Music Conference,pages 306–313.

Mauch, M. (2010). Automatic chord transcription from audio using compu-tational models of musical context. PhD thesis, Queen Mary University ofLondon.

Mauch, M. and Dixon, S. (2010a). Approximate note transcription for theimproved identification of difficult chords. In Proceedings of the InternationalSociety for Music Information Retrieval Conference, pages 135–140.

Mauch, M. and Dixon, S. (2010b). Simultaneous estimation of chords and musi-cal context from audio. IEEE Transactions on Audio, Speech, and LanguageProcessing, 18(6):1280–1289.

Mauch, M., Dixon, S., and Mary, Q. (2008). A discrete mixture model for chordlabelling. In Proceedings of the International Society for Music InformationRetrieval Conference, pages 45–50.

Mauch, M., Noland, K., and Dixon, S. (2009). Using musical structure toenhance automatic chord transcription. In Proceedings of the InternationalSociety for Music Information Retrieval Conference, pages 231–236.

McLachlan, G. J., Krishnan, T., and Ng, S. K. (2004). The EM algorithm.Technical report, Humboldt-Universitat Berlin, Center for Applied Statisticsand Economics (CASE).

Ni, Y., McVicar, M., Santos-Rodriguez, R., and De Bie, T. (2012). An end-to-end machine learning system for harmonic analysis of music. IEEE Transac-tions on Audio, Speech, and Language Processing, 20(6):1771–1783.

Ono, N., Miyamoto, K., Le Roux, J., Kameoka, H., and Sagayama, S. (2008).Separation of a monaural audio signal into harmonic/percussive componentsby complementary diffusion on spectrogram. In Proceedings of the EuropeanSignal Processing Conference, pages 240–244.

Osmalsky, J., Embrechts, J.-J., Van Droogenbroeck, M., and Pierard, S. (2012).Neural networks for musical chord recognition. In Journees d’informatiquemusicale.

58

Oudre, L., Grenier, Y., and Fevotte, C. (2009). Template-based chord recogni-tion: Influence of the chord types. In Proceedings of the International Con-ference on Music Information Retrieval, pages 153–158.

Oudre, L., Grenier, Y., and Fevotte, C. (2011). Chord recognition by fittingrescaled chroma vectors to chord templates. IEEE Transactions on Audio,Speech, and Language Processing, 19(7):2222–2233.

Palm, R. B. (2012). Prediction as a candidate for learning deep hierarchicalmodels of data. Master’s thesis, Technical University of Denmark.

Papadopoulos, H. and Peeters, G. (2007). Large-scale study of chord estimationalgorithms based on chroma representation and HMM. In Proceedings of theInternational Workshop Content-Based Multimedia Indexing, pages 53–60.

Papadopoulos, H. and Peeters, G. (2008). Simultaneous estimation of chordprogression and downbeats from an audio file. In Proceedings of the IEEEInternational Conference on Acoustics, Speech and Signal Processing, pages121–124.

Papadopoulos, H. and Peeters, G. (2011). Joint estimation of chords and down-beats from an audio signal. IEEE Transactions on Audio, Speech, and Lan-guage Processing, 19(1):138–152.

Pauws, S. (2004). Musical key extraction from audio. In Proceedings of theInternational Society for Music Information Retrieval Conference, pages 96––99.

Peeters, G. (2006). Musical key estimation of audio signal based on hiddenMarkov modeling of chroma vectors. In Proceedings of the International Con-ference on Digital Audio Effects, pages 127–131.

Rabiner, L. (1989). A tutorial on hidden Markov models and selected applica-tions in speech recognition. Proceedings of the IEEE, 77(2):257–286.

Reed, J., Ueda, Y., Siniscalchi, S. M., Uchiyama, Y., Sagayama, S., and Lee,C.-H. (2009). Minimum classification error training to improve isolated chordrecognition. In Proceedings of the International Conference on Music Infor-mation Retrieval, pages 609–614.

Scholz, R., Vincent, E., and Bimbot, F. (2009). Robust modeling of musicalchord sequences using probabilistic n-grams. In Proceedings of the IEEEInternational Conference on Acoustics, Speech and Signal Processing, pages53–56.

Sheh, A. and Ellis, D. P. (2003). Chord segmentation and recognition usingem-trained hidden Markov models. Proceedings of the International Societyfor Music Information Retrieval Conference, pages 185–191.

Shepard, R. N. (1964). Circularity in judgments of relative pitch. Journal ofthe Acoustical Society of America, 36:2346.

Sikora, F. (2003). Neue Jazz-Harmonielehre. Schott Musik International, Mainz,3rd edition.

59

Su, B. and Jeng, S.-K. (2001). Multi-timbre chord classification using wavelettransform and self-organized map neural networks. In Proceedings of theIEEE International Conference on Acoustics, Speech, and Signal Processing,volume 5, pages 3377–3380.

Talbot-Smith, M. (2001). Audio engineer’s reference book. Taylor & Francis,Oxford.

Tang, Y. and Mohamed, A.-r. (2012). Multiresolution deep belief networks.In Proceedings of the International Conference on Artificial Intelligence andStatistics, pages 1203–1211.

Ueda, Y., Uchiyama, Y., Nishimoto, T., Ono, N., and Sagayama, S. (2010).HMM-based approach for automatic chord detection using refined acousticfeatures. In Proceedings of the IEEE International Conference on AcousticsSpeech and Signal Processing, pages 5518–5521.

Vincent, P., Larochelle, H., Bengio, Y., and Manzagol, P.-A. (2008). Extractingand composing robust features with denoising autoencoders. In Proceedingsof the International Conference on Machine learning, pages 1096–1103.

Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol, P.-A. (2010).Stacked denoising autoencoders: Learning useful representations in a deepnetwork with a local denoising criterion. Journal of Machine Learning Re-search, 9999:3371–3408.

Wakefield, G. H. (1999). Mathematical representation of joint time-chroma dis-tributions. In Proceedings of the SPIE’s International Symposium on OpticalScience, Engineering, and Instrumentation, pages 637–645.

Weil, J., Sikora, T., Durrieu, J.-L., and Richard, G. (2009). Automatic gen-eration of lead sheets from polyphonic music signals. In Proceedings of theInternational Society for Music Information Retrieval Conference, pages 603–608.

Weller, A., Ellis, D., and Jebara, T. (2009). Structured prediction models forchord transcription of music audio. In Proceedings of the International Con-ference on Machine Learning and Applications, pages 590–595.

Zhang, X. and Gerhard, D. (2008). Chord recognition using instrument voicingconstraints. In Proceedings of the International Society for Music InformationRetrieval Conference, pages 33–38.

60

A Joint Optimization

In the following, we describe the application of a joint neural network HMMoptimization approach proposed by Bengio et al. (1992) for speech recognition,applied to chord recognition. This was the original goal of this thesis, but it didnot yield any improvement after basic initialization of the components.

A.1 Basic System Outline

The system consists of two main components:

1. A continuous-density HMM which estimates time the temporal correlationof chord progressions and performs the final classification.

2. A neural network (initialized as stacked denoising autoencoder) with soft-max activation, which is trained to approximate the computation of “per-fect” normalized PCPs computed from the ground truth chord symbols.

Both are trained separately at first, the neural network according to precom-puted training data and the HMM on basis of of the neural network output foremission probabilities and ground truth chord data for estimation of transitionprobabilities.

After this, a joint optimization, based on the gradient of the HMM accordingto a global optimization criteria (maximum likelihood), is performed and theneural network’s weights are adjusted. In turn the emission probabilities of theHMM are updated on basis of the new neural network output of the trainingdata, until the system does not improve further.

A.2 Gradient of the Hidden Markov Model

We define the emission probability bt of the HMM as follows:

bt = P (Yt|St), (27)

the probability of emitting the neural network output Yt in state St at time taccording to our state sequence determined by the training data.

The joint probability of state and observation sequence is defined as:

P (q1, q2...qt, O1, O2, ...Ot|λ) = π1b1

T∏t=2

btat−1,t (28)

with π1 being the initial state probability, bt the probability of emission asstated in equation (27), and at−1,t the transition probability from state St−1 toSt, where t indicates the time step. qt are the states q at time step t and λ theparameters of the HMM, and Ot the observation at time t.

We want to maximize the log likelihood of the model according to followingoptimization criterion:

C = log

(π1b1

T∏t=2

btat−1,t

), (29)

61

Similar to Bengio et al. (1992). Since the transition probabilities are fixed bythe provided ground truth, we take the partial derivative in respect to bt leavingus thus with:

∂C

∂bt=

∂ log

(π1b1

T∏t=2

btat−1,t

)∂bt

(30)

We rewrite the logarithm of the product as a sum of logarithms. Since thederivative in respect to bt does not affect the initial state probability distribution,transition probabilities or emission probabilities of the other states, these canbe dropped, leaving us with:

∂C

∂bt=∂log(bt)

∂bt=

1

bt(31)

Since we are using a continuous-densities HMM, the emission probability bt canbe represented as a mixture of Gaussians, as described in (Bengio et al., 1992):

bi,t =∑k

Zk√(2π)n|Σk|

exp

(−1

2(Yt − µk)Σ−1

k (Yt − µk)>), (32)

where n is the number of Gaussian components per state of the HMM, andZk, µk and Σk the gain (or mixture weight), mean and covariance matrix ofGaussian component k respectively.

A.3 Adjusting Neural Network Parameters

Since we are aiming to change the neural network parameters according to theHMM optimization gradient, we need to adjust the neural network parametersas described in Bengio et al. (1992).

Using the chain rule we take partial derivative of the optimization criterionC in respect to the neural network output Yj,t for the jth component of theoutput at time t:

∂C

∂Yj,t=

∂C

∂bi,t

∂bi,t∂Yj,t

, (33)

where∂bi,t∂Yj,t

by differentiating equation (32), which can be written as follows:

∂bt∂Yjt

=∑k

Zk√2π|Σk|

(∑l

dk,lj(µkl − Ylt))

exp(− 1

2 (Yt − µk)Σ−1k (Yt − µk)>

),

(34)

where dk,lj is the element (l,j) of the inverse of the covariance matrix (Σ−1) forthe kth Gaussian distribution and µkl is the lth element of the kth Gaussianmean vector µk.

A.4 Updating HMM Parameters

Rabiner (1989) provides the standard methods to update continuous-densityHMMs we can update the gain Zjk for state j and component k as follows:

62

Z ′jk =

T∑t=1

γt(j, k)

T∑t=1

M∑k=1

γt(j, k)

(35)

The mean µjk for state j and component k can be computed with:

µ′jk =

T∑t=1

γt(j, k)Yt

T∑t=1

γt(j, k)

, (36)

where Ot the observation, specific neural network output at time t. The covari-ance Σjk for state j and component k can be computed with:

Σ′jk =

T∑t=1

γt(j, k)(Yt − µjk)(Yt − µjk)T

T∑t=1

γt(j, k)

, (37)

γt(j, k) describing the probability of being in state j at time t with the kth

Gaussian mixture component:

γt(j, k) = δtjZjkN (Yt, µjk,Σjk)∑

mZjmN (Yt, µjm,Σjm)

, (38)

where the term δtj is 1 if j is equal to the state in our ground truth data and 0otherwise.

A.5 Neural Network

The neural network can be pretrained as a stacked denoising autoencoder di-rectly on a preprocessed excerpt of the FFT as described in section 5.2.1. Toapproximate “perfect” PCPs computed from the ground truth, we add an ad-ditional softmax output layer and finetune with backpropagation.

A.6 Hidden Markov Model

In the implemented system we try to estimate only major, minor and non-chords, thus leaving us with 25 possible symbols. The emission probabilities ofeach state in the HMM are modelled by a mixture of two Gaussians. The HMMin turn is trained on the output of the pretrained neural network applyingthe expecation maximization algorithm for estimating the parameters of theGaussian mixtures.

A.7 Combined Training

For the joint optimization of the neural network and the HMM we iterativelyadjust the neural network weights according to the HMM gradient for the globaloptimization criterion (as described above). After the neural network weights

63

are adjusted, we update the HMM with the methods described in section A.4Every alternating neural network weight adjustment and HMM update a test isperformed and in theory the training is completed when the change in perfor-mance of the system falls below a prior specified threshold.

A.8 Joint Optimization

For testing purposes we train the initialized neural network HMM hybrid on anexcerpt of the Beatles dataset, to recognize the restricted set of chord symbolsof major and minor chords. We plot the training error that is backpropagatedin the joint optimization, described in section A.3, and the performance on avalidation set, that is separated from the training samples.

Since the SDAE is trained to model PCP vectors in this case, we computefor the joint optimization approach twelve distinct training errors according toeach training sample (or multiple of such vectors for batch training). In figure15, we plot the sum of averaged absolute training errors for each output overone epoch of neural network training. Figure 16 depicts the performance on thevalidation set after each learning epoch of the joint optimization. We denotethe percent of accurately estimated chord symbol segments according to theWCSR. The figures depict performance for 20 epochs (training iterations).

0 5 10 15 200

5

10

15

20

25

30

Epochs

Ave

rage

ofab

solu

tetr

ain

ing

erro

r

Figure 15: Average absolute backpropagated training error on training set perepoch.

64

0 5 10 15 200

20

40

60

80

100

Epochs

WC

SR

in%

Figure 16: Classifcation performance on validation set after training one epoch.

A.9 Joint Optimization Possible Interpretation

In figure 15 and 16 we can see the training error and the performance on thevalidation set for 20 iterations of joint training, after an initial pretraining phase.We can see the average training error for each iteration of joint optimizationdecreasing. This suggests that the neural network HMM hybrid is learningaccording to the error measure defined above: The gradient in respect to overalllikelihood of the HMM is decreasing. However, as we can see in figure 16, theperformance on the validation set is decreasing as well. This means despitemaximizing the likelihood of the HMM, the performance decreases. These aresimilar to symptoms we can observe with overfitting. Further experiments wereconducted trying to backpropagate the Euclidean distance to the aggregatedweighted means of the Gaussian mixture model for the emission of the respectivestate, but to no avail. I conclude that this method seems to maximize thelikelihood according to defined criteria, but is unable to improve performancefurther after initializing both the Gaussian mixture emissions HMM and stackeddenoising autoencoders.

65

Documents

Chord Recognition with Stacked Denoising Autoencoders · PDF fileChord Recognition with Stacked Denoising Autoencoders ... 6.1 Reduction of Chord Vocabulary ... beat detection, symbolic