9
Efficient Bayesian inference for harmonic models via adaptive posterior factorization Emmanuel Vincent a, ,1 , Mark D. Plumbley b a METISS Group, IRISA-INRIA, Campus de Beaulieu, 35042 Rennes Cedex, France b Centre for Digital Music, Department of Electronic Engineering, Queen Mary University of London, Mile End Road, London E1 4NS, UK article info Available online 1 October 2008 Keywords: Bayesian inference Harmonic model Adaptive factorization Posterior dependence abstract Harmonic sinusoidal models are an essential tool for music audio signal analysis. Bayesian harmonic models are particularly interesting, since they allow the joint exploitation of various priors on the model parameters. However existing inference methods often rely on specific prior distributions and remain computationally demanding for realistic data. In this article, we investigate a generic inference method based on approximate factorization of the joint posterior into a product of independent distributions on small subsets of parameters. We discuss the conditions under which this factorization holds true and propose two criteria to choose these subsets adaptively. We evaluate the resulting performance experimentally for the task of multiple pitch estimation using different levels of factorization. & 2008 Elsevier B.V. All rights reserved. 1. Introduction Music and speech involve different types of sounds, including periodic, transient and noisy sounds. Short-term stationary periodic sounds composed of sinusoidal partials at harmonic or near- harmonic frequencies are perceptually essential, since they contain most of the energy of musical notes and vowels. Harmonicity means that at each instant the frequencies of the partials are multiples of a single frequency called the fundamental frequency. Estimating the periodic sounds underlying a given signal, i.e. estimating their fundamental frequencies and the amplitudes and phases of their partials, is required or useful for many applications, such as speech prosody analysis [10], multiple pitch estimation and instrument recognition [11] and low bit-rate compression [20]. This problem is particularly difficult for polyphonic signals, i.e. signals containing several concurrent periodic sounds, since different periodic sounds may exhibit partials overlapping at the same frequencies. Existing methods for polyphonic fundamental frequency estimation are often based on one of two approaches [11]: either validation of fundamental frequency candidates given by the peaks of a short-term auto-correlation function [8,22,15] or inference of the hidden states of a probabilistic model of the signal short-term power spectrum based on learned template spectra [14,18,6]. These approaches have achieved limited perfor- mance on complex polyphonic signals so far [11,15]. Moreover neither approach provides estimates for the amplitudes and phases of the partials, which are needed for musical instrument recognition or low bit-rate compression. A promising way to address these issues is to rely on a probabilistic model of the signal waveform incorporating various prior knowledge. Two families of such models have been proposed in the literature for music signals. One family introduced in [4,3] models each musical note signal in state-space form by a discrete fundamental frequency and a fixed number of damped oscillators at harmonic frequencies with independent transition noises. Decoding is achieved either via linear Kalman filtering or variational approximation [12], depending whether the damping factors are fixed or subject to additional transition noises. These inference methods restrict the prior distribution of the transition noises to be Gaussian or from a class of conjugate priors [2], respectively. Another family of models described in [21,9,7] represents musical note signals by continuous fundamental frequency, amplitude and phase parameters, inferred using Markov Chain Monte Carlo (MCMC) methods [2]. These methods are applicable to all prior distributions in theory, but tend to be computationally demanding in practice. Thus the chosen priors are mostly motivated by computational issues [7]. In particular, the amplitudes of the partials are modeled by independent uniform priors or by conjugate zero-mean Gaussian priors, so that analytical marginalization can be performed. For both families of models, the above priors exhibit some differences with the empirical parameter distributions. In parti- cular, they do not penalize partials with zero amplitude. This typically leads to missing estimated notes for signals composed of several notes at integer fundamental frequency ratios [21,7] or to erroneous fundamental frequency estimates equal to a multiple or ARTICLE IN PRESS Contents lists available at ScienceDirect journal homepage: www.elsevier.com/locate/neucom Neurocomputing 0925-2312/$ - see front matter & 2008 Elsevier B.V. All rights reserved. doi:10.1016/j.neucom.2007.12.050 Corresponding author. Tel.: +332 9984 2269. E-mail addresses: [email protected] (E. Vincent), [email protected] (M.D. Plumbley). 1 This author was funded by EPSRC Grant GR/S75802/01. Neurocomputing 72 (2008) 79–87

Efficient Bayesian inference for harmonic models via adaptive posterior factorization

Embed Size (px)

Citation preview

Page 1: Efficient Bayesian inference for harmonic models via adaptive posterior factorization

ARTICLE IN PRESS

Neurocomputing 72 (2008) 79–87

Contents lists available at ScienceDirect

Neurocomputing

0925-23

doi:10.1

� Corr

E-m

mark.pl1 Th

journal homepage: www.elsevier.com/locate/neucom

Efficient Bayesian inference for harmonic modelsvia adaptive posterior factorization

Emmanuel Vincent a,�,1, Mark D. Plumbley b

a METISS Group, IRISA-INRIA, Campus de Beaulieu, 35042 Rennes Cedex, Franceb Centre for Digital Music, Department of Electronic Engineering, Queen Mary University of London, Mile End Road, London E1 4NS, UK

a r t i c l e i n f o

Available online 1 October 2008

Keywords:

Bayesian inference

Harmonic model

Adaptive factorization

Posterior dependence

12/$ - see front matter & 2008 Elsevier B.V. A

016/j.neucom.2007.12.050

esponding author. Tel.: +33 2 9984 2269.

ail addresses: [email protected] (E. V

[email protected] (M.D. Plumbley).

is author was funded by EPSRC Grant GR/S7

a b s t r a c t

Harmonic sinusoidal models are an essential tool for music audio signal analysis. Bayesian harmonic

models are particularly interesting, since they allow the joint exploitation of various priors on the model

parameters. However existing inference methods often rely on specific prior distributions and remain

computationally demanding for realistic data. In this article, we investigate a generic inference method

based on approximate factorization of the joint posterior into a product of independent distributions on

small subsets of parameters. We discuss the conditions under which this factorization holds true and

propose two criteria to choose these subsets adaptively. We evaluate the resulting performance

experimentally for the task of multiple pitch estimation using different levels of factorization.

& 2008 Elsevier B.V. All rights reserved.

1. Introduction

Music and speech involve different types of sounds, includingperiodic, transient and noisy sounds. Short-term stationary periodicsounds composed of sinusoidal partials at harmonic or near-harmonic frequencies are perceptually essential, since they containmost of the energy of musical notes and vowels. Harmonicity meansthat at each instant the frequencies of the partials are multiples of asingle frequency called the fundamental frequency. Estimating theperiodic sounds underlying a given signal, i.e. estimating theirfundamental frequencies and the amplitudes and phases of theirpartials, is required or useful for many applications, such as speechprosody analysis [10], multiple pitch estimation and instrumentrecognition [11] and low bit-rate compression [20]. This problem isparticularly difficult for polyphonic signals, i.e. signals containingseveral concurrent periodic sounds, since different periodic soundsmay exhibit partials overlapping at the same frequencies.

Existing methods for polyphonic fundamental frequencyestimation are often based on one of two approaches [11]: eithervalidation of fundamental frequency candidates given by thepeaks of a short-term auto-correlation function [8,22,15] orinference of the hidden states of a probabilistic model of thesignal short-term power spectrum based on learned templatespectra [14,18,6]. These approaches have achieved limited perfor-mance on complex polyphonic signals so far [11,15]. Moreover

ll rights reserved.

incent),

5802/01.

neither approach provides estimates for the amplitudes andphases of the partials, which are needed for musical instrumentrecognition or low bit-rate compression.

A promising way to address these issues is to rely on aprobabilistic model of the signal waveform incorporating variousprior knowledge. Two families of such models have been proposedin the literature for music signals. One family introduced in [4,3]models each musical note signal in state-space form by a discretefundamental frequency and a fixed number of damped oscillatorsat harmonic frequencies with independent transition noises.Decoding is achieved either via linear Kalman filtering orvariational approximation [12], depending whether the dampingfactors are fixed or subject to additional transition noises. Theseinference methods restrict the prior distribution of the transitionnoises to be Gaussian or from a class of conjugate priors [2],respectively. Another family of models described in [21,9,7]represents musical note signals by continuous fundamentalfrequency, amplitude and phase parameters, inferred usingMarkov Chain Monte Carlo (MCMC) methods [2]. These methodsare applicable to all prior distributions in theory, but tend to becomputationally demanding in practice. Thus the chosen priorsare mostly motivated by computational issues [7]. In particular,the amplitudes of the partials are modeled by independentuniform priors or by conjugate zero-mean Gaussian priors, sothat analytical marginalization can be performed.

For both families of models, the above priors exhibit somedifferences with the empirical parameter distributions. In parti-cular, they do not penalize partials with zero amplitude. Thistypically leads to missing estimated notes for signals composed ofseveral notes at integer fundamental frequency ratios [21,7] or toerroneous fundamental frequency estimates equal to a multiple or

Page 2: Efficient Bayesian inference for harmonic models via adaptive posterior factorization

ARTICLE IN PRESS

x

e s

f r a

S

x: observed signal frame

sp: signal of note p

e: residual

fp: fundamental frequency

rp: amplitude scale factor

apm: amplitude of partial m

Sp: activity state of note p

�pm: phase of partial m

Fig. 1. Bayesian network structure of the harmonic model in [20] on one signal

frame. Circles denote vector random variables (some of variable size) and arrows

conditional dependencies.

E. Vincent, M.D. Plumbley / Neurocomputing 72 (2008) 79–8780

a submultiple of the true fundamental frequencies [7]. To helpsolving these limitations, we recently designed a probabilisticharmonic model involving priors motivated by empirical para-meter distributions and proposed a variant of the diagonal Laplacemethod for fast inference [20], since analytical marginalizationwas no longer feasible with these priors.

In this article, we propose an alternative fast inference methodfor probabilistic harmonic models, based on approximate factor-ization of the joint posterior into a product of independentdistributions on subsets of parameters. This method is designedfor models of the form described in [21,9,7,20], involving explicitfrequency, amplitude and phase parameters. It is generic, in that itcan be applied to a wide range of priors, and adaptive, since thelevel of factorization depends on the observed signal and thehypothesized notes. This constitutes a crucial difference comparedto variational approximation methods, where the terms of thefactorization are fixed a priori and their parameters can onlybe computed for certain classes of priors. We complete ourpreliminary work [19] by discussing the extension of this methodto nongaussian likelihood and alternative model structures,investigating a new criterion for the choice of the parametersubsets and providing a detailed experimental evaluation.

The structure of the rest of the article is as follows. In Section 2,we present a possible Bayesian network structure for harmonicmodels and make some mild assumptions about the parameterpriors. Then, we describe the proposed inference method inSection 3 and extend it to alternative model structures. In Section4, we evaluate its performance for the task of multiple pitchestimation on short time frames. We conclude in Section 5 andsuggest some perspectives for future research.

2. Assumptions about the model

The harmonic models in [21,9,7,20] are variations of the sameconcept. They all represent the observed music signal as a sumof note signals, each composed of several sinusoidal partialsparameterized by a sequence of random variables spanningsuccessive time frames. However, the chosen variables and theirconditional dependency structure are slightly different for eachmodel. For the sake of clarity, we first discuss our approach for themodel structure in [20], which involves fewer variables. Also, weconsider each signal frame separately, thus omitting temporaldependencies. Such dependencies could be taken into account byreplacing the priors over the variables in a frame by conditionalpriors given previous frames.

2.1. Bayesian network structure

On each time frame, the model described in [20] exhibitsthe four-layer Bayesian network structure shown in Fig. 1. Theobserved signal frame xðtÞ is assumed to be obtained bywindowing the whole signal with a window wðtÞ of lengthT. Each layer models xðtÞ at a different abstraction level.

The bottom layer represents the active notes in this frame on adiscrete pitch scale. In western music, the fundamental frequencyf p of each note expressed relatively to the sampling frequency Fs

may vary across frames but remains close to a discrete pitch ofthe form

mp ¼440

Fs2ðp�69Þ=12 (1)

where p is an integer on the MIDI semitone scale. Assuming nounison, i.e. that several notes corresponding to the same discretepitch cannot be present at the same time, each discrete pitch p isassociated with a binary activity state Sp determining whether a

note with that pitch is active or not. The number of active notesand their pitches are thus represented by the activity state vectorS ¼ fSp : plowpppphighg where plow and phigh are the lowest andhighest pitches among possible instruments.

The signal spðtÞ corresponding to each active note is thendefined in the middle layers for 0ptpT � 1 by

spðtÞ ¼ wðtÞXMp

m¼1

apm cosð2pmf pt þfpmÞ (2)

where f p, apm and fpm are, respectively, its normalized funda-mental frequency and the amplitude and phase of its m-thharmonic partial. The number of partials Mp is constrained as afunction of the note pitch p to

Mp ¼ min1

2mp

;Mmax

!(3)

so that the partials fill the whole observed frequency range up to amaximum number of partials Mmax. The amplitudes of the partialsare assumed to depend on an amplitude scale factor rp accountingfor the total power of the note. The vectors of frequency, scalefactor, amplitude and phase parameters for all notes are denoted,respectively, by f ¼ ff p : Sp ¼ 1g, r ¼ frp : Sp ¼ 1g, a ¼ fapm : Sp ¼

1;1pmpMpg and f ¼ ffpm : Sp ¼ 1;1pmpMpg. Finally, the ob-served signal is modeled in the top layer as

xðtÞ ¼X

p s:t:Sp¼1

spðtÞ þ eðtÞ (4)

where eðtÞ is the residual.

2.2. Assumptions about the parameter priors

The inference method proposed below is valid given some mildassumptions about the parameter priors. Classically, we assumethat the fundamental frequencies f p of different notes p and thephases fpm of different partials ðp;mÞ are independent a priori andthat the amplitudes apm of different partials are independenta priori given the scale factors rp. We also assume that theprior distribution of each fundamental frequency f p is close tozero outside the interval ½2�1=12mp;2

1=12mp�, so that it enforcesproximity to the underlying discrete pitch. Finally, we make thehypothesis that the residual eðtÞ has a continuous distribution andthat its values at distinct frequencies are independent a priori, sothat the likelihood Pðxjf ; a;fÞ factors as

Pðxjf ; a;fÞ ¼YT�1

n¼0

PðEnÞ (5)

where En are the discrete Fourier transform (DFT) coefficientsof eðtÞ.

Page 3: Efficient Bayesian inference for harmonic models via adaptive posterior factorization

ARTICLE IN PRESS

E. Vincent, M.D. Plumbley / Neurocomputing 72 (2008) 79–87 81

Note that the ubiquitous time-domain Gaussian i.i.d.distribution satisfies this hypothesis, since it is equivalent to aGaussian i.i.d. distribution on the DFT coefficients Pðxjf ; a;fÞ ¼ð2ps2Þ

�T=2QT�1n¼0 expð�jEnj

2=ð2s2ÞÞ. A more general distribution inthe form of (5) and of particular interest in the following is thefrequency-weighted Gaussian [20]

Pðxjf ; a;fÞ ¼ ð2ps2Þ�T=2

YT�1

n¼0

exp �gnjEnj2

2s2

� �(6)

where gn are constant positive weights. This can be rewrittenas Pðxjf ; a;fÞ ¼ ð2ps2Þ

�T=2 expð�kek2g=ð2s2ÞÞ where kek2

g ¼PT�1

n¼0gnjEnj

2 is the squared weighted Euclidean norm of the DFTcoefficients.

3. Bayesian inference via adaptive posterior factorization

Harmonic models are typically employed to solve the multiplepitch estimation task, which consists of estimating the number ofactive notes and their pitches on each time frame. In the presentframework, this task translates into finding the maximum a

posteriori (MAP) activity state vector S ¼ arg maxPðSjxÞ, which isachieved by trying a number of candidate vectors S, computingtheir posterior probabilities PðSjxÞ and selecting the largest. Theseprobabilities are defined by

PðSjxÞ ¼

ZPðS; f ; r; a;fjxÞdf dr da df (7)

where the joint posterior PðS; f ; r; a;fjxÞ is given by Bayes law

PðS; f ; r; a;fjxÞ / Pðxjf ; a;fÞPðfjSÞPðajr; SÞPðrjSÞPðf jSÞPðSÞ (8)

In the following, we focus on the computation of the integral in(7), which is known as the Bayesian marginalization problem [12].We briefly recall some existing integration methods, thenintroduce the proposed method in a simple context and extendit to a more general context later on.

3.1. Sampling-based vs. full factorization-based integration

A number of sampling techniques are available to computesuch integrals [12]. However they appear unsatisfactory in thiscontext. Numerical integration on a uniform grid is accurate fordistributions of a few parameters, but intractable here since thenumber of parameters is typically of the order of one hundred.Integration via importance sampling [13] is computationallydemanding, since the variance of the importance weights, whichis proportional to that of the estimate, increases sharply with thenumber of parameters [12]. Sampling of the joint posterior viareversible jump MCMC [2] is also demanding [7].

Fast inference can be achieved at the cost of lower accuracy byestimating the MAP parameter values ðf ; r; a; fÞ ¼ arg maxf ;r;a;fPðS; f ; r; a;fjxÞ associated with each candidate activity state vectorS using some nonlinear optimization algorithm and approximat-ing the joint posterior around these values by a simplerdistribution which can be integrated analytically or by tabulation.The fastest techniques include the diagonal Laplace approxima-tion [5], which relies on full factorization of the posterior into aproduct of parameter-wise univariate Gaussian distributions, andits variant proposed in [20] with a specific univariate nongaussiandistribution for the phase parameters. The full Laplace approx-imation [5] performs poorly here due to unbounded integrationover the phase parameters [19].

The proposed inference method aims to bridge the gapbetween sampling-based and full factorization-based techniquesby partially factoring the joint posterior into a product ofdistributions over subsets of a few parameters and integrating

these distributions via sampling. Various levels of factorizationcan be obtained depending on the MAP parameter values.

3.2. Conditional posterior factorization over the partials

For simplicity, let us assume initially that the harmonic partialsof the hypothesized active notes have ‘‘different enough’’frequencies and that the likelihood is a frequency-weightedGaussian as in (6). These two assumptions are relaxed lateron. The former is generally true for a single hypothesizedactive note, but almost never for several active notes. Mathema-tically, it leads to the assumption that the windowed complexsinusoidal signals

zpmðtÞ ¼ wðtÞ e2ipmf pt (9)

corresponding to different partials are mutually orthogonal

hzpm; zp0m0 ig ¼ 0 8ðp;mÞaðp0;m0Þ (10)

according to the dot product h:; :ig consistent with the weightedEuclidean norm k:kg defined in Section 2.2. This dot product isdefined for two signals zðtÞ and z0ðtÞ by

hz; z0ig ¼XT�1

n¼0

gnZnZ0

n (11)

where Zn and Z0n are the DFT coefficients of zðtÞ and z0ðtÞ and Z0

n isthe complex conjugate of Z0n. The orthogonality property (10)formalizes the fact that partials with ‘‘different enough’’ frequen-cies have almost disjoint frequency supports and can be assumedto hold true for all possible frequency weights gn. When thefrequencies of the partials are not too close to Nyquist,the negative frequency sinusoidal signals zpmðtÞ ¼ wðtÞ e�2ipmf pt

are also orthogonal to their positive counterparts: hzpm; zp0m0 ig ¼ 0for all ðp;mÞ and ðp0;m0Þ. The observed signal xðtÞ can then bedecomposed into a sum of sinusoidal signals at the frequencies ofthe hypothesized partials by orthogonal projection onto the two-dimensional subspaces spanned by ðzpm; zpmÞ

xðtÞ ¼1

2

Xp;m

~apmðei ~fpm zpmðtÞ þ e�i ~fpm zpmðtÞÞ þ ~eðtÞ (12)

The projection coefficients given by

~apm ei ~fpm ¼ 2hx; zpmig

kzpmk2g

(13)

represent the amplitude and phase values of each partial leadingto the minimum-norm residual ~eðtÞ. Given hypothesized valuesapm and fpm, the corresponding residual eðtÞ can be decomposedas a sum of mutually orthogonal terms

eðtÞ ¼1

2

Xp;m

ð~apmei ~fpm � apmeifpm ÞzpmðtÞ

þ ð~apme�i ~fpm � apme�ifpm ÞzpmðtÞ þ ~eðtÞ (14)

The squared norm of the residual then equals by analyticalcomputation

kek2g ¼

Xp;m

Dpm þ D0 (15)

with D0 ¼ k~ek2g and

Dpm ¼1

2kzpmk

2g ðapm � ~apmÞ

2þ 4 ~apmapmsin2 fpm �

~fpm

2

!(16)

Using (8) and the relationship between Pðxjf ; a;fÞ and kek2g , this

leads to the exact factorization of the joint posterior into a productof partial-wise bivariate conditional distributions over amplitude

Page 4: Efficient Bayesian inference for harmonic models via adaptive posterior factorization

ARTICLE IN PRESS

E. Vincent, M.D. Plumbley / Neurocomputing 72 (2008) 79–8782

and phase parameters

PðS; f ; r; a;fjxÞ / P0ðx; f ÞPðrjSÞPðf jSÞPðSÞ

�Yp;m

Ppmðapm;fpm; x; f pÞPðapmjrpÞPðfpmÞ (17)

where P0ðx; f Þ ¼ ð2ps2Þ�T=2 e�D0=ð2s2Þ depends on f only

and Ppmðapm;fpm; x; f pÞ ¼ expð�Dpm=ð2s2ÞÞ is a bivariateparametric distribution that can be quickly computed,since it depends on three hyper-parameters only: kzpmk

2g , ~apm

and ~fpm. The top part of Fig. 2 illustrates the validity of thisfactorization.

Denoting by ða; fÞ ¼ arg maxa;fPðS; f ; r; a;fjxÞ the vectorsof estimated MAP amplitude and phase parameter valuesassociated with S, f and r and by apm ¼ fap0m0 : ðp0;m0Þaðp;mÞgand fpm ¼ ffp0m0 : ðp

0;m0Þaðp;mÞg the same vectors minusone coefficient corresponding to partial ðp;mÞ, the aboveexpression can be equivalently rewritten as proved in Appendix

Exact posterior

apm (dB) apm (dB)

apm (dB) apm (dB)

apm (dB) apm (dB)

a p′m

′ (dB

)

a p′m

′ (dB

)

a p′m

′ (dB

)

a p′m

′ (dB

)

a p′m

′ (dB

)

a p′m

′ (dB

)

40 60

40

60

Factored approx.

40 60

40

60

Exact posterior

20 40 60

40

60

Factored approx.

20 40 60

40

60

Exact posterior

40 60

40

60

Factored approx.

40 60

40

60

Fig. 2. Joint posterior distribution Pðapm ; ap0m0 jS; f ; r; apmp0m0 ; f; xÞ of the amplitudes

of two partials given the MAP values of other parameters, using the priors defined

in [20] and assuming 60 dB ground truth amplitudes with no other partials at the

same frequencies. Dark areas denote high probability. Top: partials at different

frequencies with mean prior amplitudes of 50 and 60 dB. Middle: partials at the

same frequency with identical mean prior amplitudes of 60 dB. Bottom: partials at

the same frequency with mean prior amplitudes of 40 and 60 dB. The posterior

dependence between apm and ap0m0 equals 0, 0.99 and 0.093 bits, respectively.

A as

PðS; f ; r; a;fjxÞ ¼ PðS; f ; r; a; fjxÞYp;m

Pðapm;fpmjS; f ; rp; apm; fpm; xÞ

Pðapm; fpmjS; f ; rp; apm; fpm; xÞ

(18)

This equation holds under the assumptions that the likelihood isfrequency-weighted Gaussian and that the partials of thehypothesized active notes are orthogonal signals. It admits thefollowing interpretation: the first term is the maximum ofthe joint posterior with fixed S, f and r and the quotient termsdescribe the decrease of this posterior around its maximum asproportional to the posterior distribution of the parameters ofeach partial with the parameters of other partials being fixed. Intheory, this equation also holds for any value of a and f distinctfrom the actual MAP parameter values.

In practice, perfect orthogonality never happens. Nevertheless,this equation remains approximately valid under the more generalassumptions that the partials of the hypothesized active noteshave ‘‘different enough’’ frequencies and that the likelihoodsatisfies (5), although quick computation of the quotient termsby orthogonal projection is not feasible anymore with nongaus-sian likelihood. Indeed, when the amplitude parameters are nottoo far from their MAP values, the DFT coefficients of each partialsignal are close to zero except for a few DFT bins n around thefrequency of that partial. The sets of bins associated with differentpartials are disjoint. Therefore, the likelihood PðEnÞ in a given bindepends mostly on the parameters of a single partial, which leadsto (18) after simple analytical computation. Note that thisfactorization holds as soon as the MAP amplitude values arenot grossly overestimated, otherwise the frequency support ofdifferent partials might overlap due to secondary lobes. Inparticular, it holds when the joint posterior is multimodal and alocal maximum was estimated instead of the global maximum.

3.3. Conditional posterior factorization over subsets of partials

In the general case where some partials of the hypothesizedactive notes may have close frequencies, the terms of (18) can stillbe computed but this equation may not hold true anymore, asshown in the middle part of Fig. 2. It is however possible to grouppartials into disjoint subsets such that partials from differentsubsets have ‘‘different enough’’ frequencies. These subsets can beiteratively created as follows: a partial ðp;mÞ is assigned to apreviously created subset g if there exists a partial ðp0;m0Þ 2 g

such that

jmf p �m0f p0 jpf max (19)

where f max is a manually set frequency threshold, otherwise itforms a new singleton subset. Provided that the likelihoodsatisfies (5) and that the MAP amplitude values are not grosslyoverestimated, similar arguments as above lead to the approx-imate factorization of the posterior into a product of multivariateconditional distributions over subsets of amplitude and phaseparameters ag ¼ fapm; ðp;mÞ 2 gg and fg ¼ ffpm; ðp;mÞ 2 gg

PðS; f ; r; a;fjxÞ � PðS; f ; r; a; fjxÞY

g

Pðag ;fg jS; f ; r; ag ; fg ; xÞ

Pðag ; fg jS; f ; r; ag ; fg ; xÞ(20)

Each quotient term can be quickly computed by orthogonalprojection in the particular case where the likelihood isfrequency-weighted Gaussian.

A higher threshold f max increases the accuracy of this equation,but also leads to larger subsets. In practice, it is often possible toobtain a factored expression of similar accuracy with smallersubsets. Indeed there exist some situations where partials at close

Page 5: Efficient Bayesian inference for harmonic models via adaptive posterior factorization

ARTICLE IN PRESS

E. Vincent, M.D. Plumbley / Neurocomputing 72 (2008) 79–87 83

frequencies may still be associated with different subsets. Anexample of such a situation is given in the bottom part of Fig. 2and discussed in [19]. Denoting by the vectors y and y0 twodisjoint subsets of variables and by y and y

0their estimated MAP

values given the rest of the variables y00, we assess the accuracy ofthe approximation of the joint posterior distribution Pðy; y0jy00Þ

by the factored distribution Pðyjy0; y00ÞPðy0jy; y00Þ using the

Kullback–Leibler divergence [12]

Dðy; y0Þ ¼

ZPðy; y0jy00Þlog2

Pðy; y0jy00Þ

Pðyjy0; y00ÞPðy0jy; y00Þ

dy dy0 (21)

This quantity is always positive and equal to zero only when theapproximation is exact. It can be seen as a measure of the localposterior dependence between y and y0 expressed in bits. Indeed,it is analogous to mutual information [12], except that themarginal distribution of each variable is replaced here by itsposterior distribution given the estimated MAP value of the other.This suggests that partials ðp;mÞ and ðp0;m0Þ are to be grouped inthe same subset if

Dðfapm;fpmg; fap0m0 ;fp0m0 gÞXcmin (22)

where cmin is a manually set threshold. Compared to mutualinformation, this criterion is tractable for a wide range ofdistributions. However, it is less accurate when three or morepartials overlap at a given frequency and the joint posterior ismultimodal, since misestimation of the MAP parameter values ofone partial may affect the estimated posterior dependencebetween the parameters of other partials. This may in turnaffect the determination of the parameter subsets, hence theaccuracy of (20).

Fig. 3 depicts the distribution of posterior dependencebetween the parameters of two partials as a function of theirfrequency difference measured in number of DFT bins. The choiceof the frame length T affects the measured frequency difference.However, the distribution for a given frequency differenceremains roughly independent of the frame length. On average,posterior dependence tends to decrease with increasing frequencydifference and exhibits smaller values for differences correspond-ing to certain zeroes of the DFT of the window wðtÞ. Mostimportantly, for a given frequency difference, posteriordependence values differing by up to three orders of magnitudecan be observed. This shows that there exist many situations inpractice where the parameters of two partials at close frequenciesare much less dependent than average, so that the size of theparameter subsets can effectively be reduced. This figure can also

frequency difference (bins)

post

erio

r de

pend

ence

(bi

ts)

10−15

10−10

10−5

100

105

0 1 2 3 4 5 6

Fig. 3. Posterior dependence between the parameters of two partials as a function

of their frequency difference measured in number of DFT bins. The black curve and

the gray area denote, respectively, the median and the two-tailed 95th percentile

of the values computed for all the data of Section 4.1.

be exploited to speed up the estimation of the subsets byavoiding the computation of the posterior dependence betweenpartials whose frequency difference is above a certain threshold,chosen so that the posterior dependence is guaranteed to be largerthan cmin.

3.4. An exploitable posterior factorization

The conditional factorization (20) can be exploited fornumerical integration of the posterior, either by sampling on auniform grid or by importance sampling. Indeed integration overamplitude and phase parameters can be achieved by multiplyinglower dimension integrals over the parameters of each subset ofpartials. Using sampling on a uniform grid and denoting by N thenumber of grid points for each scalar variable, P the number ofhypothesized notes, M ¼

PpMp their total number of partials and

G the size of the largest subset of partials, this results in amaximum complexity of OððM=GÞN2Pþ2G

Þ. This is smaller than thecomplexity of OðN2Pþ2M

Þ associated with straightforward integra-tion of the joint posterior, but still intractable.

In order to get faster integration, additional parameterdependencies must be removed. A natural approach consists ofcomputing the MAP values ðf ; r; a; fÞ ¼ arg maxf ;r;a;fPðS; f ; r; a;fjxÞof all parameters, replacing the free fundamental frequency andscale factor parameters f and r in the quotient terms of (20) bytheir MAP values and factoring the first term of (20) overindividual parameters f p and rp. This gives

PðS; f ; r; a;fjxÞ � PðS; f ; r; a; fjxÞY

p

Pðf pjS; f p; a; f; xÞ

Pðf pjS; f p; a; f; xÞ

�Y

p

PðrpjS; apÞ

PðrpjS; apÞ

Yg

Pðag ;fg jS; f ; r; ag ; fg ; xÞ

Pðag ; fg jS; f ; r; ag ; fg ; xÞ(23)

This equation allows approximate numerical integration of theposterior with a maximum complexity of OððM=GÞN2G

Þ. Althoughit is not straightforward to justify mathematically, this additionalapproximation appears experimentally valid under the assump-tion that the prior distribution of fundamental frequency para-meters enforces proximity to the underlying discrete pitches. Forinstance, the posterior dependence between fundamental fre-quencies and other parameters or between scale factors and otherparameters for the data of Section 4.1 was above the best settingof cmin determined in that section in less than 4% of the cases, sothat the error introduced by factorization of the joint posteriorover fundamental frequency and scale factor parameters wasgenerally smaller than that introduced by factorization overamplitude and phase parameters. Similarly to above, this equationmay become inaccurate due to a different grouping of the partialswhen three or more partials overlap at a given frequency and theMAP amplitude and phase values a and f are misestimated.However, misestimation of the MAP fundamental frequencies f

has little effect, since it affects mostly the grouping of upperfrequency partials which are generally independent a posteriori

due to their small amplitude.

3.5. Summary of the proposed inference method

To sum up, the proposed inference method is as follows. Foreach signal frame xðtÞ and each candidate activity state vector S

(1)

estimate the MAP parameter values ðf ; r; a; fÞ by nonlinearoptimization;

(2)

either� group the partials into disjoint subsets g according to the

frequency difference criterion (19) or

Page 6: Efficient Bayesian inference for harmonic models via adaptive posterior factorization

ARTICLE IN PRESS

2

html

E. Vincent, M.D. Plumbley / Neurocomputing 72 (2008) 79–8784

� compute the posterior dependence criterion (22) betweenall partials with small frequency difference and groupthem into disjoint subsets g according to this criterion;

htt

.

(3)

compute the integral of each term of the factored posterior(23) via numerical integration and multiply these integrals toobtain PðSjxÞ.

Various algorithms can be used to address each step. In thefollowing, the MAP parameter values were computed using thesubspace trust region optimization algorithm implemented inMatlab’s lsqnonlin function.2 This algorithm was initialized withthe parameter values f p ¼ mp, apm ¼ ~apm and fpm ¼

~fpm defined in(1) and (13). The posterior dependence was computed for all pairsof partials with frequency difference smaller than 2.5 bins bynumerical integration on a uniform grid with 11 points pervariable, that is about 1:5� 104 samples per pair of partials. Eachterm of the factored posterior was subsequently integrated bysampling on a uniform grid with N points per variable, resulting ina total of Ntot ¼ N2P

þP

gN2jgj samples per candidate activity statevector where jgj denotes the number of partials in subset g. Wealso tried integration of these terms via importance sampling [13],but this did not significantly affect performance, although this ledto an increase in computation time.

3.6. Extension to alternative model structures

The proposed marginalization could be extended to alternativeharmonic model structures, such as those described in [21,9,7].Indeed, the approximate posterior independence property ofpartials at different frequencies remains valid.

Among all structural differences, these models consider thenumber of partials per note Mp as a random variable subject to acertain prior. With narrow fundamental frequency priors as in[21,9], the proposed method can be directly applied to computethe integrals of the joint posterior for each value of Mp. Note thatthis results in little additional cost compared to fixed Mp. Indeed,when increasing or decreasing Mp by one, only one subset ofpartials needs to be updated, while the integral over the othersubsets remains constant. With wider fundamental frequencypriors as in [7], the posterior becomes multimodal with localmaxima at all fundamental frequencies present in the signal andrational multiples of these. Amplitude and phase parameters thenexhibit a strong dependence with fundamental frequency para-meters. The proposed method can still be applied by splitting thefundamental frequency range into disjoint narrow bands, similarto the semitone bands considered above, and summing theintegrals of the joint posterior within each band.

Another difference is that the models in [9,7] involveadditional parameters, namely one global inharmonicity para-meter and one spectral shape parameter per note in [9] and onelocal inharmonicity parameter per partial in [7]. The proposedmethod can be directly applied in the second case by groupinglocal inharmonicity parameters with amplitude and phaseparameters from the same partials, yielding a maximum complex-ity of OððM=GÞN3G

Þ. We believe that it could also be applied in thefirst case after additional factorization of the joint posterior overglobal inharmonicity and spectral shape parameters. Indeed theseparameters are physically similar to fundamental frequency andscale factor parameters and should exhibit a similar level ofposterior dependence with other parameters.

Finally, the models in [21,9,7] describe the likelihood by aGaussian whose variance is considered as a random variable.

p://www.mathworks.com/access/helpdesk/help/toolbox/optim/ug/lsqnonlin.

Although this distribution does not satisfy (5), the proposedmethod can still be applied after additional factorizationof the posterior over this variance parameter. We believe thatthis factorization remains approximately accurate providedthat the posterior distribution of the variance is unimodal andnarrow.

4. Evaluation

The precision of the integral estimates obtained by theproposed marginalization method cannot be assessed for realisticsignals, since ground truth integral values are not available.However, the aim of marginalization is often not to compute theexact values of the state posteriors PðSjxÞ, but rather to provideaccurate multiple pitch estimation, that is to select the right MAPactivity state vector S. Therefore, we evaluated the performance ofthe proposed method with respect to the latter task. The variant ofthe diagonal Laplace method employed in [20] was also evaluatedfor comparison.

4.1. Training data and evaluation procedure

The parameter priors were chosen as in [20], without assumingknowledge of the true number of active notes on each frame: theactivity states Sp were modeled by Bernoulli priors, the funda-mental frequencies f p, the scale factors rp and the amplitudes ofthe partials apm by log-Gaussian priors, the phases of the partialsfpm by uniform priors and the likelihood by a frequency-weightedGaussian. Note that the prior over apm helps to avoid partials withzero amplitude. The means and variances of these priors werelearned on a subset of the RWC Musical Instrument database3

consisting of isolated notes from five wind instruments (flute,oboe, clarinet, trombone and bassoon) with MIDI pitches rangingfrom plow ¼ 34 to phigh ¼ 96 and about 500–3000 signal framesper pitch.

For each frame, each active note in the MAP state vector S wasconsidered to be correct if it was actually present in the testsignal. Performance was then classically assessed by theF-measure F ¼ 2RP=ðRþ PÞ in percent, where the recall R is theratio of the total number of correct notes over all frames dividedby the true number of active notes and the precision P is theproportion of correct notes among the estimated active notes[17,15]. The computation time was measured for a Matlab 7.5implementation on a 1.2 GHz dual core laptop.

4.2. Results with one-note and two-note signals

A first experiment was run on one-note and two-note single-frame signals generated by selecting and mixing isolated notesignals from the five above wind instruments taken from theUniversity of Iowa Musical Instrument Samples database.4 Moreprecisely, the test set included 100 one-note signals spanning alldiscrete pitches from p ¼ 40 to 87 and 100 two-note signalscorresponding to all possible pitch intervals between 1 and 25semitones with four different bass pitches p ¼ 40, 47, 54 and 61.All signals were sampled at 22.05 kHz and cut to a single frameusing a Hanning window wðtÞ of length T ¼ 1024 (46 ms). In orderto avoid testing all possible activity state vectors S, six candidatevectors (three with one active note and three with two activenotes) were automatically pre-selected for each test signal asthose minimizing the residual of the orthogonal projection of the

3 http://staff.aist.go.jp/m.goto/RWC-MDB/.4 http://theremin.music.uiowa.edu/MIS.html.

Page 7: Efficient Bayesian inference for harmonic models via adaptive posterior factorization

ARTICLE IN PRESS

93

94

95

96

97

98

posterior dependence cmin (bits)

F−m

easu

re (

%)

10−1 100 101 102 103

10−1 100 101 102 103

posterior dependence cmin (bits)

perc

enta

ge o

f su

bset

s (%

)

0

20

40

60

80

100

num

ber

of p

artia

ls p

er s

ubse

t

1

2

3

0 0.5 1 1.5 293

94

95

96

97

98

frequency difference fmax (bins)

F−m

easu

re (

%)

frequency difference fmax (bins)

perc

enta

ge o

f su

bset

s (%

)

0 0.5 1 1.5 20

20

40

60

80

100

num

ber

of p

artia

ls p

er s

ubse

t

1

2

3

Fig. 4. Multiple pitch estimation results on two-note signals using adaptive posterior factorization based on posterior dependence (left) or frequency difference (right)

between partials. Top: F-measure for various numbers of integration samples (plain: Ntot ¼ 107, dashed: Ntot ¼ 106, dash-dotted: Ntot ¼ 105). Bottom: percentage of partials

from two-notes candidates within subsets of size 1, 2 or 3. The percentage for subsets of size 3 equals 0.2% for cmin ¼ 10�1:5 bits, 1% for f max ¼ 2 bins and 0 in all other cases.

All partials from one-note candidates form singleton subsets.

E. Vincent, M.D. Plumbley / Neurocomputing 72 (2008) 79–87 85

observed magnitude spectrum onto the subspace spanned by themean magnitude spectra of active notes, derived from theamplitude prior as explained in [20]. The thresholds f max andcmin were varied between 0 and 2 bins and between 103 and 10�1:5

bits, respectively, resulting in a variation of the maximum numberof partials per subset from one to three. The average number Ntot

of integration samples per test signal and per candidate wasvaried between 105 and 107.

With one-note signals, the proposed method gave perfectresults for all settings of f max, cmin and Ntot, as expected by anyreasonable fundamental frequency estimator. The method in [20]also provided perfect results with a faster average computationtime of 0.55 s per candidate, mostly due to the optimization of theMAP parameters.

The results with two-note signals are depicted in Fig. 4. Theperformance of the proposed method with many integrationsamples Ntot ¼ 107 increases from F ¼ 93:4% ðR ¼ 89:0%; P ¼98:3%Þ to F ¼ 96:9% ðR ¼ 94:5%;P ¼ 99:5%Þ for both groupingcriteria when the average number of partials per subset increases.This difference is statistically significant, as confirmed by aMcNemar’s p value [16] of 10�3. By comparison, the method in[20] achieved a performance of F ¼ 92:6% ðR ¼ 88:0%; P ¼ 97:8%Þ,which is not statistically different from that of the proposedmethod with a single partial per subset. The best setting for theproposed method consists of using the posterior dependencecriterion with cmin ’ 100 bits. Indeed, this criterion generallyresults in smaller subsets of partials, thus allowing a largernumber N of integration samples per variable for a given totalintegration cost Ntot. The resulting increased accuracy translatesinto the fact that the performance curve with Ntot ¼ 105 wanders

less around the curve with Ntot ¼ 107 for this criterion than forthe frequency difference criterion. The above value of cmin is thelargest one yielding maximum performance, resulting in as littleas 7% of partials found within subsets of two partials for two-note candidates and no partials found within subsets of threeor more. For this setting, the computation time with Ntot ¼ 105 orequivalently N ¼ 15 was equal to 1.0 s per candidate on average.This can be split into about 0.55 s for the optimization of the MAPparameters, 0.1 s for the computation of the posterior dependencebetween the partials and 0.35 s for the numerical integration ofthe terms of the factored posterior. This is much faster thanpreviously reported computation times for MCMC methods withsimilar models, e.g. 1080 s per active note with T ¼ 6000 using a2.6 GHz dual core computer in [7], corresponding to about 800 sper test signal for two-note signals of length T ¼ 1024 with ourcomputer.

The pitch estimation errors made by the proposed methodand the method in [20] are compared in Fig. 5. Errors arisemostly in two situations well known to be difficult [11]: pitchintervals of 12, 19 or 24 semitones, corresponding to integerfundamental frequency ratios of 2, 3 or 4, and pitch intervalsof 1–10 semitones with medium or low pitch bass. Theseerrors can be explained, respectively, by the fact that all thepartials of one note overlap with the partials of the otherand that the frequency resolution is too small to distinguishmultiple notes at low fundamental frequencies. The proposedmethod reduced the number of errors in both situations. Inparticular, correct estimation was achieved for some pitchintervals of 19 semitones and all intervals of 24 semitones,while such intervals would typically result in estimation errors for

Page 8: Efficient Bayesian inference for harmonic models via adaptive posterior factorization

ARTICLE IN PRESS

interval (semitones)

bass

pitc

h (M

IDI)

5 10 15 20 25

40

47

54

61

Fig. 5. Comparison of pitch estimation errors on two-note signals using adaptive posterior factorization with the best setting or the diagonal Laplace variant method in

[20]. White, gray and black squares denote, respectively, accurate estimation for both methods, accurate estimation for the proposed method only and erroneous estimation

for both methods. There are no cases where [20] was accurate but the proposed method was not.

Table 1Multiple pitch estimation results on real-world signals

Method Number of

instruments

R (%) P (%) F (%) Average time per

candidate state (s)

Total time (h)

Adaptive posterior factorization 2 87.0 91.8 89.3 1.8 8

3 60.1 86.2 70.9 5.2 24

4 51.2 82.2 63.1 5.4 26

5 42.2 85.4 56.5 5.5 28

Diagonal Laplace variant [20] 2 86.2 91.3 88.6 0.8 4

3 59.3 85.9 70.1 1.4 6

4 49.4 81.5 61.5 1.5 7

5 40.4 84.1 54.6 1.3 6

E. Vincent, M.D. Plumbley / Neurocomputing 72 (2008) 79–8786

other models based on uniform or zero-mean Gaussian amplitudepriors [21,7].

4.3. Results with real-world signals

In order to estimate the potential performance of the proposedmethod on real-world data, a second experiment was run on thedata for the multiple fundamental frequency estimation task ofthe 2007 music information retrieval evaluation exchange(MIREX).5 These data consist of public and hidden excerpts ofrecordings and monophonic pitch annotations of individualinstrument parts of a wind quintet by Beethoven. Four test signalswith two to five instruments were generated by successivelysumming together the parts of flute, clarinet, bassoon, horn andoboe of the first 17 s of the public excerpt. These signals werethen resampled at 22.05 kHz and framed with half-overlappingHanning windows of length T ¼ 1024 (46 ms). A variable numberof candidate activity state vectors S with zero to five active noteswas automatically pre-selected for each frame using the iterativestate jump algorithm employed in [20], resulting in 23 candidatesper frame on average. Inference was performed using the bestsettings determined above, namely cmin ¼ 100 bits and N ¼ 15,leading to subsets of one to three partials.

The results are detailed in Table 1. On average, the proposedmethod ran in 17 h and achieved F ¼ 70:0%, while the method in[20] ran in 6 h and achieved F ¼ 68:7%. The similarity betweenthese performance figures suggests that full posterior factoriza-tion can work nearly as well as adaptive posterior factorization inpractice. One reason for this is that the pitch intervals for whichan improvement was observed in the first experiment occur morerarely in this experiment. These figures are in line with the topones reported at MIREX. For instance, the best method [15]achieved F ¼ 73% on hidden excerpts of the same data, whileexploiting additional temporal priors.

5 http://www.music-ir.org/mirex2007/.

5. Conclusion

We proposed a fast inference method for Bayesian harmonicmodels based on approximate factorization of the joint posteriorinto a product of distributions over disjoint parameter subsets andnumerical integration of these distributions. A local posteriordependence criterion was exploited to determine relevant subsets.Although factorization based on this criterion is theoreticallyfeasible for any Bayesian model, it does not necessarily providesmall subsets, which are needed for subsequent numericalintegration. The key property of harmonic models demonstratedhere is that the parameters of partials with different frequenciesare approximately independent a posteriori. Compared to classicalinference methods such as MCMC and variational approximation,this method is tractable for a wide range of priors and allows avariable level of factorization depending on the observed signaland the hypothesized notes. However, the resulting marginalprobability estimates are intrinsically less accurate. Hence, webelieve that it is most beneficial when classical methods areintractable due to the chosen priors. Our experiments suggestthat approximate inference with priors motivated by empiricalparameter distributions can provide better pitch transcriptionresults than exact inference with standard priors motivated bycomputational issues.

To improve the accuracy of the factorization, it would beinteresting to investigate transformations of the parametersresulting in a smaller posterior dependence. The minimizationof the dependence between subsets of random variables describedby a sequence of samples is known as the independent subspaceanalysis (ISA) problem and can be solved in the case oflinear transformations by independent component analysis (ICA)followed by grouping of the transformed variables [1]. Thisapproach could easily be combined with subsequent integrationbased on importance sampling, and this may also allow otherBayesian models, which do not readily satisfy the posteriorindependence property, to benefit from the proposed inferencemethod.

Page 9: Efficient Bayesian inference for harmonic models via adaptive posterior factorization

ARTICLE IN PRESS

E. Vincent, M.D. Plumbley / Neurocomputing 72 (2008) 79–87 87

Acknowledgments

The authors would like to thank the anonymous reviewers foruseful comments on the first version of this article. The firstauthor also wishes to thank Cedric Fevotte and Simon J. Godsill formotivating discussions about the influence of prior distributionson the performance of harmonic models and the practical use ofMCMC methods.

Appendix A. Proof of Eq. (18)

We start from the expression of joint posterior in (17). On theone hand, by evaluating (17) with a ¼ a and f ¼ f, we obtain

PðS; f ; r; a; fjxÞ / P0ðx; f ÞPðrjSÞPðf jSÞPðSÞ

�Yp;m

Ppmðapm; fpm; x; f pÞPðapmjrpÞPðfpmÞ (A.1)

After dividing (17) by (A.1), we get

PðS; f ; r; a;fjxÞ

PðS; f ; r; a; fjxÞ¼Yp;m

Ppmðapm;fpm; x; f pÞPðapmjrpÞPðfpmÞ

Ppmðapm; fpm; x; f pÞPðapmjrpÞPðfpmÞ(A.2)

On the other hand, by evaluating (17) with ap0m0 ¼ ap0m0 and fp0m0 ¼

fp0m0 for all partials ðp0;m0Þ except ðp;mÞ, we have

PðS; f ; r; apm; apm;fpm; fpmjxÞ

/ P0ðx; f ÞPðrjSÞPðf jSÞPðSÞ

�Ppmðapm;fpm; x; f pÞPðapmjrpÞPðfpmÞ

�Y

ðp0 ;m0 Þaðp;mÞ

Pp0m0 ðap0m0 ; fp0m0 ; x; f p0 ÞPðap0m0 jrp0 ÞPðfp0m0 Þ (A.3)

Using Bayes law, this leads to

Pðapm;fpmjS; f ; rp; apm; fpm; xÞ

/ P0ðx; f ÞPpmðapm;fpm; x; f pÞPðapmjrpÞPðfpmÞ

�Y

ðp0 ;m0 Þaðp;mÞ

Pp0m0 ðap0m0 ; fp0m0 ; x; f p0 Þ (A.4)

When evaluating this expression for apm ¼ apm and fpm ¼ fpm,we get

Pðapm; fpmjS; f ; rp; apm; fpm; xÞ

/ P0ðx; f ÞPpmðapm; fpm; x; f pÞPðapmjrpÞPðfpmÞ

�Y

ðp0 ;m0 Þaðp;mÞ

Pp0m0 ðap0m0 ; fp0m0 ; x; f p0 Þ (A.5)

After dividing (A.4) by (A.5), we conclude that

Pðapm;fpmjS; f ; rp; apm; fpm; xÞ

Pðapm; fpmjS; f ; rp; apm; fpm; xÞ¼

Ppmðapm;fpm; x; f pÞPðapmjrpÞPðfpmÞ

Ppmðapm; fpm; x; f pÞPðapmjrpÞPðfpmÞ

(A.6)

The target result (18) is then readily obtained by replacing (A.6)into (A.2). Note that the fact that a and f are the MAP amplitudeand phase parameter values associated with S, f and r is not usedwithin this proof.

References

[1] J.-F. Cardoso, Multidimensional independent component analysis, in: Pro-ceedings of the IEEE International Conference on Acoustics, Speech and SignalProcessing (ICASSP), 1998, pp. IV-1941–1944.

[2] G. Casella, C.P. Robert, Monte Carlo Statistical Methods, second ed., Springer,New York, NY, 2005.

[3] A.T. Cemgil, S.J. Godsill, Efficient variational inference for the dynamicharmonic model, in: Proceedings of the IEEE Workshop on Applications ofSignal Processing to Audio and Acoustics (WASPAA), 2005, pp. 271–274.

[4] A.T. Cemgil, H.J. Kappen, D. Barber, A generative model for music transcrip-tion, IEEE Trans. Audio Speech Lang. Process. 14 (2) (2006) 679–694.

[5] D.M. Chickering, D. Heckerman, Efficient approximations for the marginallikelihood of Bayesian networks with hidden variables, in: Proceedings of theConference on Uncertainty in Artificial Intelligence (UAI), 1996, pp. 158–168.

[6] A. Cont, Realtime multiple pitch observation using sparse non-negativeconstraints, in: Proceedings of the International Conference on MusicInformation Retrieval (ISMIR), 2006, pp. 206–212.

[7] M. Davy, S.J. Godsill, J. Idier, Bayesian analysis of western tonal music, J.Acoust. Soc. Am. 119 (4) (2006) 2498–2517.

[8] D.P.W. Ellis, Prediction-driven computational auditory scene analysis, Ph.D.Thesis, Department of Electrical Engineering and Computer Science, MIT,Cambridge, MA, 1996.

[9] S.J. Godsill, M. Davy, Bayesian computational models for inharmonicity inmusical instruments, in: Proceedings of the IEEE Workshop on Applications ofSignal Processing to Audio and Acoustics (WASPAA), 2005, pp. 283–286.

[10] M. Horne (Ed.), Prosody: Theory and Experiment, Kluwer Academic Publish-ers, Boston, MA, 2000.

[11] A. Klapuri, M. Davy, Signal Processing Methods for Music Transcription,Springer, New York, NY, 2006.

[12] D.J.C. McKay, Information Theory, Inference, and Learning Algorithms,Cambridge University Press, Cambridge, UK, 2003.

[13] R.M. Neal, Probabilistic inference using Markov chain Monte Carlo methods,Technical Report CRG-TR-93-1, Department of Computer Science, Universityof Toronto, 1993.

[14] C. Raphael, Automatic transcription of piano music, in: Proceedings ofthe International Conference on Music Information Retrieval (ISMIR), 2002,pp. 15–19.

[15] M.P. Ryynanen, A.P. Klapuri, Polyphonic music transcription using note eventmodeling, in: Proceedings of the IEEE Workshop on Applications of SignalProcessing to Audio and Acoustics (WASPAA), 2005, pp. 319–322.

[16] D.J. Sheskin, Handbook of Parametric and Nonparametric Statistical Proce-dures, second ed., Chapman & Hall, Boca Raton, FL, 2000.

[17] C.J. van Rijsbergen, Information Retrieval, second ed., Butterworths, London,UK, 1979.

[18] E. Vincent, Musical source separation using time-frequency source priors,IEEE Trans. Audio Speech Lang. Process. 14 (1) (2006) 91–98.

[19] E. Vincent, M.D. Plumbley, Fast factorization-based inference for Bayesianharmonic models, in: Proceedings of the IEEE International Conference onMachine Learning for Signal Processing (MLSP), 2006, pp. 117–122.

[20] E. Vincent, M.D. Plumbley, Low bit-rate object coding of musical audio usingBayesian harmonic models, IEEE Trans. Audio Speech Lang. Process. 15 (4)(2007) 1273–1282.

[21] P.J. Walmsley, S.J. Godsill, P.J.W. Rayner, Polyphonic pitch tracking using jointBayesian estimation of multiple frame parameters, in: Proceedings of the IEEEWorkshop on Applications of Signal Processing to Audio and Acoustics(WASPAA), 1999, pp. 119–122.

[22] M. Wu, D. Wang, G.J. Brown, A multipitch tracking algorithm for noisy speech,IEEE Trans. Speech Audio Process. 11 (3) (2003) 229–241.

Emmanuel Vincent received the Mathematics degreeof the Ecole Normale Superieure, Paris, France, in 2001and Ph.D. degree in Acoustics, Signal Processing andComputer Science Applied to Music from the Univer-sity of Paris-VI Pierre et Marie Curie, Paris, in 2004.From 2004 to 2006, he has been a Research Assistantwith the Centre for Digital Music at Queen Mary,University of London, London, UK. He is now aPermanent Researcher with the French National In-stitute for Research in Computer Science and Control(INRIA). His research focuses on structured probabil-istic modeling of audio signals applied to blind source

separation, indexing and object coding of musicalaudio.

Mark D. Plumbley began his research in the area ofNeural Networks in 1987, as a Ph.D. Research Studentat Cambridge University Engineering Department.Following his Ph.D. he joined King’s College Londonin 1991, and in 2002 moved to Queen Mary Universityof London to help establish the new Centre for DigitalMusic. He is currently working on the analysisof musical audio, including automatic music tran-scription, beat tracking, audio source separation,independent component analysis and sparse coding.Dr. Plumbley currently coordinates two UK ResearchNetworks: the Digital Music Research Network

(www.dmrn.org) and the ICA Research Network(www.icarn.org).