[IEEE 2005 IEEE Conference on Control Applications, 2005. CCA 2005. - Toronto, Canada (Aug. 29-31, 2005)] Proceedings of 2005 IEEE Conference on Control Applications, 2005. CCA 2005

Abstract—In this paper we discuss the problem of partial tracking as applied to music signals, and propose a tracking algorithm based on Kalman filtering. This algorithm is capable of tracking both frequency and power partials, which are used in different areas of music signal analysis. We introduce a set of state-space models for our signals based on the evolution of frequency and amplitude in different classes of musical instruments. These prior models are used to estimate future values of partial tracks in successive time frames of our spectral data. We present and evaluate the performance of our tracker in different possible scenarios where there are crossing partials or vibrato.

I. INTRODUCTION

RACKING of partials plays an instrumental role in the areas of music signal analysis where the focus is on

identifying fundamental properties of these signals, such as pitch and frequency-amplitude of harmonics. In areas such as automatic music transcription and audio restoration [1] we have discrete sets of time-frequency data and we need to obtain pseudo-stationary information by making appropriate connections between data from successive time frames. In the literature there are various methodologies for tackling this problem, all of which are based on a model of time-varying sinusoidal component plus noise [2]. The idea of Partial tracking was first used in analysis and synthesis of speech signal [3] and then adopted for the case of music signal [2], where it was based on a heuristic approach. In both approaches, the basic idea is that a set of frequency guides advances in time through the spectral peaks, looking for the appropriate peaks that are within a small vicinity of the guide. In a more recent approach [4] and as an extension to [3], linear prediction was used to enhance the tracking of frequency components in music signals. In all these approaches peaks from successive frames are connected to each other based on their proximity in frequency, and the behaviour of peaks' amplitude is not taken into account while performing the tracking.

Another approach [5], which was inspired by a similar technique in radar tracking and also a frequency tracker for avalanche signals [6], takes the advantage of Kalman filter by constructing a state-space model for the behaviour of peaks' power (i.e. amplitude in dB scale) and frequency. In this approach peaks are not matched based on how close they look like in frequency, but based on the future behaviour of a peak's frequency and power.

The authors are with Electrical and Computer Engineering department, Northeastern University, Boston, MA 02115, USA (phone: 617-373-2984; fax: 617-373-4189; e-mail: {hsattar&shafai}@ ece.neu.edu).

In the next section we introduce the problem of automatic music transcription as a main application area for our partial tracking method. Modeling of our music signals will be discussed in section III. In section IV we present the formulation of our Kalman tracker and talk about different aspects of its performance. Tracking results for various signals and a comparison with the method of [5] are included in section V. Limitation of conventional Kalman filter is discussed at the end.

II. BACKGROUND

Music transcription is the process of un-making or documenting music. Un-making in the sense that the process of reading from score and playing music is reversed [5], and documenting in the context of substantiating musical sounds, whether it has been played from score, memory, or just improvised.

A music transcription system, in its perfection, should be able to detect all attributes as written in the score, such as loudness and tempo, as well as performance gestures intended by the performer. However, at the fundamental level it is the problem of recognizing which note is played and when. Although this process can be done by human listener with trained ears, developing a music transcription system, which replaces the human listener with a computer, even at the basic level, requires sophisticated signal processing techniques. For polyphonic music, where more than one musical sound is present at a time, keeping track of individual notes and producing the simple score is much more complex.

Each musical note contains a fundamental frequency and integer multiples of this frequency, which are called harmonics of the note. To identify a note we need to know the fundamental frequency or pitch. For transcribing a piece of music, the identity of each note and its time duration are required. Since in a real scenario we can have more than one note being played at a time, we need to distinguish between these notes. This can be done by identifying all the partials and their initiation and termination times first. The fundamental and all the harmonics related to each note and their time length are then extracted, which can be directly interpreted into musical score.

A general music transcription system takes the waveform of recorded music and finds the behavior of frequencies within small time frames using spectral estimation tools, assuming that the signal is a combination of sinusoids and noise. In fact, we are dealing with pseudo-stationary signals for which amplitude and frequencies vary slowly with time,

Kalman Filtering Application in Automatic Music Transcription Hamid Satar-Boroujeni, Member, IEEE and Bahram Shafai, Senior Member, IEEE

T

Proceedings of the2005 IEEE Conference on Control ApplicationsToronto, Canada, August 28-31, 2005

WC3.2

0-7803-9354-6/05/$20.00 ©2005 IEEE 1612

and we choose small time frames to preserve the stationarity which is required for the estimation of spectrum. This process results in a representation with power concentrated at specific frequencies. These frequencies, which are the local maxima within the spectral representation, are indications of partials of existing musical notes in that time frame. Identified peaks from adjacent frame which belong to the same partial must be connected to each other using data association techniques. One possible approach is to use Kalman filtering technique for tracking partials through neighboring time frames based on a pre-estimated state-space model for evolution of frequency and power in time. This idea is pretty much equivalent to the use of Kalman filtering in the area of radar tracking [7].

In the process of music signal analysis, detection of peaks plays an important role. We need to collect all possible peaks pertaining to existing partials and reject all those that are most likely related to noise or imperfections in estimating the spectrum. Optimum number of peaks will optimize the computational load of the tracking process. On the other hand, a large number of inaccurate peaks can result in formation of false partial tracks from randomly successive sets of spurious peaks. Having this in mind we proposed a novel and improved technique for detection of peaks, which was introduced before [8].

III. MODELING

A. Time Varying Partials A well-known approach to modeling of music signals for

the purpose of statistical analysis/synthesis assumes a model of additive sinusoidal plus residuals that can be formulated as [2]

( ) ( ) ( )y t s t e t (1) with

1

( ) ( ) cos ( ) ( )N

n n nn

s t A t t t (2)

Here, s(t) reflects the pure musical part of the signal and e(t) can be modeled as a stationary autoregressive process. In the musical portion, An(t) and n(t) are representatives of time-varying amplitude and frequency of partials, and N is the number of partials. Quantity n(t) represents timbral variations and performance effects. Since we do not consider such effects in our music signals, n(t) will be considered as a noise process.

It should be pointed out that although we need only frequency partials for the purpose of music transcription, keeping track of power partials can also improve performance of the tracker by providing more information about a peak. This is evident in the presence of crossing partials which will be discussed in section IV.C.

B. General Model for Evolution of Partials The next step in our music signal modeling is estimation

of An(t) and n(t) using available observations from the peak detection step. What we have is discrete sets of peaks from successive time frames. An(t) and n(t) can be estimated by making connections between those peaks from adjacent frames that look like being the continuation of the same partial.

Kalman filter takes the noisy observations and based on a model for evolution of certain states finds the optimal estimate of the process behavior. Here the noise corrupted observations are the identified peaks and system model is a state-space model for evolution of frequency and power. This model can be represented as

( 1) ( ) ( )( ) ( ) ( )

x k Ax k Bv k

y k Cx k w k (3)

where

1

1

( ) ( ) ( ) ( ) ( )

( ) ( ) ( )

( ) ( ) ( )

T

m

T

m

x k f k p k n k n k

v k u k u k

y k f k p k

(4)

Here, f(k) and p(k) are frequency and power for a detected peak respectively. v(k) and w(k) are process noise and observation noise, and ni(k), i=1,…,m are states for as many shaping filters for which the uncorrelated noise processes ui(k), i=1,…,m are white. The matrix A is the transition matrix, B describes coupling of the process noise v(k) into the system states, and C is the observation matrix. In this model, v(k) and w(k) are zero-mean and jointly uncorrelated Gaussian processes with covariance matrices Q and R,respectively.

C. Instrument-Specific Models For specifying matrices and number of states needed for

our modeling, prior information about the power and frequency partials is needed. This can help us to specify the model by a piecewise-linear fit to p(t)=20log An(t) and f(t)= n(t)/2 .

Melodic instruments can be classified into two groups based on the way their source for sound production behaves. This can directly affect the shape of amplitude partials. If during production of sound, source continues to inject energy, the overall shape of the steady-state part of the amplitude track will be non-decaying. We consider these instruments in the class of Continued Energy Injection (CEI). Examples of this group of instruments are woodwind, brass, and violins from the family of string instruments. Since we are not accepting performance effects such as glissando and vibrato, which affect the amplitude and frequency partials during steady-state part, we can consider these partials as nearly constant with additive noise, following their attack part and before the offset. The onset and offset parts represent different characteristics and will not be considered in our analysis here. An example of the fundamental of chamber note played on a clarinet is shown in the upper part of figure 1.

1613

The second group includes those instruments for which the injection of energy is discontinued and amplitude partial represents an exponentially decaying shape. In this case, the power partial will have a linear decay since it is in logarithmic scale. These instruments are considered in the class of Discontinued Energy Injection (DEI). Members of this group are hammered and plucked instruments such as piano and guitar. One example of the shape of fundamental frequency for the chamber note played on a piano is shown in the lower part of figure 1.

As we see in the right hand side of the figure, frequency partials of both classes are nearly constant in time, which suggests the same modeling for evolution of frequency for all melodic instruments.

In a polyphonic setting there are three possible scenarios when we do not consider non-melodic instruments such as drums. A piece of music can consist of instruments from the CEI, DEI or a combination of both.

For the first scenario, where both frequency and power are nearly constant we have

1

1 1 1 1 1

2

2 2 2 2 2

1 2

1 2

1 0 0 0

0 1 0 0

( 1) ( ) ( )

( 1) ( ) ( )

( 1) ( ) ( )

( 1) ( ) ( )

( ) ( ) ( ) ( ) ( )

( ) ( ) ( )

( ) ( ) ( )

T

T

f k f k n k

n k a n k b u k

p k p k n k

n k a n k b u k

k f k p k n k n k

k u k u k

y k x k w k

x

v

(5)

For the second and third scenario we have the same model, but their parameters are different since we estimate these parameters by considering different databases of sound for each scenario. This model can be shown as

2

1 2

1

1 1 1 1 1

2 2 2 2 2

1 2

1 0 0 0 0

0 1 0 0 0

( 1) ( ) ( )

( 1) ( ) ( )

( 1) ( ) ( )

( 1) ( ) ( )

( 1) ( ) ( )

( ) ( ) ( ) ( ) ( )

( ) ( ) ( )

( ) ( ) ( )

( )

p

p p

T

p

T

f k f k n k

n k a n k b u k

p k p k v k

v k v k n k

n k a n k b u k

f k p k v k n k n k

k u k u k

y k x k w k

x k

v

(6)

As noted above, we estimate the parameters of each model, e. g. a1, b1, a2, b2, by performing a statistical analysis on a large number of musical sounds with known identities in a forward-problem setting. The details of this procedure are presented in [9].

Due to our experience, these parameters turned out to be frequency dependant. Therefore, in each scenario and for different frequency bins we have different sets of parameters. Estimated parameters for both CEI and DEI classes are shown in figure 2.

IV. PARTIAL TRACKING

As mentioned earlier, the idea of Kalman tracking comes to the picture where we need to make appropriate connection among discrete sets of information in separate time frames that are related to the same event. Here, we process discrete segments of music signal, and after extracting useful information related to the behaviour of partials we intend to put this information together and represent the shape of these partials in time. The extracted information or noisy observations are frequency and power of peaks from different time frames and we use Kalman filter to estimate noise free outputs and from there find the set of frequency and power data in the adjacent frame that

Figure 1: Shape of power and frequency partials for the fundamental of the chamber note played on a clarinet and a piano.

Figure 2: Estimated parameters for CDI (dashed) and DEI (solid)

1614

are most related to them.

A. Kalman Tracker Considering the evolution model of (3) with states in

either (5) or (6), we can write the recursive formulation for the Kalman tracker as follows. Measurement Update:

1( ) ( / 1) ( / 1)

ˆ ˆ ˆ( / ) ( / 1) ( ) ( ) ( / 1)

( / ) ( ) ( / 1)

T TM k P k k C R CP k k C

x k k x k k M k y k Cx k k

P k k I M k C P k k

(7)

Time Update: ˆ ˆ( 1/ ) ( / )

( 1/ ) ( / ) T T

x k k Ax k k

P k k AP k k A BQB (8)

Initial values:

(0) (0)

(1/ 0)

ˆ (1/ 0) 0 0T

Ti if p

P BQB

x (9)

Noise covariance:

2

21

22

( ) ( ) ( )

0( ) ( ) ( )

0

T

T

R k E v k v k I

Q k E w k w k (10)

Here M is the filter gain, P is the state error covariance matrix, and fi(0) and pi(0) are frequency and power of the initial peak in the ith track respectively. 2

1 and 2

2 are variances of observation noise process and there is no theoretical way of determining these values. In fact, the performance of our tracker is less sensitive to these parameters as long as they stay close to one [5]. We set them equal to 0.97. The formulation of Kalman tracker is applicable to both 4th and 5th order models in (5) and (6), except for the initial state vector in (9), which has one more zero value for the case of 5th order model. The tracker is initiated with peak data from the first time frame. Depending on the nature of our music signal and the class of instrument, and based on the frequency of initial peak, a set of parameters from the related frequency bin is selected for its evolution model. The Kalman tracker then estimates the noise-free values for power and frequency in the following frame. If the following frame contains a peak that is close enough to the estimated values, that peak is added to the track and is used to update the tracker. This process is continued through successive frames until there is no peak close enough to the last estimated peak. Here the track is terminated or considered as "dead" and a new track is initiated in following frame. The process starts with all peaks in the first frame and also with all peaks from other frames that have not been used in any track.

For our data association we follow the idea of “global nearest neighbour (GNN) association” [10]. According to this idea, one observation can be claimed only by one track, and at most one observation can be used to update a given

track.

B. Adaptive Acceptance Gate After estimating the noise-free values for frequency and

power of a peak in the ith track at the time frame k, i.e. ˆ ( / 1)if k k and ˆ ( / 1)ip k k , we compare them to all the

peaks in the kth frame. We then update our tracker with the peaks that are close enough to these estimations, or in another term, fall into the acceptance gate of the tracker. A distance function is defined as the distance between a peak's frequency and power and the estimated frequency and power from previous frame. This function is [7]

12 ( ) ( ) ( / 1) ( )T Td k e k CP k k C R e k (11)

where ˆ( ) ( ) ( / 1)e k y k Cx k k is the error between current observation and predicted values, and

( / 1) TCP k k C R is the covariance matrix of this error. A peak falls into acceptance gate of an estimated peak if the value of its distance function is less than the gate value. If more than one peak fall into the acceptance gate, the one with less distance is selected.

Based on our experience, it is not possible to set a universal value for the acceptance gate in our application and it must be adaptively changed based on the frequency of partial tracks. As mentioned earlier, we are dealing with pseudo-stationary signals. Frequencies of our partials vary with time but these variations are magnified when we move from lower harmonics to the higher harmonics. So, if we consider the same value for our acceptance gate in all frequencies, we have the risk of missing tracks in higher frequencies or loosely accepting false partial tracks in lower frequencies. To cope with these variations we set the gate value as a function of frequency, as follows

( ) 10 0.01g f f (12) In fact, we increase the chance of continuing a track where

the peaks are sparser and less likely to join a track in higher frequencies.

C. Crossing Partials Although power and frequency partials evolve

independently from each other, considering a function of both power and frequency for the distance function in (11) is especially rewarding when we are dealing with crossing partials. This can mostly happen in the third scenario in section III.C where we can have a combination of both constant and linearly decaying power partials. In partial tracking techniques, where power and frequency partials are tracked separately ( [2], [3], [4]), the problem of crossing partials needs considerable attention and requires additional adjustments to the original tracker. However, in our algorithm the contribution of constant and distinct frequency partials in the distance function helps the tracker to distinguish between corresponding power partials in the crossing area, and it does not need additional adjustments.

1615

D. Missing Peaks Due to imperfections in estimating the spectrum and also

because partials with low power can get buried in noise, we might face the problem of missing peaks. This can result in discontinuities in parts of a partial. To overcome this problem, it is proposed in [2] to add "zombie" states to the end of a track where we cannot find any peak within the acceptance gate. In our algorithm we update the track with estimated states in such situation, and continue this process for a maximum of three frames. If during these attempts no peak falls into the acceptance gate, we consider that track as dead and extract the fake updates from the track. If we find a peak during this process, the track is updated with this peak and we keep the fake updates or zombies. We also have the option of interpolating the missing peaks with the newly found peak as proposed in [4]. However, this should be done when we have enough points in the track for the interpolation to be accurate.

E. Backward Tracking To add to the accuracy of our algorithm we can perform a

backward tracking at the end of each track. When a track is terminated, we can initiate a backward tracker with the last updated states and error covariance matrix. This process is identical to the forward tracking but in the reverse direction. This can be helpful because the forward tracker is loosely initiated with the noisy observations for power and frequency and zero values for other states, while our backward tracker is initiated more accurately. On the other hand, the backward tracker is capable of recovering discontinuities in the forward tracking results, since it has the support of a more accurate initiation and a longer history of observation updates.

V. RESULTS

We examined the accuracy of our algorithm by performing the proposed partial tracking on a wide range of instrumental sounds from different classes of melodic instruments as well as fictitious signals. The tracking results were compared with that of the method proposed in [5]. This comparison was done by first defining some accuracy factors. These factors are

100, 100ftdtdt ft

et et

nnR R

n n (13)

where Rdt is the detection rate, Rft is the false rate, ndt is the number of detected tracks, nft is the number of false tracks, and net is the number of expected tracks. Number of expected tracks was acquired by counting the number of real peaks that appeared to belong to genuine tracks in different time frames of test signals. We computed these factors for 32 musical notes from all classes of melodic instruments. The same sets of peaks from our peak detection process were fed into the tracker of [5]. Table 1 contains accuracy factors for these two trackers.

The superior performance of our method is mostly due to the adaptive acceptance gate and instrument-specific models. Acceptance gate in [5] is fixed and the same evolution model is used for all classes of instruments. To test the performance of our tracking system in the presence of crossing partials, we used fictitious sound signals containing two music notes; one with constant and the other with linearly decaying power partials. We produced these notes by adding white noise to predefined frequency and power partials and using (2) to make the pure music signals. The noisy signals were then produced by adding another noise, as modeled in (1). Finally, we added up the two signals to yield the fictitious sound. Tracking result for these crossing power partials is shown in figure 3. We also tested our tracker in the presence of vibrato. Vibrato is defined as small oscillations of about 4 Hz in the frequency tracks. This feature is one of the performance effects that are caused by performer or singer. Tracking result for five frequency partials of a signal with vibrato is presented in figure 4. As we can see, the sinusoidal variations of frequency tracks become larger for higher frequency partials. Our tracker is able to cope with all the variations by producing larger estimates at higher frequencies. This can be seen by comparing the 24th and 25th

estimated samples (dots) in the lowest and highest partials. Performance of the backward tracker is shown in figure 5. The forward track is discontinued from frame 27 to 31, but our backward tracker, which is initiated with estimated states at the end of forward track (frame 55), is able to recover missing point of the partial.

A. Limitation of Kalman Filter For Kalman filter to work properly, system model must be

accurate. In practical applications, where parameters of the evolution model are not guaranteed to be accurate enough, the performance of Kalman filter can be poor [11]. Our tracker is not exempt from this limitation, since we obtain the parameters of our model through a statistical analysis of a large database of music signals and by averaging over varying estimates. With the same model parameters for different instruments in one class, we ended up with more false tracks where we were dealing with smoother partials, and we got more missing tracks where we had partials with larger variations.

In addition to the inaccuracy of these parameters, a large amount of effort is needed for their estimation. The future focus of authors is on finding alternative filters that preserve the same good properties as conventional Kalman tracker, but do not suffer from its limitations.

dtR ftROur Method 98.2 18.2 Method of [5] 84.7 27.4

Table 1: Accuracy rates of partial tracking

1616

VI. CONCLUSION

We introduced some scopes of music signals analysis and discussed the application of Kalman filter in partial tracking of these signals. We also investigated the advantages of this filter as compared with other methodologies in the field of partial tracking for music signals. The limitations inherited in conventional Kalman filter can degrade the performance of our tracker, as it is the case for other practical applications with estimated models. We can overcome these constraints by using modified versions of the filter that are less sensitive to inaccurate model parameters and can guarantee a reasonable performance.

REFERENCES

[1] S. J. Godsill and P. J. W. Rayner, Digital audio restoration : a statistical model based approach. London ; New York: Springer, 1998.

[2] X. Serra, “Musical Sound Modelling with Sinusoids plus Noise, ” in Musical Signal Processing, Exton, PA: Swets & Zeitlinger, 1997, pp. 91-122.

[3] R. McAulay and T. Quatieri, "Speech analysis/Synthesis based on a sinusoidal representation," Acoustics, Speech, and Signal Processing [see also IEEE Trans. Signal Processing], IEEE Trans., vol. 34, pp. 744-754, 1986.

[4] M. Lagrange, S. Marchand, and J. B. Rault, "Using Linear Prediction to Enhance the Tracking of Partials," presented at ICASSP '04, Montreal, Canada, 2004.

[5] A. Sterian, "Model-Based Segmentation of Time-Frequency Images for Musical Transcription," in Electrical Engineering: Systems. Ann Arbor: University of Michigan, 1999.

[6] W. Roguet, N. Martin, and A. Chehikian, "Tracking of frequency in a time-frequency representation," presented at Time-Frequency and Time-Scale Analysis, 1996., Proceedings of the IEEE-SP International Symposium on, 1996.

[7] S. S. Blackman, Multiple-Target Tracking -with Radar Applications.Norwood: Artech House, 1986.

[8] H. Satar-Boroujeni and B. Shafai, "Peak Tracking and Partial Formation of Music Signals," accepted for the 13th European Signal Processing Conference, Antalya, Turkey, 2005.

[9] H. Satar-Boroujeni and B. Shafai, "State-Space Modeling and Analysis for Partial Tracking of Music Signals," presented at The 24th IASTED-MIC, Innsbruck, Austria, 2005.

[10] S. Blackman and R. Popoli, Design and Analysis of Modern Tracking Systems, Boston: Artech House, 1999.

[11] Y. Theodor, U. Shaked, and C. E. de Souza, “A Game Theory Approach to Robust Discrete-time H-infinity Estimation, IEEE Trans. Signal Processing, pp. 1486–1495, 1994.

Figure 3: Crossing power partials.

Figure 4: Tracking results for frequency partials containing vibrato (solid) along with estimated values (dots).

Figure 5: Forward and backward tracking: discontinued partial in the forward tracking (solid line) and its estimate (squares), along with the

backward track (circles) and its estimates (dots).

1617

Documents

[IEEE 2005 IEEE Conference on Control Applications, 2005. CCA 2005. - Toronto, Canada (Aug. 29-31, 2005)] Proceedings of 2005 IEEE Conference on Control Applications, 2005. CCA 2005