[IEEE 2013 IEEE International Conference on Multimedia and Expo (ICME) - San Jose, CA, USA (2013.07.15-2013.07.19)] 2013 IEEE International Conference on Multimedia and Expo (ICME)

PDOA BASED UNDERDE TERMINED BLIND SOURCE SEPARATION USING TWO MICROPHONES

Avram Levi

AvayaLabs Basking Ridge, NJ

ABSTRACT

This paper builds on the idea of frequency bin-wise separation of mixed speech signals by Gaussian Mixture Model fitting using the Expectation-Maximization algorithm on PhaseDifference-of-Arrival values between two microphones. We find that using a combination of pre-processing steps and error finding post processing, separation performance exceeding the state of the art is achieved on publicly available live recording data.

Index Terms- BSS, underdetermined, time-frequency masking

1. INTRODUCTION

Consider two microphones in a room environment, where Xj(t) denotes the time-signal recorded using microphone j (t is the time-index). There are N point sources in the room, emitting signals denoted by Si(t) (i = L.N) and some general background noise n(t). Xj(t) can be modelled as an additive mixture of the images of these signals at microphone

j: N

Xj(t) = L 8ij(t) + nj(t) (1) i=1

where Sij(t) and nj(t) denote the image of source i's signal and background noise at microphone j, respectively. 8ij (t), is decomposed into the convolution of 8i (t) and the room impulse response, hij (t), summing the effects of all paths (direct and indirect) sound has to travel from source i to microphone

j: (2)

where I also is a time-index.

The task of Blind Source Separation (BSS)[I] is to separate the recorded mixtures x j (t) into accurate approximations of 8i (t) using very little or no prior information about the sources. For this particular development, we assume that only the number of sources, N, is known and that Sij (t) is a reasonable approximation of Si (t).

If our system was determined, i. e. we had equal or more microphones than N, then linear filters estimated using In

dependent Component Analysis (ICA) could be used to effectively separate the mixtures[2] . In this work we consider cases where N > 2, i. e. under-determined systems where the ICA procedure will have to be augmented[3] with other modes of information.

Another approach to solve the under-determined case involves the exploitation of source sparseness[ 4] . Taking the short-time Fourier-transform of Equation 1

N

Xj(t, 1) = L Sij(t, 1) + Nj(t, 1) (3) i=1

it has been effectively shown that for a reasonable number

of sources in the signal, any time-frequency point (tf) that has signal above the background noise level IXj(t, 1)1 » INj(t, 1) I, is dominated by a single source k only[5] .

(4)

where k is the active source at time-frequency point tf. This,

in effect, transforms the more general source separation problem to a problem of finding the dominant source of each t f. Once the tf points corresponding to each source are identified, binary time-frequency masks can be created and applied to the Xj(t, 1) as detailed in section II-C of [6] .

In [7] , the authors explore various features that may be extracted from Xj (t, 1) to be used in identifying the dominant source of each t f. The idea is that feature values extracted

from time-frequency points dominated by different sources will form separate clusters in the feature space. Once these

separate clusters are identified, tf points may be properly assigned to their respective sources. Since there is no prior information about the whereabouts of clusters, unsupervised clustering methods are applied to the extracted feature values. Two of the popular methods applied in this context are k-means clustering[7] and Gaussian Mixture Model (GMM) fitting using the Expectation-Maximization algorithm[8] [9] . These methods assume that the number of sources in the mixtures are known[6] . However, recent work has shown that when a Dirichlet prior is imposed on the GMM mixture

weights (for the second approach), this assumption might not be necessary[ I 0] .

One specific feature that has been employed in recent work is the phase-difference of arrival (PDOA) between pairs of microphones[6] [1O] . There are three difficulties encountered when attempting to use the PDOA values as a feature. One is the spatial aliasing problem, which occurs when the distances between pairs of microphones are large. This problem is explained in detail in [10] . The second difficulty is the problem of permutation ambiguity. PDOA values are frequency dependent therefore therefore clustering has to be

done separately for each frequency bin. This causes ambiguity between cluster labels at different frequencies, i. e. clusters from the same talker at different frequencies are labelled differently. An explicit procedure to align the clusters for the full

band is necessary. Examples of such procedures are detailed in [11] [6] . Finally, the third difficulty is that tf points with high reverberant signal or only the background noise manifest themselves as noise in the PDOA values. This corrupts the unsupervised clustering procedures, causing errors in the identification of the dominant source for t f points.

In this work, we present a full separation algorithm for the two microphone BSS problem that aims to make contributions to the second and third difficulties. The first part of the algorithm employs a frequency bin-wise clustering of PDOA values using GMM fitting by the EM algorithm. However, instead of using all possible time-frequency points, the algorithm pre-selects points where the effects of background

noise and reverberation are minimal to address the third difficulty. The second part introduces a simple, iterative algorithm for simultaneously solving the permutation ambiguity problem and finding the time-difference-of-arrival(TDOA) values corresponding to each individual source. The last part of the algorithm is a rerun of the first part but with better initialization of the GMM's using the TDOA information extracted in the second part. In addition, this part involves a procedure that tests whether clustering via GMMs in each frequency was successful or not. We present the separation performance of the algorithm on the publicly available SISEC development database [ 12] and compare our results to two of the highest ranking algorithms presented in the latest SISEC results[13] [3] .

2. ALGORITHM

Our algorithm consists of three stages. The first stage, similar to related work[6] is a frequency-bin wise clustering of the t f points using GMM fitting via the EM algorithm with

a random initialization of the means. However, contrary to related work, we only use a subset of all possible tf points to determine the GMM. The second stage aims to use the found GMM means and variances at every frequency to identify the TDOA values for each source. Once these are found, at stage three, the first stage is repeated, however with initial means

Fig. 1. Basic flowchart of the algorithm.

are derived from the found TDOA values for each source. This ensures that the clusters from different frequencies properly match each other at the end of the stage, effectively solving the permutational ambiguity of the first stage. In addition, some post processing is done to check whether the found GMM's at each frequency properly separate the points, so that

in the negative case we remove that frequency from consider

ation and attenuate that frequency completely.

2. 1. Stage 1: Initial Clustering

The PDOA value between the two microphones is equal to

¢(t, f) = L.Xl(t, f)X�(t, f) (5)

If no source signal is dominant at a t f point, we assume

¢(t, f) to be random with some unknown distribution. On the other hand, for a tf point dominated by source k, assuming that all sources are far-field and an anechoic roomenvironment, Equation 5 becomes

(6)

where Tk(t, f) is the TDOA value corresponding to source

k. In a real environment, unfortunately the anechoic model

doesn't hold. This introduces an error term to the ¢(t, f) which we can model as

where 0 ::.; I ::.; I and ¢E(t, f) is an angular variable with an unknown distribution modelling the contribution of the reverberant signal to ¢(t, f). For the anechoic model, I = 1. As more reverberant signal is present in the t f point, I decreases, tilting the ¢(t, f) value towards the error prone ¢E(t, f).

-5

-10

-15

iii' -20

� � a.. -25

-30

-35

-40

0.75 1.25 1.5

timers]

Fig. 2. Power(dB) vs time plot at 500Hz of a l.5 second long portion of a three talker mixture. The tf points in the blue parts are considered for GMM fitting, the red portions are out because they are the later parts of utterances. The black portions are thrown out because they are deemed as background noise.

In general, common practice is to consider all possible

¢( t, f) for the GMM fitting, including the ones that are not dominated by any source. We observe that this may be error prone because I) data from t f points with no dominant source introduces PDOA values due to background noise which is

likely to corrupt the clustering procedures and 2) many of the tf points with a dominant source may have a high content of reverberant signal(i. e. a lower I value), introducing noisy PDOA values. To identify a set of tf that is dominated by a source and have high I we try to find points that have 1) sufficient signal-to-background noise ratio and 2) have a high direct-to-reverberant signal ratio. For the first criteria, we run a background noise estimation algorithm based on the minimum statistics method [14] and select points with higher

signal power than the background noise, based on the procedure described in [15] . For the second criteria, we introduce a novel procedure. Given that reverberant signals take more time to reach the microphones than the initial direct signal, one can assume that earlier parts of utterances where there is an increase in signal power have more direct signal where as the later parts where signal power is now decreasing have a lot more reverberant signal. To select the earlier parts of ut

terances we compare the power content of the t f point with the average power in the past several frames in the same frequency. We select a point if, for b.t time-frames in the STFT

1 IX(t, f)12 > A D.t

t

L (8) t'=t-6.t

where tf is the time parameter. The set of points that satisfy the two criteria at each frequency is defined as Of.

The GMM for ¢(t, f) in Of may be written as

N

p(¢) = Lajp(¢IMj, aj) (9) i=l

where aj is the mixture weight for cluster 'i at frequency f with the constraint Laj = 1. P(¢IMj, aj), is the wrapped

angular normal distribution[ 16] for cluster i where Mj is the

mean and aj is the variance. Note that we dropped the (t, f) notation in front of ¢ for brevity. Starting with randomly initialized values we update the parameters using the following equations:

-i Mf

N

L Lajp(¢IMj, aj) i=l ['If

L ajp( ¢IMj, aj)¢ ['If L ajp( ¢IMj, aj) ['If

L ajp(¢IMj, aj)(¢ - fl)2 ['If

(10)

(11)

(12)

The parameters are updated until convergence or up to a max

imum pre-defined limit of iterations. At the end, for each frequency, we have N clusters with cluster labels at each frequency denoted by 'if . Note that the cluster labels have permutation ambiguity within different frequencies.

2.2. Stage 2: Identifying TnOA values

In this stage, we describe an iterative procedure to get the TDOA value for each source using the clusters identified in the previous stage. We repeat the following procedure for each source k,

1. Pick a random TDOA value, Ti.

2. Calculate anechoic PDOA values, ¢j corresponding to Ti at every frequency using Equation 6.

3. DO until Ti converges:

(a) For every f i. Pick the Gaussian that yields highest proba

bility for chosen ¢j, N(Mj, aj)

(b) Find the Ti value for which L 1 ¢j - M j 12 is at a f

minimum.

4. The converged Ti value is stored as Tk, the TDOA for source k

180

135

90

45

-45

-90

-135

-180

1000 2000 3000 4000 5000 6000 7000 8000 frequency[Hz]

Fig. 3. The circle markers indicate found GMM means for each frequency for a four talker mixture. Each of the four colors represent a different label i. The black lines indicate ¢1 from the found Tk values.

5. Remove all chosen Gaussian's at every f

6. Repeat for next source.

A result of this procedure is shown in Figure 3 for a four talker mixture. The means of the Gaussian's from the same talker have different labels at different frequencies because of the bin-wise clustering. The procedure finds the correct TDOA

values as demonstrated in the plot.

2.3. Stage 3: Final Clustering and Post Processing

In this stage, we run the EM algorithm for each frequency again, as described)n Stage 1. This time, the means are initialized using the ¢1 values from the found Tk'S and keeping the cluster labelling consistent within frequencies. This is effective in two ways, I) having the initial means in the vicinity of the eventual means makes the EM algorithm more likely to find the true maxima in the likelihood function and 2) the found cluster labels will have no permutational ambiguity. In Figure 4, one may clearly observe that the means of the found Gaussian's now have no permutational ambiguity and are much closer to the ¢1 values derived from the Tk values.

In some cases, specifically in lower frequencies where the

anechoic ¢ values are very close, the EM algorithm fails to separate the points, instead the algorithm tends to end up with two or more Gaussian's covering the same area. One of these Gaussian's typically end up with a higher weight, claiming all the points. We detect these cases by checking the following condition. For each k, if

(13)

than we attenuate all points in frequency f.

180

135

90

45

-45

-90

-135

-180

1000 2000 3000 4000 5000 6000 7000 8000 frequency[Hz]

Fig. 4. The found GMM means at the end of Stage 3. Each of the four colors represent a different label k. The black lines

indicate ¢1 from the found Tk values.

Finally, we calculate the probabilities of every timefrequency point for each of the Gaussian's in the mixture and assign to points to the source whose Gaussian yields the highest probability.

(14)

3. EXPERIMENTS AND RESULTS

We tested our algorithm on live recordings from the SISEC dataset[12l. Specifically, we used the live recordings made by two microphones with a spacing of Scm's in the Devl dataset. These particular recordings consist of three or four source mixtures taken in two different reverberant conditions with T60 values l30ms and 250ms.

The sampling rate of the recordings were 16kHz. We applied a single tap pre-emphasis filter with a coefficient of 0. 75 (i. e. x[t] = x[t] - 0.75x[t - 1]. The STFT was performed on

32ms, hamming windowed frames with 75% overlap. We labelled a tf point as background noise (for Section

2. 1) if 1) its power was less than 4 times the noise power calculated by the minimum statistics algorithm or 2) its power was less than 40dB from the maximum power in the whole

STFT. The fl.t value for the procedure in the same section was set to the past 4 time-frames, equivalent to 32ms.

For the iterative procedure in Section 2.2, we sampled all possible Tk values at a resolution of 0.01 samples in the end

fire region, for this particular setup it amounted to [-2.3 : 0.01 : 2.3] samples. The iterative procedure was applied until the selection of points did not change from one iteration to the next.

Similarly in the application of the EM algorithm in Sections 2.1 and 2.3, we continued the iterations until the set of

'I I' .• +fl� 'I"I��I' .�.,��" 1 ��.�I 2 4 timers] 6 10

"I H�� 1 II"�"" ++'*111'" o 2 4 6 8 10

timers]

II II��"I �".'" +.I�I'.*.�I. 2 6 10

timers]

Fig. 5. Top figure shows the time-series of a mixed recording. The middle figure is the time-series of one of the talkers separated from the mixture using the method presented. The bottom figure shows the true signal in the mixture. All signals are pre-emphasized.

points that belonged to each Gaussian did not change from

one iteration to the next. In Section 2. 1 the initialization of the means were done semi-randomly. The first mean was chosen at random with a uniform distribution among the points, the remaining were chosen again at random but with a probability

distribution inversely proportional to the points' distances to the means already selected. This ensured that the mean values were not very close to each other. In Section 2. 3, the means were already pre-chosen according to found TDOA values.

In all EM runs, the standard deviation was initialized at 7r /30 radians and bounded by above at 7r /12 radians.

The application of the masks were fairly straight forward.

The tf points that were to be excluded were done so by attenuating the signal-power by 30dB.

3. 1. Evaluation

We evaluated the algorithm using the measures in the BSS Eval toolbox provided by the SISEC database. We specifically used the SDR and SIR measures[lO] , looking at the target signal distortion and interfering signal attenuation compared to the original source signals.

In Table , we present our results in comparison to the

results published in the SISEC database for the methods [13](ISO 2011) and [3] (NESTA 2012). From the SIR point of view, our method clearly improves on the published results. From an SDR perspective, there is some improvement from ISO 2011 in both low and high reverberant recordings. The method marks an improvement against NESTA 2012 for low reverberation situation. At high reverberation, NESTA 2012 is robust at three talker scenarios while degrades significantly in four talker cases. In both

situations, our method on average does somewhat better. Proposed ISO 2011 NESTA 2012

T60(ms) N SDR SIR SDR SIR SDR SIR

130 3 7. 2 17. 1 5. 9 9. 4 5. 2 9. 2 4 5.5 13.5 - - 4. 7 8. 3

250 3 5. 9 13. 2 4. 7 7.8 5.5 9. 0 4 3.5 9. 9 - - 2. 9 5. 2

. .

In FIgure 5, we present an example of mput and output signals for a mixture of three male talkers at a reverberation of 130ms. As seen from the time-plots the interference in this case is mostly gone and inaudible in the signal. Our subjective

evaluation concludes that the low-frequency information is mostly missing from the output which can be vaguely seen in the time-plots. This is mostly due to the post-processing step. In particular, the EM algorithm is unable separate the ¢(t, f) values at low-frequencies (or any other frequency where spatial aliasing occurs) for different talkers, simply because the true means are simply too close to each other. This causes two or more talkers being represented using the same Gaussian. Once this is detected, that whole frequency is attenuated. Both by objective measures and subjective evaluation, we can confirm that getting rid of the all the interference at these unseparable (at least by using PDOA values) frequencies at the expense of the source signal is better than keeping the source signal along with the interference.

4. CONCLUSION

In this work, we presented an underdetermined blind source separation algorithm for two-microphones. This algorithm generates binary time-frequency masks for each talker using only the PDOA values as clues/features. We present the performance of the algorithm by comparing separation perfor

mance to other published work using objective measures. Future work will include generalizing this algorithm to three and four microphone as well as improving low-frequency performance.

5. REFERENCES

[1] A. Belouchrani, K. Abed-Meraim, J.E Cardoso, and E. Moulines, "A blind source separation technique using second-order statistics," Signal Processing, IEEE

Transactions on, voL 45, no. 2, pp. 434-444, 1997.

[2] A. Hyvarinen and E. Oja, "Independent component analysis: algorithms and applications," Neural net

works, vol. 13, no. 4, pp. 411-430, 2000.

[3] E Nesta and M. Omologo, "Convolutive underdetermined source separation through weighted interleaved ica and spatio-temporal source correlation," Latent

Variable Analysis and Signal Separation, pp. 222-230, 2012.

[4] O. Yilmaz and S. Rickard, "Blind separation of speech mixtures via time-frequency masking," Signal Process

ing, IEEE Transactions on, vol. 52, no. 7, pp. 1830-1847,2004.

[5] N. Roman and D.L. Wang, "Binaural sound segregation for multisource reverberant environments," in Acoustics, Speech, and Signal Processing, 2004. Pro

ceedings.(ICASSP'04). IEEE International Conference

on. IEEE, 2004, vol. 2, pp. ii-373.

[6] H. Sawada, S. Araki, and S. Makino, "Underdeter

mined convolutive blind source separation via frequency bin-wise clustering and permutation alignment," Audio,

Speech, and Language Processing, IEEE Transactions

on, vol. 19, no. 3, pp. 516-527, 2011.

[7] S. Araki, H. Sawada, R. Mukai, and S. Makino, "Underdetermined blind sparse source separation for arbitrarily arranged multiple sensors," Signal Processing, vol. 87, no. 8,pp. 1833-1847,2007.

[8] C. M. Bishop et aI., Pattern recognition and machine

learning, vol. 4, springer New York, 2006.

[9] M.1. Mandel, D.P.W. Ellis, and T. Jebara, "An em algorithm for localizing multiple sound: Sources in reverberant environments," in Advances in neural information

processing systems 19 proceedings of the 2006 confer

ence. MIT Press, 2007, pp. 953-960.

[10] S. Araki, T. Nakatani, H. Sawada, and S. Makino, "Blind sparse source separation for unknown number of sources using gaussian mixture model fitting with dirichlet prior," in Acoustics, Speech and Signal Pro

cessing, 2009. ICASSP 2009. IEEE International Con

ference on. IEEE, 2009, pp. 33-36.

[11] H. Sawada, R. Mukai, S. Araki, and S. Makino, "A robust and precise method for solving the permutation problem of frequency-domain blind source separation," Speech and Audio Processing, IEEE Transactions on,

vol. 12, no. 5,pp. 530-538,2004.

[12] S. Araki, F. Nesta, E. Vincent, Z. Koldovsky, G. Nolte, A. Ziehe, and A. Benichoux, "The 2011 signal sep

aration evaluation campaign (sisec2011):-audio source separation," Latent Variable Analysis and Signal Sepa

ration, pp. 414-422, 2012.

[13] K. Iso, S. Araki, S. Makino, T. Nakatani, H. Sawada, T. Yamada, and A. Nakamura, "Blind source separation of mixed speech in a high reverberation environment," in Hands-free Speech Communication and Microphone

Arrays (HSCMA), 2011 Joint Workshop on. IEEE, 2011, pp. 36-39.

[14] R. Martin, "Noise power spectral density estimation based on optimal smoothing and minimum statistics," Speech and Audio Processing, IEEE Transactions on,

vol. 9,no. 5,pp. 504-512,2001.

[IS] M. Berouti, R. Schwartz, and J. Makhoul, "Enhancement of speech corrupted by acoustic noise," in Acmls

tics, Speech, and Signal Processing, IEEE International

Conference on ICASSP'79. IEEE, 1979, vol. 4, pp. 208-211.

[16] S. Araki, T. Nakatani, H. Sawada, and S. Makino, "Stereo source separation and source counting with map estimation with dirichlet prior considering spatial aliasing problem," Independent Component Analysis and

Signal Separation, pp. 742-750, 2009.

Documents

[IEEE 2013 IEEE International Conference on Multimedia and Expo (ICME) - San Jose, CA, USA (2013.07.15-2013.07.19)] 2013 IEEE International Conference on Multimedia and Expo (ICME)