39
1 Blind Separation of Speech Mixtures via Time-Frequency Masking ¨ Ozg¨ ur Yılmaz and Scott Rickard Abstract Binary time-frequency masks are powerful tools for the separation of sources from a single mixture. Perfect demixing via binary time-frequency masks is possible provided the time-frequency representations of the sources do not overlap, a condition we call W-disjoint orthogonality. We introduce here the concept of approximate W-disjoint orthogonality and present experimental results demonstrating the level of approximate W-disjoint orthogonality of speech in mixtures of various orders. The results demonstrate that ideal binary time-frequency masks exist which can separate several speech signals from one mixture. While determining these masks blindly from just one mix- ture is an open problem, we show that we can approximate the ideal masks in the case where two anechoic mixtures are provided. Motivated by the maximum likelihood mixing parameter estimators, we define a power weighted two- dimensional histogram constructed from the ratio of the time-frequency representations of the mixtures which is shown to have one peak for each source with peak location corresponding to the relative attenuation and delay mixing param- eters. The histogram is used to create time-frequency masks which partition one of one mixture into the original sources. Experimental results on speech mixtures verify the technique. Example demixing results can be found online: http://www.alum.mit.edu/www/rickard/bss.html I. I NTRODUCTION The goal in blind source separation (BSS) is to determine the original sources given mixtures of those sources. When the number of sources is greater than the number of mixtures, the problem is degenerate in that traditional ¨ Ozg¨ur Yılmaz ([email protected]) iswith the Department of Mathematics, University of Maryland. Scott Rickard ([email protected]) is with the Department of Electronic and Electrical Engineering, University College Dublin, Ireland. This work was partially funded by Siemens Corporate Research, Princeton, NJ. Submitted to IEEE Transactions on Signal Processing, November 4, 2002. Revised March 27, 2003. Second revision July 24, 2003.

1 Blind Separation of Speech Mixtures via Time-Frequency ...oyilmaz/preprints/wdo_final.pdf · [5, 11] demix by assuming that the number of sources active in the transform domain

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 1 Blind Separation of Speech Mixtures via Time-Frequency ...oyilmaz/preprints/wdo_final.pdf · [5, 11] demix by assuming that the number of sources active in the transform domain

1

Blind Separation of Speech Mixtures via

Time-Frequency Masking

Ozgur Yılmaz and Scott Rickard

Abstract

Binary time-frequency masks are powerful tools for the separation of sources from a single mixture. Perfect

demixing via binary time-frequency masks is possible provided the time-frequency representations of the sources do

not overlap, a condition we call W-disjoint orthogonality. We introduce here the concept of approximate W-disjoint

orthogonality and present experimental results demonstrating the level of approximate W-disjoint orthogonality of

speech in mixtures of various orders. The results demonstrate that ideal binary time-frequency masks exist which

can separate several speech signals from one mixture. While determining these masks blindly from just one mix-

ture is an open problem, we show that we can approximate the ideal masks in the case where two anechoic mixtures

are provided. Motivated by the maximum likelihood mixing parameter estimators, we define a power weighted two-

dimensional histogram constructed from the ratio of the time-frequency representations of the mixtures which is shown

to have one peak for each source with peak location corresponding to the relative attenuation and delay mixing param-

eters. The histogram is used to create time-frequency masks which partition one of one mixture into the original

sources. Experimental results on speech mixtures verify the technique. Example demixing results can be found online:

http://www.alum.mit.edu/www/rickard/bss.html

I. INTRODUCTION

The goal in blind source separation (BSS) is to determine the original sources given mixtures of those sources.

When the number of sources is greater than the number of mixtures, the problem is degenerate in that traditional

Ozg ur Yılmaz ([email protected]) is with the Department of Mathematics, University of Maryland.

Scott Rickard ([email protected]) is with the Department of Electronic and Electrical Engineering, University College Dublin, Ireland.

This work was partially funded by Siemens Corporate Research, Princeton, NJ.

Submitted to IEEE Transactions on Signal Processing, November 4, 2002. Revised March 27, 2003. Second revision July 24, 2003.

Page 2: 1 Blind Separation of Speech Mixtures via Time-Frequency ...oyilmaz/preprints/wdo_final.pdf · [5, 11] demix by assuming that the number of sources active in the transform domain

2

matrix inversion demixing cannot be applied. However, when a representation of the sources exists such that the

sources have disjoint support in that representation, it is possible to partition the support of the mixtures and obtain the

original sources. One solution to the problem of degenerate demixing is thus to (1) determine an appropriate disjoint

representation of the sources and (2) determine the partitions in this representation which demix. In this paper, we

show that the Gabor expansion (i.e., the discrete short-time (or windowed) Fourier transform) is a good representation

for demixing speech mixtures. Specifically, we show that partitions of the time-frequency lattice exist that can demix

mixtures of several speech signals from one mixture. Determining the partition blindly from one mixture is an open

problem, but, given a second mixture, we describe a method for partitioning the time-frequency lattice which separates

the sources.

Formally, let�

be the family of signals of interest. Typically�

will be some collection of square integrable ban-

dlimited functions. Suppose there exists some linear transformation �������� ��� � (where � maps the set�

to

another family of functions) with the following properties:

(i) � is invertible on�

(i.e., ����������������������� � � ).

(ii) ! �#" !#$ ��% for &('��) , where ! � is the support of � , i.e., ! � �*� supp

� �*�,+�-(� � �.-/��'�1032 .For example, we can consider the case where

�is a collection of square integrable functions with mutually disjoint

supports in the Fourier domain; any two functions � � and �54 in�

satisfy 6� � ��78��6��49��78�:�;0 for all 7 , where 6� � denotes

the Fourier transform of � � . Then if we define � on�

as ���<�*�=6� , it is clear that � satisfies (i) and (ii).

For any � with properties (i) and (ii), we can demix a mixture > � of signals in�

, > � �@?A�B�DC�E�GF � � � �@?H� , via

� � �I� ��� �KJMLONM�P> � � (1)

where JMLON is the indicator function of the set ! � , i.e.,

J LON �.-/�8�*�QRRS RRT J -U�(! �0 otherwise.

(2)

Going back to our example above, this corresponds to � � being equal to the inverse Fourier transform of J5LON/6> � which

is certainly true since the functions in�

satisfy (ii).

Suppose now that we have another mixture > 4 �@?H�V� C�E�GF �XW �Y�G�O�@?�Z\[H��� , which is the case in anechoic environments

Page 3: 1 Blind Separation of Speech Mixtures via Time-Frequency ...oyilmaz/preprints/wdo_final.pdf · [5, 11] demix by assuming that the number of sources active in the transform domain

3

when we have two microphones. In the mixing, W � and [ � are the relative attenuation and delay parameters respectively

corresponding to the & th source. Assume

(iii) supp � � � � Z [���� supp ��� for any �<� � , � � [ �����, and

(iv) there exist functions � and � such that W � �� ���P> � �.-/� �K�P> 4O�.-/�H� and [ � �� ���P> � �.-/� �K�P> 4��.-/�H� for -(�U! � for

& � J��������5�� ,

where�

is the maximum possible delay between mixtures due to the distance separating the sensors. Using (iii) and

(iv), we can label each -\� supp �P> � with the pair ��� ���P> � �.- � �K�P> 4 �.- �H� ��� ���P> � �.- � �K�P> 4 �.- �H�H� , and ! � is exactly the

set of all points with the label � W � � [ � � . It follows that given the mixtures > � �@?A� and >/4��@?A� , we can demix via

� � �I� ��� �KJMLONM� > � ��� (3)

Clearly, (iii) will be satisfied for the example above since the Fourier transform of �3� � Z [�� will be just a modulated

version of the Fourier transform of � and thus it will have the same support as � . As to the existence of functions �

and � , one can show that � ��6> � ��78� � 6> 4 ��78�H��� � 6> 49��78��� 6> � ��78� � and � ��6> � ��78� �X6>/4���78�H���;Z ���� ��6> 49��78��� 6> � ��78�H� where ���

denotes the phase of the complex number � taken between Z�� and � , satisfies (iv).

The general algorithm explained above mainly depends on two major points: (a) the existence of an invertible

transformation � that transforms the signals to a domain on which they have disjoint representations (properties (i), (ii),

and (iii)), and (b) finding functions � and � that provide the means of labeling on the transform domain (property (iv)).

Note that in the description above we required � and � to yield the exact mixing parameters. Although this is desired

since the mixing parameters provide the perfect labels and can also be used for various other purposes (e.g., direction-

of-arrival determination), it is not necessary for the demixing algorithm to work. Some function that provides a unique

labeling on the transform domain is sufficient. Moreover, requirement (ii) that the transformation � is “disjoint” is very

strong. In practice, one is usually more interested in transforms that satisfy (ii) in some approximate sense. Transforms

that result in sparse representations of the signals of interest, representations where a small percentage of the signal

coefficients capture a large percentage of the signal energy, can lead to (ii) being approximately satisfied.

There are many examples in the literature that use this type of approach with various choices of � for various

mixing models and demixing methods [1–12]. The mixing model in [1–3, 5, 8, 9, 11] is “instantaneous” (sources

Page 4: 1 Blind Separation of Speech Mixtures via Time-Frequency ...oyilmaz/preprints/wdo_final.pdf · [5, 11] demix by assuming that the number of sources active in the transform domain

4

have different amplifications in different mixtures) while [4, 6, 7, 10, 12] use an anechoic mixing model (sources have

different amplifications and time delays in different mixtures). [1–3, 11] consider the time domain sampling operator

as � . The general assumption in these is that at any given time at most one source is non-zero. [4–7, 9, 10, 12] use

the short-time Fourier transform (STFT) operator as � . Condition (ii) is satisfied in this case, at least approximately,

because of the sparsity of the time-frequency representations of speech signals. Empirical support for this can be found

in [7, 13], and a more extensive discussion is given in Section II-A. [8] chooses � depending on the signal class of

interest in such a way that it yields a sparse representation. In principle, [1–12] all use some clustering algorithm

for estimating the mixing parameters, although there are several different approaches to demixing. [1, 3, 4, 6, 7, 9–

11] use a labeling scheme based on the estimated mixing parameters and thus demix in the above described way by

creating binary masks in the transform domain corresponding to each source. That is, given the mixtures > � and >/4 ,demixing is done by grouping the clusters of points in ���P> � �K�P> 4 � space, although different techniques are used to

detect these clusters. For example, [4, 6, 7, 9, 10] demix essentially by constructing binary time-frequency masks that

partition the time-frequency plane such that each partition corresponds to the time-frequency points that “belong” to

a particular source. The fact that such a mask exists has been observed also in [14] in the context of BSS of speech

signals from one mixture, and in [15] in the context of source localization. In [2, 8, 11, 12], the demixing is done by

making additional assumptions on the statistical properties of the sources and using a maximum a posteriori (MAP)

estimator. [5, 11] demix by assuming that the number of sources active in the transform domain at any given point is

equal to the number of mixtures. They then demix by inverting the now non-degenerate � -by- � mixing matrices and

appropriately combining the outputs. The above comparison is summarized in Table I.

TABLE I

A COMPARISON OF DEGENERATE DEMIXING METHODS USING DISJOINT REPRESENTATIONS.

mixing model � operator demixing

instantaneous sampling [1–3, 11] masking

[1–3, 5, 8, 9, 11] STFT [1, 3, 4, 6, 7, 9–11]

anechoic [4–7, 9, 10, 12] MAP [2, 8, 11, 12]

[4, 6, 7, 10, 12] signal dependent [8] matrix masking [5, 11]

Page 5: 1 Blind Separation of Speech Mixtures via Time-Frequency ...oyilmaz/preprints/wdo_final.pdf · [5, 11] demix by assuming that the number of sources active in the transform domain

5

In this paper, for the linear transform � , we use the short-time Fourier transform (STFT) and Gabor expansions (the

discrete version of the STFT) of speech signals. We present extensive empirical evidence that speech signals indeed

satisfy (ii) in an approximate sense when T is the STFT with an appropriate window function. Based on this, we

extend the Degenerate Unmixing Estimation Technique (DUET), originally presented in [4] for sources with disjointly

supported STFTs, to anechoic mixtures of speech signals. The algorithm we propose relies on estimating the mixing

parameters via maximum likelihood motivated estimators and constructing binary time-frequency masks using these

estimates. Thus the method presented here: (1) uses an anechoic mixing model, (2) uses the STFT as T, and (3)

performs demixing via masking.

In Section II we introduce a way of measuring the degree of “approximate” W-disjoint orthogonality, WDO � , of a

signal in a given mixture for a given mask � . We construct a family of time-frequency masks,���

, that correspond

to the indicator functions of the time-frequency points in which one source dominates the others by > dB. We test

the demixing performance of these masks experimentally and illustrate that WDO ��� is indeed a good measure of the

demixing performance of the masks���

. The results show that binary time-frequency masks exist that are capable

of demixing several speech signals from just a single mixture. At present, there is no known robust technique for

determining these masks blindly from one mixture. However, in Section III we derive a technique that given a second

anechoic mixture can approximate these demixing masks blindly. We first derive the maximum likelihood estimators

for the delay and attenuation coefficients. We then compare the performance of these with other estimators motivated

by the maximum likelihood estimators. The modified delay and attenuation estimators are weighted averages of the

instantaneous time-frequency delay and attenuation estimates. We combine the delay and attenuation estimators and

show that a weighted two-dimensional histogram can be used to enumerate the sources, determine the mixing parame-

ters, and demix the sources. The number of peaks in the histogram is the number of sources, the peak locations reveal

the mixing parameters, and the mixing parameters can be used to partition the time-frequency representation of one

of the mixtures to obtain estimates of the original sources. In Section IV, we verify the method presenting demixing

results for speech signals mixed synthetically and in both anechoic and echoic rooms.

Page 6: 1 Blind Separation of Speech Mixtures via Time-Frequency ...oyilmaz/preprints/wdo_final.pdf · [5, 11] demix by assuming that the number of sources active in the transform domain

6

II. W-DISJOINT ORTHOGONALITY

In this section, we focus on showing that binary time-frequency masks exist which are capable of separating multiple

speech signals from one mixture. Our goal is, given a mixture

> � �@?H� �*� E�� F �

� � �@?A� (4)

of sources � � �@?H� , & � J��������5�� , to recover the original sources. In order to accomplish this, we exploit the fact that

the sources are pairwise approximately W-disjoint orthogonal. In this section we will define a qualitative measure of

W-disjoint orthogonality and relate this measure to demixing performance.

We call two functions � � and �54 W-disjoint orthogonal (W-DO) if, for a given a window function � � the supports

of the short-time Fourier transforms (STFTs) of � � and �54 are disjoint [4]. The STFT of � � is defined

����� � ��� �� �K78�:�*� J � ��� � � �@? Z�� �K� � �@?H��� ��� ����� ? (5)

which we will refer to as 6� �O�� �K78� . For a detailed discussion of the properties of this transform consult [16]. The

W-disjoint orthogonality assumption can be stated concisely

6� � �� �K78��6��4O�� �K78����0 ����� �K7 � (6)

The two limiting cases for � � namely � � J and � �@?H� � [ �@?A� , result in interesting sets of W-DO signals. In

the � � J case, the � argument in (6) is irrelevant because the windowed Fourier transform is simply the Fourier

transform. In this case, the condition is satisfied by signals which are frequency disjoint, such as frequency division

multiplexed signals. In the other extreme, when � �@?A� �=[ �@?A� , signals which are time disjoint such as time-division

multiplexed signals satisfy the condition. For window functions which are well localized in time and frequency, the W-

disjoint orthogonality condition leads to signals such as those used in frequency-hopped multiple access systems [17].

Indeed, the method presented here could be applied to time domain multiplexed, frequency domain multiplexed, or

frequency-hopped multiple access signals; however, in this paper we exclusively consider speech signals.

Unfortunately, (6) will not be satisfied for simultaneous speech signals because the time-frequency representation of

active speech is rarely zero. However, speech is sparse in that a small percentage of the time-frequency coefficients in

the Gabor expansion of speech capture a large percentage of the overall energy. In other words, the magnitude of the

Page 7: 1 Blind Separation of Speech Mixtures via Time-Frequency ...oyilmaz/preprints/wdo_final.pdf · [5, 11] demix by assuming that the number of sources active in the transform domain

7

Gabor coefficients of speech is often small. For different speech signals, it is unlikely that the large Gabor coefficients

will coincide, which leads to the signals being W-disjoint orthogonality in an approximate sense. The goal of this

section is to show that speech signals satisfy a weakened version of (6) and are thus approximately W-DO. The higher

the degree of approximate W-disjoint orthogonality, the better separation results are possible. Figure 1 illustrates that

speech signals have sparse time-frequency representations and satisfy a weakened version of (6), in that the product of

their time-frequency representations is almost always small. A condition similar to (6) is also considered in [18], the

only difference being that the time-frequency transform used was the Wigner distribution. Signals satisfying (6) for

the Wigner distribution were called “time-frequency disjoint.”

The approximate W-disjoint orthogonality of speech has been described as the “sparsity” and “disjointness” of the

short-time Fourier transform of the sources [5], “when one source has large energy the other does not” and “harmonic

components” which “hardly overlap” [7], “when a datapoint is large the most likely decomposition is to assume that

it belongs to a single source” [12], “spectra [that] are non-overlapping” [14], and “useful” time-frequency points con-

taining a “contribution of one speaker...significantly higher than the energy of the other speaker” [19]. A quantitative

measure of approximate W-disjoint orthogonality is discussed later in this section.

time

freq

time

freq

time

freq

Fig. 1. A picture of W-disjoint orthogonality. The three figures are grayscale images of ������������� � (top), ������������� � (middle), and

�������������� ����������� � (bottom) for two speech signals �������� and �������� , sampled at 16 kHz, normalized to have unit energy. A Hamming win-

dow of length 64 ms was used as � ���� and all signals had a length of three seconds. ������������� ����������� � contains fewer large components than

�������������� � or ������������� � . Further analysis of these signals reveals that the time-frequency points that contain �� �! of the energy of ��� contain

only "�#$"�! of the energy of � � . Similarly, the time-frequency points that contain �� �! of the energy of � � contain only %# &�! of the energy of � � .Thus, these speech signals approximately satisfy the W-disjoint orthogonality condition.

Page 8: 1 Blind Separation of Speech Mixtures via Time-Frequency ...oyilmaz/preprints/wdo_final.pdf · [5, 11] demix by assuming that the number of sources active in the transform domain

8

We can rewrite the model from (4) in the time-frequency domain

6> � �� �K78�#� 6� � �� �K78���������� 6� E �� �K78��� (7)

Assuming the sources are pairwise W-DO, at most one of the sources will be non-zero for a given �� �K78� , and thus

6> � �� �K78�V�=6������� ��� �� �K78� (8)

where V�� �K78� is the index of the source active at �� �K78� . To demix, one creates the time-frequency mask corresponding

to each source and applies each mask to the mixture to produce the original source time-frequency representations. For

example, defining

� � �� �K78� �*�QRRS RRT J 6� � �� �K78� '�100 otherwise,

(9)

the indicator function for the support of � � , one obtains the time-frequency representation of � � from the mixture via

6� � �� �K78��� � � �� �K78�O6> � �� �K78� ����� �K7 � (10)

A. Measuring the W-Disjoint Orthogonality of Speech

Clearly, the W-disjoint orthogonality assumption is not strictly satisfied for our signals of interest. We introduce

here a measure of approximate W-disjoint orthogonality based on the demixing performance of time-frequency masks

created using knowledge of the instantaneous source and interference time-frequency powers. In order to measure

W-disjoint orthogonality for a given mask, we combine two important performance criteria: (1) how well the mask

preserves the source of interest, and (2) how well the mask suppresses the interfering sources. These two criteria, the

preserved-signal ratio (PSR) and the signal-to-interference ratio (SIR), are introduced below.

First, given a time-frequency mask � such that 0�� � �� �K78��� J for all �� �K78� , we define PSR � , the PSR of the

mask � , as

PSR � �*� ��=�� �K78��6�G���� �K78� � 4� 6� ���� �K78� � 4 (11)

which is the portion of energy of the & th source remaining after demixing using the mask. Note that PSR ����J with

PSR � ��J only if supp � ��� supp � . Here��� �@> ��� � � 4 �*����� � � �@>����X� � 4 � > � � . Now, we define

� � �@?A�8�*� E�$ F ����F $

�5$ �@?H� (12)

Page 9: 1 Blind Separation of Speech Mixtures via Time-Frequency ...oyilmaz/preprints/wdo_final.pdf · [5, 11] demix by assuming that the number of sources active in the transform domain

9

so that � � �@?A� is the summation of the sources interfering with the & th source. Then, we define the signal-to-interference

ratio of time-frequency mask �=�� �K78�SIR ���*� �

�=�� �K78��6� � �� �K78� � 4��=�� �K78�O6� � �� �K78� � 4 (13)

which is the output signal-to-interference ratio after using the mask to demix.

We now combine the PSR � and SIR � into one measure of approximate W-disjoint orthogonality. We propose

the normalized difference between the signal energy maintained in masking and the interference energy maintained in

masking as a measure of the W-disjoint orthogonality associated with a particular mask:

WDO � ����������� ������� ���� ����������������� ������� ���� ������� �� � ���� ���� � (14)

� PSR � �PSR �"! SIR �$# (15)

For signals which are W-DO, using the mask � � �� �K78� defined in (9), we note that PSR �8N ��J , SIR �#N �&% , and

WDO �#N���J . This is the maximum obtainable WDO value because WDO � �,J for all � such that 0 � �=�� �K78� �J . Moreover, for any � , WDO � � J implies that PSR � � J , SIR � �'% , and that (6) is satisfied. That is,

WDO � � J implies that the signals are W-DO and that mask � perfectly separates the & th source from the mixture.

In order for a mask to have WDO �&( J , i.e., good demixing performance, it must simultaneously preserve the energy

of the signal of interest while suppressing the energy of the interference. The failure of a mask to accomplish either of

the goals can result in a small, even negative, WDO value. For example, WDO � �;0 implies either that PSR � �;0(the mask kills all the energy of the source of interest) or that SIR ��� J (the mask results in equal energy for source

and interference). Masks with SIR �� J have associated WDO �

� 0 .Now we establish that binary time-frequency masks exist which are capable of demixing speech signals from one

mixture and detail their performance in relation to the three presented measures. Consider the following family of

time-frequency masks

� �� �� �K78� �*�QRRS RRT J

� 0*),+ -�� � 6� � �� �K78� � � � 6� � �� �K78� � �/. >0 otherwise

(16)

which is the indicator function for the time-frequency points where � � dominates the interference in the mixture by >dB. We will use PSR� �@>�� and SIR� �@>�� as shorthand for PSR � �N and SIR � �N , respectively.

To determine the demixing ability of the above mask type, the masks for various > were applied to speech mixtures

Page 10: 1 Blind Separation of Speech Mixtures via Time-Frequency ...oyilmaz/preprints/wdo_final.pdf · [5, 11] demix by assuming that the number of sources active in the transform domain

10

of various orders and the demixing performance measures, PSR � �@>�� and SIR� �@>�� , were determined. We refer to a

mixture of sources as a mixture of order and the mixtures used in these tests had orders � � ���3���������YJ 0 . The

demixed speech was then rated by the authors as falling into one of five subjective categories. The speech signals were

selected from 16 male and 16 female continuous speech segments of three seconds taken from the TIMIT database

and normalized to unit energy. The time-frequency representation of the 16kHz sampled data was created using a

Hamming window of 1024 samples with 50% overlap. The results of the 333 listening tests are displayed in Figure 2.

We note that there is a fairly accurate relationship between the WDO performance measure and the subjective ratings

listed in the table under the figure.

5 10 15 20 25 30 35 40−10

−9

−8

−7

−6

−5

−4

−3

−2

−1

0WDO=1.0

WDO=0.8

WDO=0.6

WDO=0.4

WDO=0.2

1 1 1 12

2

3

1 12

2

3

3

4

12

2

3

4

4

5

11

2

3

3

4

5

1

2

2

2

3

4

5

2

2

2

3

3

4

1

2

3

3

4

2

2

3

3

4

5

22

3

4

5

5

1 1 11

2

2

3

11

1

2

3

3

4

1

1

1

2

3

3

11

2

2

3

3

4

2

2

2

3

3

4

5

22

2

3

4

5

5

2

2

2

3

1

2

3

3

4

2

3

3

4

1 1 11

12

2

11

12

2

3

3

11

2

2

3

4

4

22

2

3

4

4

5

1

1

2

3

3

4

5

2

2

3

3

4

5

1

2

2

3

4

22

2

3

4

5

23

3

4

5

5

1 1 1 1 1 1 1 1 1 1 1 1 1 2 22

1 11 11

22

2 2

22

33

33

3

1 1 1 1 1 2 22

22

33 3

44

4

1 1 1 1 2 2 22

23

3

3

4

4

4

5

11

22

2

2

2

3

3

4

44

4

5

55

1 11

2

22

33

3

4

44

5

5

2 22

22

33

33

4

44

4

5

22

22

2

33

3

33

4

4

22

2

2

2

2

3

3

4

4

PS

R (d

B)

SIR (dB)

rating meaning

1 perfect

2 minor artifacts or interference

3 distorted but intelligible

4 very distorted and barely intelligible

5 not intelligible

Fig. 2. Results of subjective listening test performed by the authors. For example, �# ��� WDO � �# & implies a “minor artifacts or interference”

rating or better

Now that we have some idea how PSR, SIR, and WDO map to demixing performance, we analyze the demix-

ing performance of the masks described in (16). Figure 3 shows plots of PSR � �@>�� versus SIR� �@>�� and a table of

� PSR���@>�� � SIR���@>��H� pairs averaged for groups of speech mixtures of different orders. For � �, each source was

compared against each of the remaining 31 sources, resulting in ����

� J��� � tests being averaged for each data

point. For larger , each source was compared against a random mixing of ZIJ of the remaining 31 sources. This

was done 31 times per source in order to keep the number of tests per data point constant at 992. As we tested mixtures

from � �to � J 0 , a total of �

���� ���� � � mixtures were created to generate the data for Figure 3. Figure 3

Page 11: 1 Blind Separation of Speech Mixtures via Time-Frequency ...oyilmaz/preprints/wdo_final.pdf · [5, 11] demix by assuming that the number of sources active in the transform domain

11

demonstrates that time-frequency masks exist which exhibit excellent demixing performance; For example, consider-

ing the 0 dB mask���� , we see that on average this mask produces demixtures with WDO measure greater than 0.6 for

mixtures of up to ten sources.

5 10 15 20 25 30 35 40−10

−9

−8

−7

−6

−5

−4

−3

−2

−1

0 WDO=1.0

WDO=0.8

WDO=0.6

WDO=0.4

WDO=0.2

PS

R (d

B)

SIR (dB)x=

0

x=5

x=10

x=15

x=20

x=25

x=30

N=2N=3N=4N=5N=10

N ����� ����� ��� � � ��� � �2 �� �� �� � �� ����� �� ����� ��� ��� � �� � ��� � � ��� � �� � ��� � �� �� �3 �� ����� ��� � � � �� ��� � � �� ��� � �� � ��� � �� � ��� �� ����� � �� � � �4 �� � � � � �� ����� �� � ��� � �� � ��� �� � � ��� ����� �� ����� ��� �����5 �� � ������ � � �� ���� � �� �� � �� ����� � � � � � �� � � � � � ��� �10 �� � �� � � ��� �� � � � ��� � � � �� ����� � �� ��� � �� ��� � � �� � ���

Fig. 3. Time-Frequency Mask Demixing Performance. Plot contains PSRN ��� (in dB) versus SIRN ��� (in dB) for ��� " #�#�# �� for

� � " � % # #�# " . Table contains (PSRN ��� SIRN ��� � in dB ) for� �! �����"�$#% "� for �%� �#% " " # dB. The different gray regions

correspond to different regions of approximate W-disjoint orthogonality as determined by the lines of constant WDO. For example, using the

�&�'# dB mask in mixtures of four sources yields 14.32 dB output SIR while maintaining 83% of the desired source energy. This � PSR SIR (�

(0.83,14.32 dB) pair results in WDO � �# �� , which from Figure 2 implies perfect demixing performance. In other words, if we can correctly

map time-frequency points with 5 dB or more single source dominance to the correct corresponding output partition, we can recover 83% of the

energy of each of the original sources and produce demixtures with 14.32 dB output SIR from a mixture of four sources.

Now that we know that good time-frequency masks exist, we wish to determine the dependence of these performance

measures on the window function � �@?A� and window size. For this task, we examine the performance of the 0 dB mask,

� �� . Figure 4 shows PSR, SIR, and WDO for pairwise mixing for various window sizes and types. Each data point

in the figure represents the average of the results for ���

mixtures. In all measures, the Hamming window of size

1024 samples performed the best. Note, however, that the performance of the other masks (with the exception of the

rectangle) was extremely similar and exhibited better than ��0 ) W-disjoint orthogonality for pairwise mixing across a

wide range of window sizes (from roughly 500 to 4000 samples). Other mixture orders and masks (i.e.,���� for >+* 0 )

exhibited similar performance and in all cases the Hamming window of size 1024 had the best performance. A similar

Page 12: 1 Blind Separation of Speech Mixtures via Time-Frequency ...oyilmaz/preprints/wdo_final.pdf · [5, 11] demix by assuming that the number of sources active in the transform domain

12

conclusion regarding the optimal time-frequency resolution of a window for speech separation was arrived at in [7].

Note that even when the window size is 1 (i.e., � is sampling), the mixtures still exhibit a high level of PSR, SIR,

and WDO. This fact was exploited by those methods that used the time-disjoint nature of speech [1–3, 11]. However,

Figure 4 clearly shows the advantage of moving from the time domain to the time-frequency domain: the speech

signals are more disjoint in the time-frequency domain provided the window size is sufficiently large. Choosing the

window size too large, however, results in reduced W-disjoint orthogonality.

0 2 4 6 8 10 12 140.85

0.9

0.95

1

PSR

0 2 4 6 8 10 12 148

10

12

14

16

SIR

(dB)

0 2 4 6 8 10 12 140.7

0.8

0.9

1

log2(window size)

WD

O

Fig. 4. Window size and type comparison. Hamming ( � ), Blackman ( � ), Hann ( � ), Triangle (�

), and Rectangle ( � ). PSR, SIR, and WDO

for the 0 dB mask for window size = " � %�" #�#�# " & � � " samples for various window types for pairwise mixing of speech signals sampled at 16

kHz. The Hamming window of size 1024 has the best performance.

We close this section by proposing WDO � with � � � �� as the general measure of W-disjoint orthogonality. Ta-

ble II shows WDO � �N values for mixtures of various orders. Again, each data point represents the average measurement

over ���

mixtures. It can be shown using (14) that the 0 dB mask,� �� , maximizes WDO, and thus the 0 dB mask line

represents the upper bound of WDO for any mask. We thus say that, for example, speech signals in pairwise mixtures

are ������ ) W-disjoint orthogonal.

TABLE II

WDO FOR THE 0 DB MASK FOR MIXTURES OF VARIOUS ORDERS.

�2 3 4 5 6 7 8 9 10

% WDO ���� � ��� � � �� � ��� � ��� � � � ���� � ���� � ���� �

Page 13: 1 Blind Separation of Speech Mixtures via Time-Frequency ...oyilmaz/preprints/wdo_final.pdf · [5, 11] demix by assuming that the number of sources active in the transform domain

13

III. PARAMETER ESTIMATION AND DEMIXING

In this section, we will present a demixing algorithm that separates an arbitrary number of sources using two mix-

tures. We start by describing our anechoic mixing model. Suppose we have sources � � �@?A� ��������� � E �@?A� . Let > � �@?H� and

>/4��@?A� be the mixtures such that

>/$3�@?A��� E��GF � W

$ � � � �@? Z [M$ � � � ) � J�� � (17)

where parameters W $ � and [M$ � are the attenuation coefficients and the time delays associated with the path from the & th

source to the ) th receiver. Without loss of generality we set W � � � J and [ � � �;0 for & � J��������5�� ; for simplicity we

rename W 4 � as W � and [ 4 � as [A� . In addition we assume that the windowed Fourier transform of any source function,

� � � � ��� �� �K78� satisfies the narrowband assumption for array processing, i.e.,

� � � � � � � Z [�� � �� �K78� � ����� ��Z�� 7#[�� ��� � � � � �� Z [��K78�( ����� ��Z�� 7#[�� � � � � � � �� �K78��� (18)

This assumption is realistic as long as the window function � is chosen appropriately. A detailed discussion about

this assumption can be found in [20].

Now we go back to discussing the mixing model, described in (17). We take the STFT of > � and >/4 with an

appropriate choice of � . Using the assumption discussed above, the mixing model (17) reduces to

��� ��� ���� ��� � ���� ��

�� � �

��� � #�#�# ���������������� #�#�# � �!������� �#"

�� ����������$� ���� ��

...

�� � � ��� ��

� � # (19)

A. Parameter Estimation and Demixing for W-DO sources

To motivate the Degenerate Unmixing Estimation Technique (DUET), which we will describe in the next section,

we first consider the case where the sources are W-DO, i.e.,

6� �O�� �K78��6�5$ �� �K78���10 � � �� �K78� �:� &U'�D) � (20)

This condition is the idealization of the properties of speech signals discussed in Section II. We now construct the

parameter estimators and the demixing algorithm for W-DO signals. Clearly, when the sources are W-DO, at most one

Page 14: 1 Blind Separation of Speech Mixtures via Time-Frequency ...oyilmaz/preprints/wdo_final.pdf · [5, 11] demix by assuming that the number of sources active in the transform domain

14

source will be active at any time-frequency point �� �K78� ; in particular for any �� �K78� at which 6> � �� �K78��'�10 , there exists

a & such that 6� � �� �K78��'�10 and 6�5$ �� �K78����0 for &U'�D) . Recalling the definition of the time-frequency mask � � in (10),

we note that � � �� �K78� � $X�� �K78���10 for all �� �K78� if &('��) . From (19) we deduce that

6� � � � � 6> � � (21)

This shows that we can demix an arbitrary number of sources from only one of the mixtures if we can construct the

corresponding mask � � for each source. Next we will describe how to construct the masks � � using the mixtures > �and > 4 .

Let & be arbitrary, and define � � �*� + �� �K78� � � � �� �K78� � J�2 so that � � � J���N . Note that the � � are pairwise

disjoint. Now consider

� 4 � �� �K78�:�*� 6> 49�� �K78�6> � �� �K78� � (22)

Clearly, on � �� 4 � �� �K78��� W � � ����� N � � (23)

In this case� � 4 � �� �K78� � � W � and Z �� � � 4 � �� �K78��� [ � , where ��� denotes the phase of the complex number � taken

between Z�� and � .

The observation above yields a way of constructing the sets � � and thus a demixing algorithm: we simply label each

time-frequency point �� �K78� with the pair � � � 4 � �� �K78� � � Z ���� � 4 � �� �K78�H� . Since the sources are W-DO, there will be

distinct labels. By grouping the time-frequency points �� �K78� with the same label, we construct the sets ��� , thus the

masks � � � J�� N .The above described demixing algorithm is the motivation behind DUET. Note that the algorithm separates the

sources without inverting the mixing matrix, which makes it possible to deal with mixtures of an arbitrary number of

sources. Aside from demixing, it also yields the mixing parameters: the labels � � � 4 � �� �K78� � � Z �� � � 4 � �� �K78�H� which

we used to construct the masks are exactly the mixing parameters W � and [ � . Motivated by this fact we define the

instantaneous DUET attenuation and delay parameter estimators as

�W �� �K78� �*� � � 4 � �� �K78� � (24)

�[ �� �K78� �*� Z J7 � � 4 � �� �K78� (25)

Page 15: 1 Blind Separation of Speech Mixtures via Time-Frequency ...oyilmaz/preprints/wdo_final.pdf · [5, 11] demix by assuming that the number of sources active in the transform domain

15

respectively. We will use these estimators in the next section.

In summary, the DUET algorithm for demixing W-DO sources is,

1) From mixtures > � �@?H� and > 4 �@?A� construct time-frequency representations 6> � �� �K78� and 6> 4 �� �K78� .2) For each non-zero time-frequency point, calculate � �W �� �K78� �

�[X�� �K78�H� .3) Take the union of the � �W �� �K78� �

�[X�� �K78�H� pairs, ��� ��� ��� + � �W �� �K78� �

�[X�� �K78�H� 2 . Note

will be equal to + � W � � [ � � �& � J��������5�� 2 .

4) For each � W � � [ � � in

, & � J��������5�� , note that 6� � �� �K78�8� J�� ���� � � ��� �� � � ��� � F � � N � N ��� �� �K78��6> � �� �K78� for �� �K78� with

6> � �� �K78� '�D0 and 6� � �� �K78�8��0 otherwise. Clearly, 6� � �� �K78� will be the time-frequency representations of one of

the original sources. The numbering of the sources is arbitrary.

5) Convert each 6� � �� �K78� back into the time domain.

Remark 1: Note that the instantaneous DUET delay estimator yields a meaningful estimate at a time-frequency

point �� � �8� � � only if

� 7#[H� � � � � (26)

This follows from the periodicity of the complex exponential. For �� �K78� � � � , we have Z �� � � 4 � �� �K78� ��� ��7#[H�5�K[H� ,with � �� ����*�������� where

� *<�*� �� � � ���� +�� � � �VZ � . When (26) is not satisfied, the delay estimate obtained

using the instantaneous DUET estimator will be a fraction of its true value. Let 7������ be the maximum element of

+ � 7 � � ��7 � � � ��� � � � � for some �/2 , which is the maximum frequency present in the sources, and denote by 7! the

sampling rate. Let [ � max �*�"�$# � � � [ � � . Clearly, (26) is guaranteed for all & and for all 7 �%� � � � if

7 ����� [H� max� � � (27)

Now define [ � max �*�� ��7&����� . Any delay parameter with modulus less than [ � max can be estimated correctly. Clearly,

(27) is equivalent to the condition [ � max� [ � max. If 7'����� � 7( � � , the Nyquist frequency, then this means that the

maximum delay, [ � max � 4*)�,+ , is exactly equal to the sampling period. In other words, as long as the delay between the

two microphone readings is less than a sample, the estimated phase will be accurate. While the 7 ����� is determined

by the characteristics of speech signals, the maximum physically possible delay, which we will denote by [.- max, is

determined by the microphone spacing. For two microphones separated by a distance�, [,- max � � �0/ where / is the

Page 16: 1 Blind Separation of Speech Mixtures via Time-Frequency ...oyilmaz/preprints/wdo_final.pdf · [5, 11] demix by assuming that the number of sources active in the transform domain

16

speed of sound. Clearly, we have [ � max� [ - max, and therefore (27) will be satisfied if [ - max

� [ � max. This suggests

that one can guarantee (27) simply by choosing�, and thus [0- max, sufficiently small. For example, for a sampling rate

7( � � � � �V� J � kHz, assuming 7 �������I7( �� � and / � � ����� ��� , we obtain that [ - max �I[ � max as long as� � � � J�� cm.

If we knew, however, that 7 ����� � � � � �#� �kHz, then this distance would be increased by a factor of 4 to 8.60 cm. The

smaller the largest frequency present in the signal, the larger the allowable microphone separation (or equivalently the

larger we can choose [ � max) that guarantees accurate phase parameter estimates. The demixing technique presented in

this paper does not require knowledge of the value of�, but we assume that

�is small enough such that (27) is satisfied.

B. Parameter Estimation and Demixing for Approximately W-DO sources

In Section II, we illustrated that the time-frequency representations of speech signals are nearly disjoint and demon-

strated that we can indeed recover a speech signal from one mixture of an arbitrary number of sources if we can

construct an appropriate time-frequency mask. This suggests that a weakened W-DO condition holds for speech sig-

nals: if at a time-frequency point one of the sources has considerable power, the contribution of all the other sources

at that time-frequency point is likely to be small. This observation is the key to the demixing algorithm we propose in

this section. First we shall discuss how to estimate the mixing parameters.

From this point on, instead of the continuous STFT, we use the equivalent discrete counterpart1

6� � � )/��� � �=6� � � ) � � ����7 � � (28)

where � � and 7 � are the time-frequency lattice spacing parameters. We define the instantaneous DUET delay estimate,

the discrete version of (25),

�[ � )/��� � �*� Z J��7 � � � 4 � � )/��� � � (29)

where� 4 � � )/��� � �*� � �� $ �� � �� $ �� . For convenience, we define

�[ � )/��� � �,0 if 6> � � )/��� � �,0 or 6> 4 � )/��� � �;0 . Similarly, we define

the instantaneous DUET attenuation estimate

�W � )/��� � �*� � � 4 � � )/��� � � (30)�The equivalence is nontrivial and only true for appropriately chosen window functions � with sufficiently small � � and � � . An illustrative

discussion can be found in [16].

Page 17: 1 Blind Separation of Speech Mixtures via Time-Frequency ...oyilmaz/preprints/wdo_final.pdf · [5, 11] demix by assuming that the number of sources active in the transform domain

17

which is the discrete version of (24). For convenience, we define�W � )/��� � � J if 6> � � )/��� � �10 or 6>/4 � )/��� � �10 . For reasons

discussed in Appendix I, we choose to estimate

� � �*� W � Z J � W � (31)

instead of directly estimating W � , and define the instantaneous DUET symmetric attenuation estimate

�� � )/��� � �*� �W � )/��� � ZIJ � �W � )/��� � � (32)

We will say that �M� is dominant at � )/��� � if� 6�M� � )/��� � � . � 6��� � )/��� � � , where 6�5� is as in (12). Note, the 0 dB mask,

�&�� , in (16)

is the indicator function for the dominant time-frequency points of the & th source. In Appendix I, we derive that under

certain assumptions the maximum likelihood (ML) estimates for the mixing parameters W � and [ � can be determined

via certain weighted averages (see (57) and (59)) of the instantaneous DUET delay (�[ � )/��� � ) and instantaneous DUET

attenuation (�W � )/��� � ) estimates, at the time-frequency points at which � � is dominant. In Appendix II we compare the

performance of the weighted estimators suggested by the ML derivation with other empirically motivated weighted

estimators, and we illustrate that more accurate estimates � � � � �� � [ � � �� � of the true parameters � � � � [ � � are determined by

�������� ��� � �����������

� ��� � ������� � � � ����� � � �� � � �!�"��#$� ������� � �

��� � � �!����� � � � �!��� � � (33)

% ������ ��� � �����������

��� � � ������� � � � ����� � � �% � � ������� � �����������

� ��� � ������� � � � ����� � � (34)

when & � �. To employ these estimates however, we need to first construct the sets ! � � + � )/��� � � � 6� � � )/���.� � *

� 6���O� )/���.� � 2 for each & . Note that these sets would be the discrete version of � � of Section III-A if the sources are

W-DO. In the W-DO case we used the instantaneous DUET estimates as labels for each time-frequency point �� �K78� ,and each � � consisted of the points with identical labels. In the approximately W-DO case the instantaneous DUET

estimates for the time-frequency points in ! � will not be identical anymore. However, we claim that we can still

use these estimates as a means of labeling, and thus construct the sets ! � , at least approximately. Once we know

the ! � , we demix simply by partitioning the support of 6> � � )/��� � using ! � and converting the resulting time-frequency

representations back into the time domain. In order to determine the ! � , we rely on three observations which lead us

to create a smoothed two-dimensional power weighted histogram of the � �� � )/��� � � �[�� )/��� � � pairs. Enumerating the peaks

in this histogram estimates the number of sources, the peak centers estimate the mixing parameters, and the set of

time-frequency points which contribute to a given peak provide an estimate for the associated ! � .

Page 18: 1 Blind Separation of Speech Mixtures via Time-Frequency ...oyilmaz/preprints/wdo_final.pdf · [5, 11] demix by assuming that the number of sources active in the transform domain

18

Observation 1: The time-frequency points with instantaneous DUET estimates � �� � )/��� � � �[ � )/��� � � inside a small rect-

angle centered on the true mixing parameter pair � � � � [ � � contain most of the source energy.

We wish to show that the time-frequency points which yield instantaneous DUET estimates that are in close prox-

imity to the true mixing parameters contain most of the energy of the source. Let,

��� � � )/��� � �QRRS RRT J if

� �� � )/��� � Z � �����0 otherwise

(35)

be the indicator function for time-frequency points with instantaneous DUET symmetric attenuation estimate within

�of � where

�is a resolution parameter. We are interested in,

PSR ��� N � � � C � $ � � � � N � � )/��� � � � � � )/��� � � 4C � $ � � � � � � )/��� � � 4 (36)

which will show the portion of the energy of � � contained in time-frequency points with corresponding�� � )/��� � within

�of the true mixing parameter value � � . Figure 5 shows PSR � � N � � averaged over 100 randomly selected speech

signals taken from the TIMIT database. The curves represent the expected energy contained in time-frequency points

with instantaneous DUET symmetric attenuation estimates close to the true symmetric attenuation. For example, with

� � 0 � J we expect more than 60% of the source energy to come from time-frequency points with corresponding

�� � )/��� � located within 0.1 of the true value � � in mixtures of five sources. In this section, the model described by (47)

in Appendix I and discussed in Appendix II was used to simulate mixtures of � � ���3� �3�YJ 0 sources.

Similarly, for the delay, we define

� � � � )/��� � �QRRS RRT J if

�[ � )/��� � Z [ ��

0 otherwise

(37)

where

is a resolution parameter. Then we are interested in

PSR ��� N � � C � $ � � � � N � � )/��� � � � � � )/��� � � 4C � $ � � � � � � )/��� � � 4 (38)

which will show the portion of energy of �Y� with instantaneous DUET delay estimates within

of the true mixing

parameter value [ � . Figure 5 also shows PSR � � N � as a function of

for various mixture orders. For example, 70%

of the energy of the source is expected to be contained in time-frequency points with corresponding�[ � )/��� � within 0.1

samples of the true value of [ � in pairwise mixing.

Page 19: 1 Blind Separation of Speech Mixtures via Time-Frequency ...oyilmaz/preprints/wdo_final.pdf · [5, 11] demix by assuming that the number of sources active in the transform domain

19

0 0.1 0.2 0.3 0.4 0.5 0.60

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

AP

SR

N=2 N=3 N=5 N=10

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

D

PS

R

N=2 N=3 N=5 N=10

Fig. 5. Energy distribution PSR ��� ��� � (left) and PSR ��� ��� � (right) of instantaneous DUET estimates around the true mixing parameters. Note,

for W-DO signals, the corresponding source energy portion would be 1.0 for all distances from � N (and N ).

We now show that the source energy is localized simultaneously around � � � � [�� . To do so, we look at

PSR � � N � � ��� N � � C � $ � � � � N � � )/��� � � � N � � )/��� � � � � � )/��� � � 4C � $ � � � � � � )/��� � � 4 (39)

which measures the portion of source energy for time-frequency points with instantaneous DUET symmetric atten-

uation estimates within�

of � � and instantaneous DUET delay estimates withing

of [ � . Before we examine

PSR ��� N � � � � N � , we need to determine the appropriate�

to

ratio. Plotting � � � � pairs for PSR ��� N � � � PSR � � N � for the same mixture order reveals that the � � � � lie essentially along a line. The least-mean-square fit of this line

determines a ratio of� � � J �3J ����� samples. This means that, for example, PSR � � N � � (�0 ��� for

� �10 � J for � �

implies that PSR ��� N � ( 0 ��� for � J ����� � 0 � J � 0 � J ��� samples for � � , a property which can be verified

from the data displayed in Figure 5. Figure 6 shows PSR � � N � � ��� N � versus�

and

for J ����� � � . Note that the

�axis is at the bottom and the

axis is at the top. For example, almost 80% of the energy of the source is contained

by time-frequency points with corresponding � �� � )/��� � � �[ � )/��� � � falling in a rectangle with dimensions 0.2-by-0.33 cen-

tered on � � � � [ � � , i.e., � � � Z 0 � J�� � � � 0 � J � � � [ � Z � J ���3� [ � � 0 � J ��� � , for mixtures of three sources. As the number

of sources increases, the energy spreads over a wider area, but remains relatively well localized around the source’s

mixing parameters.

Observation 2: Observation 1 is true for the individual sources in mixtures.

This observation is based on the fact that, from the experiments with speech mixtures (see Figure 3), we know

that the time-frequency points at which one source dominates maintain a significant percentage of the dominating

source’s energy. For � � ���3� � � �3� and J 0 the percentage source energy preserved when only considering dominant

Page 20: 1 Blind Separation of Speech Mixtures via Time-Frequency ...oyilmaz/preprints/wdo_final.pdf · [5, 11] demix by assuming that the number of sources active in the transform domain

20

0 0.1 0.2 0.3 0.4 0.5 0.60

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

A

PS

R

0.00 0.17 0.33 0.50 0.66 0.83 0.99D

N=2 N=3 N=5 N=10

Fig. 6. Energy distribution PSR ��� ��� � � � � � � of instantaneous DUET estimates in a rectangle centered on the true mixing parameter pair

� � N N .time-frequency points is � � ) ��� � ) ��� J ) ���� ) � and

� � ) , respectively. Considering the time-frequency points when

one source dominates, Figure 6 shows that the instantaneous DUET estimates falling in a rectangle centered on true

estimates maintain a significant percentage of that source’s energy. For example for � �, 87% of the source’s energy

is contained in the time-frequency points with estimates falling in a rectangle of dimension 0.2-by-0.33 centered on the

true mixing parameter pair. Thus, in pairwise mixing we would expect the time-frequency points which yield estimates

� �� � )/��� � � �[ � )/��� � � inside a 0.2-by-0.33 rectangle centered on � � � � [ � � to contain the product of � � )(from Figure 6) and

� � )(from Figure 3) for a total of � � ) �

� � ) � � � ) of the energy of the first source. Similarly, we would expect

84% of the energy of the second source to come from time-frequency points which have instantaneous DUET estimate

pairs within a 0.2-by-0.33 rectangle centered on � � 4�� [M45� . As increases, the source energy percentage we expect to

see in a fixed size rectangle centered on each source’s mixing parameters decreases (it is 39% for =10); nevertheless,

Observation 1 will still hold.

Observation 3: The peaks in a smoothed two-dimensional power weighted histogram of the instantaneous DUET

estimates will be in one-to-one correspondence with the rectangle centers in Observation 2.

One way to determine the mixing parameters for multiple sources is to look at the two-dimensional weighted his-

togram C $ � � � � � )/��� � � � � � )/��� � � 6> � � )/��� � 6>/4�� )/��� � � � as a function of � and [ , for some fixed & . If � � � � is chosen

large enough to capture a large portion of the source energy, as determined by Figure 6, yet small enough so the � � � �rectangle does not contain significant energy contributions from multiple sources, we would expect the local maxima

to occur around the true mixing parameter pairs � � � � [ � � � &���J����������� . Therefore, one way of determining the mixing

Page 21: 1 Blind Separation of Speech Mixtures via Time-Frequency ...oyilmaz/preprints/wdo_final.pdf · [5, 11] demix by assuming that the number of sources active in the transform domain

21

parameters would be to calculate C $ � � � � � )/��� � � � � � )/��� � � 6> � � )/��� � 6> 4 � )/��� � � � for the range of interest of � � � [�� pairs

and select the local maxima. A computationally efficient way of doing this is to construct a two-dimensional weighted

histogram at a high resolution, and then smooth that histogram with a kernel of the dimensions of the desired � � � �rectangle. We perform smoothing to group the time-frequency points which are likely to correspond to one source.

Recall that the estimators (33) and (34) averaged the instantaneous estimates over all time-frequency points where the

source of interest was dominant. We know from the results shown in Figure 6 that a rectangle centered on the true

mixing parameters will capture most of the corresponding source’s energy. By smoothing, we locate the rectangle

centers that capture locally the largest energy contribution, and thus estimate the mixing parameters. Histograms have

been used previously for parameter estimation of voice mixtures; for example, [21] clusters onset arrival difference to

determine the time delays of the various sources.

Now we construct a two dimensional weighted histogram for � �� � )/��� � � �[ � )/��� � � , where�� � )/��� � and

�[ � )/��� � are the in-

stantaneous DUET estimates, with the weights� 6> � � )/��� � 6> 4 � )/��� � � � for some & . The weighted histogram with resolution

widths � and�

and weighting exponent & , is defined as

� � � � [�� �*� �$ � � � ��� 4 � )/��� � � � ��� 4 � )/��� � � 6> � � )/��� � 6> 4 � )/��� � � � � (40)

which we will smooth with a rectangular kernel � � � � [�� ,

� � � � [��:�*�QRRS RRT J �

� � � � [�� � � Z � � � � � � � � � � Z � � � � � �0 otherwise,

(41)

to produce the smoothed histogram

� � � � [�� �*� � �� � � � � � [�� (42)

where�

denotes two-dimensional convolution.

Figure 7 shows an example histogram before and after smoothing generated using the dominant time-frequency

points of a speech signal generated using the model of five source mixing. For the mixing model, � � �O� [H�5�B��� 0 � � �A0 � �O�which match well with the peak location. The importance of the smoothing is clear in that it combines all the en-

ergy from the estimates in a local region and results in a clear single peak and thus mixing parameter estimate. In

Appendix III, we compare the performance of histogram-based parameter estimators to the performance of the ML

Page 22: 1 Blind Separation of Speech Mixtures via Time-Frequency ...oyilmaz/preprints/wdo_final.pdf · [5, 11] demix by assuming that the number of sources active in the transform domain

22

estimators and discuss the choice of & .

−2−1

01

2

−1

−0.5

0

0.5

10

1.0

2.0

3.0

4.0

δα −2−1

01

2

−1

−0.5

0

0.5

10.0

0.2

0.4

0.6

0.8

1.0

1.2

δα

Fig. 7. Example raw (left) and smoothed (right) power weighted (� � ) histograms for one speech signal in a mixture of five. The peak

location of the smoothed histogram corresponds to the mixing parameters � � N N (� � �# % %# #� .

C. Demixing Algorithm for Approximately W-DO Sources

Recall that in the W-DO case, sources were demixed using time-frequency masks that were constructed by group-

ing the time-frequency points that yield the same instantaneous parameter estimates. We demix in a similar way for

approximately W-DO sources. First we estimate the mixing parameters, for example, using the histogram method de-

scribed in the previous section. Then, we group time-frequency points that yield instantaneous parameter estimates that

are “close” to these estimated mixing parameters. One natural definition of closeness is the instantaneous likelihood

function for the & th source

� � � )/��� � �*� & ��6> � � )/��� � �X6>/4�� )/��� � � W � � [ � �� J� ��� 4 � �

���� ������ N�� � � N�� � � � � $ � � � � $ � ���

� ��� ��� � �N � (43)

obtained by substituting the instantaneous ML source estimate (53) into the likelihood function in (48) modified to

consider only time-frequency point � )/��� � . � � � )/��� � is, in a sense, the likelihood that the & th source is dominant at time-

frequency point � )/��� � . One way to demix the mixtures is to construct a time frequency mask for � � by taking those

time-frequency points for which� � � )/��� � . � � � )/��� � �X� � '� & . The time-frequency mask for demixing � � is thus

�� � �*� J � $ �� �� �GF �����,���������� $ � � (44)

and defining

�! � �,+ � )/��� � �,#���- �$# ��� � � )/��� � (45)

Page 23: 1 Blind Separation of Speech Mixtures via Time-Frequency ...oyilmaz/preprints/wdo_final.pdf · [5, 11] demix by assuming that the number of sources active in the transform domain

23

the estimate of the time-frequency points for which the & th source is dominant, we can relate this demixing mask

to those that were used in the W-DO case. There are many other ways we can envision using these likelihoods, for

example, some type of relative weighting resulting in fractional masks instead of the binary winner-take-all masks

created by the scheme we have proposed. However, we have shown in Section II that the 0-dB binary masks exhibit

excellent demixing performance and maximize the WDO performance measure so we consider exclusively binary

time-frequency masks in this paper.

As before, we estimate the source by converting

� 6� � � )/��� � �*� �� � � )/��� � 6> � � )/��� � (46)

into the time domain. Note, we could apply the mask to > 4 as well, and, could combine the two demixtures using

the ML estimate of the source as in (53). However, in order to compare with the results obtained in Section II, the

experimental results presented in the next section will use (46).

In summary, the DUET algorithm for demixing Approximately W-DO sources is,

1) From mixtures > � �@?H� and > 4 �@?A� construct time-frequency representations 6> � � )/��� � and 6> 4 � )/��� � .2) For each time-frequency point, calculate � �� � )/��� � � �[ � )/��� � � using (32) and (29).

3) Construct histogram and locate peaks:

a) Construct a high resolution histogram as in (40)

b) Smooth the histogram as in (42)

c) Locate peaks in histogram. There will be peaks, one for each source, with peak locations approximately

equal to the true mixing parameter pairs, + � � � � [ � � � &���J����������� 2 .4) For the pairs of � � � � [ � � estimates, construct the time-frequency masks corresponding to each pair using the

ML partitioning as in (44) and apply these masks to one of the mixtures as in (46) to yield estimates of the

time-frequency representations of the original sources.

5) Convert each estimate back into the time domain.

Page 24: 1 Blind Separation of Speech Mixtures via Time-Frequency ...oyilmaz/preprints/wdo_final.pdf · [5, 11] demix by assuming that the number of sources active in the transform domain

24

IV. EXPERIMENTS

In order to demonstrate the technique, we present results in this section for both synthetic and real mixtures. One

issue that we have not addressed is how the histogram peaks are automatically enumerated and identified. For the

following demonstration, we used an ad-hoc technique that iteratively selected the highest peak and removed a region

surrounding the peak from the histogram. Peaks were removed as long as the histogram maintained a threshold

percentage of its original weight. The threshold percentage and region dimensions had to be occasionally altered in

the course of the tests to ensure the correct number of sources was found. Indeed, peak enumeration and identification

remains a topic of future research. In all examples, we used histograms with & � J , as suggested in Appendix III.

A. Synthetic mixtures

Figure 8 shows the smoothed histogram (42) for a six source synthetic mixing example with histogram resolu-

tion widths � ��� � ��� � 0 � 0 �3�A0 � J � samples � and smoothing kernel dimensions � � � � ��� 0 � J � �A0 � � samples � . The six

sources were taken from the TIMIT database and the stereo mixture was created using (symmetric attenuation, delay)

mixing parameters pairs + �KJ�� Z � � , � � � � � Z<J�� , � � � � �YJ�� , �KJ�� � � , � � � �3�YJ�� , � � � �3� Z<J�� 2 . It is clear given only the stereo

mixture, one can determine how many sources were used to create the mixture by enumerating the peaks in the his-

togram. Using the ML partitioning, the first channel of the mixture was demixed and the SIR, PSR, and WDO were

measured; the results are shown in Table III. For comparison, WDO � � , the optimal WDO created using the 0 dB mask

is shown in the last column. The demixtures average over 13 dB SIR gain and the WDO numbers indicate demixtures

which would rate right on the border between “minor artifacts or interference” and “distorted but intelligible.” Note

that even though the blind method performs reasonably well, the performance of the 0 dB mask shows that there exist

time-frequency masks which would further improve the performance. Figure 9 shows the original six sources, the two

mixtures, and the six demixtures.

To show the limits of this technique, a ten source stereo mixture was synthetically mixed. The smoothed histogram

for the mixture is shown in Figure 8 and Table III contains the demixing performance. The SIR gains are still high,

the average gain above 12 dB, however, the WDO performance has dropped to “very distorted and barely intelligible.”

However, as we are trying to demix ten sources from just two mixtures, these results are promising. More promising

Page 25: 1 Blind Separation of Speech Mixtures via Time-Frequency ...oyilmaz/preprints/wdo_final.pdf · [5, 11] demix by assuming that the number of sources active in the transform domain

25

−2−1

01

2

−1−0.5

00.5

1

0.5

1

1.5

2

2.5

3

δα−2

−10

12

−1−0.5

00.5

1

0.5

1

1.5

δα

Fig. 8. Six and Ten Source Synthetic Mixing Smoothed Histograms (� � " ). Each peak corresponds to one source and the peak location

corresponds to the associated source’s mixing parameters.

TABLE III

SIX AND TEN SOURCE DEMIXING PERFORMANCE. PERFORMANCE OF THE BLIND TECHNIQUE IS COMPARED AGAINST THE OPTIMAL

TIME-FREQUENCY MASK, THE 0 DB MASK.

source SIR in (dB) SIR out (dB) SIR gain (dB) PSR WDO DUET WDO 0dB

+ � -7.29 5.92 13.21 0.76 0.57 0.80+��-7.29 5.24 12.53 0.78 0.55 0.78+��-5.08 6.60 11.67 0.80 0.62 0.81+��-9.29 5.35 14.63 0.79 0.56 0.69+��-5.03 7.06 12.09 0.78 0.63 0.81+��-9.28 5.47 14.75 0.77 0.55 0.66

+ � -9.74 -0.32 9.42 0.58 -0.04 0.70+��-7.73 3.14 10.87 0.66 0.34 0.77+��-11.64 3.43 15.06 0.68 0.37 0.64+��-9.72 -0.60 9.13 0.58 -0.09 0.67+��-7.73 3.93 11.66 0.66 0.39 0.73+��-11.61 3.14 14.75 0.56 0.29 0.51+��-7.75 2.57 10.31 0.56 0.25 0.74+�-11.62 1.36 12.98 0.61 0.16 0.62+�-9.72 4.70 14.42 0.60 0.39 0.67+ ��� -9.74 3.33 13.07 0.60 0.32 0.64

Fig. 9. Six Sources, Stereo Mixture, and Six Demixtures.

Page 26: 1 Blind Separation of Speech Mixtures via Time-Frequency ...oyilmaz/preprints/wdo_final.pdf · [5, 11] demix by assuming that the number of sources active in the transform domain

26

indeed is the fact that the 0 dB mask’s performance is significantly better showing that there is room for improvement.

B. Anechoic and Echoic Mixing Results

We also tested DUET on speech mixtures recorded in an anechoic room. For the tests, each speech signal was

recorded separately at 16 kHz and then the signals were mixed additively to generate the mixtures for the tests. Knowl-

edge of the actual signals present in each mixture allows us to calculate the performance measures exactly. For the

recordings, the omnidirectional microphones were separated by 1.75 cm and the speech signals were played from

various positions on a 1.5 meter radius semicircle around the microphones with the microphone axis along the line

from the 0 �

position to the J ��0 �

position. Two female (F1 and F2) and one male (M1) TIMIT sound files were used

for the tests. Pairwise mixing results for female-female and male-female mixtures are shown in Table IV. Again, for

comparison purposes, the WDO obtained by the DUET algorithm is compared to the optimal WDO which is obtained

using the 0 dB mask. The separation obtained by DUET is nearly perfect and in all but the ��0 �

case: the DUET

mask’s performance is essentially the same as the performance of the optimal mask. The reason for the slight fall in

performance in the ��0 �

case is that the source peak regions begin to overlap and as a result some time-frequency points

are misassigned.

Higher order mixing results (i.e., *�) are listed in Table V. In addition to the three, four, and five source

anechoic mixtures tested, a three source echoic mixture was tested. All of the speech signals, three female (F1, F2,

and F3) and two male (M1 and M2), were taken from the TIMIT database. The echoic recording was made in an

echoic office environment with an approximate reverberation time of 500 ms. As the number of sources increases, the

demixing performance decreases, although the performance is still acceptable in the five source mixture. As expected,

the performance drops off significantly when switching from the anechoic to the echoic environment as the method is

based on an anechoic mixing model. However, some separation is still achieved.

Figure 10 compares the one source histograms for anechoic and echoic recordings for sources at three different

angles. The histograms corresponding to the summation of the three sources are also shown. The anechoic histograms

are well localized and the peak regions are clearly distinct, even in the histogram corresponding to the summation of the

sources. The peak regions in the echoic histograms are spread out and overlap with one another. This overlap results

Page 27: 1 Blind Separation of Speech Mixtures via Time-Frequency ...oyilmaz/preprints/wdo_final.pdf · [5, 11] demix by assuming that the number of sources active in the transform domain

27

TABLE IV

PAIRWISE ANECHOIC DEMIXING PERFORMANCE.

test SIR in (dB) SIR out (dB) SIR gain (dB) PSR WDO DUET WDO 0dB

F1 � �

-0.58 12.69 13.26 0.92 0.87 0.96

F2 ��� �

0.58 11.25 10.68 0.96 0.89 0.96

F1 � �

-0.54 15.97 16.51 0.98 0.95 0.96

F2 ��� �

0.54 17.21 16.68 0.98 0.96 0.96

F1 � �

-0.62 15.29 15.91 0.97 0.94 0.94

F2 ��� �

0.62 15.69 15.07 0.98 0.95 0.95

F1 � �

-0.49 17.50 17.99 0.98 0.96 0.96

F2��� � �

0.49 17.36 16.87 0.98 0.97 0.97

F1 � �

-0.50 15.79 16.29 0.97 0.94 0.94

F2� ��� �

0.50 15.51 15.01 0.98 0.95 0.95

F1 � �

-0.44 16.29 16.73 0.96 0.94 0.94

F2��� � �

0.44 14.49 14.05 0.98 0.94 0.95

F1 � �

3.54 13.99 10.46 0.96 0.92 0.97

M1 ��� �

-3.54 10.35 13.88 0.91 0.83 0.94

F1 � �

3.60 18.42 14.81 0.99 0.97 0.98

M1 ��� �

-3.60 15.41 19.01 0.97 0.94 0.95

F1 � �

3.63 18.92 15.29 0.99 0.98 0.98

M1 ��� �

-3.63 15.91 19.54 0.97 0.95 0.95

F1 � �

3.69 19.91 16.22 0.99 0.98 0.98

M1��� � �

-3.69 15.79 19.48 0.98 0.95 0.95

F1 � �

3.75 19.57 15.82 0.99 0.98 0.98

M1� ��� �

-3.75 16.37 20.12 0.97 0.95 0.95

F1 � �

3.90 18.47 14.57 0.99 0.97 0.98

M1��� � �

-3.90 15.51 19.41 0.97 0.94 0.94

in reduced demixing performance. Note, however, that the 0 dB mask still performs well in the echoic case, so there

remains a gap between what we can separate blindly and what we can separate with knowledge of the instantaneous

time-frequency attenuations when using time-frequency masking to demix.

Figure 11 shows the histogram for the Te-Won Lee real office room recording consisting of two speakers [22]. The

histogram shows a number of peaks, the peaks with � * ZP0 � � are all associated with the Spanish speaker, and those

along the � ��Z<J � 0 line correspond to the English speaker. Note that for this recording, it is the attenuation direction

in the histogram that allows for the separation and that a method that only relied on delays would not be able to separate

the sources. Demixtures generated from this recording using the DUET algorithm are compared to several other BSS

techniques in [23].

Page 28: 1 Blind Separation of Speech Mixtures via Time-Frequency ...oyilmaz/preprints/wdo_final.pdf · [5, 11] demix by assuming that the number of sources active in the transform domain

28

TABLE V

HIGHER ORDER DEMIXING PERFORMANCE. RESULTS FOR THREE SOURCE, FOUR SOURCE, AND FIVE SOURCE ANECHOIC MIXTURES,

AS WELL AS THREE SOURCE ECHOIC MIXING.

Anechoic

test SIR in (dB) SIR out (dB) SIR gain (dB) PSR WDO DUET WDO 0dB

M1 � �

-2.72 13.67 16.39 0.92 0.88 0.90

F1 ��� �

-2.05 7.96 10.00 0.96 0.80 0.93

M2��� � �

-4.37 13.32 17.70 0.88 0.84 0.87

M1 � �

-6.93 9.89 16.83 0.78 0.70 0.80

F1 ��� �

-3.19 7.11 10.30 0.92 0.74 0.91

M2��� � �

-4.37 6.98 11.35 0.85 0.68 0.89

F2��� � �

-5.05 10.08 15.12 0.86 0.78 0.90

F1 � �

-9.77 7.97 17.74 0.73 0.62 0.76

M1 ��� �

-4.30 7.16 11.46 0.83 0.67 0.86

F2 ��� �

-3.77 5.99 9.76 0.91 0.68 0.91

M2��� � �

-5.60 7.05 12.65 0.80 0.65 0.85

F3��� � �

-8.59 8.53 17.11 0.76 0.65 0.82

Echoic

test SIR in (dB) SIR out (dB) SIR gain (dB) PSR WDO DUET WDO 0dB

M1 � �

-5.20 5.38 10.58 0.56 0.40 0.81

M2 ��� �

0.07 4.33 4.26 0.89 0.56 0.91

F1��� � �

-4.48 6.03 10.51 0.65 0.49 0.87

Fig. 10. Anechoic vs. Echoic Histogram Comparison. The left column images are of the histograms for three anechoic sources at �

, �� �

,

" �� �

, and their mixture. The histogram of the mixture is essentially the summation of the individual histograms and the peak regions in the

histogram are clearly separated. The right column images are of the histograms for three echoic sources �

, �� �

, " �� �

, and their mixture. While

the individual histograms show some level of localization (left, center, right), peak regions in each histogram overlap and the peaks are difficult

to identify in the summation image. Thus, the algorithm performs worse on echoic mixtures. In all images, the x-axis is delay ranging from -2

to 2 samples and the y-axis is symmetric attenuation ranging from -1 to 1. All histograms used � � " .

Page 29: 1 Blind Separation of Speech Mixtures via Time-Frequency ...oyilmaz/preprints/wdo_final.pdf · [5, 11] demix by assuming that the number of sources active in the transform domain

29

−50

5

−1−0.5

00.5

1

1

2

3

δα

Fig. 11. Histogram (� � " ) for Te-Won Lee’s “A real Cocktail Party Effect” Echoic Mixing Example.

V. CONCLUSIONS

In this paper, we presented a method to blindly separate mixtures of speech signals. We first illustrated experimen-

tally that binary time-frequency masks exist that can separate as many as 10 different speech signals from one mixture.

This relies upon a property of the Gabor expansions of speech signals, which we refer to as W-disjoint orthogonality.

W-disjoint orthogonality in the strict sense is satisfied by signals which have disjoint time-frequency supports. Speech

signals, as a result of the sparsity of their Gabor expansions, satisfy an approximate version of the W-disjoint orthog-

onality property. In Section II-A, we introduced a means of measuring the degree of W-disjoint orthogonality of a

signal in a given mixture with respect to a windowing function � . Listening experiments showed that there is a fairly

accurate relationship between the WDO value of a particular signal in a mixture for a given time-frequency mask and

the subjective performance of the time-frequency mask to separate the signal from the mixture.

Next, we addressed the problem of blindly constructing binary time-frequency masks that demix. The solution we

presented in this paper considered the two mixture case. For strictly W-disjoint orthogonal signals, we showed that

the instantaneous DUET attenuation and delay estimators are the anechoic mixing parameters and, using this fact,

described a simple algorithm to construct a binary time-frequency mask that demixes perfectly. Next we showed,

by modeling the contributions of the interfering sources as independent Gaussian white noise, that the ML estima-

tors for the mixing parameters are given by weighted averages of the instantaneous DUET estimates. Motivated by

this, we constructed a weighted histogram which was used to enumerate the sources and partition the time-frequency

representation of one of the mixtures to demix.

In Section IV we presented experimental results demonstrating that the algorithm works extremely well for synthetic

Page 30: 1 Blind Separation of Speech Mixtures via Time-Frequency ...oyilmaz/preprints/wdo_final.pdf · [5, 11] demix by assuming that the number of sources active in the transform domain

30

mixtures of speech as well as for speech mixtures recorded in an anechoic room and produces near perfect demixtures.

That is, the performance of the mask generated by the DUET algorithm was close to the performance of the ideal mask.

In an echoic room the anechoic model is violated and the quality of the demixing is reduced. In the echoic case, the

demixtures contain some crosstalk and distortion, but are intelligible. There is, in the echoic case, a performance gap

between the ideal binary time-frequency mask and the mask generated using the DUET algorithm. Closing this gap

is one goal of our future work. As we mentioned in Section IV, the enumeration and identification of the histogram

peaks are also topics for future research.

APPENDIX I

DERIVATION OF THE ML ESTIMATORS

Let us concentrate on one source, say � � . Let ! � be the set of time-frequency points � )/��� � at which � � is dominant as

defined in Section III-B. On ! � , we model the mixtures as follows:

6> � � )/��� � � 6� � � )/��� � � � � � )/��� �6> 4 � )/��� � � W � � ����� N � � � 6� � � )/��� � � � 4 � )/��� � (47)

where � � and � 4 are i.i.d. white complex Gaussian noise signals with mean zero and variance � 4 . Here � � and � 4model the contributions of other sources at the time-frequency points where � � is the dominant source. We model the

interfering sources as independent Gaussian noise in order to obtain simple closed-form source and mixing parameter

estimators. In reality, the interference in the different mixtures will be correlated and may not be Gaussian distributed.

However, the estimates we obtain here are used simply to motivate the weighted histograms used in our algorithm

discussed in Section III-B and the model is sufficient for that purpose.

For the model in (47), we want to employ an ML estimate to find the parameter pair � W � � [ � � � � 4 as well as

6� � � )/��� � which has the maximum likelihood. To that goal, we define the likelihood,� � , of ��� � � W � � [ � � , where � � �

Page 31: 1 Blind Separation of Speech Mixtures via Time-Frequency ...oyilmaz/preprints/wdo_final.pdf · [5, 11] demix by assuming that the number of sources active in the transform domain

31

� 6� � � )/��� � � � $ � ��� LON with each 6� � � )/��� � � �, given the data 6> � � )/��� � and 6>/4 � )/��� � , by��� ��� � � � � � % � � ���� �� � �� � � � � � � � � % � �

� ��� � ���������

� � � � � ���� � � � �!�"��� �� � � � �!�"� � � � � � ������� � � � ��� � � � � � �� � � � ������

��������� ��� ���� � �

�� � ������� � ���� � � �!����� �� � � � �!��� �

����� � � � � �!����� � � � ��� � � � � � �� � � � ����� �� �! (48)

where " � � ��6> � � )/��� � � � $ � �#� LON . The last equality holds because we assume � � and � 4 are i.i.d. complex Gaussian noise

signals. Clearly, maximizing� � is equivalent to maximizing� �$� � � � � � % � � ��� � �

�#$� ����������� ��� � �!��� � ���� � � �!�"� �

�� �� � � � � �!��� � � � � ��� � � � � � ���� � � �!��� �� � # (49)

We want to solve the equations % �% �&N $ �� � 0 , % �% �'N $ �� � 0 for all � )/��� � � ! � , % �% � N � 0 , and % �% � N � 0 simultaneously,

where 6�)(� � )/��� � and 6�+*� � )/��� � denote the real and imaginary parts of 6� � � )/��� � respectively. We start with % �% &N $ �� . For any

� )/���.�8�(! � , we have , �, 6� (� � )/��� � � ,, 6� (� � )/��� �.- 6> � � )/��� � Z 6� (� � )/��� � Z �56� *� � )/��� � 4 � 6> 4 � )/��� � Z W � � ����� N � � � � 6� (� � )/��� � � �56� *� � )/��� � � 40/ � (50)

We solve % �% &N $ �� �&N $ � F �&21N $ �� �10 for 6� (43� � )/��� � , the ML estimate of 6� (� � )/��� � , and obtain

6� (43� � )/��� � �6587:9 6> � � )/��� � � W � � ��� N � � � 6>/4 � )/��� �J � W 4� ; � (51)

Similarly, solving % �% �'N $ � 'N $ �� F '�1N $ �� �10 for 6� *<3� � )/��� � , the ML estimate of 6� *� � )/��� � , yields

6� *<3� � )/��� � �6=�> 9 6> � � )/��� � � W � � ��� N � � � 6> 4 � )/��� �J � W 4� ; (52)

which we combine with (51) to get the ML estimate ��3� for � � :� 3� � )/��� � � 6> � � )/��� � � W � � ��� N � � � 6> 4 � )/��� �J � W 4� � (53)

Page 32: 1 Blind Separation of Speech Mixtures via Time-Frequency ...oyilmaz/preprints/wdo_final.pdf · [5, 11] demix by assuming that the number of sources active in the transform domain

32

Next, we consider % �% � N . We have, �, [ � � ,, [ ���� �� $ � ��� LON 6> 4 � )/��� � Z W � � ����� N � � � 6� � � )/��� � 4���

� �W �

�� $ � �#� LON � 7 � =�>�� 6> 4 � )/��� � 6�G��� )/��� � � � � N � � �� � (54)

where 6� � � )/��� � is the complex conjugate of 6� � � )/��� � . We now plug in 6� � � )/��� � � � 3� � )/��� � in (54), which yields � % � � � � �� � � �� �

��$� ����������, � � � � � � ����� � ���� ���� � � � � ����� � ��� � � � ���

�� � �

� � � �� ���$� ��� �����

�, � � � � � � �!�"� � � � � �!�"� ������� ��� � � � � � �!�"� � % � �, � � # (55)

We assume that� � � 4 � � )/��� � � [ � � 7 � � �

� 7 � � [ � Z �[ � )/��� � � is small which is reasonable because we are considering

only the � )/��� � where � � is dominant and we make the approximation

����� � � � 4 � � )/��� � � [ � ��7 � ��( � � 4 � � )/��� � � [ � ��7 � � (56)

After plugging (56) into (55), we solve the equation % �% � N � N F � 1N �10 for [)3� and obtain (using the definition in (29))

[ 3� � C � $ � ��� LON �[ � )/��� � � 4 7 4� � 6> � � )/��� � 6> 4 � )/��� � �C � $ � �#� L9N � 4 7 4� � 6> � � )/��� � 6>/4�� )/��� � � � (57)

Note that [ 3� , the ML estimate for the parameter [ � , is a weighted average of the instantaneous DUET delay estimates,

with each estimate weighted by the product magnitude of the mixtures as well as � ��7 � � 4 . We also observe that the ML

estimate [ 3� does not depend on the attenuation parameter W � .Finally, we will solve % �% � N � N F � 1N for W 3� . We have, �, W � � ,, W ���� �

� $ � ��� LON 6>/4 � )/��� � Z W � � ����� N � � � 6� � � )/��� � 4 ��� �� $ � �#� LON� W � � 6� � � )/��� � � 4 Z � 5�7 � 6>/4 � )/��� � 6� � � )/��� � � ��� N � � � � � (58)

After setting 6�M� � )/��� � � � 3� � )/��� � and some algebra we get (using the definition in (30))

�"!� ��� �!� � �� !� ���$� ��� ��� � �

� � � � �!�"� � � � � �!�"� � � �� � � �!��� � � ! �� � � ��������$� ��� ��� ��#%$'& � � � � ����� ��� � � ����� � � �)(� � � �+* (59)

Page 33: 1 Blind Separation of Speech Mixtures via Time-Frequency ...oyilmaz/preprints/wdo_final.pdf · [5, 11] demix by assuming that the number of sources active in the transform domain

33

The estimate for W 3� Z �� 1N is symmetric in " � and " 4 : swapping the mixture labels will only result in a sign change

of this quantity (i.e., �KJ � W � � Z1�KJ � �KJ � W � �H�8��Z�� W � ZIJ � W � � ). In the original presentation of DUET [4], the logarithm

of the attenuation estimates was used solely because it has the same property (i.e., ),+ -/�KJ � W � � � Z ),+ -/� W � � ). However,

motivated by its appearance in the ML estimator (59), we will replace the role of the logarithm with the DUET

symmetric attenuation estimator defined in (32).

Remark 2: Although the estimate given in (59) is not a weighted average, it is interesting to note that if we replace

[ 3� in (59) with�[ � )/��� � we obtain that587 � 6> 4 � )/��� � 6> � � )/��� � � � �� $ �� � � � � � � 6> � � )/��� � 6> 4 � )/��� � � (60)

and in this case, (59) becomes (33) with & � J , a weighted average of�� � )/��� � .

APPENDIX II

EXPERIMENTAL EVALUATION OF THE ML ESTIMATORS

In this section, we experimentally evaluate the ML estimators as well as other estimators motivated by the previous

section. In order simulate mixtures, we use the model in (47) and adjust the noise energy to model the different

number of interfering sources. The model in (47) is valid for the dominant time-frequency points of one source. In

order to determine the set of dominant time-frequency points ! � , a speech signal taken from the TIMIT database was

compared to a random mixture of 1, 2, 4, or 9 TIMIT speech signals to model � � ���3� �3�YJ 0 , and in each case, the

time-frequency points corresponding to the 0-dB mask were selected. The mixtures of interfering sources were only

used to determine ! � and were discarded after the dominant time-frequency points were identified. In order to simulate

the presence of interfering sources, i.i.d. Gaussian white noise was added to the dominant time-frequency points of

source � � on both channels. The added noise was amplified to produce a 15.12 dB, 12.26 dB, 9.87 dB, or 7.52 dB

SNR so as to model mixing of order � � ���3� �3� or J 0 , respectively. These SNR’s were selected to model different

mixture orders because they match the average SIR’s for the 0-dB mask from Figure 3. That is, 15.12 dB, 12.26 dB,

9.87 dB, and 7.52 dB are the expected SIR’s after applying the 0-dB mask to mixtures of order � � ���3� �3� and J 0 ,and thus in order to model these mixture orders for the dominant time-frequency points of one source, we add noise

to one source to produce the corresponding SNR’s. Note that the dominate time-frequency points are precisely the

Page 34: 1 Blind Separation of Speech Mixtures via Time-Frequency ...oyilmaz/preprints/wdo_final.pdf · [5, 11] demix by assuming that the number of sources active in the transform domain

34

support of the source’s 0 dB mask. Thus the performance of the estimator evaluated with this model at these SNR’s

should approximate the true performance of the estimator in speech mixtures of order � � ���3� �3� and J 0 . All results

in the remainder of this section are obtained using this model.

We choose to experimentally evaluate the estimators using the model as described above as opposed to creating

synthetic mixtures of multiple speech signals because we (1) wanted to prevent the results from depending on the

specific choice of mixing parameters of the interfering sources and (2) wanted to evaluate the estimators using the

model that motivated them. The disadvantage of modeling the presence of interfering sources in this way is that the

interference should be correlated and this correlation is lost when the interference is modeled as independent noise.

Our desire in this section is to explore the qualitative performance of a family of estimators to motivate the demixing

algorithm, and modeling the interference as noise is sufficient for this purpose.

Figure 12 shows the ML estimate [�3� from (57) versus [ as [ ranges linearly from -5 to 5 samples with W � � J . We

can see that the ML delay estimator exhibits bad performance outside the -1 to 1 sample range, and biased performance

inside this range. The bad performance for larger delays is due to the phase wrap around problem discussed in Re-

mark 1 in Section III-A. The squared frequency weighting factor in the ML delay estimator accentuates this problem.

In addition, such a frequency weighting would make signals with higher frequency content have higher likelihood

estimates of their delay parameters. In the next sections, we will be using these weightings to construct weighted his-

tograms for source separation and it is undesirable to assign more likelihood to one set of parameters simply because

their associated source contains higher frequencies. While methods for unwrapping the phase do exist, these methods

are inappropriate for our purposes as different sources may be active from one frequency to the next. In order to see

if we could reduce the bias, eliminate the wrap-around effect, and remove the high frequency weighting, we removed

the squared frequency weighting factor in the ML delay estimator and considered estimators of the following form

[ � � �� �*� C � $ � ��� LON � 6> � � )/��� � 6>/4 � )/��� � � � �[ � )/��� �C � $ � ��� LON � 6> � � )/��� � 6>/4�� )/��� � � � � (61)

The free parameter & in (61) determines how strongly the estimates obtained from time-frequency points � )/��� � with

large� > � � )/��� � � are weighted relative to those obtained from time-frequency points with small

� > � � )/��� � � . Note that &�10corresponds to (61) being a plain average and & � % results in [ � � �� being the instantenous DUET estimate from the

Page 35: 1 Blind Separation of Speech Mixtures via Time-Frequency ...oyilmaz/preprints/wdo_final.pdf · [5, 11] demix by assuming that the number of sources active in the transform domain

35

time-frequency point � )/��� � with the largest coefficient magnitude. Moreover, with & � J , this estimator is the ML delay

estimator with the squared frequency weighting factor removed. The estimator in (61) for & � J � � �YJ�� � is compared

−5 −4 −3 −2 −1 0 1 2 3 4 5−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

actual δj

estima

ted δ j

N=2 N=3 N=5 N=10

Fig. 12. Maximum Likelihood Delay Estimator. The plot compares estimated N versus true

N for the ML estimator for N ranging linearly

from -5 to 5 samples when � N � for mixture models of 2, 3, 5, and 10 sources.

with the ML estimator in Figure 13. For this test, [ � ranges linearly from -5 to 5 samples while � � ranges linearly from

0.15 to -0.15 and the SNR and ! � were selected to model mixture orders of 2, 3, 5, and 10. The & ��J � � estimator

suffers similar deficiencies as the ML estimator. The &U� J estimator is clearly biased outside the -1 to 1 delay range,

but is monotonic with increasing [ � and exhibits good (although biased) performance inside -1 to 1 sample delay. The

& � �estimator exhibits near perfect estimates.

Figure 13 also shows � 3� versus � � for the same data used for the delay estimates. Similar to the delay case, in

addition to the ML estimator, we consider estimators of the following form,

� � � �� � ��� � �����������

� ��� � ������� � � � ����� � � �� � � ������#$� ������� � �

��� � � �!����� � � � �!��� � � # (62)

Note that with &,� J , this estimator is the ML symmetric estimator with the substitution described in Remark 2

previously. The ML and & � J � � symmetric attenuation estimators are clearly biased. The & � J symmetric atten-

uation estimator is also biased, although less so. The &1� �symmetric attenuation estimator exhibits near perfect

performance.

APPENDIX III

HISTOGRAM-BASED PARAMETER ESTIMATION

In order to evaluate the usefulness of the histogram as a parameter estimator, a smoothed histogram was created

for each of the tests used to generate Figure 13 and the peak location of the histogram was used as the symmetric

attenuation and delay estimate. Figure 14 contains the results of these tests. Each estimator histogram consisted

Page 36: 1 Blind Separation of Speech Mixtures via Time-Frequency ...oyilmaz/preprints/wdo_final.pdf · [5, 11] demix by assuming that the number of sources active in the transform domain

36

−5 −4 −3 −2 −1 0 1 2 3 4 5−5

−4

−3

−2

−1

0

1

2

3

4

5

actual δj

estim

ated

δj

ML p=1/2p=1 p=2

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

actual αj

estim

ated

αj

ML p=1/2p=1 p=2

Fig. 13. Delay (left) and Attenuation (right) Estimator Comparison. The graph on the left compares estimated N versus true

N for the ML

delay estimator and the weighted average DUET delay estimators with � � " � % " � as N ranges linearly from -5 to 5 samples while � N ranges

linearly from 0.15 to -0.15 for a mixture model of 5 sources. The graph on the right compares estimated � N versus true � N for the ML and

weighted average DUET symmetric attenuation estimators for the same experimental data used to generate the delay graph.

of 401-by-401 points with a delay range from -6 to 6 samples and symmetric attenuation from -1.2 to 1.2, and the

smoothing kernel had parameters � � � � � � 0 � J � �A0 � � � . Comparing Figure 13 and Figure 14, we conclude that the

histogram based estimators are more accurate than the previously considered ML motivated estimators.

−5 −4 −3 −2 −1 0 1 2 3 4 5−5

−4

−3

−2

−1

0

1

2

3

4

5

actual δj

estim

ated

δj

p=1/2p=1 p=2

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

actual αj

estim

ated

αj

p=1/2p=1 p=2

Fig. 14. Histogram Delay (left) and Attenuation (right) Estimator Comparison. The graph on the left compares estimated N versus true

N for

the smoothed histogram peak estimators with � � " � � " � for N ranging linearly from -5 to 5 samples as � N ranges linearly 0.15 to -0.15. The

graph on the right compares estimated � N to the true � N . Both graphs were generated using a model of 5 source mixing.

The similar estimator performance for different choices of & in Figure 14 suggests that the choice of & should be

driven by other concerns. Identifying the peaks in the histogram is the crucial step in the separation process. Two

important criteria for the weighting exponent & selection are (1) the shape around the peak (the “peak shape”) and (2)

the relative peak heights. In order to aid in peak identification, we want the peak shape to be narrow and tall, and we

want the peaks to be roughly of the same height. Figure 15 compares the histogram peak shapes for &I� J � � �YJ�� �

Page 37: 1 Blind Separation of Speech Mixtures via Time-Frequency ...oyilmaz/preprints/wdo_final.pdf · [5, 11] demix by assuming that the number of sources active in the transform domain

37

for both the � and [ axes by taking the summation along the other axis. That is, Figure 15 contains 1-D weighted

histograms for both � and [ . As & increases, the peak shape becomes narrower and taller. This would suggest that we

should select & as large as possible. However, the larger we choose & , the more the peak heights depend only on the

largest instantaneous product power time-frequency components of each source. If these components have different

magnitude distributions for different sources, the resulting peaks heights can vary by several orders of magnitude

making identification of the smaller peaks impossible. While & � �results in the best peak shape, smaller choices of &

may result in easier peak identification. The choice of & is thus data dependent, however, motivated once again by the

form of the ML estimators, we will suggest & � J as the default choice.

−0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

α

p=1/2p=1 p=2

−1.5 −1 −0.5 0 0.5 1 1.5

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

δ

p=1/2p=1 p=2

Fig. 15. Peakshape for p=1/2, 1, and 2 for mixing with � � N N � � %# � �# #� .

REFERENCES

[1] J.-K. Lin, D. G. Grier, and J. D. Cowan, “Feature extraction approach to blind source separation,” in IEEE Workshop on Neural Networks

for Signal Processing (NNSP), Amelia Island Plantation, Florida, September 24–26 1997, pp. 398–405.

[2] T.-W. Lee, M. Lewicki, M. Girolami, and T. Sejnowski, “Blind source separation of more sources than mixtures using overcomplete

representations,” IEEE Signal Proc. Letters, vol. 6, no. 4, pp. 87–90, April 1999.

[3] M. V. Hulle, “Clustering approach to square and non-square blind source separation,” in IEEE Workshop on Neural Networks for Signal

Processing (NNSP), Madison, Wisconsin, August 23–25 1999, pp. 315–323.

[4] A. Jourjine, S. Rickard, and O. Yılmaz, “Blind separation of disjoint orthogonal signals: Demixing N sources from 2 mixtures,” in IEEE

International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 5, Istanbul, Turkey, June 5–9 2000, pp. 2985–2988.

[5] P. Bofill and M. Zibulevsky, “Blind separation of more sources than mixtures using sparsity of their short-time Fourier transform,” in

International Workshop on Independent Component Analysis and Blind Signal Separation (ICA), Helsinki, Finland, June 19–22 2000, pp.

87–92.

[6] R. Balan and J. Rosca, “Statistical properties of STFT ratios for two channel systems and applications to blind source separation,” in

Page 38: 1 Blind Separation of Speech Mixtures via Time-Frequency ...oyilmaz/preprints/wdo_final.pdf · [5, 11] demix by assuming that the number of sources active in the transform domain

38

International Workshop on Independent Component Analysis and Blind Source Separation (ICA), Helsinki, Finland, June 19–22 2000, pp.

429–434.

[7] M. Aoki, M. Okamoto, S. Aoki, H. Matsui, T. Sakurai, and Y. Kaneda, “Sound source segregation based on estimating incident angle

of each frequency component of input signals acquired by multiple microphones,” Acoustical Science & Technology, vol. 22, no. 2, pp.

149–157, Feb. 2001.

[8] M. Zibulevsky and B. A. Pearlmutter, “Blind source separation by sparse decomposition in a signal dictionary,” Neural Computation,

vol. 13, no. 4, pp. 863–882, 2001.

[9] L.-T. Nguyen, A. Belouchrani, K. Abed-Meraim, and B. Boashash, “Separating more sources than sensors using time-frequency distribu-

tions,” in International Symposium on Signal Processing and its Applications (ISSPA), Kuala Lumpur, Malaysia, August 13–16 2001, pp.

583–586.

[10] S. Rickard, R. Balan, and J. Rosca, “Real-time time-frequency based blind source separation,” in International Workshop on Independent

Component Analysis and Blind Source Separation (ICA), San Diego, CA, December 9–12 2001, pp. 651–656.

[11] L. Vielva, D. Erdogmus, C. Pantaleon, I. Santamaria, J. Pereda, and J. C. Principe, “Underdetermined blind source separation in a time-

varying environment,” in IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP), vol. 3, Orlando, Florida, May 13–17

2002, pp. 3049–3052.

[12] P. Bofill, “Underdetermined blind separation of delayed sound sources in the frequency domain,” preprint, 2002.

[13] S. Rickard and O. Yılmaz, “On the approximate W-disjoint orthogonality of speech,” in IEEE Int. Conf. on Acoustics, Speech, and Signal

Processing (ICASSP), vol. 1, Orlando, Florida, May 13–17 2002, pp. 529–532.

[14] S. T. Roweis, “One microphone source separation,” in Neural Information Processing Systems 13 (NIPS), 2000, pp. 793–799.

[15] K. Suyama, K. Takahashi, and R. Hirabayashi, “A robust technique for sound source localization in consideration of room capacity,” in

IEEE Workshop on the Applications of Signal Processing to Audio and Acoustics, New Paltz, New York, October 21–24 2001, pp. 63–66.

[16] I. Daubechies, Ten Lectures on Wavelets. Philadelphia, PA: SIAM, 1992.

[17] R. Mersereau and T. Seay, “Multiple access frequency hopping patterns with low ambiguity,” IEEE Transactions on Aerospace and

Electronic Systems, vol. AES-17, no. 4, July 1981.

[18] W. Kozek, “Time-frequency signal processing based on the Wigner-Weyl framework,” Signal Processing, vol. 29, pp. 77–92, 1992.

[19] B. Berdugo, J. Rosenhouse, and H. Azhari, “Speakers’ direction finding using estimated time delays in the frequency domain,” Signal

Processing, vol. 82, pp. 19–30, 2002.

[20] R. Balan, J. Rosca, S. Rickard, and J. O’Ruanaidh, “The influence of windowing on time delay estimates,” in Conf. on Info. Sciences and

Systems (CISS), vol. 1, Princeton, NJ, March 2000, pp. WP1–(15–17).

[21] J. Huang, N. Ohnishi, and N. Sugie, “A biomimetic system for localization and separation of multiple sound sources,” IEEE Transactions

on Instrumentation and Measurement, vol. 44, no. 3, pp. 733–738, June 1995.

[22] [Online]. Available: http://www.cnl.salk.edu/ � tewon/Blind/blind audio.html

Page 39: 1 Blind Separation of Speech Mixtures via Time-Frequency ...oyilmaz/preprints/wdo_final.pdf · [5, 11] demix by assuming that the number of sources active in the transform domain

39

[23] [Online]. Available: http://www.alum.mit.edu/www/rickard/bss.html