Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
1
Blind Separation of Speech Mixtures via
Time-Frequency Masking
Ozgur Yılmaz and Scott Rickard
Abstract
Binary time-frequency masks are powerful tools for the separation of sources from a single mixture. Perfect
demixing via binary time-frequency masks is possible provided the time-frequency representations of the sources do
not overlap, a condition we call W-disjoint orthogonality. We introduce here the concept of approximate W-disjoint
orthogonality and present experimental results demonstrating the level of approximate W-disjoint orthogonality of
speech in mixtures of various orders. The results demonstrate that ideal binary time-frequency masks exist which
can separate several speech signals from one mixture. While determining these masks blindly from just one mix-
ture is an open problem, we show that we can approximate the ideal masks in the case where two anechoic mixtures
are provided. Motivated by the maximum likelihood mixing parameter estimators, we define a power weighted two-
dimensional histogram constructed from the ratio of the time-frequency representations of the mixtures which is shown
to have one peak for each source with peak location corresponding to the relative attenuation and delay mixing param-
eters. The histogram is used to create time-frequency masks which partition one of one mixture into the original
sources. Experimental results on speech mixtures verify the technique. Example demixing results can be found online:
http://www.alum.mit.edu/www/rickard/bss.html
I. INTRODUCTION
The goal in blind source separation (BSS) is to determine the original sources given mixtures of those sources.
When the number of sources is greater than the number of mixtures, the problem is degenerate in that traditional
Ozg ur Yılmaz ([email protected]) is with the Department of Mathematics, University of Maryland.
Scott Rickard ([email protected]) is with the Department of Electronic and Electrical Engineering, University College Dublin, Ireland.
This work was partially funded by Siemens Corporate Research, Princeton, NJ.
Submitted to IEEE Transactions on Signal Processing, November 4, 2002. Revised March 27, 2003. Second revision July 24, 2003.
2
matrix inversion demixing cannot be applied. However, when a representation of the sources exists such that the
sources have disjoint support in that representation, it is possible to partition the support of the mixtures and obtain the
original sources. One solution to the problem of degenerate demixing is thus to (1) determine an appropriate disjoint
representation of the sources and (2) determine the partitions in this representation which demix. In this paper, we
show that the Gabor expansion (i.e., the discrete short-time (or windowed) Fourier transform) is a good representation
for demixing speech mixtures. Specifically, we show that partitions of the time-frequency lattice exist that can demix
mixtures of several speech signals from one mixture. Determining the partition blindly from one mixture is an open
problem, but, given a second mixture, we describe a method for partitioning the time-frequency lattice which separates
the sources.
Formally, let�
be the family of signals of interest. Typically�
will be some collection of square integrable ban-
dlimited functions. Suppose there exists some linear transformation �������� ��� � (where � maps the set�
to
another family of functions) with the following properties:
(i) � is invertible on�
(i.e., ����������������������� � � ).
(ii) ! �#" !#$ ��% for &('��) , where ! � is the support of � , i.e., ! � �*� supp
� �*�,+�-(� � �.-/��'�1032 .For example, we can consider the case where
�is a collection of square integrable functions with mutually disjoint
supports in the Fourier domain; any two functions � � and �54 in�
satisfy 6� � ��78��6��49��78�:�;0 for all 7 , where 6� � denotes
the Fourier transform of � � . Then if we define � on�
as ���<�*�=6� , it is clear that � satisfies (i) and (ii).
For any � with properties (i) and (ii), we can demix a mixture > � of signals in�
, > � �@?A�B�DC�E�GF � � � �@?H� , via
� � �I� ��� �KJMLONM�P> � � (1)
where JMLON is the indicator function of the set ! � , i.e.,
J LON �.-/�8�*�QRRS RRT J -U�(! �0 otherwise.
(2)
Going back to our example above, this corresponds to � � being equal to the inverse Fourier transform of J5LON/6> � which
is certainly true since the functions in�
satisfy (ii).
Suppose now that we have another mixture > 4 �@?H�V� C�E�GF �XW �Y�G�O�@?�Z\[H��� , which is the case in anechoic environments
3
when we have two microphones. In the mixing, W � and [ � are the relative attenuation and delay parameters respectively
corresponding to the & th source. Assume
(iii) supp � � � � Z [���� supp ��� for any �<� � , � � [ �����, and
(iv) there exist functions � and � such that W � �� ���P> � �.-/� �K�P> 4O�.-/�H� and [ � �� ���P> � �.-/� �K�P> 4��.-/�H� for -(�U! � for
& � J��������5�� ,
where�
is the maximum possible delay between mixtures due to the distance separating the sensors. Using (iii) and
(iv), we can label each -\� supp �P> � with the pair ��� ���P> � �.- � �K�P> 4 �.- �H� ��� ���P> � �.- � �K�P> 4 �.- �H�H� , and ! � is exactly the
set of all points with the label � W � � [ � � . It follows that given the mixtures > � �@?A� and >/4��@?A� , we can demix via
� � �I� ��� �KJMLONM� > � ��� (3)
Clearly, (iii) will be satisfied for the example above since the Fourier transform of �3� � Z [�� will be just a modulated
version of the Fourier transform of � and thus it will have the same support as � . As to the existence of functions �
and � , one can show that � ��6> � ��78� � 6> 4 ��78�H��� � 6> 49��78��� 6> � ��78� � and � ��6> � ��78� �X6>/4���78�H���;Z ���� ��6> 49��78��� 6> � ��78�H� where ���
denotes the phase of the complex number � taken between Z�� and � , satisfies (iv).
The general algorithm explained above mainly depends on two major points: (a) the existence of an invertible
transformation � that transforms the signals to a domain on which they have disjoint representations (properties (i), (ii),
and (iii)), and (b) finding functions � and � that provide the means of labeling on the transform domain (property (iv)).
Note that in the description above we required � and � to yield the exact mixing parameters. Although this is desired
since the mixing parameters provide the perfect labels and can also be used for various other purposes (e.g., direction-
of-arrival determination), it is not necessary for the demixing algorithm to work. Some function that provides a unique
labeling on the transform domain is sufficient. Moreover, requirement (ii) that the transformation � is “disjoint” is very
strong. In practice, one is usually more interested in transforms that satisfy (ii) in some approximate sense. Transforms
that result in sparse representations of the signals of interest, representations where a small percentage of the signal
coefficients capture a large percentage of the signal energy, can lead to (ii) being approximately satisfied.
There are many examples in the literature that use this type of approach with various choices of � for various
mixing models and demixing methods [1–12]. The mixing model in [1–3, 5, 8, 9, 11] is “instantaneous” (sources
4
have different amplifications in different mixtures) while [4, 6, 7, 10, 12] use an anechoic mixing model (sources have
different amplifications and time delays in different mixtures). [1–3, 11] consider the time domain sampling operator
as � . The general assumption in these is that at any given time at most one source is non-zero. [4–7, 9, 10, 12] use
the short-time Fourier transform (STFT) operator as � . Condition (ii) is satisfied in this case, at least approximately,
because of the sparsity of the time-frequency representations of speech signals. Empirical support for this can be found
in [7, 13], and a more extensive discussion is given in Section II-A. [8] chooses � depending on the signal class of
interest in such a way that it yields a sparse representation. In principle, [1–12] all use some clustering algorithm
for estimating the mixing parameters, although there are several different approaches to demixing. [1, 3, 4, 6, 7, 9–
11] use a labeling scheme based on the estimated mixing parameters and thus demix in the above described way by
creating binary masks in the transform domain corresponding to each source. That is, given the mixtures > � and >/4 ,demixing is done by grouping the clusters of points in ���P> � �K�P> 4 � space, although different techniques are used to
detect these clusters. For example, [4, 6, 7, 9, 10] demix essentially by constructing binary time-frequency masks that
partition the time-frequency plane such that each partition corresponds to the time-frequency points that “belong” to
a particular source. The fact that such a mask exists has been observed also in [14] in the context of BSS of speech
signals from one mixture, and in [15] in the context of source localization. In [2, 8, 11, 12], the demixing is done by
making additional assumptions on the statistical properties of the sources and using a maximum a posteriori (MAP)
estimator. [5, 11] demix by assuming that the number of sources active in the transform domain at any given point is
equal to the number of mixtures. They then demix by inverting the now non-degenerate � -by- � mixing matrices and
appropriately combining the outputs. The above comparison is summarized in Table I.
TABLE I
A COMPARISON OF DEGENERATE DEMIXING METHODS USING DISJOINT REPRESENTATIONS.
mixing model � operator demixing
instantaneous sampling [1–3, 11] masking
[1–3, 5, 8, 9, 11] STFT [1, 3, 4, 6, 7, 9–11]
anechoic [4–7, 9, 10, 12] MAP [2, 8, 11, 12]
[4, 6, 7, 10, 12] signal dependent [8] matrix masking [5, 11]
5
In this paper, for the linear transform � , we use the short-time Fourier transform (STFT) and Gabor expansions (the
discrete version of the STFT) of speech signals. We present extensive empirical evidence that speech signals indeed
satisfy (ii) in an approximate sense when T is the STFT with an appropriate window function. Based on this, we
extend the Degenerate Unmixing Estimation Technique (DUET), originally presented in [4] for sources with disjointly
supported STFTs, to anechoic mixtures of speech signals. The algorithm we propose relies on estimating the mixing
parameters via maximum likelihood motivated estimators and constructing binary time-frequency masks using these
estimates. Thus the method presented here: (1) uses an anechoic mixing model, (2) uses the STFT as T, and (3)
performs demixing via masking.
In Section II we introduce a way of measuring the degree of “approximate” W-disjoint orthogonality, WDO � , of a
signal in a given mixture for a given mask � . We construct a family of time-frequency masks,���
, that correspond
to the indicator functions of the time-frequency points in which one source dominates the others by > dB. We test
the demixing performance of these masks experimentally and illustrate that WDO ��� is indeed a good measure of the
demixing performance of the masks���
. The results show that binary time-frequency masks exist that are capable
of demixing several speech signals from just a single mixture. At present, there is no known robust technique for
determining these masks blindly from one mixture. However, in Section III we derive a technique that given a second
anechoic mixture can approximate these demixing masks blindly. We first derive the maximum likelihood estimators
for the delay and attenuation coefficients. We then compare the performance of these with other estimators motivated
by the maximum likelihood estimators. The modified delay and attenuation estimators are weighted averages of the
instantaneous time-frequency delay and attenuation estimates. We combine the delay and attenuation estimators and
show that a weighted two-dimensional histogram can be used to enumerate the sources, determine the mixing parame-
ters, and demix the sources. The number of peaks in the histogram is the number of sources, the peak locations reveal
the mixing parameters, and the mixing parameters can be used to partition the time-frequency representation of one
of the mixtures to obtain estimates of the original sources. In Section IV, we verify the method presenting demixing
results for speech signals mixed synthetically and in both anechoic and echoic rooms.
6
II. W-DISJOINT ORTHOGONALITY
In this section, we focus on showing that binary time-frequency masks exist which are capable of separating multiple
speech signals from one mixture. Our goal is, given a mixture
> � �@?H� �*� E�� F �
� � �@?A� (4)
of sources � � �@?H� , & � J��������5�� , to recover the original sources. In order to accomplish this, we exploit the fact that
the sources are pairwise approximately W-disjoint orthogonal. In this section we will define a qualitative measure of
W-disjoint orthogonality and relate this measure to demixing performance.
We call two functions � � and �54 W-disjoint orthogonal (W-DO) if, for a given a window function � � the supports
of the short-time Fourier transforms (STFTs) of � � and �54 are disjoint [4]. The STFT of � � is defined
����� � ��� �� �K78�:�*� J � ��� � � �@? Z�� �K� � �@?H��� ��� ����� ? (5)
which we will refer to as 6� �O�� �K78� . For a detailed discussion of the properties of this transform consult [16]. The
W-disjoint orthogonality assumption can be stated concisely
6� � �� �K78��6��4O�� �K78����0 ����� �K7 � (6)
The two limiting cases for � � namely � � J and � �@?H� � [ �@?A� , result in interesting sets of W-DO signals. In
the � � J case, the � argument in (6) is irrelevant because the windowed Fourier transform is simply the Fourier
transform. In this case, the condition is satisfied by signals which are frequency disjoint, such as frequency division
multiplexed signals. In the other extreme, when � �@?A� �=[ �@?A� , signals which are time disjoint such as time-division
multiplexed signals satisfy the condition. For window functions which are well localized in time and frequency, the W-
disjoint orthogonality condition leads to signals such as those used in frequency-hopped multiple access systems [17].
Indeed, the method presented here could be applied to time domain multiplexed, frequency domain multiplexed, or
frequency-hopped multiple access signals; however, in this paper we exclusively consider speech signals.
Unfortunately, (6) will not be satisfied for simultaneous speech signals because the time-frequency representation of
active speech is rarely zero. However, speech is sparse in that a small percentage of the time-frequency coefficients in
the Gabor expansion of speech capture a large percentage of the overall energy. In other words, the magnitude of the
7
Gabor coefficients of speech is often small. For different speech signals, it is unlikely that the large Gabor coefficients
will coincide, which leads to the signals being W-disjoint orthogonality in an approximate sense. The goal of this
section is to show that speech signals satisfy a weakened version of (6) and are thus approximately W-DO. The higher
the degree of approximate W-disjoint orthogonality, the better separation results are possible. Figure 1 illustrates that
speech signals have sparse time-frequency representations and satisfy a weakened version of (6), in that the product of
their time-frequency representations is almost always small. A condition similar to (6) is also considered in [18], the
only difference being that the time-frequency transform used was the Wigner distribution. Signals satisfying (6) for
the Wigner distribution were called “time-frequency disjoint.”
The approximate W-disjoint orthogonality of speech has been described as the “sparsity” and “disjointness” of the
short-time Fourier transform of the sources [5], “when one source has large energy the other does not” and “harmonic
components” which “hardly overlap” [7], “when a datapoint is large the most likely decomposition is to assume that
it belongs to a single source” [12], “spectra [that] are non-overlapping” [14], and “useful” time-frequency points con-
taining a “contribution of one speaker...significantly higher than the energy of the other speaker” [19]. A quantitative
measure of approximate W-disjoint orthogonality is discussed later in this section.
time
freq
time
freq
time
freq
Fig. 1. A picture of W-disjoint orthogonality. The three figures are grayscale images of ������������� � (top), ������������� � (middle), and
�������������� ����������� � (bottom) for two speech signals �������� and �������� , sampled at 16 kHz, normalized to have unit energy. A Hamming win-
dow of length 64 ms was used as � ���� and all signals had a length of three seconds. ������������� ����������� � contains fewer large components than
�������������� � or ������������� � . Further analysis of these signals reveals that the time-frequency points that contain �� �! of the energy of ��� contain
only "�#$"�! of the energy of � � . Similarly, the time-frequency points that contain �� �! of the energy of � � contain only %# &�! of the energy of � � .Thus, these speech signals approximately satisfy the W-disjoint orthogonality condition.
8
We can rewrite the model from (4) in the time-frequency domain
6> � �� �K78�#� 6� � �� �K78���������� 6� E �� �K78��� (7)
Assuming the sources are pairwise W-DO, at most one of the sources will be non-zero for a given �� �K78� , and thus
6> � �� �K78�V�=6������� ��� �� �K78� (8)
where V�� �K78� is the index of the source active at �� �K78� . To demix, one creates the time-frequency mask corresponding
to each source and applies each mask to the mixture to produce the original source time-frequency representations. For
example, defining
� � �� �K78� �*�QRRS RRT J 6� � �� �K78� '�100 otherwise,
(9)
the indicator function for the support of � � , one obtains the time-frequency representation of � � from the mixture via
6� � �� �K78��� � � �� �K78�O6> � �� �K78� ����� �K7 � (10)
A. Measuring the W-Disjoint Orthogonality of Speech
Clearly, the W-disjoint orthogonality assumption is not strictly satisfied for our signals of interest. We introduce
here a measure of approximate W-disjoint orthogonality based on the demixing performance of time-frequency masks
created using knowledge of the instantaneous source and interference time-frequency powers. In order to measure
W-disjoint orthogonality for a given mask, we combine two important performance criteria: (1) how well the mask
preserves the source of interest, and (2) how well the mask suppresses the interfering sources. These two criteria, the
preserved-signal ratio (PSR) and the signal-to-interference ratio (SIR), are introduced below.
First, given a time-frequency mask � such that 0�� � �� �K78��� J for all �� �K78� , we define PSR � , the PSR of the
mask � , as
PSR � �*� ��=�� �K78��6�G���� �K78� � 4� 6� ���� �K78� � 4 (11)
which is the portion of energy of the & th source remaining after demixing using the mask. Note that PSR ����J with
PSR � ��J only if supp � ��� supp � . Here��� �@> ��� � � 4 �*����� � � �@>����X� � 4 � > � � . Now, we define
� � �@?A�8�*� E�$ F ����F $
�5$ �@?H� (12)
9
so that � � �@?A� is the summation of the sources interfering with the & th source. Then, we define the signal-to-interference
ratio of time-frequency mask �=�� �K78�SIR ���*� �
�=�� �K78��6� � �� �K78� � 4��=�� �K78�O6� � �� �K78� � 4 (13)
which is the output signal-to-interference ratio after using the mask to demix.
We now combine the PSR � and SIR � into one measure of approximate W-disjoint orthogonality. We propose
the normalized difference between the signal energy maintained in masking and the interference energy maintained in
masking as a measure of the W-disjoint orthogonality associated with a particular mask:
WDO � ����������� ������� ���� ����������������� ������� ���� ������� �� � ���� ���� � (14)
� PSR � �PSR �"! SIR �$# (15)
For signals which are W-DO, using the mask � � �� �K78� defined in (9), we note that PSR �8N ��J , SIR �#N �&% , and
WDO �#N���J . This is the maximum obtainable WDO value because WDO � �,J for all � such that 0 � �=�� �K78� �J . Moreover, for any � , WDO � � J implies that PSR � � J , SIR � �'% , and that (6) is satisfied. That is,
WDO � � J implies that the signals are W-DO and that mask � perfectly separates the & th source from the mixture.
In order for a mask to have WDO �&( J , i.e., good demixing performance, it must simultaneously preserve the energy
of the signal of interest while suppressing the energy of the interference. The failure of a mask to accomplish either of
the goals can result in a small, even negative, WDO value. For example, WDO � �;0 implies either that PSR � �;0(the mask kills all the energy of the source of interest) or that SIR ��� J (the mask results in equal energy for source
and interference). Masks with SIR �� J have associated WDO �
� 0 .Now we establish that binary time-frequency masks exist which are capable of demixing speech signals from one
mixture and detail their performance in relation to the three presented measures. Consider the following family of
time-frequency masks
� �� �� �K78� �*�QRRS RRT J
� 0*),+ -�� � 6� � �� �K78� � � � 6� � �� �K78� � �/. >0 otherwise
(16)
which is the indicator function for the time-frequency points where � � dominates the interference in the mixture by >dB. We will use PSR� �@>�� and SIR� �@>�� as shorthand for PSR � �N and SIR � �N , respectively.
To determine the demixing ability of the above mask type, the masks for various > were applied to speech mixtures
10
of various orders and the demixing performance measures, PSR � �@>�� and SIR� �@>�� , were determined. We refer to a
mixture of sources as a mixture of order and the mixtures used in these tests had orders � � ���3���������YJ 0 . The
demixed speech was then rated by the authors as falling into one of five subjective categories. The speech signals were
selected from 16 male and 16 female continuous speech segments of three seconds taken from the TIMIT database
and normalized to unit energy. The time-frequency representation of the 16kHz sampled data was created using a
Hamming window of 1024 samples with 50% overlap. The results of the 333 listening tests are displayed in Figure 2.
We note that there is a fairly accurate relationship between the WDO performance measure and the subjective ratings
listed in the table under the figure.
5 10 15 20 25 30 35 40−10
−9
−8
−7
−6
−5
−4
−3
−2
−1
0WDO=1.0
WDO=0.8
WDO=0.6
WDO=0.4
WDO=0.2
1 1 1 12
2
3
1 12
2
3
3
4
12
2
3
4
4
5
11
2
3
3
4
5
1
2
2
2
3
4
5
2
2
2
3
3
4
1
2
3
3
4
2
2
3
3
4
5
22
3
4
5
5
1 1 11
2
2
3
11
1
2
3
3
4
1
1
1
2
3
3
11
2
2
3
3
4
2
2
2
3
3
4
5
22
2
3
4
5
5
2
2
2
3
1
2
3
3
4
2
3
3
4
1 1 11
12
2
11
12
2
3
3
11
2
2
3
4
4
22
2
3
4
4
5
1
1
2
3
3
4
5
2
2
3
3
4
5
1
2
2
3
4
22
2
3
4
5
23
3
4
5
5
1 1 1 1 1 1 1 1 1 1 1 1 1 2 22
1 11 11
22
2 2
22
33
33
3
1 1 1 1 1 2 22
22
33 3
44
4
1 1 1 1 2 2 22
23
3
3
4
4
4
5
11
22
2
2
2
3
3
4
44
4
5
55
1 11
2
22
33
3
4
44
5
5
2 22
22
33
33
4
44
4
5
22
22
2
33
3
33
4
4
22
2
2
2
2
3
3
4
4
PS
R (d
B)
SIR (dB)
rating meaning
1 perfect
2 minor artifacts or interference
3 distorted but intelligible
4 very distorted and barely intelligible
5 not intelligible
Fig. 2. Results of subjective listening test performed by the authors. For example, �# ��� WDO � �# & implies a “minor artifacts or interference”
rating or better
Now that we have some idea how PSR, SIR, and WDO map to demixing performance, we analyze the demix-
ing performance of the masks described in (16). Figure 3 shows plots of PSR � �@>�� versus SIR� �@>�� and a table of
� PSR���@>�� � SIR���@>��H� pairs averaged for groups of speech mixtures of different orders. For � �, each source was
compared against each of the remaining 31 sources, resulting in ����
� J��� � tests being averaged for each data
point. For larger , each source was compared against a random mixing of ZIJ of the remaining 31 sources. This
was done 31 times per source in order to keep the number of tests per data point constant at 992. As we tested mixtures
from � �to � J 0 , a total of �
���� ���� � � mixtures were created to generate the data for Figure 3. Figure 3
11
demonstrates that time-frequency masks exist which exhibit excellent demixing performance; For example, consider-
ing the 0 dB mask���� , we see that on average this mask produces demixtures with WDO measure greater than 0.6 for
mixtures of up to ten sources.
5 10 15 20 25 30 35 40−10
−9
−8
−7
−6
−5
−4
−3
−2
−1
0 WDO=1.0
WDO=0.8
WDO=0.6
WDO=0.4
WDO=0.2
PS
R (d
B)
SIR (dB)x=
0
x=5
x=10
x=15
x=20
x=25
x=30
N=2N=3N=4N=5N=10
N ����� ����� ��� � � ��� � �2 �� �� �� � �� ����� �� ����� ��� ��� � �� � ��� � � ��� � �� � ��� � �� �� �3 �� ����� ��� � � � �� ��� � � �� ��� � �� � ��� � �� � ��� �� ����� � �� � � �4 �� � � � � �� ����� �� � ��� � �� � ��� �� � � ��� ����� �� ����� ��� �����5 �� � ������ � � �� ���� � �� �� � �� ����� � � � � � �� � � � � � ��� �10 �� � �� � � ��� �� � � � ��� � � � �� ����� � �� ��� � �� ��� � � �� � ���
Fig. 3. Time-Frequency Mask Demixing Performance. Plot contains PSRN ��� (in dB) versus SIRN ��� (in dB) for ��� " #�#�# �� for
� � " � % # #�# " . Table contains (PSRN ��� SIRN ��� � in dB ) for� �! �����"�$#% "� for �%� �#% " " # dB. The different gray regions
correspond to different regions of approximate W-disjoint orthogonality as determined by the lines of constant WDO. For example, using the
�&�'# dB mask in mixtures of four sources yields 14.32 dB output SIR while maintaining 83% of the desired source energy. This � PSR SIR (�
(0.83,14.32 dB) pair results in WDO � �# �� , which from Figure 2 implies perfect demixing performance. In other words, if we can correctly
map time-frequency points with 5 dB or more single source dominance to the correct corresponding output partition, we can recover 83% of the
energy of each of the original sources and produce demixtures with 14.32 dB output SIR from a mixture of four sources.
Now that we know that good time-frequency masks exist, we wish to determine the dependence of these performance
measures on the window function � �@?A� and window size. For this task, we examine the performance of the 0 dB mask,
� �� . Figure 4 shows PSR, SIR, and WDO for pairwise mixing for various window sizes and types. Each data point
in the figure represents the average of the results for ���
mixtures. In all measures, the Hamming window of size
1024 samples performed the best. Note, however, that the performance of the other masks (with the exception of the
rectangle) was extremely similar and exhibited better than ��0 ) W-disjoint orthogonality for pairwise mixing across a
wide range of window sizes (from roughly 500 to 4000 samples). Other mixture orders and masks (i.e.,���� for >+* 0 )
exhibited similar performance and in all cases the Hamming window of size 1024 had the best performance. A similar
12
conclusion regarding the optimal time-frequency resolution of a window for speech separation was arrived at in [7].
Note that even when the window size is 1 (i.e., � is sampling), the mixtures still exhibit a high level of PSR, SIR,
and WDO. This fact was exploited by those methods that used the time-disjoint nature of speech [1–3, 11]. However,
Figure 4 clearly shows the advantage of moving from the time domain to the time-frequency domain: the speech
signals are more disjoint in the time-frequency domain provided the window size is sufficiently large. Choosing the
window size too large, however, results in reduced W-disjoint orthogonality.
0 2 4 6 8 10 12 140.85
0.9
0.95
1
PSR
0 2 4 6 8 10 12 148
10
12
14
16
SIR
(dB)
0 2 4 6 8 10 12 140.7
0.8
0.9
1
log2(window size)
WD
O
Fig. 4. Window size and type comparison. Hamming ( � ), Blackman ( � ), Hann ( � ), Triangle (�
), and Rectangle ( � ). PSR, SIR, and WDO
for the 0 dB mask for window size = " � %�" #�#�# " & � � " samples for various window types for pairwise mixing of speech signals sampled at 16
kHz. The Hamming window of size 1024 has the best performance.
We close this section by proposing WDO � with � � � �� as the general measure of W-disjoint orthogonality. Ta-
ble II shows WDO � �N values for mixtures of various orders. Again, each data point represents the average measurement
over ���
mixtures. It can be shown using (14) that the 0 dB mask,� �� , maximizes WDO, and thus the 0 dB mask line
represents the upper bound of WDO for any mask. We thus say that, for example, speech signals in pairwise mixtures
are ������ ) W-disjoint orthogonal.
TABLE II
WDO FOR THE 0 DB MASK FOR MIXTURES OF VARIOUS ORDERS.
�2 3 4 5 6 7 8 9 10
% WDO ���� � ��� � � �� � ��� � ��� � � � ���� � ���� � ���� �
13
III. PARAMETER ESTIMATION AND DEMIXING
In this section, we will present a demixing algorithm that separates an arbitrary number of sources using two mix-
tures. We start by describing our anechoic mixing model. Suppose we have sources � � �@?A� ��������� � E �@?A� . Let > � �@?H� and
>/4��@?A� be the mixtures such that
>/$3�@?A��� E��GF � W
$ � � � �@? Z [M$ � � � ) � J�� � (17)
where parameters W $ � and [M$ � are the attenuation coefficients and the time delays associated with the path from the & th
source to the ) th receiver. Without loss of generality we set W � � � J and [ � � �;0 for & � J��������5�� ; for simplicity we
rename W 4 � as W � and [ 4 � as [A� . In addition we assume that the windowed Fourier transform of any source function,
� � � � ��� �� �K78� satisfies the narrowband assumption for array processing, i.e.,
� � � � � � � Z [�� � �� �K78� � ����� ��Z�� 7#[�� ��� � � � � �� Z [��K78�( ����� ��Z�� 7#[�� � � � � � � �� �K78��� (18)
This assumption is realistic as long as the window function � is chosen appropriately. A detailed discussion about
this assumption can be found in [20].
Now we go back to discussing the mixing model, described in (17). We take the STFT of > � and >/4 with an
appropriate choice of � . Using the assumption discussed above, the mixing model (17) reduces to
��� ��� ���� ��� � ���� ��
�� � �
��� � #�#�# ���������������� #�#�# � �!������� �#"
�� ����������$� ���� ��
...
�� � � ��� ��
� � # (19)
A. Parameter Estimation and Demixing for W-DO sources
To motivate the Degenerate Unmixing Estimation Technique (DUET), which we will describe in the next section,
we first consider the case where the sources are W-DO, i.e.,
6� �O�� �K78��6�5$ �� �K78���10 � � �� �K78� �:� &U'�D) � (20)
This condition is the idealization of the properties of speech signals discussed in Section II. We now construct the
parameter estimators and the demixing algorithm for W-DO signals. Clearly, when the sources are W-DO, at most one
14
source will be active at any time-frequency point �� �K78� ; in particular for any �� �K78� at which 6> � �� �K78��'�10 , there exists
a & such that 6� � �� �K78��'�10 and 6�5$ �� �K78����0 for &U'�D) . Recalling the definition of the time-frequency mask � � in (10),
we note that � � �� �K78� � $X�� �K78���10 for all �� �K78� if &('��) . From (19) we deduce that
6� � � � � 6> � � (21)
This shows that we can demix an arbitrary number of sources from only one of the mixtures if we can construct the
corresponding mask � � for each source. Next we will describe how to construct the masks � � using the mixtures > �and > 4 .
Let & be arbitrary, and define � � �*� + �� �K78� � � � �� �K78� � J�2 so that � � � J���N . Note that the � � are pairwise
disjoint. Now consider
� 4 � �� �K78�:�*� 6> 49�� �K78�6> � �� �K78� � (22)
Clearly, on � �� 4 � �� �K78��� W � � ����� N � � (23)
In this case� � 4 � �� �K78� � � W � and Z �� � � 4 � �� �K78��� [ � , where ��� denotes the phase of the complex number � taken
between Z�� and � .
The observation above yields a way of constructing the sets � � and thus a demixing algorithm: we simply label each
time-frequency point �� �K78� with the pair � � � 4 � �� �K78� � � Z ���� � 4 � �� �K78�H� . Since the sources are W-DO, there will be
distinct labels. By grouping the time-frequency points �� �K78� with the same label, we construct the sets ��� , thus the
masks � � � J�� N .The above described demixing algorithm is the motivation behind DUET. Note that the algorithm separates the
sources without inverting the mixing matrix, which makes it possible to deal with mixtures of an arbitrary number of
sources. Aside from demixing, it also yields the mixing parameters: the labels � � � 4 � �� �K78� � � Z �� � � 4 � �� �K78�H� which
we used to construct the masks are exactly the mixing parameters W � and [ � . Motivated by this fact we define the
instantaneous DUET attenuation and delay parameter estimators as
�W �� �K78� �*� � � 4 � �� �K78� � (24)
�[ �� �K78� �*� Z J7 � � 4 � �� �K78� (25)
15
respectively. We will use these estimators in the next section.
In summary, the DUET algorithm for demixing W-DO sources is,
1) From mixtures > � �@?H� and > 4 �@?A� construct time-frequency representations 6> � �� �K78� and 6> 4 �� �K78� .2) For each non-zero time-frequency point, calculate � �W �� �K78� �
�[X�� �K78�H� .3) Take the union of the � �W �� �K78� �
�[X�� �K78�H� pairs, ��� ��� ��� + � �W �� �K78� �
�[X�� �K78�H� 2 . Note
will be equal to + � W � � [ � � �& � J��������5�� 2 .
4) For each � W � � [ � � in
, & � J��������5�� , note that 6� � �� �K78�8� J�� ���� � � ��� �� � � ��� � F � � N � N ��� �� �K78��6> � �� �K78� for �� �K78� with
6> � �� �K78� '�D0 and 6� � �� �K78�8��0 otherwise. Clearly, 6� � �� �K78� will be the time-frequency representations of one of
the original sources. The numbering of the sources is arbitrary.
5) Convert each 6� � �� �K78� back into the time domain.
Remark 1: Note that the instantaneous DUET delay estimator yields a meaningful estimate at a time-frequency
point �� � �8� � � only if
� 7#[H� � � � � (26)
This follows from the periodicity of the complex exponential. For �� �K78� � � � , we have Z �� � � 4 � �� �K78� ��� ��7#[H�5�K[H� ,with � �� ����*�������� where
� *<�*� �� � � ���� +�� � � �VZ � . When (26) is not satisfied, the delay estimate obtained
using the instantaneous DUET estimator will be a fraction of its true value. Let 7������ be the maximum element of
+ � 7 � � ��7 � � � ��� � � � � for some �/2 , which is the maximum frequency present in the sources, and denote by 7! the
sampling rate. Let [ � max �*�"�$# � � � [ � � . Clearly, (26) is guaranteed for all & and for all 7 �%� � � � if
7 ����� [H� max� � � (27)
Now define [ � max �*�� ��7&����� . Any delay parameter with modulus less than [ � max can be estimated correctly. Clearly,
(27) is equivalent to the condition [ � max� [ � max. If 7'����� � 7( � � , the Nyquist frequency, then this means that the
maximum delay, [ � max � 4*)�,+ , is exactly equal to the sampling period. In other words, as long as the delay between the
two microphone readings is less than a sample, the estimated phase will be accurate. While the 7 ����� is determined
by the characteristics of speech signals, the maximum physically possible delay, which we will denote by [.- max, is
determined by the microphone spacing. For two microphones separated by a distance�, [,- max � � �0/ where / is the
16
speed of sound. Clearly, we have [ � max� [ - max, and therefore (27) will be satisfied if [ - max
� [ � max. This suggests
that one can guarantee (27) simply by choosing�, and thus [0- max, sufficiently small. For example, for a sampling rate
7( � � � � �V� J � kHz, assuming 7 �������I7( �� � and / � � ����� ��� , we obtain that [ - max �I[ � max as long as� � � � J�� cm.
If we knew, however, that 7 ����� � � � � �#� �kHz, then this distance would be increased by a factor of 4 to 8.60 cm. The
smaller the largest frequency present in the signal, the larger the allowable microphone separation (or equivalently the
larger we can choose [ � max) that guarantees accurate phase parameter estimates. The demixing technique presented in
this paper does not require knowledge of the value of�, but we assume that
�is small enough such that (27) is satisfied.
B. Parameter Estimation and Demixing for Approximately W-DO sources
In Section II, we illustrated that the time-frequency representations of speech signals are nearly disjoint and demon-
strated that we can indeed recover a speech signal from one mixture of an arbitrary number of sources if we can
construct an appropriate time-frequency mask. This suggests that a weakened W-DO condition holds for speech sig-
nals: if at a time-frequency point one of the sources has considerable power, the contribution of all the other sources
at that time-frequency point is likely to be small. This observation is the key to the demixing algorithm we propose in
this section. First we shall discuss how to estimate the mixing parameters.
From this point on, instead of the continuous STFT, we use the equivalent discrete counterpart1
6� � � )/��� � �=6� � � ) � � ����7 � � (28)
where � � and 7 � are the time-frequency lattice spacing parameters. We define the instantaneous DUET delay estimate,
the discrete version of (25),
�[ � )/��� � �*� Z J��7 � � � 4 � � )/��� � � (29)
where� 4 � � )/��� � �*� � �� $ �� � �� $ �� . For convenience, we define
�[ � )/��� � �,0 if 6> � � )/��� � �,0 or 6> 4 � )/��� � �;0 . Similarly, we define
the instantaneous DUET attenuation estimate
�W � )/��� � �*� � � 4 � � )/��� � � (30)�The equivalence is nontrivial and only true for appropriately chosen window functions � with sufficiently small � � and � � . An illustrative
discussion can be found in [16].
17
which is the discrete version of (24). For convenience, we define�W � )/��� � � J if 6> � � )/��� � �10 or 6>/4 � )/��� � �10 . For reasons
discussed in Appendix I, we choose to estimate
� � �*� W � Z J � W � (31)
instead of directly estimating W � , and define the instantaneous DUET symmetric attenuation estimate
�� � )/��� � �*� �W � )/��� � ZIJ � �W � )/��� � � (32)
We will say that �M� is dominant at � )/��� � if� 6�M� � )/��� � � . � 6��� � )/��� � � , where 6�5� is as in (12). Note, the 0 dB mask,
�&�� , in (16)
is the indicator function for the dominant time-frequency points of the & th source. In Appendix I, we derive that under
certain assumptions the maximum likelihood (ML) estimates for the mixing parameters W � and [ � can be determined
via certain weighted averages (see (57) and (59)) of the instantaneous DUET delay (�[ � )/��� � ) and instantaneous DUET
attenuation (�W � )/��� � ) estimates, at the time-frequency points at which � � is dominant. In Appendix II we compare the
performance of the weighted estimators suggested by the ML derivation with other empirically motivated weighted
estimators, and we illustrate that more accurate estimates � � � � �� � [ � � �� � of the true parameters � � � � [ � � are determined by
�������� ��� � �����������
� ��� � ������� � � � ����� � � �� � � �!�"��#$� ������� � �
��� � � �!����� � � � �!��� � � (33)
% ������ ��� � �����������
��� � � ������� � � � ����� � � �% � � ������� � �����������
� ��� � ������� � � � ����� � � (34)
when & � �. To employ these estimates however, we need to first construct the sets ! � � + � )/��� � � � 6� � � )/���.� � *
� 6���O� )/���.� � 2 for each & . Note that these sets would be the discrete version of � � of Section III-A if the sources are
W-DO. In the W-DO case we used the instantaneous DUET estimates as labels for each time-frequency point �� �K78� ,and each � � consisted of the points with identical labels. In the approximately W-DO case the instantaneous DUET
estimates for the time-frequency points in ! � will not be identical anymore. However, we claim that we can still
use these estimates as a means of labeling, and thus construct the sets ! � , at least approximately. Once we know
the ! � , we demix simply by partitioning the support of 6> � � )/��� � using ! � and converting the resulting time-frequency
representations back into the time domain. In order to determine the ! � , we rely on three observations which lead us
to create a smoothed two-dimensional power weighted histogram of the � �� � )/��� � � �[�� )/��� � � pairs. Enumerating the peaks
in this histogram estimates the number of sources, the peak centers estimate the mixing parameters, and the set of
time-frequency points which contribute to a given peak provide an estimate for the associated ! � .
18
Observation 1: The time-frequency points with instantaneous DUET estimates � �� � )/��� � � �[ � )/��� � � inside a small rect-
angle centered on the true mixing parameter pair � � � � [ � � contain most of the source energy.
We wish to show that the time-frequency points which yield instantaneous DUET estimates that are in close prox-
imity to the true mixing parameters contain most of the energy of the source. Let,
��� � � )/��� � �QRRS RRT J if
� �� � )/��� � Z � �����0 otherwise
(35)
be the indicator function for time-frequency points with instantaneous DUET symmetric attenuation estimate within
�of � where
�is a resolution parameter. We are interested in,
PSR ��� N � � � C � $ � � � � N � � )/��� � � � � � )/��� � � 4C � $ � � � � � � )/��� � � 4 (36)
which will show the portion of the energy of � � contained in time-frequency points with corresponding�� � )/��� � within
�of the true mixing parameter value � � . Figure 5 shows PSR � � N � � averaged over 100 randomly selected speech
signals taken from the TIMIT database. The curves represent the expected energy contained in time-frequency points
with instantaneous DUET symmetric attenuation estimates close to the true symmetric attenuation. For example, with
� � 0 � J we expect more than 60% of the source energy to come from time-frequency points with corresponding
�� � )/��� � located within 0.1 of the true value � � in mixtures of five sources. In this section, the model described by (47)
in Appendix I and discussed in Appendix II was used to simulate mixtures of � � ���3� �3�YJ 0 sources.
Similarly, for the delay, we define
� � � � )/��� � �QRRS RRT J if
�[ � )/��� � Z [ ��
0 otherwise
(37)
where
is a resolution parameter. Then we are interested in
PSR ��� N � � C � $ � � � � N � � )/��� � � � � � )/��� � � 4C � $ � � � � � � )/��� � � 4 (38)
which will show the portion of energy of �Y� with instantaneous DUET delay estimates within
of the true mixing
parameter value [ � . Figure 5 also shows PSR � � N � as a function of
for various mixture orders. For example, 70%
of the energy of the source is expected to be contained in time-frequency points with corresponding�[ � )/��� � within 0.1
samples of the true value of [ � in pairwise mixing.
19
0 0.1 0.2 0.3 0.4 0.5 0.60
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
AP
SR
N=2 N=3 N=5 N=10
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
D
PS
R
N=2 N=3 N=5 N=10
Fig. 5. Energy distribution PSR ��� ��� � (left) and PSR ��� ��� � (right) of instantaneous DUET estimates around the true mixing parameters. Note,
for W-DO signals, the corresponding source energy portion would be 1.0 for all distances from � N (and N ).
We now show that the source energy is localized simultaneously around � � � � [�� . To do so, we look at
PSR � � N � � ��� N � � C � $ � � � � N � � )/��� � � � N � � )/��� � � � � � )/��� � � 4C � $ � � � � � � )/��� � � 4 (39)
which measures the portion of source energy for time-frequency points with instantaneous DUET symmetric atten-
uation estimates within�
of � � and instantaneous DUET delay estimates withing
of [ � . Before we examine
PSR ��� N � � � � N � , we need to determine the appropriate�
to
ratio. Plotting � � � � pairs for PSR ��� N � � � PSR � � N � for the same mixture order reveals that the � � � � lie essentially along a line. The least-mean-square fit of this line
determines a ratio of� � � J �3J ����� samples. This means that, for example, PSR � � N � � (�0 ��� for
� �10 � J for � �
implies that PSR ��� N � ( 0 ��� for � J ����� � 0 � J � 0 � J ��� samples for � � , a property which can be verified
from the data displayed in Figure 5. Figure 6 shows PSR � � N � � ��� N � versus�
and
for J ����� � � . Note that the
�axis is at the bottom and the
axis is at the top. For example, almost 80% of the energy of the source is contained
by time-frequency points with corresponding � �� � )/��� � � �[ � )/��� � � falling in a rectangle with dimensions 0.2-by-0.33 cen-
tered on � � � � [ � � , i.e., � � � Z 0 � J�� � � � 0 � J � � � [ � Z � J ���3� [ � � 0 � J ��� � , for mixtures of three sources. As the number
of sources increases, the energy spreads over a wider area, but remains relatively well localized around the source’s
mixing parameters.
Observation 2: Observation 1 is true for the individual sources in mixtures.
This observation is based on the fact that, from the experiments with speech mixtures (see Figure 3), we know
that the time-frequency points at which one source dominates maintain a significant percentage of the dominating
source’s energy. For � � ���3� � � �3� and J 0 the percentage source energy preserved when only considering dominant
20
0 0.1 0.2 0.3 0.4 0.5 0.60
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
A
PS
R
0.00 0.17 0.33 0.50 0.66 0.83 0.99D
N=2 N=3 N=5 N=10
Fig. 6. Energy distribution PSR ��� ��� � � � � � � of instantaneous DUET estimates in a rectangle centered on the true mixing parameter pair
� � N N .time-frequency points is � � ) ��� � ) ��� J ) ���� ) � and
� � ) , respectively. Considering the time-frequency points when
one source dominates, Figure 6 shows that the instantaneous DUET estimates falling in a rectangle centered on true
estimates maintain a significant percentage of that source’s energy. For example for � �, 87% of the source’s energy
is contained in the time-frequency points with estimates falling in a rectangle of dimension 0.2-by-0.33 centered on the
true mixing parameter pair. Thus, in pairwise mixing we would expect the time-frequency points which yield estimates
� �� � )/��� � � �[ � )/��� � � inside a 0.2-by-0.33 rectangle centered on � � � � [ � � to contain the product of � � )(from Figure 6) and
� � )(from Figure 3) for a total of � � ) �
� � ) � � � ) of the energy of the first source. Similarly, we would expect
84% of the energy of the second source to come from time-frequency points which have instantaneous DUET estimate
pairs within a 0.2-by-0.33 rectangle centered on � � 4�� [M45� . As increases, the source energy percentage we expect to
see in a fixed size rectangle centered on each source’s mixing parameters decreases (it is 39% for =10); nevertheless,
Observation 1 will still hold.
Observation 3: The peaks in a smoothed two-dimensional power weighted histogram of the instantaneous DUET
estimates will be in one-to-one correspondence with the rectangle centers in Observation 2.
One way to determine the mixing parameters for multiple sources is to look at the two-dimensional weighted his-
togram C $ � � � � � )/��� � � � � � )/��� � � 6> � � )/��� � 6>/4�� )/��� � � � as a function of � and [ , for some fixed & . If � � � � is chosen
large enough to capture a large portion of the source energy, as determined by Figure 6, yet small enough so the � � � �rectangle does not contain significant energy contributions from multiple sources, we would expect the local maxima
to occur around the true mixing parameter pairs � � � � [ � � � &���J����������� . Therefore, one way of determining the mixing
21
parameters would be to calculate C $ � � � � � )/��� � � � � � )/��� � � 6> � � )/��� � 6> 4 � )/��� � � � for the range of interest of � � � [�� pairs
and select the local maxima. A computationally efficient way of doing this is to construct a two-dimensional weighted
histogram at a high resolution, and then smooth that histogram with a kernel of the dimensions of the desired � � � �rectangle. We perform smoothing to group the time-frequency points which are likely to correspond to one source.
Recall that the estimators (33) and (34) averaged the instantaneous estimates over all time-frequency points where the
source of interest was dominant. We know from the results shown in Figure 6 that a rectangle centered on the true
mixing parameters will capture most of the corresponding source’s energy. By smoothing, we locate the rectangle
centers that capture locally the largest energy contribution, and thus estimate the mixing parameters. Histograms have
been used previously for parameter estimation of voice mixtures; for example, [21] clusters onset arrival difference to
determine the time delays of the various sources.
Now we construct a two dimensional weighted histogram for � �� � )/��� � � �[ � )/��� � � , where�� � )/��� � and
�[ � )/��� � are the in-
stantaneous DUET estimates, with the weights� 6> � � )/��� � 6> 4 � )/��� � � � for some & . The weighted histogram with resolution
widths � and�
and weighting exponent & , is defined as
� � � � [�� �*� �$ � � � ��� 4 � )/��� � � � ��� 4 � )/��� � � 6> � � )/��� � 6> 4 � )/��� � � � � (40)
which we will smooth with a rectangular kernel � � � � [�� ,
� � � � [��:�*�QRRS RRT J �
� � � � [�� � � Z � � � � � � � � � � Z � � � � � �0 otherwise,
(41)
to produce the smoothed histogram
� � � � [�� �*� � �� � � � � � [�� (42)
where�
denotes two-dimensional convolution.
Figure 7 shows an example histogram before and after smoothing generated using the dominant time-frequency
points of a speech signal generated using the model of five source mixing. For the mixing model, � � �O� [H�5�B��� 0 � � �A0 � �O�which match well with the peak location. The importance of the smoothing is clear in that it combines all the en-
ergy from the estimates in a local region and results in a clear single peak and thus mixing parameter estimate. In
Appendix III, we compare the performance of histogram-based parameter estimators to the performance of the ML
22
estimators and discuss the choice of & .
−2−1
01
2
−1
−0.5
0
0.5
10
1.0
2.0
3.0
4.0
δα −2−1
01
2
−1
−0.5
0
0.5
10.0
0.2
0.4
0.6
0.8
1.0
1.2
δα
Fig. 7. Example raw (left) and smoothed (right) power weighted (� � ) histograms for one speech signal in a mixture of five. The peak
location of the smoothed histogram corresponds to the mixing parameters � � N N (� � �# % %# #� .
C. Demixing Algorithm for Approximately W-DO Sources
Recall that in the W-DO case, sources were demixed using time-frequency masks that were constructed by group-
ing the time-frequency points that yield the same instantaneous parameter estimates. We demix in a similar way for
approximately W-DO sources. First we estimate the mixing parameters, for example, using the histogram method de-
scribed in the previous section. Then, we group time-frequency points that yield instantaneous parameter estimates that
are “close” to these estimated mixing parameters. One natural definition of closeness is the instantaneous likelihood
function for the & th source
� � � )/��� � �*� & ��6> � � )/��� � �X6>/4�� )/��� � � W � � [ � �� J� ��� 4 � �
���� ������ N�� � � N�� � � � � $ � � � � $ � ���
� ��� ��� � �N � (43)
obtained by substituting the instantaneous ML source estimate (53) into the likelihood function in (48) modified to
consider only time-frequency point � )/��� � . � � � )/��� � is, in a sense, the likelihood that the & th source is dominant at time-
frequency point � )/��� � . One way to demix the mixtures is to construct a time frequency mask for � � by taking those
time-frequency points for which� � � )/��� � . � � � )/��� � �X� � '� & . The time-frequency mask for demixing � � is thus
�� � �*� J � $ �� �� �GF �����,���������� $ � � (44)
and defining
�! � �,+ � )/��� � �,#���- �$# ��� � � )/��� � (45)
23
the estimate of the time-frequency points for which the & th source is dominant, we can relate this demixing mask
to those that were used in the W-DO case. There are many other ways we can envision using these likelihoods, for
example, some type of relative weighting resulting in fractional masks instead of the binary winner-take-all masks
created by the scheme we have proposed. However, we have shown in Section II that the 0-dB binary masks exhibit
excellent demixing performance and maximize the WDO performance measure so we consider exclusively binary
time-frequency masks in this paper.
As before, we estimate the source by converting
� 6� � � )/��� � �*� �� � � )/��� � 6> � � )/��� � (46)
into the time domain. Note, we could apply the mask to > 4 as well, and, could combine the two demixtures using
the ML estimate of the source as in (53). However, in order to compare with the results obtained in Section II, the
experimental results presented in the next section will use (46).
In summary, the DUET algorithm for demixing Approximately W-DO sources is,
1) From mixtures > � �@?H� and > 4 �@?A� construct time-frequency representations 6> � � )/��� � and 6> 4 � )/��� � .2) For each time-frequency point, calculate � �� � )/��� � � �[ � )/��� � � using (32) and (29).
3) Construct histogram and locate peaks:
a) Construct a high resolution histogram as in (40)
b) Smooth the histogram as in (42)
c) Locate peaks in histogram. There will be peaks, one for each source, with peak locations approximately
equal to the true mixing parameter pairs, + � � � � [ � � � &���J����������� 2 .4) For the pairs of � � � � [ � � estimates, construct the time-frequency masks corresponding to each pair using the
ML partitioning as in (44) and apply these masks to one of the mixtures as in (46) to yield estimates of the
time-frequency representations of the original sources.
5) Convert each estimate back into the time domain.
24
IV. EXPERIMENTS
In order to demonstrate the technique, we present results in this section for both synthetic and real mixtures. One
issue that we have not addressed is how the histogram peaks are automatically enumerated and identified. For the
following demonstration, we used an ad-hoc technique that iteratively selected the highest peak and removed a region
surrounding the peak from the histogram. Peaks were removed as long as the histogram maintained a threshold
percentage of its original weight. The threshold percentage and region dimensions had to be occasionally altered in
the course of the tests to ensure the correct number of sources was found. Indeed, peak enumeration and identification
remains a topic of future research. In all examples, we used histograms with & � J , as suggested in Appendix III.
A. Synthetic mixtures
Figure 8 shows the smoothed histogram (42) for a six source synthetic mixing example with histogram resolu-
tion widths � ��� � ��� � 0 � 0 �3�A0 � J � samples � and smoothing kernel dimensions � � � � ��� 0 � J � �A0 � � samples � . The six
sources were taken from the TIMIT database and the stereo mixture was created using (symmetric attenuation, delay)
mixing parameters pairs + �KJ�� Z � � , � � � � � Z<J�� , � � � � �YJ�� , �KJ�� � � , � � � �3�YJ�� , � � � �3� Z<J�� 2 . It is clear given only the stereo
mixture, one can determine how many sources were used to create the mixture by enumerating the peaks in the his-
togram. Using the ML partitioning, the first channel of the mixture was demixed and the SIR, PSR, and WDO were
measured; the results are shown in Table III. For comparison, WDO � � , the optimal WDO created using the 0 dB mask
is shown in the last column. The demixtures average over 13 dB SIR gain and the WDO numbers indicate demixtures
which would rate right on the border between “minor artifacts or interference” and “distorted but intelligible.” Note
that even though the blind method performs reasonably well, the performance of the 0 dB mask shows that there exist
time-frequency masks which would further improve the performance. Figure 9 shows the original six sources, the two
mixtures, and the six demixtures.
To show the limits of this technique, a ten source stereo mixture was synthetically mixed. The smoothed histogram
for the mixture is shown in Figure 8 and Table III contains the demixing performance. The SIR gains are still high,
the average gain above 12 dB, however, the WDO performance has dropped to “very distorted and barely intelligible.”
However, as we are trying to demix ten sources from just two mixtures, these results are promising. More promising
25
−2−1
01
2
−1−0.5
00.5
1
0.5
1
1.5
2
2.5
3
δα−2
−10
12
−1−0.5
00.5
1
0.5
1
1.5
δα
Fig. 8. Six and Ten Source Synthetic Mixing Smoothed Histograms (� � " ). Each peak corresponds to one source and the peak location
corresponds to the associated source’s mixing parameters.
TABLE III
SIX AND TEN SOURCE DEMIXING PERFORMANCE. PERFORMANCE OF THE BLIND TECHNIQUE IS COMPARED AGAINST THE OPTIMAL
TIME-FREQUENCY MASK, THE 0 DB MASK.
source SIR in (dB) SIR out (dB) SIR gain (dB) PSR WDO DUET WDO 0dB
+ � -7.29 5.92 13.21 0.76 0.57 0.80+��-7.29 5.24 12.53 0.78 0.55 0.78+��-5.08 6.60 11.67 0.80 0.62 0.81+��-9.29 5.35 14.63 0.79 0.56 0.69+��-5.03 7.06 12.09 0.78 0.63 0.81+��-9.28 5.47 14.75 0.77 0.55 0.66
+ � -9.74 -0.32 9.42 0.58 -0.04 0.70+��-7.73 3.14 10.87 0.66 0.34 0.77+��-11.64 3.43 15.06 0.68 0.37 0.64+��-9.72 -0.60 9.13 0.58 -0.09 0.67+��-7.73 3.93 11.66 0.66 0.39 0.73+��-11.61 3.14 14.75 0.56 0.29 0.51+��-7.75 2.57 10.31 0.56 0.25 0.74+�-11.62 1.36 12.98 0.61 0.16 0.62+�-9.72 4.70 14.42 0.60 0.39 0.67+ ��� -9.74 3.33 13.07 0.60 0.32 0.64
Fig. 9. Six Sources, Stereo Mixture, and Six Demixtures.
26
indeed is the fact that the 0 dB mask’s performance is significantly better showing that there is room for improvement.
B. Anechoic and Echoic Mixing Results
We also tested DUET on speech mixtures recorded in an anechoic room. For the tests, each speech signal was
recorded separately at 16 kHz and then the signals were mixed additively to generate the mixtures for the tests. Knowl-
edge of the actual signals present in each mixture allows us to calculate the performance measures exactly. For the
recordings, the omnidirectional microphones were separated by 1.75 cm and the speech signals were played from
various positions on a 1.5 meter radius semicircle around the microphones with the microphone axis along the line
from the 0 �
position to the J ��0 �
position. Two female (F1 and F2) and one male (M1) TIMIT sound files were used
for the tests. Pairwise mixing results for female-female and male-female mixtures are shown in Table IV. Again, for
comparison purposes, the WDO obtained by the DUET algorithm is compared to the optimal WDO which is obtained
using the 0 dB mask. The separation obtained by DUET is nearly perfect and in all but the ��0 �
case: the DUET
mask’s performance is essentially the same as the performance of the optimal mask. The reason for the slight fall in
performance in the ��0 �
case is that the source peak regions begin to overlap and as a result some time-frequency points
are misassigned.
Higher order mixing results (i.e., *�) are listed in Table V. In addition to the three, four, and five source
anechoic mixtures tested, a three source echoic mixture was tested. All of the speech signals, three female (F1, F2,
and F3) and two male (M1 and M2), were taken from the TIMIT database. The echoic recording was made in an
echoic office environment with an approximate reverberation time of 500 ms. As the number of sources increases, the
demixing performance decreases, although the performance is still acceptable in the five source mixture. As expected,
the performance drops off significantly when switching from the anechoic to the echoic environment as the method is
based on an anechoic mixing model. However, some separation is still achieved.
Figure 10 compares the one source histograms for anechoic and echoic recordings for sources at three different
angles. The histograms corresponding to the summation of the three sources are also shown. The anechoic histograms
are well localized and the peak regions are clearly distinct, even in the histogram corresponding to the summation of the
sources. The peak regions in the echoic histograms are spread out and overlap with one another. This overlap results
27
TABLE IV
PAIRWISE ANECHOIC DEMIXING PERFORMANCE.
test SIR in (dB) SIR out (dB) SIR gain (dB) PSR WDO DUET WDO 0dB
F1 � �
-0.58 12.69 13.26 0.92 0.87 0.96
F2 ��� �
0.58 11.25 10.68 0.96 0.89 0.96
F1 � �
-0.54 15.97 16.51 0.98 0.95 0.96
F2 ��� �
0.54 17.21 16.68 0.98 0.96 0.96
F1 � �
-0.62 15.29 15.91 0.97 0.94 0.94
F2 ��� �
0.62 15.69 15.07 0.98 0.95 0.95
F1 � �
-0.49 17.50 17.99 0.98 0.96 0.96
F2��� � �
0.49 17.36 16.87 0.98 0.97 0.97
F1 � �
-0.50 15.79 16.29 0.97 0.94 0.94
F2� ��� �
0.50 15.51 15.01 0.98 0.95 0.95
F1 � �
-0.44 16.29 16.73 0.96 0.94 0.94
F2��� � �
0.44 14.49 14.05 0.98 0.94 0.95
F1 � �
3.54 13.99 10.46 0.96 0.92 0.97
M1 ��� �
-3.54 10.35 13.88 0.91 0.83 0.94
F1 � �
3.60 18.42 14.81 0.99 0.97 0.98
M1 ��� �
-3.60 15.41 19.01 0.97 0.94 0.95
F1 � �
3.63 18.92 15.29 0.99 0.98 0.98
M1 ��� �
-3.63 15.91 19.54 0.97 0.95 0.95
F1 � �
3.69 19.91 16.22 0.99 0.98 0.98
M1��� � �
-3.69 15.79 19.48 0.98 0.95 0.95
F1 � �
3.75 19.57 15.82 0.99 0.98 0.98
M1� ��� �
-3.75 16.37 20.12 0.97 0.95 0.95
F1 � �
3.90 18.47 14.57 0.99 0.97 0.98
M1��� � �
-3.90 15.51 19.41 0.97 0.94 0.94
in reduced demixing performance. Note, however, that the 0 dB mask still performs well in the echoic case, so there
remains a gap between what we can separate blindly and what we can separate with knowledge of the instantaneous
time-frequency attenuations when using time-frequency masking to demix.
Figure 11 shows the histogram for the Te-Won Lee real office room recording consisting of two speakers [22]. The
histogram shows a number of peaks, the peaks with � * ZP0 � � are all associated with the Spanish speaker, and those
along the � ��Z<J � 0 line correspond to the English speaker. Note that for this recording, it is the attenuation direction
in the histogram that allows for the separation and that a method that only relied on delays would not be able to separate
the sources. Demixtures generated from this recording using the DUET algorithm are compared to several other BSS
techniques in [23].
28
TABLE V
HIGHER ORDER DEMIXING PERFORMANCE. RESULTS FOR THREE SOURCE, FOUR SOURCE, AND FIVE SOURCE ANECHOIC MIXTURES,
AS WELL AS THREE SOURCE ECHOIC MIXING.
Anechoic
test SIR in (dB) SIR out (dB) SIR gain (dB) PSR WDO DUET WDO 0dB
M1 � �
-2.72 13.67 16.39 0.92 0.88 0.90
F1 ��� �
-2.05 7.96 10.00 0.96 0.80 0.93
M2��� � �
-4.37 13.32 17.70 0.88 0.84 0.87
M1 � �
-6.93 9.89 16.83 0.78 0.70 0.80
F1 ��� �
-3.19 7.11 10.30 0.92 0.74 0.91
M2��� � �
-4.37 6.98 11.35 0.85 0.68 0.89
F2��� � �
-5.05 10.08 15.12 0.86 0.78 0.90
F1 � �
-9.77 7.97 17.74 0.73 0.62 0.76
M1 ��� �
-4.30 7.16 11.46 0.83 0.67 0.86
F2 ��� �
-3.77 5.99 9.76 0.91 0.68 0.91
M2��� � �
-5.60 7.05 12.65 0.80 0.65 0.85
F3��� � �
-8.59 8.53 17.11 0.76 0.65 0.82
Echoic
test SIR in (dB) SIR out (dB) SIR gain (dB) PSR WDO DUET WDO 0dB
M1 � �
-5.20 5.38 10.58 0.56 0.40 0.81
M2 ��� �
0.07 4.33 4.26 0.89 0.56 0.91
F1��� � �
-4.48 6.03 10.51 0.65 0.49 0.87
Fig. 10. Anechoic vs. Echoic Histogram Comparison. The left column images are of the histograms for three anechoic sources at �
, �� �
,
" �� �
, and their mixture. The histogram of the mixture is essentially the summation of the individual histograms and the peak regions in the
histogram are clearly separated. The right column images are of the histograms for three echoic sources �
, �� �
, " �� �
, and their mixture. While
the individual histograms show some level of localization (left, center, right), peak regions in each histogram overlap and the peaks are difficult
to identify in the summation image. Thus, the algorithm performs worse on echoic mixtures. In all images, the x-axis is delay ranging from -2
to 2 samples and the y-axis is symmetric attenuation ranging from -1 to 1. All histograms used � � " .
29
−50
5
−1−0.5
00.5
1
1
2
3
δα
Fig. 11. Histogram (� � " ) for Te-Won Lee’s “A real Cocktail Party Effect” Echoic Mixing Example.
V. CONCLUSIONS
In this paper, we presented a method to blindly separate mixtures of speech signals. We first illustrated experimen-
tally that binary time-frequency masks exist that can separate as many as 10 different speech signals from one mixture.
This relies upon a property of the Gabor expansions of speech signals, which we refer to as W-disjoint orthogonality.
W-disjoint orthogonality in the strict sense is satisfied by signals which have disjoint time-frequency supports. Speech
signals, as a result of the sparsity of their Gabor expansions, satisfy an approximate version of the W-disjoint orthog-
onality property. In Section II-A, we introduced a means of measuring the degree of W-disjoint orthogonality of a
signal in a given mixture with respect to a windowing function � . Listening experiments showed that there is a fairly
accurate relationship between the WDO value of a particular signal in a mixture for a given time-frequency mask and
the subjective performance of the time-frequency mask to separate the signal from the mixture.
Next, we addressed the problem of blindly constructing binary time-frequency masks that demix. The solution we
presented in this paper considered the two mixture case. For strictly W-disjoint orthogonal signals, we showed that
the instantaneous DUET attenuation and delay estimators are the anechoic mixing parameters and, using this fact,
described a simple algorithm to construct a binary time-frequency mask that demixes perfectly. Next we showed,
by modeling the contributions of the interfering sources as independent Gaussian white noise, that the ML estima-
tors for the mixing parameters are given by weighted averages of the instantaneous DUET estimates. Motivated by
this, we constructed a weighted histogram which was used to enumerate the sources and partition the time-frequency
representation of one of the mixtures to demix.
In Section IV we presented experimental results demonstrating that the algorithm works extremely well for synthetic
30
mixtures of speech as well as for speech mixtures recorded in an anechoic room and produces near perfect demixtures.
That is, the performance of the mask generated by the DUET algorithm was close to the performance of the ideal mask.
In an echoic room the anechoic model is violated and the quality of the demixing is reduced. In the echoic case, the
demixtures contain some crosstalk and distortion, but are intelligible. There is, in the echoic case, a performance gap
between the ideal binary time-frequency mask and the mask generated using the DUET algorithm. Closing this gap
is one goal of our future work. As we mentioned in Section IV, the enumeration and identification of the histogram
peaks are also topics for future research.
APPENDIX I
DERIVATION OF THE ML ESTIMATORS
Let us concentrate on one source, say � � . Let ! � be the set of time-frequency points � )/��� � at which � � is dominant as
defined in Section III-B. On ! � , we model the mixtures as follows:
6> � � )/��� � � 6� � � )/��� � � � � � )/��� �6> 4 � )/��� � � W � � ����� N � � � 6� � � )/��� � � � 4 � )/��� � (47)
where � � and � 4 are i.i.d. white complex Gaussian noise signals with mean zero and variance � 4 . Here � � and � 4model the contributions of other sources at the time-frequency points where � � is the dominant source. We model the
interfering sources as independent Gaussian noise in order to obtain simple closed-form source and mixing parameter
estimators. In reality, the interference in the different mixtures will be correlated and may not be Gaussian distributed.
However, the estimates we obtain here are used simply to motivate the weighted histograms used in our algorithm
discussed in Section III-B and the model is sufficient for that purpose.
For the model in (47), we want to employ an ML estimate to find the parameter pair � W � � [ � � � � 4 as well as
6� � � )/��� � which has the maximum likelihood. To that goal, we define the likelihood,� � , of ��� � � W � � [ � � , where � � �
31
� 6� � � )/��� � � � $ � ��� LON with each 6� � � )/��� � � �, given the data 6> � � )/��� � and 6>/4 � )/��� � , by��� ��� � � � � � % � � ���� �� � �� � � � � � � � � % � �
� ��� � ���������
� � � � � ���� � � � �!�"��� �� � � � �!�"� � � � � � ������� � � � ��� � � � � � �� � � � ������
��������� ��� ���� � �
�� � ������� � ���� � � �!����� �� � � � �!��� �
����� � � � � �!����� � � � ��� � � � � � �� � � � ����� �� �! (48)
where " � � ��6> � � )/��� � � � $ � �#� LON . The last equality holds because we assume � � and � 4 are i.i.d. complex Gaussian noise
signals. Clearly, maximizing� � is equivalent to maximizing� �$� � � � � � % � � ��� � �
�#$� ����������� ��� � �!��� � ���� � � �!�"� �
�� �� � � � � �!��� � � � � ��� � � � � � ���� � � �!��� �� � # (49)
We want to solve the equations % �% �&N $ �� � 0 , % �% �'N $ �� � 0 for all � )/��� � � ! � , % �% � N � 0 , and % �% � N � 0 simultaneously,
where 6�)(� � )/��� � and 6�+*� � )/��� � denote the real and imaginary parts of 6� � � )/��� � respectively. We start with % �% &N $ �� . For any
� )/���.�8�(! � , we have , �, 6� (� � )/��� � � ,, 6� (� � )/��� �.- 6> � � )/��� � Z 6� (� � )/��� � Z �56� *� � )/��� � 4 � 6> 4 � )/��� � Z W � � ����� N � � � � 6� (� � )/��� � � �56� *� � )/��� � � 40/ � (50)
We solve % �% &N $ �� �&N $ � F �&21N $ �� �10 for 6� (43� � )/��� � , the ML estimate of 6� (� � )/��� � , and obtain
6� (43� � )/��� � �6587:9 6> � � )/��� � � W � � ��� N � � � 6>/4 � )/��� �J � W 4� ; � (51)
Similarly, solving % �% �'N $ � 'N $ �� F '�1N $ �� �10 for 6� *<3� � )/��� � , the ML estimate of 6� *� � )/��� � , yields
6� *<3� � )/��� � �6=�> 9 6> � � )/��� � � W � � ��� N � � � 6> 4 � )/��� �J � W 4� ; (52)
which we combine with (51) to get the ML estimate ��3� for � � :� 3� � )/��� � � 6> � � )/��� � � W � � ��� N � � � 6> 4 � )/��� �J � W 4� � (53)
32
Next, we consider % �% � N . We have, �, [ � � ,, [ ���� �� $ � ��� LON 6> 4 � )/��� � Z W � � ����� N � � � 6� � � )/��� � 4���
� �W �
�� $ � �#� LON � 7 � =�>�� 6> 4 � )/��� � 6�G��� )/��� � � � � N � � �� � (54)
where 6� � � )/��� � is the complex conjugate of 6� � � )/��� � . We now plug in 6� � � )/��� � � � 3� � )/��� � in (54), which yields � % � � � � �� � � �� �
��$� ����������, � � � � � � ����� � ���� ���� � � � � ����� � ��� � � � ���
�� � �
� � � �� ���$� ��� �����
�, � � � � � � �!�"� � � � � �!�"� ������� ��� � � � � � �!�"� � % � �, � � # (55)
We assume that� � � 4 � � )/��� � � [ � � 7 � � �
� 7 � � [ � Z �[ � )/��� � � is small which is reasonable because we are considering
only the � )/��� � where � � is dominant and we make the approximation
����� � � � 4 � � )/��� � � [ � ��7 � ��( � � 4 � � )/��� � � [ � ��7 � � (56)
After plugging (56) into (55), we solve the equation % �% � N � N F � 1N �10 for [)3� and obtain (using the definition in (29))
[ 3� � C � $ � ��� LON �[ � )/��� � � 4 7 4� � 6> � � )/��� � 6> 4 � )/��� � �C � $ � �#� L9N � 4 7 4� � 6> � � )/��� � 6>/4�� )/��� � � � (57)
Note that [ 3� , the ML estimate for the parameter [ � , is a weighted average of the instantaneous DUET delay estimates,
with each estimate weighted by the product magnitude of the mixtures as well as � ��7 � � 4 . We also observe that the ML
estimate [ 3� does not depend on the attenuation parameter W � .Finally, we will solve % �% � N � N F � 1N for W 3� . We have, �, W � � ,, W ���� �
� $ � ��� LON 6>/4 � )/��� � Z W � � ����� N � � � 6� � � )/��� � 4 ��� �� $ � �#� LON� W � � 6� � � )/��� � � 4 Z � 5�7 � 6>/4 � )/��� � 6� � � )/��� � � ��� N � � � � � (58)
After setting 6�M� � )/��� � � � 3� � )/��� � and some algebra we get (using the definition in (30))
�"!� ��� �!� � �� !� ���$� ��� ��� � �
� � � � �!�"� � � � � �!�"� � � �� � � �!��� � � ! �� � � ��������$� ��� ��� ��#%$'& � � � � ����� ��� � � ����� � � �)(� � � �+* (59)
33
The estimate for W 3� Z �� 1N is symmetric in " � and " 4 : swapping the mixture labels will only result in a sign change
of this quantity (i.e., �KJ � W � � Z1�KJ � �KJ � W � �H�8��Z�� W � ZIJ � W � � ). In the original presentation of DUET [4], the logarithm
of the attenuation estimates was used solely because it has the same property (i.e., ),+ -/�KJ � W � � � Z ),+ -/� W � � ). However,
motivated by its appearance in the ML estimator (59), we will replace the role of the logarithm with the DUET
symmetric attenuation estimator defined in (32).
Remark 2: Although the estimate given in (59) is not a weighted average, it is interesting to note that if we replace
[ 3� in (59) with�[ � )/��� � we obtain that587 � 6> 4 � )/��� � 6> � � )/��� � � � �� $ �� � � � � � � 6> � � )/��� � 6> 4 � )/��� � � (60)
and in this case, (59) becomes (33) with & � J , a weighted average of�� � )/��� � .
APPENDIX II
EXPERIMENTAL EVALUATION OF THE ML ESTIMATORS
In this section, we experimentally evaluate the ML estimators as well as other estimators motivated by the previous
section. In order simulate mixtures, we use the model in (47) and adjust the noise energy to model the different
number of interfering sources. The model in (47) is valid for the dominant time-frequency points of one source. In
order to determine the set of dominant time-frequency points ! � , a speech signal taken from the TIMIT database was
compared to a random mixture of 1, 2, 4, or 9 TIMIT speech signals to model � � ���3� �3�YJ 0 , and in each case, the
time-frequency points corresponding to the 0-dB mask were selected. The mixtures of interfering sources were only
used to determine ! � and were discarded after the dominant time-frequency points were identified. In order to simulate
the presence of interfering sources, i.i.d. Gaussian white noise was added to the dominant time-frequency points of
source � � on both channels. The added noise was amplified to produce a 15.12 dB, 12.26 dB, 9.87 dB, or 7.52 dB
SNR so as to model mixing of order � � ���3� �3� or J 0 , respectively. These SNR’s were selected to model different
mixture orders because they match the average SIR’s for the 0-dB mask from Figure 3. That is, 15.12 dB, 12.26 dB,
9.87 dB, and 7.52 dB are the expected SIR’s after applying the 0-dB mask to mixtures of order � � ���3� �3� and J 0 ,and thus in order to model these mixture orders for the dominant time-frequency points of one source, we add noise
to one source to produce the corresponding SNR’s. Note that the dominate time-frequency points are precisely the
34
support of the source’s 0 dB mask. Thus the performance of the estimator evaluated with this model at these SNR’s
should approximate the true performance of the estimator in speech mixtures of order � � ���3� �3� and J 0 . All results
in the remainder of this section are obtained using this model.
We choose to experimentally evaluate the estimators using the model as described above as opposed to creating
synthetic mixtures of multiple speech signals because we (1) wanted to prevent the results from depending on the
specific choice of mixing parameters of the interfering sources and (2) wanted to evaluate the estimators using the
model that motivated them. The disadvantage of modeling the presence of interfering sources in this way is that the
interference should be correlated and this correlation is lost when the interference is modeled as independent noise.
Our desire in this section is to explore the qualitative performance of a family of estimators to motivate the demixing
algorithm, and modeling the interference as noise is sufficient for this purpose.
Figure 12 shows the ML estimate [�3� from (57) versus [ as [ ranges linearly from -5 to 5 samples with W � � J . We
can see that the ML delay estimator exhibits bad performance outside the -1 to 1 sample range, and biased performance
inside this range. The bad performance for larger delays is due to the phase wrap around problem discussed in Re-
mark 1 in Section III-A. The squared frequency weighting factor in the ML delay estimator accentuates this problem.
In addition, such a frequency weighting would make signals with higher frequency content have higher likelihood
estimates of their delay parameters. In the next sections, we will be using these weightings to construct weighted his-
tograms for source separation and it is undesirable to assign more likelihood to one set of parameters simply because
their associated source contains higher frequencies. While methods for unwrapping the phase do exist, these methods
are inappropriate for our purposes as different sources may be active from one frequency to the next. In order to see
if we could reduce the bias, eliminate the wrap-around effect, and remove the high frequency weighting, we removed
the squared frequency weighting factor in the ML delay estimator and considered estimators of the following form
[ � � �� �*� C � $ � ��� LON � 6> � � )/��� � 6>/4 � )/��� � � � �[ � )/��� �C � $ � ��� LON � 6> � � )/��� � 6>/4�� )/��� � � � � (61)
The free parameter & in (61) determines how strongly the estimates obtained from time-frequency points � )/��� � with
large� > � � )/��� � � are weighted relative to those obtained from time-frequency points with small
� > � � )/��� � � . Note that &�10corresponds to (61) being a plain average and & � % results in [ � � �� being the instantenous DUET estimate from the
35
time-frequency point � )/��� � with the largest coefficient magnitude. Moreover, with & � J , this estimator is the ML delay
estimator with the squared frequency weighting factor removed. The estimator in (61) for & � J � � �YJ�� � is compared
−5 −4 −3 −2 −1 0 1 2 3 4 5−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
actual δj
estima
ted δ j
N=2 N=3 N=5 N=10
Fig. 12. Maximum Likelihood Delay Estimator. The plot compares estimated N versus true
N for the ML estimator for N ranging linearly
from -5 to 5 samples when � N � for mixture models of 2, 3, 5, and 10 sources.
with the ML estimator in Figure 13. For this test, [ � ranges linearly from -5 to 5 samples while � � ranges linearly from
0.15 to -0.15 and the SNR and ! � were selected to model mixture orders of 2, 3, 5, and 10. The & ��J � � estimator
suffers similar deficiencies as the ML estimator. The &U� J estimator is clearly biased outside the -1 to 1 delay range,
but is monotonic with increasing [ � and exhibits good (although biased) performance inside -1 to 1 sample delay. The
& � �estimator exhibits near perfect estimates.
Figure 13 also shows � 3� versus � � for the same data used for the delay estimates. Similar to the delay case, in
addition to the ML estimator, we consider estimators of the following form,
� � � �� � ��� � �����������
� ��� � ������� � � � ����� � � �� � � ������#$� ������� � �
��� � � �!����� � � � �!��� � � # (62)
Note that with &,� J , this estimator is the ML symmetric estimator with the substitution described in Remark 2
previously. The ML and & � J � � symmetric attenuation estimators are clearly biased. The & � J symmetric atten-
uation estimator is also biased, although less so. The &1� �symmetric attenuation estimator exhibits near perfect
performance.
APPENDIX III
HISTOGRAM-BASED PARAMETER ESTIMATION
In order to evaluate the usefulness of the histogram as a parameter estimator, a smoothed histogram was created
for each of the tests used to generate Figure 13 and the peak location of the histogram was used as the symmetric
attenuation and delay estimate. Figure 14 contains the results of these tests. Each estimator histogram consisted
36
−5 −4 −3 −2 −1 0 1 2 3 4 5−5
−4
−3
−2
−1
0
1
2
3
4
5
actual δj
estim
ated
δj
ML p=1/2p=1 p=2
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
actual αj
estim
ated
αj
ML p=1/2p=1 p=2
Fig. 13. Delay (left) and Attenuation (right) Estimator Comparison. The graph on the left compares estimated N versus true
N for the ML
delay estimator and the weighted average DUET delay estimators with � � " � % " � as N ranges linearly from -5 to 5 samples while � N ranges
linearly from 0.15 to -0.15 for a mixture model of 5 sources. The graph on the right compares estimated � N versus true � N for the ML and
weighted average DUET symmetric attenuation estimators for the same experimental data used to generate the delay graph.
of 401-by-401 points with a delay range from -6 to 6 samples and symmetric attenuation from -1.2 to 1.2, and the
smoothing kernel had parameters � � � � � � 0 � J � �A0 � � � . Comparing Figure 13 and Figure 14, we conclude that the
histogram based estimators are more accurate than the previously considered ML motivated estimators.
−5 −4 −3 −2 −1 0 1 2 3 4 5−5
−4
−3
−2
−1
0
1
2
3
4
5
actual δj
estim
ated
δj
p=1/2p=1 p=2
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
actual αj
estim
ated
αj
p=1/2p=1 p=2
Fig. 14. Histogram Delay (left) and Attenuation (right) Estimator Comparison. The graph on the left compares estimated N versus true
N for
the smoothed histogram peak estimators with � � " � � " � for N ranging linearly from -5 to 5 samples as � N ranges linearly 0.15 to -0.15. The
graph on the right compares estimated � N to the true � N . Both graphs were generated using a model of 5 source mixing.
The similar estimator performance for different choices of & in Figure 14 suggests that the choice of & should be
driven by other concerns. Identifying the peaks in the histogram is the crucial step in the separation process. Two
important criteria for the weighting exponent & selection are (1) the shape around the peak (the “peak shape”) and (2)
the relative peak heights. In order to aid in peak identification, we want the peak shape to be narrow and tall, and we
want the peaks to be roughly of the same height. Figure 15 compares the histogram peak shapes for &I� J � � �YJ�� �
37
for both the � and [ axes by taking the summation along the other axis. That is, Figure 15 contains 1-D weighted
histograms for both � and [ . As & increases, the peak shape becomes narrower and taller. This would suggest that we
should select & as large as possible. However, the larger we choose & , the more the peak heights depend only on the
largest instantaneous product power time-frequency components of each source. If these components have different
magnitude distributions for different sources, the resulting peaks heights can vary by several orders of magnitude
making identification of the smaller peaks impossible. While & � �results in the best peak shape, smaller choices of &
may result in easier peak identification. The choice of & is thus data dependent, however, motivated once again by the
form of the ML estimators, we will suggest & � J as the default choice.
−0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
α
p=1/2p=1 p=2
−1.5 −1 −0.5 0 0.5 1 1.5
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.1
δ
p=1/2p=1 p=2
Fig. 15. Peakshape for p=1/2, 1, and 2 for mixing with � � N N � � %# � �# #� .
REFERENCES
[1] J.-K. Lin, D. G. Grier, and J. D. Cowan, “Feature extraction approach to blind source separation,” in IEEE Workshop on Neural Networks
for Signal Processing (NNSP), Amelia Island Plantation, Florida, September 24–26 1997, pp. 398–405.
[2] T.-W. Lee, M. Lewicki, M. Girolami, and T. Sejnowski, “Blind source separation of more sources than mixtures using overcomplete
representations,” IEEE Signal Proc. Letters, vol. 6, no. 4, pp. 87–90, April 1999.
[3] M. V. Hulle, “Clustering approach to square and non-square blind source separation,” in IEEE Workshop on Neural Networks for Signal
Processing (NNSP), Madison, Wisconsin, August 23–25 1999, pp. 315–323.
[4] A. Jourjine, S. Rickard, and O. Yılmaz, “Blind separation of disjoint orthogonal signals: Demixing N sources from 2 mixtures,” in IEEE
International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 5, Istanbul, Turkey, June 5–9 2000, pp. 2985–2988.
[5] P. Bofill and M. Zibulevsky, “Blind separation of more sources than mixtures using sparsity of their short-time Fourier transform,” in
International Workshop on Independent Component Analysis and Blind Signal Separation (ICA), Helsinki, Finland, June 19–22 2000, pp.
87–92.
[6] R. Balan and J. Rosca, “Statistical properties of STFT ratios for two channel systems and applications to blind source separation,” in
38
International Workshop on Independent Component Analysis and Blind Source Separation (ICA), Helsinki, Finland, June 19–22 2000, pp.
429–434.
[7] M. Aoki, M. Okamoto, S. Aoki, H. Matsui, T. Sakurai, and Y. Kaneda, “Sound source segregation based on estimating incident angle
of each frequency component of input signals acquired by multiple microphones,” Acoustical Science & Technology, vol. 22, no. 2, pp.
149–157, Feb. 2001.
[8] M. Zibulevsky and B. A. Pearlmutter, “Blind source separation by sparse decomposition in a signal dictionary,” Neural Computation,
vol. 13, no. 4, pp. 863–882, 2001.
[9] L.-T. Nguyen, A. Belouchrani, K. Abed-Meraim, and B. Boashash, “Separating more sources than sensors using time-frequency distribu-
tions,” in International Symposium on Signal Processing and its Applications (ISSPA), Kuala Lumpur, Malaysia, August 13–16 2001, pp.
583–586.
[10] S. Rickard, R. Balan, and J. Rosca, “Real-time time-frequency based blind source separation,” in International Workshop on Independent
Component Analysis and Blind Source Separation (ICA), San Diego, CA, December 9–12 2001, pp. 651–656.
[11] L. Vielva, D. Erdogmus, C. Pantaleon, I. Santamaria, J. Pereda, and J. C. Principe, “Underdetermined blind source separation in a time-
varying environment,” in IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP), vol. 3, Orlando, Florida, May 13–17
2002, pp. 3049–3052.
[12] P. Bofill, “Underdetermined blind separation of delayed sound sources in the frequency domain,” preprint, 2002.
[13] S. Rickard and O. Yılmaz, “On the approximate W-disjoint orthogonality of speech,” in IEEE Int. Conf. on Acoustics, Speech, and Signal
Processing (ICASSP), vol. 1, Orlando, Florida, May 13–17 2002, pp. 529–532.
[14] S. T. Roweis, “One microphone source separation,” in Neural Information Processing Systems 13 (NIPS), 2000, pp. 793–799.
[15] K. Suyama, K. Takahashi, and R. Hirabayashi, “A robust technique for sound source localization in consideration of room capacity,” in
IEEE Workshop on the Applications of Signal Processing to Audio and Acoustics, New Paltz, New York, October 21–24 2001, pp. 63–66.
[16] I. Daubechies, Ten Lectures on Wavelets. Philadelphia, PA: SIAM, 1992.
[17] R. Mersereau and T. Seay, “Multiple access frequency hopping patterns with low ambiguity,” IEEE Transactions on Aerospace and
Electronic Systems, vol. AES-17, no. 4, July 1981.
[18] W. Kozek, “Time-frequency signal processing based on the Wigner-Weyl framework,” Signal Processing, vol. 29, pp. 77–92, 1992.
[19] B. Berdugo, J. Rosenhouse, and H. Azhari, “Speakers’ direction finding using estimated time delays in the frequency domain,” Signal
Processing, vol. 82, pp. 19–30, 2002.
[20] R. Balan, J. Rosca, S. Rickard, and J. O’Ruanaidh, “The influence of windowing on time delay estimates,” in Conf. on Info. Sciences and
Systems (CISS), vol. 1, Princeton, NJ, March 2000, pp. WP1–(15–17).
[21] J. Huang, N. Ohnishi, and N. Sugie, “A biomimetic system for localization and separation of multiple sound sources,” IEEE Transactions
on Instrumentation and Measurement, vol. 44, no. 3, pp. 733–738, June 1995.
[22] [Online]. Available: http://www.cnl.salk.edu/ � tewon/Blind/blind audio.html
39
[23] [Online]. Available: http://www.alum.mit.edu/www/rickard/bss.html