Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Time-‐frequency Masking
EECS 352: Machine Percep=on of Music & Audio
Zafar Rafii, Winter 2014 1
Real spectrogram
time
frequ
ency
• The Short-‐Time Fourier Transform (STFT) is a succession of local Fourier Transforms (FT)
STFT
STFT
Zafar Rafii, Winter 2014 2
+ j* Imaginary spectrogram
time
frequ
ency
frame i frame i
Time signal
timewindow i
Imaginary spectrum
frequency
Real spectrum
frequency
• If we used a window of N samples, the FT has N values, from 0 to N-‐1; e.g., if N = 8…
Time signal
timewindow i
FT
STFT
Zafar Rafii, Winter 2014 3
+ j* window i frame i frame i
-‐3 -‐1 0 1 2 1 0 -‐1 0 -‐1 0 1 0 -‐1 0 1 + j* 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
N frequency values N frequency values
Imaginary spectrum
frequency
Real spectrum
frequency
• Frequency index 0 is the DC component; it is always real (it is the sum of the =me values!)
Time signal
timewindow i
FT
STFT
Zafar Rafii, Winter 2014 4
+ j* window i frame i frame i
-‐3 -‐1 0 1 2 1 0 -‐1 0 -‐1 0 1 0 -‐1 0 1 + j* 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
Imaginary spectrum
frequency
Real spectrum
frequency
• Frequency indices from 1 to floor(N/2) are the “unique” complex values (a + j*b)
Time signal
timewindow i
FT
STFT
Zafar Rafii, Winter 2014 5
+ j* window i frame i frame i
-‐3 -‐1 0 1 2 1 0 -‐1 0 -‐1 0 1 0 -‐1 0 1 + j* 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
Imaginary spectrum
frequency
Real spectrum
frequency
• Frequency indices from floor(N/2) to N-‐1 are the “mirrored” complex conjugates (a -‐ j*b)
Time signal
timewindow i
FT
STFT
Zafar Rafii, Winter 2014 6
+ j* window i frame i frame i
-‐3 -‐1 0 1 2 1 0 -‐1 0 -‐1 0 1 0 -‐1 0 1 + j* 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
Imaginary spectrum
frequency
Real spectrum
frequency
• If N is even, there is a pivot component at frequency index N/2; it is always real!
Time signal
timewindow i
FT
STFT
Zafar Rafii, Winter 2014 7
+ j* window i frame i frame i
-‐3 -‐1 0 1 2 1 0 -‐1 0 -‐1 0 1 0 -‐1 0 1 + j* 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
Real spectrogram
time
frequ
ency
• Summary of the frequency indices and values in the STFT (in colors!)
STFT
Zafar Rafii, Winter 2014 8
Imaginary spectrogram
time
frequ
ency
Frequency 0 = DC component (always real)
Frequency 1 to floor(N/2) = “unique” complex values
Frequency N/2 = “pivot” component (always real)
Frequency floor(N/2) to N-‐1 = “mirrored” complex conjugates
N frequency values = frequency 0 to N-‐1
+ j*
• The (magnitude) spectrogram is the magnitude (absolute value) of the STFT
Magnitude spectrogram
time
frequ
ency
Real spectrogram
time
frequ
ency
Spectrogram
Zafar Rafii, Winter 2014 9
Imaginary spectrogram
time
frequ
ency
abs
frame i frame i frame i
+ j*
• For a complex number 𝑎+𝑗∗𝑏, the absolute value is |𝑎+𝑗∗𝑏|=√𝑎↑2 + 𝑏↑2
abs
Spectrogram
Zafar Rafii, Winter 2014 10
Imaginary spectrum
frequency
Real spectrum
frequency
+ j* frame i frame i
-‐3 -‐1 0 1 2 1 0 -‐1 0 -‐1 0 1 0 -‐1 0 1 + j* 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
Magnitude spectrum
frequency
=
frame i
1.4 1.4 1.4 1.4 3 0 2 0
1.4 1.4 1.4
• All the N frequency values (frequency indices from 0 to N-‐1) are real and posiHve (abs!)
abs
Spectrogram
Zafar Rafii, Winter 2014 11
Imaginary spectrum
frequency
Real spectrum
frequency
+ j* frame i frame i
-‐3 -‐1 0 1 2 1 0 -‐1 0 -‐1 0 1 0 -‐1 0 1 + j* 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
Magnitude spectrum
frequency
0 1 2 3 4 5 6 7 = 1.4 3 0 2 0
frame i
N frequency values
1.4 1.4 1.4
• Frequency indices from 0 to floor(N/2) are the unique frequency values (with DC and pivot)
abs
Spectrogram
Zafar Rafii, Winter 2014 12
Imaginary spectrum
frequency
Real spectrum
frequency
+ j* frame i frame i
-‐3 -‐1 0 1 2 1 0 -‐1 0 -‐1 0 1 0 -‐1 0 1 + j* 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
Magnitude spectrum
frequency
0 1 2 3 4 5 6 7 = 1.4 3 0 2 0
frame i
1.4 1.4 1.4
• Frequency indices from floor(N/2)+1 to N-‐1 are the mirrored frequency values
abs
Spectrogram
Zafar Rafii, Winter 2014 13
Imaginary spectrum
frequency
Real spectrum
frequency
+ j* frame i frame i
-‐3 -‐1 0 1 2 1 0 -‐1 0 -‐1 0 1 0 -‐1 0 1 + j* 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
Magnitude spectrum
frequency
0 1 2 3 4 5 6 7 = 1.4 3 0 2 0
frame i
1.4 1.4 1.4
• Since they are redundant, we can discard the frequency values from floor(N/2)+1 to N-‐1
abs
Spectrogram
Zafar Rafii, Winter 2014 14
Imaginary spectrum
frequency
Real spectrum
frequency
+ j* frame i frame i
-‐3 -‐1 0 1 2 1 0 -‐1 0 -‐1 0 1 0 -‐1 0 1 + j* 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
Magnitude spectrum
frequency
0 1 2 3 4 5 6 7 = 1.4 3 0 2 0
frame i
floor(N/2)+1 unique frequency values
• The spectrogram has therefore floor(N/2)+1 unique frequency values (with DC and pivot)
Magnitude spectrogram
time
frequ
ency
Real spectrogram
time
frequ
ency
Spectrogram
Zafar Rafii, Winter 2014 15
Imaginary spectrogram
time
frequ
ency
frame i frame i frame i
abs + j*
Spectrogram
• Why the magnitude spectrogram? – Easy to visualize (compare with the STFT) – Magnitude informa=on more important – Human ear less sensi=ve to phase
Zafar Rafii, Winter 2014 16
Time signal
time
Magnitude spectrogram
time
frequ
ency +
0
Spectrogram
• When you display a spectrogram in Matlab… – imagesc: data is scaled to use the full colormap – 10*log10(V): magnitude spectrogram in dB – set(gca,’YDir’,’normal’): y-‐axis from boiom to top
Zafar Rafii, Winter 2014 17
Time signal
time
Magnitude spectrogram
time
frequ
ency +
0
• The signal cannot be reconstructed from the spectrogram (phase informa=on is missing!)
Spectrogram
Zafar Rafii, Winter 2014 18
Time signal
time
Imaginary spectrogram
time
frequ
ency
Real spectrogram
time
frequ
ency
Magnitude spectrogram
time
frequ
ency iSTFT ???
???
• Suppose we have a mixture of two sources: a music signal and a voice signal
Time-‐frequency Masking
Zafar Rafii, Winter 2014 19
Music signal
time
Mixture signal
time
Voice signal
time
Music spectrogram
time
frequ
ency
Voice spectrogram
time
frequ
ency
Mixture spectrogram
timefre
quen
cy
+ 0
+ 0
+ 0
• We assume that the sources are sparse = most of the =me-‐frequency bins have null energy
Music spectrogram
time
frequ
ency
Voice spectrogram
time
frequ
ency
Mixture spectrogram
timefre
quen
cy
Time-‐frequency Masking
Zafar Rafii, Winter 2014 20
Music signal
time
Mixture signal
time
Voice signal
time
Music spectrogram
time
frequ
ency +
0
+ 0
+ 0
Voice spectrogram
time
frequ
ency
Mixture spectrogram
timefre
quen
cy
• We assume that the sources are sparse = most of the =me-‐frequency bins have null energy
Music spectrogram
time
frequ
ency
Voice spectrogram
time
frequ
ency
Mixture spectrogram
timefre
quen
cy
Time-‐frequency Masking
Zafar Rafii, Winter 2014 21
Music signal
time
Mixture signal
time
Voice signal
time
Music spectrogram
time
frequ
ency +
0
+ 0
+ 0
Voice spectrogram
time
frequ
ency
Mostly low energy bins
Mostly low energy bins Mixture spectrogram
timefre
quen
cy
• We assume that the sources are disjoint = their =me-‐frequency bins do not overlap
Music spectrogram
time
frequ
ency
Voice spectrogram
time
frequ
ency
Mixture spectrogram
timefre
quen
cy
Time-‐frequency Masking
Zafar Rafii, Winter 2014 22
Music signal
time
Mixture signal
time
Voice signal
time
Music spectrogram
time
frequ
ency +
0
+ 0
+ 0
Voice spectrogram
time
frequ
ency
Mixture spectrogram
timefre
quen
cy
• We assume that the sources are disjoint = their =me-‐frequency bins do not overlap
Music spectrogram
time
frequ
ency
Voice spectrogram
time
frequ
ency
Mixture spectrogram
timefre
quen
cy
Time-‐frequency Masking
Zafar Rafii, Winter 2014 23
Music signal
time
Mixture signal
time
Voice signal
time
Music spectrogram
time
frequ
ency +
0
+ 0
+ 0
Voice spectrogram
time
frequ
ency
Mixture spectrogram
timefre
quen
cy
Not a lot of overlapping
• Assuming sparseness and disjointness, we can discriminate the bins between mixed sources
Music spectrogram
time
frequ
ency
Voice spectrogram
time
frequ
ency
Mixture spectrogram
timefre
quen
cy
Time-‐frequency Masking
Zafar Rafii, Winter 2014 24
Music spectrogram
time
frequ
ency +
0
+ 0
+ 0
Voice spectrogram
time
frequ
ency
Mixture spectrogram
timefre
quen
cy
Music signal
time
Mixture signal
time
Voice signal
time
• Assuming sparseness and disjointness, we can discriminate the bins between mixed sources
Music spectrogram
time
frequ
ency
Voice spectrogram
time
frequ
ency
Mixture spectrogram
timefre
quen
cy
Time-‐frequency Masking
Zafar Rafii, Winter 2014 25
Music spectrogram
time
frequ
ency +
0
+ 0
+ 0
Voice spectrogram
time
frequ
ency
Mixture spectrogram
timefre
quen
cy
Music signal
time
Mixture signal
time
Voice signal
time
Source 1 = bright Source 2 = dark
• Bins that are likely to belong to one source are assigned to 1, the rest to 0 = binary masking!
Binary mask
time
frequ
ency
Music spectrogram
time
frequ
ency
Voice spectrogram
time
frequ
ency
Mixture spectrogram
timefre
quen
cy
Time-‐frequency Masking
Zafar Rafii, Winter 2014 26
Music spectrogram
time
frequ
ency +
0
+ 0
+ 0
Voice spectrogram
time
frequ
ency
Mixture spectrogram
timefre
quen
cy
Binary mask
time
frequ
ency 1
0
Music signal
time
Voice signal
time
Source of interest Interfering source
• By mul=plying the binary mask to the mixture spectrogram, we can “preview” the es=mate
Time-‐frequency Masking
Zafar Rafii, Winter 2014 27
Binary mask
time
frequ
ency .x
Mixture spectrogram
time
frequ
ency
Masked spectrogram
time
frequ
ency
+ 0
+ 0
1 0
• However, we cannot derive the es=mate itself because we cannot invert a spectrogram!
Time-‐frequency Masking
Zafar Rafii, Winter 2014 28
Binary mask
time
frequ
ency .x
Mixture spectrogram
time
frequ
ency
Masked spectrogram
time
frequ
ency
+ 0
+ 0
Music estimate
time
1 0
• We mirror the redundant frequencies from the unique frequencies (without DC and pivot)
Time-‐frequency Masking
Zafar Rafii, Winter 2014 29
Binary mask
time
frequ
ency
Binary mask
time
frequ
ency
Binary mask
time
frequency
1 0
1 0
• We then apply this full binary mask to the STFT using a element-‐wise mul=plica=on
Time-‐frequency Masking
Zafar Rafii, Winter 2014 30
Binary mask
time
frequ
ency
Binary mask
time
frequ
ency
Binary mask
time
frequency
Imaginary spectrogram
time
frequ
ency
Real spectrogram
timefre
quen
cy
1 0
1 0
.x
• The es=mate signal can now be reconstructed via inverse STFT
Time-‐frequency Masking
Zafar Rafii, Winter 2014 31
Binary mask
time
frequ
ency
1 0
Masked imaginary
time
frequ
ency
Masked real
time
frequ
ency
Music estimate
time
iSTFT
• Sources are not really sparse or disjoint in =me-‐frequency in the mixture
Music spectrogram
time
frequ
ency
Voice spectrogram
time
frequ
ency
Mixture spectrogram
timefre
quen
cy
Time-‐frequency Masking
Zafar Rafii, Winter 2014 32
Music signal
time
Mixture signal
time
Voice signal
time
Music spectrogram
time
frequ
ency +
0
+ 0
+ 0
Voice spectrogram
time
frequ
ency
Mixture spectrogram
timefre
quen
cy
Soft mask
time
frequ
ency
• Bins that are likely to belong to one source are close to 1, the rest close to 0 = soP masking!
Music spectrogram
time
frequ
ency
Voice spectrogram
time
frequ
ency
Mixture spectrogram
timefre
quen
cy
Time-‐frequency Masking
Zafar Rafii, Winter 2014 33
Music spectrogram
time
frequ
ency +
0
+ 0
+ 0
Voice spectrogram
time
frequ
ency
Mixture spectrogram
timefre
quen
cy
Music signal
time
Voice signal
time
Source of interest Interfering source
1 0
Soft mask
time
frequ
ency
• Let’s listen to the results!
Time-‐frequency Masking
Zafar Rafii, Winter 2014 34
Music estimate
time
Music signal
time
demix mix
Mixture signal
time
• How can we efficiently model a binary/som =me-‐frequency mask for source separa=on?...
• To be con=nued…
Ques=on
Zafar Rafii, Winter 2014 35
Mixture spectrogram
time
frequ
ency
+ 0
Soft mask
time
frequ
ency
1 0 ???