Upload
others
View
17
Download
0
Embed Size (px)
Citation preview
Gaussian Processes for Audio Feature Extraction
Dr. Richard E. Turner ([email protected])
Computational and Biological Learning LabDepartment of Engineering
University of Cambridge
Machine hearing pipeline
signal
T samples
Machine hearing pipeline
frequ
ency
time-frequency(TF) analysis
signal
short time Fourier transformspectrogram
waveletfilter bank
(non-linear)
T samples
T' D > T samples
Machine hearing pipeline
frequ
ency
time-frequency(TF) analysis
probabilistic model
signal
HMM (speech recognition)NMF (source separation, denoising)ICA (source separation, denoising)
short time Fourier transformspectrogram
waveletfilter bank
(non-linear)
T samples
T' D > T samples
Problems with conventional pipeline
frequ
ency
time-frequency(TF) analysis
probabilistic model
signal
(non-linear)
T samples
T' D > T samples
noise (source mixtures) hard to model in TF domain
(hard to propagate uncertainty noise/missing data - from signal to
TF domain)
Problems with conventional pipeline
frequ
ency
time-frequency(TF) analysis
probabilistic model
signal
(non-linear)
T samples
T' D > T samples
noise (source mixtures) hard to model in TF domain
hard to enforce/learn dependencies intrinsic to the FT analysis
image of mapping(injective)
(hard to propagate uncertainty noise/missing data - from signal to
TF domain)
Problems with conventional pipeline
frequ
ency
time-frequency(TF) analysis
probabilistic model
signal
(non-linear)
T samples
T' D > T samples
noise (source mixtures) hard to model in TF domain
learning based on time-frequency representation ignores Jacobian
hard to enforce/learn dependencies intrinsic to the FT analysis
image of mapping(injective)
(hard to propagate uncertainty noise/missing data - from signal to
TF domain)
Problems with conventional pipeline
frequ
ency
time-frequency(TF) analysis
probabilistic model
signal
(non-linear)
T samples
T' D > T samples
noise (source mixtures) hard to model in TF domain
learning based on time-frequency representation ignores Jacobian
hard to enforce/learn dependencies intrinsic to the FT analysis
image of mapping(injective)
(hard to propagate uncertainty noise/missing data - from signal to
TF domain)
Problems with conventional pipeline
frequ
ency
time-frequency(TF) analysis
probabilistic model
signal
(non-linear)
T samples
T' D > T samples
noise (source mixtures) hard to model in TF domain
learning based on time-frequency representation ignores Jacobian
hard to enforce/learn dependencies intrinsic to the FT analysis
image of mapping(injective)
hard to adapt both top and bottom layers
(hard to propagate uncertainty noise/missing data - from signal to
TF domain)
Goal of this talk
frequ
ency
time-frequency(TF) analysis
probabilistic model
signal
(non-linear)
T samples
T' D > T samples
probabilise time-frequency analysis(construct generative model in whichinference corresponds to classical time-frequency analysis)
build a hierachical model that incorporates downstream processing module
classical signal processing
machine learning
A typical audio pipeline
0.5 1 1.5 2
y
time /s
signal
A typical audio pipeline
0.5 1 1.5 2
y
time /s
0.2
window
short time Fourier transform
magnitude
0.5
1.1
2.6
6
frequ
ency
/kH
z
spectrogram
signal
Fourier transform
A typical audio pipeline
0.5 1 1.5 2
y
time /s
0.2
0
0
window
short time Fourier transform
magnitude
0.5
1.1
2.6
6
frequ
ency
/kH
z
spectrogram
NMF
signal
Fourier transform
A typical audio pipeline
0.5 1 1.5 2
y
time /s
0.5
1.1
2.6
6
frequ
ency
/kH
z
0.2
0
0
window
short time Fourier transform
magnitudespectrogram
NMF
signal
Fourier transform
What form of generative model corresponds to the STFT?
desire: expected value of latent time-frequency coefficients sd,1:T = STFT
• assume y formed by (weighted) superposition of band-limited signals sd,1:T
• linearity of inference can be assured by setting the distributions of each sd,1:Tand the noise to be Gaussian
• time-invariance =⇒ generative model statistically stationary
=⇒ GP prior over STFT coefficients, p(sd,1:T ) = G(sd,1:T ; 0,Γ), stationary
Γt,t′ ≈T∑k=1
FT−1t,kγkFTk,t′ where FTk,t=e−2πi(k−1)(t−1)/T
Time-frequency analysis as inference
generation
complex sinusoids
time-varying (complex) coefficients
Time-frequency analysis as inference
generation
complex sinusoids
time-varying (complex) coefficients
Time-frequency analysis as inference
generation inference
complex sinusoids
time-varying (complex) coefficients
Time-frequency analysis as inference
generation inference
complex sinusoids
most probable coefficients given the signal is the STFT
STFT
time-varying (complex) coefficients
STFT window = prior covariance
frequency shifted inverse signal covariance
Time-frequency analysis as inference
generation inference
Time-frequency analysis as inference
generation inference
Time-frequency analysis as inference
generation inference
Time-frequency analysis as inference
generation inference
Time-frequency analysis as inference
generation inference
depends on independent of
Time-frequency analysis as inference
generation inference
signal noise
depends on independent of
Time-frequency analysis as inference
generation inference
signal noise
depends on independent of
Wiener filter
Time-frequency analysis as inference
generation inference
signal noise
depends on independent of
Wiener filter
Time-frequency analysis as inference
generation inference
signal noise
depends on independent of
Wiener filter
STFT window = prior covariancefrequency shifted inverse signal covariance
Time-frequency analysis as inference
generation inference
signal noise
depends on independent of
Wiener filter
STFT window = prior covariancefrequency shifted inverse signal covariance
probabilistic filter bank
probabilistic STFT
Time-frequency analysis as inference
• probabilistic models in which inference recovers STFT, filter bank, waveletanalysis
– unifes a number of existing probabilistic time-series models & connectsto traditional sig. proc.
– can learn window of STFT and frequencies (equivalently filter properties)– frequency shift relationship mimics classical relationship between these
time-frequency relationships
• hops/down-sampling and finite window used correspond to FITC(uniformly spaced pseudo-points) and sparse-covariance approximations
– rediscover Nyquist in the context of approximation GPs
Probabilistic audio processing pipeline
0.1
2.6
0.1
20 40 60 80 100 120 140 160 180−1
0
1
time /ms
2.6
freq
/kH
zfre
q /k
Hz
0
0
envelopes
carriers
signal
= bandpass Gaussian noise
mean spectrum
Probabilistic audio processing pipeline
0.1
2.6
0.1
20 40 60 80 100 120 140 160 180−1
0
1
time /ms
2.6
freq
/kH
zfre
q /k
Hz
0
0
envelopes
carriers
signal
= bandpass Gaussian noise
mean spectrum
Probabilistic audio processing pipeline
0.1
2.6
0.1
20 40 60 80 100 120 140 160 180−1
0
1
time /ms
2.6
freq
/kH
zfre
q /k
Hz
0
0envelopepatterns
envelopes
carriers
signal
= slow Gaussian process
= bandpass Gaussian noise
mean spectrum
mean spectrum
Inference and Learning
• Key Observation – fix envelopes:
– posterior over carriers is Gaussian– posterior mean given by an (adaptive) filter
• Leads to MAP estimation of the envelopes (or HMCMC), let zlt = log hlt
ZMAP = arg maxZ
p(Z|Y)
p(Z|Y) = 1Zp(Z,Y) =
1
Z
∫dXp(Z,Y,X) =
1
Zp(Z)
∫dXp(Y|A,X)p(X)
• Compute integral efficiently using chain stuctured approximation andKalman Smoothing
• Leads to gradient based optimisation for transformed amplitudes
• Learning: approximate Maximum Likelihood θ = arg maxθ p(Y|θ)
• NMF: zero-temperature EM, one E-Step, initialise constant envelopes
Audio modelling
� �� �� �� ��
�
��
��
��
��
���� ���
�
���
��
�
������ ����� ���� ��� �
��� ��� ���� ����
������� ���
�������
�
�
�����
�
�������� �������
����
������
��� ��� ���� ����
�
���
������� ���
�
���
��
�
���
��
�
���
�
����������
������������������� �������������������
Audio modelling
time /s
frequ
ency
/kH
fire stream wind rain
tent-zipfoot step Turner, 2010
Audio modelling
time /s
frequ
ency
/kH
fire stream wind rain
tent-zipfoot step Turner, 2010
Audio modelling
time /s
frequ
ency
/kH
Turner, 2010
Statistical texture synthesis
• Old approach: build detailed physical models (e.g. rain drops)
• New approach
– train model on your favourite texture– sample from the prior, and then from the likelihood.
• Waveform unique, but statistically matched to original
• Often perceptually indistinguishable
Audio denoising
−10 −5 0 5 10 15
2
4
6
8
10
12
14
SNR before /dB
SNR
impr
ovem
ent/
dB
1 1.5 2 2.5−0.5
0
0.5
1
PESQ before
PESQ
impr
ovem
ent
5 10 15 200
1
2
3
4
5
6
7
SNR log−spec before /dB
SNR
log−
spec
impr
ovem
ent/
dB
2 4 6 8 10 12 14 16 18
−5
0
5
y t
2 4 6 8 10 12 14 16 18
−5
0
5
y t
2 4 6 8 10 12 14
−5
0
5
y t
2 4 6 8 10 12 14
−5
0
5
y t
2 4 6 8 10 12 14 16 18 20
−10
−5
0
5
10
time /ms
y t
2 4 6 8 10 12 14 16 18 20
−10
0
10
time /ms
y t
NMFtNMFGTFGTFtNMFadapted filtersunadapted filtersWienerspectral subtractblock threshold
Audio missing data imputation
0 5 10 150
2
4
6
8
10
12
14
missing region /ms
SNR
/dB
0 5 10 15
2.5
3
3.5
4
missing region /ms
PESQ
0 5 10 15
5
10
15
20
missing region /ms
SNR
log−
spec
/dB
y t
5 10 15 20 25
−2
−1
0
1
2
y t
5 10 15 20 25
−2
−1
0
1
2
y t
5 10 15 20 25
−2
0
2
y t
5 10 15 20 25
−2
0
2
y t
time /ms5 10 15 20 25
−2
−1
0
1
2
y t
time /ms5 10 15 20 25
−2
−1
0
1
2
tNMFGTFGTFtNMFunadapted filtersadapted filters
Unifying classical and probabilistic audio signal processing
Probabilistic signal processing
Classical signal processing
robustnessadaptation
fast methodsimportant variables
Spectrogram
Filter Bank &Hilbert STFT
freq shift
Amplitudes
Cemgil &Godsill
Qi &Minka
Probabilistic signal processing
Classical signal processing
estimation
freq shift
Additional slides