Gaussian Processes for Audio Feature Extractionlearning.eng.cam.ac.uk/.../Turner/Presentations/gp-audio.pdf · 2014. 10. 9. · Gaussian Processes for Audio Feature Extraction Dr

Gaussian Processes for Audio Feature Extraction

Dr. Richard E. Turner ([email protected])

Computational and Biological Learning LabDepartment of Engineering

University of Cambridge

Machine hearing pipeline

signal

T samples


frequ

ency

time-frequency(TF) analysis

signal

short time Fourier transformspectrogram

waveletfilter bank

(non-linear)

T samples

T' D > T samples


frequ

ency


probabilistic model

signal

HMM (speech recognition)NMF (source separation, denoising)ICA (source separation, denoising)

short time Fourier transformspectrogram

waveletfilter bank

(non-linear)

T samples

T' D > T samples

Problems with conventional pipeline

frequ

ency


probabilistic model

signal

(non-linear)

T samples

T' D > T samples

noise (source mixtures) hard to model in TF domain

(hard to propagate uncertainty noise/missing data - from signal to

TF domain)


frequ

ency


probabilistic model

signal

(non-linear)

T samples

T' D > T samples


hard to enforce/learn dependencies intrinsic to the FT analysis

image of mapping(injective)


TF domain)


frequ

ency


probabilistic model

signal

(non-linear)

T samples

T' D > T samples


learning based on time-frequency representation ignores Jacobian




TF domain)


frequ

ency


probabilistic model

signal

(non-linear)

T samples

T' D > T samples


learning based on time-frequency representation ignores Jacobian



hard to adapt both top and bottom layers


TF domain)

Goal of this talk

frequ

ency


probabilistic model

signal

(non-linear)

T samples

T' D > T samples

probabilise time-frequency analysis(construct generative model in whichinference corresponds to classical time-frequency analysis)

build a hierachical model that incorporates downstream processing module

classical signal processing

machine learning

A typical audio pipeline

0.5 1 1.5 2

y

time /s

signal


0.5 1 1.5 2

y

time /s

0.2

window

short time Fourier transform

magnitude

0.5

1.1

2.6

6

frequ

ency

/kH

z

spectrogram

signal

Fourier transform


0.5 1 1.5 2

y

time /s

0.2

0

0

window


magnitude

0.5

1.1

2.6

6

frequ

ency

/kH

z

spectrogram

NMF

signal

Fourier transform


0.5 1 1.5 2

y

time /s

0.5

1.1

2.6

6

frequ

ency

/kH

z

0.2

0

0

window


magnitudespectrogram

NMF

signal

Fourier transform

What form of generative model corresponds to the STFT?

desire: expected value of latent time-frequency coefficients sd,1:T = STFT

• assume y formed by (weighted) superposition of band-limited signals sd,1:T

• linearity of inference can be assured by setting the distributions of each sd,1:Tand the noise to be Gaussian

• time-invariance =⇒ generative model statistically stationary

=⇒ GP prior over STFT coefficients, p(sd,1:T ) = G(sd,1:T ; 0,Γ), stationary

Γt,t′ ≈T∑k=1

FT−1t,kγkFTk,t′ where FTk,t=e−2πi(k−1)(t−1)/T

Time-frequency analysis as inference

generation

complex sinusoids

time-varying (complex) coefficients


generation inference

complex sinusoids




complex sinusoids

most probable coefficients given the signal is the STFT

STFT


STFT window = prior covariance

frequency shifted inverse signal covariance



depends on independent of



signal noise




signal noise


Wiener filter



signal noise


Wiener filter

STFT window = prior covariancefrequency shifted inverse signal covariance



signal noise


Wiener filter

STFT window = prior covariancefrequency shifted inverse signal covariance

probabilistic filter bank

probabilistic STFT


• probabilistic models in which inference recovers STFT, filter bank, waveletanalysis

– unifes a number of existing probabilistic time-series models & connectsto traditional sig. proc.

– can learn window of STFT and frequencies (equivalently filter properties)– frequency shift relationship mimics classical relationship between these

time-frequency relationships

• hops/down-sampling and finite window used correspond to FITC(uniformly spaced pseudo-points) and sparse-covariance approximations

– rediscover Nyquist in the context of approximation GPs

Probabilistic audio processing pipeline

0.1

2.6

0.1

20 40 60 80 100 120 140 160 180−1

0

1

time /ms

2.6

freq

/kH

zfre

q /k

Hz

0

0

envelopes

carriers

signal

= bandpass Gaussian noise

mean spectrum

Probabilistic audio processing pipeline

0.1

2.6

0.1

20 40 60 80 100 120 140 160 180−1

0

1

time /ms

2.6

freq

/kH

zfre

q /k

Hz

0

0envelopepatterns

envelopes

carriers

signal

= slow Gaussian process

= bandpass Gaussian noise

mean spectrum

mean spectrum

Inference and Learning

• Key Observation – fix envelopes:

– posterior over carriers is Gaussian– posterior mean given by an (adaptive) filter

• Leads to MAP estimation of the envelopes (or HMCMC), let zlt = log hlt

ZMAP = arg maxZ

p(Z|Y)

p(Z|Y) = 1Zp(Z,Y) =

1

Z

∫dXp(Z,Y,X) =

1

Zp(Z)

∫dXp(Y|A,X)p(X)

• Compute integral efficiently using chain stuctured approximation andKalman Smoothing

• Leads to gradient based optimisation for transformed amplitudes

• Learning: approximate Maximum Likelihood θ = arg maxθ p(Y|θ)

• NMF: zero-temperature EM, one E-Step, initialise constant envelopes

Audio modelling

� ��

�

��

��

��

��

��

�

��

��

�

��

��

��

��

�

�

��

�

��

��

��

��

�

��

��

�

��

��

�

��

��

�

��

�

��

��

Audio modelling

time /s

frequ

ency

/kH

fire stream wind rain

tent-zipfoot step Turner, 2010

Audio modelling

time /s

frequ

ency

/kH

Turner, 2010

Statistical texture synthesis

• Old approach: build detailed physical models (e.g. rain drops)

• New approach

– train model on your favourite texture– sample from the prior, and then from the likelihood.

• Waveform unique, but statistically matched to original

• Often perceptually indistinguishable

Audio denoising

−10 −5 0 5 10 15

2

4

6

8

10

12

14

SNR before /dB

SNR

impr

ovem

ent/

dB

1 1.5 2 2.5−0.5

0

0.5

1

PESQ before

PESQ

impr

ovem

ent

5 10 15 200

1

2

3

4

5

6

7

SNR log−spec before /dB

SNR

log−

spec

impr

ovem

ent/

dB

2 4 6 8 10 12 14 16 18

−5

0

5

y t

2 4 6 8 10 12 14 16 18

−5

0

5

y t

2 4 6 8 10 12 14

−5

0

5

y t

2 4 6 8 10 12 14

−5

0

5

y t

2 4 6 8 10 12 14 16 18 20

−10

−5

0

5

10

time /ms

y t

2 4 6 8 10 12 14 16 18 20

−10

0

10

time /ms

y t

NMFtNMFGTFGTFtNMFadapted filtersunadapted filtersWienerspectral subtractblock threshold

Audio missing data imputation

0 5 10 150

2

4

6

8

10

12

14

missing region /ms

SNR

/dB

0 5 10 15

2.5

3

3.5

4

missing region /ms

PESQ

0 5 10 15

5

10

15

20

missing region /ms

SNR

log−

spec

/dB

y t

5 10 15 20 25

−2

−1

0

1

2

y t

5 10 15 20 25

−2

−1

0

1

2

y t

5 10 15 20 25

−2

0

2

y t

5 10 15 20 25

−2

0

2

y t

time /ms5 10 15 20 25

−2

−1

0

1

2

y t

time /ms5 10 15 20 25

−2

−1

0

1

2

tNMFGTFGTFtNMFunadapted filtersadapted filters

Unifying classical and probabilistic audio signal processing

Probabilistic signal processing

Classical signal processing

robustnessadaptation

fast methodsimportant variables

Spectrogram

Filter Bank &Hilbert STFT

freq shift

Amplitudes

Cemgil &Godsill

Qi &Minka

Probabilistic signal processing

Classical signal processing

estimation

freq shift

Additional slides

Documents

Gaussian Processes for Audio Feature Extractionlearning.eng.cam.ac.uk/.../Turner/Presentations/gp-audio.pdf · 2014. 10. 9. · Gaussian Processes for Audio Feature Extraction Dr