Noise Robust Pitch Tracking by Subband Autocorrelation...

Preview:

Citation preview

Noise Robust Pitch Trackingby Subband Autocorrelation Classification (SAcC)

Byung Suk Lee & Dan Ellis • Columbia University / ICSI • {bsl,dpwe}@ee.columbia.edu

Summary: A neural net classifier is trained to identify the pitch of a frame of subband autocorrelation principal components. Accuracy is greatly improved for noisy, bandlimited speech, matched to the training data.

COLUMBIA

UNIVERSITYLaboratory for the Recognition andOrganization of Speech and Audio

for Interspeech 2012 • 2012-09-11 • dpwe@ee.columbia.eduMATLAB code available: http://labrosa.ee.columbia.edu/projects/SAcC/

Background

Material

Filterbank Classifier• “Classic” approaches to pitch tracking reveal time-domain waveform

periodicity via autocorrelation or related approaches.• An example is YIN [de Cheveigné & Kawahara 2002]:

• We are specifically interested in pitch tracking for low quality radio reception data (e.g. hand-held narrow-FM) (RATS project).

• This data has both high noise/distortion and narrow bandwidth.

• 48, 4-pole 2-zero ERB-scale cochlea filter approximations form auditory-like subbands.

• The core of our system is a trained (MLP) classifier.

• It takes k (10) PCA coefficients for each of s (48) subband autocorrelations and estimates posteriors over p (67) quantized pitch values.

• The MLP is trained discriminatively on noisy data (with ground truth pitches).

• Discriminative training virtually eliminates octave/suboctave errors seen in peak-based pitch tracking.

• Added interference hurts the autocorrelation, but filtering into subbands can reveal certain bands with better local SNR. Also, the contribution of each subband can be adjusted in later fusion to select sources.

• An example is [Wu, Wang & Brown 2003] (“the Wu algorithm”):

• This work began as an investigation into the peak selection and integration stages of the Wu algorithm. We found that replacing both stages with a trained classifier offered large performance improvements, as well as the chance to train domain-specific pitch trackers.

Training Data

Evaluation

• The pitch classifier needs training data with ground truth. Performance improves when training data is more similar to test data.

• We used pitch-annotated data (Keele), then artificially filtered and added noise to resemble the target domain.

• We also generated pseudo-ground-truth for target domain data by using only frames where three independent pitch trackers agreed.

Autocorrelation-based

di�erence function

Cumulativenormalization

andthreshold

Local peak selection

& interpolation

SoundPitch trackingusing HMM*

* not in standard YIN

Pitch

Sound AuditoryFilterbank

NormalizedAutocorelation

Channel/PeakSelection

Cross-channelIntegration

Pitch Trackingusing HMM

Pitch

time / s

GPE = Evv/NvvEvv

Nvv

VE = (Evv+Evu)/NvEvv Evu

Nv

UE = Euv/Nu PTE = (VE+UE)/2Euv

Nu Nu

−10 0 10 20 30

PTE

(%)

FDA − RBF and pink noise

−10 0 10 20 30

1

10

100

1

10

100

1

10

100

1

10

100

GPE

(%)

−10 0 10 20 30

VE (%

)

SNR (dB)−10 0 10 20 30

102

UE

(%)

SNR (dB)

YINWuget_f0SAcC

0agVagUdUVdVV

500

1000

Freq

/ H

z

YINWuSWIPE’

Agre

emen

t

600 700 800 900 1000 1100 1200 1300 1400 1500 16000

50

100

Pitc

h In

dex

Frame

LabelReject

Clean Speech

Subb

and

Lags50 100 150 200 250 300 350 400

5

10

15

20

25

30

35

40

45

Autocorrelation Principal Component Analysis Hidden Markov Model Smoothing• Normalized autocorrelation out to 25 ms

(400 samples @ 16 kHz) reveals periodic structure in each subband.

• The autocorrelation in each subband is highly constrained by the bandlimited input.

• Pitch information is distributed throughout each autocorrelation row; simple peak-picking ignores most of this.

• The autocorrelations in a given subband always reflect the center frequency, i.e., they occupy a small part of the 400-D space.

•  We use per-subband PCA to reduce each band to 10 coefficients, which preserves almost all the variance.

• PCA bases remain stable when learned from different signal conditions, so are fixed.

• The MLP generates posterior probabilities across 66 pitch candidates + “no pitch” for every 10 ms time frame.

• These become a single pitch track via Viterbi decoding through an HMM with pitch states.

• Transition probabilities are set parametrically (pitch-invariant) and tuned empirically.

• We tested on the pitch- annotated FDA data with radio-band filtering and pink noise added at a range of SNRs.

• SAcC trained on Keele data (with similar corruption) substantially outperformed other pitch trackers.

• Later experiments show that SAcC can generalize to mismatched train/test scenarios.

• GPE, the most common pitch tracking metric, only considers frames where both ground truth and system report a pitch value. This rewards “punting” on difficult frames.

• We define VE as accuracy over all true-voiced frames, and UE over all true-unvoiced frames.

• We evaluate by PTE, the mean of VE and UE.

Freq

/ kH

z

Time / s0 0.5 1 1.5

1

2

Gain / dB

Radio-bandfilter (RBF)

0 -20 -40YI

NW

uPi

tch

Inde

x

time / s10Pitch “scores” from YIN, Wu, and SAcC

ReferencesA. de Cheveigné and H. Kawahara, “YIN, a fundamental frequency estimator

for speech and music,” J. Acoust. Soc. Am., 111(4):1917–1930, April 2002.

M. Wu, D.L. Wang, and G.J. Brown, “A multipitch tracking algorithm for noisy speech,” IEEE Tr. Speech and Audio Proc., 11(3):229–241, May 2003.

MLP output

SAcC log MLP output

Wu likelihood

1 - YIN similarity

Recommended