Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Kanru Hua (IE598 Final Presentation) 1
Combining Auditory Preprocessing and Bayesian Estimation for Robust Formant Tracking
(Gläser et al., 2010)
Kanru Hua (IE598 Final Presentation) 2
Background
● In speech processing context, formants are resonances of the vocal tract.
● Formant frequencies have a close link to vowel quality.
● Applications: speech recognition/synthesis, speech enhancement, hearing aids, language learning tools, ...
time
freq
uenc
y
“Author of the danger trail, ...”
Kanru Hua (IE598 Final Presentation) 3
Architecture (simplified)
Auditory Filterbank Gender Detection
Enhancement
Bayesian MixtureFiltering
Bayesian Smoothing Bayesian Smoothing Bayesian Smoothing...
Speech Signal
Adaptive FrequencyRange Segmentation
F1 F2 FN
Kanru Hua (IE598 Final Presentation) 4
Bayesian Filtering
● Think of a generalized version of Kalman Filter
● Define belief/message as posterior probability
y1 y2 y3 y4
x1 x2 x3 x4
Observation(filterbank output)
Hidden State(formant freqs.)
(predict)
(update)
Kanru Hua (IE598 Final Presentation) 5
Bayesian Filtering
● Formants are not normally distributed - Kalman filter won’t work
● Particle filtering (non-parametric) – multi-modality not guaranteed
– To illustrate why, let’s suppose
● This leads us to mixture filtering, a techinque borrowed from the computer vision community.
Kanru Hua (IE598 Final Presentation) 6
Bayesian Mixture Filtering
To find the weights:
(Vermaak et al., 2003)
(each corresponds to a formant)
Kanru Hua (IE598 Final Presentation) 7
Bayesian Mixture FilteringA quick summary:
Propagation on each component (formant) is independent from the others
Re-weighting step is the only place where mixture components interact
Target belief and component beliefs(at time t)
Kanru Hua (IE598 Final Presentation) 8
Mixture Segmentation
● However, mixture filtering still does not prevent belief diffusion.
(mixture components could become over-general over time as they independently propagates; order of formants is unconstrained)
● Solution: introducing hard frequency boundaries R1, R2, … RM between formants/mixture components
Kanru Hua (IE598 Final Presentation) 9
Mixture Segmentation
● We need to modify & re-weight component beliefs to implement these hard boundaries
● Concretely, set out-of-range probabilities to zero while making sure the target distribution is kept unchanged
(accumulation)
(truncation & re-weighting)
Kanru Hua (IE598 Final Presentation) 10
Adaptive Segmentation
● The next step is to determine R1, R2, … RM, given component beliefs before segmentation
● We run a Viterbi search (in other words, dynamic programming) to find out the most likely segmentation
– State space: assignment from (discretized) frequency to mixture component
Other transitions have zero probability (disabled)
Kanru Hua (IE598 Final Presentation) 11
Bayesian Smoothing
● So far we are making predictions based on previous observations (y1, y2, …, yt) only. The re-weighted beliefs may still appear ambiguous.
● We mitigate this by incorporating observations from the reverse direction (if the whole sequence is known in advance), in a fashion similar to the backward pass in Kalman filtering.
Kanru Hua (IE598 Final Presentation) 12
Results
● Final formant frequency estimate:
● Evaluation:
– Tested on 34 and 56 sentences spoken by male and female speakers, respectively
– Added white/babble/car noise at 7 different signal-to-noise ratios
Kanru Hua (IE598 Final Presentation) 13
Results
● Compared against other formant tracking approaches
(percentage of error reduction)
● Time delay for real-time tracking: 120ms on Intel Q6600 @ 2.4GHz
Kanru Hua (IE598 Final Presentation) 14
Summary
● A two-stage formant tracking method– First stage: signal processing for feature extraction– Second stage: Bayesian filtering on features
● Challenge 1 – maintaining multi-modality– Solution: mixture tracking
● Challenge 2 - belief diffusion– Solution: adaptive frequency range segmentation
● Post processing: Bayesian smoothing (backward pass)● Drawbacks:
– Computationally expensive– Inevitable time delay for real-time tracking– Formant continuity not guaranteed
Kanru Hua (IE598 Final Presentation) 15
References
● Gläser, Claudius, et al. "Combining Auditory Preprocessing and Bayesian Estimation for Robust Formant Tracking." IEEE Transactions on Audio, Speech & Language Processing 18.2 (2010): 224-236.
● Vermaak, Jaco, Arnaud Doucet, and Patrick Pérez. "Maintaining multimodality through mixture tracking." Computer Vision, 2003. Proceedings. Ninth IEEE International Conference on. IEEE, 2003.