Upload
clement-cain
View
229
Download
4
Tags:
Embed Size (px)
Citation preview
Introduction
Power-Normalized Cepstral Coefficients (PNCC)
Robust Feature Extraction for Automatic Speech Recognition based on Data-driven and
Physiologically-motivated Approaches
Mark J. Harvilla1, Chanwoo Kim2 and Richard M. Stern1,2
1Electrical and Computer Engineering Department and 2Language Technologies InstituteCarnegie Mellon University, Pittsburgh, PA
Compensatory Spectral Averaging and Warping using Histograms (CSAWH)
It is well known that the accuracy of automatic speech recognition systems is compromised in high-noise environments. In contrast, humans have a remarkable ability to accurately recognize continuous speech fairly independently of the environment. This observation might imply that the robustness of ASR systems can be increased by exploiting principles and adopting characteristic mechanisms of the human auditory system (HAS).
A contrasting, and quite possibly complimentary, approach is to confront the problem from a statistical standpoint. By designing algorithms with statistical optimality in mind, robustness systems can be built that are effective, but that don’t necessarily adhere to any physiological mechanism.
Objectives
Our general objectives are to:•Develop portable front-end features that work generally to improve the robustness of speech-based systems, most specifically ASR, but conceivably helpful for any fundamentally speech-based systems such as voice-activity detectors, speaker recognizers, keyword spotters, and so on.•Maintain generality of the features so that they are independent of ASR systems, tasks, and other adaptation and normalization techniques•Design the features with the overall objective of reducing the mismatch between training and testing data in mind
Suppression of Slowly-varying components and the Falling edge (SSF)
Selected Experimental Results
Pre-emphasis
STFT
Magnitude squared
Gammatone filter bank
Peak power normalization
Medium-duration power bias subtraction
Power function nonlinearity
DCT & CMN
Input audio
Feature
Pre-emphasis
STFT
Magnitude squared
Gammatone filter bank
SSF processing
Spectral reshaping
Inverse STFT
Post de-emphasis
Input audio
Output audio
Pre-emphasis
STFT
Magnitude squared
Gammatone filter bank
Peak power normalization
Histogram matching
Power function nonlinearity
Input audio
These processing blocks effectively resynthesize audio; this helps to smooth spectral discontinuities introduced by nonlinear processing
Audio resynthesis
Output audio
M[m,l] = λM[m-1,l] + (1- λ)P[m,l]P1[m,l] = max(P[m,l]-M[m,l],c0P[m,l])P2[m,l] = max(P[m,l]-M[m,l],c0M[m,l])
The figure below illustrates the effect of SSF to emphasize onsets in reverberation:
• SSF is based on the precedence effect, which is the tendency of the HAS to focus on the first arriving wave front of a given sound source.
• By emphasizing onsets, the spectral smearing effect of reverberation can be partially counteracted.
• PNCC combines multiple properties of the HAS including:
• Knowledge of the shape of the effective auditory filters related to cochlear response
• The precedence effect (see SSF below)
• The rate-level nonlinearity• PNCC imposes power
bias subtraction motivated by the mismatch in the AM-GM ratio between clean and noisy speech.
• The AM-GM ratio is related to the shape parameter of the Gamma distribution, which characterizes well the distribution of linear power of speech.
• The AM-GM ratio can also be used for blind SNR measurement (Waveform-amplitude distribution analysis, WADA).
The figures to the left illustrate clean and noisy subband speech power signals with and without PNCC processing (bottom and top, respectively).
Weighted spectral averaging
• CSAWH is based on the observation that noise significantly alters the characteristic distribution of subband speech power
• CSAWH applies nonparametric transformations to match the distribution of the input speech to prototype distributions observed over clean reference data
• The nonlinear transformations do not inherently discriminate between speech and noise; weighted spectral averaging helps to mitigate sporadic suppression of speech or amplification of noise
The above plots show results from CMU Sphinx-3 on RM1 in white noise. Below, results are depicted for RATS-like noise on the SRI DECIPHER ASR. For both cases, the left column shows clean training and the right column shows multistyle training.
Below, results for SSF are compared to other standard feature extraction algorithms. The left plot shows results for clean speech in reverberation and the right plot shows results for speech in music noise. The differences between SSF Type-I and Type-II are pronounced in reverberation, but insignificant in the other case.
1.5 2 2.5 3 3.5 4
0.5
1
1.5
2
2.5
3
3.5
input power
dis
trib
utio
n
20th filter, CF: 1.25kHz
0dB5dB10dB15dB20dBclean
The figure to the left illustrates the effect of the noise on the distribution of the subband speech power