1
Introduction Power-Normalized Cepstral Coefficients (PNCC) Robust Feature Extraction for Automatic Speech Recognition based on Data-driven and Physiologically-motivated Approaches Mark J. Harvilla 1 , Chanwoo Kim 2 and Richard M. Stern 1,2 1 Electrical and Computer Engineering Department and 2 Language Technologies Institute Carnegie Mellon University, Pittsburgh, PA Compensatory Spectral Averaging and Warping using Histograms (CSAWH) It is well known that the accuracy of automatic speech recognition systems is compromised in high-noise environments. In contrast, humans have a remarkable ability to accurately recognize continuous speech fairly independently of the environment. This observation might imply that the robustness of ASR systems can be increased by exploiting principles and adopting characteristic mechanisms of the human auditory system (HAS). A contrasting, and quite possibly complimentary, approach is to confront the problem from a statistical standpoint. By designing algorithms with statistical optimality in mind, robustness systems can be built that are effective, but that don’t necessarily adhere to any physiological mechanism. Objectives Our general objectives are to: •Develop portable front-end features that work generally to improve the robustness of speech-based systems, most specifically ASR, but conceivably helpful for any fundamentally speech-based systems such as voice- activity detectors, speaker recognizers, keyword spotters, and so on. •Maintain generality of the features so that they are independent of ASR systems, tasks, and other adaptation and normalization techniques •Design the features with the overall objective of reducing the mismatch between training and testing data in mind Suppression of Slowly-varying components and the Falling edge (SSF) Selected Experimental Results Pre-emphasis STFT Magnitude squared Gammatone filter bank Peak power normalization Medium- duration power bias subtraction Power function nonlinearity DCT & CMN Input audio Feature Pre-emphasis STFT Magnitude squared Gammatone filter bank SSF processing Spectral reshaping Inverse STFT Post de- emphasis Input audio Output audio Pre-emphasis STFT Magnitude squared Gammatone filter bank Peak power normalization Histogram matching Power function nonlinearity Input audio These processing blocks effectively resynthesize audio; this helps to smooth spectral discontinuities introduced by nonlinear processing Audio resynthesis Output audio M[m,l] = λM[m-1,l] + (1- λ)P[m,l] P 1 [m,l] = max(P[m,l]- M[m,l],c 0 P[m,l]) P 2 [m,l] = max(P[m,l]- M[m,l],c 0 M[m,l]) The figure below illustrates the effect of SSF to emphasize onsets in reverberation: SSF is based on the precedence effect, which is the tendency of the HAS to focus on the first arriving wave front of a given sound source. • By emphasizing onsets, the spectral smearing effect of reverberation can be partially counteracted. • PNCC combines multiple properties of the HAS including: Knowledge of the shape of the effective auditory filters related to cochlear response The precedence effect (see SSF below) The rate-level nonlinearity • PNCC imposes power bias subtraction motivated by the mismatch in the AM-GM ratio between clean and noisy speech. The AM-GM ratio is related to the shape parameter of the Gamma distribution, which characterizes well the distribution of linear power of speech. The AM-GM ratio can also be used for blind SNR measurement (Waveform-amplitude distribution analysis, WADA). The figures to the left illustrate clean and noisy subband speech power signals with and without PNCC processing (bottom and top, respectively) . Weighted spectral averaging • CSAWH is based on the observation that noise significantly alters the characteristic distribution of subband speech power • CSAWH applies nonparametric transformations to match the distribution of the input speech to prototype distributions observed over clean reference data • The nonlinear transformations do not inherently discriminate between speech and noise; weighted spectral averaging helps to mitigate sporadic suppression of speech or amplification of noise The above plots show results from CMU Sphinx-3 on RM1 in white noise. Below, results are depicted for RATS-like noise on the SRI DECIPHER ASR. For both cases, the left column shows clean training and the right column shows multistyle training. Below, results for SSF are compared to other standard feature extraction algorithms. The left plot shows results for clean speech in reverberation and the right plot shows results for speech in music noise. The differences between SSF Type-I and Type-II are pronounced in reverberation, but insignificant in the other case. 1.5 2 2.5 3 3.5 4 0.5 1 1.5 2 2.5 3 3.5 inputpower distribution 20 th filter,C F:1.25kH z 0dB 5dB 10dB 15dB 20dB clean The figure to the left illustrates the effect of the noise on the distribution of the subband speech power

Introduction Power-Normalized Cepstral Coefficients (PNCC) Robust Feature Extraction for Automatic Speech Recognition based on Data-driven and Physiologically-motivated

Embed Size (px)

Citation preview

Page 1: Introduction Power-Normalized Cepstral Coefficients (PNCC) Robust Feature Extraction for Automatic Speech Recognition based on Data-driven and Physiologically-motivated

Introduction

Power-Normalized Cepstral Coefficients (PNCC)

Robust Feature Extraction for Automatic Speech Recognition based on Data-driven and

Physiologically-motivated Approaches

Mark J. Harvilla1, Chanwoo Kim2 and Richard M. Stern1,2

1Electrical and Computer Engineering Department and 2Language Technologies InstituteCarnegie Mellon University, Pittsburgh, PA

Compensatory Spectral Averaging and Warping using Histograms (CSAWH)

It is well known that the accuracy of automatic speech recognition systems is compromised in high-noise environments. In contrast, humans have a remarkable ability to accurately recognize continuous speech fairly independently of the environment. This observation might imply that the robustness of ASR systems can be increased by exploiting principles and adopting characteristic mechanisms of the human auditory system (HAS).

A contrasting, and quite possibly complimentary, approach is to confront the problem from a statistical standpoint. By designing algorithms with statistical optimality in mind, robustness systems can be built that are effective, but that don’t necessarily adhere to any physiological mechanism.

Objectives

Our general objectives are to:•Develop portable front-end features that work generally to improve the robustness of speech-based systems, most specifically ASR, but conceivably helpful for any fundamentally speech-based systems such as voice-activity detectors, speaker recognizers, keyword spotters, and so on.•Maintain generality of the features so that they are independent of ASR systems, tasks, and other adaptation and normalization techniques•Design the features with the overall objective of reducing the mismatch between training and testing data in mind

Suppression of Slowly-varying components and the Falling edge (SSF)

Selected Experimental Results

Pre-emphasis

STFT

Magnitude squared

Gammatone filter bank

Peak power normalization

Medium-duration power bias subtraction

Power function nonlinearity

DCT & CMN

Input audio

Feature

Pre-emphasis

STFT

Magnitude squared

Gammatone filter bank

SSF processing

Spectral reshaping

Inverse STFT

Post de-emphasis

Input audio

Output audio

Pre-emphasis

STFT

Magnitude squared

Gammatone filter bank

Peak power normalization

Histogram matching

Power function nonlinearity

Input audio

These processing blocks effectively resynthesize audio; this helps to smooth spectral discontinuities introduced by nonlinear processing

Audio resynthesis

Output audio

M[m,l] = λM[m-1,l] + (1- λ)P[m,l]P1[m,l] = max(P[m,l]-M[m,l],c0P[m,l])P2[m,l] = max(P[m,l]-M[m,l],c0M[m,l])

The figure below illustrates the effect of SSF to emphasize onsets in reverberation:

• SSF is based on the precedence effect, which is the tendency of the HAS to focus on the first arriving wave front of a given sound source.

• By emphasizing onsets, the spectral smearing effect of reverberation can be partially counteracted.

• PNCC combines multiple properties of the HAS including:

• Knowledge of the shape of the effective auditory filters related to cochlear response

• The precedence effect (see SSF below)

• The rate-level nonlinearity• PNCC imposes power

bias subtraction motivated by the mismatch in the AM-GM ratio between clean and noisy speech.

• The AM-GM ratio is related to the shape parameter of the Gamma distribution, which characterizes well the distribution of linear power of speech.

• The AM-GM ratio can also be used for blind SNR measurement (Waveform-amplitude distribution analysis, WADA).

The figures to the left illustrate clean and noisy subband speech power signals with and without PNCC processing (bottom and top, respectively).

Weighted spectral averaging

• CSAWH is based on the observation that noise significantly alters the characteristic distribution of subband speech power

• CSAWH applies nonparametric transformations to match the distribution of the input speech to prototype distributions observed over clean reference data

• The nonlinear transformations do not inherently discriminate between speech and noise; weighted spectral averaging helps to mitigate sporadic suppression of speech or amplification of noise

The above plots show results from CMU Sphinx-3 on RM1 in white noise. Below, results are depicted for RATS-like noise on the SRI DECIPHER ASR. For both cases, the left column shows clean training and the right column shows multistyle training.

Below, results for SSF are compared to other standard feature extraction algorithms. The left plot shows results for clean speech in reverberation and the right plot shows results for speech in music noise. The differences between SSF Type-I and Type-II are pronounced in reverberation, but insignificant in the other case.

1.5 2 2.5 3 3.5 4

0.5

1

1.5

2

2.5

3

3.5

input power

dis

trib

utio

n

20th filter, CF: 1.25kHz

0dB5dB10dB15dB20dBclean

The figure to the left illustrates the effect of the noise on the distribution of the subband speech power