Pushing the Envelope - Aside

Pushing the Envelope - Aside

Nelson Morgan, Qifeng Zhu, Andreas Stolcke,Kemal Sönmez, Sunil Sivadas, Takahiro Shinozaki,Mari Ostendorf, Pratibha Jain, Hynek Hermansky,Dan Ellis, George Doddington, Barry Chen,Özgür Çetin, Hervé Bourlard, and Marios Athineos

Presenter: Shih-Hsiang

IEEE SIGNAL PROCESSING MAGAZINE SEPTEMBER,2005

Reference Ö. Çetin and M. Ostendorf, “Multi-rate and variable-rate

modeling of speech at phone and syllable time scales,” in Proc. ICASSP 2005

B. Chen, Q. Zhu, and N. Morgan, “Learning long term temporal features in LVCSR using neural networks,” in Proc. ICSLP, 2004

H. Hermansky and S. Sharma, “TRAPS—Classifiers of temporal patterns,” in Proc. ICSLP, 1998

H. Hermansky, S. Sharma, and P. Jain, “Data-derived nonlinear mapping for feature extraction in HMM,” in Proc. ASRU, 1999

Reference (cont.) C. Moreno, Q. Zhu, B. Chen, Nelson Morgan, “Autom

atic Data Selection for MLP-based Feature Extraction for ASR” in Proc. ASRU, 2005

N. Morgan, B. Chen, Q. Zhu, A. Stolcke, “Trapping Conversational Speech: Extending TRAP/TANDEM Approaches to Conversational Telephone Speech Recognition” in Proc. ICASSP, 2004

Today’s topic Focus on three issues

Using MLP to extract the long-term features TRAPs HATs

The considerations when training the large amount data New HMM model introduced (multi-scale)

Multi-Scale, Variable-Scale

Introduction The core acoustic operation has essentially remained the sa

me for decades Using single feature vector compares to a set of distributions deriv

ed from training The feature vector often derived from the power spectral envelope

over a 20-30ms window, steeped forward by ~10ms step per frame Systems using short-term cepstra for modeling have been su

ccessful both in the laboratory and in numerous application But there are still significant limitations to speech recognition

performance, particularly for conversational speech and/or speech with significant acoustic degradations from noise or reverberation

Introduction (cont.) Human phonetic categorization is poor for extremely short segm

ents (<100ms) suggesting that analysis of longer time regions is somehow essentia

l to the task In mid-2002, they began working on a DARPA sponsored proje

ct - EARS The fundamental goal of this multisite effect was is

Push the spectral envelope away from its role as the sole source of acoustic incorporated by the statistical models of modern speech recognition systems (SRSs)

This ultimately would required both a revamping of acoustical feature extraction and a fresh look at the incorporation of these feature into statistical models representing speech

Temporal Representation Replace (or augment) the current notion of a spectral-energy

based vector at time t with variables Based on posterior probabilities of speech categories for long a

nd short time functions of the time-frequency plane These feature may be represented as multiple streams of prob

abilistic information Working with narrow spectral subbands and long temporal w

indows (up to 500 ms or more, sufficiently long for two or more syllables) TempoRAl Patterns (TRAPs) Hidden Activation TRAPS (HATS)

TempoRAl Patterns (TRAPs) Substitute a conventional spectral feature vector in

phonetic classification by a 1 sec long temporal vector of critical band logarithmic spectral energies (Bark critical band)

ICSLP 1998

Bark Critical Band The scale ranges from 1 to 24 and corresponds to the first 24

critical bands of hearing

))7500/arctan((5.3)1000/76.0arctan(13 2ffBark

The subsequent band edges are (in Hz) 0, 100, 200, 300, 400, 510, 630, 770, 920, 1080, 1270, 1480, 1720, 2000, 2320, 2700, 3150, 3700, 4400, 5300, 6400, 7700, 9500, 12000, 15500

TempoRAl Patterns (cont.)

Fig. Mean TRAPs for 16 phonemes at the fifth critical band

TempoRAl Patterns (cont.) The TRAPS system consists of two stages of MLPs

In the first stagecritical band MLPs learn phone probabilities posterior on the input

In the second stage A “merger” MLP merges the output of each of these individual critical band MLPs resulting in overall phone posteriors probabilities

ASRU 1999

TempoRAl Patterns (cont.) Input to each TRAP is a 1 sec long temporal vector Output of each TRAP is a vector of estimates of

phoneme-specific likelihoods Output from the merging MLP is a vector of estimates of

phoneme-specific posterior probabilities

15 Critical-band 101 input units300 hidden units29 output phonetic classes

TRAP

Hidden Activation TRAPS (HATS) Use the hidden activations of the critical band MLPs inst

ead of their outputs as inputs to the “merger” MLPs ?? Widening acoustic context by using more frames of full b

and speech energies as input to the MLP Reducing the word error rate from 25.6% to 23.5% on th

e 2001 NIST evaluation set Reducing the word error rate from 20.3% to 18.3% on th

e 2004 NIST evaluation set

ICSLP 2004

Hidden Activation TRAPS (cont.)

Hidden Activation TRAPS (cont.) PLP feature were derived from short term spectral analysis(2

5ms time slices every 10 ms) PLP/MLP used 9 frames of PLP features and HATs used 51 fra

mes of log critical band energies

Stability of Results Switch board (earlier) and Fisher (later) conversational

data is extremely difficult to recognize Due to their unconstrained vocabulary, speaking style,

and range of telephones used Increasing amounts of training data can achieved better

performance

Some Practical Consideration Larger and larger training sets can provide the best

improvement implies a quadratic growth in training time

Solution Hyper-threading on the dual CPUs Gender-specific training Preliminary network training passes with fewer

training patterns Customization of the learning regimen to reduce the

number of epochs (training iteration) Using selected subsets of the data for later training

passes

Some Practical Consideration (cont.) Faster probabilistic inference algorithms and judicious

model selection methods for controlling model complexity are needed

Some Practical Consideration (cont.) Data Selection is also an important issue

Reducing the redundancy existing in the database can help to reduce the costs of learning achieving the same performance with less effort

Over-represented examples in the database can harm the generalization capabilities of a given learning machines biasing its modeling toward those classes

For the selection of data based on the filter approach we need an evaluation method that allows us to sort the data according to some sampling criteria of definition of usefulness of the data

ASRU 2005

Some Practical Consideration (cont.) Evaluation method

The first step, we have to train an MLP selector (classifier),s, using a small subset of the data that will result in a set of parameters,

Afterward, given those parameters we can then obtain the probabilities a posteriori for the rest of the data

for every feature frame and phoneme, qk

We can now compute the entropy value for each feature frame as

1,...,0])[()],[|( KknxsnxqP ks

][nx

1

02 )],[|(log)],[|(][

K

kksks nxqPnxqPnh

Some Practical Consideration (cont.) Sampling criteria

High entropy values indicate that taking a decision is going to be difficult

Low entropy value indicate that the decision is easy to make (not necessarily implying it will be the right one)

Very high entropy values may account for outlier or mislabeled examples: non–separable data.

Very low entropy value can account for overrepresented or easily learnt examples This overrepresentation can harm the classifier ability by

forcing too much detail in the corresponding class

Some Practical Consideration (cont.)

NIST 2001

Statistical Modeling for the New Features HMMs are not well suited to long-term features

The use of HMMs as the core acoustic modeling technology might obscure the gains from new features, especially those from long time scales

This may be one reason why progress with novel techniques has been so difficult

The standard way to use longer temporal scale with an HMM is simply to use a large analysis window and a small frame step The successive features at the slow time scale are even

more correlated than those at the fast time scale, leading to a bias in posteriors

Models that do not represent the high correlation between successive frames effectively

Statistical Modeling for the New Features (cont.)

They propose instead to focus on the problem of multistream and multirate process modeling It is desirable to improve robustness to corruption of individual

streams The use of multiple streams introduces more flexibility in characte

rizing speech at different time and frequency scale The statistical models and features interact, and simple HMM-b

ased combination approaches might not fully utilize complementary information in different feature sequences

A multi-rate and variable-rate modeling is introduced

Multi-Rate and Variable-Rate Modeling The traditional approach for utilizing new features is to

concatenate them with existing cepstral features after over-sampling and use them with in a standard HMM-based models HMM have become so tuned to short-term features that

their use might obscure the gains from new features Traditional HMM

1

01 )|()|(}){},({

T

ttttttt sopssPsoP

ICASSP 2005

Multi-Rate and Variable-Rate Modeling (cont.) Basic Multi-rate HMM

)|(),|(}){},{},...,{},({ 1/1

1

111

11

kt

kt

kMt

kt

kt

K

k

T

t

Kt

Kttt kkkkkk

k

k

kkooPsssPsosoP

:states:observation

coarser scale

finer scale

T1=3 M2=3

T2=M2xT1=9

Multi-Rate and Variable-Rate Modeling (cont.)

Variable-rate Extension (2-rate)

)|()|(}){|}{},{},{},({ 1111

11

0

2211

1111

1

112211 tttt

T

tttttt soPssPMsosoP

)|()|( 222

1121)(

)( 22212

11

12ttttt

Mtl

tltsoPsssPt

:states:observation

coarser scale

finer scale

Multi-Rate and Variable-Rate Modeling (cont.) In their experiment, they modeled speech using both recogni

tion units and feature sequences corresponding to phone and syllable time scales Short-time: traditional phone HMMs using cepstral features (PL

P cepstral) Long-time: characterizes syllable structure and lexical stress us

ing HATs Unlike the previously mentioned HAT features that were trained o

n phone targets, these HAT features are trained on broad consonant/vowel classes with distinction for syllable position (onset, coda, and ambi-syllabification) for consonants and low/high stress level for vowels

2% word error rate reduction on NIST 2001 Hub-5 task

Multi-Rate and Variable-Rate Modeling (cont.) The experiment result shows the explicit modeling of speech at tw

o time scales via multirate, coupled HMMs architecture outperforms simple HMM-based feature concatenation approach

The feature extraction and statistical modeling are tailored to focus more on information-bearing regions (e.g. phone transition) as opposed to a uniform emphasis over the whole signal space

Research direction Choice of the sampling rates according to the scale/rate of the larger ti

me-window features Multirate acoustic models with more than two time scales

The third or higher time scale can represent utterance-level effects such as speaking rate and style, gender and noise

What could be next

Determine optimal window sizes and frame rates for different regions of speech, thus creating a signal-adaptive front end

The energy-based representations of temporal trajectories could be replaced by autoregressive models for these components of the time-frequency plane FDLP, LP-TRAP Perceptual linear prediction squared (PLP2 )

A spectrogram-like signal representation that is iteratively approximated by all-pole models applied sequentially in the time and frequency direction of the spectrotemporal pattern Unlike conventional feature processing, no frame-based spectral analy

sis occures

Final Words They wrote some words …

“We implored the reader not to be deterred by initial result that were poorer than those achieved by more conventional method, since this was almost inevitable when wandering from a well-worn path. However the goal was always to ultimately improve performance, and the explorations into relatively uncharted territory were only a path to that goal. This process can be slow and sometimes frustrating”

Documents

Pushing the Envelope - Aside