Upload
anthea
View
44
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Pushing the Envelope - Aside. Nelson Morgan, Qifeng Zhu, Andreas Stolcke, Kemal S ö nmez, Sunil Sivadas, Takahiro Shinozaki, Mari Ostendorf, Pratibha Jain, Hynek Hermansky, Dan Ellis, George Doddington, Barry Chen, Ö zg ü r Ç etin, Herv é Bourlard, and Marios Athineos - PowerPoint PPT Presentation
Citation preview
Pushing the Envelope - Aside
Nelson Morgan, Qifeng Zhu, Andreas Stolcke,Kemal Sönmez, Sunil Sivadas, Takahiro Shinozaki,Mari Ostendorf, Pratibha Jain, Hynek Hermansky,Dan Ellis, George Doddington, Barry Chen,Özgür Çetin, Hervé Bourlard, and Marios Athineos
Presenter: Shih-Hsiang
IEEE SIGNAL PROCESSING MAGAZINE SEPTEMBER,2005
Reference Ö. Çetin and M. Ostendorf, “Multi-rate and variable-rate
modeling of speech at phone and syllable time scales,” in Proc. ICASSP 2005
B. Chen, Q. Zhu, and N. Morgan, “Learning long term temporal features in LVCSR using neural networks,” in Proc. ICSLP, 2004
H. Hermansky and S. Sharma, “TRAPS—Classifiers of temporal patterns,” in Proc. ICSLP, 1998
H. Hermansky, S. Sharma, and P. Jain, “Data-derived nonlinear mapping for feature extraction in HMM,” in Proc. ASRU, 1999
Reference (cont.) C. Moreno, Q. Zhu, B. Chen, Nelson Morgan, “Autom
atic Data Selection for MLP-based Feature Extraction for ASR” in Proc. ASRU, 2005
N. Morgan, B. Chen, Q. Zhu, A. Stolcke, “Trapping Conversational Speech: Extending TRAP/TANDEM Approaches to Conversational Telephone Speech Recognition” in Proc. ICASSP, 2004
Today’s topic Focus on three issues
Using MLP to extract the long-term features TRAPs HATs
The considerations when training the large amount data New HMM model introduced (multi-scale)
Multi-Scale, Variable-Scale
Introduction The core acoustic operation has essentially remained the sa
me for decades Using single feature vector compares to a set of distributions deriv
ed from training The feature vector often derived from the power spectral envelope
over a 20-30ms window, steeped forward by ~10ms step per frame Systems using short-term cepstra for modeling have been su
ccessful both in the laboratory and in numerous application But there are still significant limitations to speech recognition
performance, particularly for conversational speech and/or speech with significant acoustic degradations from noise or reverberation
Introduction (cont.) Human phonetic categorization is poor for extremely short segm
ents (<100ms) suggesting that analysis of longer time regions is somehow essentia
l to the task In mid-2002, they began working on a DARPA sponsored proje
ct - EARS The fundamental goal of this multisite effect was is
Push the spectral envelope away from its role as the sole source of acoustic incorporated by the statistical models of modern speech recognition systems (SRSs)
This ultimately would required both a revamping of acoustical feature extraction and a fresh look at the incorporation of these feature into statistical models representing speech
Temporal Representation Replace (or augment) the current notion of a spectral-energy
based vector at time t with variables Based on posterior probabilities of speech categories for long a
nd short time functions of the time-frequency plane These feature may be represented as multiple streams of prob
abilistic information Working with narrow spectral subbands and long temporal w
indows (up to 500 ms or more, sufficiently long for two or more syllables) TempoRAl Patterns (TRAPs) Hidden Activation TRAPS (HATS)
TempoRAl Patterns (TRAPs) Substitute a conventional spectral feature vector in
phonetic classification by a 1 sec long temporal vector of critical band logarithmic spectral energies (Bark critical band)
ICSLP 1998
Bark Critical Band The scale ranges from 1 to 24 and corresponds to the first 24
critical bands of hearing
))7500/arctan((5.3)1000/76.0arctan(13 2ffBark
The subsequent band edges are (in Hz) 0, 100, 200, 300, 400, 510, 630, 770, 920, 1080, 1270, 1480, 1720, 2000, 2320, 2700, 3150, 3700, 4400, 5300, 6400, 7700, 9500, 12000, 15500
TempoRAl Patterns (cont.)
Fig. Mean TRAPs for 16 phonemes at the fifth critical band
TempoRAl Patterns (cont.) The TRAPS system consists of two stages of MLPs
In the first stagecritical band MLPs learn phone probabilities posterior on the input
In the second stage A “merger” MLP merges the output of each of these individual critical band MLPs resulting in overall phone posteriors probabilities
ASRU 1999
TempoRAl Patterns (cont.) Input to each TRAP is a 1 sec long temporal vector Output of each TRAP is a vector of estimates of
phoneme-specific likelihoods Output from the merging MLP is a vector of estimates of
phoneme-specific posterior probabilities
15 Critical-band 101 input units300 hidden units29 output phonetic classes
TRAP
Hidden Activation TRAPS (HATS) Use the hidden activations of the critical band MLPs inst
ead of their outputs as inputs to the “merger” MLPs ?? Widening acoustic context by using more frames of full b
and speech energies as input to the MLP Reducing the word error rate from 25.6% to 23.5% on th
e 2001 NIST evaluation set Reducing the word error rate from 20.3% to 18.3% on th
e 2004 NIST evaluation set
ICSLP 2004
Hidden Activation TRAPS (cont.)
Hidden Activation TRAPS (cont.) PLP feature were derived from short term spectral analysis(2
5ms time slices every 10 ms) PLP/MLP used 9 frames of PLP features and HATs used 51 fra
mes of log critical band energies
Stability of Results Switch board (earlier) and Fisher (later) conversational
data is extremely difficult to recognize Due to their unconstrained vocabulary, speaking style,
and range of telephones used Increasing amounts of training data can achieved better
performance
Some Practical Consideration Larger and larger training sets can provide the best
improvement implies a quadratic growth in training time
Solution Hyper-threading on the dual CPUs Gender-specific training Preliminary network training passes with fewer
training patterns Customization of the learning regimen to reduce the
number of epochs (training iteration) Using selected subsets of the data for later training
passes
Some Practical Consideration (cont.) Faster probabilistic inference algorithms and judicious
model selection methods for controlling model complexity are needed
Some Practical Consideration (cont.) Data Selection is also an important issue
Reducing the redundancy existing in the database can help to reduce the costs of learning achieving the same performance with less effort
Over-represented examples in the database can harm the generalization capabilities of a given learning machines biasing its modeling toward those classes
For the selection of data based on the filter approach we need an evaluation method that allows us to sort the data according to some sampling criteria of definition of usefulness of the data
ASRU 2005
Some Practical Consideration (cont.) Evaluation method
The first step, we have to train an MLP selector (classifier),s, using a small subset of the data that will result in a set of parameters,
Afterward, given those parameters we can then obtain the probabilities a posteriori for the rest of the data
for every feature frame and phoneme, qk
We can now compute the entropy value for each feature frame as
1,...,0])[()],[|( KknxsnxqP ks
][nx
1
02 )],[|(log)],[|(][
K
kksks nxqPnxqPnh
Some Practical Consideration (cont.) Sampling criteria
High entropy values indicate that taking a decision is going to be difficult
Low entropy value indicate that the decision is easy to make (not necessarily implying it will be the right one)
Very high entropy values may account for outlier or mislabeled examples: non–separable data.
Very low entropy value can account for overrepresented or easily learnt examples This overrepresentation can harm the classifier ability by
forcing too much detail in the corresponding class
Some Practical Consideration (cont.)
NIST 2001
Statistical Modeling for the New Features HMMs are not well suited to long-term features
The use of HMMs as the core acoustic modeling technology might obscure the gains from new features, especially those from long time scales
This may be one reason why progress with novel techniques has been so difficult
The standard way to use longer temporal scale with an HMM is simply to use a large analysis window and a small frame step The successive features at the slow time scale are even
more correlated than those at the fast time scale, leading to a bias in posteriors
Models that do not represent the high correlation between successive frames effectively
Statistical Modeling for the New Features (cont.)
They propose instead to focus on the problem of multistream and multirate process modeling It is desirable to improve robustness to corruption of individual
streams The use of multiple streams introduces more flexibility in characte
rizing speech at different time and frequency scale The statistical models and features interact, and simple HMM-b
ased combination approaches might not fully utilize complementary information in different feature sequences
A multi-rate and variable-rate modeling is introduced
Multi-Rate and Variable-Rate Modeling The traditional approach for utilizing new features is to
concatenate them with existing cepstral features after over-sampling and use them with in a standard HMM-based models HMM have become so tuned to short-term features that
their use might obscure the gains from new features Traditional HMM
1
01 )|()|(}){},({
T
ttttttt sopssPsoP
ICASSP 2005
Multi-Rate and Variable-Rate Modeling (cont.) Basic Multi-rate HMM
)|(),|(}){},{},...,{},({ 1/1
1
111
11
kt
kt
kMt
kt
kt
K
k
T
t
Kt
Kttt kkkkkk
k
k
kkooPsssPsosoP
:states:observation
coarser scale
finer scale
T1=3 M2=3
T2=M2xT1=9
Multi-Rate and Variable-Rate Modeling (cont.)
Variable-rate Extension (2-rate)
)|()|(}){|}{},{},{},({ 1111
11
0
2211
1111
1
112211 tttt
T
tttttt soPssPMsosoP
)|()|( 222
1121)(
)( 22212
11
12ttttt
Mtl
tltsoPsssPt
:states:observation
coarser scale
finer scale
Multi-Rate and Variable-Rate Modeling (cont.) In their experiment, they modeled speech using both recogni
tion units and feature sequences corresponding to phone and syllable time scales Short-time: traditional phone HMMs using cepstral features (PL
P cepstral) Long-time: characterizes syllable structure and lexical stress us
ing HATs Unlike the previously mentioned HAT features that were trained o
n phone targets, these HAT features are trained on broad consonant/vowel classes with distinction for syllable position (onset, coda, and ambi-syllabification) for consonants and low/high stress level for vowels
2% word error rate reduction on NIST 2001 Hub-5 task
Multi-Rate and Variable-Rate Modeling (cont.) The experiment result shows the explicit modeling of speech at tw
o time scales via multirate, coupled HMMs architecture outperforms simple HMM-based feature concatenation approach
The feature extraction and statistical modeling are tailored to focus more on information-bearing regions (e.g. phone transition) as opposed to a uniform emphasis over the whole signal space
Research direction Choice of the sampling rates according to the scale/rate of the larger ti
me-window features Multirate acoustic models with more than two time scales
The third or higher time scale can represent utterance-level effects such as speaking rate and style, gender and noise
What could be next
Determine optimal window sizes and frame rates for different regions of speech, thus creating a signal-adaptive front end
The energy-based representations of temporal trajectories could be replaced by autoregressive models for these components of the time-frequency plane FDLP, LP-TRAP Perceptual linear prediction squared (PLP2 )
A spectrogram-like signal representation that is iteratively approximated by all-pole models applied sequentially in the time and frequency direction of the spectrotemporal pattern Unlike conventional feature processing, no frame-based spectral analy
sis occures
Final Words They wrote some words …
“We implored the reader not to be deterred by initial result that were poorer than those achieved by more conventional method, since this was almost inevitable when wandering from a well-worn path. However the goal was always to ultimately improve performance, and the explorations into relatively uncharted territory were only a path to that goal. This process can be slow and sometimes frustrating”