Upload
josephine-jordan
View
216
Download
1
Tags:
Embed Size (px)
Citation preview
Informing Multisource Decoding for Robust Speech Recognition
Ning Ma and Phil GreenSpeech and Hearing Research GroupThe University of Sheffield22/04/2005
Overview of the Talk
Introduction to Multisource Decoding
Context-dependent Word Duration Modelling
Measure the “Speechiness” of Fragments
Summary and Plans
Multisource Decoding
A framework which integrates bottom-up and top-down processes in sound understanding
Easier to find a spectro-temporal region that belongs to a single source (a fragment) than to find a speech fragment (“missing data” techniques)
Unrealistic duration information encoded in HMMs
No hard limits on word durations Decoder may produce word matches with unusual durations
Worse with multisouce decoding Decide segregation hypotheses on incomplete data
Need more temporal constraints
Modelling Durations, Why?
aaiiii
11 –– a aiiii
1( ) (1 )di ii iiP d a a 0 20 40 60 80 100 120
Duration (ms)
`six', state 5 of 8
Factors that Determine Word Durations
Lexical Stress: Humans tend to lengthen a word when emphasizing it
Surrounding Words: The neighbouring words can affect the duration of a word
Speaking Rate: Fast speech vs. Slow speech
Pause Context: Words followed by a long pause have relatively longer durations: the
“Pre-pausal Lengthening” effect1
1. T. Crystal, “Segmental durations in connected-speech signals: Syllabic strees,” JASA, 1988.
Word Duration Model Investigation
Different words have different durational statistics
Skewed distribution shape
Discrete distribution more attractive
Word duration histograms for digit ‘oh’ and ‘six’
Context-Dependent Duration Modelling
In a connected digits domain High-level linguistic cues are minimised
The effect of lexical stress is not obvious
Surrounding words do not affect duration statistics
This work only models the ‘pre-pausal’
lengthening effect
The “Pre-Pausal Lengthening” Effect
Word duration histograms obtained by forced-alignment Distributions (solid lines) have a wide variance
A clear second peak around 600 ms for ‘six’
Word duration examples divided into two parts Non-terminating word vs pre-pausal word duration examples
Determine histograms for the two parts
Smoothed word duration histograms for digit ‘oh’ and ‘six’
Compute Word Duration Penalty
Estimate P(d|w,u), the probability of word w having duration d, if
followed by u Word duration histograms (bin width 10 ms) obtained by force-
alignment
Smoothed and normalised to evaluate P(d|w,u)
u can only be pause or non-pause in our case, thus two histograms
per digit
Scaling factors to control the impact of the word duration
penalties
Decoding with Word Duration Modelling
In Viterbi decoding Time-synchronous algorithm
Apply word duration penalties as paths leave final state
But within a template paths with different histories have
different durations!
1. S. Renals and M. Hochberg (1999), “Start-synchronous search for large vocabulary continuous speech recognition,”
Multi-Stack decoding Idea from NOWAY 1 decoder
Time-asynchronous, but start-synchronous
Have knowledge of each hypothesis’s future
Multi-stack Decoding
Partial word sequence hypotheses H(t,W(t),P(t)) stored on each stack The reference time t at which the hypothesis end
The word sequence W(t)=w(1)w(2)…w(n) covering the time from 1 to t
Its overall likelihood P(t)
The most likely hypothesis on each stack is extended further
Viterbi algorithm is used to find one-word extension
The final result: the best hypothesis on the stack at time T
Viterbi Search
Time
t1 t2 T
Final result
t3 t4 t5 t6
When placing a hypothesis onto stacks Compute the WD penalty based on the one-
word extension
Apply the penalty to the hypothesis’s
likelihood score
Setting a search range: WDmin and WDmax
Reduce computational cost
A typical duration range for a digit is between
150-900 ms
Applying Word Duration Penalties
Time
t1 t2 t3 t4 t5
Recognition Experiments
“Soft mask” missing data system, Spectral domain
features
16 states per HMM, 7 Gaussians per state
Silence model and short pause model in use
Aurora 2 connected digits recognition task, clean
training
Experiment Results
Four recognition systems:
1. Baseline system, no duration model
2. + uniform duration model
3. + context-independent duration model
4. + context-dependent duration model
Discussion
Context-dependent word duration model can
offer significant improvement
With duration constraints decoder can
produce more reasonable duration patterns
Assumes the duration pattern in clean situations is same as in noise
Need normalisation by speaking rate
Overview of the Talk
Introduction to Multisource Decoding
Context-dependent Word Duration Modelling
“Speechiness” Measures of Fragments
Discussion
Motivation of Measuring “Speechiness”
The multisource decoder assumes each fragment has a equal probability of being speech or not
We can measure the “speechiness” of each fragment
These measures can be used to bias the decoder towards including the fragments that are more likely to be speech.
A Noisy Speech Corpus
Aurora 2 connected digits mixed with either violins or drums
A set of a priori fragments have been generated, but unlabelled
Allow us to study the integration problem in isolation of the
problems of fragment construction
A Priori Fragments
Recognition Results
“Correct”: a priori fragments with correct labels
“Fragments”: a priori fragments with no labels
Results demonstrate that the top-down information in
our HMMs is insufficient
Acc DEL SUB INS
Violins Correct 93.04% 24 44 2
Violins Fragments 50.75% 322 171 3
Drums Correct 91.36% 38 48 1
Drums Fragments 33.76% 221 381 65
Approach to Measure “Speechiness”
Extract features that represent speech characteristics
Use statistic models like GMMs to fit the features
Need a background model which fits everything
Take the speech model / background model
likelihoods ratio as the confidence measure
Preliminary Experiment 1 – F0 Estimation
Speech and other sounds have differences in F0,
and also in Delta F0
Measure the F0 of each fragment rather than full
bands signal
Compute the correlogram of all the frequency channels
Only sum those channel within the fragment
For each frame, find the peak to estimate its F0
Smooth F0 crossing the fragment
Preliminary Experiment 1 – F0 Estimation
Pitch Delta pitch Both
74.3% 77.4% 88.8%
Accuracies using different features
Gmms with full covariance and two Gaussians
Speech fragments vs Violin fragments
Background model trained on violin fragments
Log likelihood ratio threshold is 0
Preliminary Experiment 2 – Energy Ratios
Speech has more energy around formants
Divides spectral features into frequency bands
Compute the amount of energy of a fragment within
each band, normalised by full band energy
Two bands case: Channel Central Frequency (CF) =
50 – 1000 – 3750 Hz
Four bands case: CF = 50 – 282 – 707 – 1214 – 3850 Hz
Preliminary Experiment 2 – Energy Ratios
Speech fragments vs music fragments (violins & drums)
Full covariance GMMs with 4 Gaussians
Background model trained on all types of fragments
Two bands Four bands
79.7% 93.2%
Accuracies using different features
Summary and Plans
Don’t need any classification, leave the confidence
measures to the multisource decoder
Assumes the background model is accessible, in
practice needs a garbage model
Combine different features together
More speech features, e.g. syllabic rate
Thanks!
Any questions?