Informing Multisource Decoding for Robust Speech Recognition Ning Ma and Phil Green Speech and Hearing Research Group The University of Sheffield 22/04/2005

Informing Multisource Decoding for Robust Speech Recognition

Ning Ma and Phil GreenSpeech and Hearing Research GroupThe University of Sheffield22/04/2005

Overview of the Talk

Introduction to Multisource Decoding

Context-dependent Word Duration Modelling

Measure the “Speechiness” of Fragments

Summary and Plans

Multisource Decoding

A framework which integrates bottom-up and top-down processes in sound understanding

Easier to find a spectro-temporal region that belongs to a single source (a fragment) than to find a speech fragment (“missing data” techniques)

Unrealistic duration information encoded in HMMs

No hard limits on word durations Decoder may produce word matches with unusual durations

Worse with multisouce decoding Decide segregation hypotheses on incomplete data

Need more temporal constraints

Modelling Durations, Why?

aaiiii

11 –– a aiiii

1( ) (1 )di ii iiP d a a 0 20 40 60 80 100 120

Duration (ms)

`six', state 5 of 8

Factors that Determine Word Durations

Lexical Stress: Humans tend to lengthen a word when emphasizing it

Surrounding Words: The neighbouring words can affect the duration of a word

Speaking Rate: Fast speech vs. Slow speech

Pause Context: Words followed by a long pause have relatively longer durations: the

“Pre-pausal Lengthening” effect1

1. T. Crystal, “Segmental durations in connected-speech signals: Syllabic strees,” JASA, 1988.

Word Duration Model Investigation

Different words have different durational statistics

Skewed distribution shape

Discrete distribution more attractive

Word duration histograms for digit ‘oh’ and ‘six’

Context-Dependent Duration Modelling

In a connected digits domain High-level linguistic cues are minimised

The effect of lexical stress is not obvious

Surrounding words do not affect duration statistics

This work only models the ‘pre-pausal’

lengthening effect

The “Pre-Pausal Lengthening” Effect

Word duration histograms obtained by forced-alignment Distributions (solid lines) have a wide variance

A clear second peak around 600 ms for ‘six’

Word duration examples divided into two parts Non-terminating word vs pre-pausal word duration examples

Determine histograms for the two parts

Smoothed word duration histograms for digit ‘oh’ and ‘six’

Compute Word Duration Penalty

Estimate P(d|w,u), the probability of word w having duration d, if

followed by u Word duration histograms (bin width 10 ms) obtained by force-

alignment

Smoothed and normalised to evaluate P(d|w,u)

u can only be pause or non-pause in our case, thus two histograms

per digit

Scaling factors to control the impact of the word duration

penalties

Decoding with Word Duration Modelling

In Viterbi decoding Time-synchronous algorithm

Apply word duration penalties as paths leave final state

But within a template paths with different histories have

different durations!

1. S. Renals and M. Hochberg (1999), “Start-synchronous search for large vocabulary continuous speech recognition,”

Multi-Stack decoding Idea from NOWAY 1 decoder

Time-asynchronous, but start-synchronous

Have knowledge of each hypothesis’s future

Multi-stack Decoding

Partial word sequence hypotheses H(t,W(t),P(t)) stored on each stack The reference time t at which the hypothesis end

The word sequence W(t)=w(1)w(2)…w(n) covering the time from 1 to t

Its overall likelihood P(t)

The most likely hypothesis on each stack is extended further

Viterbi algorithm is used to find one-word extension

The final result: the best hypothesis on the stack at time T

Viterbi Search

Time

t1 t2 T

Final result

t3 t4 t5 t6

When placing a hypothesis onto stacks Compute the WD penalty based on the one-

word extension

Apply the penalty to the hypothesis’s

likelihood score

Setting a search range: WDmin and WDmax

Reduce computational cost

A typical duration range for a digit is between

150-900 ms

Applying Word Duration Penalties

Time

t1 t2 t3 t4 t5

Recognition Experiments

“Soft mask” missing data system, Spectral domain

features

16 states per HMM, 7 Gaussians per state

Silence model and short pause model in use

Aurora 2 connected digits recognition task, clean

training

Experiment Results

Four recognition systems:

1. Baseline system, no duration model

2. + uniform duration model

3. + context-independent duration model

4. + context-dependent duration model

Discussion

Context-dependent word duration model can

offer significant improvement

With duration constraints decoder can

produce more reasonable duration patterns

Assumes the duration pattern in clean situations is same as in noise

Need normalisation by speaking rate

Overview of the Talk

Introduction to Multisource Decoding

Context-dependent Word Duration Modelling

“Speechiness” Measures of Fragments

Discussion

Motivation of Measuring “Speechiness”

The multisource decoder assumes each fragment has a equal probability of being speech or not

We can measure the “speechiness” of each fragment

These measures can be used to bias the decoder towards including the fragments that are more likely to be speech.

A Noisy Speech Corpus

Aurora 2 connected digits mixed with either violins or drums

A set of a priori fragments have been generated, but unlabelled

Allow us to study the integration problem in isolation of the

problems of fragment construction

A Priori Fragments

Recognition Results

“Correct”: a priori fragments with correct labels

“Fragments”: a priori fragments with no labels

Results demonstrate that the top-down information in

our HMMs is insufficient

Acc DEL SUB INS

Violins Correct 93.04% 24 44 2

Violins Fragments 50.75% 322 171 3

Drums Correct 91.36% 38 48 1

Drums Fragments 33.76% 221 381 65

Approach to Measure “Speechiness”

Extract features that represent speech characteristics

Use statistic models like GMMs to fit the features

Need a background model which fits everything

Take the speech model / background model

likelihoods ratio as the confidence measure

Preliminary Experiment 1 – F0 Estimation

Speech and other sounds have differences in F0,

and also in Delta F0

Measure the F0 of each fragment rather than full

bands signal

Compute the correlogram of all the frequency channels

Only sum those channel within the fragment

For each frame, find the peak to estimate its F0

Smooth F0 crossing the fragment

Preliminary Experiment 1 – F0 Estimation

Pitch Delta pitch Both

74.3% 77.4% 88.8%

Accuracies using different features

Gmms with full covariance and two Gaussians

Speech fragments vs Violin fragments

Background model trained on violin fragments

Log likelihood ratio threshold is 0

Preliminary Experiment 2 – Energy Ratios

Speech has more energy around formants

Divides spectral features into frequency bands

Compute the amount of energy of a fragment within

each band, normalised by full band energy

Two bands case: Channel Central Frequency (CF) =

50 – 1000 – 3750 Hz

Four bands case: CF = 50 – 282 – 707 – 1214 – 3850 Hz

Preliminary Experiment 2 – Energy Ratios

Speech fragments vs music fragments (violins & drums)

Full covariance GMMs with 4 Gaussians

Background model trained on all types of fragments

Two bands Four bands

79.7% 93.2%

Accuracies using different features

Summary and Plans

Don’t need any classification, leave the confidence

measures to the multisource decoder

Assumes the background model is accessible, in

practice needs a garbage model

Combine different features together

More speech features, e.g. syllabic rate

Thanks!

Any questions?

Documents

Informing Multisource Decoding for Robust Speech Recognition Ning Ma and Phil Green Speech and Hearing Research Group The University of Sheffield 22/04/2005