Segmental HMMs: Modelling Dynamics and Underlying Structure for Automatic Speech Recognition Wendy Holmes 20/20 Speech Limited, UK A DERA/NXT Joint Venture

Segmental HMMs: Modelling Dynamics and Underlying Structure for Automatic

Speech RecognitionWendy Holmes

20/20 Speech Limited, UKA DERA/NXT Joint Venture

2

Overview

• Hidden Markov models (HMMs): advantages and limitations

• Overcoming limitations with segment-based HMMs

• Modelling trajectories of acoustic features

• Theory of trajectory-based segmental HMMs

• Experimental investigations: comparing performance of different segmental HMMs• Choice of parameters for trajectory modelling: recognition using formant trajectories

• A “unified” model for both recognition and synthesis

• Challenges and further issues

3

Typical speech spectral characteristics

s i k s th r ee o ne

• Each sound has particular spectral characteristics.

• Characteristics change continuously with time.

• Patterns of change give cues to phone identity.

• Spectrum includes speaker identity information.

4

Useful properties of HMMs

1. Appropriate general structure

• Underlying Markov process allows for time-varying nature of utterances.

• Probability distributions associated with states represent short-term spectral variability.

• Can incorporate speech knowledge - e.g. context-dependent models, choice of features.

2. Tractable mathematical framework

• Algorithms for automatically training model parameters from natural speech data.

• Straightforward recognition algorithms.

5

Modelling observations with an HMM

time t+1

time t+2

observations

model time t

6

Conventional HMM assumptions• Piece-wise stationarity

Assume speech produced by piece-wise stationary process with instantaneous transitions between stationary states.

• Independence AssumptionProbability of an acoustic vector given a model state depends ONLY on the vector and the state. Assume no dependency of observations, other than through the state sequence.

• Duration modelState duration conforms to geometric pdf (given by self-loop transition probability).

7

Limitations of HMM assumptions• Speech production is not a piece-wise stationary process,

but a continuous one.• Changes are mostly smoothly time varying.• Constraints of articulation are such that any one frame of

speech is highly correlated with previous and following frames.

• Time derivatives capture correlation to some extent - but not within the model.

• Long-term correlations, e.g. speaker identity.• Speech sounds have a typical duration, with shorter and

longer durations being less likely, and limitations on maximum duration.

8

Addressing HMM limitations

AIMS WERE TO:

• retain advantages of HMMs:– automatic and tractable algorithms for training to model quantity of

speech data;– manageable recognition algorithms (principle of dynamic programming).

• improve the underlying model structure to address HMM shortcomings as models of speech.

ACHIEVING THE AIMS:

• Associate states with sequences of feature vectors

=> SEGMENTAL HMMS

9

Modelling observations with Segmental HMMs

time t (d=3)

time t+3 (d=2)

time t+5 (d=5)

10

Segmental HMMs

• Associate states with sequences of feature vectors, where these sequences can vary in duration.

• Each state is associated with meaningful acoustic-phonetic event (phones or parts of phones).

• Can easily incorporate realistic duration model.

• Enable relationship between frames comprising a segment to be modelled explicitly.

• Characterize dynamic behaviour during a segment.

11

Recognition calculations with HMMs• Compute most likely path through model (or sequence

of models).• Evaluate efficiently using dynamic programming

(Viterbi algorithm).• To compute probability of emitting observations up to a

given frame time, for any one state need only consider states which could be occupied at previous frame.

time

1 2 3 4 5 6 7

12

Segmental HMM recognition calculation• Principle of dynamic programming still applies.

• BUT, is more complex and computationally intensive.• For probability in any one state at any given frame

time:– assume that represents last frame of a segment– consider all possible segment durations from 1 to

some maximum D– therefore, must consider all possible previous states

at all possible previous frame times from t-1 up to t-D.

time

1 2 3 4 5 6 7

13

Trajectory-based segmental HMMs

t

featurevalue

• Simple trajectory-based segmental HMM: associate a state with a single mean trajectory, in place of (static) single mean value used for a standard HMM.

• Approximate relation between successive feature vectors by some trajectory through feature space.

14

Segmental HMM probability calculations• Generate observations independently, but conditioned on

the trajectory.

• Aim to provide constraining model of dynamics without requiring a complex model of correlations.

• BUT, trajectory may be different for different utterances of the same sound.

• So, if a single trajectory is used to represent all examples of a given model unit, will not be a very accurate representation for any one example.

• One possible solution is a mixture of trajectories, but needs many components to capture all different trajectories.

15

Intra- and Extra-segmental variability• Model feature dynamics across all segment examples by, in effect, a continuous mixture of trajectories.

• This is achieved by modelling separately:– extra-segmental variation (underlying trajectory)– intra-segmental variation (about trajectory)

=> Probabilistic-trajectory segmental HMMs

featurevalue

t

16

Comparing different modelsGenerating a sequence of 5 observations

HMMstates

standard HMMprobabilistic-trajectorysegmental HMM

segmental HMM

17

Probabilistic-trajectory segmental HMMs• Parametric trajectory model and Gaussian

distributions.

• Simple linear trajectory - characterized by mid-point and slope .

• For illustration show with slope=0.

extra-segmental variability 1 Dt

intra-segmentalvariability

target

18

PTSHMM probability (general)

• A segment of observations is y = y0,...,yT.

• Probability of y and trajectory f given state S is

P y S P y Stt

T( | ) ( | )

0

P y f S P f S P y S ft tt

T( , | ) ( | ) ( | , ) .

0

extra-segmental intra-segmentalAlternative segmental models:

1. Define trajectory; model variation in trajectory2. Fix trajectory and model observations - HMM is limiting case:

19

Linear Gaussian PTSHMM

• Linear trajectory: slope m and mid-point c.

• Joint probability of y and linear trajectory is:

)),(,|().|().|()|,,( 0

T

t

tt cmfSyPScPSmPScmyP

slope mid-point intra-segment

• Gaussian distributions for slope, mid-point and intra-segment variance.

• To use model in recognition, need to compute P(y|S).

• but values of trajectory parameters m and c are not known - they are “hidden” from the observer.

20

Hidden-trajectory probability calculation• One possibility: estimate the location of the trajectory, and compute

the probability for that trajectory.

• Used this approach in early work, but suffers problems due to difficulty in making unbiased trajectory estimate.

• A better alternative is to allow for all possible locations of the trajectory by integrating out the unknown parameters.

• In the case of the linear model, the calculation is:

P y S P y m c S

P m S P c S P y S f m c dm dc

m c

t tt

T

( | ) ( , , | )

( | ). ( | ). ( | , ( , ))

,

=

0

21

Parameters of the linear PTSHMM• Linear PTSHMM has five model parameters:

mid-point mean and variance, slope mean and variance,and intra-segment variance.

• Simpler models arise as special cases, by fixing various parameters.• If trajectory slope is set to zero

=> “static” PTSHMM.• If prevent variability in trajectory

=> “fixed-trajectory” SHMM.• Fixed-trajectory static SHMM = standard HMM with explicit

duration model.

22

Digit recognition experiments

• Speaker-independent connected-digit recognition• 8 mel cepstrum features + overall energy• three-state monophone models• Segmental HMM max. segment dur. 10 frames (=> maximum phone duration = 300 ms). • Compared probabilistic-trajectory SHMMs with fixed-trajectory

SHMMs and with standard HMMs.• Initialised all SHMMs from segmented training data (using HMM

Viterbi alignment).• Interested in acoustic-modelling aspects, so fixed all transition

and duration probabilities to be equal.• 5 training iterations.

23

Digit recognition results: simple SHMMs

% Sub. % Del. %Ins %Err.

Standard HMM 6.2 1.5 0.9 8.6

Add duration constraint 5.2 0.7 0.7 6.6

Linear fixed trajectory 3.8 0.5 0.6 4.9

• Some benefit from simply imposing duration constraints by introducing the segmental structure (prevents “silly” segmentations).

• Further benefit from representing dynamics by incorporating linear trajectory (one trajectory per model state).

24

Digit recognition results: static PTSHMMs

%Sub. %Del. %Ins %Err.

Static fixed SHMM 5.2 0.2 0.7 6.6

Static probabilistic SHMM 5.2 2.2 0.1 7.5

• For static models, no advantage from distinguishing between extra- and intra-segmental variability.

25

Digit recognition results: linear SHMMs %Sub. %Del. %Ins %Err.

Static fixed SHMM 5.2 0.2 0.7 6.6Linear fixed trajectory 3.8 0.5 0.6 4.9Linear PTSHMM (slope var=0) 2.0 0.8 0.1 2.9Linear PTSHMM (flexible slope) 4.9 4.0 0.1 9.0

• Some advantage for linear trajectory.

• Considerable further benefit from modelling variability in mid-point.

• But modelling variability in both mid-point and slope is detrimental to recognition performance.

26

Conclusions from digit experiments

Best trajectory model gives nearly 70% reduction inn error-rate (2.9%) compared with standard HMMs (8.6% error-rate).

=> advantages from trajectory-based segmental HMM which also incorporates distinction between intra- and extra-segmental variability, but:

• Trajectory assumption must be reasonably accurate (advantage for linear but not for static models).

• Not beneficial to model variability in slope parameter - possibly too variable between speakers, or too difficult to estimate reliably for short segments.

27

Phonetic classification: TIMIT

• Training and recognition with given segment boundaries.

• Train on complete training set (male speakers), with classification on core test set.

• 12 mel cepstrum features + overall energy.

• Evaluated (constrained) linear PTSHMMs.

• Compared performance with standard-HMM performance for:

– context-dependent (biphone) versus context-independent (monophone) models

– feature set using only the mel cepstrum features versus one which also included time derivative features.

28

TIMIT classification results

• Improvement with linear PTSHMM is greatest for more accurate (context-dependent) models.

=> more benefit from modelling trajectories when not including different phonetic events in one model.

• Most advantage when not using delta features.

=> most benefit from modelling dynamics when not attempting to represent dynamics in front-end.

Feature M odel HM M PTSHM M % improvement set type % err. % err. of PTSHM M

mel-cepstrum monophone 48.1 44.0 8.5

features only biphone 43.0 38.2 11.1

include time monophone 38.7 36.5 5.7

derivatives biphone 29.4 26.8 8.8

29

Benefit of PTSHMMs for some different phone classes

no. HMM PTSHMM %impro- examples %error %error ment

Fricatives (f v th dh s z sh hh) 710 41.7 38.9 6.8Vowels(iy ih eh ae ah uw uh er) 1178 53.8 48.9 9.1Semivowels and glides(l r y w) 97 39.2 33.2 15.4Diphthongs(ey ay oy aw ow) 376 48.9 41.2 15.8Stops (p t dx k b d g) 566 56.7 54.8 3.4

Most benefit from linear PTSHMM for sounds characterised by continuous smooth-changing dynamics.

30

Summary of findings

• Probabilistic-trajectory segmental HMMs can outperform standard HMMs and fixed-trajectory segmental HMMs.

• Separately modelling variability within/between segments is a powerful approach, provided that:– trajectory assumptions are appropriate (linear trajectory)– variability in the parameter can be usefully modelled (not

useful to model variability in slope parameter with current approach).

• The models have been shown to give useful performance gains.

31

Issues of modelling speech dynamics

Compare error rates on TIMIT task:

• HMMs with time derivatives: 29.8%

• best segmental HMM result WITHOUT time derivatives: 38.2%.

=> time derivatives capture some aspects of dynamics not modelled in segmental HMMs.

• Time derivative features provide some measure of dynamics for every frame.

• current segmental HMMs only model dynamics within a segment.

32

modelling issues and questions (1)

• Choice of model unit (e.g. phone, diphone)

• How to model dynamics and continuity effects across segment boundaries, to represent dynamics throughout an utterance.

• How to model context effects. (e.g. could define trajectories according to previous and following sounds - but complicates search)

• How to define trajectories. (e.g. linear or higher-order polynomial; versus dynamical-system type model with filtered output of hidden states)

33

modelling issues and questions (2)

• Incorporating a realistic duration model.

• How to model any systematic effects of duration on trajectory realisation - should reduce remaining variability in trajectories.

• How to model speaker-dependent effects and speaker continuity.

• How to deal with other systematic influences - e.g. speaker stress, speaking rate.

• Dealing with external influences - e.g. noise.

• Choice of features for trajectory modelling.

34

Spectral representations (1)• Typical wideband spectrogram - for display

compute spectrum at frequent time intervals (e.g. 2 ms)

th r ee s I x s I x

• Typical features for ASR: mfccs computed from FFT of 25 ms windows at 10 ms intervals:

35

Spectral representations (2)• Using long windows at fixed positions blurs rapid

events - stop bursts and rapid formant transitions.

• An alternative: use a shorter window “excitation synchronously”:

th r ee s I x s I x

• Compare with long fixed-window analysis:

36

Standard HMM digit recognition experiments

• Compared excitation-synchronous analysis with fixed analysis for different window lengths.

• In all cases computed FFT then mel cepstrum.• Shorter window gives lower frequency resolution, but effect

is not so great on mel scale.• Best fixed-window condition 20 or 25 ms: 2.1% err.

(increased to 4.6% for a 5 ms window).• Best synchronous-window condition 10ms: 1.9% err. But

only increased to 2.1 % for a 5 ms window.=> some advantage to capturing rapid events. But note short

window may be disadvantage for fricatives.Maybe combine different analyses?

37

Moving beyond cepstrum trajectories• Start with spectral analysis: this must preserve all relevant

information.

• But is it appropriate to then model trajectories directly in the spectral/cepstral domain?

• Motivation for modelling dynamics is from nature of articulation, and its acoustic consequences.

=> should be modelling in domain closer to articulation.

• One possibility is an articulatory description.

• Another option is formants - closely related to articulation but also to acoustics.

38

Problems with formant analysis

• Unambiguous formant labelling may not be possible from a single spectral cross-section.e.g. close formants may merge to give single spectral

peak

• A formant may not be apparent in the spectrum.e.g. formant is weakly excited (F1 in unvoiced

sounds).

• NOT useful for certain distinctions, where low amplitude is the main feature.e.g. identifying silence or weak fricatives.

=> difficult to identify formants independently from recognition process, so not generally used as features for automatic speech recognition.

39

Estimating formant trajectories s i k s th r ee o ne

•Where see clear formant structure, F1, F2 and F3 can be identified.•In voiceless fricatives, higher formant movements are usually continuous with those in adjacent vowels.•For F1, arbitrarily connect between adjacent vowels.

40

Formant analysis methodJohn Holmes (Proc. EUROSPEECH’97)

• Aims to emulate human abilities:– ability to label single spectrum cross-sections– rely heavily on continuity over time– sometimes need knowledge of what is being said to

disambiguate alternatives

• Two fundamental features of the method:– outputs alternatives when uncertain (“delayed decisions”).– Notion of “confidence” in formant measurement

when formants cannot be estimated (e.g. during silence), confidence is low and estimate not useful for recognition

=> rely on other features (general spectrum shape).

41

Example of formant analyser output

• Up to two sets of formants for each frame.

• Alternatives are in terms of sets - F1, F2, F3.

• Specified frame by frame, but are usually alternative trajectories.

“four seven”

42

Segmental HMM experiments

• Each segment model is associated with a linear trajectory.

• Model each phone by a sequence of one or more segments.

e.g. monophthongal vowels, fricatives - 1 segmentdiphthongs - sequence of 2 segmentsaspirated voiceless stops - sequence of 3 segments.

• Set allowed minimum and maximum segment duration dependent on identity of phone segment (loose constraint).

• Incorporate confidence estimate (as a variance) in recognition calculations.

• Resolve formant alternatives based on probability.

• Use formants + low-order cepstrum features.

43

Some connected-digit recognition results

Word error rates 8 cep. 5 cep.+3

for.Standard-HMM baseline 3.5 % 2.5 %with 3 states per phoneStandard HMMs with 6.4 % 5.9 %variable state allocation

• Performance drops when introduce new state allocation (total number of states about half that of baseline)

Introduce segment structure 3.2 % 2.9 %

• Need segment structure for good performance

Introduce linear trajectory 2.6 % 2.3 %

• Some advantage from linear trajectory• Formants show small, but consistent, advantage.

44

Formant modelling

• Expressing a model in terms of formant dynamics offers:– Potential for modelling systematic effects in a meaningful way: e.g speaker

identity, speaker stress, speaking rate.

– Potential for a constrained model for speech, which should be more robust to noise (assuming also model the noise).

• BUT: analysis of formants separately from hypotheses about what is being said will always be prone to errors.

• FUTURE AIM: integrate formant analysis within recognition scheme: provided speech model is accurate, this should overcome any formant tracking errors.

• A good model for speech should be appropriate for synthesis as well as for recognition: a trajectory-based formant model offers this possibility.

45

A “unified” speech model: applied to coding

46

A simple coding scheme• Demonstrate principles of coding using same model for

both recognition and synthesis.

• Model represents linear formant trajectories.

• Recognition: linear trajectory segmental HMMs of formant features.

• Synthesis: JSRU parallel-formant synthesizer.

• Coding is applied to analysed formant trajectories

=> relatively high bit-rate (up to about 1000 bits/s).

• Recognition is used mainly to identify segment boundaries, but also to guide the coding of the trajectories.

47

Segment coding scheme overview

48

Coded at about 600bps– Speaker 1: digits– Speaker 2: digits– Speaker 3: digits– Speaker 1: ARM report

Natural– Speaker 1: digits– Speaker 2: digits– Speaker 3: digits– Speaker 1: ARM

report

Speech Coding results

Achievements of study: Established principle of using formant trajectory model for both recognition and synthesis, including using information from recognition to assist in coding.

Future work: better quality coding should be possible by further integrating formant analysis, recognition and synthesis within a common framework.

Documents

Segmental HMMs: Modelling Dynamics and Underlying Structure for Automatic Speech Recognition Wendy Holmes 20/20 Speech Limited, UK A DERA/NXT Joint Venture