Frederico Rodrigues and Isabel Trancoso

Frederico Rodrigues and Isabel Trancoso

INESC/IST, 2000

Robust Recognition of Digits and Natural Numbers

2

SummarySummary

Problem overview

Baseline system

Extensions to the baseline system

Conclusions and future work

3

The ProblemThe Problem

Speaker

Gender

Age

Vocal tract characteristics

Pronunciation

Rate of Speech

Stress

Lombard Reflex

Microphone

Microphone

Position

Distortion Channel

Distortion

Noise

Environment

Background noises

Intermitent noises

Coktail party noises

Reverberation

4

CorpusCorpus Description Description

Multilingual telephone speech corpus

SPEECHDAT(M) 1000 speakers

SPEECHDAT(II) 4000 speakers

Orthographically transcribed including

noise events

5

Noise eventsNoise events

[spk] : Speaker related noises

[sta] : Stationary noises

[int] : Intermittent noises

SPEECHDAT(II) SPEECHDAT(M)

[spk] Blow, loud breath, other speaker noises

[sta] channel noise, background noise

[int] cross talk, radio, telephone, other

Pedir-lhe-emos agora que leia a coluna da direita da seguinte lista:

1. Leia o número algarismo a algarismo 3 6 4 8 22. Leia a frase A derrota veio num golo que teve um remate muito

bonito.3. Leia o nome da cidade ou vila Edimburgo4. Soletre a palavra (letra a letra) E, D, I, M, B, U, R, G, O5. Leia a frase Pincele tudo com uma gema de ovo misturada com

uma colher de sopa de água.6. Leia as horas onze horas e cinco minutos7. Leia a palavra operador8. Leia a quantia em dinheiro 18.362$0027. Leia o número zero28. Leia a palavra conferência29. Leia a frase O estado apostou sem risco e embolsou mais de dez

milhões de contos30. Leia o código pessoal 1 4 1 4 2 031. Leia a palavra sopro32. Leia o número algarismo a algarismo 9 0 5 2 7 3 1 8 4 6

10. Leia o número de cartão de crédito 3483 1331 0764 708211. Leia a frase Eu queria telefonar12. Leia o número por extenso 19.39513. Leia a frase O deputado participou, em sessenta e um, na

Operação Dulcinea, conduzida por Henrique Galvão.14. Leia a data Domingo, 21 de Maio de 2000

7

Train and Test Set DefinitionTrain and Test Set Definition

Selection procedure– Age, gender and region distribution are

approximately equal in both train and test sets;

SPEECHDAT II– Fixed 500 speakers evaluation set– Additional 300 speakers development set

SPEECHDAT(M)– 200 speakers evaluation set

Overall ratio of 80% Train/20% Test

8

I1 B1 N*Train Set 2954 2905 5059

Evaluation Set 768 491 260Development Set - 277 467

Total 3722 3673 5786

Sub-corpus UsedSub-corpus Used

I1 - Isolated digit stringsB1 - Sequences of 10 digitsN* - Natural numbers

9

Feature ExtractionFeature Extraction

MFCC (Mel Frequency Cepstral Coefficients)

– 14 Cepstra + 14 Cepstra + Energy + Energy

– Speech signal band-limited between 200 and 3800 Hz

– Hamming Window: 25 ms each 10 ms

Cepstral Mean Substraction– Simple but effective technique for channel and

speaker normalization

10

Acoustic ModelingAcoustic Modeling

Left-right continuous density HMM’s– Word models for each digit. No skips.

– Silence and filler models with forward and backward skips

Gender dependent models

HMM: Hidden Markov Model

11

Model TopologyModel Topology

Fillers and silence models topology

Nº of States Models3 “um”, fillers, silence

6 “zero”, “três”, “quatro”,“cinco”, “oito”, “nove”

7 “sete”8 “dois”, “seis”

12

Baseline System - Isolated Baseline System - Isolated DigitsDigits

Choose isolated digits with no noise marks– HMM parameters initialized with the global mean and

variance of the training data

Embedded Baum-Welch ReestimationEvaluate performance withViterbi decoding

– Grammar allowing one digit and initial and final silence– Grammar allowing one digit and any number of fillers or

silence

13

"Zero"

"Um"

"Oito"

"Nove"

Filler models

Silence

Filler models

Silence

"Zero"

"Um"

"Oito"

"Nove"

Silence Silence

Baseline System - Isolated Baseline System - Isolated DigitsDigits

14

Baseline System - Isolated Baseline System - Isolated DigitsDigitsIncrement Gaussian mixtures per state

up to 3 for the digit modelsIntroduce files with noise marksRepeat re-estimation/evaluation

processIncrement Gaussian mixtures per state

up to 3 for the filler and digit models

15

Connected vs Isolated DigitsConnected vs Isolated Digits

Example:

Number 3 1 2 6 said as:

Isolated Digits: t r e S u~ d o j S s 6 j S

Connected Digits: t r e z u~ d o j S _ 6 j S

16

Baseline System - Connected Baseline System - Connected DigitsDigits

Use best isolated digit models as bootstrap models

Repeat re-estimation/evaluation process

Increment gradually Gaussian mixtures per state up to 5 for the digit models

17

Baseline System - ResultsBaseline System - Results

% Correctness Accuracy

Isolated Digits 99.0 99.0

Connected DigitsKnown-length grammar

97.5 97.3

Connected DigitsUnknown-length grammar

96.2 96.1

18

Extension to the Baseline Extension to the Baseline SystemSystem

New way of modelling the filler models

Same training/evaluation process

Train the 9 filler and silence models with no skips

Build a unique filler model concatenating all filler and silence models

19

New Filler Model New Filler Model ArquitectureArquitecture

Silence

Filler 1

Filler 8

Filler 9

20

% Correctness % Accuracy

Isolated Digits 99,5 99,5


98,5 98,2


97,8 96,4

% Correcção % Precisão

BS EXT BS EXT

Isolated Digits 99,0 99,4 99,0 99,4


97,8 98,1 97,5 98,0


97,2 97,9 95,1 96,1

Results With New Filler Results With New Filler ModelModel

21

Natural NumbersNatural NumbersPhone models with 3 states and no skips

• Larger vocabulary size• May be adapted to other tasks

Phones initialized from models already trained for a directory assistance task

Digits are still modeled by word models

Grammar for natural numbers ranging from zero to hundreds of millions

22

Natural Numbers ExampleNatural Numbers Example

Number 25:

Hypothesis 1: vinte e cinco (Twenty and five)

Hypotesis 2: vinte cinco (Twenty five)

But “vinte cinco” could also be the sequence of natural numbers: 20 5

23

Natural Numbers - ResultsNatural Numbers - Results

# Mixtures % Correctness Accuracy

1 90,9 90,2

2 95,5 95,0

5 98,5 98,4

24

Sample ApplicationSample Application

State Control

Speech Recording

User

Server

Feature Extraction

Speech Recognition

Speech SynthesisDIXI - SVIT

Client

Speech

Prompts

Speech / Commands

Synthesised answer/ Commands

Answer

25

Conclusions and Future WorkConclusions and Future Work

Explicitly modeling fillers is a difficult task– Improved filler model decreases error rate up to 50

%

Develop context dependent models– Solve vowel reduction and co-articulation problems

Results may be improved through the use of discriminative training techniques

Documents

Frederico Rodrigues and Isabel Trancoso