25
Frederico Rodrigues and Isabel Trancoso INESC/IST, 2000 Robust Recognition of Digits and Natural Numbers

Frederico Rodrigues and Isabel Trancoso

Embed Size (px)

DESCRIPTION

Robust Recognition of Digits and Natural Numbers. Frederico Rodrigues and Isabel Trancoso. INESC/IST, 2000. Summary. Problem overview Baseline system Extensions to the baseline system Conclusions and future work. Microphone. Microphone. Position. Channel. Distortion. Distortion. - PowerPoint PPT Presentation

Citation preview

Page 1: Frederico Rodrigues and Isabel Trancoso

Frederico Rodrigues and Isabel Trancoso

INESC/IST, 2000

Robust Recognition of Digits and Natural Numbers

Page 2: Frederico Rodrigues and Isabel Trancoso

2

SummarySummary

Problem overview

Baseline system

Extensions to the baseline system

Conclusions and future work

Page 3: Frederico Rodrigues and Isabel Trancoso

3

The ProblemThe Problem

Speaker

Gender

Age

Vocal tract characteristics

Pronunciation

Rate of Speech

Stress

Lombard Reflex

Microphone

Microphone

Position

Distortion Channel

Distortion

Noise

Environment

Background noises

Intermitent noises

Coktail party noises

Reverberation

Page 4: Frederico Rodrigues and Isabel Trancoso

4

CorpusCorpus Description Description

Multilingual telephone speech corpus

SPEECHDAT(M) 1000 speakers

SPEECHDAT(II) 4000 speakers

Orthographically transcribed including

noise events

Page 5: Frederico Rodrigues and Isabel Trancoso

5

Noise eventsNoise events

[spk] : Speaker related noises

[sta] : Stationary noises

[int] : Intermittent noises

SPEECHDAT(II) SPEECHDAT(M)

[spk] Blow, loud breath, other speaker noises

[sta] channel noise, background noise

[int] cross talk, radio, telephone, other

Page 6: Frederico Rodrigues and Isabel Trancoso

Pedir-lhe-emos agora que leia a coluna da direita da seguinte lista:

1. Leia o número algarismo a algarismo 3 6 4 8 22. Leia a frase A derrota veio num golo que teve um remate muito

bonito.3. Leia o nome da cidade ou vila Edimburgo4. Soletre a palavra (letra a letra) E, D, I, M, B, U, R, G, O5. Leia a frase Pincele tudo com uma gema de ovo misturada com

uma colher de sopa de água.6. Leia as horas onze horas e cinco minutos7. Leia a palavra operador8. Leia a quantia em dinheiro 18.362$0027. Leia o número zero28. Leia a palavra conferência29. Leia a frase O estado apostou sem risco e embolsou mais de dez

milhões de contos30. Leia o código pessoal 1 4 1 4 2 031. Leia a palavra sopro32. Leia o número algarismo a algarismo 9 0 5 2 7 3 1 8 4 6

10. Leia o número de cartão de crédito 3483 1331 0764 708211. Leia a frase Eu queria telefonar12. Leia o número por extenso 19.39513. Leia a frase O deputado participou, em sessenta e um, na

Operação Dulcinea, conduzida por Henrique Galvão.14. Leia a data Domingo, 21 de Maio de 2000

Page 7: Frederico Rodrigues and Isabel Trancoso

7

Train and Test Set DefinitionTrain and Test Set Definition

Selection procedure– Age, gender and region distribution are

approximately equal in both train and test sets;

SPEECHDAT II– Fixed 500 speakers evaluation set– Additional 300 speakers development set

SPEECHDAT(M)– 200 speakers evaluation set

Overall ratio of 80% Train/20% Test

Page 8: Frederico Rodrigues and Isabel Trancoso

8

I1 B1 N*Train Set 2954 2905 5059

Evaluation Set 768 491 260Development Set - 277 467

Total 3722 3673 5786

Sub-corpus UsedSub-corpus Used

I1 - Isolated digit stringsB1 - Sequences of 10 digitsN* - Natural numbers

Page 9: Frederico Rodrigues and Isabel Trancoso

9

Feature ExtractionFeature Extraction

MFCC (Mel Frequency Cepstral Coefficients)

– 14 Cepstra + 14 Cepstra + Energy + Energy

– Speech signal band-limited between 200 and 3800 Hz

– Hamming Window: 25 ms each 10 ms

Cepstral Mean Substraction– Simple but effective technique for channel and

speaker normalization

Page 10: Frederico Rodrigues and Isabel Trancoso

10

Acoustic ModelingAcoustic Modeling

Left-right continuous density HMM’s– Word models for each digit. No skips.

– Silence and filler models with forward and backward skips

Gender dependent models

HMM: Hidden Markov Model

Page 11: Frederico Rodrigues and Isabel Trancoso

11

Model TopologyModel Topology

Fillers and silence models topology

Nº of States Models3 “um”, fillers, silence

6 “zero”, “três”, “quatro”,“cinco”, “oito”, “nove”

7 “sete”8 “dois”, “seis”

Page 12: Frederico Rodrigues and Isabel Trancoso

12

Baseline System - Isolated Baseline System - Isolated DigitsDigits

Choose isolated digits with no noise marks– HMM parameters initialized with the global mean and

variance of the training data

Embedded Baum-Welch ReestimationEvaluate performance withViterbi decoding

– Grammar allowing one digit and initial and final silence– Grammar allowing one digit and any number of fillers or

silence

Page 13: Frederico Rodrigues and Isabel Trancoso

13

"Zero"

"Um"

"Oito"

"Nove"

Filler models

Silence

Filler models

Silence

"Zero"

"Um"

"Oito"

"Nove"

Silence Silence

Baseline System - Isolated Baseline System - Isolated DigitsDigits

Page 14: Frederico Rodrigues and Isabel Trancoso

14

Baseline System - Isolated Baseline System - Isolated DigitsDigitsIncrement Gaussian mixtures per state

up to 3 for the digit modelsIntroduce files with noise marksRepeat re-estimation/evaluation

processIncrement Gaussian mixtures per state

up to 3 for the filler and digit models

Page 15: Frederico Rodrigues and Isabel Trancoso

15

Connected vs Isolated DigitsConnected vs Isolated Digits

Example:

Number 3 1 2 6 said as:

Isolated Digits: t r e S u~ d o j S s 6 j S

Connected Digits: t r e z u~ d o j S _ 6 j S

Page 16: Frederico Rodrigues and Isabel Trancoso

16

Baseline System - Connected Baseline System - Connected DigitsDigits

Use best isolated digit models as bootstrap models

Repeat re-estimation/evaluation process

Increment gradually Gaussian mixtures per state up to 5 for the digit models

Page 17: Frederico Rodrigues and Isabel Trancoso

17

Baseline System - ResultsBaseline System - Results

% Correctness Accuracy

Isolated Digits 99.0 99.0

Connected DigitsKnown-length grammar

97.5 97.3

Connected DigitsUnknown-length grammar

96.2 96.1

Page 18: Frederico Rodrigues and Isabel Trancoso

18

Extension to the Baseline Extension to the Baseline SystemSystem

New way of modelling the filler models

Same training/evaluation process

Train the 9 filler and silence models with no skips

Build a unique filler model concatenating all filler and silence models

Page 19: Frederico Rodrigues and Isabel Trancoso

19

New Filler Model New Filler Model ArquitectureArquitecture

Silence

Filler 1

Filler 8

Filler 9

Page 20: Frederico Rodrigues and Isabel Trancoso

20

% Correctness % Accuracy

Isolated Digits 99,5 99,5

Connected DigitsKnown-length grammar

98,5 98,2

Connected DigitsUnknown-length grammar

97,8 96,4

% Correcção % Precisão

BS EXT BS EXT

Isolated Digits 99,0 99,4 99,0 99,4

Connected DigitsKnown-length grammar

97,8 98,1 97,5 98,0

Connected DigitsUnknown-length grammar

97,2 97,9 95,1 96,1

Results With New Filler Results With New Filler ModelModel

Page 21: Frederico Rodrigues and Isabel Trancoso

21

Natural NumbersNatural NumbersPhone models with 3 states and no skips

• Larger vocabulary size• May be adapted to other tasks

Phones initialized from models already trained for a directory assistance task

Digits are still modeled by word models

Grammar for natural numbers ranging from zero to hundreds of millions

Page 22: Frederico Rodrigues and Isabel Trancoso

22

Natural Numbers ExampleNatural Numbers Example

Number 25:

Hypothesis 1: vinte e cinco (Twenty and five)

Hypotesis 2: vinte cinco (Twenty five)

But “vinte cinco” could also be the sequence of natural numbers: 20 5

Page 23: Frederico Rodrigues and Isabel Trancoso

23

Natural Numbers - ResultsNatural Numbers - Results

# Mixtures % Correctness Accuracy

1 90,9 90,2

2 95,5 95,0

5 98,5 98,4

Page 24: Frederico Rodrigues and Isabel Trancoso

24

Sample ApplicationSample Application

State Control

Speech Recording

User

Server

Feature Extraction

Speech Recognition

Speech SynthesisDIXI - SVIT

Client

Speech

Prompts

Speech / Commands

Synthesised answer/ Commands

Answer

Page 25: Frederico Rodrigues and Isabel Trancoso

25

Conclusions and Future WorkConclusions and Future Work

Explicitly modeling fillers is a difficult task– Improved filler model decreases error rate up to 50

%

Develop context dependent models– Solve vowel reduction and co-articulation problems

Results may be improved through the use of discriminative training techniques