Multilingual Speech Recognition With A Single End-To-End …

Multilingual Speech Recognition With A SingleEnd-To-End Model

Shubham Toshniwal1, Tara N. Sainath2, Ron J. Weiss2, Bo Li2,Pedro Moreno2, Eugene Weinstein2, and Kanishka Rao2

1TTI Chicago

2Google

April 18, 2017

Why Multilingual Speech Recognition Models ?

I Remarkable progress in speech recognition in past few years

I Most of this success restricted to high resource languages, e.g.English

I Google Voice Search supports ∼120 out of 7000 languagesI Multilingual models:

I Utilize knowledge transfer across languages, and thus alleviatedata requirement

I Successful in Neural Machine Translation (Google NMT)I Easier to deploy and maintain

Conventional ASR Systems

I Traditional ASR systems are modular

I Require expert curated resources

. . .

Feature

Extraction

Speech

Feature Vectors

Decoder

Acoustic Pronunciation

Dictionary

Language

Model Model

‘‘recognize speech’’

Words

I Multilingual models:I Focus on just the acoustic model (Lin, 2009; Ghoshal, 2013)I Separate language model and pronunciation model required for

each language

Conventional ASR Systems

I Traditional ASR systems are modular

I Require expert curated resources

. . .

Feature

Extraction

Speech

Feature Vectors

Decoder

Acoustic Pronunciation

Dictionary

Language

Model Model

‘‘recognize speech’’

Words

I Multilingual models:I Focus on just the acoustic model (Lin, 2009; Ghoshal, 2013)I Separate language model and pronunciation model required for

each language

End-to-end ASR Models

I Encoder-decoder models achieved state-of-the-art result onGoogle Voice Search task (Chiu et al. 2018)

I Encoder-Decoder models are appealing because:I Conceptually simple; subsume the acoustic model,

pronunciation model, and language model in a single model.I No need for expert curated resources!

Acoustic

Featuresx1 x2 x3 xT

h3 hTh2h1

GO

recognize speech EOS

End-to-End Multilingual ASR Models

ct =∑Tg

i=1 αithi

ct

uit = vT tanh(W1hi +W2dt)

αt = softmax(ut)

Encoder

αt

yt

dt

xTx3

Attention Layer

Decoder

x2x1

hTh3h2h1

I We use attention-based encoder-decoder models

I Decoder outputs one character per time step

I For multilingual models, take union over character sets

Multilingual Encoder-Decoder Models

Model Training Inference

Joint model No language ID No language ID

Multitask model Language ID No language IDConditioned model Language ID Language ID

I Naive model; unaware of multilingual nature of data

I Can potentially handle code-switching



Joint model No language ID No language IDMultitask model Language ID No language ID

Conditioned model Language ID Language ID

I Trained to jointly recognize language ID and speech



Joint model No language ID No language IDMultitask model Language ID No language IDConditioned model Language ID Language ID

I Learnt embedding of language ID fed as input to conditionthe model

I Language ID embedding can be fed in:(a) Encoder, (b) Decoder, (c) Encoder & Decoder

Encoder-Conditioned Model

h3 hTh2h1

x1eL

x2 xTx3eL eLeL

Embedding

Features

Language

Acoustic

Encoder of encoder-conditioned model

Task

I Recognize 9 Indian languages with a single model

I Very little script overlap, except for Hindi and Marathi.

I The union of character sets is close to 1000 characters!

I But the languages have large overlap in phonetic space(Lavanya et al. 2005).

Experimental Setup

I Training data consists of dictated queries

I Average 230K queries (∼170 hrs) per language

Bengali

GujaratiHindi

Kannada

MalayalamMarathi

TamilTelugu

Urdu0.0

2.5

5.0

7.5

10.0

12.5

15.0

17.5

Fra

ctio

nof

Tot

alD

ata

(in

%)

364K

243K213K

192K

285K

227K

164K

232K

196K

I Baseline: Encoder-decoder models trained for individuallanguages

Joint vs Individual

Bengali

GujaratiHindi

Kannada

MalayalamMarathi

TamilTelugu

Urdu

Wt Avg0

10

20

30

40

WE

R(i

n%

)

Joint

Individual

I Joint model outperforms individual models on all languages!!

I The joint model is not even language aware at test time

I Overall a 21% relative reduction in Word Error Rate (WER)

Picking the Right Script

Benga

li

Gujarat

iHind

i

Kann

ada

Malaya

lamMara

thiTa

mil

Telug

uUrdu

Mixed

Bengali

Gujarati

Hindi

Kannada

Malayalam

Marathi

Tamil

Telugu

Urdu

1 0 0 0 0 0 0 0 0 0

0 0.99 0.003 0.001 0 0 0 0 0.001 0

0 0 1 0 0 0 0 0 0 0

0.002 0.004 0.035 0.93 0.008 0 0 0.005 0.014 0

0 0.001 0.002 0.004 0.99 0 0 0.005 0.002 0

0.001 0.007 0 0.021 0.003 0.95 0 0.002 0.018 0

0 0 0 0 0 0 1 0 0 0

0 0.001 0 0.003 0.001 0 0 0.99 0 0

0.001 0.003 0.009 0.004 0.003 0 0 0 0.98 00.0

0.2

0.4

0.6

0.8

1.0

Rarely confused between languages

Joint vs Multitask

Bengali

GujaratiHindi

Kannada

MalayalamMarathi

TamilTelugu

Urdu

Wt Avg0

5

10

15

20

25

30

35

WE

R(i

n%

)

Joint

Multitask

Insignificant gains from multitask training

Joint vs Conditioned Models

Bengali

GujaratiHindi

Kannada

MalayalamMarathi

TamilTelugu

Urdu

Wt Avg0

5

10

15

20

25

30

35

WE

R(i

n%

)

Joint

Decoder

Encoder

I As expected, conditioning the model on the language ID ofspeech helps

I Encoder conditioning:I Performs better than decoder conditioningI Potential acoustic model adaptation happening

Magic of Conditioning

Benga

li

Gujarat

iHind

i

Kann

ada

Malaya

lamMara

thiTa

mil

Telug

uUrdu

Mixed

Bengali

Gujarati

Hindi

Kannada

Malayalam

Marathi

Tamil

Telugu

Urdu

1 0 0 0 0 0 0 0 0 0

0 1 0 0 0 0 0 0 0 0

0 0 1 0 0 0 0 0 0 0

0 0 0 1 0 0 0 0 0 0

0 0 0 0 1 0 0 0 0 0

0 0 0 0 0 1 0 0 0 0

0 0 0 0 0 0 1 0 0 0

0 0 0 0 0 0 0 1 0 0

0 0 0 0 0 0 0 0 1 00.0

0.2

0.4

0.6

0.8

1.0

Testing the Limits: Code Switching

I Can the joint model code switch between 2 Indian languages(trained for recognizing them separately)

I Artificial test set of 1000 utterances of Tamil query followedby Hindi with 50ms silence in between

I The model does not code-switch :(

I Picks one of the two scripts and sticks with itI From manual inspection:

I Transcribes either the Hindi/Tamil part in corresponding scriptI Transliteration in rare cases

Testing the Limits: Code Switching

I Can the joint model code switch between 2 Indian languages(trained for recognizing them separately)

I Artificial test set of 1000 utterances of Tamil query followedby Hindi with 50ms silence in between

I The model does not code-switch :(

I Picks one of the two scripts and sticks with itI From manual inspection:

I Transcribes either the Hindi/Tamil part in corresponding scriptI Transliteration in rare cases

Feeding the Wrong Language ID

I Does the model obey acoustics or is it faithful to language ID?

I Artificial dataset of 1000 Urdu queries tagged as Hindi

I Transliterates Urdu queries in Hindi’s script

I Learns to disentangle the acoustic-phonetic content from thelanguage identity

I Transliterator as a byproduct!

Feeding the Wrong Language ID

I Does the model obey acoustics or is it faithful to language ID?

I Artificial dataset of 1000 Urdu queries tagged as Hindi

I Transliterates Urdu queries in Hindi’s script

I Learns to disentangle the acoustic-phonetic content from thelanguage identity

I Transliterator as a byproduct!

Conclusion

I Encoder-Decoder models:I Elegant and simple framework for multilingual modelsI Outperform models trained for specific languagesI Rarely confused between individual languagesI Fail at code-switching

I Recent work along similar lines got promising results as well(Kim, 2017; Watanabe, 2017; Tong, 2018; Dalmia, 2018)

I Questions?

Conditioning Encoder is Enough

Bengali

GujaratiHindi

Kannada

MalayalamMarathi

TamilTelugu

Urdu

Wt Avg0

5

10

15

20

25

30

35

WE

R(i

n%

)

Encoder

Encoder+Decoder

I Conditioning decoder on top of conditioning the encoderdoesn’t buy us much

I Possibly because the attention mechanism feeds ininformation from the encoder to the decoder

Documents

Multilingual Speech Recognition With A Single End-To-End …