Multilingual Speech Recognition With A Single End-To-End …

Multilingual Speech Recognition With A SingleEnd-To-End Model

Shubham Toshniwal1, Tara N. Sainath2, Ron J. Weiss2, Bo Li2,Pedro Moreno2, Eugene Weinstein2, and Kanishka Rao2

1TTI Chicago

2Google

April 18, 2017

Why Multilingual Speech Recognition Models ?

I Remarkable progress in speech recognition in past few years

I Most of this success restricted to high resource languages, e.g.English

I Google Voice Search supports ∼120 out of 7000 languagesI Multilingual models:

I Utilize knowledge transfer across languages, and thus alleviatedata requirement

I Successful in Neural Machine Translation (Google NMT)I Easier to deploy and maintain

Conventional ASR Systems

I Traditional ASR systems are modular

I Require expert curated resources

Feature

Extraction

Speech

Feature Vectors

Decoder

Acoustic Pronunciation

Dictionary

Language

Model Model

‘‘recognize speech’’

I Multilingual models:I Focus on just the acoustic model (Lin, 2009; Ghoshal, 2013)I Separate language model and pronunciation model required for

each language

Conventional ASR Systems

I Traditional ASR systems are modular

I Require expert curated resources

Feature

Extraction

Speech

Feature Vectors

Decoder

Acoustic Pronunciation

Dictionary

Language

Model Model

‘‘recognize speech’’

I Multilingual models:I Focus on just the acoustic model (Lin, 2009; Ghoshal, 2013)I Separate language model and pronunciation model required for

each language

End-to-end ASR Models

I Encoder-decoder models achieved state-of-the-art result onGoogle Voice Search task (Chiu et al. 2018)

I Encoder-Decoder models are appealing because:I Conceptually simple; subsume the acoustic model,

pronunciation model, and language model in a single model.I No need for expert curated resources!

Acoustic

Featuresx1 x2 x3 xT

h3 hTh2h1

recognize speech EOS

End-to-End Multilingual ASR Models

ct =∑Tg

i=1 αithi

uit = vT tanh(W1hi +W2dt)

αt = softmax(ut)

Encoder

Attention Layer

Decoder

hTh3h2h1

I We use attention-based encoder-decoder models

I Decoder outputs one character per time step

I For multilingual models, take union over character sets

Multilingual Encoder-Decoder Models

Model Training Inference

Joint model No language ID No language ID

Multitask model Language ID No language IDConditioned model Language ID Language ID

I Naive model; unaware of multilingual nature of data

I Can potentially handle code-switching

Joint model No language ID No language IDMultitask model Language ID No language ID

Conditioned model Language ID Language ID

I Trained to jointly recognize language ID and speech

Joint model No language ID No language IDMultitask model Language ID No language IDConditioned model Language ID Language ID

I Learnt embedding of language ID fed as input to conditionthe model

I Language ID embedding can be fed in:(a) Encoder, (b) Decoder, (c) Encoder & Decoder

Encoder-Conditioned Model

h3 hTh2h1

x2 xTx3eL eLeL

Embedding

Features

Language

Acoustic

Encoder of encoder-conditioned model

I Recognize 9 Indian languages with a single model

I Very little script overlap, except for Hindi and Marathi.

I The union of character sets is close to 1000 characters!

I But the languages have large overlap in phonetic space(Lavanya et al. 2005).

Experimental Setup

I Training data consists of dictated queries

I Average 230K queries (∼170 hrs) per language

Bengali

GujaratiHindi

Kannada

MalayalamMarathi

TamilTelugu

Urdu0.0

243K213K

I Baseline: Encoder-decoder models trained for individuallanguages

Joint vs Individual

Bengali

GujaratiHindi

Kannada

MalayalamMarathi

TamilTelugu

Wt Avg0

Individual

I Joint model outperforms individual models on all languages!!

I The joint model is not even language aware at test time

I Overall a 21% relative reduction in Word Error Rate (WER)

Picking the Right Script

Gujarat

Malaya

lamMara

Bengali

Gujarati

Kannada

Malayalam

Marathi

Telugu

1 0 0 0 0 0 0 0 0 0

0 0.99 0.003 0.001 0 0 0 0 0.001 0

0 0 1 0 0 0 0 0 0 0

0.002 0.004 0.035 0.93 0.008 0 0 0.005 0.014 0

0 0.001 0.002 0.004 0.99 0 0 0.005 0.002 0

0.001 0.007 0 0.021 0.003 0.95 0 0.002 0.018 0

0 0 0 0 0 0 1 0 0 0

0 0.001 0 0.003 0.001 0 0 0.99 0 0

0.001 0.003 0.009 0.004 0.003 0 0 0 0.98 00.0

Rarely confused between languages

Joint vs Multitask

Bengali

GujaratiHindi

Kannada

MalayalamMarathi

TamilTelugu

Wt Avg0

Multitask

Insignificant gains from multitask training

Joint vs Conditioned Models

Bengali

GujaratiHindi

Kannada

MalayalamMarathi

TamilTelugu

Wt Avg0

Decoder

Encoder

I As expected, conditioning the model on the language ID ofspeech helps

I Encoder conditioning:I Performs better than decoder conditioningI Potential acoustic model adaptation happening

Magic of Conditioning

Gujarat

Malaya

lamMara

Bengali

Gujarati

Kannada

Malayalam

Marathi

Telugu

1 0 0 0 0 0 0 0 0 0

0 1 0 0 0 0 0 0 0 0

0 0 1 0 0 0 0 0 0 0

0 0 0 1 0 0 0 0 0 0

0 0 0 0 1 0 0 0 0 0

0 0 0 0 0 1 0 0 0 0

0 0 0 0 0 0 1 0 0 0

0 0 0 0 0 0 0 1 0 0

0 0 0 0 0 0 0 0 1 00.0

Testing the Limits: Code Switching

I Can the joint model code switch between 2 Indian languages(trained for recognizing them separately)

I Artificial test set of 1000 utterances of Tamil query followedby Hindi with 50ms silence in between

I The model does not code-switch :(

I Picks one of the two scripts and sticks with itI From manual inspection:

I Transcribes either the Hindi/Tamil part in corresponding scriptI Transliteration in rare cases

Testing the Limits: Code Switching

I Can the joint model code switch between 2 Indian languages(trained for recognizing them separately)

I Artificial test set of 1000 utterances of Tamil query followedby Hindi with 50ms silence in between

I The model does not code-switch :(

I Picks one of the two scripts and sticks with itI From manual inspection:

I Transcribes either the Hindi/Tamil part in corresponding scriptI Transliteration in rare cases

Feeding the Wrong Language ID

I Does the model obey acoustics or is it faithful to language ID?

I Artificial dataset of 1000 Urdu queries tagged as Hindi

I Transliterates Urdu queries in Hindi’s script

I Learns to disentangle the acoustic-phonetic content from thelanguage identity

I Transliterator as a byproduct!

Feeding the Wrong Language ID

I Does the model obey acoustics or is it faithful to language ID?

I Artificial dataset of 1000 Urdu queries tagged as Hindi

I Transliterates Urdu queries in Hindi’s script

I Learns to disentangle the acoustic-phonetic content from thelanguage identity

I Transliterator as a byproduct!

Conclusion

I Encoder-Decoder models:I Elegant and simple framework for multilingual modelsI Outperform models trained for specific languagesI Rarely confused between individual languagesI Fail at code-switching

I Recent work along similar lines got promising results as well(Kim, 2017; Watanabe, 2017; Tong, 2018; Dalmia, 2018)

I Questions?

Conditioning Encoder is Enough

Bengali

GujaratiHindi

Kannada

MalayalamMarathi

TamilTelugu

Wt Avg0

Encoder

Encoder+Decoder

I Conditioning decoder on top of conditioning the encoderdoesn’t buy us much

I Possibly because the attention mechanism feeds ininformation from the encoder to the decoder

Multilingual Speech Recognition With A Single End-To-End …

Documents

Multilingual Sequence-To-Sequence Speech Recognition

Toward Constructing A Multilingual Speech Corpus for ... · Several speech corpora of Mandarin speech have, thus, been collected and distributed ... more than 70,000 words for Mandarin

End-to-end Audiovisual Speech Activity Detection with

Multilingual Speech Recognition

End-to-End Multilingual Multi-Speaker Speech …End-to-End Multilingual Multi-Speaker Speech Recognition Hiroshi Seki 1;2, Takaaki Hori , Shinji Watanabe3, Jonathan Le Roux 1, John

Multilingual and Low-Resource Speech Recognition

Spotting Multilingual Consonant-Vowel Units of Speech using Neural Network Models Suryakanth V.Gangashetty, C. Chandra Sekhar, and B.Yegnanarayana Speech

L18: Speech synthesis (back end)

ResearchArticle End-to-End Speech Synthesis for Tibetan

Deep Speech: Scaling up end-to-end deep learning for speech

A Spelling Correction Model for End-to-end Speech

A Multiplexed Network for End-to-End, Multilingual OCR

End-to-End Multilingual Multi-Speaker Speech Recognition · Figure 1: Example of seamless ASR on code-switching speech (top) and multilingual multi-speaker speech (bottom). ble to

Multilingual Speech Recognition - … · Multilingual Speech Recognition ... { architecture DNN Þnetuned on CZ ... Asia (Chinese, Japanese, Korean), Middle East (Arabic,

Recovery from Model Inconsistency in Multilingual Speech ... · Recovery from Model Inconsistency in Multilingual Speech Recognition Report from JHU workshop 2007 Hynek Hermansky

Investigation of multilingual speech-to-text systems for ...mi.eng.cam.ac.uk/~kmk/presentations/CUED_Apr14_Knill.pdf · Investigation of multilingual speech-to-text systems for use

Multilingual Part-of-Speech Tagging: Two Unsupervised ... · Multilingual Part-of-Speech Tagging: Two Unsupervised Approaches Tahira Naseem TAHIRA@CSAIL.MIT.EDU Benjamin Snyder BSNYDER@CSAIL.MIT.EDU

German and Multilingual Speech Synthesis

A Semantic Model for End-to-End Multilingual Web …...A Semantic Model for End-to-End Multilingual Web Content Processing David Lewis, Alex O’Connor, Dominic Jones Centre for Next

Massively Multilingual Adversarial Speech Recognition