View
6
Download
0
Category
Preview:
Citation preview
Multilingual Speech Recognition With A SingleEnd-To-End Model
Shubham Toshniwal1, Tara N. Sainath2, Ron J. Weiss2, Bo Li2,Pedro Moreno2, Eugene Weinstein2, and Kanishka Rao2
1TTI Chicago
2Google
April 18, 2017
Why Multilingual Speech Recognition Models ?
I Remarkable progress in speech recognition in past few years
I Most of this success restricted to high resource languages, e.g.English
I Google Voice Search supports ∼120 out of 7000 languagesI Multilingual models:
I Utilize knowledge transfer across languages, and thus alleviatedata requirement
I Successful in Neural Machine Translation (Google NMT)I Easier to deploy and maintain
Conventional ASR Systems
I Traditional ASR systems are modular
I Require expert curated resources
. . .
Feature
Extraction
Speech
Feature Vectors
Decoder
Acoustic Pronunciation
Dictionary
Language
Model Model
‘‘recognize speech’’
Words
I Multilingual models:I Focus on just the acoustic model (Lin, 2009; Ghoshal, 2013)I Separate language model and pronunciation model required for
each language
Conventional ASR Systems
I Traditional ASR systems are modular
I Require expert curated resources
. . .
Feature
Extraction
Speech
Feature Vectors
Decoder
Acoustic Pronunciation
Dictionary
Language
Model Model
‘‘recognize speech’’
Words
I Multilingual models:I Focus on just the acoustic model (Lin, 2009; Ghoshal, 2013)I Separate language model and pronunciation model required for
each language
End-to-end ASR Models
I Encoder-decoder models achieved state-of-the-art result onGoogle Voice Search task (Chiu et al. 2018)
I Encoder-Decoder models are appealing because:I Conceptually simple; subsume the acoustic model,
pronunciation model, and language model in a single model.I No need for expert curated resources!
Acoustic
Featuresx1 x2 x3 xT
h3 hTh2h1
GO
recognize speech EOS
End-to-End Multilingual ASR Models
ct =∑Tg
i=1 αithi
ct
uit = vT tanh(W1hi +W2dt)
αt = softmax(ut)
Encoder
αt
yt
dt
xTx3
Attention Layer
Decoder
x2x1
hTh3h2h1
I We use attention-based encoder-decoder models
I Decoder outputs one character per time step
I For multilingual models, take union over character sets
Multilingual Encoder-Decoder Models
Model Training Inference
Joint model No language ID No language ID
Multitask model Language ID No language IDConditioned model Language ID Language ID
I Naive model; unaware of multilingual nature of data
I Can potentially handle code-switching
Multilingual Encoder-Decoder Models
Model Training Inference
Joint model No language ID No language IDMultitask model Language ID No language ID
Conditioned model Language ID Language ID
I Trained to jointly recognize language ID and speech
Multilingual Encoder-Decoder Models
Model Training Inference
Joint model No language ID No language IDMultitask model Language ID No language IDConditioned model Language ID Language ID
I Learnt embedding of language ID fed as input to conditionthe model
I Language ID embedding can be fed in:(a) Encoder, (b) Decoder, (c) Encoder & Decoder
Encoder-Conditioned Model
h3 hTh2h1
x1eL
x2 xTx3eL eLeL
Embedding
Features
Language
Acoustic
Encoder of encoder-conditioned model
Task
I Recognize 9 Indian languages with a single model
I Very little script overlap, except for Hindi and Marathi.
I The union of character sets is close to 1000 characters!
I But the languages have large overlap in phonetic space(Lavanya et al. 2005).
Experimental Setup
I Training data consists of dictated queries
I Average 230K queries (∼170 hrs) per language
Bengali
GujaratiHindi
Kannada
MalayalamMarathi
TamilTelugu
Urdu0.0
2.5
5.0
7.5
10.0
12.5
15.0
17.5
Fra
ctio
nof
Tot
alD
ata
(in
%)
364K
243K213K
192K
285K
227K
164K
232K
196K
I Baseline: Encoder-decoder models trained for individuallanguages
Joint vs Individual
Bengali
GujaratiHindi
Kannada
MalayalamMarathi
TamilTelugu
Urdu
Wt Avg0
10
20
30
40
WE
R(i
n%
)
Joint
Individual
I Joint model outperforms individual models on all languages!!
I The joint model is not even language aware at test time
I Overall a 21% relative reduction in Word Error Rate (WER)
Picking the Right Script
Benga
li
Gujarat
iHind
i
Kann
ada
Malaya
lamMara
thiTa
mil
Telug
uUrdu
Mixed
Bengali
Gujarati
Hindi
Kannada
Malayalam
Marathi
Tamil
Telugu
Urdu
1 0 0 0 0 0 0 0 0 0
0 0.99 0.003 0.001 0 0 0 0 0.001 0
0 0 1 0 0 0 0 0 0 0
0.002 0.004 0.035 0.93 0.008 0 0 0.005 0.014 0
0 0.001 0.002 0.004 0.99 0 0 0.005 0.002 0
0.001 0.007 0 0.021 0.003 0.95 0 0.002 0.018 0
0 0 0 0 0 0 1 0 0 0
0 0.001 0 0.003 0.001 0 0 0.99 0 0
0.001 0.003 0.009 0.004 0.003 0 0 0 0.98 00.0
0.2
0.4
0.6
0.8
1.0
Rarely confused between languages
Joint vs Multitask
Bengali
GujaratiHindi
Kannada
MalayalamMarathi
TamilTelugu
Urdu
Wt Avg0
5
10
15
20
25
30
35
WE
R(i
n%
)
Joint
Multitask
Insignificant gains from multitask training
Joint vs Conditioned Models
Bengali
GujaratiHindi
Kannada
MalayalamMarathi
TamilTelugu
Urdu
Wt Avg0
5
10
15
20
25
30
35
WE
R(i
n%
)
Joint
Decoder
Encoder
I As expected, conditioning the model on the language ID ofspeech helps
I Encoder conditioning:I Performs better than decoder conditioningI Potential acoustic model adaptation happening
Magic of Conditioning
Benga
li
Gujarat
iHind
i
Kann
ada
Malaya
lamMara
thiTa
mil
Telug
uUrdu
Mixed
Bengali
Gujarati
Hindi
Kannada
Malayalam
Marathi
Tamil
Telugu
Urdu
1 0 0 0 0 0 0 0 0 0
0 1 0 0 0 0 0 0 0 0
0 0 1 0 0 0 0 0 0 0
0 0 0 1 0 0 0 0 0 0
0 0 0 0 1 0 0 0 0 0
0 0 0 0 0 1 0 0 0 0
0 0 0 0 0 0 1 0 0 0
0 0 0 0 0 0 0 1 0 0
0 0 0 0 0 0 0 0 1 00.0
0.2
0.4
0.6
0.8
1.0
Testing the Limits: Code Switching
I Can the joint model code switch between 2 Indian languages(trained for recognizing them separately)
I Artificial test set of 1000 utterances of Tamil query followedby Hindi with 50ms silence in between
I The model does not code-switch :(
I Picks one of the two scripts and sticks with itI From manual inspection:
I Transcribes either the Hindi/Tamil part in corresponding scriptI Transliteration in rare cases
Testing the Limits: Code Switching
I Can the joint model code switch between 2 Indian languages(trained for recognizing them separately)
I Artificial test set of 1000 utterances of Tamil query followedby Hindi with 50ms silence in between
I The model does not code-switch :(
I Picks one of the two scripts and sticks with itI From manual inspection:
I Transcribes either the Hindi/Tamil part in corresponding scriptI Transliteration in rare cases
Feeding the Wrong Language ID
I Does the model obey acoustics or is it faithful to language ID?
I Artificial dataset of 1000 Urdu queries tagged as Hindi
I Transliterates Urdu queries in Hindi’s script
I Learns to disentangle the acoustic-phonetic content from thelanguage identity
I Transliterator as a byproduct!
Feeding the Wrong Language ID
I Does the model obey acoustics or is it faithful to language ID?
I Artificial dataset of 1000 Urdu queries tagged as Hindi
I Transliterates Urdu queries in Hindi’s script
I Learns to disentangle the acoustic-phonetic content from thelanguage identity
I Transliterator as a byproduct!
Conclusion
I Encoder-Decoder models:I Elegant and simple framework for multilingual modelsI Outperform models trained for specific languagesI Rarely confused between individual languagesI Fail at code-switching
I Recent work along similar lines got promising results as well(Kim, 2017; Watanabe, 2017; Tong, 2018; Dalmia, 2018)
I Questions?
Conditioning Encoder is Enough
Bengali
GujaratiHindi
Kannada
MalayalamMarathi
TamilTelugu
Urdu
Wt Avg0
5
10
15
20
25
30
35
WE
R(i
n%
)
Encoder
Encoder+Decoder
I Conditioning decoder on top of conditioning the encoderdoesn’t buy us much
I Possibly because the attention mechanism feeds ininformation from the encoder to the decoder
Recommended