Listen, Attend and Spellklivescu/MLSLP2016/chan_MLSLP2016.pdfListen, Attend and Spell - A Neural...

Listen, Attend and SpellA Neural Network for

Large Vocabulary Conversational Speech Recognition

William Chan, Navdeep Jaitly, Quoc Le, Oriol Vinyalswilliamchan@cmu.edu

{ndjaitly,qvl,vinyals}@google.com

*work done at Google Brain.

September 13, 2016

Outline

1. Introduction and Motivation2. Model: Listen, Attend and Spell3. Experiments and Results4. Conclusion

Carnegie Mellon University 2

Introduction and Motivation

Automatic Speech Recognition

InputI Acoustic signal

OutputI Word transcription

State-of-the-Art ASR is Complicated

I Signal ProcessingI Pronunciation DictionaryI GMM-HMMI Context-Dependent PhonemesI DNN Acoustic ModelI Sequence TrainingI Language Model

I Many proxy problems, (mostly) independently optimizedI Disconnect between proxy problems (i.e., frame accuracy) and

ASR performanceI Sequence Training solves some of the problems

HMM Assumptions

I Conditional independence between frames/symbolsI MarkovianI Phonemes

I We make untrue assumptions to simply our problemI Almost everything fallback to the HMM (and phonemes)

Goal: Model Characters directly from Acoustics

InputI Acoustic signal (e.g., filterbank spectra)

OutputI English characters

I Don’t make assumptions about the our distributions

End-to-End Model

I Signal ProcessingI Listen, Attend and Spell (LAS)I Language Model?

I One model optimized end-to-endI learn pronunciation, acoustic, dicationary all in one end-to-end

Model: Listen, Attend and Spell

Sequence-to-Sequence (and Attention)

Machine Translation:I Sutskever et al., “Sequence to Sequence Learning with Neural

Networks,” in NIPS 2014.I Cho et al., “Learning Phrase Representations using RNN

Encoder-Decoder for Statistical Machine Translation,” inEMNLP 2014.

I Bahdanau et al., “Neural Machine Translation by JointlyLearning to Align and Translate,” in ICLR 2015.

TIMIT:I Chorowski et al., “Attention-Based Models for Speech

Recognition,” in NIPS 2015.

Listen, Attend and Spell

Let x be our acoustic features, and let y be the sequence we aretrying to model (i.e., character sequence):

h = Listen(x) (1)P (yi|x, y<i) = AttendAndSpell(y<i,h) (2)

Implicit Language Model

I HMM/CTC have conditional independence assumptionI seq2seq models have a conditional dependence on the

previously emitted symbols:

P (yi|x, y<i) (3)

I Listen(x) can be a RNN (i.e., LSTM).I Transform our input features x into some higher level feature h

I AttendAndSpell is an attention-based RNN decoder.

AttendAndSpell is an attention-based RNN decoder:

ci = AttentionContext(si,h) (4)si = RNN(si−1, yi−1, ci−1) (5)

P (yi|x, y<i) = CharacterDistribution(si, ci) (6)

The AttentionContext creates an alignment and context for eachtimestep:

ei,u = 〈φ(si), ψ(hu)〉 (7)

αi,u = exp(ei,u)∑u′ exp(ei,u′) (8)

ci =∑u

αi,uhu (9)

I Attention mechanism creates a short circuit between eachdecoder’s output and the acoustic

I More efficient information/gradient flow!I Creates an explicit alignment between each character and the

acoustic featuresI CTC’s alignment is latent

I Model works but...I Takes “forever” to train, after > 1 month model still not

converged : (I WERs in the >20s (CLDNN-HMM is 8ish)I Attention mechanism must focus on a long range of frames

Pyramid

I Build higher level features with each layerI Reduce number of timesteps for attention to attend toI Computational efficiencyI 8 filterbank frames → 1 pyramid frame feature

hji = pBLSTM(hji−1,[hj−1

2i , hj−12i+1

]) (10)

Pyramid

1. Sequence-to-Sequence2. Attention3. Pyramid

I 16 to 20-ish WERs (w/o LM)I Takes around 2-3 wks to train, overfitting a HUGE problemI Mismatch between train and inference conditions

Sampling Trick

Machine Translation, Image Captioning and TIMIT:I Bengio et al., “Scheduled Sampling for Sequence Prediction

with Recurrent Neural Networks,” in NIPS 2015.

Sampling Trick

I Training is conditioned on ground truthI We don’t have access to ground truth during inference!

logP (yi|x, y∗>i; θ) (11)

Sampling Trick

I Sample from our modelI Condition on sample for next step prediction

yi ∼ CharacterDistribution(si, ci) (12)

logP (yi|x, y>i; θ) (13)

P (y|x) =∏i

P (yi|x, y<i) (14)

h = Listen(x) (15)P (y|x) = AttendAndSpell(h,y) (16)

Language Model Rescoring

I Leverage on vast quantities of text!I Normalize our LAS model by number of characters in

utterance – LAS has bias for short utterances.

s(y,x) = logP (y|x)|y|c

+ λ logPLM(y) (17)

Experiments and Results

Dataset

I Google voice searchI 2000 hrs, 3M training utterancesI 16 hrs, 22K test utterancesI Mixed Room Simulator, artificially increase acoustic data by

x20 (i.e., YouTube and environmental noise)I Clean and noisy test set

Training

I Stochastic Gradient DescentI DistBelief 32 replicas, minibatch size of 32I 2-3 weeks of training time

Results

Model Clean WER Noisy WERCLDNN-HMM (Tara et al., 2015) 8.0 8.9LAS 16.2 19.0LAS + LM 12.6 14.7LAS + Sampling 14.1 16.5LAS + Sampling + LM 10.3 12.0

Decoding

I We didn’t decode with a dictionary!I LAS implicitly learnt the dictionary during trainingI Rare spelling mistakes!

I We didn’t decode with a LM! (only rescored)I n-best list decoding where n = 32

I CLDNN-HMM is convolutional and unidirectional, LAS is notconvolutional and bidirectional

Decoding

12 4 8 16 32N

WER LM

WER Oracle

N-best List Decoding: N vs. WER

Decoding

I 16% WER without any searching (and LM) - just take thegreedy path!

I LAS does “reasonably” well even with n = 4 in n-best listdecoding

I Not much to gain after n > 16I LM rescoring recovers less than 1/2 of the oracle – need to

improve LM?

Results: Triple A

N Text logP WERTruth call aaa roadside assistance - -1 call aaa roadside assistance -0.57 0.02 call triple a roadside assistance -1.54 50.03 call trip way roadside assistance -3.50 50.04 call xxx roadside assistance -4.44 25.0

Conclusion

I Listen, Attend and Spell (LAS)I End-to-end speech recognition modelI No conditional independence, Markovian assumptions, or proxy

problemsI Sequence-to-Sequence + Attention + PyramidI One model: integrate all traditional components of an ASR

system into one model (acoustic, pronunciation, language,etc...)

I Competitive to state-of-the-art CLDNN-HMM systemI 10.3 vs. 8.0 WER

I Time to throw away HMM and phonemes!I Independently proposed by Bahandau et al., 2016 on WSJ

(next next talk, checkout their paper too!)

Acknowledgements

Nothing is possible without help from many friends...I Google Speech team: Tara Sainath and Babak DamavandiI Google Brain team: Andrew Dai, Ashish Agarwal, Samy

Bengio, Eugene Brevdo, Greg Corrado, Andrew Dai, JeffDean, Rajat Monga, Christopher Olah, Mike Schuster, NoamShazeer, Ilya Sutskever, Vincent Vanhoucke and more!

Listen, Attend and Spellklivescu/MLSLP2016/chan_MLSLP2016.pdfListen, Attend and Spell - A Neural...

Documents

Management of Non-Small Cell Lung Cancerpodcast.researchtopractice.com/podcast/pdf/lcu/ASCOLung_13.pdflisten to the audio MP3s, complete the Post-test with a ... Advisory Committee:

Attend ppt template_for_sales_20120601

ATTEND!I! ATTEND!!! ATTEND!!!! ATTEND!!i And bring a FRIEMDi · 2018-09-26 · THE HUBBLE SPACE TELESCOPE THE FEBRUARY 13th PROGRAM by DAN DURDA, UNIVERSITY OF FLORIDA ASTRONOMY DEPARTMENT

Ask Attend Act Amend

Comparison of Distance Metrics for Phoneme …ttic.uchicago.edu/~klivescu/MLSLP2016/rizwan.pdfComparison of Distance Metrics for Phoneme Classiﬁcation based on Deep Neural Network

Tri-State Land & Timber, LLC Young Farmers Attend · PDF fileYoung Farmers Attend Emerging Entrepreneurs’ Conference ... Young Farmers Attend Emerging Entrepreneurs’ Conference

Attend a meeting Receive training Attend business outings Inspect work Visit customer/suppliers Attend a conference Visit other branches

Being Shallow and Random is (almost) as Good as and …klivescu/MLSLP2016/sha_MLSLP2016.pdfas intuition and manual inspection fail ... [P. Domingos] Representation as a learning problem

COMP089H: Everyday Computing (HONORS )lin/COMP089H/LEC/1.pdfListen to music ! Dissect robots ! Shrink to mini-you ! Try out theme-park like ride UNC Chapel Hill M. C. Lin Hopefully

Lectures Engaged - St. Xavier's College-Autonomous, … Engaged 9 4 10 23 Name UID Attend ent % Attend ent % Attend ent % Total Attend ent % Presen t ... Daxini Priyank Sanjay 171394

Attend Anc

website traffic by up to 55%documents.nam.org/CMA/CMA2014SLC/LAI_Video_CMA_SLC_2014.pdfListen to Weird Al Yankovic parody Lorde's song "Royals" while touting the Assoaauot CURATION

"Why Attend Every Service?"

Why Attend every service?

End-to-end approaches to speech recognition and language ...ttic.uchicago.edu/~klivescu/MLSLP2016/chorowski_MLSLP2016.pdf · End-to-end approaches to speech recognition and language

End-to-End Dialogue Systems Modern Challenges in Buildingttic.uchicago.edu/~klivescu/MLSLP2016/lowe_MLSLP2016.pdfModern Challenges in Building End-to-End Dialogue Systems Ryan Lowe

CHAPTER IV RESEARCH FINDING AND DISCUSSIONdigilib.uinsby.ac.id/2664/7/Bab 4.pdfListen for emphatic expressions of surprise 12. Listen for wishes 13. Listen untrue conditions 14. Listen

REE Like us on Facebook Cedar Mill Newscedarmillnews.com/pdf/CMNews1017.pdfListen to the folk, bluegrass and old-time music of Lauren Sheehan and her band, and enjoy a delicious lunch

Attend a meeting Receive training Attend business outings Inspect work Visit customer/suppliers

Attend Tech Fest 2015