Recent advances in automatic speech recognition | A brief ...ttic.uchicago.edu/~llu/pdf/liang_hwu14.pdf · This talk I What is happening in ASR? I Background { speech recognition

$Page 1: Recent advances in automatic speech recognition | A brief ...ttic.uchicago.edu/~llu/pdf/liang_hwu14.pdf · This talk I What is happening in ASR? I Background { speech recognition$
Recent advances in automatic speech recognition— A brief overview

Liang LuUniversity of Edinburgh

Liang Lu ([email protected]), Heriot-Watt University, Feb, 2014. RTSC

RTSC

This talk

I What is happening in ASR?I Background – speech recognition and its applicationI (Recent) advances in system representation

I Weighted finite state transducer

I Recent advances in language modellingI Recurrent neural network language model

I Recent advances in acoustic modellingI Deep neural network acoustic model

I Summary


RTSC

BackgroundI Speech is one of the most nature ways for information

communicationI ASR is a central component for voice-driven information

processing systems

X. He and L, Deng, ”Speech-Centric Information Processing: An Optimization Oriented Approach”, in proceedings

of IEEE, 2013


RTSC

Background

I What does ASR do and how does it do it?

ASRSpeech

Text

I It can be expressed mathematically as

W = arg maxW

P(W|X) (1)

= arg maxW

p(X|W)︸︷︷︸likelihood

P(W)︸︷︷︸prior

(2)

where X is a sequence of acoustic feature vectors, and W is aword sequence.


RTSC

It is still hard, let’s decompose it further ...

i'm i'm what i'm thinking while my sweet time i'm very 0.0324

i had what i'm thinking while my sweet time i'm very 0.0127

i i was thinking about my sweet time i'm very 0.0046 ........

...

...abide ax b ay d 1.0abiding ax b ay d ih ng 1.0abilities ax b ih l ih t iy z 0.666666abilities ey b ih l ih t iy z 0.333333ability ax b ih l ih t iy 1.0able ax b ax l 0.413349able ey b ax l 0.553356......

ax b ay d ---> sil-ax-b ax-b-ay b-ay-d ay-d-si 1.0

j − 1 j + 1j

sil-ax-b

· · · j − 1 j + 1j

ay-d-si

LM -- language model

PM -- pronunciation model

CD -- context dependency

HMMs


RTSC

It is still hard, let’s decompose it further ...




...



j − 1 j + 1j

sil-ax-b

· · · j − 1 j + 1j

ay-d-si




HMMs

Active research


RTSC

System training - a generative process

...



j − 1 j + 1j

sil-ax-b

· · · j − 1 j + 1j

ay-d-si

i was thinking about my sweet time i'm very

· · · · · ·


RTSC

Decoding - a search problem

...



j − 1 j + 1j

sil-ax-b

· · · j − 1 j + 1j

ay-d-si

· · · · · ·





RTSC

(Recent) advances in system representation

...



j − 1 j + 1j

sil-ax-b

· · · j − 1 j + 1j

ay-d-si

· · · · · ·




?Liang Lu ([email protected]), Heriot-Watt University, Feb, 2014. R

TSC

RTSC


WFST – Weighted finite state transducer

I Input vocabulary i ∈ Φ1

I Output vocabulary o ∈ Φ2

I weight w ∈ RI ⊕ operation

I ⊗ operation

0 1brad:brad/2

2pitt:pitt/5

M. Mohri and F. Pereira, ”Weighted finite state Transducers in speech recognition”, in CSL, 2002


RTSC


WFST for language model and pronunciation model

M. Mohri and F. Pereira, ”Weighted finite state Transducers in speech recognition”, in CSL, 2002


RTSC


I WFST can integrate all the components in an ASR systeminto a joint graph with addition optimisation

I If we defineI H - HMMsI C - context dependency transducerI L - pronunciation modelI G - language model

then the task for ASR can be simply represented as

w = best path(H ◦ C ◦ L ◦ G ) (3)

given the acoustic signals.


RTSC


I WFST provides an elegant interface for down-streamapplications

I An example of spoken language understanding (ASR + NLU):

0 1Show:O

2me:O

3movies:B-movie_type

4with:O

5brad:B-movie_star

6pitt:I-moive_star

A. Deoras et al, ”Joint Discriminative Decoding of Words and Semantic Tags for Spoken Language Understanding”

in IEEE TASLP 2013


RTSC


I An example of speech to speech translation (ASR + MT)

X He, L. Deng and A. Acero, ”Why Word Error Rate is not a Good Metric for Speech Recognizer Training

for the Speech Translation Task?” in ICASSP 2011


RTSC


X He, L. Deng and A. Acero, ”Why Word Error Rate is not a Good Metric for Speech Recognizer Training for the

Speech Translation Task?” in ICASSP 2011


RTSC


I Common practice – coupling ASR and MT with WFST

B. Zhou et al, ”Folsom: A Fast and Memory-Efficient Phrase-based Approach to statistical machine translation” inSLT 2006

B. Zhou et al, ”On efficient coupling of ASR and SMT for speech translation” in ICASSP 2007


RTSC



RTSC

Recent advances in ASR




...



j − 1 j + 1j

sil-ax-b

· · · j − 1 j + 1j

ay-d-si




HMMs

Active research


RTSC

Recent advances in ASR




...



j − 1 j + 1j

sil-ax-b

· · · j − 1 j + 1j

ay-d-si




HMMs

Neural networks


RTSC

Neural networks in language modelling

I N-gram language model has defined the state-of-the-art foralmost 40 years [L. R. Bahl, 1978]

I There has been a long struggle to move beyond n-grams byvarious statistical models

I Random forest language model [P. Xu, 2004]I Class-based language model, e.g. IBM Model M [S.F. Chen,

2009]I Nonparametric language model [Y.W. Teh, 2006]I Discriminative language model [B. Roark, 2006]I ...

I It may just really happen recently by using recurrent neuralnetwork (RNNLM)

T. Mikolov, et al, ”Recurrent neural network based language model”, in Interspeech

2010


RTSC


I The aim of a language model is very simple

P(wn|wn−1, ...,w1) ≈ P(wn|wn−1, ...,wn−k), (4)

but it is very difficult if k > 3 for large vocabulary task, e.g.what is the value of 60, 0003?


RTSC


I Neural network language model is not new [Y. Benjio, 2003]

0 0 0 1 0 0 ... 0 1 0 0 0 0 ...· · ·wn−k wn−1

projection layer

input layer

hidden layer

output layer

P (wn = n|wn−1, . . . , wn−k)

Y. Benjio, et al, ”A neural probabilistic language model”, in JMLR 2003


RTSC


I RNNLM differs in that a recurrent layer of input is used tocapture longer contextual information


2010


RTSC


I RNNLM can achieve significant reduction both in perplexityand word error rate (results on Wall Street Journal)


2010


RTSC


Not limited to language model

I RNN for spoken language understanding [K. Yao, et al, 2013]

K. Yao, et al, ”Recurrent neural networks for language understanding”, in Interspeech 2013


RTSC


Not limited to language model

I RNN for spoken language understanding [K. Yao, et al, 2013]

K. Yao, et al, ”Recurrent neural networks for language understanding”, in Interspeech 2013


RTSC

Neural networks in language modellingNot limited to language model

I RNN for machine translation [N Kalchbrenner, P. Blunsom,2013 ]

N. Kalchbrenner, P. Blunsom, ”Recurrent continuous translation models”, in EMNLP 2013


RTSC

Neural networks in acoustic modelling

I GMM-HMM has defined state-of-the-art for over 20 years

j − 1 j + 1j

I Pros:I Efficient and parallel training algorithmsI Clear physical meaning (Gaussian mean, variances, etc)I Efficient adaptation algorithm (MLLR, fMLLR, etc)

I Cons:I Inefficient in learning feature correlationsI Hard to take advantage of longer context windowI Generative rather than Discriminative model


RTSC

Neural networks in acoustic modelling

I Moving beyond GMM-HMM?I Conditional random field (CRF), e.g. segmental CRF[G. Zweig,

2010], augmented CRF [Y. Hifny 2009], hidden CRF[A.Gunawardana, 2005] )

I Support vector machines (SVM) e.g. [N. Simith, 2002]I Template based acoustic models, e.g. [M. De Wachter, 2007]I ...

I Deep neural networks for acoustic modelling [G. Dahl, 2012]

G. Zweig, P. Nguyen, ”A segmental conditional random fields toolkit for speech recognition” in Interspeech 2010.

Y. Hifny, S. Renals, ”Speech recognition using augmented conditional random fields”, IEEE TASLP, 2009.

A. Gunawardana, et al, ”Hidden conditional random fields for phone classification”, in Interspeech 2005.

N. Smith, M. Gales, ”Speech recognition using SVMs”, in NIPS, 2002.

M. De Wachter, et al, ”Template based continuous speech recognition”, IEEE TASLP, 2007

G. Dahl, et al, ”Context-dependent pre-trained deep neural networks for large vocabulary speech recognition”, IEEETASLP 2012


RTSC

Deep neural networks for acoustic modelling

I Neural networks for speech recognition has been extensivelystudied in early 1990’s

I New ingredients in Deep Neural Networks (DNN)I Pre-training using restricted Boltzmann machine (RBM)I More hidden layers (≥ 4)I Wider output (103 vs. less than 102 in speech)

input

hidden

output

shallow neural network

input

hidden

output

deep neural network


RTSC


DNN is still combined with HMM, which was the practice in theearly 1990’s

G. Dahl, et al, ”Context-dependent pre-trained deep neural networks for large vocabulary speech recognition”, IEEETASLP 2012


RTSC


I Why the new ingredients can make a difference?I Deep neural network is difficult to train since it can easily be

trapped in the local optimumI Pre-training helps (in some cases)

I Shallow network can not efficient learn complex functionsI more hidden layer helps

I For ASR, context-dependent model normally has severalthousand output states

I Wide output helps

Additionally, GPU provides us the computation power


RTSC


I How to train a deep neural network (for acoustic modelling)?I Step 1: Train the restricted Boltzmann machines (RBMs)I Step 2: Stack the RBMsI Step 3: Put a softmax layer on top and refine the weights

using back-propagation

G. Hinton, et al, ”Deep neural networks for acoustic modeling in speech recognition: The shared views of fourresearch groups”, IEEE signal processing magazine, 2012


RTSC

Deep neural networks for acoustic modellingI restricted Boltzmann machine

I Only has visible-hidden connectionsI learning by maximising the log-likelihood

P(v) =1

Zexp(−F (v)) (5)

F (v) = −log(∑

h

exp(−E (v,h))

)free energy (6)

E (v,h) = −bTv − cTh− vTWh energy function (7)

Z =∑v,h

exp(−E (v,h)) partition function (8)

visible layer v

hidden layer h

weight matrix W


RTSC


I Performance for ASR — DNN significantly improvesstate-of-the-art

1 20

5

10

15

20

25

30

GMM (25.3)

+SAT (21.2)

+DT (18.6)

DNN+SAT (14.2)

+DT (12.6)

Results on switchboard using 300 hours of training data

K. Vesely, et al, ”Sequence-discriminative training of deep neural networks”, in Interspeech 2013


RTSC


I Current research activities in DNN for ASRI New types of neural networks, e.g. tensor networks,

convolutional networksI Learning new acoustic feature representations i.e., move

beyond MFCCsI Distributed optimisation to speed up the trainingI Adaption algorithms for speakers or domainsI ...


RTSC

Summary

I A brief overview of recent advances in ASR

I System representation using WFST

I Language modelling using RNN

I Acoustic modelling using DNNI Practise by yourself using the open-source toolkits

I OpenFst - http://www.openfst.orgI RNNLM - http://www.fit.vutbr.cz/∼imikolov/rnnlmI DNN for ASR - http://kaldi.sourceforge.net


RTSC

Thanks!


RTSC

Documents

Recent advances in automatic speech recognition | A brief ...ttic.uchicago.edu/~llu/pdf/liang_hwu14.pdf · This talk I What is happening in ASR? I Background { speech recognition