38
Recent advances in automatic speech recognition — A brief overview Liang Lu University of Edinburgh Liang Lu ([email protected]), Heriot-Watt University, Feb, 2014. R T S C R T S C

Recent advances in automatic speech recognition | A brief ...ttic.uchicago.edu/~llu/pdf/liang_hwu14.pdf · This talk I What is happening in ASR? I Background { speech recognition

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Recent advances in automatic speech recognition | A brief ...ttic.uchicago.edu/~llu/pdf/liang_hwu14.pdf · This talk I What is happening in ASR? I Background { speech recognition

Recent advances in automatic speech recognition— A brief overview

Liang LuUniversity of Edinburgh

Liang Lu ([email protected]), Heriot-Watt University, Feb, 2014. RTSC

RTSC

Page 2: Recent advances in automatic speech recognition | A brief ...ttic.uchicago.edu/~llu/pdf/liang_hwu14.pdf · This talk I What is happening in ASR? I Background { speech recognition

This talk

I What is happening in ASR?I Background – speech recognition and its applicationI (Recent) advances in system representation

I Weighted finite state transducer

I Recent advances in language modellingI Recurrent neural network language model

I Recent advances in acoustic modellingI Deep neural network acoustic model

I Summary

Liang Lu ([email protected]), Heriot-Watt University, Feb, 2014. RTSC

RTSC

Page 3: Recent advances in automatic speech recognition | A brief ...ttic.uchicago.edu/~llu/pdf/liang_hwu14.pdf · This talk I What is happening in ASR? I Background { speech recognition

BackgroundI Speech is one of the most nature ways for information

communicationI ASR is a central component for voice-driven information

processing systems

X. He and L, Deng, ”Speech-Centric Information Processing: An Optimization Oriented Approach”, in proceedings

of IEEE, 2013

Liang Lu ([email protected]), Heriot-Watt University, Feb, 2014. RTSC

RTSC

Page 4: Recent advances in automatic speech recognition | A brief ...ttic.uchicago.edu/~llu/pdf/liang_hwu14.pdf · This talk I What is happening in ASR? I Background { speech recognition

Background

I What does ASR do and how does it do it?

ASRSpeech

Text

I It can be expressed mathematically as

W = arg maxW

P(W|X) (1)

= arg maxW

p(X|W)︸ ︷︷ ︸likelihood

P(W)︸ ︷︷ ︸prior

(2)

where X is a sequence of acoustic feature vectors, and W is aword sequence.

Liang Lu ([email protected]), Heriot-Watt University, Feb, 2014. RTSC

RTSC

Page 5: Recent advances in automatic speech recognition | A brief ...ttic.uchicago.edu/~llu/pdf/liang_hwu14.pdf · This talk I What is happening in ASR? I Background { speech recognition

It is still hard, let’s decompose it further ...

i'm i'm what i'm thinking while my sweet time i'm very 0.0324

i had what i'm thinking while my sweet time i'm very 0.0127

i i was thinking about my sweet time i'm very 0.0046 ........

...

...abide ax b ay d 1.0abiding ax b ay d ih ng 1.0abilities ax b ih l ih t iy z 0.666666abilities ey b ih l ih t iy z 0.333333ability ax b ih l ih t iy 1.0able ax b ax l 0.413349able ey b ax l 0.553356......

ax b ay d ---> sil-ax-b ax-b-ay b-ay-d ay-d-si 1.0

j − 1 j + 1j

sil-ax-b

· · · j − 1 j + 1j

ay-d-si

LM -- language model

PM -- pronunciation model

CD -- context dependency

HMMs

Liang Lu ([email protected]), Heriot-Watt University, Feb, 2014. RTSC

RTSC

Page 6: Recent advances in automatic speech recognition | A brief ...ttic.uchicago.edu/~llu/pdf/liang_hwu14.pdf · This talk I What is happening in ASR? I Background { speech recognition

It is still hard, let’s decompose it further ...

i'm i'm what i'm thinking while my sweet time i'm very 0.0324

i had what i'm thinking while my sweet time i'm very 0.0127

i i was thinking about my sweet time i'm very 0.0046 ........

...

...abide ax b ay d 1.0abiding ax b ay d ih ng 1.0abilities ax b ih l ih t iy z 0.666666abilities ey b ih l ih t iy z 0.333333ability ax b ih l ih t iy 1.0able ax b ax l 0.413349able ey b ax l 0.553356......

ax b ay d ---> sil-ax-b ax-b-ay b-ay-d ay-d-si 1.0

j − 1 j + 1j

sil-ax-b

· · · j − 1 j + 1j

ay-d-si

LM -- language model

PM -- pronunciation model

CD -- context dependency

HMMs

Active research

Liang Lu ([email protected]), Heriot-Watt University, Feb, 2014. RTSC

RTSC

Page 7: Recent advances in automatic speech recognition | A brief ...ttic.uchicago.edu/~llu/pdf/liang_hwu14.pdf · This talk I What is happening in ASR? I Background { speech recognition

System training - a generative process

...

...abide ax b ay d 1.0abiding ax b ay d ih ng 1.0abilities ax b ih l ih t iy z 0.666666abilities ey b ih l ih t iy z 0.333333ability ax b ih l ih t iy 1.0able ax b ax l 0.413349able ey b ax l 0.553356......

ax b ay d ---> sil-ax-b ax-b-ay b-ay-d ay-d-si 1.0

j − 1 j + 1j

sil-ax-b

· · · j − 1 j + 1j

ay-d-si

i was thinking about my sweet time i'm very

· · · · · ·

Liang Lu ([email protected]), Heriot-Watt University, Feb, 2014. RTSC

RTSC

Page 8: Recent advances in automatic speech recognition | A brief ...ttic.uchicago.edu/~llu/pdf/liang_hwu14.pdf · This talk I What is happening in ASR? I Background { speech recognition

Decoding - a search problem

...

...abide ax b ay d 1.0abiding ax b ay d ih ng 1.0abilities ax b ih l ih t iy z 0.666666abilities ey b ih l ih t iy z 0.333333ability ax b ih l ih t iy 1.0able ax b ax l 0.413349able ey b ax l 0.553356......

ax b ay d ---> sil-ax-b ax-b-ay b-ay-d ay-d-si 1.0

j − 1 j + 1j

sil-ax-b

· · · j − 1 j + 1j

ay-d-si

· · · · · ·

i'm i'm what i'm thinking while my sweet time i'm very 0.0324

i had what i'm thinking while my sweet time i'm very 0.0127

i i was thinking about my sweet time i'm very 0.0046 ........

Liang Lu ([email protected]), Heriot-Watt University, Feb, 2014. RTSC

RTSC

Page 9: Recent advances in automatic speech recognition | A brief ...ttic.uchicago.edu/~llu/pdf/liang_hwu14.pdf · This talk I What is happening in ASR? I Background { speech recognition

(Recent) advances in system representation

...

...abide ax b ay d 1.0abiding ax b ay d ih ng 1.0abilities ax b ih l ih t iy z 0.666666abilities ey b ih l ih t iy z 0.333333ability ax b ih l ih t iy 1.0able ax b ax l 0.413349able ey b ax l 0.553356......

ax b ay d ---> sil-ax-b ax-b-ay b-ay-d ay-d-si 1.0

j − 1 j + 1j

sil-ax-b

· · · j − 1 j + 1j

ay-d-si

· · · · · ·

i'm i'm what i'm thinking while my sweet time i'm very 0.0324

i had what i'm thinking while my sweet time i'm very 0.0127

i i was thinking about my sweet time i'm very 0.0046 ........

?Liang Lu ([email protected]), Heriot-Watt University, Feb, 2014. R

TSC

RTSC

Page 10: Recent advances in automatic speech recognition | A brief ...ttic.uchicago.edu/~llu/pdf/liang_hwu14.pdf · This talk I What is happening in ASR? I Background { speech recognition

(Recent) advances in system representation

WFST – Weighted finite state transducer

I Input vocabulary i ∈ Φ1

I Output vocabulary o ∈ Φ2

I weight w ∈ RI ⊕ operation

I ⊗ operation

0 1brad:brad/2

2pitt:pitt/5

M. Mohri and F. Pereira, ”Weighted finite state Transducers in speech recognition”, in CSL, 2002

Liang Lu ([email protected]), Heriot-Watt University, Feb, 2014. RTSC

RTSC

Page 11: Recent advances in automatic speech recognition | A brief ...ttic.uchicago.edu/~llu/pdf/liang_hwu14.pdf · This talk I What is happening in ASR? I Background { speech recognition

(Recent) advances in system representation

WFST for language model and pronunciation model

M. Mohri and F. Pereira, ”Weighted finite state Transducers in speech recognition”, in CSL, 2002

Liang Lu ([email protected]), Heriot-Watt University, Feb, 2014. RTSC

RTSC

Page 12: Recent advances in automatic speech recognition | A brief ...ttic.uchicago.edu/~llu/pdf/liang_hwu14.pdf · This talk I What is happening in ASR? I Background { speech recognition

(Recent) advances in system representation

I WFST can integrate all the components in an ASR systeminto a joint graph with addition optimisation

I If we defineI H - HMMsI C - context dependency transducerI L - pronunciation modelI G - language model

then the task for ASR can be simply represented as

w = best path(H ◦ C ◦ L ◦ G ) (3)

given the acoustic signals.

Liang Lu ([email protected]), Heriot-Watt University, Feb, 2014. RTSC

RTSC

Page 13: Recent advances in automatic speech recognition | A brief ...ttic.uchicago.edu/~llu/pdf/liang_hwu14.pdf · This talk I What is happening in ASR? I Background { speech recognition

(Recent) advances in system representation

I WFST provides an elegant interface for down-streamapplications

I An example of spoken language understanding (ASR + NLU):

0 1Show:O

2me:O

3movies:B-movie_type

4with:O

5brad:B-movie_star

6pitt:I-moive_star

A. Deoras et al, ”Joint Discriminative Decoding of Words and Semantic Tags for Spoken Language Understanding”

in IEEE TASLP 2013

Liang Lu ([email protected]), Heriot-Watt University, Feb, 2014. RTSC

RTSC

Page 14: Recent advances in automatic speech recognition | A brief ...ttic.uchicago.edu/~llu/pdf/liang_hwu14.pdf · This talk I What is happening in ASR? I Background { speech recognition

(Recent) advances in system representation

I An example of speech to speech translation (ASR + MT)

X He, L. Deng and A. Acero, ”Why Word Error Rate is not a Good Metric for Speech Recognizer Training

for the Speech Translation Task?” in ICASSP 2011

Liang Lu ([email protected]), Heriot-Watt University, Feb, 2014. RTSC

RTSC

Page 15: Recent advances in automatic speech recognition | A brief ...ttic.uchicago.edu/~llu/pdf/liang_hwu14.pdf · This talk I What is happening in ASR? I Background { speech recognition

(Recent) advances in system representation

X He, L. Deng and A. Acero, ”Why Word Error Rate is not a Good Metric for Speech Recognizer Training for the

Speech Translation Task?” in ICASSP 2011

Liang Lu ([email protected]), Heriot-Watt University, Feb, 2014. RTSC

RTSC

Page 16: Recent advances in automatic speech recognition | A brief ...ttic.uchicago.edu/~llu/pdf/liang_hwu14.pdf · This talk I What is happening in ASR? I Background { speech recognition

(Recent) advances in system representation

I Common practice – coupling ASR and MT with WFST

B. Zhou et al, ”Folsom: A Fast and Memory-Efficient Phrase-based Approach to statistical machine translation” inSLT 2006

B. Zhou et al, ”On efficient coupling of ASR and SMT for speech translation” in ICASSP 2007

Liang Lu ([email protected]), Heriot-Watt University, Feb, 2014. RTSC

RTSC

Page 17: Recent advances in automatic speech recognition | A brief ...ttic.uchicago.edu/~llu/pdf/liang_hwu14.pdf · This talk I What is happening in ASR? I Background { speech recognition

(Recent) advances in system representation

Liang Lu ([email protected]), Heriot-Watt University, Feb, 2014. RTSC

RTSC

Page 18: Recent advances in automatic speech recognition | A brief ...ttic.uchicago.edu/~llu/pdf/liang_hwu14.pdf · This talk I What is happening in ASR? I Background { speech recognition

Recent advances in ASR

i'm i'm what i'm thinking while my sweet time i'm very 0.0324

i had what i'm thinking while my sweet time i'm very 0.0127

i i was thinking about my sweet time i'm very 0.0046 ........

...

...abide ax b ay d 1.0abiding ax b ay d ih ng 1.0abilities ax b ih l ih t iy z 0.666666abilities ey b ih l ih t iy z 0.333333ability ax b ih l ih t iy 1.0able ax b ax l 0.413349able ey b ax l 0.553356......

ax b ay d ---> sil-ax-b ax-b-ay b-ay-d ay-d-si 1.0

j − 1 j + 1j

sil-ax-b

· · · j − 1 j + 1j

ay-d-si

LM -- language model

PM -- pronunciation model

CD -- context dependency

HMMs

Active research

Liang Lu ([email protected]), Heriot-Watt University, Feb, 2014. RTSC

RTSC

Page 19: Recent advances in automatic speech recognition | A brief ...ttic.uchicago.edu/~llu/pdf/liang_hwu14.pdf · This talk I What is happening in ASR? I Background { speech recognition

Recent advances in ASR

i'm i'm what i'm thinking while my sweet time i'm very 0.0324

i had what i'm thinking while my sweet time i'm very 0.0127

i i was thinking about my sweet time i'm very 0.0046 ........

...

...abide ax b ay d 1.0abiding ax b ay d ih ng 1.0abilities ax b ih l ih t iy z 0.666666abilities ey b ih l ih t iy z 0.333333ability ax b ih l ih t iy 1.0able ax b ax l 0.413349able ey b ax l 0.553356......

ax b ay d ---> sil-ax-b ax-b-ay b-ay-d ay-d-si 1.0

j − 1 j + 1j

sil-ax-b

· · · j − 1 j + 1j

ay-d-si

LM -- language model

PM -- pronunciation model

CD -- context dependency

HMMs

Neural networks

Liang Lu ([email protected]), Heriot-Watt University, Feb, 2014. RTSC

RTSC

Page 20: Recent advances in automatic speech recognition | A brief ...ttic.uchicago.edu/~llu/pdf/liang_hwu14.pdf · This talk I What is happening in ASR? I Background { speech recognition

Neural networks in language modelling

I N-gram language model has defined the state-of-the-art foralmost 40 years [L. R. Bahl, 1978]

I There has been a long struggle to move beyond n-grams byvarious statistical models

I Random forest language model [P. Xu, 2004]I Class-based language model, e.g. IBM Model M [S.F. Chen,

2009]I Nonparametric language model [Y.W. Teh, 2006]I Discriminative language model [B. Roark, 2006]I ...

I It may just really happen recently by using recurrent neuralnetwork (RNNLM)

T. Mikolov, et al, ”Recurrent neural network based language model”, in Interspeech

2010

Liang Lu ([email protected]), Heriot-Watt University, Feb, 2014. RTSC

RTSC

Page 21: Recent advances in automatic speech recognition | A brief ...ttic.uchicago.edu/~llu/pdf/liang_hwu14.pdf · This talk I What is happening in ASR? I Background { speech recognition

Neural networks in language modelling

I The aim of a language model is very simple

P(wn|wn−1, ...,w1) ≈ P(wn|wn−1, ...,wn−k), (4)

but it is very difficult if k > 3 for large vocabulary task, e.g.what is the value of 60, 0003?

Liang Lu ([email protected]), Heriot-Watt University, Feb, 2014. RTSC

RTSC

Page 22: Recent advances in automatic speech recognition | A brief ...ttic.uchicago.edu/~llu/pdf/liang_hwu14.pdf · This talk I What is happening in ASR? I Background { speech recognition

Neural networks in language modelling

I Neural network language model is not new [Y. Benjio, 2003]

0 0 0 1 0 0 ... 0 1 0 0 0 0 ...· · ·wn−k wn−1

projection layer

input layer

hidden layer

output layer

P (wn = n|wn−1, . . . , wn−k)

Y. Benjio, et al, ”A neural probabilistic language model”, in JMLR 2003

Liang Lu ([email protected]), Heriot-Watt University, Feb, 2014. RTSC

RTSC

Page 23: Recent advances in automatic speech recognition | A brief ...ttic.uchicago.edu/~llu/pdf/liang_hwu14.pdf · This talk I What is happening in ASR? I Background { speech recognition

Neural networks in language modelling

I RNNLM differs in that a recurrent layer of input is used tocapture longer contextual information

T. Mikolov, et al, ”Recurrent neural network based language model”, in Interspeech

2010

Liang Lu ([email protected]), Heriot-Watt University, Feb, 2014. RTSC

RTSC

Page 24: Recent advances in automatic speech recognition | A brief ...ttic.uchicago.edu/~llu/pdf/liang_hwu14.pdf · This talk I What is happening in ASR? I Background { speech recognition

Neural networks in language modelling

I RNNLM can achieve significant reduction both in perplexityand word error rate (results on Wall Street Journal)

T. Mikolov, et al, ”Recurrent neural network based language model”, in Interspeech

2010

Liang Lu ([email protected]), Heriot-Watt University, Feb, 2014. RTSC

RTSC

Page 25: Recent advances in automatic speech recognition | A brief ...ttic.uchicago.edu/~llu/pdf/liang_hwu14.pdf · This talk I What is happening in ASR? I Background { speech recognition

Neural networks in language modelling

Not limited to language model

I RNN for spoken language understanding [K. Yao, et al, 2013]

K. Yao, et al, ”Recurrent neural networks for language understanding”, in Interspeech 2013

Liang Lu ([email protected]), Heriot-Watt University, Feb, 2014. RTSC

RTSC

Page 26: Recent advances in automatic speech recognition | A brief ...ttic.uchicago.edu/~llu/pdf/liang_hwu14.pdf · This talk I What is happening in ASR? I Background { speech recognition

Neural networks in language modelling

Not limited to language model

I RNN for spoken language understanding [K. Yao, et al, 2013]

K. Yao, et al, ”Recurrent neural networks for language understanding”, in Interspeech 2013

Liang Lu ([email protected]), Heriot-Watt University, Feb, 2014. RTSC

RTSC

Page 27: Recent advances in automatic speech recognition | A brief ...ttic.uchicago.edu/~llu/pdf/liang_hwu14.pdf · This talk I What is happening in ASR? I Background { speech recognition

Neural networks in language modellingNot limited to language model

I RNN for machine translation [N Kalchbrenner, P. Blunsom,2013 ]

N. Kalchbrenner, P. Blunsom, ”Recurrent continuous translation models”, in EMNLP 2013

Liang Lu ([email protected]), Heriot-Watt University, Feb, 2014. RTSC

RTSC

Page 28: Recent advances in automatic speech recognition | A brief ...ttic.uchicago.edu/~llu/pdf/liang_hwu14.pdf · This talk I What is happening in ASR? I Background { speech recognition

Neural networks in acoustic modelling

I GMM-HMM has defined state-of-the-art for over 20 years

j − 1 j + 1j

I Pros:I Efficient and parallel training algorithmsI Clear physical meaning (Gaussian mean, variances, etc)I Efficient adaptation algorithm (MLLR, fMLLR, etc)

I Cons:I Inefficient in learning feature correlationsI Hard to take advantage of longer context windowI Generative rather than Discriminative model

Liang Lu ([email protected]), Heriot-Watt University, Feb, 2014. RTSC

RTSC

Page 29: Recent advances in automatic speech recognition | A brief ...ttic.uchicago.edu/~llu/pdf/liang_hwu14.pdf · This talk I What is happening in ASR? I Background { speech recognition

Neural networks in acoustic modelling

I Moving beyond GMM-HMM?I Conditional random field (CRF), e.g. segmental CRF[G. Zweig,

2010], augmented CRF [Y. Hifny 2009], hidden CRF[A.Gunawardana, 2005] )

I Support vector machines (SVM) e.g. [N. Simith, 2002]I Template based acoustic models, e.g. [M. De Wachter, 2007]I ...

I Deep neural networks for acoustic modelling [G. Dahl, 2012]

G. Zweig, P. Nguyen, ”A segmental conditional random fields toolkit for speech recognition” in Interspeech 2010.

Y. Hifny, S. Renals, ”Speech recognition using augmented conditional random fields”, IEEE TASLP, 2009.

A. Gunawardana, et al, ”Hidden conditional random fields for phone classification”, in Interspeech 2005.

N. Smith, M. Gales, ”Speech recognition using SVMs”, in NIPS, 2002.

M. De Wachter, et al, ”Template based continuous speech recognition”, IEEE TASLP, 2007

G. Dahl, et al, ”Context-dependent pre-trained deep neural networks for large vocabulary speech recognition”, IEEETASLP 2012

Liang Lu ([email protected]), Heriot-Watt University, Feb, 2014. RTSC

RTSC

Page 30: Recent advances in automatic speech recognition | A brief ...ttic.uchicago.edu/~llu/pdf/liang_hwu14.pdf · This talk I What is happening in ASR? I Background { speech recognition

Deep neural networks for acoustic modelling

I Neural networks for speech recognition has been extensivelystudied in early 1990’s

I New ingredients in Deep Neural Networks (DNN)I Pre-training using restricted Boltzmann machine (RBM)I More hidden layers (≥ 4)I Wider output (103 vs. less than 102 in speech)

input

hidden

output

shallow neural network

input

hidden

output

deep neural network

Liang Lu ([email protected]), Heriot-Watt University, Feb, 2014. RTSC

RTSC

Page 31: Recent advances in automatic speech recognition | A brief ...ttic.uchicago.edu/~llu/pdf/liang_hwu14.pdf · This talk I What is happening in ASR? I Background { speech recognition

Deep neural networks for acoustic modelling

DNN is still combined with HMM, which was the practice in theearly 1990’s

G. Dahl, et al, ”Context-dependent pre-trained deep neural networks for large vocabulary speech recognition”, IEEETASLP 2012

Liang Lu ([email protected]), Heriot-Watt University, Feb, 2014. RTSC

RTSC

Page 32: Recent advances in automatic speech recognition | A brief ...ttic.uchicago.edu/~llu/pdf/liang_hwu14.pdf · This talk I What is happening in ASR? I Background { speech recognition

Deep neural networks for acoustic modelling

I Why the new ingredients can make a difference?I Deep neural network is difficult to train since it can easily be

trapped in the local optimumI Pre-training helps (in some cases)

I Shallow network can not efficient learn complex functionsI more hidden layer helps

I For ASR, context-dependent model normally has severalthousand output states

I Wide output helps

Additionally, GPU provides us the computation power

Liang Lu ([email protected]), Heriot-Watt University, Feb, 2014. RTSC

RTSC

Page 33: Recent advances in automatic speech recognition | A brief ...ttic.uchicago.edu/~llu/pdf/liang_hwu14.pdf · This talk I What is happening in ASR? I Background { speech recognition

Deep neural networks for acoustic modelling

I How to train a deep neural network (for acoustic modelling)?I Step 1: Train the restricted Boltzmann machines (RBMs)I Step 2: Stack the RBMsI Step 3: Put a softmax layer on top and refine the weights

using back-propagation

G. Hinton, et al, ”Deep neural networks for acoustic modeling in speech recognition: The shared views of fourresearch groups”, IEEE signal processing magazine, 2012

Liang Lu ([email protected]), Heriot-Watt University, Feb, 2014. RTSC

RTSC

Page 34: Recent advances in automatic speech recognition | A brief ...ttic.uchicago.edu/~llu/pdf/liang_hwu14.pdf · This talk I What is happening in ASR? I Background { speech recognition

Deep neural networks for acoustic modellingI restricted Boltzmann machine

I Only has visible-hidden connectionsI learning by maximising the log-likelihood

P(v) =1

Zexp(−F (v)) (5)

F (v) = −log(∑

h

exp(−E (v,h))

)free energy (6)

E (v,h) = −bTv − cTh− vTWh energy function (7)

Z =∑v,h

exp(−E (v,h)) partition function (8)

visible layer v

hidden layer h

weight matrix W

Liang Lu ([email protected]), Heriot-Watt University, Feb, 2014. RTSC

RTSC

Page 35: Recent advances in automatic speech recognition | A brief ...ttic.uchicago.edu/~llu/pdf/liang_hwu14.pdf · This talk I What is happening in ASR? I Background { speech recognition

Deep neural networks for acoustic modelling

I Performance for ASR — DNN significantly improvesstate-of-the-art

1 20

5

10

15

20

25

30

GMM (25.3)

+SAT (21.2)

+DT (18.6)

DNN+SAT (14.2)

+DT (12.6)

Results on switchboard using 300 hours of training data

K. Vesely, et al, ”Sequence-discriminative training of deep neural networks”, in Interspeech 2013

Liang Lu ([email protected]), Heriot-Watt University, Feb, 2014. RTSC

RTSC

Page 36: Recent advances in automatic speech recognition | A brief ...ttic.uchicago.edu/~llu/pdf/liang_hwu14.pdf · This talk I What is happening in ASR? I Background { speech recognition

Deep neural networks for acoustic modelling

I Current research activities in DNN for ASRI New types of neural networks, e.g. tensor networks,

convolutional networksI Learning new acoustic feature representations i.e., move

beyond MFCCsI Distributed optimisation to speed up the trainingI Adaption algorithms for speakers or domainsI ...

Liang Lu ([email protected]), Heriot-Watt University, Feb, 2014. RTSC

RTSC

Page 37: Recent advances in automatic speech recognition | A brief ...ttic.uchicago.edu/~llu/pdf/liang_hwu14.pdf · This talk I What is happening in ASR? I Background { speech recognition

Summary

I A brief overview of recent advances in ASR

I System representation using WFST

I Language modelling using RNN

I Acoustic modelling using DNNI Practise by yourself using the open-source toolkits

I OpenFst - http://www.openfst.orgI RNNLM - http://www.fit.vutbr.cz/∼imikolov/rnnlmI DNN for ASR - http://kaldi.sourceforge.net

Liang Lu ([email protected]), Heriot-Watt University, Feb, 2014. RTSC

RTSC

Page 38: Recent advances in automatic speech recognition | A brief ...ttic.uchicago.edu/~llu/pdf/liang_hwu14.pdf · This talk I What is happening in ASR? I Background { speech recognition

Thanks!

Liang Lu ([email protected]), Heriot-Watt University, Feb, 2014. RTSC

RTSC